Commit Graph

31150 Commits

Author SHA1 Message Date
Pablo Neira Ayuso
cf4dfa8539 netfilter: nf_tables: fix error path in the init functions
We have to unregister chain type if this fails to register netns.

Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2014-01-09 23:25:48 +01:00
James Chapman
74f77a6b2b netfilter: introduce l2tp match extension
Introduce an xtables add-on for matching L2TP packets. Supports L2TPv2
and L2TPv3 over IPv4 and IPv6. As well as filtering on L2TP tunnel-id
and session-id, the filtering decision can also include the L2TP
packet type (control or data), protocol version (2 or 3) and
encapsulation type (UDP or IP).

The most common use for this will likely be to filter L2TP data
packets of individual L2TP tunnels or sessions. While a u32 match can
be used, the L2TP protocol headers are such that field offsets differ
depending on bits set in the header, making rules for matching generic
L2TP connections cumbersome. This match extension takes care of all
that.

Signed-off-by: James Chapman <jchapman@katalix.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2014-01-09 21:36:39 +01:00
Wei Yongjun
d0eb1f7e66 ip_tunnel: fix sparse non static symbol warning
Fixes the following sparse warning:

net/ipv4/ip_tunnel.c:116:18: warning:
 symbol 'tunnel_dst_check' was not declared. Should it be static?

Signed-off-by: Wei Yongjun <yongjun_wei@trendmicro.com.cn>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-01-09 14:31:47 -05:00
Wei Yongjun
ece37c87ab openvswitch: Use kmem_cache_free() instead of kfree()
memory allocated by kmem_cache_alloc() should be freed using
kmem_cache_free(), not kfree().

Fixes: e298e50570 ('openvswitch: Per cpu flow stats.')
Signed-off-by: Wei Yongjun <yongjun_wei@trendmicro.com.cn>
Acked-by: Jesse Gross <jesse@nicira.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-01-09 14:26:39 -05:00
Patrick McHardy
3876d22dba netfilter: nf_tables: rename nft_do_chain_pktinfo() to nft_do_chain()
We don't encode argument types into function names and since besides
nft_do_chain() there are only AF-specific versions, there is no risk
of confusion.

Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2014-01-09 20:17:16 +01:00
Patrick McHardy
44a6f0df03 netfilter: nf_tables: prohibit deletion of a table with existing sets
We currently leak the set memory when deleting a table that still has
sets in it. Return EBUSY when attempting to delete a table with sets.

Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2014-01-09 20:17:16 +01:00
Patrick McHardy
7047f9d052 netfilter: nf_tables: take AF module reference when creating a table
The table refers to data of the AF module, so we need to make sure the
module isn't unloaded while the table exists.

Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2014-01-09 20:17:16 +01:00
Patrick McHardy
c5c1f975ad netfilter: nf_tables: perform flags validation before table allocation
Simplifies error handling. Additionally use the correct type u32 for the
host byte order flags value.

Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2014-01-09 20:17:15 +01:00
Patrick McHardy
fa2c1de0bb netfilter: nf_tables: minor nf_chain_type cleanups
Minor nf_chain_type cleanups:

- reorder struct to plug a hoe
- rename struct module member to "owner" for consistency
- rename nf_hookfn array to "hooks" for consistency
- reorder initializers for better readability

Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2014-01-09 20:17:15 +01:00
Patrick McHardy
2a37d755b8 netfilter: nf_tables: constify chain type definitions and pointers
Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2014-01-09 20:17:15 +01:00
Patrick McHardy
93b0806f00 netfilter: nf_tables: replay request after dropping locks to load chain type
To avoid races, we need to replay to request after dropping the nfnl_mutex
to auto-load the chain type module.

Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2014-01-09 20:17:14 +01:00
Patrick McHardy
88ce65a71c netfilter: nf_tables: add missing module references to chain types
In some cases we neither take a reference to the AF info nor to the
chain type, allowing the module to be unloaded while in use.

Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2014-01-09 20:17:14 +01:00
Patrick McHardy
baae3e62f3 netfilter: nf_tables: fix chain type module reference handling
The chain type module reference handling makes no sense at all: we take
a reference immediately when the module is registered, preventing the
module from ever being unloaded.

Fix by taking a reference when we're actually creating a chain of the
chain type and release the reference when destroying the chain.

Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2014-01-09 20:17:14 +01:00
Patrick McHardy
758206760c netfilter: nf_tables: fix check for table overflow
The table use counter is only increased for new chains, so move the check
to the correct position.

Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2014-01-09 20:17:13 +01:00
Patrick McHardy
4401a86200 netfilter: nf_tables: restore chain change atomicity
Chain counter validation is performed after the chain policy has
potentially been changed. Move counter validation/setting before
changing of the chain policy to fix this.

Additionally fix a memory leak if chain counter allocation fails
for new chains, remove an unnecessary free_percpu() and move
counter allocation for new chains

Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2014-01-09 20:17:13 +01:00
Patrick McHardy
57de2a0cd9 netfilter: nf_tables: split chain policy validation from actually setting it
Currently nf_tables_newchain() atomicity is broken because of having
validation of some netlink attributes performed after changing attributes
of the chain. The chain policy is (currently) fine, but split it up as
preparation for the following fixes and to avoid future mistakes.

Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2014-01-09 20:17:13 +01:00
Pablo Neira Ayuso
b38895c577 netfilter: nft_meta: fix lack of validation of the input register
We have to validate that the input register is in the range of
allowed registers, otherwise we can take a incorrect register
value as input that may lead us to a crash.

Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2014-01-09 20:04:16 +01:00
Kristian Evensen
c4ede3d382 netfilter: nft_ct: Add support to set the connmark
This patch adds kernel support for setting properties of tracked
connections. Currently, only connmark is supported. One use-case
for this feature is to provide the same functionality as
-j CONNMARK --save-mark in iptables.

Some restructuring was needed to implement the set op. The new
structure follows that of nft_meta.

Signed-off-by: Kristian Evensen <kristian.evensen@gmail.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2014-01-09 19:07:44 +01:00
John W. Linville
0f74d82d80 Merge branch 'master' of git://git.kernel.org/pub/scm/linux/kernel/git/linville/wireless into for-davem 2014-01-09 10:19:01 -05:00
David S. Miller
54b553e2c1 Merge branch 'master' of git://git.kernel.org/pub/scm/linux/kernel/git/pablo/nf-next
Pablo Neira Ayuso says:

====================
Netfilter updates for net-next

The following patchset contains three Netfilter updates, they are:

* Fix wrong usage of skb_header_pointer in the DCCP protocol helper that
  has been there for quite some time. It was resulting in copying the dccp
  header to a pointer allocated in the stack. Fortunately, this pointer
  provides room for the dccp header is 4 bytes long, so no crashes have been
  reported so far. From Daniel Borkmann.

* Use format string to print in the invocation of nf_log_packet(), again
  in the DCCP helper. Also from Daniel Borkmann.

* Revert "netfilter: avoid get_random_bytes call" as prandom32 does not
  guarantee enough entropy when being calling this at boot time, that may
  happen when reloading the rule.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2014-01-08 15:04:56 -05:00
Antonio Quartulli
42cb0bef01 batman-adv: set the isolation mark in the skb if needed
If a broadcast packet is coming from a client marked as
isolated, then mark the skb using the isolation mark so
that netfilter (or any other application) can recognise
them.

The mark is written in the skb based on the mask value:
only bits set in the mask are substitued by those in the
mark value

Signed-off-by: Antonio Quartulli <antonio@open-mesh.com>
Signed-off-by: Marek Lindner <mareklindner@neomailbox.ch>
2014-01-08 20:49:46 +01:00
Antonio Quartulli
eceb22ae0b batman-adv: create helper function to get AP isolation status
The AP isolation status may be evaluated in different spots.
Create an helper function to avoid code duplication.

Signed-off-by: Antonio Quartulli <antonio@open-mesh.com>
Signed-off-by: Marek Lindner <mareklindner@neomailbox.ch>
2014-01-08 20:49:45 +01:00
Antonio Quartulli
2d2fcc2a3f batman-adv: extend the ap_isolation mechanism
Change the AP isolation mechanism to not only "isolate" WIFI
clients but also all those marked with the more generic
"isolation flag" (BATADV_TT_CLIENT_ISOLA).

The result is that when AP isolation is on any unicast
packet originated by an "isolated" client and directed to
another "isolated" client is dropped at the source node.

Signed-off-by: Antonio Quartulli <antonio@open-mesh.com>
Signed-off-by: Marek Lindner <mareklindner@neomailbox.ch>
2014-01-08 20:49:44 +01:00
Antonio Quartulli
dd24ddb265 batman-adv: print the new BATADV_TT_CLIENT_ISOLA flag
Print the new BATADV_TT_CLIENT_ISOLA flag properly in the
Local and Global Translation Table output.

The character 'I' is used in the flags column to indicate
that the entry is marked as isolated.

Signed-off-by: Antonio Quartulli <antonio@open-mesh.com>
Signed-off-by: Marek Lindner <mareklindner@neomailbox.ch>
2014-01-08 20:49:44 +01:00
Antonio Quartulli
9464d07188 batman-adv: mark a local client as isolated when needed
A client sending packets which mark matches the value
configured via sysfs has to be identified as isolated using
the TT_CLIENT_ISOLA flag.

The match is mask based, meaning that only bits set in the
mask are compared with those in the mark value.

If the configured mask is equal to 0 no operation is
performed.

Such flag is then advertised within the classic client
announcement mechanism.

Signed-off-by: Antonio Quartulli <antonio@open-mesh.com>
Signed-off-by: Marek Lindner <mareklindner@neomailbox.ch>
2014-01-08 20:49:43 +01:00
Antonio Quartulli
c42edfe382 batman-adv: add isolation_mark sysfs attribute
This attribute can be used to set and read the value and the
mask of the skb mark which will be used to classify the
source non-mesh client as ISOLATED. In this way a client can
be advertised as such and the mark can potentially be
restored at the receiving node before delivering the skb.

This can be helpful for creating network wide netfilter
policies.

This sysfs file expects a string of the shape "$mark/$mask".
Where $mark has to be a 32-bit number in any base, while
$mask must be a 32bit mask expressed in hex base. Only bits
in $mark covered by the bitmask are really stored.

Signed-off-by: Antonio Quartulli <antonio@open-mesh.com>
Signed-off-by: Marek Lindner <mareklindner@neomailbox.ch>
2014-01-08 20:49:42 +01:00
Antonio Quartulli
6c413b1c22 batman-adv: send every DHCP packet as bat-unicast
In different situations it is possible that the DHCP server
or client uses broadcast Ethernet frames to send messages
to each other. The GW component in batman-adv takes care of
using bat-unicast packets to bring broadcast DHCP
Discover/Requests to the "best" server.

On the way back the DHCP server usually sends unicasts,
but upon client request it may decide to use broadcasts as
well.

This patch improves the GW component so that it now snoops
and sends as unicast all the DHCP packets, no matter if they
were generated by a DHCP server or client.

Signed-off-by: Antonio Quartulli <antonio@open-mesh.com>
Signed-off-by: Marek Lindner <mareklindner@neomailbox.ch>
2014-01-08 20:49:42 +01:00
Antonio Quartulli
36484f84d5 batman-adv: remove parenthesis from return statements
Remove parenthesis around return expression as suggested by
checkpatch.

Signed-off-by: Antonio Quartulli <antonio@meshcoding.com>
Signed-off-by: Marek Lindner <mareklindner@neomailbox.ch>
2014-01-08 20:49:41 +01:00
Antonio Quartulli
4e820e72db batman-adv: rename gw_deselect() to gw_reselect()
The function batadv_gw_deselect() is actually not deselecting
anything. It is just informing the GW code to perform a
re-election procedure when possible.
The current gateway is not being touched at all and therefore
the name of this function is rather misleading.

Rename it to batadv_gw_reselect() to batadv_gw_reselect()
to make its behaviour easier to grasp.

Signed-off-by: Antonio Quartulli <antonio@open-mesh.com>
Signed-off-by: Marek Lindner <mareklindner@neomailbox.ch>
2014-01-08 20:49:41 +01:00
Antonio Quartulli
f316318157 batman-adv: deselect current GW on client mode switch off
When switching from gw_mode client to either off or server
the current selected gateway has to be deselected.
In this way when client mode is enabled again a gateway
re-election is forced and a GW_ADD event is consequently
sent.

The current behaviour instead is to keep the current gateway
leading to no GW_ADD event when gw_mode client is selected
for a second time

Signed-off-by: Antonio Quartulli <antonio@open-mesh.com>
Signed-off-by: Marek Lindner <mareklindner@neomailbox.ch>
Signed-off-by: Antonio Quartulli <antonio@meshcoding.com>
2014-01-08 20:49:40 +01:00
Antonio Quartulli
ebf38fb7ab batman-adv: remove FSF address from GPL disclaimer
As suggested by checkpatch, remove all the references to the
FSF address since the kernel already has one reference in
its documentation.

In this way it is easier to update it in case of future
changes.

Signed-off-by: Antonio Quartulli <antonio@meshcoding.com>
Signed-off-by: Marek Lindner <mareklindner@neomailbox.ch>
2014-01-08 20:49:39 +01:00
Antonio Quartulli
3fba7325bb batman-adv: don't switch byte order too often if not needed
If possible, operations like ntohs/ntohl should not be
performed too often. Use a variable to locally store the
converted value and then use it.

Signed-off-by: Antonio Quartulli <antonio@open-mesh.com>
Signed-off-by: Marek Lindner <mareklindner@neomailbox.ch>
Signed-off-by: Antonio Quartulli <antonio@meshcoding.com>
2014-01-08 20:49:39 +01:00
Antonio Quartulli
a48bcacdb3 batman-adv: properly rename define in distributed arp table header file
Signed-off-by: Antonio Quartulli <antonio@meshcoding.com>
Signed-off-by: Marek Lindner <mareklindner@neomailbox.ch>
2014-01-08 20:49:38 +01:00
John W. Linville
300e5fd160 Merge branch 'for-upstream' of git://git.kernel.org/pub/scm/linux/kernel/git/bluetooth/bluetooth-next 2014-01-08 13:44:29 -05:00
John W. Linville
2eff7c791a This is the first NFC fixes pull request for 3.13.
It only contains one fix for a regression introduced with commit
 e29a9e2ae1. Without this fix, we can not establish a p2p link in
 target mode. Only initiator mode works.
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1.4.15 (GNU/Linux)
 
 iQIcBAABAgAGBQJSywyaAAoJEIqAPN1PVmxK3eMP/2XKZa2qzcMZfWZBLEMXQHce
 UO36BqZF0nre0ZUZQmaXzM5L0PM/UNxvhf2DsLx3s54/Tk/o0wHYSM/7GfX58dkX
 YAYSbG3viQM9L1dRhgfGQwRxXUcd8M85fLUH9SdJeUzCIgxXEDFnlpkeaKPE+UaM
 PgCHRXbW+cDje4DSO5JXVzIRFzsCtAwgd5bx6u/5bM5PLzxGwHJHkHe+lZ9b4Hbe
 ZLsqbU0HRy0rB8hmpro6NcrIoNHEXYupdd1gwslb88jdA+BDOUAZj7htcBBcC9Lw
 7D8RQTI6YNDsOzcLJyWmmfqTKd1j3RWPcTSuWkAGTvy04VOhGEc1LcgOOVLvhsLZ
 Hw412d0lYOiIwtjeIwS1etY42+f7tPOHOuhWFO3EQX0/1fIQ2H9V18DkM2qFPCWT
 L5GwV70YWbYfeRF1i3kGWKN15qLh9/toxB6rgE8eM2vCCGJXbt52yz9oSNM3QvZf
 2rHnzAHCEKGv4xS7oHhJSnkwH3Nd7WR6gjWatbfswrjtaDU2PKlsL703Uxk1gIPz
 o3itmJv/Ej1j6TqUpUz2Vs/sr4dltCDiD7tiaiv8zprHn0LZcNyp1/E+p31pMHBP
 8IW3BXxpBh3S8gpJ932LVAYe40ymQ0m1ZhssddbH1CC90jfs6m/HfaUC+GsgBX/x
 0iaxEH1jwwRIZC9p1Qso
 =YkKJ
 -----END PGP SIGNATURE-----

Merge tag 'nfc-fixes-3.13-1' of git://git.kernel.org/pub/scm/linux/kernel/git/sameo/nfc-fixes

Samuel Ortiz <sameo@linux.intel.com> says:

"This is the first NFC fixes pull request for 3.13.

It only contains one fix for a regression introduced with commit
e29a9e2ae1. Without this fix, we can not establish a p2p link in
target mode. Only initiator mode works."

Signed-off-by: John W. Linville <linville@tuxdriver.com>
2014-01-08 13:36:17 -05:00
Ying Xue
da7c224b1b net: xfrm: xfrm_policy: silence compiler warning
Fix below compiler warning:

net/xfrm/xfrm_policy.c:1644:12: warning: ‘xfrm_dst_alloc_copy’ defined but not used [-Wunused-function]

Signed-off-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-01-07 22:45:26 -05:00
Jon Paul Maloy
581465fa28 tipc: make link start event synchronous
When a link is created we delay the start event by launching it
to be executed later in a tasklet. As we hold all the
necessary locks at the moment of creation, and there is no risk
of deadlock or contention, this delay serves no purpose in the
current code.

We remove this obsolete indirection step, and the associated function
link_start(). At the same time, we rename the function tipc_link_stop()
to the more appropriate tipc_link_purge_queues().

Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Reviewed-by: Paul Gortmaker <paul.gortmaker@windriver.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-01-07 18:44:26 -05:00
Ying Xue
f9a2c80b8b tipc: introduce new spinlock to protect struct link_req
Currently, only 'bearer_lock' is used to protect struct link_req in
the function disc_timeout(). This is unsafe, since the member fields
'num_nodes' and 'timer_intv' might be accessed by below three different
threads simultaneously, none of them grabbing bearer_lock in the
critical region:

link_activate()
  tipc_bearer_add_dest()
    tipc_disc_add_dest()
      req->num_nodes++;

tipc_link_reset()
  tipc_bearer_remove_dest()
    tipc_disc_remove_dest()
      req->num_nodes--
      disc_update()
        read req->num_nodes
	write req->timer_intv

disc_timeout()
  read req->num_nodes
  read/write req->timer_intv

Without lock protection, the only symptom of a race is that discovery
messages occasionally may not be sent out. This is not fatal, since such
messages are best-effort anyway. On the other hand, since discovery
messages are not time critical, adding a protecting lock brings no
serious overhead either. So we add a new, dedicated spinlock in
order to guarantee absolute data consistency in link_req objects.
This also helps reduce the overall role of the bearer_lock, which
we want to remove completely in a later commit series.

Signed-off-by: Ying Xue <ying.xue@windriver.com>
Reviewed-by: Paul Gortmaker <paul.gortmaker@windriver.com>
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-01-07 18:44:25 -05:00
Jon Paul Maloy
b9d4c33935 tipc: remove 'has_redundant_link' flag from STATE link protocol messages
The flag 'has_redundant_link' is defined only in RESET and ACTIVATE
protocol messages. Due to an ambiguity in the protocol specification it
is currently also transferred in STATE messages. Its value is used to
initialize a link state variable, 'permit_changeover', which is used
to inhibit futile link failover attempts when it is known that the
peer node has no working links at the moment, although the local node
may still think it has one.

The fact that 'has_redundant_link' incorrectly is read from STATE
messages has the effect that 'permit_changeover' sometimes gets a wrong
value, and permanently blocks any links from being re-established. Such
failures can only occur in in dual-link systems, and are extremely rare.
This bug seems to have always been present in the code.

Furthermore, since commit b4b5610223
("tipc: Ensure both nodes recognize loss of contact between them"),
the 'permit_changeover' field serves no purpose any more. The task of
enforcing 'lost contact' cycles at both peer endpoints is now taken
by a new mechanism, using the flags WAIT_NODE_DOWN and WAIT_PEER_DOWN
in struct tipc_node to abort unnecessary failover attempts.

We therefore remove the 'has_redundant_link' flag from STATE messages,
as well as the now redundant 'permit_changeover' variable.

Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Reviewed-by: Ying Xue <ying.xue@windriver.com>
Reviewed-by: Paul Gortmaker <paul.gortmaker@windriver.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-01-07 18:44:25 -05:00
Jon Paul Maloy
170b3927b4 tipc: rename functions related to link failover and improve comments
The functionality related to link addition and failover is unnecessarily
hard to understand and maintain. We try to improve this by renaming
some of the functions, at the same time adding or improving the
explanatory comments around them. Names such as "tipc_rcv()" etc. also
align better with what is used in other networking components.

The changes in this commit are purely cosmetic, no functional changes
are made.

Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Reviewed-by: Ying Xue <ying.xue@windriver.com>
Reviewed-by: Paul Gortmaker <paul.gortmaker@windriver.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-01-07 18:44:25 -05:00
David S. Miller
a04c0e2c0d Merge branch 'master' of git://git.kernel.org/pub/scm/linux/kernel/git/pablo/nf
Pablo Neira Ayuso says:

====================
The following patchset contains two patches:

* fix the IRC NAT helper which was broken when adding (incomplete) IPv6
  support, from Daniel Borkmann.

* Refine the previous bugtrap that Jesper added to catch problems for the
  usage of the sequence adjustment extension in IPVs in Dec 16th, it may
  spam messages in case of finding a real bug.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2014-01-07 18:38:17 -05:00
Daniel Borkmann
be7928d20b net: xfrm: xfrm_policy: fix inline not at beginning of declaration
Fix three warnings related to:

  net/xfrm/xfrm_policy.c:1644:1: warning: 'inline' is not at beginning of declaration [-Wold-style-declaration]
  net/xfrm/xfrm_policy.c:1656:1: warning: 'inline' is not at beginning of declaration [-Wold-style-declaration]
  net/xfrm/xfrm_policy.c:1668:1: warning: 'inline' is not at beginning of declaration [-Wold-style-declaration]

Just removing the inline keyword is sufficient as the compiler will
decide on its own about inlining or not.

Signed-off-by: Daniel Borkmann <dborkman@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-01-07 18:34:00 -05:00
Patrick McHardy
9638f33ecf netfilter: nft_ct: load both IPv4 and IPv6 conntrack modules for NFPROTO_INET
The ct expression can currently not be used in the inet family since
we don't have a conntrack module for NFPROTO_INET, so
nf_ct_l3proto_try_module_get() fails. Add some manual handling to
load the modules for both NFPROTO_IPV4 and NFPROTO_IPV6 if the
ct expression is used in the inet family.

Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2014-01-07 23:57:32 +01:00
Patrick McHardy
4566bf2706 netfilter: nft_meta: add l4proto support
For L3-proto independant rules we need to get at the L4 protocol value
directly. Add it to the nft_pktinfo struct and use the meta expression
to retrieve it.

Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2014-01-07 23:57:31 +01:00
Patrick McHardy
124edfa9e0 netfilter: nf_tables: add nfproto support to meta expression
Needed by multi-family tables to distinguish IPv4 and IPv6 packets.

Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2014-01-07 23:57:30 +01:00
Patrick McHardy
1d49144c0a netfilter: nf_tables: add "inet" table for IPv4/IPv6
This patch adds a new table family and a new filter chain that you can
use to attach IPv4 and IPv6 rules. This should help to simplify
rule-set maintainance in dual-stack setups.

Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2014-01-07 23:57:25 +01:00
Patrick McHardy
115a60b173 netfilter: nf_tables: add support for multi family tables
Add support to register chains to multiple hooks for different address
families for mixed IPv4/IPv6 tables.

Signed-off-by: Patrick McHardy <kaber@trash.net>
2014-01-07 23:55:46 +01:00
Patrick McHardy
c9484874e7 netfilter: nf_tables: add hook ops to struct nft_pktinfo
Multi-family tables need the AF from the hook ops. Add a pointer to the
hook ops and replace usage of the hooknum member in struct nft_pktinfo.

Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2014-01-07 23:50:43 +01:00
Patrick McHardy
3b088c4bc0 netfilter: nf_tables: make chain types override the default AF functions
Currently the AF-specific hook functions override the chain-type specific
hook functions. That doesn't make too much sense since the chain types
are a special case of the AF-specific hooks.

Make the AF-specific hook functions the default and make the optional
chain type hooks override them.

As a side effect, the necessary code restructuring reduces the code size,
f.i. in case of nf_tables_ipv4.o:

  nf_tables_ipv4_init_net   |  -24
  nft_do_chain_ipv4         | -113
 2 functions changed, 137 bytes removed, diff: -137

Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2014-01-07 23:50:43 +01:00
Pablo Neira Ayuso
688d18636f netfilter: nft_reject: fix compilation warning if NF_TABLES_IPV6 is disabled
net/netfilter/nft_reject.c: In function 'nft_reject_eval':
net/netfilter/nft_reject.c:37:14: warning: unused variable 'net' [-Wunused-variable]

Reported-by: kbuild test robot <fengguang.wu@intel.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2014-01-07 23:50:43 +01:00
Jerry Chu
bf5a755f5e net-gre-gro: Add GRE support to the GRO stack
This patch built on top of Commit 299603e837
("net-gro: Prepare GRO stack for the upcoming tunneling support") to add
the support of the standard GRE (RFC1701/RFC2784/RFC2890) to the GRO
stack. It also serves as an example for supporting other encapsulation
protocols in the GRO stack in the future.

The patch supports version 0 and all the flags (key, csum, seq#) but
will flush any pkt with the S (seq#) flag. This is because the S flag
is not support by GSO, and a GRO pkt may end up in the forwarding path,
thus requiring GSO support to break it up correctly.

Currently the "packet_offload" structure only contains L3 (ETH_P_IP/
ETH_P_IPV6) GRO offload support so the encapped pkts are limited to
IP pkts (i.e., w/o L2 hdr). But support for other protocol type can
be easily added, so is the support for GRE variations like NVGRE.

The patch also support csum offload. Specifically if the csum flag is on
and the h/w is capable of checksumming the payload (CHECKSUM_COMPLETE),
the code will take advantage of the csum computed by the h/w when
validating the GRE csum.

Note that commit 60769a5dcd "ipv4: gre:
add GRO capability" already introduces GRO capability to IPv4 GRE
tunnels, using the gro_cells infrastructure. But GRO is done after
GRE hdr has been removed (i.e., decapped). The following patch applies
GRO when pkts first come in (before hitting the GRE tunnel code). There
is some performance advantage for applying GRO as early as possible.
Also this approach is transparent to other subsystem like Open vSwitch
where GRE decap is handled outside of the IP stack hence making it
harder for the gro_cells stuff to apply. On the other hand, some NICs
are still not capable of hashing on the inner hdr of a GRE pkt (RSS).
In that case the GRO processing of pkts from the same remote host will
all happen on the same CPU and the performance may be suboptimal.

I'm including some rough preliminary performance numbers below. Note
that the performance will be highly dependent on traffic load, mix as
usual. Moreover it also depends on NIC offload features hence the
following is by no means a comprehesive study. Local testing and tuning
will be needed to decide the best setting.

All tests spawned 50 copies of netperf TCP_STREAM and ran for 30 secs.
(super_netperf 50 -H 192.168.1.18 -l 30)

An IP GRE tunnel with only the key flag on (e.g., ip tunnel add gre1
mode gre local 10.246.17.18 remote 10.246.17.17 ttl 255 key 123)
is configured.

The GRO support for pkts AFTER decap are controlled through the device
feature of the GRE device (e.g., ethtool -K gre1 gro on/off).

1.1 ethtool -K gre1 gro off; ethtool -K eth0 gro off
thruput: 9.16Gbps
CPU utilization: 19%

1.2 ethtool -K gre1 gro on; ethtool -K eth0 gro off
thruput: 5.9Gbps
CPU utilization: 15%

1.3 ethtool -K gre1 gro off; ethtool -K eth0 gro on
thruput: 9.26Gbps
CPU utilization: 12-13%

1.4 ethtool -K gre1 gro on; ethtool -K eth0 gro on
thruput: 9.26Gbps
CPU utilization: 10%

The following tests were performed on a different NIC that is capable of
csum offload. I.e., the h/w is capable of computing IP payload csum
(CHECKSUM_COMPLETE).

2.1 ethtool -K gre1 gro on (hence will use gro_cells)

2.1.1 ethtool -K eth0 gro off; csum offload disabled
thruput: 8.53Gbps
CPU utilization: 9%

2.1.2 ethtool -K eth0 gro off; csum offload enabled
thruput: 8.97Gbps
CPU utilization: 7-8%

2.1.3 ethtool -K eth0 gro on; csum offload disabled
thruput: 8.83Gbps
CPU utilization: 5-6%

2.1.4 ethtool -K eth0 gro on; csum offload enabled
thruput: 8.98Gbps
CPU utilization: 5%

2.2 ethtool -K gre1 gro off

2.2.1 ethtool -K eth0 gro off; csum offload disabled
thruput: 5.93Gbps
CPU utilization: 9%

2.2.2 ethtool -K eth0 gro off; csum offload enabled
thruput: 5.62Gbps
CPU utilization: 8%

2.2.3 ethtool -K eth0 gro on; csum offload disabled
thruput: 7.69Gbps
CPU utilization: 8%

2.2.4 ethtool -K eth0 gro on; csum offload enabled
thruput: 8.96Gbps
CPU utilization: 5-6%

Signed-off-by: H.K. Jerry Chu <hkchu@google.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-01-07 16:21:31 -05:00
Benjamin Poirier
cdb3f4a31b net: Do not enable tx-nocache-copy by default
There are many cases where this feature does not improve performance or even
reduces it.

For example, here are the results from tests that I've run using 3.12.6 on one
Intel Xeon W3565 and one i7 920 connected by ixgbe adapters. The results are
from the Xeon, but they're similar on the i7. All numbers report the
mean±stddev over 10 runs of 10s.

1) latency tests similar to what is described in "c6e1a0d net: Allow no-cache
copy from user on transmit"
There is no statistically significant difference between tx-nocache-copy
on/off.
nic irqs spread out (one queue per cpu)

200x netperf -r 1400,1
tx-nocache-copy off
        692000±1000 tps
        50/90/95/99% latency (us): 275±2/643.8±0.4/799±1/2474.4±0.3
tx-nocache-copy on
        693000±1000 tps
        50/90/95/99% latency (us): 274±1/644.1±0.7/800±2/2474.5±0.7

200x netperf -r 14000,14000
tx-nocache-copy off
        86450±80 tps
        50/90/95/99% latency (us): 334.37±0.02/838±1/2100±20/3990±40
tx-nocache-copy on
        86110±60 tps
        50/90/95/99% latency (us): 334.28±0.01/837±2/2110±20/3990±20

2) single stream throughput tests
tx-nocache-copy leads to higher service demand

                        throughput  cpu0        cpu1        demand
                        (Gb/s)      (Gcycle)    (Gcycle)    (cycle/B)

nic irqs and netperf on cpu0 (1x netperf -T0,0 -t omni -- -d send)

tx-nocache-copy off     9402±5      9.4±0.2                 0.80±0.01
tx-nocache-copy on      9403±3      9.85±0.04               0.838±0.004

nic irqs on cpu0, netperf on cpu1 (1x netperf -T1,1 -t omni -- -d send)

tx-nocache-copy off     9401±5      5.83±0.03   5.0±0.1     0.923±0.007
tx-nocache-copy on      9404±2      5.74±0.03   5.523±0.009 0.958±0.002

As a second example, here are some results from Eric Dumazet with latest
net-next.
tx-nocache-copy also leads to higher service demand

(cpu is Intel(R) Xeon(R) CPU X5660  @ 2.80GHz)

lpq83:~# ./ethtool -K eth0 tx-nocache-copy on
lpq83:~# perf stat ./netperf -H lpq84 -c
MIGRATED TCP STREAM TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to lpq84.prod.google.com () port 0 AF_INET
Recv   Send    Send                          Utilization       Service Demand
Socket Socket  Message  Elapsed              Send     Recv     Send    Recv
Size   Size    Size     Time     Throughput  local    remote   local   remote
bytes  bytes   bytes    secs.    10^6bits/s  % S      % U      us/KB   us/KB

 87380  16384  16384    10.00      9407.44   2.50     -1.00    0.522   -1.000

 Performance counter stats for './netperf -H lpq84 -c':

       4282.648396 task-clock                #    0.423 CPUs utilized
             9,348 context-switches          #    0.002 M/sec
                88 CPU-migrations            #    0.021 K/sec
               355 page-faults               #    0.083 K/sec
    11,812,797,651 cycles                    #    2.758 GHz                     [82.79%]
     9,020,522,817 stalled-cycles-frontend   #   76.36% frontend cycles idle    [82.54%]
     4,579,889,681 stalled-cycles-backend    #   38.77% backend  cycles idle    [67.33%]
     6,053,172,792 instructions              #    0.51  insns per cycle
                                             #    1.49  stalled cycles per insn [83.64%]
       597,275,583 branches                  #  139.464 M/sec                   [83.70%]
         8,960,541 branch-misses             #    1.50% of all branches         [83.65%]

      10.128990264 seconds time elapsed

lpq83:~# ./ethtool -K eth0 tx-nocache-copy off
lpq83:~# perf stat ./netperf -H lpq84 -c
MIGRATED TCP STREAM TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to lpq84.prod.google.com () port 0 AF_INET
Recv   Send    Send                          Utilization       Service Demand
Socket Socket  Message  Elapsed              Send     Recv     Send    Recv
Size   Size    Size     Time     Throughput  local    remote   local   remote
bytes  bytes   bytes    secs.    10^6bits/s  % S      % U      us/KB   us/KB

 87380  16384  16384    10.00      9412.45   2.15     -1.00    0.449   -1.000

 Performance counter stats for './netperf -H lpq84 -c':

       2847.375441 task-clock                #    0.281 CPUs utilized
            11,632 context-switches          #    0.004 M/sec
                49 CPU-migrations            #    0.017 K/sec
               354 page-faults               #    0.124 K/sec
     7,646,889,749 cycles                    #    2.686 GHz                     [83.34%]
     6,115,050,032 stalled-cycles-frontend   #   79.97% frontend cycles idle    [83.31%]
     1,726,460,071 stalled-cycles-backend    #   22.58% backend  cycles idle    [66.55%]
     2,079,702,453 instructions              #    0.27  insns per cycle
                                             #    2.94  stalled cycles per insn [83.22%]
       363,773,213 branches                  #  127.757 M/sec                   [83.29%]
         4,242,732 branch-misses             #    1.17% of all branches         [83.51%]

      10.128449949 seconds time elapsed

CC: Tom Herbert <therbert@google.com>
Signed-off-by: Benjamin Poirier <bpoirier@suse.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-01-07 16:20:19 -05:00
Erik Hugne
732256b933 tipc: correctly unlink packets from deferred packet queue
When we pull a received packet from a link's 'deferred packets' queue
for processing, its 'next' pointer is not cleared, and still refers to
the next packet in that queue, if any. This is incorrect, but caused
no harm before commit 40ba3cdf54 ("tipc:
message reassembly using fragment chain") was introduced. After that
commit, it may sometimes lead to the following oops:

general protection fault: 0000 [#1] SMP DEBUG_PAGEALLOC
Modules linked in: tipc
CPU: 4 PID: 0 Comm: swapper/4 Tainted: G        W 3.13.0-rc2+ #6
Hardware name: Bochs Bochs, BIOS Bochs 01/01/2007
task: ffff880017af4880 ti: ffff880017aee000 task.ti: ffff880017aee000
RIP: 0010:[<ffffffff81710694>]  [<ffffffff81710694>] skb_try_coalesce+0x44/0x3d0
RSP: 0018:ffff880016603a78  EFLAGS: 00010212
RAX: 6b6b6b6bd6d6d6d6 RBX: ffff880013106ac0 RCX: ffff880016603ad0
RDX: ffff880016603ad7 RSI: ffff88001223ed00 RDI: ffff880013106ac0
RBP: ffff880016603ab8 R08: 0000000000000000 R09: 0000000000000000
R10: 0000000000000001 R11: 0000000000000000 R12: ffff88001223ed00
R13: ffff880016603ad0 R14: 000000000000058c R15: ffff880012297650
FS:  0000000000000000(0000) GS:ffff880016600000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
CR2: 000000000805b000 CR3: 0000000011f5d000 CR4: 00000000000006e0
Stack:
 ffff880016603a88 ffffffff810a38ed ffff880016603aa8 ffff88001223ed00
 0000000000000001 ffff880012297648 ffff880016603b68 ffff880012297650
 ffff880016603b08 ffffffffa0006c51 ffff880016603b08 00ffffffa00005fc
Call Trace:
 <IRQ>
 [<ffffffff810a38ed>] ? trace_hardirqs_on+0xd/0x10
 [<ffffffffa0006c51>] tipc_link_recv_fragment+0xd1/0x1b0 [tipc]
 [<ffffffffa0007214>] tipc_recv_msg+0x4e4/0x920 [tipc]
 [<ffffffffa00016f0>] ? tipc_l2_rcv_msg+0x40/0x250 [tipc]
 [<ffffffffa000177c>] tipc_l2_rcv_msg+0xcc/0x250 [tipc]
 [<ffffffffa00016f0>] ? tipc_l2_rcv_msg+0x40/0x250 [tipc]
 [<ffffffff8171e65b>] __netif_receive_skb_core+0x80b/0xd00
 [<ffffffff8171df94>] ? __netif_receive_skb_core+0x144/0xd00
 [<ffffffff8171eb76>] __netif_receive_skb+0x26/0x70
 [<ffffffff8171ed6d>] netif_receive_skb+0x2d/0x200
 [<ffffffff8171fe70>] napi_gro_receive+0xb0/0x130
 [<ffffffff815647c2>] e1000_clean_rx_irq+0x2c2/0x530
 [<ffffffff81565986>] e1000_clean+0x266/0x9c0
 [<ffffffff81985f7b>] ? notifier_call_chain+0x2b/0x160
 [<ffffffff8171f971>] net_rx_action+0x141/0x310
 [<ffffffff81051c1b>] __do_softirq+0xeb/0x480
 [<ffffffff819817bb>] ? _raw_spin_unlock+0x2b/0x40
 [<ffffffff810b8c42>] ? handle_fasteoi_irq+0x72/0x100
 [<ffffffff81052346>] irq_exit+0x96/0xc0
 [<ffffffff8198cbc3>] do_IRQ+0x63/0xe0
 [<ffffffff81981def>] common_interrupt+0x6f/0x6f
 <EOI>

This happens when the last fragment of a message has passed through the
the receiving link's 'deferred packets' queue, and at least one other
packet was added to that queue while it was there. After the fragment
chain with the complete message has been successfully delivered to the
receiving socket, it is released. Since 'next' pointer of the last
fragment in the released chain now is non-NULL, we get the crash shown
above.

We fix this by clearing the 'next' pointer of all received packets,
including those being pulled from the 'deferred' queue, before they
undergo any further processing.

Fixes: 40ba3cdf54 ("tipc: message reassembly using fragment chain")
Signed-off-by: Erik Hugne <erik.hugne@ericsson.com>
Reported-by: Ying Xue <ying.xue@windriver.com>
Reviewed-by: Paul Gortmaker <paul.gortmaker@windriver.com>
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-01-07 16:15:24 -05:00
Jiri Pirko
dfd1582d1e ipv4: loopback device: ignore value changes after device is upped
When lo is brought up, new ifa is created. Then, devconf and neigh values
bitfield should be set so later changes of default values would not
affect lo values.

Note that the same behaviour is in ipv6. Also note that this is likely
not an issue in many distros (for example Fedora 19) because userspace
sets address to lo manually before bringing it up.

Signed-off-by: Jiri Pirko <jiri@resnulli.us>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-01-07 15:55:17 -05:00
FX Le Bail
509aba3b0d IPv6: add the option to use anycast addresses as source addresses in echo reply
This change allows to follow a recommandation of RFC4942.

- Add "anycast_src_echo_reply" sysctl to control the use of anycast addresses
  as source addresses for ICMPv6 echo reply. This sysctl is false by default
  to preserve existing behavior.
- Add inline check ipv6_anycast_destination().
- Use them in icmpv6_echo_reply().

Reference:
RFC4942 - IPv6 Transition/Coexistence Security Considerations
   (http://tools.ietf.org/html/rfc4942#section-2.1.6)

2.1.6. Anycast Traffic Identification and Security

   [...]
   To avoid exposing knowledge about the internal structure of the
   network, it is recommended that anycast servers now take advantage of
   the ability to return responses with the anycast address as the
   source address if possible.

Signed-off-by: Francois-Xavier Le Bail <fx.lebail@yahoo.com>
Acked-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-01-07 15:51:39 -05:00
Li RongQing
657e5d1965 ipv6: pcpu_tstats.syncp should be initialised in ip6_vti.c
initialise pcpu_tstats.syncp to kill the calltrace
[   11.973950] Call Trace:
[   11.973950]  [<819bbaff>] dump_stack+0x48/0x60
[   11.973950]  [<819bbaff>] dump_stack+0x48/0x60
[   11.973950]  [<81078dcf>] __lock_acquire.isra.22+0x1bf/0xc10
[   11.973950]  [<81078dcf>] __lock_acquire.isra.22+0x1bf/0xc10
[   11.973950]  [<81079fa7>] lock_acquire+0x77/0xa0
[   11.973950]  [<81079fa7>] lock_acquire+0x77/0xa0
[   11.973950]  [<817ca7ab>] ? dev_get_stats+0xcb/0x130
[   11.973950]  [<817ca7ab>] ? dev_get_stats+0xcb/0x130
[   11.973950]  [<8183862d>] ip_tunnel_get_stats64+0x6d/0x230
[   11.973950]  [<8183862d>] ip_tunnel_get_stats64+0x6d/0x230
[   11.973950]  [<817ca7ab>] ? dev_get_stats+0xcb/0x130
[   11.973950]  [<817ca7ab>] ? dev_get_stats+0xcb/0x130
[   11.973950]  [<811cf8c1>] ? __nla_reserve+0x21/0xd0
[   11.973950]  [<811cf8c1>] ? __nla_reserve+0x21/0xd0
[   11.973950]  [<817ca7ab>] dev_get_stats+0xcb/0x130
[   11.973950]  [<817ca7ab>] dev_get_stats+0xcb/0x130
[   11.973950]  [<817d5409>] rtnl_fill_ifinfo+0x569/0xe20
[   11.973950]  [<817d5409>] rtnl_fill_ifinfo+0x569/0xe20
[   11.973950]  [<810352e0>] ? kvm_clock_read+0x20/0x30
[   11.973950]  [<810352e0>] ? kvm_clock_read+0x20/0x30
[   11.973950]  [<81008e38>] ? sched_clock+0x8/0x10
[   11.973950]  [<81008e38>] ? sched_clock+0x8/0x10
[   11.973950]  [<8106ba45>] ? sched_clock_local+0x25/0x170
[   11.973950]  [<8106ba45>] ? sched_clock_local+0x25/0x170
[   11.973950]  [<810da6bd>] ? __kmalloc+0x3d/0x90
[   11.973950]  [<810da6bd>] ? __kmalloc+0x3d/0x90
[   11.973950]  [<817b8c10>] ? __kmalloc_reserve.isra.41+0x20/0x70
[   11.973950]  [<817b8c10>] ? __kmalloc_reserve.isra.41+0x20/0x70
[   11.973950]  [<810da81a>] ? slob_alloc_node+0x2a/0x60
[   11.973950]  [<810da81a>] ? slob_alloc_node+0x2a/0x60
[   11.973950]  [<817b919a>] ? __alloc_skb+0x6a/0x2b0
[   11.973950]  [<817b919a>] ? __alloc_skb+0x6a/0x2b0
[   11.973950]  [<817d8795>] rtmsg_ifinfo+0x65/0xe0
[   11.973950]  [<817d8795>] rtmsg_ifinfo+0x65/0xe0
[   11.973950]  [<817cbd31>] register_netdevice+0x531/0x5a0
[   11.973950]  [<817cbd31>] register_netdevice+0x531/0x5a0
[   11.973950]  [<81892b87>] ? ip6_tnl_get_cap+0x27/0x90
[   11.973950]  [<81892b87>] ? ip6_tnl_get_cap+0x27/0x90
[   11.973950]  [<817cbdb6>] register_netdev+0x16/0x30
[   11.973950]  [<817cbdb6>] register_netdev+0x16/0x30
[   11.973950]  [<81f574a6>] vti6_init_net+0x1c4/0x1d4
[   11.973950]  [<81f574a6>] vti6_init_net+0x1c4/0x1d4
[   11.973950]  [<81f573af>] ? vti6_init_net+0xcd/0x1d4
[   11.973950]  [<81f573af>] ? vti6_init_net+0xcd/0x1d4
[   11.973950]  [<817c16df>] ops_init.constprop.11+0x17f/0x1c0
[   11.973950]  [<817c16df>] ops_init.constprop.11+0x17f/0x1c0
[   11.973950]  [<817c1779>] register_pernet_operations.isra.9+0x59/0x90
[   11.973950]  [<817c1779>] register_pernet_operations.isra.9+0x59/0x90
[   11.973950]  [<817c18d1>] register_pernet_device+0x21/0x60
[   11.973950]  [<817c18d1>] register_pernet_device+0x21/0x60
[   11.973950]  [<81f574b6>] ? vti6_init_net+0x1d4/0x1d4
[   11.973950]  [<81f574b6>] ? vti6_init_net+0x1d4/0x1d4
[   11.973950]  [<81f574c7>] vti6_tunnel_init+0x11/0x68
[   11.973950]  [<81f574c7>] vti6_tunnel_init+0x11/0x68
[   11.973950]  [<81f572a1>] ? mip6_init+0x73/0xb4
[   11.973950]  [<81f572a1>] ? mip6_init+0x73/0xb4
[   11.973950]  [<81f0cba4>] do_one_initcall+0xbb/0x15b
[   11.973950]  [<81f0cba4>] do_one_initcall+0xbb/0x15b
[   11.973950]  [<811a00d8>] ? sha_transform+0x528/0x1150
[   11.973950]  [<811a00d8>] ? sha_transform+0x528/0x1150
[   11.973950]  [<81f0c544>] ? repair_env_string+0x12/0x51
[   11.973950]  [<81f0c544>] ? repair_env_string+0x12/0x51
[   11.973950]  [<8105c30d>] ? parse_args+0x2ad/0x440
[   11.973950]  [<8105c30d>] ? parse_args+0x2ad/0x440
[   11.973950]  [<810546be>] ? __usermodehelper_set_disable_depth+0x3e/0x50
[   11.973950]  [<810546be>] ? __usermodehelper_set_disable_depth+0x3e/0x50
[   11.973950]  [<81f0cd27>] kernel_init_freeable+0xe3/0x182
[   11.973950]  [<81f0cd27>] kernel_init_freeable+0xe3/0x182
[   11.973950]  [<81f0c532>] ? do_early_param+0x7a/0x7a
[   11.973950]  [<81f0c532>] ? do_early_param+0x7a/0x7a
[   11.973950]  [<819b5b1b>] kernel_init+0xb/0x100
[   11.973950]  [<819b5b1b>] kernel_init+0xb/0x100
[   11.973950]  [<819cebf7>] ret_from_kernel_thread+0x1b/0x28
[   11.973950]  [<819cebf7>] ret_from_kernel_thread+0x1b/0x28
[   11.973950]  [<819b5b10>] ? rest_init+0xc0/0xc0
[   11.973950]  [<819b5b10>] ? rest_init+0xc0/0xc0

Before 469bdcefdc ("ipv6: fix the use of pcpu_tstats in ip6_vti.c"),
the pcpu_tstats.syncp is not used to pretect the 64bit elements of
pcpu_tstats, so not appear this calltrace.

Reported-by: Fengguang Wu <fengguang.wu@intel.com>
Signed-off-by: Li RongQing <roy.qing.li@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-01-07 14:12:46 -05:00
Claudio Takahasi
e825eb1d7e Bluetooth: Fix 6loWPAN peer lookup
This patch fixes peer address lookup for 6loWPAN over Bluetooth Low
Energy links.

ADDR_LE_DEV_PUBLIC, and ADDR_LE_DEV_RANDOM are the values allowed for
"dst_type" field in the hci_conn struct for LE links.

Signed-off-by: Claudio Takahasi <claudio.takahasi@openbossa.org>
Signed-off-by: Gustavo Padovan <gustavo.padovan@collabora.co.uk>
2014-01-07 11:32:15 -02:00
Claudio Takahasi
b071a62099 Bluetooth: Fix setting Universal/Local bit
This patch fixes the Bluetooth Low Energy Address type checking when
setting Universal/Local bit for the 6loWPAN network device or for the
peer device connection.

ADDR_LE_DEV_PUBLIC or ADDR_LE_DEV_RANDOM are the values allowed for
"src_type" and "dst_type" in the hci_conn struct. The Bluetooth link
type can be obtainned reading the "type" field in the same struct.

Signed-off-by: Claudio Takahasi <claudio.takahasi@openbossa.org>
Signed-off-by: Gustavo Padovan <gustavo.padovan@collabora.co.uk>
2014-01-07 11:32:11 -02:00
Eric Dumazet
438e38fadc gre_offload: statically build GRE offloading support
GRO/GSO layers can be enabled on a node, even if said
node is only forwarding packets.

This patch permits GSO (and upcoming GRO) support for GRE
encapsulated packets, even if the host has no GRE tunnel setup.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: H.K. Jerry Chu <hkchu@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-01-06 20:28:34 -05:00
David S. Miller
39b6b2992f Merge branch 'master' of git://git.kernel.org/pub/scm/linux/kernel/git/jesse/openvswitch
Jesse Gross says:

====================
[GIT net-next] Open vSwitch

Open vSwitch changes for net-next/3.14. Highlights are:
 * Performance improvements in the mechanism to get packets to userspace
   using memory mapped netlink and skb zero copy where appropriate.
 * Per-cpu flow stats in situations where flows are likely to be shared
   across CPUs. Standard flow stats are used in other situations to save
   memory and allocation time.
 * A handful of code cleanups and rationalization.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2014-01-06 19:48:38 -05:00
Stephen Hemminger
443cd88c8a ovs: make functions local
Several functions and datastructures could be local
Found with 'make namespacecheck'

Signed-off-by: Stephen Hemminger <stephen@networkplumber.org>
Signed-off-by: Jesse Gross <jesse@nicira.com>
2014-01-06 15:54:39 -08:00
Thomas Graf
09c5e6054e openvswitch: Compute checksum in skb_gso_segment() if needed
The copy & csum optimization is no longer present with zerocopy
enabled. Compute the checksum in skb_gso_segment() directly by
dropping the HW CSUM capability from the features passed in.

Signed-off-by: Thomas Graf <tgraf@suug.ch>
Signed-off-by: Jesse Gross <jesse@nicira.com>
2014-01-06 15:53:24 -08:00
Thomas Graf
bda56f143c openvswitch: Use skb_zerocopy() for upcall
Use of skb_zerocopy() can avoid the expensive call to memcpy()
when copying the packet data into the Netlink skb. Completes
checksum through skb_checksum_help() if not already done in
GSO segmentation.

Zerocopy is only performed if user space supported unaligned
Netlink messages. memory mapped netlink i/o is preferred over
zerocopy if it is set up.

Cost of upcall is significantly reduced from:
+   7.48%       vhost-8471  [k] memcpy
+   5.57%     ovs-vswitchd  [k] memcpy
+   2.81%       vhost-8471  [k] csum_partial_copy_generic

to:
+   5.72%     ovs-vswitchd  [k] memcpy
+   3.32%       vhost-5153  [k] memcpy
+   0.68%       vhost-5153  [k] skb_zerocopy

(megaflows disabled)

Signed-off-by: Thomas Graf <tgraf@suug.ch>
Signed-off-by: Jesse Gross <jesse@nicira.com>
2014-01-06 15:53:17 -08:00
Thomas Graf
8055a89cfa openvswitch: Pass datapath into userspace queue functions
Allows removing the net and dp_ifindex argument and simplify the
code.

Signed-off-by: Thomas Graf <tgraf@suug.ch>
Signed-off-by: Jesse Gross <jesse@nicira.com>
2014-01-06 15:53:07 -08:00
Thomas Graf
44da5ae5fb openvswitch: Drop user features if old user space attempted to create datapath
Drop user features if an outdated user space instance that does not
understand the concept of user_features attempted to create a new
datapath.

Signed-off-by: Thomas Graf <tgraf@suug.ch>
Signed-off-by: Jesse Gross <jesse@nicira.com>
2014-01-06 15:53:00 -08:00
Thomas Graf
43d4be9cb5 openvswitch: Allow user space to announce ability to accept unaligned Netlink messages
Signed-off-by: Thomas Graf <tgraf@suug.ch>
Reviewed-by: Daniel Borkmann <dborkman@redhat.com>
Signed-off-by: Jesse Gross <jesse@nicira.com>
2014-01-06 15:52:53 -08:00
Thomas Graf
af2806f8f9 net: Export skb_zerocopy() to zerocopy from one skb to another
Make the skb zerocopy logic written for nfnetlink queue available for
use by other modules.

Signed-off-by: Thomas Graf <tgraf@suug.ch>
Reviewed-by: Daniel Borkmann <dborkman@redhat.com>
Acked-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Jesse Gross <jesse@nicira.com>
2014-01-06 15:52:42 -08:00
Wei Yongjun
5f03f47c9c openvswitch: remove duplicated include from flow_table.c
Remove duplicated include.

Signed-off-by: Wei Yongjun <yongjun_wei@trendmicro.com.cn>
Signed-off-by: Jesse Gross <jesse@nicira.com>
2014-01-06 15:52:35 -08:00
Daniel Borkmann
11d6c461b3 net: ovs: use kfree_rcu instead of rcu_free_{sw_flow_mask_cb,acts_callback}
As we're only doing a kfree() anyway in the RCU callback, we can
simply use kfree_rcu, which does the same job, and remove the
function rcu_free_sw_flow_mask_cb() and rcu_free_acts_callback().

Signed-off-by: Daniel Borkmann <dborkman@redhat.com>
Signed-off-by: Jesse Gross <jesse@nicira.com>
2014-01-06 15:52:30 -08:00
Pravin B Shelar
e298e50570 openvswitch: Per cpu flow stats.
With mega flow implementation ovs flow can be shared between
multiple CPUs which makes stats updates highly contended
operation. This patch uses per-CPU stats in cases where a flow
is likely to be shared (if there is a wildcard in the 5-tuple
and therefore likely to be spread by RSS). In other situations,
it uses the current strategy, saving memory and allocation time.

Signed-off-by: Pravin B Shelar <pshelar@nicira.com>
Signed-off-by: Jesse Gross <jesse@nicira.com>
2014-01-06 15:52:24 -08:00
Thomas Graf
795449d8b8 openvswitch: Enable memory mapped Netlink i/o
Use memory mapped Netlink i/o for all unicast openvswitch
communication if a ring has been set up.

Benchmark
  * pktgen -> ovs internal port
  * 5M pkts, 5M flows
  * 4 threads, 8 cores

Before:
Result: OK: 67418743(c67108212+d310530) usec, 5000000 (9000byte,0frags)
  74163pps 5339Mb/sec (5339736000bps) errors: 0
	+   2.98%     ovs-vswitchd  [k] copy_user_generic_string
	+   2.49%     ovs-vswitchd  [k] memcpy
	+   1.84%       kpktgend_2  [k] memcpy
	+   1.81%       kpktgend_1  [k] memcpy
	+   1.81%       kpktgend_3  [k] memcpy
	+   1.78%       kpktgend_0  [k] memcpy

After:
Result: OK: 24229690(c24127165+d102524) usec, 5000000 (9000byte,0frags)
  206358pps 14857Mb/sec (14857776000bps) errors: 0
	+   2.80%     ovs-vswitchd  [k] memcpy
	+   1.31%       kpktgend_2  [k] memcpy
	+   1.23%       kpktgend_0  [k] memcpy
	+   1.09%       kpktgend_1  [k] memcpy
	+   1.04%       kpktgend_3  [k] memcpy
	+   0.96%     ovs-vswitchd  [k] copy_user_generic_string

Signed-off-by: Thomas Graf <tgraf@suug.ch>
Reviewed-by: Daniel Borkmann <dborkman@redhat.com>
Signed-off-by: Jesse Gross <jesse@nicira.com>
2014-01-06 15:52:12 -08:00
Thomas Graf
aae9f0e22c netlink: Avoid netlink mmap alloc if msg size exceeds frame size
An insufficent ring frame size configuration can lead to an
unnecessary skb allocation for every Netlink message. Check frame
size before taking the queue lock and allocating the skb and
re-check with lock to be safe.

Signed-off-by: Thomas Graf <tgraf@suug.ch>
Reviewed-by: Daniel Borkmann <dborkman@redhat.com>
Signed-off-by: Jesse Gross <jesse@nicira.com>
2014-01-06 15:52:06 -08:00
Thomas Graf
bb9b18fb55 genl: Add genlmsg_new_unicast() for unicast message allocation
Allocates a new sk_buff large enough to cover the specified payload
plus required Netlink headers. Will check receiving socket for
memory mapped i/o capability and use it if enabled. Will fall back
to non-mapped skb if message size exceeds the frame size of the ring.

Signed-of-by: Thomas Graf <tgraf@suug.ch>
Reviewed-by: Daniel Borkmann <dborkman@redhat.com>
Signed-off-by: Jesse Gross <jesse@nicira.com>
2014-01-06 15:51:53 -08:00
Jesse Gross
663efa3696 openvswitch: Silence RCU lockdep checks from flow lookup.
Flow lookup can happen either in packet processing context or userspace
context but it was annotated as requiring RCU read lock to be held. This
also allows OVS mutex to be held without causing warnings.

Reported-by: Justin Pettit <jpettit@nicira.com>
Signed-off-by: Jesse Gross <jesse@nicira.com>
Reviewed-by: Thomas Graf <tgraf@redhat.com>
2014-01-06 15:51:48 -08:00
Andy Zhou
5bb506324d openvswitch: Change ovs_flow_tbl_lookup_xx() APIs
API changes only for code readability. No functional chnages.

This patch removes the underscored version. Added a new API
ovs_flow_tbl_lookup_stats() that returns the n_mask_hits.

Reported by: Ben Pfaff <blp@nicira.com>
Reviewed-by: Thomas Graf <tgraf@redhat.com>
Signed-off-by: Andy Zhou <azhou@nicira.com>
Signed-off-by: Jesse Gross <jesse@nicira.com>
2014-01-06 15:51:41 -08:00
Ben Pfaff
8f49ce1135 openvswitch: Shrink sw_flow_mask by 8 bytes (64-bit) or 4 bytes (32-bit).
We won't normally have a ton of flow masks but using a size_t to store
values no bigger than sizeof(struct sw_flow_key) seems excessive.

This reduces sw_flow_key_range and sw_flow_mask by 4 bytes on 32-bit
systems.  On 64-bit systems it shrinks sw_flow_key_range by 12 bytes but
sw_flow_mask only by 8 bytes due to padding.

Compile tested only.

Signed-off-by: Ben Pfaff <blp@nicira.com>
Acked-by: Andy Zhou <azhou@nicira.com>
Signed-off-by: Jesse Gross <jesse@nicira.com>
2014-01-06 15:51:27 -08:00
Ben Pfaff
d1211908b9 openvswitch: Correct comment.
Signed-off-by: Ben Pfaff <blp@nicira.com>
Signed-off-by: Jesse Gross <jesse@nicira.com>
2014-01-06 15:51:21 -08:00
David S. Miller
56a4342dfe Merge branch 'master' of git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
Conflicts:
	drivers/net/ethernet/qlogic/qlcnic/qlcnic_sriov_pf.c
	net/ipv6/ip6_tunnel.c
	net/ipv6/ip6_vti.c

ipv6 tunnel statistic bug fixes conflicting with consolidation into
generic sw per-cpu net stats.

qlogic conflict between queue counting bug fix and the addition
of multiple MAC address support.

Signed-off-by: David S. Miller <davem@davemloft.net>
2014-01-06 17:37:45 -05:00
Gianluca Anzolin
f86772af6a Bluetooth: Remove rfcomm_carrier_raised()
Remove the rfcomm_carrier_raised() definition as that function isn't
used anymore.

Signed-off-by: Gianluca Anzolin <gianluca@sottospazio.it>
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
2014-01-06 13:51:45 -08:00
Gianluca Anzolin
4a2fb3ecc7 Bluetooth: Always wait for a connection on RFCOMM open()
This patch fixes two regressions introduced with the recent rfcomm tty
rework.

The current code uses the carrier_raised() method to wait for the
bluetooth connection when a process opens the tty.

However processes may open the port with the O_NONBLOCK flag or set the
CLOCAL termios flag: in these cases the open() syscall returns
immediately without waiting for the bluetooth connection to
complete.

This behaviour confuses userspace which expects an established bluetooth
connection.

The patch restores the old behaviour by waiting for the connection in
rfcomm_dev_activate() and removes carrier_raised() from the tty_port ops.

As a side effect the new code also fixes the case in which the rfcomm
tty device is created with the flag RFCOMM_REUSE_DLC: the old code
didn't call device_move() and ModemManager skipped the detection
probe. Now device_move() is always called inside rfcomm_dev_activate().

Signed-off-by: Gianluca Anzolin <gianluca@sottospazio.it>
Reported-by: Andrey Vihrov <andrey.vihrov@gmail.com>
Reported-by: Beson Chow <blc+bluez@mail.vanade.com>
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
2014-01-06 13:51:45 -08:00
Gianluca Anzolin
e228b63390 Bluetooth: Move rfcomm_get_device() before rfcomm_dev_activate()
This is a preparatory patch which moves the rfcomm_get_device()
definition before rfcomm_dev_activate() where it will be used.

Signed-off-by: Gianluca Anzolin <gianluca@sottospazio.it>
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
2014-01-06 13:51:45 -08:00
Gianluca Anzolin
5b89924187 Bluetooth: Release RFCOMM port when the last user closes the TTY
This patch fixes a userspace regression introduced by the commit
29cd718b.

If the rfcomm device was created with the flag RFCOMM_RELEASE_ONHUP the
user space expects that the tty_port is released as soon as the last
process closes the tty.

The current code attempts to release the port in the function
rfcomm_dev_state_change(). However it won't get a reference to the
relevant tty to send a HUP: at that point the tty is already destroyed
and therefore NULL.

This patch fixes the regression by taking over the tty refcount in the
tty install method(). This way the tty_port is automatically released as
soon as the tty is destroyed.

As a consequence the check for RFCOMM_RELEASE_ONHUP flag in the hangup()
method is now redundant. Instead we have to be careful with the reference
counting in the rfcomm_release_dev() function.

Signed-off-by: Gianluca Anzolin <gianluca@sottospazio.it>
Reported-by: Alexander Holler <holler@ahsoftware.de>
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
2014-01-06 13:51:45 -08:00
Jamal Hadi Salim
805c1f4aed net_sched: act: action flushing missaccounting
action flushing missaccounting
Account only for deleted actions

Signed-off-by: Jamal Hadi Salim <jhs@mojatatu.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-01-06 16:46:56 -05:00
Jamal Hadi Salim
63acd6807c net_sched: Remove unnecessary checks for act->ops
Remove unnecessary checks for act->ops
(suggested by Eric Dumazet).

Signed-off-by: Jamal Hadi Salim <jhs@mojatatu.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-01-06 16:46:32 -05:00
sfeldma@cumulusnetworks.com
fbf2671bb8 bridge: use DEVICE_ATTR_xx macros
Use DEVICE_ATTR_RO/RW macros to simplify bridge sysfs attribute definitions.

Signed-off-by: Scott Feldman <sfeldma@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-01-06 16:40:46 -05:00
Curt Brune
fe0d692bbc bridge: use spin_lock_bh() in br_multicast_set_hash_max
br_multicast_set_hash_max() is called from process context in
net/bridge/br_sysfs_br.c by the sysfs store_hash_max() function.

br_multicast_set_hash_max() calls spin_lock(&br->multicast_lock),
which can deadlock the CPU if a softirq that also tries to take the
same lock interrupts br_multicast_set_hash_max() while the lock is
held .  This can happen quite easily when any of the bridge multicast
timers expire, which try to take the same lock.

The fix here is to use spin_lock_bh(), preventing other softirqs from
executing on this CPU.

Steps to reproduce:

1. Create a bridge with several interfaces (I used 4).
2. Set the "multicast query interval" to a low number, like 2.
3. Enable the bridge as a multicast querier.
4. Repeatedly set the bridge hash_max parameter via sysfs.

  # brctl addbr br0
  # brctl addif br0 eth1 eth2 eth3 eth4
  # brctl setmcqi br0 2
  # brctl setmcquerier br0 1

  # while true ; do echo 4096 > /sys/class/net/br0/bridge/hash_max; done

Signed-off-by: Curt Brune <curt@cumulusnetworks.com>
Signed-off-by: Scott Feldman <sfeldma@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-01-06 16:39:47 -05:00
Eric Dumazet
996b175e39 tcp: out_of_order_queue do not use its lock
TCP out_of_order_queue lock is not used, as queue manipulation
happens with socket lock held and we therefore use the lockless
skb queue routines (as __skb_queue_head())

We can use __skb_queue_head_init() instead of skb_queue_head_init()
to make this more consistent.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-01-06 16:34:34 -05:00
Hannes Frederic Sowa
88ad31491e ipv6: don't install anycast address for /128 addresses on routers
It does not make sense to create an anycast address for an /128-prefix.
Suppress it.

As 32019e651c ("ipv6: Do not leave router anycast address for /127
prefixes.") shows we also may not leave them, because we could accidentally
remove an anycast address the user has allocated or got added via another
prefix.

Cc: François-Xavier Le Bail <fx.lebail@yahoo.com>
Cc: Thomas Haller <thaller@redhat.com>
Cc: Jiri Pirko <jiri@resnulli.us>
Signed-off-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-01-06 16:32:43 -05:00
Vijay Subramanian
d4b36210c2 net: pkt_sched: PIE AQM scheme
Proportional Integral controller Enhanced (PIE) is a scheduler to address the
bufferbloat problem.

>From the IETF draft below:
" Bufferbloat is a phenomenon where excess buffers in the network cause high
latency and jitter. As more and more interactive applications (e.g. voice over
IP, real time video streaming and financial transactions) run in the Internet,
high latency and jitter degrade application performance. There is a pressing
need to design intelligent queue management schemes that can control latency and
jitter; and hence provide desirable quality of service to users.

We present here a lightweight design, PIE(Proportional Integral controller
Enhanced) that can effectively control the average queueing latency to a target
value. Simulation results, theoretical analysis and Linux testbed results have
shown that PIE can ensure low latency and achieve high link utilization under
various congestion situations. The design does not require per-packet
timestamp, so it incurs very small overhead and is simple enough to implement
in both hardware and software.  "

Many thanks to Dave Taht for extensive feedback, reviews, testing and
suggestions. Thanks also to Stephen Hemminger and Eric Dumazet for reviews and
suggestions.  Naeem Khademi and Dave Taht independently contributed to ECN
support.

For more information, please see technical paper about PIE in the IEEE
Conference on High Performance Switching and Routing 2013. A copy of the paper
can be found at ftp://ftpeng.cisco.com/pie/.

Please also refer to the IETF draft submission at
http://tools.ietf.org/html/draft-pan-tsvwg-pie-00

All relevant code, documents and test scripts and results can be found at
ftp://ftpeng.cisco.com/pie/.

For problems with the iproute2/tc or Linux kernel code, please contact Vijay
Subramanian (vijaynsu@cisco.com or subramanian.vijay@gmail.com) Mythili Prabhu
(mysuryan@cisco.com)

Signed-off-by: Vijay Subramanian <subramanian.vijay@gmail.com>
Signed-off-by: Mythili Prabhu <mysuryan@cisco.com>
CC: Dave Taht <dave.taht@bufferbloat.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-01-06 15:13:01 -05:00
John W. Linville
a50f9d5ee0 Merge branch 'for-john' of git://git.kernel.org/pub/scm/linux/kernel/git/jberg/mac80211 2014-01-06 14:19:18 -05:00
John W. Linville
9d1cd503c7 Merge branch 'master' of git://git.kernel.org/pub/scm/linux/kernel/git/linville/wireless 2014-01-06 14:08:41 -05:00
David S. Miller
83111e7fe8 netfilter: Fix build failure in nfnetlink_queue_core.c.
net/netfilter/nfnetlink_queue_core.c: In function 'nfqnl_put_sk_uidgid':
net/netfilter/nfnetlink_queue_core.c:304:35: error: 'TCP_TIME_WAIT' undeclared (first use in this function)
net/netfilter/nfnetlink_queue_core.c:304:35: note: each undeclared identifier is reported only once for each function it appears in
make[3]: *** [net/netfilter/nfnetlink_queue_core.o] Error 1

Just a missing include of net/tcp_states.h

Reported-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-01-06 13:36:06 -05:00
David S. Miller
9aa28f2b71 Merge branch 'master' of git://git.kernel.org/pub/scm/linux/kernel/git/pablo/nftables
Pablo Neira Ayuso says: <pablo@netfilter.org>

====================
nftables updates for net-next

The following patchset contains nftables updates for your net-next tree,
they are:

* Add set operation to the meta expression by means of the select_ops()
  infrastructure, this allows us to set the packet mark among other things.
  From Arturo Borrero Gonzalez.

* Fix wrong format in sscanf in nf_tables_set_alloc_name(), from Daniel
  Borkmann.

* Add new queue expression to nf_tables. These comes with two previous patches
  to prepare this new feature, one to add mask in nf_tables_core to
  evaluate the queue verdict appropriately and another to refactor common
  code with xt_NFQUEUE, from Eric Leblond.

* Do not hide nftables from Kconfig if nfnetlink is not enabled, also from
  Eric Leblond.

* Add the reject expression to nf_tables, this adds the missing TCP RST
  support. It comes with an initial patch to refactor common code with
  xt_NFQUEUE, again from Eric Leblond.

* Remove an unused variable assignment in nf_tables_dump_set(), from Michal
  Nazarewicz.

* Remove the nft_meta_target code, now that Arturo added the set operation
  to the meta expression, from me.

* Add help information for nf_tables to Kconfig, also from me.

* Allow to dump all sets by specifying NFPROTO_UNSPEC, similar feature is
  available to other nf_tables objects, requested by Arturo, from me.

* Expose the table usage counter, so we can know how many chains are using
  this table without dumping the list of chains, from Tomasz Bursztyka.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2014-01-06 13:29:30 -05:00
Johan Hedberg
cb6ca8e1ed Bluetooth: Default to no security with L2CAP RAW sockets
L2CAP RAW sockets can be used for things which do not involve
establishing actual connection oriented L2CAP channels. One example of
such usage is the l2ping tool. The default security level for L2CAP
sockets is LOW, which implies that for SSP based connection
authentication is still requested (although with no MITM requirement),
which is not what we want (or need) for things like l2ping. Therefore,
default to one lower level, i.e. BT_SECURITY_SDP, for L2CAP RAW sockets
in order not to trigger unwanted authentication requests.

Signed-off-by: Johan Hedberg <johan.hedberg@intel.com>
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
2014-01-06 09:26:23 -08:00
Johan Hedberg
8cef8f50d4 Bluetooth: Fix NULL pointer dereference when disconnecting
When disconnecting it is possible that the l2cap_conn pointer is already
NULL when bt_6lowpan_del_conn() is entered. Looking at l2cap_conn_del
also verifies this as there's a NULL check there too. This patch adds
the missing NULL check without which the following bug may occur:

BUG: unable to handle kernel NULL pointer dereference at   (null)
IP: [<c131e9c7>] bt_6lowpan_del_conn+0x19/0x12a
*pde = 00000000
Oops: 0000 [#1] SMP
CPU: 1 PID: 52 Comm: kworker/u5:1 Not tainted 3.12.0+ #196
Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011
Workqueue: hci0 hci_rx_work
task: f6259b00 ti: f48c0000 task.ti: f48c0000
EIP: 0060:[<c131e9c7>] EFLAGS: 00010282 CPU: 1
EIP is at bt_6lowpan_del_conn+0x19/0x12a
EAX: 00000000 EBX: ef094e10 ECX: 00000000 EDX: 00000016
ESI: 00000000 EDI: f48c1e60 EBP: f48c1e50 ESP: f48c1e34
 DS: 007b ES: 007b FS: 00d8 GS: 0000 SS: 0068
CR0: 8005003b CR2: 00000000 CR3: 30c65000 CR4: 00000690
Stack:
 f4d38000 00000000 f4d38000 00000002 ef094e10 00000016 f48c1e60 f48c1e70
 c1316bed f48c1e84 c1316bed 00000000 00000001 ef094e10 f48c1e84 f48c1ed0
 c1303cc6 c1303c7b f31f331a c1303cc6 f6e7d1c0 f3f8ea16 f3f8f380 f4d38008
Call Trace:
 [<c1316bed>] l2cap_disconn_cfm+0x3f/0x5b
 [<c1316bed>] ? l2cap_disconn_cfm+0x3f/0x5b
 [<c1303cc6>] hci_event_packet+0x645/0x2117
 [<c1303c7b>] ? hci_event_packet+0x5fa/0x2117
 [<c1303cc6>] ? hci_event_packet+0x645/0x2117
 [<c12681bd>] ? __kfree_skb+0x65/0x68
 [<c12681eb>] ? kfree_skb+0x2b/0x2e
 [<c130d3fb>] ? hci_send_to_sock+0x18d/0x199
 [<c12fa327>] hci_rx_work+0xf9/0x295
 [<c12fa327>] ? hci_rx_work+0xf9/0x295
 [<c1036d25>] process_one_work+0x128/0x1df
 [<c1346a39>] ? _raw_spin_unlock_irq+0x8/0x12
 [<c1036d25>] ? process_one_work+0x128/0x1df
 [<c103713a>] worker_thread+0x127/0x1c4
 [<c1037013>] ? rescuer_thread+0x216/0x216
 [<c103aec6>] kthread+0x88/0x8d
 [<c1040000>] ? task_rq_lock+0x37/0x6e
 [<c13474b7>] ret_from_kernel_thread+0x1b/0x28
 [<c103ae3e>] ? __kthread_parkme+0x50/0x50
Code: 05 b8 f4 ff ff ff 8d 65 f4 5b 5e 5f 5d 8d 67 f8 5f c3 57 8d 7c 24 08 83 e4 f8 ff 77 fc 55 89 e5 57 56f
EIP: [<c131e9c7>] bt_6lowpan_del_conn+0x19/0x12a SS:ESP 0068:f48c1e34
CR2: 0000000000000000

Signed-off-by: Johan Hedberg <johan.hedberg@intel.com>
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
2014-01-06 09:26:23 -08:00
Daniel Borkmann
b22f5126a2 netfilter: nf_conntrack_dccp: fix skb_header_pointer API usages
Some occurences in the netfilter tree use skb_header_pointer() in
the following way ...

  struct dccp_hdr _dh, *dh;
  ...
  skb_header_pointer(skb, dataoff, sizeof(_dh), &dh);

... where dh itself is a pointer that is being passed as the copy
buffer. Instead, we need to use &_dh as the forth argument so that
we're copying the data into an actual buffer that sits on the stack.

Currently, we probably could overwrite memory on the stack (e.g.
with a possibly mal-formed DCCP packet), but unintentionally, as
we only want the buffer to be placed into _dh variable.

Fixes: 2bc780499a ("[NETFILTER]: nf_conntrack: add DCCP protocol support")
Signed-off-by: Daniel Borkmann <dborkman@redhat.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2014-01-06 17:40:02 +01:00
Jesper Dangaard Brouer
f2661adc0c netfilter: only warn once on wrong seqadj usage
Avoid potentially spamming the kernel log with WARN splash messages
when catching wrong usage of seqadj, by simply using WARN_ONCE.

This is a followup to commit db12cf2743 (netfilter: WARN about
wrong usage of sequence number adjustments)

Suggested-by: Flavio Leitner <fbl@redhat.com>
Suggested-by: Daniel Borkmann <dborkman@redhat.com>
Suggested-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2014-01-06 14:23:17 +01:00
Daniel Borkmann
138aef7dca netfilter: nf_conntrack_dccp: use %s format string for buffer
Some invocations of nf_log_packet() use arg buffer directly instead of
"%s" format string with follow-up buffer pointer. Currently, these two
usages are not really critical, but we should fix this up nevertheless
so that we don't run into trouble if that changes one day.

Signed-off-by: Daniel Borkmann <dborkman@redhat.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2014-01-06 14:19:16 +01:00
Daniel Borkmann
2690d97ade netfilter: nf_nat: fix access to uninitialized buffer in IRC NAT helper
Commit 5901b6be88 attempted to introduce IPv6 support into
IRC NAT helper. By doing so, the following code seemed to be removed
by accident:

  ip = ntohl(exp->master->tuplehash[IP_CT_DIR_REPLY].tuple.dst.u3.ip);
  sprintf(buffer, "%u %u", ip, port);
  pr_debug("nf_nat_irc: inserting '%s' == %pI4, port %u\n", buffer, &ip, port);

This leads to the fact that buffer[] was left uninitialized and
contained some stack value. When we call nf_nat_mangle_tcp_packet(),
we call strlen(buffer) on excatly this uninitialized buffer. If we
are unlucky and the skb has enough tailroom, we overwrite resp. leak
contents with values that sit on our stack into the packet and send
that out to the receiver.

Since the rather informal DCC spec [1] does not seem to specify
IPv6 support right now, we log such occurences so that admins can
act accordingly, and drop the packet. I've looked into XChat source,
and IPv6 is not supported there: addresses are in u32 and print
via %u format string.

Therefore, restore old behaviour as in IPv4, use snprintf(). The
IRC helper does not support IPv6 by now. By this, we can safely use
strlen(buffer) in nf_nat_mangle_tcp_packet() and prevent a buffer
overflow. Also simplify some code as we now have ct variable anyway.

  [1] http://www.irchelp.org/irchelp/rfc/ctcpspec.html

Fixes: 5901b6be88 ("netfilter: nf_nat: support IPv6 in IRC NAT helper")
Signed-off-by: Daniel Borkmann <dborkman@redhat.com>
Cc: Harald Welte <laforge@gnumonks.org>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2014-01-06 14:17:17 +01:00
Pablo Neira Ayuso
2a50d805e5 Revert "netfilter: avoid get_random_bytes calls"
This reverts commit a42b99a6e3.

Hannes Frederic Sowa reported some problems with this patch, more specifically
that prandom_u32() may not be ready at boot time, see:

http://marc.info/?l=linux-netdev&m=138896532403533&w=2

Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2014-01-06 14:00:55 +01:00
Fengguang Wu
5537a0557c pktgen_dst_metrics[] can be static
CC: Fan Du <fan.du@windriver.com>
CC: Steffen Klassert <steffen.klassert@secunet.com>
Signed-off-by: Fengguang Wu <fengguang.wu@intel.com>
Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
2014-01-06 11:00:07 +01:00
Daniel Borkmann
a48d4bb0b0 net: netdev_kobject_init: annotate with __init
netdev_kobject_init() is only being called from __init context,
that is, net_dev_init(), so annotate it with __init as well, thus
the kernel can take this as a hint that the function is used only
during the initialization phase and free up used memory resources
after its invocation.

Signed-off-by: Daniel Borkmann <dborkman@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-01-05 20:27:54 -05:00
Daniel Borkmann
965801e1eb net: 6lowpan: fix lowpan_header_create non-compression memcpy call
In function lowpan_header_create(), we invoke the following code
construct:

  struct ipv6hdr *hdr;
  ...
  hdr = ipv6_hdr(skb);
  ...
  if (...)
    memcpy(hc06_ptr + 1, &hdr->flow_lbl[1], 2);
  else
    memcpy(hc06_ptr, &hdr, 4);

Where the else path of the condition, that is, non-compression
path, calls memcpy() with a pointer to struct ipv6hdr *hdr as
source, thus two levels of indirection. This cannot be correct,
and likely only one level of pointer was intended as source
buffer for memcpy() here.

Fixes: 44331fe2aa ("IEEE802.15.4: 6LoWPAN basic support")
Signed-off-by: Daniel Borkmann <dborkman@redhat.com>
Cc: Alexander Smirnov <alex.bluesman.smirnov@gmail.com>
Cc: Dmitry Eremin-Solenikov <dbaryshkov@gmail.com>
Cc: Werner Almesberger <werner@almesberger.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-01-05 20:25:24 -05:00
David S. Miller
855404efae Merge git://git.kernel.org/pub/scm/linux/kernel/git/pablo/nf-next
Pablo Neira Ayuso says:

====================
netfilter/IPVS updates for net-next

The following patchset contains Netfilter updates for your net-next tree,
they are:

* Add full port randomization support. Some crazy researchers found a way
  to reconstruct the secure ephemeral ports that are allocated in random mode
  by sending off-path bursts of UDP packets to overrun the socket buffer of
  the DNS resolver to trigger retransmissions, then if the timing for the
  DNS resolution done by a client is larger than usual, then they conclude
  that the port that received the burst of UDP packets is the one that was
  opened. It seems a bit aggressive method to me but it seems to work for
  them. As a result, Daniel Borkmann and Hannes Frederic Sowa came up with a
  new NAT mode to fully randomize ports using prandom.

* Add a new classifier to x_tables based on the socket net_cls set via
  cgroups. These includes two patches to prepare the field as requested by
  Zefan Li. Also from Daniel Borkmann.

* Use prandom instead of get_random_bytes in several locations of the
  netfilter code, from Florian Westphal.

* Allow to use the CTA_MARK_MASK in ctnetlink when mangling the conntrack
  mark, also from Florian Westphal.

* Fix compilation warning due to unused variable in IPVS, from Geert
  Uytterhoeven.

* Add support for UID/GID via nfnetlink_queue, from Valentina Giusti.

* Add IPComp extension to x_tables, from Fan Du.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2014-01-05 20:18:50 -05:00
stephen hemminger
eec73f1c96 tipc: remove unused code
Remove dead code;
       tipc_bearer_find_interface
       tipc_node_redundant_links

This may break out of tree version of TIPC if there still is one.
But that maybe a good thing :-)

Signed-off-by: Stephen Hemminger <stephen@networkplumber.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-01-04 20:18:50 -05:00
stephen hemminger
9805696399 tipc: make local function static
Signed-off-by: Stephen Hemminger <stephen@networkplumber.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-01-04 20:18:50 -05:00
stephen hemminger
6aee49c558 dccp: make local variable static
Make DCCP module config variable static, only used in one file.

Signed-off-by: Stephen Hemminger <stephen@networkplumber.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-01-04 20:18:50 -05:00
stephen hemminger
fd34d627ae dccp: remove obsolete code
This function is defined but not used.
Remove it now, can be resurrected if ever needed.

Signed-off-by: Stephen Hemminger <stephen@networkplumber.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-01-04 20:18:49 -05:00
Li RongQing
8f84985fec net: unify the pcpu_tstats and br_cpu_netstats as one
They are same, so unify them as one, pcpu_sw_netstats.

Define pcpu_sw_netstat in netdevice.h, remove pcpu_tstats
from if_tunnel and remove br_cpu_netstats from br_private.h

Cc: Cong Wang <xiyou.wangcong@gmail.com>
Cc: Stephen Hemminger <stephen@networkplumber.org>
Signed-off-by: Li RongQing <roy.qing.li@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-01-04 20:10:24 -05:00
Marcel Holtmann
f9f462faa0 Bluetooth: Add quirk for disabling Delete Stored Link Key command
Some controller pretend they support the Delete Stored Link Key command,
but in reality they really don't support it.

  < HCI Command: Delete Stored Link Key (0x03|0x0012) plen 7
      bdaddr 00:00:00:00:00:00 all 1
  > HCI Event: Command Complete (0x0e) plen 4
      Delete Stored Link Key (0x03|0x0012) ncmd 1
      status 0x11 deleted 0
      Error: Unsupported Feature or Parameter Value

Not correctly supporting this command causes the controller setup to
fail and will make a device not work. However sending the command for
controller that handle stored link keys is important. This quirk
allows a driver to disable the command if it knows that this command
handling is broken.

Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
Signed-off-by: Johan Hedberg <johan.hedberg@intel.com>
2014-01-04 20:10:40 +02:00
Arron Wang
d31652a26b NFC: Fix target mode p2p link establishment
With commit e29a9e2ae1, we set the active_target pointer from
nfc_dep_link_is_up() in order to support the case where the target
detection and the DEP link setting are done atomically by the driver.
That can only happen in initiator mode, so we need to check for that
otherwise we fail to bring a p2p link in target mode.

Signed-off-by: Arron Wang <arron.wang@intel.com>
Signed-off-by: Samuel Ortiz <sameo@linux.intel.com>
2014-01-04 03:31:32 +01:00
stephen hemminger
5e419e68a6 llc: make lock static
The llc_sap_list_lock does not need to be global, only acquired
in core.

Signed-off-by: Stephen Hemminger <stephen@networkplumber.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-01-03 20:56:48 -05:00
stephen hemminger
8f09898bf0 socket: cleanups
Namespace related cleaning

 * make cred_to_ucred static
 * remove unused sock_rmalloc function

Signed-off-by: Stephen Hemminger <stephen@networkplumber.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-01-03 20:55:58 -05:00
Tom Herbert
9a4aa9af44 ipv4: Use percpu Cache route in IP tunnels
percpu route cache eliminates share of dst refcnt between CPUs.

Signed-off-by: Tom Herbert <therbert@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-01-03 19:40:57 -05:00
Tom Herbert
7d442fab0a ipv4: Cache dst in tunnels
Avoid doing a route lookup on every packet being tunneled.

In ip_tunnel.c cache the route returned from ip_route_output if
the tunnel is "connected" so that all the rouitng parameters are
taken from tunnel parms for a packet. Specifically, not NBMA tunnel
and tos is from tunnel parms (not inner packet).

Signed-off-by: Tom Herbert <therbert@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-01-03 19:38:45 -05:00
Neil Horman
f916ec9608 sctp: Add process name and pid to deprecation warnings
Recently I updated the sctp socket option deprecation warnings to be both a bit
more clear and ratelimited to prevent user processes from spamming the log file.
Ben Hutchings suggested that I add the process name and pid to these warnings so
that users can tell who is responsible for using the deprecated apis.  This
patch accomplishes that.

Signed-off-by: Neil Horman <nhorman@tuxdriver.com>
CC: Vlad Yasevich <vyasevich@gmail.com>
CC: Ben Hutchings <bhutchings@solarflare.com>
CC: "David S. Miller" <davem@davemloft.net>
CC: netdev@vger.kernel.org
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-01-03 19:36:46 -05:00
Pablo Neira Ayuso
c9c8e48597 netfilter: nf_tables: dump sets in all existing families
This patch allows you to dump all sets available in all of
the registered families. This allows you to use NFPROTO_UNSPEC
to dump all existing sets, similarly to other existing table,
chain and rule operations.

This patch is based on original patch from Arturo Borrero
González.

Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2014-01-04 00:23:11 +01:00
Daniel Borkmann
82a37132f3 netfilter: x_tables: lightweight process control group matching
It would be useful e.g. in a server or desktop environment to have
a facility in the notion of fine-grained "per application" or "per
application group" firewall policies. Probably, users in the mobile,
embedded area (e.g. Android based) with different security policy
requirements for application groups could have great benefit from
that as well. For example, with a little bit of configuration effort,
an admin could whitelist well-known applications, and thus block
otherwise unwanted "hard-to-track" applications like [1] from a
user's machine. Blocking is just one example, but it is not limited
to that, meaning we can have much different scenarios/policies that
netfilter allows us than just blocking, e.g. fine grained settings
where applications are allowed to connect/send traffic to, application
traffic marking/conntracking, application-specific packet mangling,
and so on.

Implementation of PID-based matching would not be appropriate
as they frequently change, and child tracking would make that
even more complex and ugly. Cgroups would be a perfect candidate
for accomplishing that as they associate a set of tasks with a
set of parameters for one or more subsystems, in our case the
netfilter subsystem, which, of course, can be combined with other
cgroup subsystems into something more complex if needed.

As mentioned, to overcome this constraint, such processes could
be placed into one or multiple cgroups where different fine-grained
rules can be defined depending on the application scenario, while
e.g. everything else that is not part of that could be dropped (or
vice versa), thus making life harder for unwanted processes to
communicate to the outside world. So, we make use of cgroups here
to track jobs and limit their resources in terms of iptables
policies; in other words, limiting, tracking, etc what they are
allowed to communicate.

In our case we're working on outgoing traffic based on which local
socket that originated from. Also, one doesn't even need to have
an a-prio knowledge of the application internals regarding their
particular use of ports or protocols. Matching is *extremly*
lightweight as we just test for the sk_classid marker of sockets,
originating from net_cls. net_cls and netfilter do not contradict
each other; in fact, each construct can live as standalone or they
can be used in combination with each other, which is perfectly fine,
plus it serves Tejun's requirement to not introduce a new cgroups
subsystem. Through this, we result in a very minimal and efficient
module, and don't add anything except netfilter code.

One possible, minimal usage example (many other iptables options
can be applied obviously):

 1) Configuring cgroups if not already done, e.g.:

  mkdir /sys/fs/cgroup/net_cls
  mount -t cgroup -o net_cls net_cls /sys/fs/cgroup/net_cls
  mkdir /sys/fs/cgroup/net_cls/0
  echo 1 > /sys/fs/cgroup/net_cls/0/net_cls.classid
  (resp. a real flow handle id for tc)

 2) Configuring netfilter (iptables-nftables), e.g.:

  iptables -A OUTPUT -m cgroup ! --cgroup 1 -j DROP

 3) Running applications, e.g.:

  ping 208.67.222.222  <pid:1799>
  echo 1799 > /sys/fs/cgroup/net_cls/0/tasks
  64 bytes from 208.67.222.222: icmp_seq=44 ttl=49 time=11.9 ms
  [...]
  ping 208.67.220.220  <pid:1804>
  ping: sendmsg: Operation not permitted
  [...]
  echo 1804 > /sys/fs/cgroup/net_cls/0/tasks
  64 bytes from 208.67.220.220: icmp_seq=89 ttl=56 time=19.0 ms
  [...]

Of course, real-world deployments would make use of cgroups user
space toolsuite, or own custom policy daemons dynamically moving
applications from/to various cgroups.

  [1] http://www.blackhat.com/presentations/bh-europe-06/bh-eu-06-biondi/bh-eu-06-biondi-up.pdf

Signed-off-by: Daniel Borkmann <dborkman@redhat.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: cgroups@vger.kernel.org
Acked-by: Li Zefan <lizefan@huawei.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2014-01-03 23:41:44 +01:00
Daniel Borkmann
86f8515f97 net: netprio: rename config to be more consistent with cgroup configs
While we're at it and introduced CGROUP_NET_CLASSID, lets also make
NETPRIO_CGROUP more consistent with the rest of cgroups and rename it
into CONFIG_CGROUP_NET_PRIO so that for networking, we now have
CONFIG_CGROUP_NET_{PRIO,CLASSID}. This not only makes the CONFIG
option consistent among networking cgroups, but also among cgroups
CONFIG conventions in general as the vast majority has a prefix of
CONFIG_CGROUP_<SUBSYS>.

Signed-off-by: Daniel Borkmann <dborkman@redhat.com>
Cc: Zefan Li <lizefan@huawei.com>
Cc: cgroups@vger.kernel.org
Acked-by: Li Zefan <lizefan@huawei.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2014-01-03 23:41:42 +01:00
Daniel Borkmann
fe1217c4f3 net: net_cls: move cgroupfs classid handling into core
Zefan Li requested [1] to perform the following cleanup/refactoring:

- Split cgroupfs classid handling into net core to better express a
  possible more generic use.

- Disable module support for cgroupfs bits as the majority of other
  cgroupfs subsystems do not have that, and seems to be not wished
  from cgroup side. Zefan probably might want to follow-up for netprio
  later on.

- By this, code can be further reduced which previously took care of
  functionality built when compiled as module.

cgroupfs bits are being placed under net/core/netclassid_cgroup.c, so
that we are consistent with {netclassid,netprio}_cgroup naming that is
under net/core/ as suggested by Zefan.

No change in functionality, but only code refactoring that is being
done here.

 [1] http://patchwork.ozlabs.org/patch/304825/

Suggested-by: Li Zefan <lizefan@huawei.com>
Signed-off-by: Daniel Borkmann <dborkman@redhat.com>
Cc: Zefan Li <lizefan@huawei.com>
Cc: Thomas Graf <tgraf@suug.ch>
Cc: cgroups@vger.kernel.org
Acked-by: Li Zefan <lizefan@huawei.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2014-01-03 23:41:41 +01:00
Eric Leblond
14abfa161d netfilter: xt_CT: fix error value in xt_ct_tg_check()
If setting event mask fails then we were returning 0 for success.
This patch updates return code to -EINVAL in case of problem.

Signed-off-by: Eric Leblond <eric@regit.org>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2014-01-03 23:41:39 +01:00
stephen hemminger
dcd93ed4cd netfilter: nf_conntrack: remove dead code
The following code is not used in current upstream code.
Some of this seems to be old hooks, other might be used by some
out of tree module (which I don't care about breaking), and
the need_ipv4_conntrack was used by old NAT code but no longer
called.

Signed-off-by: Stephen Hemminger <stephen@networkplumber.org>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2014-01-03 23:41:37 +01:00
stephen hemminger
02eca9d2cc netfilter: ipset: remove unused code
Function never used in current upstream code.

Signed-off-by: Stephen Hemminger <stephen@networkplumber.org>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2014-01-03 23:41:35 +01:00
Daniel Borkmann
34ce324019 netfilter: nf_nat: add full port randomization support
We currently use prandom_u32() for allocation of ports in tcp bind(0)
and udp code. In case of plain SNAT we try to keep the ports as is
or increment on collision.

SNAT --random mode does use per-destination incrementing port
allocation. As a recent paper pointed out in [1] that this mode of
port allocation makes it possible to an attacker to find the randomly
allocated ports through a timing side-channel in a socket overloading
attack conducted through an off-path attacker.

So, NF_NAT_RANGE_PROTO_RANDOM actually weakens the port randomization
in regard to the attack described in this paper. As we need to keep
compatibility, add another flag called NF_NAT_RANGE_PROTO_RANDOM_FULLY
that would replace the NF_NAT_RANGE_PROTO_RANDOM hash-based port
selection algorithm with a simple prandom_u32() in order to mitigate
this attack vector. Note that the lfsr113's internal state is
periodically reseeded by the kernel through a local secure entropy
source.

More details can be found in [1], the basic idea is to send bursts
of packets to a socket to overflow its receive queue and measure
the latency to detect a possible retransmit when the port is found.
Because of increasing ports to given destination and port, further
allocations can be predicted. This information could then be used by
an attacker for e.g. for cache-poisoning, NS pinning, and degradation
of service attacks against DNS servers [1]:

  The best defense against the poisoning attacks is to properly
  deploy and validate DNSSEC; DNSSEC provides security not only
  against off-path attacker but even against MitM attacker. We hope
  that our results will help motivate administrators to adopt DNSSEC.
  However, full DNSSEC deployment make take significant time, and
  until that happens, we recommend short-term, non-cryptographic
  defenses. We recommend to support full port randomisation,
  according to practices recommended in [2], and to avoid
  per-destination sequential port allocation, which we show may be
  vulnerable to derandomisation attacks.

Joint work between Hannes Frederic Sowa and Daniel Borkmann.

 [1] https://sites.google.com/site/hayashulman/files/NIC-derandomisation.pdf
 [2] http://arxiv.org/pdf/1205.5190v1.pdf

Signed-off-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: Daniel Borkmann <dborkman@redhat.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2014-01-03 23:41:26 +01:00
Michal Nazarewicz
720e0dfa3a netfilter: nf_tables: remove unused variable in nf_tables_dump_set()
The nfmsg variable is not used (except in sizeof operator which does
not care about its value) between the first and second time it is
assigned the value.  Furthermore, nlmsg_data has no side effects, so
the assignment can be safely removed.

Signed-off-by: Michal Nazarewicz <mina86@mina86.com>
Cc: Patrick McHardy <kaber@trash.net>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2014-01-03 22:51:14 +01:00
Daniel Borkmann
1466291790 netfilter: nf_tables: fix type in parsing in nf_tables_set_alloc_name()
In nf_tables_set_alloc_name(), we are trying to find a new, unused
name for our new set and interate through the list of present sets.
As far as I can see, we're using format string %d to parse already
present names in order to mark their presence in a bitmap, so that
we can later on find the first 0 in that map to assign the new set
name to. We should rather use a temporary variable of type int to
store the result of sscanf() to, and for making sanity checks on.

Signed-off-by: Daniel Borkmann <dborkman@redhat.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2014-01-03 22:50:59 +01:00
Fan Du
8101328b79 {pktgen, xfrm} Show spi value properly when ipsec turned on
If user run pktgen plus ipsec by using spi, show spi value
properly when cat /proc/net/pktgen/ethX

Signed-off-by: Fan Du <fan.du@windriver.com>
Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
2014-01-03 07:29:12 +01:00
Fan Du
c454997e68 {pktgen, xfrm} Introduce xfrm_state_lookup_byspi for pktgen
Introduce xfrm_state_lookup_byspi to find user specified by custom
from "pgset spi xxx". Using this scheme, any flow regardless its
saddr/daddr could be transform by SA specified with configurable
spi.

Signed-off-by: Fan Du <fan.du@windriver.com>
Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
2014-01-03 07:29:12 +01:00
Fan Du
cf93d47ed4 {pktgen, xfrm} Construct skb dst for tunnel mode transformation
IPsec tunnel mode encapuslation needs to set outter ip header
with right protocol/ttl/id value with regard to skb->dst->child.

Looking up a rt in a standard way is absolutely wrong for every
packet transmission. In a simple way, construct a dst by setting
neccessary information to make tunnel mode encapuslation working.

Signed-off-by: Fan Du <fan.du@windriver.com>
Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
2014-01-03 07:29:11 +01:00
Fan Du
de4aee7d69 {pktgen, xfrm} Using "pgset spi xxx" to spedifiy SA for a given flow
User could set specific SPI value to arm pktgen flow with IPsec
transformation, instead of looking up SA by sadr/daddr. The reaseon
to do so is because current state lookup scheme is both slow and, most
important of all, in fact pktgen doesn't need to match any SA state
addresses information, all it needs is the SA transfromation shell to
do the encapuslation.

And this option also provide user an alternative to using pktgen
test existing SA without creating new ones.

Signed-off-by: Fan Du <fan.du@windriver.com>
Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
2014-01-03 07:29:11 +01:00
Fan Du
4ae770bf58 {pktgen, xfrm} Correct xfrm_state_lock usage in xfrm_stateonly_find
Acquiring xfrm_state_lock in process context is expected to turn BH off,
as this lock is also used in BH context, namely xfrm state timer handler.
Otherwise it surprises LOCKDEP with below messages.

[   81.422781] pktgen: Packet Generator for packet performance testing. Version: 2.74
[   81.725194]
[   81.725211] =========================================================
[   81.725212] [ INFO: possible irq lock inversion dependency detected ]
[   81.725215] 3.13.0-rc2+ #92 Not tainted
[   81.725216] ---------------------------------------------------------
[   81.725218] kpktgend_0/2780 just changed the state of lock:
[   81.725220]  (xfrm_state_lock){+.+...}, at: [<ffffffff816dd751>] xfrm_stateonly_find+0x41/0x1f0
[   81.725231] but this lock was taken by another, SOFTIRQ-safe lock in the past:
[   81.725232]  (&(&x->lock)->rlock){+.-...}
[   81.725232]
[   81.725232] and interrupts could create inverse lock ordering between them.
[   81.725232]
[   81.725235]
[   81.725235] other info that might help us debug this:
[   81.725237]  Possible interrupt unsafe locking scenario:
[   81.725237]
[   81.725238]        CPU0                    CPU1
[   81.725240]        ----                    ----
[   81.725241]   lock(xfrm_state_lock);
[   81.725243]                                local_irq_disable();
[   81.725244]                                lock(&(&x->lock)->rlock);
[   81.725246]                                lock(xfrm_state_lock);
[   81.725248]   <Interrupt>
[   81.725249]     lock(&(&x->lock)->rlock);
[   81.725251]
[   81.725251]  *** DEADLOCK ***
[   81.725251]
[   81.725254] no locks held by kpktgend_0/2780.
[   81.725255]
[   81.725255] the shortest dependencies between 2nd lock and 1st lock:
[   81.725269]  -> (&(&x->lock)->rlock){+.-...} ops: 8 {
[   81.725274]     HARDIRQ-ON-W at:
[   81.725276]                       [<ffffffff8109a64b>] __lock_acquire+0x65b/0x1d70
[   81.725282]                       [<ffffffff8109c3c7>] lock_acquire+0x97/0x130
[   81.725284]                       [<ffffffff81774af6>] _raw_spin_lock+0x36/0x70
[   81.725289]                       [<ffffffff816dc3a3>] xfrm_timer_handler+0x43/0x290
[   81.725292]                       [<ffffffff81059437>] __tasklet_hrtimer_trampoline+0x17/0x40
[   81.725300]                       [<ffffffff8105a1b7>] tasklet_hi_action+0xd7/0xf0
[   81.725303]                       [<ffffffff81059ac6>] __do_softirq+0xe6/0x2d0
[   81.725305]                       [<ffffffff8105a026>] irq_exit+0x96/0xc0
[   81.725308]                       [<ffffffff8177fd0a>] smp_apic_timer_interrupt+0x4a/0x60
[   81.725313]                       [<ffffffff8177e96f>] apic_timer_interrupt+0x6f/0x80
[   81.725316]                       [<ffffffff8100b7c6>] arch_cpu_idle+0x26/0x30
[   81.725329]                       [<ffffffff810ace28>] cpu_startup_entry+0x88/0x2b0
[   81.725333]                       [<ffffffff8102e5b0>] start_secondary+0x190/0x1f0
[   81.725338]     IN-SOFTIRQ-W at:
[   81.725340]                       [<ffffffff8109a61d>] __lock_acquire+0x62d/0x1d70
[   81.725342]                       [<ffffffff8109c3c7>] lock_acquire+0x97/0x130
[   81.725344]                       [<ffffffff81774af6>] _raw_spin_lock+0x36/0x70
[   81.725347]                       [<ffffffff816dc3a3>] xfrm_timer_handler+0x43/0x290
[   81.725349]                       [<ffffffff81059437>] __tasklet_hrtimer_trampoline+0x17/0x40
[   81.725352]                       [<ffffffff8105a1b7>] tasklet_hi_action+0xd7/0xf0
[   81.725355]                       [<ffffffff81059ac6>] __do_softirq+0xe6/0x2d0
[   81.725358]                       [<ffffffff8105a026>] irq_exit+0x96/0xc0
[   81.725360]                       [<ffffffff8177fd0a>] smp_apic_timer_interrupt+0x4a/0x60
[   81.725363]                       [<ffffffff8177e96f>] apic_timer_interrupt+0x6f/0x80
[   81.725365]                       [<ffffffff8100b7c6>] arch_cpu_idle+0x26/0x30
[   81.725368]                       [<ffffffff810ace28>] cpu_startup_entry+0x88/0x2b0
[   81.725370]                       [<ffffffff8102e5b0>] start_secondary+0x190/0x1f0
[   81.725373]     INITIAL USE at:
[   81.725375]                      [<ffffffff8109a31a>] __lock_acquire+0x32a/0x1d70
[   81.725385]                      [<ffffffff8109c3c7>] lock_acquire+0x97/0x130
[   81.725388]                      [<ffffffff81774af6>] _raw_spin_lock+0x36/0x70
[   81.725390]                      [<ffffffff816dc3a3>] xfrm_timer_handler+0x43/0x290
[   81.725394]                      [<ffffffff81059437>] __tasklet_hrtimer_trampoline+0x17/0x40
[   81.725398]                      [<ffffffff8105a1b7>] tasklet_hi_action+0xd7/0xf0
[   81.725401]                      [<ffffffff81059ac6>] __do_softirq+0xe6/0x2d0
[   81.725404]                      [<ffffffff8105a026>] irq_exit+0x96/0xc0
[   81.725407]                      [<ffffffff8177fd0a>] smp_apic_timer_interrupt+0x4a/0x60
[   81.725409]                      [<ffffffff8177e96f>] apic_timer_interrupt+0x6f/0x80
[   81.725412]                      [<ffffffff8100b7c6>] arch_cpu_idle+0x26/0x30
[   81.725415]                      [<ffffffff810ace28>] cpu_startup_entry+0x88/0x2b0
[   81.725417]                      [<ffffffff8102e5b0>] start_secondary+0x190/0x1f0
[   81.725420]   }
[   81.725421]   ... key      at: [<ffffffff8295b9c8>] __key.46349+0x0/0x8
[   81.725445]   ... acquired at:
[   81.725446]    [<ffffffff8109c3c7>] lock_acquire+0x97/0x130
[   81.725449]    [<ffffffff81774af6>] _raw_spin_lock+0x36/0x70
[   81.725452]    [<ffffffff816dc057>] __xfrm_state_delete+0x37/0x140
[   81.725454]    [<ffffffff816dc18c>] xfrm_state_delete+0x2c/0x50
[   81.725456]    [<ffffffff816dc277>] xfrm_state_flush+0xc7/0x1b0
[   81.725458]    [<ffffffffa005f6cc>] pfkey_flush+0x7c/0x100 [af_key]
[   81.725465]    [<ffffffffa005efb7>] pfkey_process+0x1c7/0x1f0 [af_key]
[   81.725468]    [<ffffffffa005f139>] pfkey_sendmsg+0x159/0x260 [af_key]
[   81.725471]    [<ffffffff8162c16f>] sock_sendmsg+0xaf/0xc0
[   81.725476]    [<ffffffff8162c99c>] SYSC_sendto+0xfc/0x130
[   81.725479]    [<ffffffff8162cf3e>] SyS_sendto+0xe/0x10
[   81.725482]    [<ffffffff8177dd12>] system_call_fastpath+0x16/0x1b
[   81.725484]
[   81.725486] -> (xfrm_state_lock){+.+...} ops: 11 {
[   81.725490]    HARDIRQ-ON-W at:
[   81.725493]                     [<ffffffff8109a64b>] __lock_acquire+0x65b/0x1d70
[   81.725504]                     [<ffffffff8109c3c7>] lock_acquire+0x97/0x130
[   81.725507]                     [<ffffffff81774e4b>] _raw_spin_lock_bh+0x3b/0x70
[   81.725510]                     [<ffffffff816dc1df>] xfrm_state_flush+0x2f/0x1b0
[   81.725513]                     [<ffffffffa005f6cc>] pfkey_flush+0x7c/0x100 [af_key]
[   81.725516]                     [<ffffffffa005efb7>] pfkey_process+0x1c7/0x1f0 [af_key]
[   81.725519]                     [<ffffffffa005f139>] pfkey_sendmsg+0x159/0x260 [af_key]
[   81.725522]                     [<ffffffff8162c16f>] sock_sendmsg+0xaf/0xc0
[   81.725525]                     [<ffffffff8162c99c>] SYSC_sendto+0xfc/0x130
[   81.725527]                     [<ffffffff8162cf3e>] SyS_sendto+0xe/0x10
[   81.725530]                     [<ffffffff8177dd12>] system_call_fastpath+0x16/0x1b
[   81.725533]    SOFTIRQ-ON-W at:
[   81.725534]                     [<ffffffff8109a67a>] __lock_acquire+0x68a/0x1d70
[   81.725537]                     [<ffffffff8109c3c7>] lock_acquire+0x97/0x130
[   81.725539]                     [<ffffffff81774af6>] _raw_spin_lock+0x36/0x70
[   81.725541]                     [<ffffffff816dd751>] xfrm_stateonly_find+0x41/0x1f0
[   81.725544]                     [<ffffffffa008af03>] mod_cur_headers+0x793/0x7f0 [pktgen]
[   81.725547]                     [<ffffffffa008bca2>] pktgen_thread_worker+0xd42/0x1880 [pktgen]
[   81.725550]                     [<ffffffff81078f84>] kthread+0xe4/0x100
[   81.725555]                     [<ffffffff8177dc6c>] ret_from_fork+0x7c/0xb0
[   81.725565]    INITIAL USE at:
[   81.725567]                    [<ffffffff8109a31a>] __lock_acquire+0x32a/0x1d70
[   81.725569]                    [<ffffffff8109c3c7>] lock_acquire+0x97/0x130
[   81.725572]                    [<ffffffff81774e4b>] _raw_spin_lock_bh+0x3b/0x70
[   81.725574]                    [<ffffffff816dc1df>] xfrm_state_flush+0x2f/0x1b0
[   81.725576]                    [<ffffffffa005f6cc>] pfkey_flush+0x7c/0x100 [af_key]
[   81.725580]                    [<ffffffffa005efb7>] pfkey_process+0x1c7/0x1f0 [af_key]
[   81.725583]                    [<ffffffffa005f139>] pfkey_sendmsg+0x159/0x260 [af_key]
[   81.725586]                    [<ffffffff8162c16f>] sock_sendmsg+0xaf/0xc0
[   81.725589]                    [<ffffffff8162c99c>] SYSC_sendto+0xfc/0x130
[   81.725594]                    [<ffffffff8162cf3e>] SyS_sendto+0xe/0x10
[   81.725597]                    [<ffffffff8177dd12>] system_call_fastpath+0x16/0x1b
[   81.725599]  }
[   81.725600]  ... key      at: [<ffffffff81cadef8>] xfrm_state_lock+0x18/0x50
[   81.725606]  ... acquired at:
[   81.725607]    [<ffffffff810995c0>] check_usage_backwards+0x110/0x150
[   81.725609]    [<ffffffff81099e96>] mark_lock+0x196/0x2f0
[   81.725611]    [<ffffffff8109a67a>] __lock_acquire+0x68a/0x1d70
[   81.725614]    [<ffffffff8109c3c7>] lock_acquire+0x97/0x130
[   81.725616]    [<ffffffff81774af6>] _raw_spin_lock+0x36/0x70
[   81.725627]    [<ffffffff816dd751>] xfrm_stateonly_find+0x41/0x1f0
[   81.725629]    [<ffffffffa008af03>] mod_cur_headers+0x793/0x7f0 [pktgen]
[   81.725632]    [<ffffffffa008bca2>] pktgen_thread_worker+0xd42/0x1880 [pktgen]
[   81.725635]    [<ffffffff81078f84>] kthread+0xe4/0x100
[   81.725637]    [<ffffffff8177dc6c>] ret_from_fork+0x7c/0xb0
[   81.725640]
[   81.725641]
[   81.725641] stack backtrace:
[   81.725645] CPU: 0 PID: 2780 Comm: kpktgend_0 Not tainted 3.13.0-rc2+ #92
[   81.725647] Hardware name: innotek GmbH VirtualBox, BIOS VirtualBox 12/01/2006
[   81.725649]  ffffffff82537b80 ffff880018199988 ffffffff8176af37 0000000000000007
[   81.725652]  ffff8800181999f0 ffff8800181999d8 ffffffff81099358 ffffffff82537b80
[   81.725655]  ffffffff81a32def ffff8800181999f4 0000000000000000 ffff880002cbeaa8
[   81.725659] Call Trace:
[   81.725664]  [<ffffffff8176af37>] dump_stack+0x46/0x58
[   81.725667]  [<ffffffff81099358>] print_irq_inversion_bug.part.42+0x1e8/0x1f0
[   81.725670]  [<ffffffff810995c0>] check_usage_backwards+0x110/0x150
[   81.725672]  [<ffffffff81099e96>] mark_lock+0x196/0x2f0
[   81.725675]  [<ffffffff810994b0>] ? check_usage_forwards+0x150/0x150
[   81.725685]  [<ffffffff8109a67a>] __lock_acquire+0x68a/0x1d70
[   81.725691]  [<ffffffff810899a5>] ? sched_clock_local+0x25/0x90
[   81.725694]  [<ffffffff81089b38>] ? sched_clock_cpu+0xa8/0x120
[   81.725697]  [<ffffffff8109a31a>] ? __lock_acquire+0x32a/0x1d70
[   81.725699]  [<ffffffff816dd751>] ? xfrm_stateonly_find+0x41/0x1f0
[   81.725702]  [<ffffffff8109c3c7>] lock_acquire+0x97/0x130
[   81.725704]  [<ffffffff816dd751>] ? xfrm_stateonly_find+0x41/0x1f0
[   81.725707]  [<ffffffff810899a5>] ? sched_clock_local+0x25/0x90
[   81.725710]  [<ffffffff81774af6>] _raw_spin_lock+0x36/0x70
[   81.725712]  [<ffffffff816dd751>] ? xfrm_stateonly_find+0x41/0x1f0
[   81.725715]  [<ffffffff810971ec>] ? lock_release_holdtime.part.26+0x1c/0x1a0
[   81.725717]  [<ffffffff816dd751>] xfrm_stateonly_find+0x41/0x1f0
[   81.725721]  [<ffffffffa008af03>] mod_cur_headers+0x793/0x7f0 [pktgen]
[   81.725724]  [<ffffffffa008bca2>] pktgen_thread_worker+0xd42/0x1880 [pktgen]
[   81.725727]  [<ffffffffa008ba71>] ? pktgen_thread_worker+0xb11/0x1880 [pktgen]
[   81.725729]  [<ffffffff8109cf9d>] ? trace_hardirqs_on+0xd/0x10
[   81.725733]  [<ffffffff81775410>] ? _raw_spin_unlock_irq+0x30/0x40
[   81.725745]  [<ffffffff8151faa0>] ? e1000_clean+0x9d0/0x9d0
[   81.725751]  [<ffffffff81094310>] ? __init_waitqueue_head+0x60/0x60
[   81.725753]  [<ffffffff81094310>] ? __init_waitqueue_head+0x60/0x60
[   81.725757]  [<ffffffffa008af60>] ? mod_cur_headers+0x7f0/0x7f0 [pktgen]
[   81.725759]  [<ffffffff81078f84>] kthread+0xe4/0x100
[   81.725762]  [<ffffffff81078ea0>] ? flush_kthread_worker+0x170/0x170
[   81.725765]  [<ffffffff8177dc6c>] ret_from_fork+0x7c/0xb0
[   81.725768]  [<ffffffff81078ea0>] ? flush_kthread_worker+0x170/0x170

Signed-off-by: Fan Du <fan.du@windriver.com>
Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
2014-01-03 07:29:11 +01:00
Fan Du
6de9ace4ae {pktgen, xfrm} Add statistics counting when transforming
so /proc/net/xfrm_stat could give user clue about what's
wrong in this process.

Signed-off-by: Fan Du <fan.du@windriver.com>
Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
2014-01-03 07:29:11 +01:00
Fan Du
0af0a4136b {pktgen, xfrm} Correct xfrm state lock usage when transforming
xfrm_state lock protects its state, i.e., VALID/DEAD and statistics,
not the transforming procedure, as both mode/type output functions
are reentrant.

Another issue is state lock can be used in BH context when state timer
alarmed, after transformation in pktgen, update state statistics acquiring
state lock should disabled BH context for a moment. Otherwise LOCKDEP
critisize this:

[   62.354339] pktgen: Packet Generator for packet performance testing. Version: 2.74
[   62.655444]
[   62.655448] =================================
[   62.655451] [ INFO: inconsistent lock state ]
[   62.655455] 3.13.0-rc2+ #70 Not tainted
[   62.655457] ---------------------------------
[   62.655459] inconsistent {IN-SOFTIRQ-W} -> {SOFTIRQ-ON-W} usage.
[   62.655463] kpktgend_0/2764 [HC0[0]:SC0[0]:HE1:SE1] takes:
[   62.655466]  (&(&x->lock)->rlock){+.?...}, at: [<ffffffffa00886f6>] pktgen_thread_worker+0x1796/0x1860 [pktgen]
[   62.655479] {IN-SOFTIRQ-W} state was registered at:
[   62.655484]   [<ffffffff8109a61d>] __lock_acquire+0x62d/0x1d70
[   62.655492]   [<ffffffff8109c3c7>] lock_acquire+0x97/0x130
[   62.655498]   [<ffffffff81774af6>] _raw_spin_lock+0x36/0x70
[   62.655505]   [<ffffffff816dc3a3>] xfrm_timer_handler+0x43/0x290
[   62.655511]   [<ffffffff81059437>] __tasklet_hrtimer_trampoline+0x17/0x40
[   62.655519]   [<ffffffff8105a1b7>] tasklet_hi_action+0xd7/0xf0
[   62.655523]   [<ffffffff81059ac6>] __do_softirq+0xe6/0x2d0
[   62.655526]   [<ffffffff8105a026>] irq_exit+0x96/0xc0
[   62.655530]   [<ffffffff8177fd0a>] smp_apic_timer_interrupt+0x4a/0x60
[   62.655537]   [<ffffffff8177e96f>] apic_timer_interrupt+0x6f/0x80
[   62.655541]   [<ffffffff8100b7c6>] arch_cpu_idle+0x26/0x30
[   62.655547]   [<ffffffff810ace28>] cpu_startup_entry+0x88/0x2b0
[   62.655552]   [<ffffffff81761c3c>] rest_init+0xbc/0xd0
[   62.655557]   [<ffffffff81ea5e5e>] start_kernel+0x3c4/0x3d1
[   62.655583]   [<ffffffff81ea55a8>] x86_64_start_reservations+0x2a/0x2c
[   62.655588]   [<ffffffff81ea569f>] x86_64_start_kernel+0xf5/0xfc
[   62.655592] irq event stamp: 77
[   62.655594] hardirqs last  enabled at (77): [<ffffffff810ab7f2>] vprintk_emit+0x1b2/0x520
[   62.655597] hardirqs last disabled at (76): [<ffffffff810ab684>] vprintk_emit+0x44/0x520
[   62.655601] softirqs last  enabled at (22): [<ffffffff81059b57>] __do_softirq+0x177/0x2d0
[   62.655605] softirqs last disabled at (15): [<ffffffff8105a026>] irq_exit+0x96/0xc0
[   62.655609]
[   62.655609] other info that might help us debug this:
[   62.655613]  Possible unsafe locking scenario:
[   62.655613]
[   62.655616]        CPU0
[   62.655617]        ----
[   62.655618]   lock(&(&x->lock)->rlock);
[   62.655622]   <Interrupt>
[   62.655623]     lock(&(&x->lock)->rlock);
[   62.655626]
[   62.655626]  *** DEADLOCK ***
[   62.655626]
[   62.655629] no locks held by kpktgend_0/2764.
[   62.655631]
[   62.655631] stack backtrace:
[   62.655636] CPU: 0 PID: 2764 Comm: kpktgend_0 Not tainted 3.13.0-rc2+ #70
[   62.655638] Hardware name: innotek GmbH VirtualBox, BIOS VirtualBox 12/01/2006
[   62.655642]  ffffffff8216b7b0 ffff88001be43ab8 ffffffff8176af37 0000000000000007
[   62.655652]  ffff88001c8d4fc0 ffff88001be43b18 ffffffff81766d78 0000000000000000
[   62.655663]  ffff880000000001 ffff880000000001 ffffffff8101025f ffff88001be43b18
[   62.655671] Call Trace:
[   62.655680]  [<ffffffff8176af37>] dump_stack+0x46/0x58
[   62.655685]  [<ffffffff81766d78>] print_usage_bug+0x1f1/0x202
[   62.655691]  [<ffffffff8101025f>] ? save_stack_trace+0x2f/0x50
[   62.655696]  [<ffffffff81099f8c>] mark_lock+0x28c/0x2f0
[   62.655700]  [<ffffffff810994b0>] ? check_usage_forwards+0x150/0x150
[   62.655704]  [<ffffffff8109a67a>] __lock_acquire+0x68a/0x1d70
[   62.655712]  [<ffffffff81115b09>] ? irq_work_queue+0x69/0xb0
[   62.655717]  [<ffffffff810ab7f2>] ? vprintk_emit+0x1b2/0x520
[   62.655722]  [<ffffffff8109cec5>] ? trace_hardirqs_on_caller+0x105/0x1d0
[   62.655730]  [<ffffffffa00886f6>] ? pktgen_thread_worker+0x1796/0x1860 [pktgen]
[   62.655734]  [<ffffffff8109c3c7>] lock_acquire+0x97/0x130
[   62.655741]  [<ffffffffa00886f6>] ? pktgen_thread_worker+0x1796/0x1860 [pktgen]
[   62.655745]  [<ffffffff81774af6>] _raw_spin_lock+0x36/0x70
[   62.655752]  [<ffffffffa00886f6>] ? pktgen_thread_worker+0x1796/0x1860 [pktgen]
[   62.655758]  [<ffffffffa00886f6>] pktgen_thread_worker+0x1796/0x1860 [pktgen]
[   62.655766]  [<ffffffffa0087a79>] ? pktgen_thread_worker+0xb19/0x1860 [pktgen]
[   62.655771]  [<ffffffff8109cf9d>] ? trace_hardirqs_on+0xd/0x10
[   62.655777]  [<ffffffff81775410>] ? _raw_spin_unlock_irq+0x30/0x40
[   62.655785]  [<ffffffff8151faa0>] ? e1000_clean+0x9d0/0x9d0
[   62.655791]  [<ffffffff81094310>] ? __init_waitqueue_head+0x60/0x60
[   62.655795]  [<ffffffff81094310>] ? __init_waitqueue_head+0x60/0x60
[   62.655800]  [<ffffffffa0086f60>] ? mod_cur_headers+0x7f0/0x7f0 [pktgen]
[   62.655806]  [<ffffffff81078f84>] kthread+0xe4/0x100
[   62.655813]  [<ffffffff81078ea0>] ? flush_kthread_worker+0x170/0x170
[   62.655819]  [<ffffffff8177dc6c>] ret_from_fork+0x7c/0xb0
[   62.655824]  [<ffffffff81078ea0>] ? flush_kthread_worker+0x170/0x170

Signed-off-by: Fan Du <fan.du@windriver.com>
Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
2014-01-03 07:29:10 +01:00
David S. Miller
aca5f58f9b netpoll: Fix missing TXQ unlock and and OOPS.
The VLAN tag handling code in netpoll_send_skb_on_dev() has two problems.

1) It exits without unlocking the TXQ.

2) It then tries to queue a NULL skb to npinfo->txq.

Reported-by: Ahmed Tamrawi <atamrawi@iastate.edu>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-01-02 19:50:52 -05:00
Li RongQing
469bdcefdc ipv6: fix the use of pcpu_tstats in ip6_vti.c
when read/write the 64bit data, the correct lock should be hold.
and we can use the generic vti6_get_stats to return stats, and
not define a new one in ip6_vti.c

Fixes: 87b6d218f3 ("tunnel: implement 64 bits statistics")
Cc: Stephen Hemminger <stephen@networkplumber.org>
Cc: Eric Dumazet <edumazet@google.com>
Signed-off-by: Li RongQing <roy.qing.li@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-01-02 19:37:21 -05:00
Li RongQing
abb6013cca ipv6: fix the use of pcpu_tstats in ip6_tunnel
when read/write the 64bit data, the correct lock should be hold.

Fixes: 87b6d218f3 ("tunnel: implement 64 bits statistics")

Cc: Stephen Hemminger <stephen@networkplumber.org>
Cc: Eric Dumazet <edumazet@google.com>
Signed-off-by: Li RongQing <roy.qing.li@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-01-02 19:37:21 -05:00
Yasushi Asano
fad8da3e08 ipv6 addrconf: fix preferred lifetime state-changing behavior while valid_lft is infinity
Fixed a problem with setting the lifetime of an IPv6
address. When setting preferred_lft to a value not zero or
infinity, while valid_lft is infinity(0xffffffff) preferred
lifetime is set to forever and does not update. Therefore
preferred lifetime never becomes deprecated. valid lifetime
and preferred lifetime should be set independently, even if
valid lifetime is infinity, preferred lifetime must expire
correctly (meaning it must eventually become deprecated)

Signed-off-by: Yasushi Asano <yasushi.asano@jp.fujitsu.com>
Acked-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-01-02 19:34:40 -05:00
Daniel Borkmann
4d231b76ee net: llc: fix use after free in llc_ui_recvmsg
While commit 30a584d944 fixes datagram interface in LLC, a use
after free bug has been introduced for SOCK_STREAM sockets that do
not make use of MSG_PEEK.

The flow is as follow ...

  if (!(flags & MSG_PEEK)) {
    ...
    sk_eat_skb(sk, skb, false);
    ...
  }
  ...
  if (used + offset < skb->len)
    continue;

... where sk_eat_skb() calls __kfree_skb(). Therefore, cache
original length and work on skb_len to check partial reads.

Fixes: 30a584d944 ("[LLX]: SOCK_DGRAM interface fixes")
Signed-off-by: Daniel Borkmann <dborkman@redhat.com>
Cc: Stephen Hemminger <stephen@networkplumber.org>
Cc: Arnaldo Carvalho de Melo <acme@ghostprotocols.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-01-02 19:31:09 -05:00
Wei-Chun Chao
7a7ffbabf9 ipv4: fix tunneled VM traffic over hw VXLAN/GRE GSO NIC
VM to VM GSO traffic is broken if it goes through VXLAN or GRE
tunnel and the physical NIC on the host supports hardware VXLAN/GRE
GSO offload (e.g. bnx2x and next-gen mlx4).

Two issues -
(VXLAN) VM traffic has SKB_GSO_DODGY and SKB_GSO_UDP_TUNNEL with
SKB_GSO_TCP/UDP set depending on the inner protocol. GSO header
integrity check fails in udp4_ufo_fragment if inner protocol is
TCP. Also gso_segs is calculated incorrectly using skb->len that
includes tunnel header. Fix: robust check should only be applied
to the inner packet.

(VXLAN & GRE) Once GSO header integrity check passes, NULL segs
is returned and the original skb is sent to hardware. However the
tunnel header is already pulled. Fix: tunnel header needs to be
restored so that hardware can perform GSO properly on the original
packet.

Signed-off-by: Wei-Chun Chao <weichunc@plumgrid.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-01-02 19:06:47 -05:00
Cong Wang
c1ddf295f5 net: revert "sched classifier: make cgroup table local"
This reverts commit de6fb288b1.
Otherwise we got:

net/sched/cls_cgroup.c:106:29: error: static declaration of ‘net_cls_subsys’ follows non-static declaration
 static struct cgroup_subsys net_cls_subsys = {
                             ^
In file included from include/linux/cgroup.h:654:0,
                 from net/sched/cls_cgroup.c:18:
include/linux/cgroup_subsys.h:35:29: note: previous declaration of ‘net_cls_subsys’ was here
 SUBSYS(net_cls)
                             ^
make[2]: *** [net/sched/cls_cgroup.o] Error 1

Cc: Stephen Hemminger <stephen@networkplumber.org>
Cc: David S. Miller <davem@davemloft.net>
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-01-02 19:02:12 -05:00
Vlad Yasevich
619a60ee04 sctp: Remove outqueue empty state
The SCTP outqueue structure maintains a data chunks
that are pending transmission, the list of chunks that
are pending a retransmission and a length of data in
flight.  It also tries to keep the emtpy state so that
it can performe shutdown sequence or notify user.

The problem is that the empy state is inconsistently
tracked.  It is possible to completely drain the queue
without sending anything when using PR-SCTP.  In this
case, the empty state will not be correctly state as
report by Jamal Hadi Salim <jhs@mojatatu.com>.  This
can cause an association to be perminantly stuck in the
SHUTDOWN_PENDING state.

Additionally, SCTP is incredibly inefficient when setting
the empty state.  Even though all the data is availaible
in the outqueue structure, we ignore it and walk a list
of trasnports.

In the end, we can completely remove the extra empty
state and figure out if the queue is empty by looking
at 3 things:  length of pending data, length of in-flight
data, and exisiting of retransmit data.  All of these
are already in the strucutre.

Reported-by: Jamal Hadi Salim <jhs@mojatatu.com>
Signed-off-by: Vlad Yasevich <vyasevich@gmail.com>
Acked-by: Neil Horman <nhorman@tuxdriver.com>
Tested-by: Jamal Hadi Salim <jhs@mojatatu.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-01-02 17:22:48 -05:00
stephen hemminger
de6fb288b1 sched classifier: make cgroup table local
Doesn't need to be global.

Signed-off-by: Stephen Hemminger <stephen@networkplumber.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-01-02 03:30:36 -05:00
stephen hemminger
9c75f4029c sched action: make local function static
No need to export functions only used in one file.

Signed-off-by: Stephen Hemminger <stephen@networkplumber.org>
Acked-by: Jamal Hadi Salim <jhs@mojatatu.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-01-02 03:30:36 -05:00
Weilong Chen
dd9b45598a ipv4: switch and case should be at the same indent
Signed-off-by: Weilong Chen <chenweilong@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-01-02 03:30:36 -05:00
Weilong Chen
442c67f844 ipv4: spaces required around that '='
Signed-off-by: Weilong Chen <chenweilong@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-01-02 03:30:36 -05:00
Li RongQing
0c3584d589 ipv6: remove prune parameter for fib6_clean_all
since the prune parameter for fib6_clean_all always is 0, remove it.

Signed-off-by: Li RongQing <roy.qing.li@gmail.com>
Acked-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-01-02 03:30:35 -05:00
wangweidong
b055597697 tipc: make the code look more readable
In commit 3b8401fe9d ("tipc: kill unnecessary goto's") didn't make
the code look most readable, so fix it. This patch is cosmetic
and does not change the operation of TIPC in any way.

Suggested-by: David Laight <David.Laight@ACULAB.COM>
Signed-off-by: Wang Weidong <wangweidong1@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-01-02 03:30:35 -05:00
Weilong Chen
2f3ea9a95c xfrm: checkpatch erros with inline keyword position
Signed-off-by: Weilong Chen <chenweilong@huawei.com>
Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
2014-01-02 07:48:51 +01:00
Weilong Chen
42054569f9 xfrm: fix checkpatch error
Fix that "else should follow close brace '}'".

Signed-off-by: Weilong Chen <chenweilong@huawei.com>
Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
2014-01-02 07:48:50 +01:00
Weilong Chen
02d0892f98 xfrm: checkpatch erros with space prohibited
Fix checkpatch error "space prohibited xxx".

Signed-off-by: Weilong Chen <chenweilong@huawei.com>
Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
2014-01-02 07:48:50 +01:00