Commit Graph

425 Commits

Author SHA1 Message Date
Thomas Graf
239fb791d4 vxlan: Correctly set flow*i_mark and flow4i_proto in route lookups
VXLAN must provide the skb mark and specifiy IPPROTO_UDP when doing
the FIB lookup for the remote ip. Otherwise an invalid route might
be returned.

Signed-off-by: Thomas Graf <tgraf@suug.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-05-05 19:37:54 -04:00
Li RongQing
608404290e vxlan: remove the unnecessary codes
The return value of vxlan_fdb_replace always is greater than or equal to 0

Signed-off-by: Li RongQing <roy.qing.li@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-04-22 18:45:49 -04:00
David S. Miller
87ffabb1f0 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
The dwmac-socfpga.c conflict was a case of a bug fix overlapping
changes in net-next to handle an error pointer differently.

Signed-off-by: David S. Miller <davem@davemloft.net>
2015-04-14 15:44:14 -04:00
Jesse Gross
b736a623bd udptunnels: Call handle_offloads after inserting vlan tag.
handle_offloads() calls skb_reset_inner_headers() to store
the layer pointers to the encapsulated packet. However, we
currently push the vlag tag (if there is one) onto the packet
afterwards. This changes the MAC header for the encapsulated
packet but it is not reflected in skb->inner_mac_header, which
breaks GSO and drivers which attempt to use this for encapsulation
offloads.

Fixes: 1eaa8178 ("vxlan: Add tx-vlan offload support.")
Signed-off-by: Jesse Gross <jesse@nicira.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-04-09 14:56:32 -04:00
WANG Cong
f13b1689d7 vxlan: do not exit on error in vxlan_stop()
We need to clean up vxlan despite vxlan_igmp_leave() fails.

This fixes the following kernel warning:

 WARNING: CPU: 0 PID: 6 at lib/debugobjects.c:263 debug_print_object+0x7c/0x8d()
 ODEBUG: free active (active state 0) object type: timer_list hint: vxlan_cleanup+0x0/0xd0
 CPU: 0 PID: 6 Comm: kworker/u8:0 Not tainted 4.0.0-rc7+ #953
 Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011
 Workqueue: netns cleanup_net
  0000000000000009 ffff88011955f948 ffffffff81a25f5a 00000000253f253e
  ffff88011955f998 ffff88011955f988 ffffffff8107608e 0000000000000000
  ffffffff814deba2 ffff8800d4e94000 ffffffff82254c30 ffffffff81fbe455
 Call Trace:
  [<ffffffff81a25f5a>] dump_stack+0x4c/0x65
  [<ffffffff8107608e>] warn_slowpath_common+0x9c/0xb6
  [<ffffffff814deba2>] ? debug_print_object+0x7c/0x8d
  [<ffffffff81076116>] warn_slowpath_fmt+0x46/0x48
  [<ffffffff814deba2>] debug_print_object+0x7c/0x8d
  [<ffffffff81666bf1>] ? vxlan_fdb_destroy+0x5b/0x5b
  [<ffffffff814dee02>] __debug_check_no_obj_freed+0xc3/0x15f
  [<ffffffff814df728>] debug_check_no_obj_freed+0x12/0x16
  [<ffffffff8117ae4e>] slab_free_hook+0x64/0x6c
  [<ffffffff8114deaa>] ? kvfree+0x31/0x33
  [<ffffffff8117dc66>] kfree+0x101/0x1ac
  [<ffffffff8114deaa>] kvfree+0x31/0x33
  [<ffffffff817d4137>] netdev_freemem+0x18/0x1a
  [<ffffffff817e8b52>] netdev_release+0x2e/0x32
  [<ffffffff815b4163>] device_release+0x5a/0x92
  [<ffffffff814bd4dd>] kobject_cleanup+0x49/0x5e
  [<ffffffff814bd3ff>] kobject_put+0x45/0x49
  [<ffffffff817d3fc1>] netdev_run_todo+0x26f/0x283
  [<ffffffff817d4873>] ? rollback_registered_many+0x20f/0x23b
  [<ffffffff817e0c80>] rtnl_unlock+0xe/0x10
  [<ffffffff817d4af0>] default_device_exit_batch+0x12a/0x139
  [<ffffffff810aadfa>] ? wait_woken+0x8f/0x8f
  [<ffffffff817c8e14>] ops_exit_list+0x2b/0x57
  [<ffffffff817c9b21>] cleanup_net+0x154/0x1e7
  [<ffffffff8108b05d>] process_one_work+0x255/0x4ad
  [<ffffffff8108af69>] ? process_one_work+0x161/0x4ad
  [<ffffffff8108b4b1>] worker_thread+0x1cd/0x2ab
  [<ffffffff8108b2e4>] ? process_scheduled_works+0x2f/0x2f
  [<ffffffff81090686>] kthread+0xd4/0xdc
  [<ffffffff8109eca3>] ? local_clock+0x19/0x22
  [<ffffffff810905b2>] ? __kthread_parkme+0x83/0x83
  [<ffffffff81a31c48>] ret_from_fork+0x58/0x90
  [<ffffffff810905b2>] ? __kthread_parkme+0x83/0x83

For the long-term, we should handle NETDEV_{UP,DOWN} event
from the lower device of a tunnel device.

Fixes: 56ef9c909b ("vxlan: Move socket initialization to within rtnl scope")
Cc: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
Acked-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-04-08 22:48:08 -04:00
WANG Cong
2790460e14 vxlan: fix a shadow local variable
Commit 79b16aadea
("udp_tunnel: Pass UDP socket down through udp_tunnel{, 6}_xmit_skb()")
introduce 'sk' but we already have one inner 'sk'.

Cc: David S. Miller <davem@davemloft.net>
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-04-08 14:56:35 -04:00
David Miller
79b16aadea udp_tunnel: Pass UDP socket down through udp_tunnel{, 6}_xmit_skb().
That was we can make sure the output path of ipv4/ipv6 operate on
the UDP socket rather than whatever random thing happens to be in
skb->sk.

Based upon a patch by Jiri Pirko.

Signed-off-by: David S. Miller <davem@davemloft.net>
Acked-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
2015-04-07 15:29:08 -04:00
Simon Horman
c4b495128c vxlan: correct spelling in comments
Fix some spelling / typos:
* droppped -> dropped
* asddress -> address
* compatbility -> compatibility

Signed-off-by: Simon Horman <simon.horman@netronome.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-04-01 22:52:29 -04:00
Jiri Benc
67b61f6c13 netlink: implement nla_get_in_addr and nla_get_in6_addr
Those are counterparts to nla_put_in_addr and nla_put_in6_addr.

Signed-off-by: Jiri Benc <jbenc@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-03-31 13:58:35 -04:00
Jiri Benc
930345ea63 netlink: implement nla_put_in_addr and nla_put_in6_addr
IP addresses are often stored in netlink attributes. Add generic functions
to do that.

For nla_put_in_addr, it would be nicer to pass struct in_addr but this is
not used universally throughout the kernel, in way too many places __be32 is
used to store IPv4 address.

Signed-off-by: Jiri Benc <jbenc@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-03-31 13:58:35 -04:00
Jiri Benc
f0ef31264c vxlan: fix indentation
Some lines in vxlan code are indented by 7 spaces instead of a tab.

Fixes: e4c7ed4153 ("vxlan: add ipv6 support")
Signed-off-by: Jiri Benc <jbenc@redhat.com>
Acked-by: Cong Wang <xiyou.wangcong@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-03-31 13:55:21 -04:00
Marcelo Ricardo Leitner
24c0e6838c vxlan: simplify if clause in dev_close
Dan Carpenter's static checker warned that in vxlan_stop we are checking
if 'vs' can be NULL while later we simply derreference it.

As after commit 56ef9c909b ("vxlan: Move socket initialization to
within rtnl scope") 'vs' just cannot be NULL in vxlan_stop() anymore, as
the interface won't go up if the socket initialization fails. So we are
good to just remove the check and make it consistent.

Reported-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-03-23 17:01:58 -04:00
David S. Miller
0fa74a4be4 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
Conflicts:
	drivers/net/ethernet/emulex/benet/be_main.c
	net/core/sysctl_net_core.c
	net/ipv4/inet_diag.c

The be_main.c conflict resolution was really tricky.  The conflict
hunks generated by GIT were very unhelpful, to say the least.  It
split functions in half and moved them around, when the real actual
conflict only existed solely inside of one function, that being
be_map_pci_bars().

So instead, to resolve this, I checked out be_main.c from the top
of net-next, then I applied the be_main.c changes from 'net' since
the last time I merged.  And this worked beautifully.

The inet_diag.c and sysctl_net_core.c conflicts were simple
overlapping changes, and were easily to resolve.

Signed-off-by: David S. Miller <davem@davemloft.net>
2015-03-20 18:51:09 -04:00
Marcelo Ricardo Leitner
149d7549c2 vxlan: fix possible use of uninitialized in vxlan_igmp_{join, leave}
Test robot noticed that we check the return of vxlan_igmp_join and leave
but inside them there was a path that it could be used initialized.

It's not really possible because those if() inside these igmp functions
would always match as we can't have sockets of other type in there, but
this way we keep the compiler happy.

Fixes: 56ef9c909b ("vxlan: Move socket initialization to within rtnl scope")
Reported-by: kbuild test robot <fengguang.wu@intel.com>
Signed-off-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-03-20 13:31:24 -04:00
Marcelo Ricardo Leitner
56ef9c909b vxlan: Move socket initialization to within rtnl scope
Currently, if a multicast join operation fail, the vxlan interface will
be UP but not functional, without even a log message informing the user.

Now that we can grab socket lock after already having rntl, we don't
need to defer socket creation and multicast operations. By not deferring
we can do proper error reporting to the user through ip exit code.

This patch thus removes all deferred work that vxlan had and put it back
inline. Now the socket will only be created, bound and join multicast
group when one bring the interface up, and will undo all that as soon as
one put the interface down.

As vxlan_sock_hold() is not used after this patch, it was removed too.

Signed-off-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-03-18 22:05:10 -04:00
Marcelo Ricardo Leitner
54ff9ef36b ipv4, ipv6: kill ip_mc_{join, leave}_group and ipv6_sock_mc_{join, drop}
in favor of their inner __ ones, which doesn't grab rtnl.

As these functions need to operate on a locked socket, we can't be
grabbing rtnl by then. It's too late and doing so causes reversed
locking.

So this patch:
- move rtnl handling to callers instead while already fixing some
  reversed locking situations, like on vxlan and ipvs code.
- renames __ ones to not have the __ mark:
  __ip_mc_{join,leave}_group -> ip_mc_{join,leave}_group
  __ipv6_sock_mc_{join,drop} -> ipv6_sock_mc_{join,drop}

Signed-off-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>
Acked-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-03-18 22:05:09 -04:00
Alexey Kodanev
40fb70f3aa vxlan: fix wrong usage of VXLAN_VID_MASK
commit dfd8645ea1 wrongly assumes that VXLAN_VDI_MASK includes
eight lower order reserved bits of VNI field that are using for remote
checksum offload.

Right now, when VNI number greater then 0xffff, vxlan_udp_encap_recv()
will always return with 'bad_flag' error, reducing the usable vni range
from 0..16777215 to 0..65535. Also, it doesn't really check whether RCO
bits processed or not.

Fix it by adding new VNI mask which has all 32 bits of VNI field:
24 bits for id and 8 bits for other usage.

Signed-off-by: Alexey Kodanev <alexey.kodanev@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-03-13 13:08:07 -04:00
Simon Horman
719a11cdbf vxlan: Don't set s_addr in vxlan_create_sock
In the case of AF_INET s_addr was set to INADDR_ANY (0) which which both
symmetric with the AF_INET6 case, where s_addr is not set, and unnecessary
as udp_conf is zeroed out earlier in the same function.

I suspect this change does not have any run-time effect due to compiler
optimisations. But it does make the code a little easier on the/my eyes.

Cc: Tom Herbert <therbert@google.com>
Signed-off-by: Simon Horman <simon.horman@netronome.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-03-12 23:23:16 -04:00
Tom Herbert
0ace2ca89c vxlan: Use checksum partial with remote checksum offload
Change remote checksum handling to set checksum partial as default
behavior. Added an iflink parameter to configure not using
checksum partial (calling csum_partial to update checksum).

Signed-off-by: Tom Herbert <therbert@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-02-11 15:12:13 -08:00
Tom Herbert
15e2396d4e net: Infrastructure for CHECKSUM_PARTIAL with remote checsum offload
This patch adds infrastructure so that remote checksum offload can
set CHECKSUM_PARTIAL instead of calling csum_partial and writing
the modfied checksum field.

Add skb_remcsum_adjust_partial function to set an skb for using
CHECKSUM_PARTIAL with remote checksum offload.  Changed
skb_remcsum_process and skb_gro_remcsum_process to take a boolean
argument to indicate if checksum partial can be set or the
checksum needs to be modified using the normal algorithm.

Signed-off-by: Tom Herbert <therbert@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-02-11 15:12:12 -08:00
Tom Herbert
26c4f7da3e net: Fix remcsum in GRO path to not change packet
Remote checksum offload processing is currently the same for both
the GRO and non-GRO path. When the remote checksum offload option
is encountered, the checksum field referred to is modified in
the packet. So in the GRO case, the packet is modified in the
GRO path and then the operation is skipped when the packet goes
through the normal path based on skb->remcsum_offload. There is
a problem in that the packet may be modified in the GRO path, but
then forwarded off host still containing the remote checksum option.
A remote host will again perform RCO but now the checksum verification
will fail since GRO RCO already modified the checksum.

To fix this, we ensure that GRO restores a packet to it's original
state before returning. In this model, when GRO processes a remote
checksum option it still changes the checksum per the algorithm
but on return from lower layer processing the checksum is restored
to its original value.

In this patch we add define gro_remcsum structure which is passed
to skb_gro_remcsum_process to save offset and delta for the checksum
being changed. After lower layer processing, skb_gro_remcsum_cleanup
is called to restore the checksum before returning from GRO.

Signed-off-by: Tom Herbert <therbert@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-02-11 15:12:09 -08:00
David S. Miller
2573beec56 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-02-09 14:35:57 -08:00
Rasmus Villemoes
a4870f79c2 vxlan: Wrong type passed to %pIS
src_ip is a pointer to a union vxlan_addr, one member of which is a
struct sockaddr. Passing a pointer to src_ip is wrong; one should pass
the value of src_ip itself. Since %pIS formally expects something of
type struct sockaddr*, let's pass a pointer to the appropriate union
member, though this of course doesn't change the generated code.

Fixes: e4c7ed4153 ("vxlan: add ipv6 support")
Signed-off-by: Rasmus Villemoes <linux@rasmusvillemoes.dk>
Acked-by: Cong Wang <xiyou.wangcong@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-02-08 17:14:40 -08:00
David S. Miller
6e03f896b5 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
Conflicts:
	drivers/net/vxlan.c
	drivers/vhost/net.c
	include/linux/if_vlan.h
	net/core/dev.c

The net/core/dev.c conflict was the overlap of one commit marking an
existing function static whilst another was adding a new function.

In the include/linux/if_vlan.h case, the type used for a local
variable was changed in 'net', whereas the function got rewritten
to fix a stacked vlan bug in 'net-next'.

In drivers/vhost/net.c, Al Viro's iov_iter conversions in 'net-next'
overlapped with an endainness fix for VHOST 1.0 in 'net'.

In drivers/net/vxlan.c, vxlan_find_vni() added a 'flags' parameter
in 'net-next' whereas in 'net' there was a bug fix to pass in the
correct network namespace pointer in calls to this function.

Signed-off-by: David S. Miller <davem@davemloft.net>
2015-02-05 14:33:28 -08:00
Thomas Graf
db79a62183 vxlan: Only set has-GBP bit in header if any other bits would be set
This allows for a VXLAN-GBP socket to talk to a Linux VXLAN socket by
not setting any of the bits.

Signed-off-by: Thomas Graf <tgraf@suug.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-02-05 00:38:02 -08:00
Tom Herbert
dcdc899469 net: add skb functions to process remote checksum offload
This patch adds skb_remcsum_process and skb_gro_remcsum_process to
perform the appropriate adjustments to the skb when receiving
remote checksum offload.

Updated vxlan and gue to use these functions.

Tested: Ran TCP_RR and TCP_STREAM netperf for VXLAN and GUE, did
not see any change in performance.

Signed-off-by: Tom Herbert <therbert@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-02-04 13:54:07 -08:00
Nicolas Dichtel
33564bbb2c vxlan: setup the right link netns in newlink hdlr
Rename the netns to src_net to avoid confusion with the netns where the
interface stands. The user may specify IFLA_NET_NS_[PID|FD] to create
a x-netns netndevice: IFLA_NET_NS_[PID|FD] points to the netns where the
netdevice stands and src_net to the link netns.

Note that before commit f01ec1c017 ("vxlan: add x-netns support"), it was
possible to create a x-netns vxlan netdevice, but the netdevice was not
operational.

Fixes: f01ec1c017 ("vxlan: add x-netns support")
Signed-off-by: Nicolas Dichtel <nicolas.dichtel@6wind.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-01-29 14:20:02 -08:00
Nicolas Dichtel
4967082b46 vxlan: advertise link netns in fdb messages
Previous commit is based on a wrong assumption, fdb messages are always sent
into the netns where the interface stands (see vxlan_fdb_notify()).

These fdb messages doesn't embed the rtnl attribute IFLA_LINK_NETNSID, thus we
need to add it (useful to interpret NDA_IFINDEX or NDA_DST for example).

Note also that vxlan_nlmsg_size() was not updated.

Fixes: 193523bf93 ("vxlan: advertise netns of vxlan dev in fdb msg")
Signed-off-by: Nicolas Dichtel <nicolas.dichtel@6wind.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-01-27 17:11:07 -08:00
Tom Herbert
af33c1adae vxlan: Eliminate dependency on UDP socket in transmit path
In the vxlan transmit path there is no need to reference the socket
for a tunnel which is needed for the receive side. We do, however,
need the vxlan_dev flags. This patch eliminate references
to the socket in the transmit path, and changes VXLAN_F_UNSHAREABLE
to be VXLAN_F_RCV_FLAGS. This mask is used to store the flags
applicable to receive (GBP, CSUM6_RX, and REMCSUM_RX) in the
vxlan_sock flags.

Signed-off-by: Tom Herbert <therbert@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-01-24 23:15:40 -08:00
Tom Herbert
d998f8efa4 udp: Do not require sock in udp_tunnel_xmit_skb
The UDP tunnel transmit functions udp_tunnel_xmit_skb and
udp_tunnel6_xmit_skb include a socket argument. The socket being
passed to the functions (from VXLAN) is a UDP created for receive
side. The only thing that the socket is used for in the transmit
functions is to get the setting for checksum (enabled or zero).
This patch removes the argument and and adds a nocheck argument
for checksum setting. This eliminates the unnecessary dependency
on a UDP socket for UDP tunnel transmit.

Signed-off-by: Tom Herbert <therbert@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-01-24 23:15:40 -08:00
Nicolas Dichtel
193523bf93 vxlan: advertise netns of vxlan dev in fdb msg
Netlink FDB messages are sent in the link netns. The header of these messages
contains the ifindex (ndm_ifindex) of the netdevice, but this ifindex is
unusable in case of x-netns vxlan.
I named the new attribute NDA_NDM_IFINDEX_NETNSID, to avoid confusion with
NDA_IFINDEX.

Signed-off-by: Nicolas Dichtel <nicolas.dichtel@6wind.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-01-23 17:51:15 -08:00
Nicolas Dichtel
1728d4fabd tunnels: advertise link netns via netlink
Implement rtnl_link_ops->get_link_net() callback so that IFLA_LINK_NETNSID is
added to rtnetlink messages.

Signed-off-by: Nicolas Dichtel <nicolas.dichtel@6wind.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-01-19 14:32:03 -05:00
Johannes Berg
053c095a82 netlink: make nlmsg_end() and genlmsg_end() void
Contrary to common expectations for an "int" return, these functions
return only a positive value -- if used correctly they cannot even
return 0 because the message header will necessarily be in the skb.

This makes the very common pattern of

  if (genlmsg_end(...) < 0) { ... }

be a whole bunch of dead code. Many places also simply do

  return nlmsg_end(...);

and the caller is expected to deal with it.

This also commonly (at least for me) causes errors, because it is very
common to write

  if (my_function(...))
    /* error condition */

and if my_function() does "return nlmsg_end()" this is of course wrong.

Additionally, there's not a single place in the kernel that actually
needs the message length returned, and if anyone needs it later then
it'll be very easy to just use skb->len there.

Remove this, and make the functions void. This removes a bunch of dead
code as described above. The patch adds lines because I did

-	return nlmsg_end(...);
+	nlmsg_end(...);
+	return 0;

I could have preserved all the function's return values by returning
skb->len, but instead I've audited all the places calling the affected
functions and found that none cared. A few places actually compared
the return value with <= 0 in dump functionality, but that could just
be changed to < 0 with no change in behaviour, so I opted for the more
efficient version.

One instance of the error I've made numerous times now is also present
in net/phonet/pn_netlink.c in the route_dumpit() function - it didn't
check for <0 or <=0 and thus broke out of the loop every single time.
I've preserved this since it will (I think) have caused the messages to
userspace to be formatted differently with just a single message for
every SKB returned to userspace. It's possible that this isn't needed
for the tools that actually use this, but I don't even know what they
are so couldn't test that changing this behaviour would be acceptable.

Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-01-18 01:03:45 -05:00
Thomas Graf
ac5132d1a0 vxlan: Only bind to sockets with compatible flags enabled
A VXLAN net_device looking for an appropriate socket may only consider
a socket which has a matching set of flags/extensions enabled. If
incompatible flags are enabled, return a conflict to have the caller
create a distinct socket with distinct port.

The OVS VXLAN port is kept unaware of extensions at this point.

Signed-off-by: Thomas Graf <tgraf@suug.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-01-15 01:11:41 -05:00
Thomas Graf
3511494ce2 vxlan: Group Policy extension
Implements supports for the Group Policy VXLAN extension [0] to provide
a lightweight and simple security label mechanism across network peers
based on VXLAN. The security context and associated metadata is mapped
to/from skb->mark. This allows further mapping to a SELinux context
using SECMARK, to implement ACLs directly with nftables, iptables, OVS,
tc, etc.

The group membership is defined by the lower 16 bits of skb->mark, the
upper 16 bits are used for flags.

SELinux allows to manage label to secure local resources. However,
distributed applications require ACLs to implemented across hosts. This
is typically achieved by matching on L2-L4 fields to identify the
original sending host and process on the receiver. On top of that,
netlabel and specifically CIPSO [1] allow to map security contexts to
universal labels.  However, netlabel and CIPSO are relatively complex.
This patch provides a lightweight alternative for overlay network
environments with a trusted underlay. No additional control protocol
is required.

           Host 1:                       Host 2:

      Group A        Group B        Group B     Group A
      +-----+   +-------------+    +-------+   +-----+
      | lxc |   | SELinux CTX |    | httpd |   | VM  |
      +--+--+   +--+----------+    +---+---+   +--+--+
	  \---+---/                     \----+---/
	      |                              |
	  +---+---+                      +---+---+
	  | vxlan |                      | vxlan |
	  +---+---+                      +---+---+
	      +------------------------------+

Backwards compatibility:
A VXLAN-GBP socket can receive standard VXLAN frames and will assign
the default group 0x0000 to such frames. A Linux VXLAN socket will
drop VXLAN-GBP  frames. The extension is therefore disabled by default
and needs to be specifically enabled:

   ip link add [...] type vxlan [...] gbp

In a mixed environment with VXLAN and VXLAN-GBP sockets, the GBP socket
must run on a separate port number.

Examples:
 iptables:
  host1# iptables -I OUTPUT -m owner --uid-owner 101 -j MARK --set-mark 0x200
  host2# iptables -I INPUT -m mark --mark 0x200 -j DROP

 OVS:
  # ovs-ofctl add-flow br0 'in_port=1,actions=load:0x200->NXM_NX_TUN_GBP_ID[],NORMAL'
  # ovs-ofctl add-flow br0 'in_port=2,tun_gbp_id=0x200,actions=drop'

[0] https://tools.ietf.org/html/draft-smith-vxlan-group-policy
[1] http://lwn.net/Articles/204905/

Signed-off-by: Thomas Graf <tgraf@suug.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-01-15 01:11:41 -05:00
Tom Herbert
dfd8645ea1 vxlan: Remote checksum offload
Add support for remote checksum offload in VXLAN. This uses a
reserved bit to indicate that RCO is being done, and uses the low order
reserved eight bits of the VNI to hold the start and offset values in a
compressed manner.

Start is encoded in the low order seven bits of VNI. This is start >> 1
so that the checksum start offset is 0-254 using even values only.
Checksum offset (transport checksum field) is indicated in the high
order bit in the low order byte of the VNI. If the bit is set, the
checksum field is for UDP (so offset = start + 6), else checksum
field is for TCP (so offset = start + 16). Only TCP and UDP are
supported in this implementation.

Remote checksum offload for VXLAN is described in:

https://tools.ietf.org/html/draft-herbert-vxlan-rco-00

Tested by running 200 TCP_STREAM connections with VXLAN (over IPv4).

With UDP checksums and Remote Checksum Offload
  IPv4
      Client
        11.84% CPU utilization
      Server
        12.96% CPU utilization
      9197 Mbps
  IPv6
      Client
        12.46% CPU utilization
      Server
        14.48% CPU utilization
      8963 Mbps

With UDP checksums, no remote checksum offload
  IPv4
      Client
        15.67% CPU utilization
      Server
        14.83% CPU utilization
      9094 Mbps
  IPv6
      Client
        16.21% CPU utilization
      Server
        14.32% CPU utilization
      9058 Mbps

No UDP checksums
  IPv4
      Client
        15.03% CPU utilization
      Server
        23.09% CPU utilization
      9089 Mbps
  IPv6
      Client
        16.18% CPU utilization
      Server
        26.57% CPU utilization
       8954 Mbps

Signed-off-by: Tom Herbert <therbert@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-01-14 15:20:04 -05:00
Tom Herbert
a2b12f3c7a udp: pass udp_offload struct to UDP gro callbacks
This patch introduces udp_offload_callbacks which has the same
GRO functions (but not a GSO function) as offload_callbacks,
except there is an argument to a udp_offload struct passed to
gro_receive and gro_complete functions. This additional argument
can be used to retrieve the per port structure of the encapsulation
for use in gro processing (mostly by doing container_of on the
structure).

Signed-off-by: Tom Herbert <therbert@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-01-14 15:20:04 -05:00
Jiri Pirko
df8a39defa net: rename vlan_tx_* helpers since "tx" is misleading there
The same macros are used for rx as well. So rename it.

Signed-off-by: Jiri Pirko <jiri@resnulli.us>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-01-13 17:51:08 -05:00
Tom Herbert
3bf3947526 vxlan: Improve support for header flags
This patch cleans up the header flags of VXLAN in anticipation of
defining some new ones:

- Move header related definitions from vxlan.c to vxlan.h
- Change VXLAN_FLAGS to be VXLAN_HF_VNI (only currently defined flag)
- Move check for unknown flags to after we find vxlan_sock, this
  assumes that some flags may be processed based on tunnel
  configuration
- Add a comment about why the stack treating unknown set flags as an
  error instead of ignoring them

Signed-off-by: Tom Herbert <therbert@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-01-12 16:05:01 -05:00
Jesse Gross
9b174d88c2 net: Add Transparent Ethernet Bridging GRO support.
Currently the only tunnel protocol that supports GRO with encapsulated
Ethernet is VXLAN. This pulls out the Ethernet code into a proper layer
so that it can be used by other tunnel protocols such as GRE and Geneve.

Signed-off-by: Jesse Gross <jesse@nicira.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-01-02 15:46:41 -05:00
Pravin B Shelar
74f47278cb vxlan: Fix double free of skb.
In case of error vxlan_xmit_one() can free already freed skb.
Also fixes memory leak of dst-entry.

Fixes: acbf74a763 ("vxlan: Refactor vxlan driver to make use
of the common UDP tunnel functions").

Signed-off-by: Pravin B Shelar <pshelar@nicira.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-12-23 23:57:31 -05:00
Marcelo Leitner
00c83b01d5 Fix race condition between vxlan_sock_add and vxlan_sock_release
Currently, when trying to reuse a socket, vxlan_sock_add will grab
vn->sock_lock, locate a reusable socket, inc refcount and release
vn->sock_lock.

But vxlan_sock_release() will first decrement refcount, and then grab
that lock. refcnt operations are atomic but as currently we have
deferred works which hold vs->refcnt each, this might happen, leading to
a use after free (specially after vxlan_igmp_leave):

  CPU 1                            CPU 2

deferred work                    vxlan_sock_add
  ...                              ...
                                   spin_lock(&vn->sock_lock)
                                   vs = vxlan_find_sock();
  vxlan_sock_release
    dec vs->refcnt, reaches 0
    spin_lock(&vn->sock_lock)
                                   vxlan_sock_hold(vs), refcnt=1
                                   spin_unlock(&vn->sock_lock)
    hlist_del_rcu(&vs->hlist);
    vxlan_notify_del_rx_port(vs)
    spin_unlock(&vn->sock_lock)

So when we look for a reusable socket, we check if it wasn't freed
already before reusing it.

Signed-off-by: Marcelo Ricardo Leitner <mleitner@redhat.com>
Fixes: 7c47cedf43 ("vxlan: move IGMP join/leave to work queue")
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-12-11 14:57:08 -05:00
Jiri Pirko
f6f6424ba7 net: make vid as a parameter for ndo_fdb_add/ndo_fdb_del
Do the work of parsing NDA_VLAN directly in rtnetlink code, pass simple
u16 vid to drivers from there.

Signed-off-by: Jiri Pirko <jiri@resnulli.us>
Acked-by: Andy Gospodarek <gospo@cumulusnetworks.com>
Acked-by: Jamal Hadi Salim <jhs@mojatatu.com>
Acked-by: John Fastabend <john.r.fastabend@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-12-02 20:01:18 -08:00
David S. Miller
60b7379dc5 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net 2014-11-29 20:47:48 -08:00
Alexander Duyck
3dc2b6a8d3 vxlan: Fix boolean flip in VXLAN_F_UDP_ZERO_CSUM6_[TX|RX]
In "vxlan: Call udp_sock_create" there was a logic error that resulted in
the default for IPv6 VXLAN tunnels going from using checksums to not using
checksums.  Since there is currently no support in iproute2 for setting
these values it means that a kernel after the change cannot talk over a IPv6
VXLAN tunnel to a kernel prior the change.

Fixes: 3ee64f3 ("vxlan: Call udp_sock_create")

Cc: Tom Herbert <therbert@google.com>
Signed-off-by: Alexander Duyck <alexander.h.duyck@redhat.com>
Acked-by: Tom Herbert <therbert@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-11-25 14:12:12 -05:00
David S. Miller
1459143386 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
Conflicts:
	drivers/net/ieee802154/fakehard.c

A bug fix went into 'net' for ieee802154/fakehard.c, which is removed
in 'net-next'.

Add build fix into the merge from Stephen Rothwell in openvswitch, the
logging macros take a new initial 'log' argument, a new call was added
in 'net' so when we merge that in here we have to explicitly add the
new 'log' arg to it else the build fails.

Signed-off-by: David S. Miller <davem@davemloft.net>
2014-11-21 22:28:24 -05:00
Jiri Pirko
5968250c86 vlan: introduce *vlan_hwaccel_push_inside helpers
Use them to push skb->vlan_tci into the payload and avoid code
duplication.

Signed-off-by: Jiri Pirko <jiri@resnulli.us>
Acked-by: Pravin B Shelar <pshelar@nicira.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-11-21 14:20:17 -05:00
Jiri Pirko
62749e2cb3 vlan: rename __vlan_put_tag to vlan_insert_tag_set_proto
Name fits better. Plus there's going to be introduced
__vlan_insert_tag later on.

Signed-off-by: Jiri Pirko <jiri@resnulli.us>
Acked-by: Pravin B Shelar <pshelar@nicira.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-11-21 14:20:17 -05:00
Joe Stringer
11bf7828a5 vxlan: Inline vxlan_gso_check().
Suggested-by: Or Gerlitz <ogerlitz@mellanox.com>
Signed-off-by: Joe Stringer <joestringer@nicira.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-11-18 15:38:44 -05:00
Joe Stringer
23e62de33d net: Add vxlan_gso_check() helper
Most NICs that report NETIF_F_GSO_UDP_TUNNEL support VXLAN, and not
other UDP-based encapsulation protocols where the format and size of the
header differs. This patch implements a generic ndo_gso_check() for
VXLAN which will only advertise GSO support when the skb looks like it
contains VXLAN (or no UDP tunnelling at all).

Implementation shamelessly stolen from Tom Herbert:
http://thread.gmane.org/gmane.linux.network/332428/focus=333111

Signed-off-by: Joe Stringer <joestringer@nicira.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-11-14 17:12:48 -05:00
David S. Miller
076ce44825 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
Conflicts:
	drivers/net/ethernet/chelsio/cxgb4vf/sge.c
	drivers/net/ethernet/intel/ixgbe/ixgbe_phy.c

sge.c was overlapping two changes, one to use the new
__dev_alloc_page() in net-next, and one to use s->fl_pg_order in net.

ixgbe_phy.c was a set of overlapping whitespace changes.

Signed-off-by: David S. Miller <davem@davemloft.net>
2014-11-14 01:01:12 -05:00
Marcelo Leitner
19ca9fc144 vxlan: Do not reuse sockets for a different address family
Currently, we only match against local port number in order to reuse
socket. But if this new vxlan wants an IPv6 socket and a IPv4 one bound
to that port, vxlan will reuse an IPv4 socket as IPv6 and a panic will
follow. The following steps reproduce it:

   # ip link add vxlan6 type vxlan id 42 group 229.10.10.10 \
       srcport 5000 6000 dev eth0
   # ip link add vxlan7 type vxlan id 43 group ff0e::110 \
       srcport 5000 6000 dev eth0
   # ip link set vxlan6 up
   # ip link set vxlan7 up
   <panic>

[    4.187481] BUG: unable to handle kernel NULL pointer dereference at 0000000000000058
...
[    4.188076] Call Trace:
[    4.188085]  [<ffffffff81667c4a>] ? ipv6_sock_mc_join+0x3a/0x630
[    4.188098]  [<ffffffffa05a6ad6>] vxlan_igmp_join+0x66/0xd0 [vxlan]
[    4.188113]  [<ffffffff810a3430>] process_one_work+0x220/0x710
[    4.188125]  [<ffffffff810a33c4>] ? process_one_work+0x1b4/0x710
[    4.188138]  [<ffffffff810a3a3b>] worker_thread+0x11b/0x3a0
[    4.188149]  [<ffffffff810a3920>] ? process_one_work+0x710/0x710

So address family must also match in order to reuse a socket.

Reported-by: Jean-Tsung Hsiao <jhsiao@redhat.com>
Signed-off-by: Marcelo Ricardo Leitner <mleitner@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-11-13 15:19:59 -05:00
Jesse Gross
cfdf1e1ba5 udptunnel: Add SKB_GSO_UDP_TUNNEL during gro_complete.
When doing GRO processing for UDP tunnels, we never add
SKB_GSO_UDP_TUNNEL to gso_type - only the type of the inner protocol
is added (such as SKB_GSO_TCPV4). The result is that if the packet is
later resegmented we will do GSO but not treat it as a tunnel. This
results in UDP fragmentation of the outer header instead of (i.e.) TCP
segmentation of the inner header as was originally on the wire.

Signed-off-by: Jesse Gross <jesse@nicira.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-11-10 15:09:45 -05:00
Tom Herbert
5c91ae08e4 vxlan: Fix to enable UDP checksums on interface
Add definition to vxlan nla_policy for UDP checksum. This is necessary
to enable UDP checksums on VXLAN.

In some instances, enabling UDP checksums can improve performance on
receive for devices that return legacy checksum-unnecessary for UDP/IP.
Also, UDP checksum provides some protection against VNI corruption.

Testing:

Ran 200 instances of TCP_STREAM and TCP_RR on bnx2x.

TCP_STREAM
  IPv4, without UDP checksums
      14.41% TX CPU utilization
      25.71% RX CPU utilization
      9083.4 Mbps
  IPv4, with UDP checksums
      13.99% TX CPU utilization
      13.40% RX CPU utilization
      9095.65 Mbps

TCP_RR
  IPv4, without UDP checksums
      94.08% TX CPU utilization
      156/248/462 90/95/99% latencies
      1.12743e+06 tps
  IPv4, with UDP checksums
      94.43% TX CPU utilization
      158/250/462 90/95/99% latencies
      1.13345e+06 tps

Signed-off-by: Tom Herbert <therbert@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-11-06 21:59:55 -05:00
Li RongQing
7a9f526fc3 vxlan: fix a free after use
pskb_may_pull maybe change skb->data and make eth pointer oboslete,
so eth needs to reload

Fixes: 91269e390d ("vxlan: using pskb_may_pull as early as possible")
Cc: Eric Dumazet <edumazet@google.com>
Signed-off-by: Li RongQing <roy.qing.li@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-10-17 16:23:58 -04:00
Li RongQing
91269e390d vxlan: using pskb_may_pull as early as possible
pskb_may_pull should be used to check if skb->data has enough space,
skb->len can not ensure that.

Cc: Cong Wang <xiyou.wangcong@gmail.com>
Signed-off-by: Li RongQing <roy.qing.li@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-10-15 23:33:23 -04:00
Li RongQing
ce6502a8f9 vxlan: fix a use after free in vxlan_encap_bypass
when netif_rx() is done, the netif_rx handled skb maybe be freed,
and should not be used.

Signed-off-by: Li RongQing <roy.qing.li@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-10-15 23:30:28 -04:00
Eric Dumazet
0287587884 net: better IFF_XMIT_DST_RELEASE support
Testing xmit_more support with netperf and connected UDP sockets,
I found strange dst refcount false sharing.

Current handling of IFF_XMIT_DST_RELEASE is not optimal.

Dropping dst in validate_xmit_skb() is certainly too late in case
packet was queued by cpu X but dequeued by cpu Y

The logical point to take care of drop/force is in __dev_queue_xmit()
before even taking qdisc lock.

As Julian Anastasov pointed out, need for skb_dst() might come from some
packet schedulers or classifiers.

This patch adds new helper to cleanly express needs of various drivers
or qdiscs/classifiers.

Drivers that need skb_dst() in their ndo_start_xmit() should call
following helper in their setup instead of the prior :

	dev->priv_flags &= ~IFF_XMIT_DST_RELEASE;
->
	netif_keep_dst(dev);

Instead of using a single bit, we use two bits, one being
eventually rebuilt in bonding/team drivers.

The other one, is permanent and blocks IFF_XMIT_DST_RELEASE being
rebuilt in bonding/team. Eventually, we could add something
smarter later.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Julian Anastasov <ja@ssi.bg>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-10-07 13:22:11 -04:00
Tom Herbert
996c9fd167 vxlan: Set inner protocol before transmit
Call skb_set_inner_protocol to set inner Ethernet protocol to
ETH_P_TEB before transmit. This is needed for GSO with UDP tunnels.

Signed-off-by: Tom Herbert <therbert@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-10-01 21:35:52 -04:00
Andy Zhou
3c4d1daece vxlan: Fix bug introduced by commit acbf74a763
Commit acbf74a763 ("vxlan: Refactor vxlan driver to make use of the common UDP tunnel functions." introduced a bug in vxlan_xmit_one()
function, causing it to transmit Vxlan packets without proper
Vxlan header inserted. The change was not needed in the first
place. Revert it.

Reported-by: Tom Herbert <therbert@google.com>
Cc: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: Andy Zhou <azhou@nicira.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-09-23 15:32:10 -04:00
Andy Zhou
acbf74a763 vxlan: Refactor vxlan driver to make use of the common UDP tunnel functions.
Simplify vxlan implementation using common UDP tunnel APIs.

Signed-off-by: Andy Zhou <azhou@nicira.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-09-19 15:57:15 -04:00
Tom Herbert
c60c308cbd vxlan: Enable checksum unnecessary conversions for vxlan/UDP sockets
Signed-off-by: Tom Herbert <therbert@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-09-01 21:36:28 -07:00
Tom Herbert
77cffe23c1 net: Clarification of CHECKSUM_UNNECESSARY
This patch:
 - Clarifies the specific requirements of devices returning
   CHECKSUM_UNNECESSARY (comments in skbuff.h).
 - Adds csum_level field to skbuff. This is used to express how
   many checksums are covered by CHECKSUM_UNNECESSARY (stores n - 1).
   This replaces the overloading of skb->encapsulation, that field is
   is now only used to indicate inner headers are valid.
 - Change __skb_checksum_validate_needed to "consume" each checksum
   as indicated by csum_level as layers of the the packet are parsed.
 - Remove skb_pop_rcv_encapsulation, no longer needed in the new
   csum_level model.

Signed-off-by: Tom Herbert <therbert@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-08-29 20:41:11 -07:00
Gerhard Stenzel
a45e92a599 vxlan: fix incorrect initializer in union vxlan_addr
The first initializer in the following

        union vxlan_addr ipa = {
            .sin.sin_addr.s_addr = tip,
            .sa.sa_family = AF_INET,
        };

is optimised away by the compiler, due to the second initializer,
therefore initialising .sin.sin_addr.s_addr always to 0.
This results in netlink messages indicating a L3 miss never contain the
missed IP address. This was observed with GCC 4.8 and 4.9. I do not know about previous versions.
The problem affects user space programs relying on an IP address being
sent as part of a netlink message indicating a L3 miss.

Changing
            .sa.sa_family = AF_INET,
to
            .sin.sin_family = AF_INET,
fixes the problem.

Signed-off-by: Gerhard Stenzel <gerhard.stenzel@de.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-08-22 19:54:56 -07:00
David S. Miller
f139c74a8d Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-07-30 13:25:49 -07:00
Jun Zhao
545469f7a5 neighbour : fix ndm_type type error issue
ndm_type means L3 address type, in neighbour proxy and vxlan, it's RTN_UNICAST.
NDA_DST is for netlink TLV type, hence it's not right value in this context.

Signed-off-by: Jun Zhao <mypopydev@gmail.com>
Acked-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-07-28 17:52:17 -07:00
Tom Herbert
3ee64f39be vxlan: Call udp_sock_create
In vxlan driver call common function udp_sock_create to create the
listener UDP port.

Signed-off-by: Tom Herbert <therbert@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-07-14 16:12:15 -07:00
Jamal Hadi Salim
5d5eacb34c bridge: fdb dumping takes a filter device
Dumping a bridge fdb dumps every fdb entry
held. With this change we are going to filter
on selected bridge port.

Signed-off-by: Jamal Hadi Salim <jhs@mojatatu.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-07-10 12:37:33 -07:00
Tom Herbert
535fb8d006 vxlan: Call udp_flow_src_port
In vxlan and OVS vport-vxlan call common function to get source port
for a UDP tunnel. Removed vxlan_src_port since the functionality is
now in udp_flow_src_port.

Signed-off-by: Tom Herbert <therbert@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-07-07 21:14:21 -07:00
Tom Herbert
f79b064c15 vxlan: Checksum fixes
Call skb_pop_rcv_encapsulation and postpull_rcsum for the Ethernet
header to work properly with checksum complete.

Signed-off-by: Tom Herbert <therbert@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-06-15 01:00:50 -07:00
Cong Wang
2853af6a2e vxlan: use dev->needed_headroom instead of dev->hard_header_len
When we mirror packets from a vxlan tunnel to other device,
the mirror device should see the same packets (that is, without
outer header). Because vxlan tunnel sets dev->hard_header_len,
tcf_mirred() resets mac header back to outer mac, the mirror device
actually sees packets with outer headers

Vxlan tunnel should set dev->needed_headroom instead of
dev->hard_header_len, like what other ip tunnels do. This fixes
the above problem.

Cc: "David S. Miller" <davem@davemloft.net>
Cc: stephen hemminger <stephen@networkplumber.org>
Cc: Pravin B Shelar <pshelar@nicira.com>
Signed-off-by: Cong Wang <cwang@twopensource.com>
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-06-13 15:27:59 -07:00
Tom Herbert
6bae1d4cc3 net: Add skb_gro_postpull_rcsum to udp and vxlan
Need to gro_postpull_rcsum for GRO to work with checksum complete.

Signed-off-by: Tom Herbert <therbert@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-06-11 15:46:13 -07:00
Tom Herbert
359a0ea987 vxlan: Add support for UDP checksums (v4 sending, v6 zero csums)
Added VXLAN link configuration for sending UDP checksums, and allowing
TX and RX of UDP6 checksums.

Also, call common iptunnel_handle_offloads and added GSO support for
checksums.

Signed-off-by: Tom Herbert <therbert@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-06-04 22:46:39 -07:00
Wilfried Klaebe
7ad24ea4bf net: get rid of SET_ETHTOOL_OPS
net: get rid of SET_ETHTOOL_OPS

Dave Miller mentioned he'd like to see SET_ETHTOOL_OPS gone.
This does that.

Mostly done via coccinelle script:
@@
struct ethtool_ops *ops;
struct net_device *dev;
@@
-       SET_ETHTOOL_OPS(dev, ops);
+       dev->ethtool_ops = ops;

Compile tested only, but I'd seriously wonder if this broke anything.

Suggested-by: Dave Miller <davem@davemloft.net>
Signed-off-by: Wilfried Klaebe <w-lkml@lebenslange-mailadresse.de>
Acked-by: Felipe Balbi <balbi@ti.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-05-13 17:43:20 -04:00
Nicolas Dichtel
f01ec1c017 vxlan: add x-netns support
This patch allows to switch the netns when packet is encapsulated or
decapsulated.
The vxlan socket is openned into the i/o netns, ie into the netns where
encapsulated packets are received. The socket lookup is done into this netns to
find the corresponding vxlan tunnel. After decapsulation, the packet is
injecting into the corresponding interface which may stand to another netns.

When one of the two netns is removed, the tunnel is destroyed.

Configuration example:
ip netns add netns1
ip netns exec netns1 ip link set lo up
ip link add vxlan10 type vxlan id 10 group 239.0.0.10 dev eth0 dstport 0
ip link set vxlan10 netns netns1
ip netns exec netns1 ip addr add 192.168.0.249/24 broadcast 192.168.0.255 dev vxlan10
ip netns exec netns1 ip link set vxlan10 up

Signed-off-by: Nicolas Dichtel <nicolas.dichtel@6wind.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-04-24 16:18:26 -04:00
Nicolas Dichtel
9e4b93f905 vxlan: ensure to advertise the right fdb remote
The goal of this patch is to fix rtnelink notification. The main problem was
about notification for fdb entry with more than one remote. Before the patch,
when a remote was added to an existing fdb entry, the kernel advertised the
first remote instead of the added one. Also when a remote was removed from a fdb
entry with several remotes, the deleted remote was not advertised.

Signed-off-by: Nicolas Dichtel <nicolas.dichtel@6wind.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-04-23 15:01:09 -04:00
Eric Dumazet
aad88724c9 ipv4: add a sock pointer to dst->output() path.
In the dst->output() path for ipv4, the code assumes the skb it has to
transmit is attached to an inet socket, specifically via
ip_mc_output() : The sk_mc_loop() test triggers a WARN_ON() when the
provider of the packet is an AF_PACKET socket.

The dst->output() method gets an additional 'struct sock *sk'
parameter. This needs a cascade of changes so that this parameter can
be propagated from vxlan to final consumer.

Fixes: 8f646c922d ("vxlan: keep original skb ownership")
Reported-by: lucien xin <lucien.xin@gmail.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-04-15 13:47:15 -04:00
Mike Rapoport
5933a7bbb5 net: vxlan: fix crash when interface is created with no group
If the vxlan interface is created without explicit group definition,
there are corner cases which may cause kernel panic.

For instance, in the following scenario:

node A:
$ ip link add dev vxlan42  address 2c:c2:60:00:10:20 type vxlan id 42
$ ip addr add dev vxlan42 10.0.0.1/24
$ ip link set up dev vxlan42
$ arp -i vxlan42 -s 10.0.0.2 2c:c2:60:00:01:02
$ bridge fdb add dev vxlan42 to 2c:c2:60:00:01:02 dst <IPv4 address>
$ ping 10.0.0.2

node B:
$ ip link add dev vxlan42 address 2c:c2:60:00:01:02 type vxlan id 42
$ ip addr add dev vxlan42 10.0.0.2/24
$ ip link set up dev vxlan42
$ arp -i vxlan42 -s 10.0.0.1 2c:c2:60:00:10:20

node B crashes:

 vxlan42: 2c:c2:60:00:10:20 migrated from 4011:eca4:c0a8:6466:c0a8:6415:8e09:2118 to (invalid address)
 vxlan42: 2c:c2:60:00:10:20 migrated from 4011:eca4:c0a8:6466:c0a8:6415:8e09:2118 to (invalid address)
 BUG: unable to handle kernel NULL pointer dereference at 0000000000000046
 IP: [<ffffffff8143c459>] ip6_route_output+0x58/0x82
 PGD 7bd89067 PUD 7bd4e067 PMD 0
 Oops: 0000 [#1] SMP
 Modules linked in:
 CPU: 1 PID: 0 Comm: swapper/1 Not tainted 3.14.0-rc8-hvx-xen-00019-g97a5221-dirty #154
 Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011
 task: ffff88007c774f50 ti: ffff88007c79c000 task.ti: ffff88007c79c000
 RIP: 0010:[<ffffffff8143c459>]  [<ffffffff8143c459>] ip6_route_output+0x58/0x82
 RSP: 0018:ffff88007fd03668  EFLAGS: 00010282
 RAX: 0000000000000000 RBX: ffffffff8186a000 RCX: 0000000000000040
 RDX: 0000000000000000 RSI: ffff88007b0e4a80 RDI: ffff88007fd03754
 RBP: ffff88007fd03688 R08: ffff88007b0e4a80 R09: 0000000000000000
 R10: 0200000a0100000a R11: 0001002200000000 R12: ffff88007fd03740
 R13: ffff88007b0e4a80 R14: ffff88007b0e4a80 R15: ffff88007bba0c50
 FS:  0000000000000000(0000) GS:ffff88007fd00000(0000) knlGS:0000000000000000
 CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
 CR2: 0000000000000046 CR3: 000000007bb60000 CR4: 00000000000006e0
 Stack:
  0000000000000000 ffff88007fd037a0 ffffffff8186a000 ffff88007fd03740
  ffff88007fd036c8 ffffffff814320bb 0000000000006e49 ffff88007b8b7360
  ffff88007bdbf200 ffff88007bcbc000 ffff88007b8b7000 ffff88007b8b7360
 Call Trace:
  <IRQ>
  [<ffffffff814320bb>] ip6_dst_lookup_tail+0x2d/0xa4
  [<ffffffff814322a5>] ip6_dst_lookup+0x10/0x12
  [<ffffffff81323b4e>] vxlan_xmit_one+0x32a/0x68c
  [<ffffffff814a325a>] ? _raw_spin_unlock_irqrestore+0x12/0x14
  [<ffffffff8104c551>] ? lock_timer_base.isra.23+0x26/0x4b
  [<ffffffff8132451a>] vxlan_xmit+0x66a/0x6a8
  [<ffffffff8141a365>] ? ipt_do_table+0x35f/0x37e
  [<ffffffff81204ba2>] ? selinux_ip_postroute+0x41/0x26e
  [<ffffffff8139d0c1>] dev_hard_start_xmit+0x2ce/0x3ce
  [<ffffffff8139d491>] __dev_queue_xmit+0x2d0/0x392
  [<ffffffff813b380f>] ? eth_header+0x28/0xb5
  [<ffffffff8139d569>] dev_queue_xmit+0xb/0xd
  [<ffffffff813a5aa6>] neigh_resolve_output+0x134/0x152
  [<ffffffff813db741>] ip_finish_output2+0x236/0x299
  [<ffffffff813dc074>] ip_finish_output+0x98/0x9d
  [<ffffffff813dc749>] ip_output+0x62/0x67
  [<ffffffff813da9f2>] dst_output+0xf/0x11
  [<ffffffff813dc11c>] ip_local_out+0x1b/0x1f
  [<ffffffff813dcf1b>] ip_send_skb+0x11/0x37
  [<ffffffff813dcf70>] ip_push_pending_frames+0x2f/0x33
  [<ffffffff813ff732>] icmp_push_reply+0x106/0x115
  [<ffffffff813ff9e4>] icmp_reply+0x142/0x164
  [<ffffffff813ffb3b>] icmp_echo.part.16+0x46/0x48
  [<ffffffff813c1d30>] ? nf_iterate+0x43/0x80
  [<ffffffff813d8037>] ? xfrm4_policy_check.constprop.11+0x52/0x52
  [<ffffffff813ffb62>] icmp_echo+0x25/0x27
  [<ffffffff814005f7>] icmp_rcv+0x1d2/0x20a
  [<ffffffff813d8037>] ? xfrm4_policy_check.constprop.11+0x52/0x52
  [<ffffffff813d810d>] ip_local_deliver_finish+0xd6/0x14f
  [<ffffffff813d8037>] ? xfrm4_policy_check.constprop.11+0x52/0x52
  [<ffffffff813d7fde>] NF_HOOK.constprop.10+0x4c/0x53
  [<ffffffff813d82bf>] ip_local_deliver+0x4a/0x4f
  [<ffffffff813d7f7b>] ip_rcv_finish+0x253/0x26a
  [<ffffffff813d7d28>] ? inet_add_protocol+0x3e/0x3e
  [<ffffffff813d7fde>] NF_HOOK.constprop.10+0x4c/0x53
  [<ffffffff813d856a>] ip_rcv+0x2a6/0x2ec
  [<ffffffff8139a9a0>] __netif_receive_skb_core+0x43e/0x478
  [<ffffffff812a346f>] ? virtqueue_poll+0x16/0x27
  [<ffffffff8139aa2f>] __netif_receive_skb+0x55/0x5a
  [<ffffffff8139aaaa>] process_backlog+0x76/0x12f
  [<ffffffff8139add8>] net_rx_action+0xa2/0x1ab
  [<ffffffff81047847>] __do_softirq+0xca/0x1d1
  [<ffffffff81047ace>] irq_exit+0x3e/0x85
  [<ffffffff8100b98b>] do_IRQ+0xa9/0xc4
  [<ffffffff814a37ad>] common_interrupt+0x6d/0x6d
  <EOI>
  [<ffffffff810378db>] ? native_safe_halt+0x6/0x8
  [<ffffffff810110c7>] default_idle+0x9/0xd
  [<ffffffff81011694>] arch_cpu_idle+0x13/0x1c
  [<ffffffff8107480d>] cpu_startup_entry+0xbc/0x137
  [<ffffffff8102e741>] start_secondary+0x1a0/0x1a5
 Code: 24 14 e8 f1 e5 01 00 31 d2 a8 32 0f 95 c2 49 8b 44 24 2c 49 0b 44 24 24 74 05 83 ca 04 eb 1c 4d 85 ed 74 17 49 8b 85 a8 02 00 00 <66> 8b 40 46 66 c1 e8 07 83 e0 07 c1 e0 03 09 c2 4c 89 e6 48 89
 RIP  [<ffffffff8143c459>] ip6_route_output+0x58/0x82
  RSP <ffff88007fd03668>
 CR2: 0000000000000046
 ---[ end trace 4612329caab37efd ]---

When vxlan interface is created without explicit group definition, the
default_dst protocol family is initialiazed to AF_UNSPEC and the driver
assumes IPv4 configuration. On the other side, the default_dst protocol
family is used to differentiate between IPv4 and IPv6 cases and, since,
AF_UNSPEC != AF_INET, the processing takes the IPv6 path.

Making the IPv4 assumption explicit by settting default_dst protocol
family to AF_INET4 and preventing mixing of IPv4 and IPv6 addresses in
snooped fdb entries fixes the corner case crashes.

Signed-off-by: Mike Rapoport <mike.rapoport@ravellosystems.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-04-03 11:16:47 -04:00
David S. Miller
04f58c8854 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
Conflicts:
	Documentation/devicetree/bindings/net/micrel-ks8851.txt
	net/core/netpoll.c

The net/core/netpoll.c conflict is a bug fix in 'net' happening
to code which is completely removed in 'net-next'.

In micrel-ks8851.txt we simply have overlapping changes.

Signed-off-by: David S. Miller <davem@davemloft.net>
2014-03-25 20:29:20 -04:00
David Stevens
4b29dba9c0 vxlan: fix nonfunctional neigh_reduce()
The VXLAN neigh_reduce() code is completely non-functional since
check-in. Specific errors:

1) The original code drops all packets with a multicast destination address,
	even though neighbor solicitations are sent to the solicited-node
	address, a multicast address. The code after this check was never run.
2) The neighbor table lookup used the IPv6 header destination, which is the
	solicited node address, rather than the target address from the
	neighbor solicitation. So neighbor lookups would always fail if it
	got this far. Also for L3MISSes.
3) The code calls ndisc_send_na(), which does a send on the tunnel device.
	The context for neigh_reduce() is the transmit path, vxlan_xmit(),
	where the host or a bridge-attached neighbor is trying to transmit
	a neighbor solicitation. To respond to it, the tunnel endpoint needs
	to do a *receive* of the appropriate neighbor advertisement. Doing a
	send, would only try to send the advertisement, encapsulated, to the
	remote destinations in the fdb -- hosts that definitely did not do the
	corresponding solicitation.
4) The code uses the tunnel endpoint IPv6 forwarding flag to determine the
	isrouter flag in the advertisement. This has nothing to do with whether
	or not the target is a router, and generally won't be set since the
	tunnel endpoint is bridging, not routing, traffic.

	The patch below creates a proxy neighbor advertisement to respond to
neighbor solicitions as intended, providing proper IPv6 support for neighbor
reduction.

Signed-off-by: David L Stevens <dlstevens@us.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-03-24 15:35:10 -04:00
David Stevens
7346135dcd vxlan: fix potential NULL dereference in arp_reduce()
This patch fixes a NULL pointer dereference in the event of an
skb allocation failure in arp_reduce().

Signed-Off-By: David L Stevens <dlstevens@us.ibm.com>
Acked-by: Cong Wang <cwang@twopensource.com>

Signed-off-by: David S. Miller <davem@davemloft.net>
2014-03-18 16:09:34 -04:00
Pablo Neira Ayuso
86c3f0f830 vxlan: remove unused port variable in vxlan_udp_encap_recv()
Signed-off-by: Pablo Neira Ayuso <pablo@gnumonks.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-02-26 15:46:25 -05:00
WANG Cong
1c213bd24a net: introduce netdev_alloc_pcpu_stats() for drivers
There are many drivers calling alloc_percpu() to allocate pcpu stats
and then initializing ->syncp. So just introduce a helper function for them.

Cc: David S. Miller <davem@davemloft.net>
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-02-14 15:49:55 -05:00
Daniel Baluta
4fe46b9a4d vxlan: remove extra newline after function definition
Signed-off-by: Daniel Baluta <dbaluta@ixiacom.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-02-01 16:58:02 -08:00
Or Gerlitz
920a0fde5a net/vxlan: Go over all candidate streams for GRO matching
The loop in vxlan_gro_receive() over the current set of candidates for
coalescing was wrongly aborted once a match was found. In rare cases,
this can cause a false-positives matching in the next layer GRO checks.

Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-01-30 16:25:49 -08:00
Or Gerlitz
d0bc65557a net/vxlan: Share RX skb de-marking and checksum checks with ovs
Make sure the practice set by commit 0afb166 "vxlan: Add capability
of Rx checksum offload for inner packet" is applied when the skb
goes through the portion of the RX code which is shared between
vxlan netdevices and ovs vxlan port instances.

Cc: Joseph Gasparakis <joseph.gasparakis@intel.com>
Cc: Pravin B Shelar <pshelar@nicira.com>
Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-01-23 13:30:03 -08:00
Daniel Borkmann
783c146335 net: vxlan: convert to act as a pernet subsystem
As per suggestion from Eric W. Biederman, vxlan should be using
{un,}register_pernet_subsys() instead of {un,}register_pernet_device()
to ensure the vxlan_net structure is initialized before and cleaned
up after all network devices in a given network namespace i.e. when
dealing with network notifiers. This is similarly handeled already in
commit 91e2ff3528 ("net: Teach vlans to cleanup as a pernet subsystem")
and, thus, improves upon fd27e0d44a ("net: vxlan: do not use vxlan_net
before checking event type"). Just as in 91e2ff3528, we do not need
to explicitly handle deletion of vxlan devices as network namespace
exit calls dellink on all remaining virtual devices, and
rtnl_link_unregister() calls dellink on all outstanding devices in that
network namespace, so we can entirely drop the pernet exit operation
as well. Moreover, on vxlan module exit, rcu_barrier() is called by
netns since commit 3a765edadb ("netns: Add an explicit rcu_barrier
to unregister_pernet_{device|subsys}"), so this may be omitted. Tested
with various scenarios and works well on my side.

Suggested-by: Eric W. Biederman <ebiederm@xmission.com>
Cc: Jesse Brandeburg <jesse.brandeburg@intel.com>
Cc: Eric W. Biederman <ebiederm@xmission.com>
Signed-off-by: Daniel Borkmann <dborkman@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-01-22 17:43:38 -08:00
Or Gerlitz
dc01e7d344 net: Add GRO support for vxlan traffic
Add GRO handlers for vxlann, by using the UDP GRO infrastructure.

For single TCP session that goes through vxlan tunneling I got nice
improvement from 6.8Gbs to 11.5Gbs

--> UDP/VXLAN GRO disabled
$ netperf  -H 192.168.52.147 -c -C

$ netperf -t TCP_STREAM -H 192.168.52.147 -c -C
MIGRATED TCP STREAM TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 192.168.52.147 () port 0 AF_INET
Recv   Send    Send                          Utilization       Service Demand
Socket Socket  Message  Elapsed              Send     Recv     Send    Recv
Size   Size    Size     Time     Throughput  local    remote   local   remote
bytes  bytes   bytes    secs.    10^6bits/s  % S      % S      us/KB   us/KB

 87380  65536  65536    10.00      6799.75   12.54    24.79    0.604   1.195

--> UDP/VXLAN GRO enabled

$ netperf -t TCP_STREAM -H 192.168.52.147 -c -C
MIGRATED TCP STREAM TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 192.168.52.147 () port 0 AF_INET
Recv   Send    Send                          Utilization       Service Demand
Socket Socket  Message  Elapsed              Send     Recv     Send    Recv
Size   Size    Size     Time     Throughput  local    remote   local   remote
bytes  bytes   bytes    secs.    10^6bits/s  % S      % S      us/KB   us/KB

 87380  65536  65536    10.00      11562.72   24.90    20.34    0.706   0.577

Signed-off-by: Shlomo Pongratz <shlomop@mellanox.com>
Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-01-21 18:05:04 -08:00
Jesse Brandeburg
ead5139aa1 net: add vxlan description
Add a description to the vxlan module, helping save the world
from the minions of destruction and confusion.

Signed-off-by: Jesse Brandeburg <jesse.brandeburg@intel.com>
CC: Stephen Hemminger <stephen@networkplumber.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-01-21 16:28:07 -08:00
Daniel Borkmann
fd27e0d44a net: vxlan: do not use vxlan_net before checking event type
Jesse Brandeburg reported that commit acaf4e7099 caused a panic
when adding a network namespace while vxlan module was present in
the system:

[<ffffffff814d0865>] vxlan_lowerdev_event+0xf5/0x100
[<ffffffff816e9e5d>] notifier_call_chain+0x4d/0x70
[<ffffffff810912be>] __raw_notifier_call_chain+0xe/0x10
[<ffffffff810912d6>] raw_notifier_call_chain+0x16/0x20
[<ffffffff815d9610>] call_netdevice_notifiers_info+0x40/0x70
[<ffffffff815d9656>] call_netdevice_notifiers+0x16/0x20
[<ffffffff815e1bce>] register_netdevice+0x1be/0x3a0
[<ffffffff815e1dce>] register_netdev+0x1e/0x30
[<ffffffff814cb94a>] loopback_net_init+0x4a/0xb0
[<ffffffffa016ed6e>] ? lockd_init_net+0x6e/0xb0 [lockd]
[<ffffffff815d6bac>] ops_init+0x4c/0x150
[<ffffffff815d6d23>] setup_net+0x73/0x110
[<ffffffff815d725b>] copy_net_ns+0x7b/0x100
[<ffffffff81090e11>] create_new_namespaces+0x101/0x1b0
[<ffffffff81090f45>] copy_namespaces+0x85/0xb0
[<ffffffff810693d5>] copy_process.part.26+0x935/0x1500
[<ffffffff811d5186>] ? mntput+0x26/0x40
[<ffffffff8106a15c>] do_fork+0xbc/0x2e0
[<ffffffff811b7f2e>] ? ____fput+0xe/0x10
[<ffffffff81089c5c>] ? task_work_run+0xac/0xe0
[<ffffffff8106a406>] SyS_clone+0x16/0x20
[<ffffffff816ee689>] stub_clone+0x69/0x90
[<ffffffff816ee329>] ? system_call_fastpath+0x16/0x1b

Apparently loopback device is being registered first and thus we
receive an event notification when vxlan_net is not ready. Hence,
when we call net_generic() and request vxlan_net_id, we seem to
access garbage at that point in time. In setup_net() where we set
up a newly allocated network namespace, we traverse the list of
pernet ops ...

list_for_each_entry(ops, &pernet_list, list) {
	error = ops_init(ops, net);
	if (error < 0)
		goto out_undo;
}

... and loopback_net_init() is invoked first here, so in the middle
of setup_net() we get this notification in vxlan. As currently we
only care about devices that unregister, move access through
net_generic() there. Fix is based on Cong Wang's proposal, but
only changes what is needed here. It sucks a bit as we only work
around the actual cure: right now it seems the only way to check if
a netns actually finished traversing all init ops would be to check
if it's part of net_namespace_list. But that I find quite expensive
each time we go through a notifier callback. Anyway, did a couple
of tests and it seems good for now.

Fixes: acaf4e7099 ("net: vxlan: when lower dev unregisters remove vxlan dev as well")
Reported-by: Jesse Brandeburg <jesse.brandeburg@intel.com>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Jesse Brandeburg <jesse.brandeburg@intel.com>
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
Signed-off-by: Daniel Borkmann <dborkman@redhat.com>
Tested-by: Jesse Brandeburg <jesse.brandeburg@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-01-17 18:49:18 -08:00
Daniel Borkmann
8425783c0f net: vxlan: properly cleanup devs on module unload
We should use vxlan_dellink() handler in vxlan_exit_net(), since
i) we're not in fast-path and we should be consistent in dismantle
just as we would remove a device through rtnl ops, and more
importantly, ii) in case future code will kfree() memory in
vxlan_dellink(), we would leak it right here unnoticed. Therefore,
do not only half of the cleanup work, but make it properly.

Signed-off-by: Daniel Borkmann <dborkman@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-01-14 23:38:39 -08:00
Daniel Borkmann
acaf4e7099 net: vxlan: when lower dev unregisters remove vxlan dev as well
We can create a vxlan device with an explicit underlying carrier.
In that case, when the carrier link is being deleted from the
system (e.g. due to module unload) we should also clean up all
created vxlan devices on top of it since otherwise we're in an
inconsistent state in vxlan device. In that case, the user needs
to remove all such devices, while in case of other virtual devs
that sit on top of physical ones, it is usually the case that
these devices do unregister automatically as well and do not
leave the burden on the user.

This work is not necessary when vxlan device was not created with
a real underlying device, as connections can resume in that case
when driver is plugged again. But at least for the other cases,
we should go ahead and do the cleanup on removal.

We don't register the notifier during vxlan_newlink() here since
I consider this event rather rare, and therefore we should not
bloat vxlan's core structure unecessary. Also, we can simply make
use of unregister_netdevice_many() to batch that. fdb is flushed
upon ndo_stop().

E.g. `ip -d link show vxlan13` after carrier removal before
this patch:

5: vxlan13: <BROADCAST,MULTICAST> mtu 1450 qdisc noop state DOWN mode DEFAULT group default
    link/ether 1e:47:da:6d:4d:99 brd ff:ff:ff:ff:ff:ff promiscuity 0
    vxlan id 13 group 239.0.0.10 dev 2 port 32768 61000 ageing 300
                                 ^^^^^
Signed-off-by: Daniel Borkmann <dborkman@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-01-14 23:38:38 -08:00
Ying Xue
7376394930 vxlan: use __dev_get_by_index instead of dev_get_by_index to find interface
The following call chains indicate that vxlan_fdb_parse() is
under rtnl_lock protection. So if we use __dev_get_by_index()
instead of dev_get_by_index() to find interface handler in it,
this would help us avoid to change interface reference counter.

rtnetlink_rcv()
  rtnl_lock()
  netlink_rcv_skb()
    rtnl_fdb_add()
      vxlan_fdb_add()
        vxlan_fdb_parse()
  rtnl_unlock()

rtnetlink_rcv()
  rtnl_lock()
  netlink_rcv_skb()
    rtnl_fdb_del()
      vxlan_fdb_del()
        vxlan_fdb_parse()
  rtnl_unlock()

Cc: Stephen Hemminger <stephen@networkplumber.org>
Signed-off-by: Ying Xue <ying.xue@windriver.com>
Acked-by: Stephen Hemminger <stephen@networkplumber.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-01-14 18:50:46 -08:00
David S. Miller
56a4342dfe Merge branch 'master' of git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
Conflicts:
	drivers/net/ethernet/qlogic/qlcnic/qlcnic_sriov_pf.c
	net/ipv6/ip6_tunnel.c
	net/ipv6/ip6_vti.c

ipv6 tunnel statistic bug fixes conflicting with consolidation into
generic sw per-cpu net stats.

qlogic conflict between queue counting bug fix and the addition
of multiple MAC address support.

Signed-off-by: David S. Miller <davem@davemloft.net>
2014-01-06 17:37:45 -05:00
Eric Dumazet
8f646c922d vxlan: keep original skb ownership
Sathya Perla posted a patch trying to address following problem :

<quote>
 The vxlan driver sets itself as the socket owner for all the TX flows
 it encapsulates (using vxlan_set_owner()) and assigns it's own skb
 destructor. This causes all tunneled traffic to land up on only one TXQ
 as all encapsulated skbs refer to the vxlan socket and not the original
 socket.  Also, the vxlan skb destructor breaks some functionality for
 tunneled traffic like wmem accounting and as TCP small queues and
 FQ/pacing packet scheduler.
</quote>

I reworked Sathya patch and added some explanations.

vxlan_xmit() can avoid one skb_clone()/dev_kfree_skb() pair
and gain better drop monitor accuracy, by calling kfree_skb() when
appropriate.

The UDP socket used by vxlan to perform encapsulation of xmit packets
do not need to be alive while packets leave vxlan code. Its better
to keep original socket ownership to get proper feedback from qdisc and
NIC layers.

We use skb->sk to

A) control amount of bytes/packets queued on behalf of a socket, but
prior vxlan code did the skb->sk transfert without any limit/control
on vxlan socket sk_sndbuf.

B) security purposes (as selinux) or netfilter uses, and I do not think
anything is prepared to handle vxlan stacked case in this area.

By not changing ownership, vxlan tunnels behave like other tunnels.
As Stephen mentioned, we might do the same change in L2TP.

Reported-by: Sathya Perla <sathya.perla@emulex.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Stephen Hemminger <stephen@networkplumber.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-01-06 16:37:09 -05:00
Li RongQing
8f84985fec net: unify the pcpu_tstats and br_cpu_netstats as one
They are same, so unify them as one, pcpu_sw_netstats.

Define pcpu_sw_netstat in netdevice.h, remove pcpu_tstats
from if_tunnel and remove br_cpu_netstats from br_private.h

Cc: Cong Wang <xiyou.wangcong@gmail.com>
Cc: Stephen Hemminger <stephen@networkplumber.org>
Signed-off-by: Li RongQing <roy.qing.li@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-01-04 20:10:24 -05:00
fan.du
7bda701e01 {vxlan, inet6} Mark vxlan_dev flags with VXLAN_F_IPV6 properly
Even if user doesn't supply the physical netdev to attach vxlan dev
to, and at the same time user want to vxlan sit top of IPv6, mark
vxlan_dev flags with VXLAN_F_IPV6 to create IPv6 based socket.
Otherwise kernel crashes safely every time spitting below messages,

Steps to reproduce:
ip link add vxlan0 type vxlan id 42 group ff0e::110
ip link set vxlan0 up

[   62.656266] BUG: unable to handle kernel NULL pointer dereference[   62.656320] ip (3008) used greatest stack depth: 3912 bytes left
 at 0000000000000046
[   62.656423] IP: [<ffffffff816d822d>] ip6_route_output+0xbd/0xe0
[   62.656525] PGD 2c966067 PUD 2c9a2067 PMD 0
[   62.656674] Oops: 0000 [#1] SMP
[   62.656781] Modules linked in: vxlan netconsole deflate zlib_deflate af_key
[   62.657083] CPU: 1 PID: 2128 Comm: whoopsie Not tainted 3.12.0+ #182
[   62.657083] Hardware name: innotek GmbH VirtualBox, BIOS VirtualBox 12/01/2006
[   62.657083] task: ffff88002e2335d0 ti: ffff88002c94c000 task.ti: ffff88002c94c000
[   62.657083] RIP: 0010:[<ffffffff816d822d>]  [<ffffffff816d822d>] ip6_route_output+0xbd/0xe0
[   62.657083] RSP: 0000:ffff88002fd038f8  EFLAGS: 00210296
[   62.657083] RAX: 0000000000000000 RBX: ffff88002fd039e0 RCX: 0000000000000000
[   62.657083] RDX: ffff88002fd0eb68 RSI: ffff88002fd0d278 RDI: ffff88002fd0d278
[   62.657083] RBP: ffff88002fd03918 R08: 0000000002000000 R09: 0000000000000000
[   62.657083] R10: 00000000000001ff R11: 0000000000000000 R12: 0000000000000001
[   62.657083] R13: ffff88002d96b480 R14: ffffffff81c8e2c0 R15: 0000000000000001
[   62.657083] FS:  0000000000000000(0000) GS:ffff88002fd00000(0063) knlGS:00000000f693b740
[   62.657083] CS:  0010 DS: 002b ES: 002b CR0: 0000000080050033
[   62.657083] CR2: 0000000000000046 CR3: 000000002c9d2000 CR4: 00000000000006e0
[   62.657083] Stack:
[   62.657083]  ffff88002fd03a40 ffffffff81c8e2c0 ffff88002fd039e0 ffff88002d96b480
[   62.657083]  ffff88002fd03958 ffffffff816cac8b ffff880019277cc0 ffff8800192b5d00
[   62.657083]  ffff88002d5bc000 ffff880019277cc0 0000000000001821 0000000000000001
[   62.657083] Call Trace:
[   62.657083]  <IRQ>
[   62.657083]  [<ffffffff816cac8b>] ip6_dst_lookup_tail+0xdb/0xf0
[   62.657083]  [<ffffffff816caea0>] ip6_dst_lookup+0x10/0x20
[   62.657083]  [<ffffffffa0020c13>] vxlan_xmit_one+0x193/0x9c0 [vxlan]
[   62.657083]  [<ffffffff8137b3b7>] ? account+0xc7/0x1f0
[   62.657083]  [<ffffffffa0021513>] vxlan_xmit+0xd3/0x400 [vxlan]
[   62.657083]  [<ffffffff8161390d>] dev_hard_start_xmit+0x49d/0x5e0
[   62.657083]  [<ffffffff81613d29>] dev_queue_xmit+0x2d9/0x480
[   62.657083]  [<ffffffff817cb854>] ? _raw_write_unlock_bh+0x14/0x20
[   62.657083]  [<ffffffff81630565>] ? eth_header+0x35/0xe0
[   62.657083]  [<ffffffff8161bc5e>] neigh_resolve_output+0x11e/0x1e0
[   62.657083]  [<ffffffff816ce0e0>] ? ip6_fragment+0xad0/0xad0
[   62.657083]  [<ffffffff816cb465>] ip6_finish_output2+0x2f5/0x470
[   62.657083]  [<ffffffff816ce166>] ip6_finish_output+0x86/0xc0
[   62.657083]  [<ffffffff816ce218>] ip6_output+0x78/0xb0
[   62.657083]  [<ffffffff816eadd6>] mld_sendpack+0x256/0x2a0
[   62.657083]  [<ffffffff816ebd8c>] mld_ifc_timer_expire+0x17c/0x290
[   62.657083]  [<ffffffff816ebc10>] ? igmp6_timer_handler+0x80/0x80
[   62.657083]  [<ffffffff816ebc10>] ? igmp6_timer_handler+0x80/0x80
[   62.657083]  [<ffffffff81051065>] call_timer_fn+0x45/0x150
[   62.657083]  [<ffffffff816ebc10>] ? igmp6_timer_handler+0x80/0x80
[   62.657083]  [<ffffffff81052353>] run_timer_softirq+0x1f3/0x2a0
[   62.657083]  [<ffffffff8102dfd8>] ? lapic_next_event+0x18/0x20
[   62.657083]  [<ffffffff8109e36f>] ? clockevents_program_event+0x6f/0x110
[   62.657083]  [<ffffffff8104a2f6>] __do_softirq+0xd6/0x2b0
[   62.657083]  [<ffffffff8104a75e>] irq_exit+0x7e/0xa0
[   62.657083]  [<ffffffff8102ea15>] smp_apic_timer_interrupt+0x45/0x60
[   62.657083]  [<ffffffff817d3eca>] apic_timer_interrupt+0x6a/0x70
[   62.657083]  <EOI>
[   62.657083]  [<ffffffff817d4a35>] ? sysenter_dispatch+0x7/0x1a
[   62.657083] Code: 4d 8b 85 a8 02 00 00 4c 89 e9 ba 03 04 00 00 48 c7 c6 c0 be 8d 81 48 c7 c7 48 35 a3 81 31 c0 e8 db 68 0e 00 49 8b 85 a8 02 00 00 <0f> b6 40 46 c0 e8 05 0f b6 c0 c1 e0 03 41 09 c4 e9 77 ff ff ff
[   62.657083] RIP  [<ffffffff816d822d>] ip6_route_output+0xbd/0xe0
[   62.657083]  RSP <ffff88002fd038f8>
[   62.657083] CR2: 0000000000000046
[   62.657083] ---[ end trace ba8a9583d7cd1934 ]---
[   62.657083] Kernel panic - not syncing: Fatal exception in interrupt

Signed-off-by: Fan Du <fan.du@windriver.com>
Reported-by: Ryan Whelan <rcwhelan@gmail.com>
Acked-by: Cong Wang <xiyou.wangcong@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-01-03 20:36:00 -05:00
Daniel Borkmann
345010b5c4 net: vxlan: use custom ndo_change_mtu handler
When adding a new vxlan device to an "underlying carrier" (here:
dst->remote_ifindex), the MTU size assigned to the vxlan device
is the MTU at setup time of the carrier - needed headroom, when
adding a vxlan device w/o explicit carrier, then it defaults
to 1500.

In case of an explicit carrier that supports jumbo frames, we
currently cannot change vxlan MTU via ip(8) to > 1500 in
post-setup time, as vxlan driver uses eth_change_mtu() as default
method for manually setting MTU.

Hence, use a custom implementation that only falls back to
eth_change_mtu() in case we didn't use a dev parameter on device
setup time, and otherwise allow a max MTU setting of the carrier
incl. adjustment for headroom.

Reported-by: Shahed Shaikh <shahed.shaikh@qlogic.com>
Signed-off-by: Daniel Borkmann <dborkman@redhat.com>
Acked-by: Stephen Hemminger <stephen@networkplumber.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-12-22 18:01:08 -05:00
David S. Miller
143c905494 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
Conflicts:
	drivers/net/ethernet/intel/i40e/i40e_main.c
	drivers/net/macvtap.c

Both minor merge hassles, simple overlapping changes.

Signed-off-by: David S. Miller <davem@davemloft.net>
2013-12-18 16:42:06 -05:00
Tom Herbert
3958afa1b2 net: Change skb_get_rxhash to skb_get_hash
Changing name of function as part of making the hash in skbuff to be
generic property, not just for receive path.

Signed-off-by: Tom Herbert <therbert@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-12-17 16:36:21 -05:00
Gao feng
95ab09917a vxlan: leave multicast group when vxlan device down
vxlan_group_used only allows device to leave multicast group
when the remote_ip of this vxlan device is difference from
other vxlan devices' remote_ip. this will cause device not
leave multicast group untile the vn_sock of this vxlan deivce
being released.

The check in vxlan_group_used is not quite precise. since even
the remote_ip is same, but these vxlan devices may use different
lower devices, and they may use different vn_socks.

Only when some vxlan devices use the same vn_sock,same lower
device and same remote_ip, the mc_list of the vn_sock should
not be changed.

Signed-off-by: Gao feng <gaofeng@cn.fujitsu.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-12-11 14:21:26 -05:00
Gao feng
79d4a94fab vxlan: remove vxlan_group_used in vxlan_open
In vxlan_open, vxlan_group_used always returns true,
because the state of the vxlan deivces which we want
to open has alreay been running. and it has already
in vxlan_list.

Since ip_mc_join_group takes care of the reference
of struct ip_mc_list. removing vxlan_group_used here
is safe.

Signed-off-by: Gao feng <gaofeng@cn.fujitsu.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-12-11 14:21:26 -05:00
Fan Du
fffc15a501 vxlan: release rt when found circular route
Otherwise causing dst memory leakage.
Have Checked all other type tunnel device transmit implementation,
no such things happens anymore.

Signed-off-by: Fan Du <fan.du@windriver.com>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-12-10 22:01:54 -05:00
Linus Torvalds
5e30025a31 Merge branch 'core-locking-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull core locking changes from Ingo Molnar:
 "The biggest changes:

   - add lockdep support for seqcount/seqlocks structures, this
     unearthed both bugs and required extra annotation.

   - move the various kernel locking primitives to the new
     kernel/locking/ directory"

* 'core-locking-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (21 commits)
  block: Use u64_stats_init() to initialize seqcounts
  locking/lockdep: Mark __lockdep_count_forward_deps() as static
  lockdep/proc: Fix lock-time avg computation
  locking/doc: Update references to kernel/mutex.c
  ipv6: Fix possible ipv6 seqlock deadlock
  cpuset: Fix potential deadlock w/ set_mems_allowed
  seqcount: Add lockdep functionality to seqcount/seqlock structures
  net: Explicitly initialize u64_stats_sync structures for lockdep
  locking: Move the percpu-rwsem code to kernel/locking/
  locking: Move the lglocks code to kernel/locking/
  locking: Move the rwsem code to kernel/locking/
  locking: Move the rtmutex code to kernel/locking/
  locking: Move the semaphore core to kernel/locking/
  locking: Move the spinlock code to kernel/locking/
  locking: Move the lockdep code to kernel/locking/
  locking: Move the mutex code to kernel/locking/
  hung_task debugging: Add tracepoint to report the hang
  x86/locking/kconfig: Update paravirt spinlock Kconfig description
  lockstat: Report avg wait and hold times
  lockdep, x86/alternatives: Drop ancient lockdep fixup message
  ...
2013-11-14 16:30:30 +09:00
John Stultz
827da44c61 net: Explicitly initialize u64_stats_sync structures for lockdep
In order to enable lockdep on seqcount/seqlock structures, we
must explicitly initialize any locks.

The u64_stats_sync structure, uses a seqcount, and thus we need
to introduce a u64_stats_init() function and use it to initialize
the structure.

This unfortunately adds a lot of fairly trivial initialization code
to a number of drivers. But the benefit of ensuring correctness makes
this worth while.

Because these changes are required for lockdep to be enabled, and the
changes are quite trivial, I've not yet split this patch out into 30-some
separate patches, as I figured it would be better to get the various
maintainers thoughts on how to best merge this change along with
the seqcount lockdep enablement.

Feedback would be appreciated!

Signed-off-by: John Stultz <john.stultz@linaro.org>
Acked-by: Julian Anastasov <ja@ssi.bg>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Cc: Alexey Kuznetsov <kuznet@ms2.inr.ac.ru>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Eric Dumazet <eric.dumazet@gmail.com>
Cc: Hideaki YOSHIFUJI <yoshfuji@linux-ipv6.org>
Cc: James Morris <jmorris@namei.org>
Cc: Jesse Gross <jesse@nicira.com>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: "Michael S. Tsirkin" <mst@redhat.com>
Cc: Mirko Lindner <mlindner@marvell.com>
Cc: Patrick McHardy <kaber@trash.net>
Cc: Roger Luethi <rl@hellgate.ch>
Cc: Rusty Russell <rusty@rustcorp.com.au>
Cc: Simon Horman <horms@verge.net.au>
Cc: Stephen Hemminger <stephen@networkplumber.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Thomas Petazzoni <thomas.petazzoni@free-electrons.com>
Cc: Wensong Zhang <wensong@linux-vs.org>
Cc: netdev@vger.kernel.org
Link: http://lkml.kernel.org/r/1381186321-4906-2-git-send-email-john.stultz@linaro.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2013-11-06 12:40:25 +01:00
Duan Jiong
e50fddc8b0 vxlan: Use ERR_CAST inlined function instead of ERR_PTR(PTR_ERR(...))
trivial patch converting ERR_PTR(PTR_ERR()) into ERR_CAST().
No functional changes.

Signed-off-by: Duan Jiong <duanj.fnst@cn.fujitsu.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-11-04 20:02:33 -05:00
Joseph Gasparakis
e6cd988c27 vxlan: Have the NIC drivers do less work for offloads
This patch removes the burden from the NIC drivers to check if the
vxlan driver is enabled in the kernel and also makes available
the vxlan headrooms to them.

Signed-off-by: Joseph Gasparakis <joseph.gasparakis@intel.com>
Tested-by: Kavindya Deegala <kavindya.s.deegala@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
2013-10-29 02:39:13 -07:00
Zhi Yong Wu
39deb2c7db vxlan: silence one build warning
drivers/net/vxlan.c: In function ‘vxlan_sock_add’:
drivers/net/vxlan.c:2298:11: warning: ‘sock’ may be used uninitialized in this function [-Wmaybe-uninitialized]
drivers/net/vxlan.c:2275:17: note: ‘sock’ was declared here
  LD      drivers/net/built-in.o

Signed-off-by: Zhi Yong Wu <wuzhy@linux.vnet.ibm.com>
Signed-off-by: Stephen Hemminger <stephen@networkplumber.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-10-29 00:19:04 -04:00
David S. Miller
4fbef95af4 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
Conflicts:
	drivers/net/ethernet/emulex/benet/be.h
	drivers/net/usb/qmi_wwan.c
	drivers/net/wireless/brcm80211/brcmfmac/dhd_bus.h
	include/net/netfilter/nf_conntrack_synproxy.h
	include/net/secure_seq.h

The conflicts are of two varieties:

1) Conflicts with Joe Perches's 'extern' removal from header file
   function declarations.  Usually it's an argument signature change
   or a function being added/removed.  The resolutions are trivial.

2) Some overlapping changes in qmi_wwan.c and be.h, one commit adds
   a new value, another changes an existing value.  That sort of
   thing.

Signed-off-by: David S. Miller <davem@davemloft.net>
2013-10-01 17:06:14 -04:00
Eric W. Biederman
0bbf87d852 net ipv4: Convert ipv4.ip_local_port_range to be per netns v3
- Move sysctl_local_ports from a global variable into struct netns_ipv4.
- Modify inet_get_local_port_range to take a struct net, and update all
  of the callers.
- Move the initialization of sysctl_local_ports into
   sysctl_net_ipv4.c:ipv4_sysctl_init_net from inet_connection_sock.c

v2:
- Ensure indentation used tabs
- Fixed ip.h so it applies cleanly to todays net-next

v3:
- Compile fixes of strange callers of inet_get_local_port_range.
  This patch now successfully passes an allmodconfig build.
  Removed manual inlining of inet_get_local_port_range in ipv4_local_port_range

Originally-by: Samya <samya@twitter.com>
Acked-by: Nicolas Dichtel <nicolas.dichtel@6wind.com>
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-09-30 21:59:38 -07:00
Pravin B Shelar
559835ea72 vxlan: Use RCU apis to access sk_user_data.
Use of RCU api makes vxlan code easier to understand.  It also
fixes bug due to missing ACCESS_ONCE() on sk_user_data dereference.
In rare case without ACCESS_ONCE() compiler might omit vs on
sk_user_data dereference.
Compiler can use vs as alias for sk->sk_user_data, resulting in
multiple sk_user_data dereference in rcu read context which
could change.

CC: Jesse Gross <jesse@nicira.com>
Signed-off-by: Pravin B Shelar <pshelar@nicira.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-09-30 14:22:59 -04:00
Sridhar Samudrala
2936b6ab45 vxlan: Avoid creating fdb entry with NULL destination
Commit afbd8bae9c
   vxlan: add implicit fdb entry for default destination
creates an implicit fdb entry for default destination. This results
in an invalid fdb entry if default destination is not specified.
For ex:
  ip link add vxlan1 type vxlan id 100
creates the following fdb entry
  00:00:00:00:00:00 dev vxlan1 dst 0.0.0.0 self permanent

This patch fixes this issue by creating an fdb entry only if a
valid default destination is specified.

Signed-off-by: Sridhar Samudrala <sri@us.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-09-17 20:19:18 -04:00
Joseph Gasparakis
35e4237973 vxlan: Fix sparse warnings
This patch fixes sparse warnings when incorrectly handling the port number
and using int instead of unsigned int iterating through &vn->sock_list[].
Keeping the port as __be16 also makes things clearer wrt endianess.
Also, it was pointed out that vxlan_get_rx_port() had unnecessary checks
which got removed.

Signed-off-by: Joseph Gasparakis <joseph.gasparakis@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-09-15 22:18:13 -04:00
Joseph Gasparakis
53cf527513 vxlan: Notify drivers for listening UDP port changes
This patch adds two more ndo ops: ndo_add_rx_vxlan_port() and
ndo_del_rx_vxlan_port().

Drivers can get notifications through the above functions about changes
of the UDP listening port of VXLAN. Also, when physical ports come up,
now they can call vxlan_get_rx_port() in order to obtain the port number(s)
of the existing VXLAN interface in case they already up before them.

This information about the listening UDP port would be used for VXLAN
related offloads.

A big thank you to John Fastabend (john.r.fastabend@intel.com) for his
input and his suggestions on this patch set.

CC: John Fastabend <john.r.fastabend@intel.com>
CC: Stephen Hemminger <stephen@networkplumber.org>
Signed-off-by: Joseph Gasparakis <joseph.gasparakis@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-09-05 12:44:30 -04:00
Pravin B Shelar
430eda6d6d vxlan: Optimize vxlan rcv
vxlan-udp-recv function lookup vxlan_sock struct on every packet
recv by using udp-port number. we can use sk->sk_user_data to
store vxlan_sock and avoid lookup.
I have open coded rcu-api to store and read vxlan_sock from
sk_user_data to avoid sparse warning as sk_user_data is not
__rcu pointer.

Signed-off-by: Pravin B Shelar <pshelar@nicira.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-09-04 14:41:55 -04:00
Nicolas Dichtel
963a88b31d tunnels: harmonize cleanup done on skb on xmit path
The goal of this patch is to harmonize cleanup done on a skbuff on xmit path.
Before this patch, behaviors were different depending of the tunnel type.

Signed-off-by: Nicolas Dichtel <nicolas.dichtel@6wind.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-09-04 00:27:25 -04:00
Nicolas Dichtel
117961878c vxlan: remove net arg from vxlan[6]_xmit_skb()
This argument is not used, let's remove it.

Signed-off-by: Nicolas Dichtel <nicolas.dichtel@6wind.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-09-04 00:27:25 -04:00
Nicolas Dichtel
8b7ed2d91d iptunnels: remove net arg from iptunnel_xmit()
This argument is not used, let's remove it.

Signed-off-by: Nicolas Dichtel <nicolas.dichtel@6wind.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-09-04 00:27:25 -04:00
Joe Perches
7367d0b573 drivers/net: Convert uses of compare_ether_addr to ether_addr_equal
Use the new bool function ether_addr_equal to add
some clarity and reduce the likelihood for misuse
of compare_ether_addr for sorting.

Done via cocci script: (and a little typing)

$ cat compare_ether_addr.cocci
@@
expression a,b;
@@
-	!compare_ether_addr(a, b)
+	ether_addr_equal(a, b)

@@
expression a,b;
@@
-	compare_ether_addr(a, b)
+	!ether_addr_equal(a, b)

@@
expression a,b;
@@
-	!ether_addr_equal(a, b) == 0
+	ether_addr_equal(a, b)

@@
expression a,b;
@@
-	!ether_addr_equal(a, b) != 0
+	!ether_addr_equal(a, b)

@@
expression a,b;
@@
-	ether_addr_equal(a, b) == 0
+	!ether_addr_equal(a, b)

@@
expression a,b;
@@
-	ether_addr_equal(a, b) != 0
+	ether_addr_equal(a, b)

@@
expression a,b;
@@
-	!!ether_addr_equal(a, b)
+	ether_addr_equal(a, b)

Signed-off-by: Joe Perches <joe@perches.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-09-03 22:28:04 -04:00
Cong Wang
660d98cae0 vxlan: include net/ip6_checksum.h for csum_ipv6_magic()
Fengguang reported a compile warning:

   drivers/net/vxlan.c: In function 'vxlan6_xmit_skb':
   drivers/net/vxlan.c:1352:3: error: implicit declaration of function 'csum_ipv6_magic' [-Werror=implicit-function-declaration]
   cc1: some warnings being treated as errors

this patch fixes it.

Reported-by: kbuild test robot <fengguang.wu@intel.com>
Signed-off-by: Cong Wang <amwang@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-09-02 21:00:49 -07:00
Cong Wang
8c1bb79fde vxlan: fix flowi6_proto value
It should be IPPROTO_UDP.

Signed-off-by: Cong Wang <amwang@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-09-02 21:00:47 -07:00
Cong Wang
f564f45c45 vxlan: add ipv6 proxy support
This patch adds the IPv6 version of "arp_reduce", ndisc_send_na()
will be needed.

Cc: David S. Miller <davem@davemloft.net>
Cc: David Stevens <dlstevens@us.ibm.com>
Signed-off-by: Cong Wang <amwang@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-08-31 22:30:01 -04:00
Cong Wang
e15a00aafa vxlan: add ipv6 route short circuit support
route short circuit only has IPv4 part, this patch adds
the IPv6 part. nd_tbl will be needed.

Cc: David S. Miller <davem@davemloft.net>
Cc: David Stevens <dlstevens@us.ibm.com>
Signed-off-by: Cong Wang <amwang@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-08-31 22:30:00 -04:00
Cong Wang
e4c7ed4153 vxlan: add ipv6 support
This patch adds IPv6 support to vxlan device, as the new version
RFC already mentions it:

   http://tools.ietf.org/html/draft-mahalingam-dutt-dcops-vxlan-03

Cc: David Stevens <dlstevens@us.ibm.com>
Cc: Stephen Hemminger <stephen@networkplumber.org>
Cc: David S. Miller <davem@davemloft.net>
Signed-off-by: Cong Wang <amwang@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-08-31 22:30:00 -04:00
Wei Yongjun
dcdc7a5929 vxlan: using kfree_rcu() to simplify the code
The callback function of call_rcu() just calls a kfree(), so we
can use kfree_rcu() instead of call_rcu() + callback function.

Signed-off-by: Wei Yongjun <yongjun_wei@trendmicro.com.cn>
Acked-by: Stephen Hemminger <stephen@networkplumber.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-08-20 15:05:03 -07:00
Pravin B Shelar
1eaa81785a vxlan: Add tx-vlan offload support.
Following patch allows transmit side vlan offload for vxlan
devices.

Signed-off-by: Pravin B Shelar <pshelar@nicira.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-08-20 00:15:44 -07:00
Pravin B Shelar
649c5b8bdd vxlan: Improve vxlan headroom calculation.
Rather than having static headroom calculation, adjust headroom
according to target device.

Signed-off-by: Pravin B Shelar <pshelar@nicira.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-08-20 00:15:43 -07:00
Pravin B Shelar
49560532d7 vxlan: Factor out vxlan send api.
Following patch allows more code sharing between vxlan and ovs-vxlan.

Signed-off-by: Pravin B Shelar <pshelar@nicira.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-08-20 00:15:43 -07:00
Pravin B Shelar
012a5729ff vxlan: Extend vxlan handlers for openvswitch.
Following patch adds data field to vxlan socket and export
vxlan handler api.
vh->data is required to store private data per vxlan handler.

Signed-off-by: Pravin B Shelar <pshelar@nicira.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-08-20 00:15:43 -07:00
Pravin B Shelar
5cfccc5a47 vxlan: Add vxlan recv demux.
Once we have ovs-vxlan functionality, one UDP port can be assigned
to kernel-vxlan or ovs-vxlan port.  Therefore following patch adds
vxlan demux functionality, so that vxlan or ovs module can
register for particular port.

Signed-off-by: Pravin B Shelar <pshelar@nicira.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-08-20 00:15:43 -07:00
Pravin B Shelar
7ce0475827 vxlan: Restructure vxlan receive.
Use iptunnel_pull_header() for better code sharing.

Signed-off-by: Pravin B Shelar <pshelar@nicira.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-08-20 00:15:43 -07:00
Pravin B Shelar
9c2e24e16f vxlan: Restructure vxlan socket apis.
Restructure vxlan-socket management APIs so that it can be
shared between vxlan and ovs modules.
This patch does not change any functionality.

Signed-off-by: Pravin B Shelar <pshelar@nicira.com>
v6-v7:
 - get rid of zero refcnt vs from hashtable.
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-08-20 00:15:43 -07:00
David S. Miller
2ff1cf12c9 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net 2013-08-16 15:37:26 -07:00
Cong Wang
ffbe4a539f vxlan: fix a soft lockup in vxlan module removal
This is a regression introduced by:

	commit fe5c3561e6
	Author: stephen hemminger <stephen@networkplumber.org>
	Date:   Sat Jul 13 10:18:18 2013 -0700

	    vxlan: add necessary locking on device removal

The problem is that vxlan_dellink(), which is called with RTNL lock
held, tries to flush the workqueue synchronously, but apparently
igmp_join and igmp_leave work need to hold RTNL lock too, therefore we
have a soft lockup!

As suggested by Stephen, probably the flush_workqueue can just be
removed and let the normal refcounting work. The workqueue has a
reference to device and socket, therefore the cleanups should work
correctly.

Suggested-by: Stephen Hemminger <stephen@networkplumber.org>
Cc: Stephen Hemminger <stephen@networkplumber.org>
Cc: David S. Miller <davem@davemloft.net>
Tested-by: Cong Wang <amwang@redhat.com>
Signed-off-by: Cong Wang <amwang@redhat.com>
Acked-by: Stephen Hemminger <stephen@networkplumber.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-08-09 11:41:45 -07:00
Cong Wang
614334df2d vxlan: fix a regression of igmp join
This is a regression introduced by:

	commit 3fc2de2fab
	Author: stephen hemminger <stephen@networkplumber.org>
	Date:   Thu Jul 18 08:40:15 2013 -0700

	    vxlan: fix igmp races

Before this commit, the old code was:

       if (vxlan_group_used(vn, vxlan->default_dst.remote_ip))
               ip_mc_join_group(sk, &mreq);
       else
               ip_mc_leave_group(sk, &mreq);

therefore we shoud check vxlan_group_used(), not its opposite,
for igmp_join.

Cc: Stephen Hemminger <stephen@networkplumber.org>
Cc: David S. Miller <davem@davemloft.net>
Signed-off-by: Cong Wang <amwang@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-08-09 11:41:45 -07:00
stephen hemminger
5ca5461c3e vxlan: fix rcu related warning
Vxlan remote list is protected by RCU and guaranteed to be non-empty.
Split out the rcu and non-rcu access to the list to fix warning

Signed-off-by: Stephen Hemminger <stephen@networkplumber.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-08-04 18:47:14 -07:00
David S. Miller
0e76a3a587 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
Merge net into net-next to setup some infrastructure Eric
Dumazet needs for usbnet changes.

Signed-off-by: David S. Miller <davem@davemloft.net>
2013-08-03 21:36:46 -07:00
Thomas Richter
906dc186e0 vxlan fdb replace an existing entry
Add support to replace an existing entry found in the
vxlan fdb database. The entry in question is identified
by its unicast mac address and the destination information
is changed. If the entry is not found, it is added in the
forwarding database. This is similar to changing an entry
in the neighbour table.

Multicast mac addresses can not be changed with the replace
option.

This is useful for virtual machine migration when the
destination of a target virtual machine changes. The replace
feature can be used instead of delete followed by add.

Resubmitted because net-next was closed last week.

Signed-off-by: Thomas Richter <tmricht@linux.vnet.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-07-23 16:36:03 -07:00
stephen hemminger
3fc2de2fab vxlan: fix igmp races
There are two race conditions in existing code for doing IGMP
management in workqueue in vxlan. First, the vxlan_group_used
function checks the list of vxlan's without any protection, and
it is possible for open followed by close to occur before the
igmp work queue runs.

To solve these move the check into vxlan_open/stop so it is
protected by RTNL. And split into two work structures so that
there is no racy reference to underlying device state.

Signed-off-by: Stephen Hemminger <stephen@networkplumber.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-07-19 17:07:25 -07:00
stephen hemminger
372675a4a9 vxlan: unregister on namespace exit
Fix memory leaks and other badness from VXLAN network namespace
teardown. When network namespace is removed, all the vxlan devices should
be unregistered (not closed).

Signed-off-by: Stephen Hemminger <stephen@networkplumber.org>
Reviewed-by: Pravin B Shelar <pshelar@nicira.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-07-19 17:07:25 -07:00
stephen hemminger
fe5c3561e6 vxlan: add necessary locking on device removal
The socket management is now done in workqueue (outside of RTNL)
and protected by vn->sock_lock. There were two possible bugs, first
the vxlan device was removed from the VNI hash table per socket without
holding lock. And there was a race when device is created and the workqueue
could run after deletion.

Signed-off-by: Stephen Hemminger <stephen@networkplumber.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-07-17 12:51:19 -07:00
Pravin B Shelar
f89e57c4f5 vxlan: Fix kernel crash on rmmod.
vxlan exit module unregisters vxlan net and then it unregisters
rtnl ops which triggers vxlan_dellink() from __rtnl_kill_links().
vxlan_dellink() deletes vxlan-dev from vxlan_list which has
list-head in vxlan-net-struct but that is already gone due to
net-unregister. That is how we are getting following crash.

Following commit fixes the crash by fixing module exit path.

BUG: unable to handle kernel paging request at ffff8804102c8000
IP: [<ffffffff812cc5e9>] __list_del_entry+0x29/0xd0
PGD 2972067 PUD 83e019067 PMD 83df97067 PTE 80000004102c8060
Oops: 0000 [#1] SMP DEBUG_PAGEALLOC
Modules linked in: ---
CPU: 19 PID: 6712 Comm: rmmod Tainted: GF            3.10.0+ #95
Hardware name: Dell Inc. PowerEdge R620/0KCKR5, BIOS 1.4.8 10/25/2012
task: ffff88080c47c580 ti: ffff88080ac50000 task.ti: ffff88080ac50000
RIP: 0010:[<ffffffff812cc5e9>]  [<ffffffff812cc5e9>]
__list_del_entry+0x29/0xd0
RSP: 0018:ffff88080ac51e08  EFLAGS: 00010206
RAX: ffff8804102c8000 RBX: ffff88040f0d4b10 RCX: dead000000200200
RDX: ffff8804102c8000 RSI: ffff88080ac51e58 RDI: ffff88040f0d4b10
RBP: ffff88080ac51e08 R08: 0000000000000001 R09: 2222222222222222
R10: 2222222222222222 R11: 2222222222222222 R12: ffff88080ac51e58
R13: ffffffffa07b8840 R14: ffffffff81ae48c0 R15: ffff88080ac51e58
FS:  00007f9ef105c700(0000) GS:ffff88082a800000(0000)
knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: ffff8804102c8000 CR3: 00000008227e5000 CR4: 00000000000407e0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
Stack:
 ffff88080ac51e28 ffffffff812cc6a1 2222222222222222 ffff88040f0d4000
 ffff88080ac51e48 ffffffffa07b3311 ffff88040f0d4000 ffffffff81ae49c8
 ffff88080ac51e98 ffffffff81492fc2 ffff88080ac51e58 ffff88080ac51e58
Call Trace:
 [<ffffffff812cc6a1>] list_del+0x11/0x40
 [<ffffffffa07b3311>] vxlan_dellink+0x51/0x70 [vxlan]
 [<ffffffff81492fc2>] __rtnl_link_unregister+0xa2/0xb0
 [<ffffffff8149448e>] rtnl_link_unregister+0x1e/0x30
 [<ffffffffa07b7b7c>] vxlan_cleanup_module+0x1c/0x2f [vxlan]
 [<ffffffff810c9b31>] SyS_delete_module+0x1d1/0x2c0
 [<ffffffff812b8a0e>] ? trace_hardirqs_on_thunk+0x3a/0x3f
 [<ffffffff81582f42>] system_call_fastpath+0x16/0x1b
Code: eb 9f 55 48 8b 17 48 b9 00 01 10 00 00 00 ad de 48 8b 47 08 48 89
e5 48 39 ca 74 29 48 b9 00 02 20 00 00 00 ad de 48 39 c8 74 7a <4c> 8b
00 4c 39 c7 75 53 4c 8b 42 08 4c 39 c7 75 2b 48 89 42 08
RIP  [<ffffffff812cc5e9>] __list_del_entry+0x29/0xd0
 RSP <ffff88080ac51e08>
CR2: ffff8804102c8000

Signed-off-by: Pravin B Shelar <pshelar@nicira.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-07-11 11:45:36 -07:00
Stephen Hemminger
ba609e9bf1 vxlan: fix function name spelling
Signed-off-by: Stephen Hemminger <stephen@networkplumber.org>
2013-06-25 17:06:01 -07:00
Mike Rapoport
58e4c76704 vxlan: fdb: allow specifying multiple destinations for zero MAC
The zero MAC entry in the fdb is used as default destination. With
multiple default destinations it is possible to use vxlan in
environments that disable multicast on the infrastructure level, e.g.
public clouds.

Signed-off-by: Mike Rapoport <mike.rapoport@ravellosystems.com>
Signed-off-by: Stephen Hemminger <stephen@networkplumber.org>
2013-06-25 09:31:40 -07:00
Mike Rapoport
bc7892ba39 vxlan: allow removal of single destination from fdb entry
When the last item is deleted from the remote destinations list, the
fdb entry is destroyed.

Signed-off-by: Mike Rapoport <mike.rapoport@ravellosystems.com>
Signed-off-by: Stephen Hemminger <stephen@networkplumber.org>
2013-06-25 09:31:38 -07:00
Mike Rapoport
f0b074be7b vxlan: introduce vxlan_fdb_parse
which will be reused by vxlan_fdb_delete

Signed-off-by: Mike Rapoport <mike.rapoport@ravellosystems.com>
Signed-off-by: Stephen Hemminger <stephen@networkplumber.org>
2013-06-25 09:31:37 -07:00
Mike Rapoport
a5e7c10a7e vxlan: introduce vxlan_fdb_find_rdst
which will be reused by vxlan_fdb_delete

Signed-off-by: Mike Rapoport <mike.rapoport@ravellosystems.com>
Signed-off-by: Stephen Hemminger <stephen@networkplumber.org>
2013-06-25 09:31:36 -07:00
Mike Rapoport
afbd8bae9c vxlan: add implicit fdb entry for default destination
Signed-off-by: Mike Rapoport <mike.rapoport@ravellosystems.com>
Signed-off-by: Stephen Hemminger <stephen@networkplumber.org>
2013-06-25 09:31:35 -07:00
Pravin B Shelar
60d9d4c6db vxlan: Fix sparse warnings.
Fix following sparse warnings.
drivers/net/vxlan.c:238:44: warning: incorrect type in argument 3 (different base types)
drivers/net/vxlan.c:238:44:    expected restricted __be32 [usertype] value
drivers/net/vxlan.c:238:44:    got unsigned int const [unsigned] [usertype] remote_vni
drivers/net/vxlan.c:1735:18: warning: incorrect type in initializer (different signedness)
drivers/net/vxlan.c:1735:18:    expected int *id
drivers/net/vxlan.c:1735:18:    got unsigned int static [toplevel] *<noident>

Signed-off-by: Pravin B Shelar <pshelar@nicira.com>
Signed-off-by: Stephen Hemminger <stephen@networkplumber.org>
2013-06-25 09:30:42 -07:00
Stephen Hemminger
234f5b7379 vxlan: cosmetic cleanup's
Fix whitespace and spelling

Signed-off-by: Stephen Hemminger <stephen@networkplumber.org>
Acked-by: David L Stevens <dlstevens@us.ibm.com>
2013-06-24 08:40:33 -07:00
Stephen Hemminger
bb3fd6878a vxlan: Use initializer for dummy structures
For the notification code, a couple of places build fdb entries on
the stack, use structure initialization instead and fix formatting.

Signed-off-by: Stephen Hemminger <stephen@networkplumber.org>
2013-06-24 08:40:33 -07:00
Stephen Hemminger
9daaa397b3 vxlan: port module param should be ushort
UDP ports are limited to 16 bits.

Signed-off-by: Stephen Hemminger <stephen@networkplumber.org>
2013-06-24 08:40:33 -07:00
Stephen Hemminger
3e61aa8f0a vxlan: convert remotes list to list_rcu
Based on initial work by Mike Rapoport <mike.rapoport@ravellosystems.com>
Use list macros and RCU for tracking multiple remotes.

Note: this code assumes list always has at least one entry,
because delete is not supported.

Signed-off-by: Stephen Hemminger <stephen@networkplumber.org>
2013-06-24 08:40:32 -07:00
Stephen Hemminger
4ad169300a vxlan: make vxlan_xmit_one void
The function vxlan_xmit_one always returns NETDEV_TX_OK, so there
is no point in keeping track of return values etc.

Signed-off-by: Stephen Hemminger <stephen@networkplumber.org>
Acked-by: David L Stevens <dlstevens@us.ibm.com>
2013-06-24 08:40:32 -07:00
Stephen Hemminger
ebf4063e86 vxlan: move cleanup to uninit
Put destruction of per-cpu statistics removal in
ndo_uninit since it is created by ndo_init.
This also avoids any problems that might be cause by destructor
being called after module removed.

Signed-off-by: Stephen Hemminger <stephen@networkplumber.org>
2013-06-24 08:40:32 -07:00
Stephen Hemminger
1c51a9159d vxlan: fix race caused by dropping rtnl_unlock
It is possible for two cpu's to race creating vxlan device.
For most cases this is harmless, but the ability to assign "next
avaliable vxlan device" relies on rtnl lock being held across the
whole operation. Therfore two instances of calling:
  ip li add vxlan%d vxlan ...
could collide and create two devices with same name.

To fix this defer creation of socket to a work queue, and
handle possible races there. Introduce a lock to ensure that
changes to vxlan socket hash list is SMP safe.

Signed-off-by: Stephen Hemminger <stephen@networkplumber.org>
2013-06-24 08:40:32 -07:00
Stephen Hemminger
8385f50a03 vxlan: send notification when MAC migrates
When learned entry migrates to another IP send a notification
that entry has changed.

Signed-off-by: Stephen Hemminger <stephen@networkplumber.org>
2013-06-24 08:40:32 -07:00
Stephen Hemminger
7c47cedf43 vxlan: move IGMP join/leave to work queue
Do join/leave from work queue to avoid lock inversion problems
between normal socket and RTNL. The code comes out cleaner
as well.

Uses Cong Wang's suggestion to turn refcnt into a real atomic
since now need to handle case where last use of socket is IGMP
worker.

Signed-off-by: Stephen Hemminger <stephen@networkplumber.org>
2013-06-24 08:40:32 -07:00
Stephen Hemminger
758c57d16a vxlan: fix crash from work pending on module removal
Switch to using a per module work queue so that all the socket
deletion callbacks are done when module is removed.

Signed-off-by: Stephen Hemminger <stephen@networkplumber.org>
2013-06-24 08:40:32 -07:00
Stephen Hemminger
b715398407 vxlan: fix out of order operation on module removal
If vxlan is removed with active vxlan's it would crash because
rtnl_link_unregister (which calls vxlan_dellink), was invoked
before unregister_pernet_device (which calls vxlan_stop).

Signed-off-by: Stephen Hemminger <stephen@networkplumber.org>
2013-06-24 08:40:32 -07:00
Pravin B Shelar
0e6fbc5b6c ip_tunnels: extend iptunnel_xmit()
Refactor various ip tunnels xmit functions and extend iptunnel_xmit()
so that there is more code sharing.

Signed-off-by: Pravin B Shelar <pshelar@nicira.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-06-19 18:07:41 -07:00
David S. Miller
d98cae64e4 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
Conflicts:
	drivers/net/wireless/ath/ath9k/Kconfig
	drivers/net/xen-netback/netback.c
	net/batman-adv/bat_iv_ogm.c
	net/wireless/nl80211.c

The ath9k Kconfig conflict was a change of a Kconfig option name right
next to the deletion of another option.

The xen-netback conflict was overlapping changes involving the
handling of the notify list in xen_netbk_rx_action().

Batman conflict resolution provided by Antonio Quartulli, basically
keep everything in both conflict hunks.

The nl80211 conflict is a little more involved.  In 'net' we added a
dynamic memory allocation to nl80211_dump_wiphy() to fix a race that
Linus reported.  Meanwhile in 'net-next' the handlers were converted
to use pre and post doit handlers which use a flag to determine
whether to hold the RTNL mutex around the operation.

However, the dump handlers to not use this logic.  Instead they have
to explicitly do the locking.  There were apparent bugs in the
conversion of nl80211_dump_wiphy() in that we were not dropping the
RTNL mutex in all the return paths, and it seems we very much should
be doing so.  So I fixed that whilst handling the overlapping changes.

To simplify the initial returns, I take the RTNL mutex after we try
to allocate 'tb'.

Signed-off-by: David S. Miller <davem@davemloft.net>
2013-06-19 16:49:39 -07:00
stephen hemminger
eb064c3b49 vxlan: fix check for migration of static entry
The check introduced by:
	commit 26a41ae604
	Author: stephen hemminger <stephen@networkplumber.org>
	Date:   Mon Jun 17 12:09:58 2013 -0700

	    vxlan: only migrate dynamic FDB entries

was not correct because it is checking flag about type of FDB
entry, rather than the state (dynamic versus static). The confusion
arises because vxlan is reusing values from bridge, and bridge is
reusing values from neighbour table, and easy to get lost in translation.

Signed-off-by: Stephen Hemminger <stephen@networkplumber.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-06-19 00:50:58 -07:00
stephen hemminger
7aa2723841 vxlan: handle skb_clone failure
If skb_clone fails if out of memory then just skip the fanout.

Problem was introduced in 3.10 with:
  commit 6681712d67
  Author: David Stevens <dlstevens@us.ibm.com>
  Date:   Fri Mar 15 04:35:51 2013 +0000

    vxlan: generalize forwarding tables

Signed-off-by: Stephen Hemminger <stephen@networkplumber.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-06-17 15:55:47 -07:00
stephen hemminger
26a41ae604 vxlan: only migrate dynamic FDB entries
Only migrate dynamic forwarding table entries, don't modify
static entries. If packet received from incorrect source IP address
assume it is an imposter and drop it.

This patch applies only to -net, a different patch would be needed for earlier
kernels since the NTF_SELF flag was introduced with 3.10.

Signed-off-by: Stephen Hemminger <stephen@networkplumber.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-06-17 15:55:46 -07:00
stephen hemminger
3bf74b1aec vxlan: fix race between flush and incoming learning
It is possible for a packet to arrive during vxlan_stop(), and
have a dynamic entry created. Close this by checking if device
is up.

 CPU1                             CPU2
vxlan_stop
  vxlan_flush
     hash_lock acquired
                                  vxlan_encap_recv
                                     vxlan_snoop
                                        waiting for hash_lock
     hash_lock relased
  vxlan_flush done
                                        hash_lock acquired
                                        vxlan_fdb_create

This is a day-one bug in vxlan goes back to 3.7.

Signed-off-by: Stephen Hemminger <stephen@networkplumber.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-06-17 15:55:46 -07:00
Cong Wang
7332a13b03 vxlan: defer vxlan init as late as possible
When vxlan is compiled as builtin, its init code
runs before IPv6 init, this could cause problems
if we create IPv6 socket in the latter patch.

Cc: Stephen Hemminger <stephen@networkplumber.org>
Cc: David S. Miller <davem@davemloft.net>
Signed-off-by: Cong Wang <amwang@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-05-28 23:53:52 -07:00
Cong Wang
31fec5aa21 vxlan: use unsigned int instead of unsigned
'unsigned int' is slightly better.

Cc: Stephen Hemminger <stephen@networkplumber.org>
Cc: David S. Miller <davem@davemloft.net>
Signed-off-by: Cong Wang <amwang@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-05-28 23:53:52 -07:00
Cong Wang
784e4616a4 vxlan: remove the unused rcu head from struct vxlan_rdst
Cc: Stephen Hemminger <stephen@networkplumber.org>
Cc: David S. Miller <davem@davemloft.net>
Signed-off-by: Cong Wang <amwang@redhat.com>
Acked-by: Stephen Hemminger <stephen@networkplumber.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-05-28 23:53:52 -07:00
David S. Miller
e6ff4c75f9 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
Merge net into net-next because some upcoming net-next changes
build on top of bug fixes that went into net.

Signed-off-by: David S. Miller <davem@davemloft.net>
2013-05-24 16:48:28 -07:00
Sridhar Samudrala
014be2c8ea vxlan: Update vxlan fdb 'used' field after each usage
Fix some instances where vxlan fdb 'used' field is not updated after the entry
is used.

v2: rename vxlan_find_mac() as __vxlan_find_mac() and create a new vxlan_find_mac()
that also updates ->used field.

Signed-off-by: Sridhar Samudrala <sri@us.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-05-18 12:53:39 -07:00
stephen hemminger
553675fb5e vxlan: listen on multiple ports
The commit 823aa873bc
  Author: stephen hemminger <stephen@networkplumber.org>
  Date:   Sat Apr 27 11:31:57 2013 +0000

    vxlan: allow choosing destination port per vxlan

introduced per-vxlan UDP port configuration but only did half of the
necessary work.  It added per vxlan destination for sending, but
overlooked the handling of multiple ports for incoming traffic.

This patch changes the listening port management to handle multiple
incoming UDP ports. The earlier per-namespace structure is now a hash
list per namespace.

It is also now possible to define the same virtual network id
but with different UDP port values which can be useful for migration.

Signed-off-by: Stephen Hemminger <stephen@networkplumber.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-05-17 14:06:29 -07:00
Dmitry Kravkov
f6ace502b8 vxlan: do not set SKB_GSO_UDP
Since SKB_GSO_* flags are set by appropriate gso_segment callback
in TCP/UDP layer.

CC: Pravin B Shelar <pshelar@nicira.com>
Signed-off-by: Dmitry Kravkov <dmitry@broadcom.com>
Signed-off-by: Eilon Greenstein <eilong@broadcom.com>
Acked-by: Pravin B Shelar <pshelar@nicira.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-04-29 15:27:47 -04:00
stephen hemminger
823aa873bc vxlan: allow choosing destination port per vxlan
Allow configuring the default destination port on a per-device basis.
Adds new netlink paramater IFLA_VXLAN_PORT to allow setting destination
port when creating new vxlan.

Signed-off-by: Stephen Hemminger <stephen@networkplumber.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-04-29 11:53:12 -04:00
stephen hemminger
7d836a7679 vxlan: compute source port in network byte order
Rather than computing source port and returning it in host order
then swapping later, go ahead and compute it in network order to
start with. Cleaner and less error prone.

Signed-off-by: Stephen Hemminger <stephen@networkplumber.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-04-29 11:53:12 -04:00
stephen hemminger
5d174dd80c vxlan: source compatiablity with IFLA_VXLAN_GROUP (v2)
Source compatiability for build iproute2 was broken by:
  commit c7995c43fa
  Author: Atzm Watanabe <atzm@stratosphere.co.jp>
    vxlan: Allow setting destination to unicast address.

Since this commit has not made it upstream (still net-next),
and better to avoid gratitious changes to exported API's;
go back to original definition, and add a comment.

Signed-off-by: Stephen Hemminger <stephen@networkplumber.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-04-29 11:53:12 -04:00
stephen hemminger
73cf331706 vxlan: fix byte order issues with NDA_PORT
The NDA_PORT attribute was added, but the author wasn't careful
about width (port is 16 bits), or byte order.  The attribute was
being dumped as 16 bits, but only 32 bit value would be accepted
when setting up a device. Also, the remote port is in network
byte order and was being compared with default port in host byte
order.

Signed-off-by: Stephen Hemminger <stephen@networkplumber.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-04-29 11:53:12 -04:00
stephen hemminger
23c578bf7d vxlan: document UDP default port
The default port for VXLAN is not same as IANA value.
Document this.

Signed-off-by: Stephen Hemminger <stephen@networkplumber.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-04-29 11:53:12 -04:00
stephen hemminger
3b8df3c6b1 vxlan: update mail address and copyright date
Signed-off-by: Stephen Hemminger <stephen@networkplumber.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-04-29 11:53:12 -04:00
David Stevens
ae88408256 VXLAN: Allow L2 redirection with L3 switching
Allow L2 redirection when VXLAN L3 switching is enabled

This patch restricts L3 switching to destination MAC addresses that are
marked as routers in order to allow virtual IP appliances that do L2
redirection to function with VXLAN L3 switching enabled.

We use L3 switching on VXLAN networks to avoid extra hops when the nominal
router for cross-subnet traffic for a VM is remote and the ultimate
destination may be local, or closer to the local node. Currently, the
destination IP address takes precedence over the MAC address in all cases.
Some network appliances receive packets for a virtualized IP address and
redirect by changing the destination MAC address (only) to be the final
destination for packet processing. VXLAN tunnel endpoints with L3 switching
enabled may then overwrite this destination MAC address based on the packet IP
address, resulting in potential loops and, at least, breaking L2 redirections
that travel through tunnel endpoints.

This patch limits L3 switching to the intended case where the original
destination MAC address is a next-hop router and relies on the destination
MAC address for all other cases, thus allowing L2 redirection and L3 switching
to coexist peacefully.

Signed-Off-By: David L Stevens <dlstevens@us.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-04-22 16:19:51 -04:00
Atzm Watanabe
c7995c43fa vxlan: Allow setting destination to unicast address.
This patch allows setting VXLAN destination to unicast address.
It allows that VXLAN can be used as peer-to-peer tunnel without
multicast.

v4: generalize struct vxlan_dev, "gaddr" is replaced with vxlan_rdst.
    "GROUP" attribute is replaced with "REMOTE".
    they are based by David Stevens's comments.

v3: move a new attribute REMOTE into the last of an enum list
    based by Stephen Hemminger's comments.

v2: use a new attribute REMOTE instead of GROUP based by
    Cong Wang's comments.

Signed-off-by: Atzm Watanabe <atzm@stratosphere.co.jp>
Acked-by: David L Stevens <dlstevens@us.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-04-16 16:43:35 -04:00
Mike Rapoport
ab09a6d0d3 vxlan: don't bypass encapsulation for multi- and broadcasts
The multicast and broadcast packets may have RTCF_LOCAL set in rt_flags
and therefore will be sent out bypassing encapsulation. This breaks
delivery of packets sent to the vxlan multicast group.
Disabling encapsulation bypass for multicasts and broadcasts fixes the
issue.

Signed-off-by: Mike Rapoport <mike.rapoport@ravellosystems.com>
Tested-by: Cong Wang <xiyou.wangcong@gmail.com>
Acked-by: Sridhar Samudrala <sri@us.ibm.com>
Tested-by: Sridhar Samudrala <sri@us.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-04-15 14:06:39 -04:00
Mike Rapoport
9d9f163c82 vxlan: use htonl when snooping for loopback address
Currently "bridge fdb show dev vxlan0" lists loopback address as
"1.0.0.127". Using htonl(INADDR_LOOPBACK) rather than passing it
directly to vxlan_snoop fixes the problem.

Signed-off-by: Mike Rapoport <mike.rapoport@ravellosystems.com>
Acked-by: Cong Wang <xiyou.wangcong@gmail.com>
Acked-by: Stephen Hemminger <stephen@networkplumber.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-04-14 15:41:49 -04:00
Wei Yongjun
6706c82e39 vxlan: fix some sparse warnings
Fixes following warning:
drivers/net/vxlan.c:406:6: warning: symbol 'vxlan_fdb_free' was not declared. Should it be static?
drivers/net/vxlan.c:1111:37: warning: Using plain integer as NULL pointer

Signed-off-by: Wei Yongjun <yongjun_wei@trendmicro.com.cn>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-04-12 15:01:27 -04:00
Sridhar Samudrala
9dcc71e1fd vxlan: Bypass encapsulation if the destination is local
This patch bypasses vxlan encapsulation if the destination vxlan
endpoint is a local device.

Changes since v1: added missing check for vxlan_find_vni() failure

Signed-off-by: Sridhar Samudrala <sri@us.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-04-07 16:58:13 -04:00
Pravin B Shelar
5abb0029c8 VXLAN: Fix sparse warnings.
Fixes following warning:-
drivers/net/vxlan.c:471:35: warning: symbol 'dev' shadows an earlier one
drivers/net/vxlan.c:433:26: originally declared here
drivers/net/vxlan.c:794:34: warning: symbol 'vxlan' shadows an earlier one
drivers/net/vxlan.c:757:26: originally declared here

CC: Stephen Hemminger <stephen@networkplumber.org>
Signed-off-by: Pravin B Shelar <pshelar@nicira.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-03-27 00:52:06 -04:00
Pravin B Shelar
206aaafcd2 VXLAN: Use IP Tunnels tunnel ENC encap API
Use common ecn_encap functions from ip_tunnel module.

Signed-off-by: Pravin B Shelar <pshelar@nicira.com>
Acked-by: Stephen Hemminger <stephen@networkplumber.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-03-26 12:27:18 -04:00
Pravin B Shelar
e817104525 VXLAN: Fix vxlan stats handling.
Fixes bug in VXLAN code where is iptunnel_xmit() called with NULL
dev->tstats.
This bug was introduced in commit 6aed0c8bf7 (tunnel: use
iptunnel_xmit() again).

Following patch fixes bug by setting dev->tstats. It uses ip_tunnel
module code to share stats function.

CC: Cong Wang <xiyou.wangcong@gmail.com>
Signed-off-by: Pravin B Shelar <pshelar@nicira.com>
Acked-by: Stephen Hemminger <stephen@networkplumber.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-03-26 12:27:18 -04:00
Pravin B Shelar
c544193214 GRE: Refactor GRE tunneling code.
Following patch refactors GRE code into ip tunneling code and GRE
specific code. Common tunneling code is moved to ip_tunnel module.
ip_tunnel module is written as generic library which can be used
by different tunneling implementations.

ip_tunnel module contains following components:
 - packet xmit and rcv generic code. xmit flow looks like
   (gre_xmit/ipip_xmit)->ip_tunnel_xmit->ip_local_out.
 - hash table of all devices.
 - lookup for tunnel devices.
 - control plane operations like device create, destroy, ioctl, netlink
   operations code.
 - registration for tunneling modules, like gre, ipip etc.
 - define single pcpu_tstats dev->tstats.
 - struct tnl_ptk_info added to pass parsed tunnel packet parameters.

ipip.h header is renamed to ip_tunnel.h

Signed-off-by: Pravin B Shelar <pshelar@nicira.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-03-26 12:27:18 -04:00
David Stevens
6681712d67 vxlan: generalize forwarding tables
This patch generalizes VXLAN forwarding table entries allowing an administrator
to:
	1) specify multiple destinations for a given MAC
	2) specify alternate vni's in the VXLAN header
	3) specify alternate destination UDP ports
	4) use multicast MAC addresses as fdb lookup keys
	5) specify multicast destinations
	6) specify the outgoing interface for forwarded packets

The combination allows configuration of more complex topologies using VXLAN
encapsulation.

Changes since v1: rebase to 3.9.0-rc2

Signed-Off-By: David L Stevens <dlstevens@us.ibm.com>

Signed-off-by: David S. Miller <davem@davemloft.net>
2013-03-17 12:23:46 -04:00
David S. Miller
e5f2ef7ab4 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
Conflicts:
	drivers/net/ethernet/intel/e1000e/netdev.c

Minor conflict in e1000e, a line that got fixed in 'net'
has been removed in 'net-next'.

Signed-off-by: David S. Miller <davem@davemloft.net>
2013-03-12 05:52:22 -04:00
Cong Wang
6aed0c8bf7 tunnel: use iptunnel_xmit() again
With recent patches from Pravin, most tunnels can't use iptunnel_xmit()
any more, due to ip_select_ident() and skb->ip_summed. But we can just
move these operations out of iptunnel_xmit(), so that tunnels can
use it again.

This by the way fixes a bug in vxlan (missing nf_reset()) for net-next.

Cc: Pravin B Shelar <pshelar@nicira.com>
Cc: Stephen Hemminger <stephen@networkplumber.org>
Cc: "David S. Miller" <davem@davemloft.net>
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-03-10 03:05:44 -04:00
Pravin B Shelar
05c0db08ab VXLAN: Use UDP Tunnel segmention.
Enable TSO for VXLAN devices and use UDP_TUNNEL to offload vxlan
segmentation.

Signed-off-by: Pravin B Shelar <pshelar@nicira.com>
Acked-by: Stephen Hemminger <stephen@networkplumber.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-03-09 16:09:17 -05:00
Zang MingJie
9cb6cb7ed1 vxlan: fix oops when delete netns containing vxlan
The following script will produce a kernel oops:

    sudo ip netns add v
    sudo ip netns exec v ip ad add 127.0.0.1/8 dev lo
    sudo ip netns exec v ip link set lo up
    sudo ip netns exec v ip ro add 224.0.0.0/4 dev lo
    sudo ip netns exec v ip li add vxlan0 type vxlan id 42 group 239.1.1.1 dev lo
    sudo ip netns exec v ip link set vxlan0 up
    sudo ip netns del v

where inspect by gdb:

    Program received signal SIGSEGV, Segmentation fault.
    [Switching to Thread 107]
    0xffffffffa0289e33 in ?? ()
    (gdb) bt
    #0  vxlan_leave_group (dev=0xffff88001bafa000) at drivers/net/vxlan.c:533
    #1  vxlan_stop (dev=0xffff88001bafa000) at drivers/net/vxlan.c:1087
    #2  0xffffffff812cc498 in __dev_close_many (head=head@entry=0xffff88001f2e7dc8) at net/core/dev.c:1299
    #3  0xffffffff812cd920 in dev_close_many (head=head@entry=0xffff88001f2e7dc8) at net/core/dev.c:1335
    #4  0xffffffff812cef31 in rollback_registered_many (head=head@entry=0xffff88001f2e7dc8) at net/core/dev.c:4851
    #5  0xffffffff812cf040 in unregister_netdevice_many (head=head@entry=0xffff88001f2e7dc8) at net/core/dev.c:5752
    #6  0xffffffff812cf1ba in default_device_exit_batch (net_list=0xffff88001f2e7e18) at net/core/dev.c:6170
    #7  0xffffffff812cab27 in cleanup_net (work=<optimized out>) at net/core/net_namespace.c:302
    #8  0xffffffff810540ef in process_one_work (worker=0xffff88001ba9ed40, work=0xffffffff8167d020) at kernel/workqueue.c:2157
    #9  0xffffffff810549d0 in worker_thread (__worker=__worker@entry=0xffff88001ba9ed40) at kernel/workqueue.c:2276
    #10 0xffffffff8105870c in kthread (_create=0xffff88001f2e5d68) at kernel/kthread.c:168
    #11 <signal handler called>
    #12 0x0000000000000000 in ?? ()
    #13 0x0000000000000000 in ?? ()
    (gdb) fr 0
    #0  vxlan_leave_group (dev=0xffff88001bafa000) at drivers/net/vxlan.c:533
    533		struct sock *sk = vn->sock->sk;
    (gdb) l
    528	static int vxlan_leave_group(struct net_device *dev)
    529	{
    530		struct vxlan_dev *vxlan = netdev_priv(dev);
    531		struct vxlan_net *vn = net_generic(dev_net(dev), vxlan_net_id);
    532		int err = 0;
    533		struct sock *sk = vn->sock->sk;
    534		struct ip_mreqn mreq = {
    535			.imr_multiaddr.s_addr	= vxlan->gaddr,
    536			.imr_ifindex		= vxlan->link,
    537		};
    (gdb) p vn->sock
    $4 = (struct socket *) 0x0

The kernel calls `vxlan_exit_net` when deleting the netns before shutting down
vxlan interfaces. Later the removal of all vxlan interfaces, where `vn->sock`
is already gone causes the oops. so we should manually shutdown all interfaces
before deleting `vn->sock` as the patch does.

Signed-off-by: Zang MingJie <zealot0630@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-03-07 16:12:51 -05:00
Zang MingJie
88c4c066c6 reset nf before xmit vxlan encapsulated packet
We should reset nf settings bond to the skb as ipip/ipgre do.

If not, the conntrack/nat info bond to the origin packet may continually
redirect the packet to vxlan interface causing a routing loop.

this is the scenario:

     VETP     VXLAN Gateway
    /----\  /---------------\
    |    |  |               |
    |  vx+--+vx --NAT-> eth0+--> Internet
    |    |  |               |
    \----/  \---------------/

when there are any packet coming from internet to the vetp, there will be lots
of garbage packets coming out the gateway's vxlan interface, but none actually
sent to the physical interface, because they are redirected back to the vxlan
interface in the postrouting chain of NAT rule, and dmesg complains:

    Mar  1 21:52:53 debian kernel: [ 8802.997699] Dead loop on virtual device vxlan0, fix it urgently!
    Mar  1 21:52:54 debian kernel: [ 8804.004907] Dead loop on virtual device vxlan0, fix it urgently!
    Mar  1 21:52:55 debian kernel: [ 8805.012189] Dead loop on virtual device vxlan0, fix it urgently!
    Mar  1 21:52:56 debian kernel: [ 8806.020593] Dead loop on virtual device vxlan0, fix it urgently!

the patch should fix the problem

Signed-off-by: Zang MingJie <zealot0630@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-03-06 02:47:05 -05:00
Sasha Levin
b67bfe0d42 hlist: drop the node parameter from iterators
I'm not sure why, but the hlist for each entry iterators were conceived

        list_for_each_entry(pos, head, member)

The hlist ones were greedy and wanted an extra parameter:

        hlist_for_each_entry(tpos, pos, head, member)

Why did they need an extra pos parameter? I'm not quite sure. Not only
they don't really need it, it also prevents the iterator from looking
exactly like the list iterator, which is unfortunate.

Besides the semantic patch, there was some manual work required:

 - Fix up the actual hlist iterators in linux/list.h
 - Fix up the declaration of other iterators based on the hlist ones.
 - A very small amount of places were using the 'node' parameter, this
 was modified to use 'obj->member' instead.
 - Coccinelle didn't handle the hlist_for_each_entry_safe iterator
 properly, so those had to be fixed up manually.

The semantic patch which is mostly the work of Peter Senna Tschudin is here:

@@
iterator name hlist_for_each_entry, hlist_for_each_entry_continue, hlist_for_each_entry_from, hlist_for_each_entry_rcu, hlist_for_each_entry_rcu_bh, hlist_for_each_entry_continue_rcu_bh, for_each_busy_worker, ax25_uid_for_each, ax25_for_each, inet_bind_bucket_for_each, sctp_for_each_hentry, sk_for_each, sk_for_each_rcu, sk_for_each_from, sk_for_each_safe, sk_for_each_bound, hlist_for_each_entry_safe, hlist_for_each_entry_continue_rcu, nr_neigh_for_each, nr_neigh_for_each_safe, nr_node_for_each, nr_node_for_each_safe, for_each_gfn_indirect_valid_sp, for_each_gfn_sp, for_each_host;

type T;
expression a,c,d,e;
identifier b;
statement S;
@@

-T b;
    <+... when != b
(
hlist_for_each_entry(a,
- b,
c, d) S
|
hlist_for_each_entry_continue(a,
- b,
c) S
|
hlist_for_each_entry_from(a,
- b,
c) S
|
hlist_for_each_entry_rcu(a,
- b,
c, d) S
|
hlist_for_each_entry_rcu_bh(a,
- b,
c, d) S
|
hlist_for_each_entry_continue_rcu_bh(a,
- b,
c) S
|
for_each_busy_worker(a, c,
- b,
d) S
|
ax25_uid_for_each(a,
- b,
c) S
|
ax25_for_each(a,
- b,
c) S
|
inet_bind_bucket_for_each(a,
- b,
c) S
|
sctp_for_each_hentry(a,
- b,
c) S
|
sk_for_each(a,
- b,
c) S
|
sk_for_each_rcu(a,
- b,
c) S
|
sk_for_each_from
-(a, b)
+(a)
S
+ sk_for_each_from(a) S
|
sk_for_each_safe(a,
- b,
c, d) S
|
sk_for_each_bound(a,
- b,
c) S
|
hlist_for_each_entry_safe(a,
- b,
c, d, e) S
|
hlist_for_each_entry_continue_rcu(a,
- b,
c) S
|
nr_neigh_for_each(a,
- b,
c) S
|
nr_neigh_for_each_safe(a,
- b,
c, d) S
|
nr_node_for_each(a,
- b,
c) S
|
nr_node_for_each_safe(a,
- b,
c, d) S
|
- for_each_gfn_sp(a, c, d, b) S
+ for_each_gfn_sp(a, c, d) S
|
- for_each_gfn_indirect_valid_sp(a, c, d, b) S
+ for_each_gfn_indirect_valid_sp(a, c, d) S
|
for_each_host(a,
- b,
c) S
|
for_each_host_safe(a,
- b,
c, d) S
|
for_each_mesh_entry(a,
- b,
c, d) S
)
    ...+>

[akpm@linux-foundation.org: drop bogus change from net/ipv4/raw.c]
[akpm@linux-foundation.org: drop bogus hunk from net/ipv6/raw.c]
[akpm@linux-foundation.org: checkpatch fixes]
[akpm@linux-foundation.org: fix warnings]
[akpm@linux-foudnation.org: redo intrusive kvm changes]
Tested-by: Peter Senna Tschudin <peter.senna@gmail.com>
Acked-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Signed-off-by: Sasha Levin <sasha.levin@oracle.com>
Cc: Wu Fengguang <fengguang.wu@intel.com>
Cc: Marcelo Tosatti <mtosatti@redhat.com>
Cc: Gleb Natapov <gleb@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-02-27 19:10:24 -08:00
Pravin B Shelar
8dc98eb2e8 VXLAN: Use tunnel_ip_select_ident() for tunnel IP-Identification.
tunnel_ip_select_ident() is more efficient when generating ip-header
id given inner packet is of ipv4 type.

Signed-off-by: Pravin B Shelar <pshelar@nicira.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-02-25 15:47:41 -05:00
Vlad Yasevich
1690be63a2 bridge: Add vlan support to static neighbors
When a user adds bridge neighbors, allow him to specify VLAN id.
If the VLAN id is not specified, the neighbor will be added
for VLANs currently in the ports filter list.  If no VLANs are
configured on the port, we use vlan 0 and only add 1 entry.

Signed-off-by: Vlad Yasevich <vyasevic@redhat.com>
Acked-by: Jitendra Kalsaria <jitendra.kalsaria@qlogic.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-02-13 19:42:16 -05:00
Yan Burman
1b13c97fae net/vxlan: Add ethtool drvinfo
Implement ethtool get_drvinfo.

Signed-off-by: Yan Burman <yanb@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-01-30 22:47:22 -05:00
stephen hemminger
6602d00789 vxlan: allow live mac address change
The VXLAN pseudo-device doesn't care if the mac address changes
when device is up.

Signed-off-by: Stephen Hemminger <shemminger@vyatta.com>
Reviewed-by: Jiri Pirko <jiri@resnulli.us>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-01-03 01:58:13 -08:00
Yan Burman
af9b078e35 net/vxlan: Use the underlying device index when joining/leaving multicast groups
The socket calls from vxlan to join/leave multicast group aren't
using the index of the underlying device, as a result the stack uses
the first interface that is up. This results in vxlan being non functional
over a device which isn't the 1st to be up.
Fix this by providing the iflink field to the vxlan instance
to the multicast calls.

Signed-off-by: Yan Burman <yanb@mellanox.com>
Acked-by: Stephen Hemminger <shemminger@vyatta.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2012-12-26 15:09:55 -08:00
Joseph Gasparakis
0afb1666fe vxlan: Add capability of Rx checksum offload for inner packet
This patch adds capability in vxlan to identify received
checksummed inner packets and signal them to the upper layers of
the stack. The driver needs to set the skb->encapsulation bit
and also set the skb->ip_summed to CHECKSUM_UNNECESSARY.

Signed-off-by: Joseph Gasparakis <joseph.gasparakis@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2012-12-09 00:20:28 -05:00
Joseph Gasparakis
d6727fe385 vxlan: capture inner headers during encapsulation
Allow VXLAN to make use of Tx checksum offloading and Tx scatter-gather.
The advantage to these two changes is that it also allows the VXLAN to
make use of GSO.

Signed-off-by: Joseph Gasparakis <joseph.gasparakis@intel.com>
Signed-off-by: Peter P Waskiewicz Jr <peter.p.waskiewicz.jr@intel.com>
Signed-off-by: Alexander Duyck <alexander.h.duyck@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2012-12-09 00:20:28 -05:00
David Stevens
e4f67addf1 add DOVE extensions for VXLAN
This patch provides extensions to VXLAN for supporting Distributed
Overlay Virtual Ethernet (DOVE) networks. The patch includes:

	+ a dove flag per VXLAN device to enable DOVE extensions
	+ ARP reduction, whereby a bridge-connected VXLAN tunnel endpoint
		answers ARP requests from the local bridge on behalf of
		remote DOVE clients
	+ route short-circuiting (aka L3 switching). Known destination IP
		addresses use the corresponding destination MAC address for
		switching rather than going to a (possibly remote) router first.
	+ netlink notification messages for forwarding table and L3 switching
		misses

Changes since v2
	- combined bools into "u32 flags"
	- replaced loop with !is_zero_ether_addr()

Signed-off-by: David L Stevens <dlstevens@us.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2012-11-20 13:41:28 -05:00
Rami Rosen
e8e55d9514 vxlan: remove unused variable.
This patch removes addrexceeded member from vxlan_dev struct as it is unused.

Signed-off-by: Rami Rosen <ramirose@gmail.com>
Acked-by: Stephen Hemminger <shemminger@vyatta.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2012-11-17 22:02:19 -05:00
David S. Miller
67f4efdce7 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
Minor line offset auto-merges.

Signed-off-by: David S. Miller <davem@davemloft.net>
2012-11-17 22:00:43 -05:00
Amerigo Wang
aa0010f880 net: convert __IPTUNNEL_XMIT() to an inline function
__IPTUNNEL_XMIT() is an ugly macro, convert it to a static
inline function, so make it more readable.

IPTUNNEL_XMIT() is unused, just remove it.

Cc: David S. Miller <davem@davemloft.net>
Signed-off-by: Cong Wang <amwang@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2012-11-14 18:49:50 -05:00
Alexander Duyck
1ba56fb45a vxlan: Update hard_header_len based on lowerdev when instantiating VXLAN
In the event of a VXLAN device being linked to a device that has a
hard_header_len greater than that of standard ethernet we could end up with
the hard_header_len not being large enough for outgoing frames.  In order to
prevent this we should update the length when a lowerdev is provided.

Signed-off-by: Alexander Duyck <alexander.h.duyck@intel.com>
Acked-by: Stephen Hemminger <shemminger@vyatta.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2012-11-13 18:19:50 -05:00
Rami Rosen
eb5ce43997 vxlan: fix a typo.
Use eXtensible and not eXtensiable in the comment on top.

Signed-off-by: Rami Rosen <ramirose@gmail.com>
Acked-by: Stephen Hemminger <shemminger@vyatta.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2012-11-13 17:14:06 -05:00
Alexander Duyck
52b702ffa5 vxlan: Fix error that was resulting in VXLAN MTU size being 10 bytes too large
This change fixes an issue I found where VXLAN frames were fragmented when
they were up to the VXLAN MTU size.  I root caused the issue to the fact that
the headroom was 4 + 20 + 8 + 8.  This math doesn't appear to be correct
because we are not inserting a VLAN header, but instead a 2nd Ethernet header.
As such the math for the overhead should be 20 + 8 + 8 + 14 to account for the
extra headers that are inserted for VXLAN.

Signed-off-by: Alexander Duyck <alexander.h.duyck@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2012-11-13 14:36:50 -05:00
David S. Miller
d4185bbf62 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
Conflicts:
	drivers/net/ethernet/broadcom/bnx2x/bnx2x_main.c

Minor conflict between the BCM_CNIC define removal in net-next
and a bug fix added to net.  Based upon a conflict resolution
patch posted by Stephen Rothwell.

Signed-off-by: David S. Miller <davem@davemloft.net>
2012-11-10 18:32:51 -05:00
Vincent Bernat
afb97186f5 vxlan: allow a user to set TTL value
"ip link add ... type vxlan ... ttl X" allows a user to set the TTL
used by a VXLAN for encapsulation. The provided value was ignored by
vxlan module and the default value of 1 was used when encapsulating
multicast packets.

Signed-off-by: Vincent Bernat <bernat@luffy.cx>
Acked-by: Stephen Hemminger <shemminger@vyatta.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2012-11-03 15:24:01 -04:00
stephen hemminger
3c172868cb vxlan: don't expire permanent entries
VXLAN confused flag versus bitmap on state.
Based on part of a earlier patch by David Stevens.

Signed-off-by: Stephen Hemminger <shemminger@vyatta.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2012-10-31 13:42:12 -04:00
stephen hemminger
34e02aa1fb vxlan: fix oops when give unknown ifindex
If vxlan is created and the ifindex is passed; there are two cases which
are incorrectly handled by the existing code. The ifindex could be zero
(i.e. no device) or there could be no device with that ifindex.

Signed-off-by: Stephen Hemminger <shemminger@vyatta.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2012-10-10 22:41:22 -04:00
stephen hemminger
d97c00a321 vxlan: fix receive checksum handling
Vxlan was trying to use postpull_rcsum to allow receive checksum
offload to work on drivers using CHECKSUM_COMPLETE method. But this
doesn't work correctly. Just force full receive checksum on received
packet.

Signed-off-by: Stephen Hemminger <shemminger@vyatta.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2012-10-10 22:41:22 -04:00
stephen hemminger
2840bf2286 vxlan: add additional headroom
Tell upper layer protocols to allocate skb with additional headroom.
This avoids allocation and copy in local packet sends.

Signed-off-by: Stephen Hemminger <shemminger@vyatta.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2012-10-10 22:41:22 -04:00
stephen hemminger
05f47d69c4 vxlan: allow configuring port range
VXLAN bases source UDP port based on flow to help the
receiver to be able to load balance based on outer header flow.

This patch restricts the port range to the normal UDP local
ports, and allows overriding via configuration.

It also uses jhash of Ethernet header when looking at flows
with out know L3 header.

Signed-off-by: Stephen Hemminger <shemminger@vyatta.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2012-10-10 22:41:21 -04:00
stephen hemminger
1cad87156b vxlan: associate with tunnel socket on transmit
When tunnelling a skb, associate it with the tunnel socket.
This allows parameters set on tunnel socket (like multicast loop
flag), to be picked up by ip_output.

Signed-off-by: Stephen Hemminger <shemminger@vyatta.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2012-10-10 22:41:21 -04:00
stephen hemminger
ca78f18129 vxlan: use ip_route_output
Select source address for VXLAN packet based on route destination
and don't lie to route code. VXLAN is not GRE.

Signed-off-by: Stephen Hemminger <shemminger@vyatta.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2012-10-10 22:41:21 -04:00
stephen hemminger
321fb99139 vxlan: fix byte order in hash function
Shift was wrong direction causing packets to hash based on
other parts of the ethernet header, not the address.

Signed-off-by: Stephen Hemminger <shemminger@vyatta.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2012-10-10 22:41:21 -04:00
stephen hemminger
ef59febe3b vxlan: minor output refactoring
Move code to find destination to a small function.

Signed-off-by: Stephen Hemminger <shemminger@vyatta.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2012-10-10 22:41:21 -04:00
Stephen Hemminger
7c41c42c5d vxlan: fix more sparse warnings
Fix a couple harmless sparse warnings reported by Fengguang Wu.

Signed-off-by: Stephen Hemminger <shemminger@vyatta.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2012-10-08 17:57:57 -04:00
Wei Yongjun
d717f14ee0 vxlan: remove unused including <linux/version.h>
Remove including <linux/version.h> that don't need it.

dpatch engine is used to auto generate this patch.
(https://github.com/weiyj/dpatch)

Signed-off-by: Wei Yongjun <yongjun_wei@trendmicro.com.cn>
Signed-off-by: David S. Miller <davem@davemloft.net>
2012-10-07 14:50:38 -04:00
stephen hemminger
bfe1b9b16e vxlan: put UDP socket in correct namespace
Move vxlan UDP socket to correct network namespace

Signed-off-by: Stephen Hemminger <shemminger@vyatta.com>
Acked-by: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2012-10-02 14:38:04 -04:00
stephen hemminger
d342894c5d vxlan: virtual extensible lan
This is an implementation of Virtual eXtensible Local Area Network
as described in draft RFC:
  http://tools.ietf.org/html/draft-mahalingam-dutt-dcops-vxlan-02

The driver integrates a Virtual Tunnel Endpoint (VTEP) functionality
that learns MAC to IP address mapping.

This implementation has not been tested only against the Linux
userspace implementation using TAP, not against other vendor's
equipment.

Signed-off-by: Stephen Hemminger <shemminger@vyatta.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2012-10-01 18:39:45 -04:00