tx_table is part of the private data of kernel net_device. It is only
zero-ed out when allocating net_device.
We may recreate netvsc_device w/o recreating net_device, so the private
netdev data, including tx_table, are not zeroed. It may contain channel
numbers for the older netvsc_device.
This patch adds initialization of tx_table each time we recreate
netvsc_device.
Signed-off-by: Haiyang Zhang <haiyangz@microsoft.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Rename this variable because it is the Receive indirection
table.
Signed-off-by: Haiyang Zhang <haiyangz@microsoft.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The patch supports the options to switch TCP hash level between
L3 and L4 by ethtool command. TCP over IPv4 and v6 can be set
differently. The default hash level is L4. We currently only
allow switching TX hash level from within the guests.
For example, for TCP over IPv4 on eth0:
To include TCP port numbers in hashing:
ethtool -N eth0 rx-flow-hash tcp4 sdfn
To exclude TCP port numbers in hashing:
ethtool -N eth0 rx-flow-hash tcp4 sd
To show TCP hash level:
ethtool -n eth0 rx-flow-hash tcp4
Signed-off-by: Haiyang Zhang <haiyangz@microsoft.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This simplifies the logic and make it easier to add more
options.
Signed-off-by: Haiyang Zhang <haiyangz@microsoft.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Add extack arg to netdev_upper_dev_link and netdev_master_upper_dev_link
Signed-off-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Report the numbers of events for stop_queue and wake_queue in
ethtool stats.
Example:
ethtool -S eth0
NIC statistics:
...
stop_queue: 7
wake_queue: 7
...
Signed-off-by: Simon Xiao <sixiao@microsoft.com>
Reviewed-by: Haiyang Zhang <haiyangz@microsoft.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
For older hosts without multi-channel (vRSS) support, and some error
cases, we still need to set the real number of queues to one.
This patch adds this missing setting.
Fixes: 8195b1396e ("hv_netvsc: fix deadlock on hotplug")
Signed-off-by: Haiyang Zhang <haiyangz@microsoft.com>
Reviewed-by: Stephen Hemminger <sthemmin@microsoft.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Don't populate const array ver_list on the stack, instead make it
static. Makes the object code smaller by over 400 bytes:
Before:
text data bss dec hex filename
18444 3168 320 21932 55ac drivers/net/hyperv/netvsc.o
After:
text data bss dec hex filename
17950 3224 320 21494 53f6 drivers/net/hyperv/netvsc.o
(gcc 6.3.0, x86-64)
Signed-off-by: Colin Ian King <colin.king@canonical.com>
Reviewed-by: Haiyang Zhang <haiyangz@microsoft.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
If MTU is changed the host would reject the send buffer change.
This problem is result of recent change to allow changing send
buffer size.
Every time we change the MTU, we store the previous net_device section
count before destroying the buffer, but we don’t store the previous
section size. When we reinitialize the buffer, its size is calculated
by multiplying the previous count and previous size. Since we
continuously increase the MTU, the host returns us a decreasing count
value while the section size is reinitialized to 1728 bytes every
time.
This eventually leads to a condition where the calculated buf_size is
so small that the host rejects it.
Fixes: 8b5327975a ("netvsc: allow controlling send/recv buffer size")
Signed-off-by: Alex Ng <alexng@microsoft.com>
Signed-off-by: Stephen Hemminger <sthemmin@microsoft.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The default receive buffer size was reduced by recent change
to a value which was appropriate for 10G and Windows Server 2016.
But the value is too small for full performance with 40G on Azure.
Increase the default back to maximum supported by host.
Fixes: 8b5327975a ("netvsc: allow controlling send/recv buffer size")
Signed-off-by: Stephen Hemminger <sthemmin@microsoft.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Only need to wakeup the initiator after all sub-channels
are opened.
Signed-off-by: Stephen Hemminger <sthemmin@microsoft.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
When a virtual device is added dynamically (via host console), then
the vmbus sends an offer message for the primary channel. The processing
of this message for networking causes the network device to then
initialize the sub channels.
The problem is that setting up the sub channels needs to wait until
the subsequent subchannel offers have been processed. These offers
come in on the same ring buffer and work queue as where the primary
offer is being processed; leading to a deadlock.
This did not happen in older kernels, because the sub channel waiting
logic was broken (it wasn't really waiting).
The solution is to do the sub channel setup in its own work queue
context that is scheduled by the primary channel setup; and then
happens later.
Fixes: 732e49850c ("netvsc: fix race on sub channel creation")
Reported-by: Dexuan Cui <decui@microsoft.com>
Signed-off-by: Stephen Hemminger <sthemmin@microsoft.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The limit of setting receive indirection table value should be
the current number of channels, not the VRSS_CHANNEL_MAX.
Signed-off-by: Haiyang Zhang <haiyangz@microsoft.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Because of the following code, net->num_tx_queues equals to
VRSS_CHANNEL_MAX, and max_chn is less than or equals to VRSS_CHANNEL_MAX.
netvsc_drv.c:
alloc_etherdev_mq(sizeof(struct net_device_context),
VRSS_CHANNEL_MAX);
rndis_filter.c:
net_device->max_chn = min_t(u32, VRSS_CHANNEL_MAX, num_possible_rss_qs);
So this patch removes the unnecessary limit check before comparing
with "max_chn".
Signed-off-by: Haiyang Zhang <haiyangz@microsoft.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The minus one and assignment to a local variable is not necessary.
This patch simplifies it.
Signed-off-by: Haiyang Zhang <haiyangz@microsoft.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch removes the parameter, num_queue in
rndis_filter_set_rss_param(), which is no longer in use.
Signed-off-by: Haiyang Zhang <haiyangz@microsoft.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
If VF is attached then can still allow netvsc driver module to
be removed. Just have to make sure and do the cleanup.
Also, avoid extra rtnl round trip when calling unregister.
Signed-off-by: Stephen Hemminger <sthemmin@microsoft.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Use one routine for datapath up/down. Don't need to reopen
the rndis layer.
Signed-off-by: Stephen Hemminger <sthemmin@microsoft.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
There is a deadlock possible when canceling the link status
delayed work queue. The removal process is run with RTNL held,
and the link status callback is acquring RTNL.
Resolve the issue by using trylock and rescheduling.
If cancel is in process, that block it from happening.
Fixes: 122a5f6410 ("staging: hv: use delayed_work for netvsc_send_garp()")
Signed-off-by: Stephen Hemminger <sthemmin@microsoft.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
We now remove rndis filter before unregister_netdev(), which calls
device close. It involves closing rndis filter already removed.
This patch fixes this error.
Signed-off-by: Haiyang Zhang <haiyangz@microsoft.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The patch add the functions to switch UDP hash level between
L3 and L4 by ethtool command. UDP over IPv4 and v6 can be set
differently. The default hash level is L4. We currently only
allow switching TX hash level from within the guests.
On Azure, fragmented UDP packets have high loss rate with L4
hashing. Using L3 hashing is recommended in this case.
For example, for UDP over IPv4 on eth0:
To include UDP port numbers in hasing:
ethtool -N eth0 rx-flow-hash udp4 sdfn
To exclude UDP port numbers in hasing:
ethtool -N eth0 rx-flow-hash udp4 sd
To show UDP hash level:
ethtool -n eth0 rx-flow-hash udp4
Signed-off-by: Haiyang Zhang <haiyangz@microsoft.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The only usage of vmbus_sendpacket_ctl was by vmbus_sendpacket.
Signed-off-by: Stephen Hemminger <sthemmin@microsoft.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The function vmbus_sendpacket_pagebuffer_ctl was never used directly.
Just have vmbus_send_pagebuffer
Signed-off-by: Stephen Hemminger <sthemmin@microsoft.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Add ethtool statistics for case where send chimmeny buffer is
exhausted and driver has to fall back to doing scatter/gather
send. Also, add statistic for case where ring buffer is full and
receive completions are delayed.
Signed-off-by: Stephen Hemminger <sthemmin@microsoft.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Control the size of the buffer areas via ethtool ring settings.
They aren't really traditional hardware rings, but host API breaks
receive and send buffer into chunks. The final size of the chunks are
controlled by the host.
The default value of send and receive buffer area for host DMA
is much larger than it needs to be. Experimentation shows that
4M receive and 1M send is sufficient.
Signed-off-by: Stephen Hemminger <sthemmin@microsoft.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The function init_page_array is always called with a valid pointer
to RNDIS header. No check for NULL is needed.
Signed-off-by: Stephen Hemminger <sthemmin@microsoft.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Assignment to a typed pointer is sufficient in C.
No cast is needed.
Signed-off-by: Stephen Hemminger <sthemmin@microsoft.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The send and receive buffers are both per-device (not per-channel).
The associated NUMA node is a property of the CPU which is per-channel
therefore it makes no sense to force the receive/send buffer to be
allocated on a particular node (since it is a shared resource).
Signed-off-by: Stephen Hemminger <sthemmin@microsoft.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
If setting new values fails, and the attempt to restore original
settings fails. Then log an error and leave device down.
This should never happen, but if it does don't go down in flames.
Signed-off-by: Stephen Hemminger <sthemmin@microsoft.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
If VF is slaved to synthetic device, then any change to netvsc
MAC address should be propagated to the slave device.
If slave device doesn't support MAC address change then it
should also be an error to attempt to change synthetic NIC MAC
address.
It also fixes the error unwind in the original code.
If give a bad address, the old code would change the device
MAC address anyway.
Reviewed-by: Haiyang Zhang <haiyangz@microsoft.com>
Signed-off-by: Stephen Hemminger <sthemmin@microsoft.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
When hv_pkt_iter_next() returns NULL, it has already called
hv_pkt_iter_close(). Calling it twice can lead to extra host signal.
Signed-off-by: Stephen Hemminger <sthemmin@microsoft.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
When VF device is discovered, delay bring it automatically up in
order to allow userspace to some simple changes (like renaming).
Reported-by: Vitaly Kuznetsov <vkuznets@redhat.com>
Signed-off-by: Stephen Hemminger <sthemmin@microsoft.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The UDP offload conflict is dealt with by simply taking what is
in net-next where we have removed all of the UFO handling code
entirely.
The TCP conflict was a case of local variables in a function
being removed from both net and net-next.
In netvsc we had an assignment right next to where a missing
set of u64 stats sync object inits were added.
Signed-off-by: David S. Miller <davem@davemloft.net>
Go back to switching datapath directly in the notifier callback.
Otherwise datapath might not get switched on unregister.
No need for calling the NOTIFY_PEERS notifier since that is only for
a gratitious ARP/ND packet; but that is not required with Hyper-V
because both VF and synthetic NIC have the same MAC address.
Reported-by: Vitaly Kuznetsov <vkuznets@redhat.com>
Fixes: 0c195567a8 ("netvsc: transparent VF management")
Signed-off-by: Stephen Hemminger <sthemmin@microsoft.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
With new transparent VF support, it is possible to get a deadlock
when some of the deferred work is running and the unregister_vf
is trying to cancel the work element. The solution is to use
trylock and reschedule (similar to bonding and team device).
Reported-by: Vitaly Kuznetsov <vkuznets@redhat.com>
Fixes: 0c195567a8 ("netvsc: transparent VF management")
Signed-off-by: Stephen Hemminger <sthemmin@microsoft.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The existing sub channel code did not wait for all the sub-channels
to completely initialize. This could lead to race causing crash
in napi_netif_del() from bad list. The existing code would send
an init message, then wait only for the initial response that
the init message was received. It thought it was waiting for
sub channels but really the init response did the wakeup.
The new code keeps track of the number of open channels and
waits until that many are open.
Other issues here were:
* host might return less sub-channels than was requested.
* the new init status is not valid until after init was completed.
Fixes: b3e6b82a00 ("hv_netvsc: Wait for sub-channels to be processed during probe")
Signed-off-by: Stephen Hemminger <sthemmin@microsoft.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch implements transparent fail over from synthetic NIC to
SR-IOV virtual function NIC in Hyper-V environment. It is a better
alternative to using bonding as is done now. Instead, the receive and
transmit fail over is done internally inside the driver.
Using bonding driver has lots of issues because it depends on the
script being run early enough in the boot process and with sufficient
information to make the association. This patch moves all that
functionality into the kernel.
Signed-off-by: Stephen Hemminger <sthemmin@microsoft.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Repeated dereference of nvmsg.msg.v1_msg.send_rndis_pkt can be
shortened by using a temporary. Do so.
No change in object code.
Miscellanea:
o Use * const for rpkt and nvchan
Signed-off-by: Joe Perches <joe@perches.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
On 32-bit hosts and with CONFIG_DEBUG_LOCK_ALLOC we should be seeing a
lockdep splat indicating this seqcount is not correctly initialized, fix
that. In commit 6c80f3fc23 ("netvsc: report per-channel stats in
ethtool statistics") netdev_alloc_pcpu_stats() was removed in favor of
open-coding the 64-bits statistics, except that u64_stats_init() was
missed.
Fixes: 6c80f3fc23 ("netvsc: report per-channel stats in ethtool statistics")
Signed-off-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: Stephen Hemminger <sthemmin@microsoft.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Two minor conflicts in virtio_net driver (bug fix overlapping addition
of a helper) and MAINTAINERS (new driver edit overlapping revamp of
PHY entry).
Signed-off-by: David S. Miller <davem@davemloft.net>
Latency improvement related to NAPI conversion.
If all packets are processed from receive ring then need
to signal host.
Signed-off-by: Stephen Hemminger <sthemmin@microsoft.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
If setting receive buffer fails, the error unwind would cause
kernel panic because it was not correctly doing RCU and NAPI
unwind. RCU'd pointer needs to be reset to NULL, and NAPI needs
to be disabled not deleted.
Signed-off-by: Stephen Hemminger <sthemmin@microsoft.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Optimize how receive completion ring are managed.
* Allocate only as many slots as needed for all buffers from host
* Allocate before setting up sub channel for better error detection
* Don't need to keep copy of initial receive section message
* Precompute the watermark for when receive flushing is needed
* Replace division with conditional test
* Replace atomic per-device variable with per-channel check.
* Handle corner case where receive completion send
fails if ring buffer to host is full.
Signed-off-by: Stephen Hemminger <sthemmin@microsoft.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The internal API was passing struct hv_page_buffer **
when only simple struct hv_page_buffer * was necessary
for passing an array.
Signed-off-by: Stephen Hemminger <sthemmin@microsoft.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Using %p to print pointer to packet meta-data doesn't give any
good info, and exposes kernel memory offsets.
Signed-off-by: Stephen Hemminger <sthemmin@microsoft.com>
Signed-off-by: David S. Miller <davem@davemloft.net>