Commit Graph

10 Commits

Author SHA1 Message Date
Yishai Hadas
55ad359225 net/mlx4_core: Enable device recovery flow with SRIOV
In SRIOV, both the PF and the VF may attempt device recovery whenever they
assume that the device is not functioning.  When the PF driver resets the
device, the VF should detect this and attempt to reinitialize itself.

The VF must be able to reset itself under all circumstances, even
if the PF is not responsive.

The VF shall reset itself in the following cases:

1. Commands are not processed within reasonable time over the communication channel.
This is done considering device state and the correct return code based on
the command as was done in the native mode, done in the next patch.

2. The VF driver receives an internal error event reported by the PF on the
communication channel. This occurs when the PF driver resets the device or
when VF is out of sync with the PF.

Add 'VF reset' capability, which allows the VF to reinitialize itself even when the
PF is not responsive.

As PF and VF may run their reset flow simulantanisly, there are several cases
that are handled:
- Prevent freeing VF resources upon FLR, when PF is in its unloading stage.
- Prevent PF getting VF commands before it has finished initializing its resources.
- Upon VF startup, check that comm-channel is online before sending
  commands to the PF and getting timed-out.

Signed-off-by: Yishai Hadas <yishaih@mellanox.com>
Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-01-25 14:43:14 -08:00
Yishai Hadas
c69453e294 net/mlx4_core: Manage interface state for Reset flow cases
We need to manage interface state to sync between reset flow and some other
relative cases such as remove_one. This has to be done to prevent certain
races. For example in case software stack is down as a result of unload call,
the remove_one should skip the unload phase.

Implement the remove_one case, handling AER and other cases comes next.

The interface can be up/down, upon remove_one, the state will include an extra
bit indicating that the device is cleaned-up, forcing other tasks to finish
before the final cleanup.

Signed-off-by: Yishai Hadas <yishaih@mellanox.com>
Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-01-25 14:43:14 -08:00
Yishai Hadas
f5aef5aa35 net/mlx4_core: Activate reset flow upon fatal command cases
We activate reset flow upon command fatal errors, when the device enters an
erroneous state, and must be reset.

The cases below are assumed to be fatal: FW command timed-out, an error from FW
on closing commands, pci is offline when posting/pending a command.

In those cases we place the device into an error state: chip is reset, pending
commands are awakened and completed immediately. Subsequent commands will
return immediately.

The return code in the above cases will depend on the command. Commands which
free and close resources will return success (because the chip was reset, so
callers may safely free their kernel resources). Other commands will return -EIO.

Since the device's state was marked as error, the catas poller will
detect this and restart the device's software stack (as is done when a FW
internal error is directly detected). The device state is protected by a
persistent mutex lives on its mlx4_dev, as such no need any more for the
hcr_mutex which is removed.

Signed-off-by: Yishai Hadas <yishaih@mellanox.com>
Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-01-25 14:43:14 -08:00
Yishai Hadas
f6bc11e426 net/mlx4_core: Enhance the catas flow to support device reset
This includes:

- resetting the chip when a fatal error is detected (the current code
  does not do this).

- exposing the ability to enter error state from outside the catas code
  by calling its functionality. (E.g. FW Command timeout, AER error).

- managing a persistent device state. This is needed to sync between
  reset flow cases.

Signed-off-by: Yishai Hadas <yishaih@mellanox.com>
Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-01-25 14:43:14 -08:00
Yishai Hadas
ad9a0bf08f net/mlx4_core: Refactor the catas flow to work per device
Using a WQ per device instead of a single global WQ, this allows
independent reset handling per device even when SRIOV is used.

This comes as a pre-patch for supporting chip reset
for both native and SRIOV.

Signed-off-by: Yishai Hadas <yishaih@mellanox.com>
Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-01-25 14:43:14 -08:00
Yishai Hadas
872bf2fb69 net/mlx4_core: Maintain a persistent memory for mlx4 device
Maintain a persistent memory that should survive reset flow/PCI error.
This comes as a preparation for coming series to support above flows.

Signed-off-by: Yishai Hadas <yishaih@mellanox.com>
Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-01-25 14:43:13 -08:00
Kleber Sacilotto de Souza
57dbf29a54 mlx4: Add support for EEH error recovery
Currently the mlx4 drivers don't have the necessary callbacks to
implement EEH errors detection and recovery, so the PCI layer uses the
probe and remove callbacks to try to recover the device after an error on
the bus. However, these callbacks have race conditions with the internal
catastrophic error recovery functions, which will also detect the error
and this can cause the system to crash if both EEH and catas functions
try to reset the device.

This patch adds the necessary error recovery callbacks and makes sure
that the internal catastrophic error functions will not try to reset the
device in such scenarios. It also adds some calls to
pci_channel_offline() to suppress reads/writes on the bus when the slot
cannot accept I/O operations so we prevent unnecessary accesses to the
bus and speed up the device removal.

Signed-off-by: Kleber Sacilotto de Souza <klebers@linux.vnet.ibm.com>
Acked-by: Shlomo Pongratz <shlomop@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2012-07-25 15:24:13 -07:00
Jack Morgenstein
d81c7186aa mlx4_core: adjust catas operation for SRIOV mode
When running in SRIOV mode, driver should not automatically start/stop
the mlx4_core upon sensing an HCA internal error -- doing this disables/enables
sriov, which will cause the hypervisor to hang if there are running VMs with
attached VFs.

In addition, on VMs the catas process should not run at all, since the HCA
error buffer is not available to VMs in the BARs.

Signed-off-by: Jack Morgenstein <jackm@dev.mellanox.co.il>
Signed-off-by: David S. Miller <davem@davemloft.net>
2011-12-13 13:56:08 -05:00
Paul Gortmaker
9d9779e723 drivers/net: Add module.h to drivers who were implicitly using it
The device.h header was including module.h, making it present for
most of these drivers.  But we want to clean that up.  Call out the
include of module.h in the modular network drivers.

Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com>
2011-10-31 19:31:07 -04:00
Jeff Kirsher
5a2cc190eb mlx4: Move the Mellanox driver
Moves the Mellanox driver into drivers/net/ethernet/mellanox/ and
make the necessary Kconfig and Makefile changes.

CC: Roland Dreier <roland@kernel.org>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
2011-08-11 02:41:35 -07:00