Default behavior is to use the information from the upper IO stacks to
select the hardware queue to use for IO submission. Which typically has
good cpu affinity.
However, the driver, when used on some variants of the upstream kernel, has
found queuing information to be suboptimal for FCP or IO completion locked
on particular cpus.
For command submission situations, the lpfc_fcp_io_sched module parameter
can be set to specify a hardware queue selection policy that overrides the
os stack information.
For IO completion situations, rather than queing cq processing based on the
cpu servicing the interrupting event, schedule the cq processing on the cpu
associated with the hardware queue's cq.
Signed-off-by: Dick Kennedy <dick.kennedy@broadcom.com>
Signed-off-by: James Smart <jsmart2021@gmail.com>
Reviewed-by: Hannes Reinecke <hare@suse.com>
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
The XRI get/put lists were partitioned per hardware queue. However, the
adapter rarely had sufficient resources to give a large number of resources
per queue. As such, it became common for a cpu to encounter a lack of XRI
resource and request the upper io stack to retry after returning a BUSY
condition. This occurred even though other cpus were idle and not using
their resources.
Create as efficient a scheme as possible to move resources to the cpus that
need them. Each cpu maintains a small private pool which it allocates from
for io. There is a watermark that the cpu attempts to keep in the private
pool. The private pool, when empty, pulls from a global pool from the
cpu. When the cpu's global pool is empty it will pull from other cpu's
global pool. As there many cpu global pools (1 per cpu or hardware queue
count) and as each cpu selects what cpu to pull from at different rates and
at different times, it creates a radomizing effect that minimizes the
number of cpu's that will contend with each other when the steal XRI's from
another cpu's global pool.
On io completion, a cpu will push the XRI back on to its private pool. A
watermark level is maintained for the private pool such that when it is
exceeded it will move XRI's to the CPU global pool so that other cpu's may
allocate them.
On NVME, as heartbeat commands are critical to get placed on the wire, a
single expedite pool is maintained. When a heartbeat is to be sent, it will
allocate an XRI from the expedite pool rather than the normal cpu
private/global pools. On any io completion, if a reduction in the expedite
pools is seen, it will be replenished before the XRI is placed on the cpu
private pool.
Statistics are added to aid understanding the XRI levels on each cpu and
their behaviors.
Signed-off-by: Dick Kennedy <dick.kennedy@broadcom.com>
Signed-off-by: James Smart <jsmart2021@gmail.com>
Reviewed-by: Hannes Reinecke <hare@suse.com>
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
SLI4 nvme functions are passing the SLI3 ring number when posting wqe to
hardware. This should be indicating the hardware queue to use, not the ring
number.
Replace ring number with the hardware queue that should be used.
Note: SCSI avoided this issue as it utilized an older lfpc_issue_iocb
routine that properly adapts.
Signed-off-by: Dick Kennedy <dick.kennedy@broadcom.com>
Signed-off-by: James Smart <jsmart2021@gmail.com>
Reviewed-by: Hannes Reinecke <hare@suse.com>
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
Many io statistics were being sampled and saved using adapter-based data
structures. This was creating a lot of contention and cache thrashing in
the I/O path.
Move the statistics to the hardware queue data structures. Given the
per-queue data structures, use of atomic types is lessened.
Add new sysfs and debugfs stat routines to collate the per hardware queue
values and report at an adapter level.
Signed-off-by: Dick Kennedy <dick.kennedy@broadcom.com>
Signed-off-by: James Smart <jsmart2021@gmail.com>
Reviewed-by: Hannes Reinecke <hare@suse.com>
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
Similar to the io execution path that reports cpu context information, the
debugfs routines for cpu information needs to be aligned with new hardware
queue implementation.
Convert debugfs cnd nvme cpucheck statistics to report information per
Hardware Queue.
Signed-off-by: Dick Kennedy <dick.kennedy@broadcom.com>
Signed-off-by: James Smart <jsmart2021@gmail.com>
Reviewed-by: Hannes Reinecke <hare@suse.com>
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
Once the IO buff allocations were made shared, there was a single XRI
buffer list shared by all hardware queues. A single list isn't great for
performance when shared across the per-cpu hardware queues.
Create a separate XRI IO buffer get/put list for each Hardware Queue. As
SGLs and associated IO buffers get allocated/posted to the firmware; round
robin their assignment across all available hardware Queues so that there
is an equitable assignment.
Modify SCSI and NVME IO submit code paths to use the Hardware Queue logic
for XRI allocation.
Add a debugfs interface to display hardware queue statistics
Added new empty_io_bufs counter to track if a cpu runs out of XRIs.
Replace common_ variables/names with io_ to make meanings clearer.
Signed-off-by: Dick Kennedy <dick.kennedy@broadcom.com>
Signed-off-by: James Smart <jsmart2021@gmail.com>
Reviewed-by: Hannes Reinecke <hare@suse.com>
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
Currently, both nvme and fcp each have their own concept of an io_channel,
which is a combination wq/cq and associated msix. Different cpus would
share an io_channel.
The driver is now moving to per-cpu wq/cq pairs and msix vectors. The
driver will still use separate wq/cq pairs per protocol on each cpu, but
the protocols will share the msix vector.
Given the elimination of the nvme and fcp io channels, the module
parameters will be removed. A new parameter, lpfc_hdw_queue is added which
allows the wq/cq pair allocation per cpu to be overridden and allocated to
lesser value. If lpfc_hdw_queue is zero, the number of pairs allocated will
be based on the number of cpus. If non-zero, the parameter specifies the
number of queues to allocate. At this time, the maximum non-zero value is
64.
To manage this new paradigm, a new hardware queue structure is created to
track queue activity and relationships.
As MSIX vector allocation must be known before setting up the
relationships, msix allocation now occurs before queue datastructures are
allocated. If the number of vectors allocated is less than the desired
hardware queues, the hardware queue counts will be reduced to the number of
vectors
Signed-off-by: Dick Kennedy <dick.kennedy@broadcom.com>
Signed-off-by: James Smart <jsmart2021@gmail.com>
Reviewed-by: Hannes Reinecke <hare@suse.com>
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
Currently, both NVME and SCSI get their IO buffers from separate
pools. XRI's are associated 1:1 with IO buffers, so XRI's are also split
between protocols.
Eliminate the independent pools and use a single pool. Each buffer
structure now has a common section and a protocol section. Per protocol
routines for SGL initialization are removed and replaced by common
routines. Initialization of the buffers is only done on the common area.
All other fields, which are protocol specific, are initialized when the
buffer is allocated for use in the per-protocol allocation routine.
In the past, the SCSI side allocated IO buffers as part of slave_alloc
calls until the maximum XRIs for SCSI was reached. As all XRIs are now
common and may be used for either protocol, allocation for everything is
done as part of adapter initialization and the scsi side has no action in
slave alloc.
As XRI's are no longer split, the lpfc_xri_split module parameter is
removed.
Adapters based on SLI3 will continue to use the older scsi_buf_list_get/put
routines. All SLI4 adapters utilize the new IO buffer scheme
Signed-off-by: Dick Kennedy <dick.kennedy@broadcom.com>
Signed-off-by: James Smart <jsmart2021@gmail.com>
Reviewed-by: Hannes Reinecke <hare@suse.com>
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
lpfc_nvme_prep_io_cmd() checks for null pnode, but caller
lpfc_nvme_fcp_io_submit() has already ensured it's non-null.
Remove the pnode null check.
Signed-off-by: Dick Kennedy <dick.kennedy@broadcom.com>
Signed-off-by: James Smart <jsmart2021@gmail.com>
Reviewed-by: Hannes Reinecke <hare@suse.com>
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
An hba-wide lock is taken in the nvme io completion routine. The lock
covers null'ing of the nrport pointer in the cmd structure.
The nrport member isn't necessary. After extracting the pointer from the
command, the pointer was dereferenced to get the fc discovery node
pointer. But the fc discovery node pointer is alrady in the command
structure so the dereferrence was unnecessary.
Eliminated the nrport structure member and its use, which also eliminates
the port-wide lock.
Signed-off-by: Dick Kennedy <dick.kennedy@broadcom.com>
Signed-off-by: James Smart <jsmart2021@gmail.com>
Reviewed-by: Hannes Reinecke <hare@suse.com>
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
Driver is setting bits in word 10 of the SLI4 ABORT WQE (the wqid). The
field was a carry over from a prior SLI revision. The field does not exist
in SLI4, and the action may result in an overlap with future definition of
the WQE.
Remove the setting of WQID in the ABORT WQE.
Also cleaned up WQE field settings - initialize to zero, don't bother to
set fields to zero.
Signed-off-by: Dick Kennedy <dick.kennedy@broadcom.com>
Signed-off-by: James Smart <jsmart2021@gmail.com>
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
This is mostly updates of the usual drivers: UFS, esp_scsi, NCR5380,
qla2xxx, lpfc, libsas, hisi_sas. In addition there's a set of mostly
small updates to the target subsystem a set of conversions to the
generic DMA API, which do have some potential for issues in the older
drivers but we'll handle those as case by case fixes. A new myrs for
the DAC960/mylex raid controllers to replace the block based DAC960
which is also being removed by Jens in this merge window. Plus the
usual slew of trivial changes.
Signed-off-by: James E.J. Bottomley <jejb@linux.ibm.com>
-----BEGIN PGP SIGNATURE-----
iJwEABMIAEQWIQTnYEDbdso9F2cI+arnQslM7pishQUCW9BQJSYcamFtZXMuYm90
dG9tbGV5QGhhbnNlbnBhcnRuZXJzaGlwLmNvbQAKCRDnQslM7pishU3MAP41T8yW
UJQDCprj65pCR+9mOUWzgMvgAW/15ouK89x/7AD/XAEQZqoAgpFUbgnoZWGddZkS
LykIzSiLHP4qeDOh1TQ=
=2JMU
-----END PGP SIGNATURE-----
Merge tag 'scsi-misc' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi
Pull SCSI updates from James Bottomley:
"This is mostly updates of the usual drivers: UFS, esp_scsi, NCR5380,
qla2xxx, lpfc, libsas, hisi_sas.
In addition there's a set of mostly small updates to the target
subsystem a set of conversions to the generic DMA API, which do have
some potential for issues in the older drivers but we'll handle those
as case by case fixes.
A new myrs driver for the DAC960/mylex raid controllers to replace the
block based DAC960 which is also being removed by Jens in this merge
window.
Plus the usual slew of trivial changes"
[ "myrs" stands for "MYlex Raid Scsi". Obviously. Silly of me to even
wonder. There's also a "myrb" driver, where the 'b' stands for
'block'. Truly, somebody has got mad naming skillz. - Linus ]
* tag 'scsi-misc' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi: (237 commits)
scsi: myrs: Fix the processor absent message in processor_show()
scsi: myrs: Fix a logical vs bitwise bug
scsi: hisi_sas: Fix NULL pointer dereference
scsi: myrs: fix build failure on 32 bit
scsi: fnic: replace gross legacy tag hack with blk-mq hack
scsi: mesh: switch to generic DMA API
scsi: ips: switch to generic DMA API
scsi: smartpqi: fully convert to the generic DMA API
scsi: vmw_pscsi: switch to generic DMA API
scsi: snic: switch to generic DMA API
scsi: qla4xxx: fully convert to the generic DMA API
scsi: qla2xxx: fully convert to the generic DMA API
scsi: qla1280: switch to generic DMA API
scsi: qedi: fully convert to the generic DMA API
scsi: qedf: fully convert to the generic DMA API
scsi: pm8001: switch to generic DMA API
scsi: nsp32: switch to generic DMA API
scsi: mvsas: fully convert to the generic DMA API
scsi: mvumi: switch to generic DMA API
scsi: mpt3sas: switch to generic DMA API
...
The driver currently uses the ndlp to get the local rport which is then used
to get the nvme transport remoteport pointer. There can be cases where a stale
remoteport pointer is obtained as synchronization isn't done through the
different dereferences.
Correct by using locks to synchronize the dereferences.
Signed-off-by: Dick Kennedy <dick.kennedy@broadcom.com>
Signed-off-by: James Smart <jsmart2021@gmail.com>
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
Fixes gcc '-Wunused-but-set-variable' warning:
drivers/scsi/lpfc/lpfc_nvme.c: In function 'lpfc_new_nvme_buf':
drivers/scsi/lpfc/lpfc_nvme.c:2238:24: warning:
variable 'sgl_size' set but not used [-Wunused-but-set-variable]
int bcnt, num_posted, sgl_size;
^
Signed-off-by: YueHaibing <yuehaibing@huawei.com>
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
Message 6408 is displayed for each entry in an array, but the cpu and queue
numbers were incorrect for the entry. Message 6001 includes an extraneous
character.
Resolve both issues
Signed-off-by: Dick Kennedy <dick.kennedy@broadcom.com>
Signed-off-by: James Smart <james.smart@broadcom.com>
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
The driver allocates a sg list per io struture based on a fixed maximum
size. When it registers with the protocol transports and indicates the max sg
list size it supports, the driver manipulates the fixed value to report a
lesser amount so that it has reserved space for sg elements that are used for
DIF.
The driver initialization path sets the cfg_sg_seg_cnt field to the
manipulated value for scsi. NVME initialization ran afterward and capped it's
maximum by the manipulated value for SCSI. This erroneously made NVME report
the SCSI-reduce-for-DIF value that reduced the max io size for nvme and wasted
sg elements.
Rework the driver so that cfg_sg_seg_cnt becomes the overall maximum size and
allow the max size to be tunable. A separate (new) scsi sg count is then
setup with the scsi-modified reduced value. NVME then initializes based off
the overall maximum.
Signed-off-by: Dick Kennedy <dick.kennedy@broadcom.com>
Signed-off-by: James Smart <james.smart@broadcom.com>
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
Performance is affected when target queue depth is tracked. An atomic
counter is incremented on the submission path which competes with it being
decremented on the completion path. In addition, multiple CPUs can
simultaniously be manipulating this counter for the same ndlp.
Reduce the overhead by only performing the target increment/decrement when
the target queue depth is less than the overall adapter depth, thus is
actually meaningful.
Signed-off-by: Dick Kennedy <dick.kennedy@broadcom.com>
Signed-off-by: James Smart <james.smart@broadcom.com>
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
During remote port loss fault testing, the driver crashed with the
following trace:
general protection fault: 0000 [#1] SMP
RIP: ... lpfc_nvme_register_port+0x250/0x480 [lpfc]
Call Trace:
lpfc_nlp_state_cleanup+0x1b3/0x7a0 [lpfc]
lpfc_nlp_set_state+0xa6/0x1d0 [lpfc]
lpfc_cmpl_prli_prli_issue+0x213/0x440
lpfc_disc_state_machine+0x7e/0x1e0 [lpfc]
lpfc_cmpl_els_prli+0x18a/0x200 [lpfc]
lpfc_sli_sp_handle_rspiocb+0x3b5/0x6f0 [lpfc]
lpfc_sli_handle_slow_ring_event_s4+0x161/0x240 [lpfc]
lpfc_work_done+0x948/0x14c0 [lpfc]
lpfc_do_work+0x16f/0x180 [lpfc]
kthread+0xc9/0xe0
ret_from_fork+0x55/0x80
After registering a new remoteport, the driver is pulling an ndlp pointer
from the lpfc rport associated with the private area of a newly registered
remoteport. The private area is uninitialized, so it's garbage.
Correct by pulling the the lpfc rport pointer from the entering ndlp point,
then ndlp value from at rport. Note the entering ndlp may be replacing by
the rport->ndlp due to an address change swap.
Signed-off-by: Dick Kennedy <dick.kennedy@broadcom.com>
Signed-off-by: James Smart <james.smart@broadcom.com>
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
The PBDE optimizations aren't supported in all firmware revs.
Make optimizations configurable in case there's a side effect on old
firmware.
Signed-off-by: Dick Kennedy <dick.kennedy@broadcom.com>
Signed-off-by: James Smart <james.smart@broadcom.com>
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
System crashes when the lpfc module is unloaded after making the port
offline
The nvme queue pointers were freed during port offline, but were later
accessed in pci remove path.
Validate the pointers in pci remove path before accessing them.
Signed-off-by: Dick Kennedy <dick.kennedy@broadcom.com>
Signed-off-by: James Smart <james.smart@broadcom.com>
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
modprobe -r lpfc produces the following:
Call Trace:
__blk_mq_run_hw_queue+0xa2/0xb0
__blk_mq_delay_run_hw_queue+0x9d/0xb0
? blk_mq_hctx_has_pending+0x32/0x80
blk_mq_run_hw_queue+0x50/0xd0
blk_mq_sched_insert_request+0x110/0x1b0
blk_execute_rq_nowait+0x76/0x180
nvme_keep_alive_work+0x8a/0xd0 [nvme_core]
process_one_work+0x17f/0x440
worker_thread+0x126/0x3c0
? manage_workers.isra.24+0x2a0/0x2a0
kthread+0xd1/0xe0
? insert_kthread_work+0x40/0x40
ret_from_fork_nospec_begin+0x21/0x21
? insert_kthread_work+0x40/0x40
However, rmmod lpfc would run correctly.
When an nvme remoteport is unregistered with the host nvme transport, it
needs to set the remoteport->dev_loss_tmo value 0 to indicate an immediate
termination of device loss and prevent any further keep alives to that
rport. The driver was never setting dev_loss_tmo causing the nvme
transport to continue to send the keep alive.
Signed-off-by: Dick Kennedy <dick.kennedy@broadcom.com>
Signed-off-by: James Smart <james.smart@broadcom.com>
Reviewed-by: Hannes Reinecke <hare@suse.com>
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
Under large configurations, the driver would start to log message 6065 -
NVME out of buffers (exchanges).
The driver is using the ndlp cmd_qdepth value when determining the max
outstanding ios for an adapter. This value, by default, is set to 65536,
which exceeds the maximum exchange counts supported on an adapter. The ndlp
cmd_qdepth has no relevance and outstanding io count should be capped at
the max exchange count with IO requests beyond that level getting bounced
back with an EBUSY status so that they are retried by the block layer.
Signed-off-by: Dick Kennedy <dick.kennedy@broadcom.com>
Signed-off-by: James Smart <james.smart@broadcom.com>
Reviewed-by: Hannes Reinecke <hare@suse.com>
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
Fix small formatting and wording nits in Broadcom copyright header
Signed-off-by: Dick Kennedy <dick.kennedy@broadcom.com>
Signed-off-by: James Smart <james.smart@broadcom.com>
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
Fix up log messages and add an fcp error stat counter in the IO submit
code path to make diagnosing problems easier
Signed-off-by: Dick Kennedy <dick.kennedy@broadcom.com>
Signed-off-by: James Smart <james.smart@broadcom.com>
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
I/O submission paths in the lpfc nvme path are rejecting the io with an
error code that reflects back to the callee as a hard io failure. Many
of these conditions are transient and would likely resolve if retried.
Correct by returning -EBUSY, which the FC transport triggers off of to
return busy status codes to the blk-mq layer.
Signed-off-by: Dick Kennedy <dick.kennedy@broadcom.com>
Signed-off-by: James Smart <james.smart@broadcom.com>
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
Points referencing local port structures didn't accommodate cases where
the localport may not be registered yet.
Add NULL pointer checks to logic.
Signed-off-by: Dick Kennedy <dick.kennedy@broadcom.com>
Signed-off-by: James Smart <james.smart@broadcom.com>
Reviewed-by: Hannes Reinecke <hare@suse.com>
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
On tests adding and removing a remote port, calls to nvme_info would
eventually show fewer target ports discovered than were present in the
san. Additionally, the following error messages were seen:
6031 RemotePort Registration failed err: -116, DID x471301
There is a race condition that exists between the driver and the nvme
transport on remote port unregister vs the confirmed deletion. It's
possible that the driver may rediscover the remote port and reregister
the remote port before a prior unregister delete callback was made (as
it rebinded to the prior remoteport structure). However, the driver was
coded to expect the callback before seeing the remote port again thus a
new registration. The logic results in the driver having an invalid
remoteport pointer set.
Correct by tracking when waiting for the delete callback. In cases where
the ndlp remoteport pointer is updated, it is only cleared when the wait
has not been superceded by a prior registration.
Signed-off-by: Dick Kennedy <dick.kennedy@broadcom.com>
Signed-off-by: James Smart <james.smart@broadcom.com>
Reviewed-by: Hannes Reinecke <hare@suse.com>
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
During target-side port faults, the driver would not recover all target
port logins. This resulted in a loss of nvme device discovery.
The driver is coded to wait for all GID_FT requests to complete before
restarting discovery. A fault is seen where the outstanding GIT_FT
counts are not properly decremented, thus discovery would never
start. Another fault was found in the clearing of the gidft_inp counter
that would be skipped in this condition. And a third fault found with
lpfc_nvme_register_port that would remove a reverence on the ndlp which
then allows a node swap on a port address change to prematurely remove
the reference and release the ndlp.
The following changes are made:
- Correct the decrementing of the outstanding GID_FT counters.
- In RSCN handling, no longer zero the counter before calling to issue
another GID_FT.
- No longer remove the reference on the dlp when the ndlp->nrport value
is not yet null.
Signed-off-by: Dick Kennedy <dick.kennedy@broadcom.com>
Signed-off-by: James Smart <james.smart@broadcom.com>
Reviewed-by: Hannes Reinecke <hare@suse.com>
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
After making remoteport unregister requests, the ndlp nrport pointer was
stale.
Track when waiting for waiting for unregister completion callback and
adjust nldp pointer assignment. Add a few safety checks for NULL
pointer values.
Signed-off-by: Dick Kennedy <dick.kennedy@broadcom.com>
Signed-off-by: James Smart <james.smart@broadcom.com>
Reviewed-by: Hannes Reinecke <hare@suse.com>
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
When debugging various issues, per IO channel IO statistics were useful
to understand what was happening. However, many of the stats were on a
port basis rather than an io channel basis.
Move statistics to an io channel basis.
Signed-off-by: Dick Kennedy <dick.kennedy@broadcom.com>
Signed-off-by: James Smart <james.smart@broadcom.com>
Reviewed-by: Hannes Reinecke <hare@suse.com>
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
There are several unions that are local to the source and do not need to
be in global scope, so make them static. Also add in a missing void
parameter to functions lpfc_nvme_cmd_template and
lpfc_nvmet_cmd_template to clean up non-ANSI warning.
Cleans up sparse warnings:
drivers/scsi/lpfc/lpfc_nvme.c:68:19: warning: symbol
'lpfc_iread_cmd_template' was not declared. Should it be static?
drivers/scsi/lpfc/lpfc_nvme.c:69:19: warning: symbol
'lpfc_iwrite_cmd_template' was not declared. Should it be static?
drivers/scsi/lpfc/lpfc_nvme.c:70:19: warning: symbol
'lpfc_icmnd_cmd_template' was not declared. Should it be static?
drivers/scsi/lpfc/lpfc_nvme.c:74:24: warning: non-ANSI function
'lpfc_tsend_cmd_template' was not declared. Should it be static?
drivers/scsi/lpfc/lpfc_nvmet.c:78:19: warning: symbol
'lpfc_treceive_cmd_template' was not declared. Should it be static?
drivers/scsi/lpfc/lpfc_nvmet.c:79:19: warning: symbol
'lpfc_trsp_cmd_template' was not declared. Should it be static?
drivers/scsi/lpfc/lpfc_nvmet.c:83:25: warning: non-ANSI function
declaration of function 'lpfc_nvmet_cmd_template'
Signed-off-by: Colin Ian King <colin.king@canonical.com>
Acked-by: James Smart <james.smart@broadcom.com>
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
To reduce latency when initializing WQE content, create templates for the
most common wqes. This reduces the number of operations taken to set the
content. It's not a lot of speed up, but every bit helps.
This patch updates the NVME initiator path.
[mkp: fixed typo]
Signed-off-by: Dick Kennedy <dick.kennedy@broadcom.com>
Signed-off-by: James Smart <james.smart@broadcom.com>
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
The driver is very sloppy about the WQE structure passed between routines.
The base struct type is a 64byte wqe. But in many routines they typecast and
access 128byte wqes. There were a couple of cases in the past (corrected
already) where the typecasts were incorrectly done and the 64byte buffer was
accessed as a 128 byte buffer.
Clean this up by properly declaring wqe's as 128byte wqe's and removing the
typecasts. 64byte wqes are considered a subset of the 128byte wqes.
Signed-off-by: Dick Kennedy <dick.kennedy@broadcom.com>
Signed-off-by: James Smart <james.smart@broadcom.com>
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
The hardware offload for NVME commands was created when the
FC-NVME standard was setting SGL Descriptor Type to SGL Data
Block Descriptor (0h) and SGL Descriptor Sub Type to Address (0h).
A late change in NVMe-over-Fabrics obsoleted these values, creating
a transport SGL descriptor type with new values to go into these
fields.
For initial hardware support, in order to be compliant to the spec,
use host-supplied cmd IU buffers instead of the adapter generated
values. Later hardware will correct this.
Add a module parameter to override this offload disablement if looking
for lowest latency. This is reasonable as nothing in FC-NVME uses
the SQE SGL values.
Signed-off-by: Dick Kennedy <dick.kennedy@broadcom.com>
Signed-off-by: James Smart <james.smart@broadcom.com>
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
Newer hardware more strictly enforces buffer lenghts, causing an
mis-set value to be identified. Older hardware won't catch it.
The difference is benign on old hardware.
Set the right embedded buffer length for nvme ios.
Signed-off-by: Dick Kennedy <dick.kennedy@broadcom.com>
Signed-off-by: James Smart <james.smart@broadcom.com>
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
The current driver isn't taking advantage of a performance hint whereby
the initial data buffer descriptor can be placed in the WQE as well as
the SGL.
Add the logic to detect support for the feature and to use it when
supported.
Signed-off-by: Dick Kennedy <dick.kennedy@broadcom.com>
Signed-off-by: James Smart <james.smart@broadcom.com>
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
Updated Copyright in files updated 11.4.0.7
Signed-off-by: Dick Kennedy <dick.kennedy@broadcom.com>
Signed-off-by: James Smart <james.smart@broadcom.com>
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
In a test that is doing large numbers of cable swaps on the target, the
nvme controllers wouldn't reconnect.
During the cable swaps, the targets n_port_id would change. This
information was passed to the nvme-fc transport, in the new remoteport
registration. However, the nvme-fc transport didn't update the n_port_id
value in the remoteport struct when it reused an existing structure.
Later, when a new association was attempted on the remoteport, the
driver's NVME LS routine would use the stale n_port_id from the
remoteport struct to address the LS. As the device is no longer at that
address, the LS would go into never never land.
Separately, the nvme-fc transport will be corrected to update the
n_port_id value on a re-registration.
However, for now, there's no reason to use the transports values. The
private pointer points to the drivers node structure and the node
structure is up to date. Therefore, revise the LS routine to use the
drivers data structures for the LS. Augmented the debug message for
better debugging in the future.
Also removed a duplicate if check that seems to have slipped in.
Signed-off-by: Dick Kennedy <dick.kennedy@broadcom.com>
Signed-off-by: James Smart <james.smart@broadcom.com>
Reviewed-by: Hannes Reinecke <hare@suse.com>
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
A stress test repeatedly resetting the adapter while performing io would
eventually report I/O failures and missing nvme namespaces.
The driver was setting the nvmefc_fcp_req->private pointer to NULL
during the IO completion routine before upcalling done(). If the
transport was also running an abort for that IO, the driver would fail
the abort with message 6140. Failing the abort is not allowed by the
nvme-fc transport, as it mandates that the io must be returned back to
the transport. As that does not happen, the transport controller delete
has an outstanding reference and can't complete teardown.
The NULL-ing of the private pointer should be done only when the io is
considered complete. It's complete when the adapter returns the exchange
with the "exchange busy" flag clear.
Move the NULL'ing of the structure to the done case. This leaves the io
contexts set while it is busy and until the subsequent XRI_ABORTED
completion which returns the exchange is received.
Signed-off-by: Dick Kennedy <dick.kennedy@broadcom.com>
Signed-off-by: James Smart <james.smart@broadcom.com>
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
If log verbose in not turned on, its hard to tell when certain error
paths get hit. Add stats counters and corresponding logic to
debugfs/sysfs to aid understanding what paths were traversed.
Signed-off-by: Dick Kennedy <dick.kennedy@broadcom.com>
Signed-off-by: James Smart <james.smart@broadcom.com>
Reviewed-by: Hannes Reinecke <hare@suse.com>
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
When unregistering a remote port the lpfc driver would eventually wait
for the remoteport_unreg done callback. But the driver never completed
the io aborts that would allow the connections to terminate thus the
unreg done callback was never issued. Turns out the coding style of the
driver allowed for the wait to occur on the same cpu that the deferred
isr is called on. The blocking for the wait, blocked the isr, and as the
isr didn't run, the io aborts wouldn't finish.
Turns out there was never a good reason to block waiting for the unreg
done in the first place. The driver can continue execution and the ref
counting within the driver will do the right thing.
Resolve by removing the wait and patching up a few cases where the ref
counting didn't look right - mainly cases where the remote port comes
back before the aborts had completed and the unreg done had been
called. Additionally, a few places which used pointer values to guide
driver actions weren't protected by lock, so correct those.
Signed-off-by: Dick Kennedy <dick.kennedy@broadcom.com>
Signed-off-by: James Smart <james.smart@broadcom.com>
Reviewed-by: Hannes Reinecke <hare@suse.com>
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
NVME targets appear to randomly disconnect from the initiator when
running heavy IO.
The error is due to the host aggregate (across all controllers) io load
was beyond the maximum exchange count for nvme on the adapter. The
driver was properly returning a resource busy status, but the io load
was so great heartbeat commands would be bounced and not have a
successful retry within the fuzz amount for the nvme heartbeat (yes, a
very high io load!). Thus the target was terminating the controller due
to a keep alive failure.
Resolve by reserving a few exchanges (by counters) which can be used
when the adapter is out of normal exchanges and the command is a NVME
heartbeat command. As counters are used, while the reserved command is
outstanding, as soon as any other exchange completes, the counters are
adjusted and the reserved count is replenished. The heartbeat completes
execution in a normal fashion.
Signed-off-by: Dick Kennedy <dick.kennedy@broadcom.com>
Signed-off-by: James Smart <james.smart@broadcom.com>
Reviewed-by: Hannes Reinecke <hare@suse.com>
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
The logic for sg_seg_cnt is a bit convoluted. This patch tries to clean
up a couple of areas, especially around the +2 and +1 logic.
This patch:
- Cleans up the lpfc_sg_seg_cnt attribute to specify a real minimum
rather than making the minimum be whatever the default is.
- Removes the hardcoding of +2 (for the number of elements we use in a
sgl for cmd iu and rsp iu) and +1 (an additional entry to compensate
for nvme's reduction of io size based on a possible partial page)
logic in sg list initialization. In the case where the +1 logic is
referenced in host and target io checks, use the values set in the
transport template as that value was properly set.
There can certainly be more done in this area and it will be addressed
in combined host/target driver effort.
Signed-off-by: Dick Kennedy <dick.kennedy@broadcom.com>
Signed-off-by: James Smart <james.smart@broadcom.com>
Reviewed-by: Hannes Reinecke <hare@suse.com>
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
During driver unload, the driver may crash due to NULL pointers. The
NULL pointers were due to the driver not protecting itself sufficiently
during some of the teardown paths. Additionally, the driver was not
waiting for and cleanup up nvme io resources. As such, the driver wasn't
making the callbacks to the transport, stalling the transports
association teardown.
This patch waits for io clean up before tearding down and adds checks
for possible NULL pointers.
Cc: <stable@vger.kernel.org> # 4.12+
Signed-off-by: Dick Kennedy <dick.kennedy@broadcom.com>
Signed-off-by: James Smart <james.smart@broadcom.com>
Reviewed-by: Hannes Reinecke <hare@suse.com>
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
When the driver is unloading, the nvme transport could be in the process
of submitting new requests, will send abort requests to terminate
associations, or may make LS-related requests. The driver's abort and
request entry points currently is ignorant of the unloading state and is
starting the requests even though the infrastructure to complete them
continues to teardown.
Change the entry points for new requests to check whether unloading and
if so, reject the requests. Abort routines check unloading, and if so,
noop the request. An abort is noop'd as the teardown paths are already
aborting/terminating the io outstanding at the time the teardown
initiated.
Signed-off-by: Dick Kennedy <dick.kennedy@broadcom.com>
Signed-off-by: James Smart <james.smart@broadcom.com>
Reviewed-by: Hannes Reinecke <hare@suse.com>
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
The driver's interaction with the host nvme transport has been incorrect
for a while. The driver did not wait for the unregister callbacks
(waited only 5 jiffies). Thus the driver may remove objects that may be
referenced by subsequent abort commands from the transport, and the
actual unregister callback was effectively a noop. This was especially
problematic if the driver was unloaded.
The driver now waits for the unregister callbacks, as it should, before
continuing with teardown.
Signed-off-by: Dick Kennedy <dick.kennedy@broadcom.com>
Signed-off-by: James Smart <james.smart@broadcom.com>
Reviewed-by: Hannes Reinecke <hare@suse.com>
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
The driver currently registers any remote port that has NVME support.
It should only be registering target ports.
Register only target ports.
Signed-off-by: Dick Kennedy <dick.kennedy@broadcom.com>
Signed-off-by: James Smart <james.smart@broadcom.com>
Reviewed-by: Hannes Reinecke <hare@suse.com>
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
This is mostly updates of the usual suspects: lpfc, qla2xxx, hisi_sas,
megaraid_sas, pm80xx, mpt3sas, be2iscsi, hpsa. and a host of minor
updates.
There's no major behaviour change or additions to the core in all of
this, so the potential for regressions should be small (biggest
potential being in the scsi error handler changes).
Signed-off-by: James E.J. Bottomley <jejb@linux.vnet.ibm.com>
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2
iQIcBAABAgAGBQJaCxtCAAoJEAVr7HOZEZN4d9EQAI+OHP6ss6zjKKC21c9jNPcH
NhLrNv37gHg/LA2VXeUEL9RGUjCGLIUrI4HsrxzkFAMLKP4TkshMs8/2RvczY+Sa
VpayPqVybEKLIS6ipQyM1SLIQff2nvtDVcN/T+8z1lkk45TrbA6ZGuwUwd2aJyEA
2V2wtg51ObnL0Nr9QPPll0JrtL1AnCZyRlu9XrwTZuuSBZwk93opIuuvbZm/3dVg
Ir4GSS4Y+PuHIfu4cxqdsPMdzRdY9I2me1YiE4jeFSn1/VTAjL4HBz7fO9eITT42
VhXSpDz1XvFsa9dJ0ubkqoALpJzCfOcBw+EuGvSydLEvOBoEVwMccdfaD9lT1zc5
L9e1Z5qqJoq7hTA6xTXCYfWG73I9HYvljtmc8yudKHhADOdnSTUXhaO6uBF0RNqD
OxPSA1RZwRx3c6lDOcK6BTtvLAkTEuYKdrWSKJi0w+QXJAyQ6etqbmsKpmPdRim7
Z4ZSpJFro2gyo9gcdJO0ykTG+z3U7Z/ay1sNgnuprsv+eU/QjUdlAPl18o79EkRf
H54zZggZ4wC6q/cFVVt4Vx+V+oqIeu38s7NDXS9UltLoTZPm2EzDW6pXd/38Z4Tf
a1oBAUET8kYLC90P8sVZxUIHZjITlpgDbyE2Lq00PMYXhk8S4IxF0aMN5RvVqzUv
+7N2HrHkSSgG1nhw1t+E
=3O85
-----END PGP SIGNATURE-----
Merge tag 'scsi-misc' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi
Pull SCSI updates from James Bottomley:
"This is mostly updates of the usual suspects: lpfc, qla2xxx, hisi_sas,
megaraid_sas, pm80xx, mpt3sas, be2iscsi, hpsa. and a host of minor
updates.
There's no major behaviour change or additions to the core in all of
this, so the potential for regressions should be small (biggest
potential being in the scsi error handler changes)"
* tag 'scsi-misc' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi: (203 commits)
scsi: lpfc: Fix hard lock up NMI in els timeout handling.
scsi: mpt3sas: remove a stray KERN_INFO
scsi: mpt3sas: cleanup _scsih_pcie_enumeration_event()
scsi: aacraid: use timespec64 instead of timeval
scsi: scsi_transport_fc: add 64GBIT and 128GBIT port speed definitions
scsi: qla2xxx: Suppress a kernel complaint in qla_init_base_qpair()
scsi: mpt3sas: fix dma_addr_t casts
scsi: be2iscsi: Use kasprintf
scsi: storvsc: Avoid excessive host scan on controller change
scsi: lpfc: fix kzalloc-simple.cocci warnings
scsi: mpt3sas: Update mpt3sas driver version.
scsi: mpt3sas: Fix sparse warnings
scsi: mpt3sas: Fix nvme drives checking for tlr.
scsi: mpt3sas: NVMe drive support for BTDHMAPPING ioctl command and log info
scsi: mpt3sas: Add-Task-management-debug-info-for-NVMe-drives.
scsi: mpt3sas: scan and add nvme device after controller reset
scsi: mpt3sas: Set NVMe device queue depth as 128
scsi: mpt3sas: Handle NVMe PCIe device related events generated from firmware.
scsi: mpt3sas: API's to remove nvme drive from sml
scsi: mpt3sas: API 's to support NVMe drive addition to SML
...
The ! has higher precedence than the & operation. I've added
parenthesis so this works as intended.
Fixes: 952c303b32 ("scsi: lpfc: Ensure io aborts interlocked with the target.")
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
The internal cfg flag is actually smaller, by 1 (for a partial page
sge), than the sg list maintained by the driver. Thus the check on sg
segments errored out when it shouldn't have
Ensure the check is +1
Note: having a value that is less than what it really is is bogus.
Correcting it now would be a significant rework. Add this item to the
list to be refactored in the merge with efct.
Signed-off-by: Dick Kennedy <dick.kennedy@broadcom.com>
Signed-off-by: James Smart <james.smart@broadcom.com>
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>