We can request task management IOCTL command(MPI2_FUNCTION_SCSI_TASK_MGMT)
to /dev/mpt3ctl. If the given task_type is either abort task or query
task, it may need a field named "Initiator Port Transfer Tag to Manage" in
the IU.
Current code does not support to check target IPTT tag from the tm_request.
This patch introduces to check TaskMID given from the userspace as a target
tag. We have a rule of relationship between
(struct request *req->tag) and smid in mpt3sas_base.c:
3318 u16
3319 mpt3sas_base_get_smid_scsiio(struct MPT3SAS_ADAPTER *ioc, u8 cb_idx,
3320 struct scsi_cmnd *scmd)
3321 {
3322 struct scsiio_tracker *request = scsi_cmd_priv(scmd);
3323 unsigned int tag = scmd->request->tag;
3324 u16 smid;
3325
3326 smid = tag + 1;
So if we want to abort a request tagged #X, then we can pass (X + 1) to
this IOCTL handler. Otherwise, user space just can pass 0 TaskMID to abort
the first outstanding smid which is legacy behaviour.
Cc: Sreekanth Reddy <sreekanth.reddy@broadcom.com>
Cc: Suganath Prabu Subramani <suganath-prabu.subramani@broadcom.com>
Cc: Sathya Prakash <sathya.prakash@broadcom.com>
Cc: James E.J. Bottomley <jejb@linux.ibm.com>
Cc: Martin K. Petersen <martin.petersen@oracle.com>
Cc: MPT-FusionLinux.pdl@broadcom.com
Signed-off-by: Minwoo Im <minwoo.im@samsung.com>
Acked-by: Sreekanth Reddy <sreekanth.reddy@broadcom.com>
Signed-off-by: Minwoo Im <minwoo.im.dev@gmail.com>
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
There is a copy and paste bug here. It uses EVENT_TRIGGERS size instead of
SCSI_TRIGGERS size but fortunately both size are 84 bytes so it doesn't
affect runtime.
These days the preferred style is to just say sizeof(object) instead of
sizeof(type) so I have updated the function to the latest style as well.
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Acked-by: Sreekanth Reddy <sreekanth.reddy@broadcom.com>
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
When using a virt_boundary_mask, as done for NVMe devices attached to
mpt3sas controllers, we require an unlimited max_segment_size as the virt
boundary merging code assumes that. But we also need to propagate that to
the DMA mapping layer to make dma-debug happy. The SCSI layer takes care
of that when using the per-host virt_boundary setting, but given that
mpt3sas only wants to set the virt_boundary for actual NVMe devices, we
can't rely on that. The DMA layer maximum segment is global to the HBA
however, so we have to set it explicitly. This patch assumes that mpt3sas
does not have a segment size limitation, which seems true based on the SGL
format, but will need to be verified.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Acked-by: Suganath Prabu <suganath-prabu.subramani@broadcom.com>
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
Enable msix load balance only when combined reply queue mode is disabled on
the SAS3 and above generation HBA devices.
Earlier msix load balance used to enable if the number of online cpus is
greater than the number of MSI-X vectors enabled on the HBA. Combined reply
queue mode will be disabled only on those HBA which works in shared
resources mode. I.e. on SAS3 HBAs it will be <= 8 and on SAS35 HBA devices
it will be <= 16.
- Before this patch if system has 256 logical CPUs and HBA exposes 128
MSI-X vectors, driver will enable msix load balance.
- After this patch if system has 256 logical CPUs and HBA exposes 128
MSI-X vectors, driver will disable msix load balance.
- After this patch if system has 256 logical CPUs and HBA exposes 16 MSI-X
vectors (due to combined reply queue mode being off in HW), driver will
enable msix load balance.
Signed-off-by: Sreekanth Reddy <sreekanth.reddy@broadcom.com>
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
Even though 'smp_affinity_enable' module parameter is enabled, if the
number of online CPUs is bigger than the number of msix vectors enabled on
that HBA, then smp affinity settings should be disabled only for this HBA.
But currently the smp affinity setting is disabled globally and hence smp
affinity will be disabled for subsequent HBAs even though number of msix
vectors enabled for this HBA matches the number of online CPU.
To fix this, define a per HBA variable smp_affinity_enable. Initially this
variable is initialized with smp_affinity_enable module parameter value. If
this HBA has less number of msix vectors configured when compared to number
of online cpus, then only this HBA's variable smp_affinity_enable is set to
zero.
Signed-off-by: Sreekanth Reddy <sreekanth.reddy@broadcom.com>
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
When enabling high iops queues, the driver should use the HBA's configured
PCIe link speed instead of looking for the maximum link speed.
I.e. enable high iops queues only if Aero/Sea HBA's configured PCIe link
speed is set to 16GT/s.
Signed-off-by: Sreekanth Reddy <sreekanth.reddy@broadcom.com>
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
Currently default perf_mode is set to 'balanced' on Intel architecture
machines and on other machines default perf_mode is set to 'latency' mode.
This CPU architecture check is removed and the default perf_mode mode is
set to 'balanced' mode on all machines.
User can choose the required performance mode using perf_mode module
parameter.
Signed-off-by: Sreekanth Reddy <sreekanth.reddy@broadcom.com>
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
Use existing macros. No functional change.
[mkp: typo]
Signed-off-by: Tomas Henzl <thenzl@redhat.com>
Acked-by: Suganath Prabu <suganath-prabu.subramani@broadcom.com>
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
Support is easier with all driver parameters visible in sysfs. Also I've
replaced a constant with an octal permission.
Signed-off-by: Tomas Henzl <thenzl@redhat.com>
Acked-by: Suganath Prabu <suganath-prabu.subramani@broadcom.com>
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
In preparation to enabling -Wimplicit-fallthrough, mark switch cases where
we are expecting to fall through.
This patch fixes the following warning:
drivers/scsi/mpt3sas/mpt3sas_base.c: In function _base_update_ioc_page1_inlinewith_perf_mode :
drivers/scsi/mpt3sas/mpt3sas_base.c:4510:6: warning: this statement may fall through [-Wimplicit-fallthrough=]
if (ioc->high_iops_queues) {
^
drivers/scsi/mpt3sas/mpt3sas_base.c:4530:2: note: here
case MPT_PERF_MODE_LATENCY:
^~~~
Warning level 3 was used: -Wimplicit-fallthrough=3
This patch is part of the ongoing efforts to enable -Wimplicit-fallthrough.
Fixes: 30cb97023f38 ("scsi: mpt3sas: Introduce perf_mode module parameter")
Signed-off-by: Gustavo A. R. Silva <gustavo@embeddedor.com>
Reviewed-by: Kees Cook <keescook@chromium.org>
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
Update driver version from 28.100.00.00 to 29.100.00.00
This is equivalent to Phase 10 OOB driver.
Signed-off-by: Suganath Prabu S <suganath-prabu.subramani@broadcom.com>
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
1. Introduce module parameter perf_mode for only Aero/Sea generation HBAs.
2. Update IOC page1 fields according to performance mode.
Below are the performance modes that can be enabled with module parameter
perf_mode:
0: Balanced - Few high iops reply queues will be enabled. Interrupt
coalescing will be enabled only for these high iops reply descriptor
queues.
1: Iops - Interrupt coalescing will be enabled on all reply queues.
Coalescing timeout is set to 0x20.This is default value for Aero.
2: Latency - Interrupt coalescing will be enabled on all reply queues.
Coalescing timeout is set to 0xA. This is a legacy behavior similar to
Ventura & Invader HBA series.
Default perf mode set by driver will be balanced mode if the following
conditions are met:
- CPU vendor = Intel;
- Aero controller working in 16GT/s pcie speed
Performance mode will be set to latency mode for all other cases.
4k Random Read IO performance numbers on 24 SAS SSD drives for above three
permormance modes. Performance data is from Intel Skylake and HGST SS300
(drive model SDLL1DLR400GCCA1).
IOPs:
-----------------------------------------------------------------------
|perf_mode | qd = 1 | qd = 64 | note |
|-------------|--------|---------|-------------------------------------
|balanced | 259K | 3061k | Provides max performance numbers |
| | | | both on lower QD workload & |
| | | | also on higher QD workload |
|-------------|--------|---------|-------------------------------------
|iops | 220K | 3100k | Provides max performance numbers |
| | | | only on higher QD workload. |
|-------------|--------|---------|-------------------------------------
|latency | 246k | 2226k | Provides good performance numbers |
| | | | only on lower QD worklaod. |
-----------------------------------------------------------------------
Avarage Latency:
-----------------------------------------------------
|perf_mode | qd = 1 | qd = 64 |
|-------------|--------------|----------------------|
|balanced | 92.05 usec | 501.12 usec |
|-------------|--------------|----------------------|
|iops | 108.40 usec | 498.10 usec |
|-------------|--------------|----------------------|
|latency | 97.10 usec | 689.26 usec |
-----------------------------------------------------
Signed-off-by: Suganath Prabu S <suganath-prabu.subramani@broadcom.com>
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
Enable interrupt coalescing only on high iops queues.
In ioc config page 1, offset 0x14 (ProductSpecific field) is used to
determine interrupt coalescing enabled/disabled on per reply descriptor
post queue group(8) basis. If 31st bit is zero, then interrupt coalescing
is enabled for all reply descriptor post queues. If 31st bit is set to one,
then user can enable/disable interrupt coalescing on per reply descriptor
post queue group(8) basis. So to enable interrupt coalescing only on first
reply descriptor post queue group (i.e. on high iops queues), set bit 0 and
31.
This configuration should reset during driver unload or shutdown to the
default settings. For this, the driver takes copy of default ioc page 1 and
copies back the default or unmodified ioc page1 during unload and
shutdown. This means that on next driver load (e.g. if older version driver
is loaded by user), current modified changes on ioc page1 won't take
effect.
Signed-off-by: Suganath Prabu S <suganath-prabu.subramani@broadcom.com>
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
High iops queues are mapped to non-managed irqs. Set affinity of
non-managed irqs to local numa node. Low latency queues are mapped to
managed irqs.
Driver reserves some reply queues for max iops (through
pci_alloc_irq_vectors_affinity and .pre_vectors interface). The rest of
queues are for low latency.
Based on io workload in io submission path, driver will decide which group
of reply queues (either high iops queues or low latency queues) to be
used. High iops queues will be mapped to local numa node of controller and
low latency queues will be mapped to cpus across numa nodes. In general,
high iops and low latency queues should fit into 128 reply queues
which is the max number of reply queues supported by Aero/Sea.
Signed-off-by: Suganath Prabu S <suganath-prabu.subramani@broadcom.com>
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
In the IO submission path _base_get_msix_index is called twice. Initially
while getting the smid and subsequently while posting the request
descriptor (RD).
Refactor code to query msix index only while posting the request
descriptor. Save determined msix index in msix_io field.
Signed-off-by: Suganath Prabu S <suganath-prabu.subramani@broadcom.com>
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
The driver will use round-robin method for io submission in batches within
the high iops queues when the number of in-flight ios on the target device
is larger than 8. Otherwise the driver will use low latency reply queues.
Signed-off-by: Suganath Prabu S <suganath-prabu.subramani@broadcom.com>
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
Code refactoring.
In function _base_get_msix_index, add scmd as second argument. This change
is made in preparation for the next patch where we introduce a new function
to get the MSI-X index for high iops queues.
Signed-off-by: Suganath Prabu S <suganath-prabu.subramani@broadcom.com>
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
Aero controllers support balanced performance mode through the ability to
configure queues with different properties.
Reply queues with interrupt coalescing enabled are called "high iops reply
queues" and reply queues with interrupt coalescing disabled are called "low
latency reply queues".
The driver configures a combination of high iops and low latency reply
queues if:
- HBA is an AERO controller;
- MSI-X vectors supported by the HBA is 128;
- Total CPU count in the system more than high iops queue count;
- Driver is loaded with default max_msix_vectors module parameter; and
- System booted in non-kdump mode.
Signed-off-by: Suganath Prabu S <suganath-prabu.subramani@broadcom.com>
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
If the Aero HBA supports Atomic Request Descriptors, it sets the Atomic
Request Descriptor Capable bit in the IOCCapabilities field of the IOCFacts
Reply message. Driver uses an Atomic Request Descriptor as an alternative
method for posting an entry onto a request queue.
The posting of an Atomic Request Descriptor is an atomic operation,
providing a safe mechanism for multiple processors on the host to post
requests without synchronization. This Atomic Request Descriptor format is
identical to first 32 bits of Default Request Descriptor and uses only 32
bits.
Signed-off-by: Suganath Prabu S <suganath-prabu.subramani@broadcom.com>
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
This code refactoring introduces function pointers.
Host uses Request Descriptors of different types for posting an entry onto
a request queue. Based on controller type and capabilities, host can also
use atomic descriptors other than normal descriptors. Using function
pointer will avoid if-else statements
Signed-off-by: Suganath Prabu S <suganath-prabu.subramani@broadcom.com>
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
In _ctl_ioctl_main(), 'ioctl_header' is fetched the first time from
userspace. 'ioctl_header.ioc_number' is then checked. The legal result is
saved to 'ioc'. Then, in condition MPT3COMMAND, the whole struct is fetched
again from the userspace. Then _ctl_do_mpt_command() is called, 'ioc' and
'karg' as inputs.
However, a malicious user can change the 'ioc_number' between the two
fetches, which will cause a potential security issues. Moreover, a
malicious user can provide a valid 'ioc_number' to pass the check in first
fetch, and then modify it in the second fetch.
To fix this, we need to recheck the 'ioc_number' in the second fetch.
Signed-off-by: Gen Zhang <blackgod016574@gmail.com>
Acked-by: Suganath Prabu S <suganath-prabu.subramani@broadcom.com>
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
This is mostly update of the usual drivers: qla2xxx, qedf, smartpqi,
hpsa, lpfc, ufs, mpt3sas, ibmvfc and hisi_sas. Plus number of minor
changes, spelling fixes and other trivia.
Signed-off-by: James E.J. Bottomley <jejb@linux.ibm.com>
-----BEGIN PGP SIGNATURE-----
iJwEABMIAEQWIQTnYEDbdso9F2cI+arnQslM7pishQUCXNIK0yYcamFtZXMuYm90
dG9tbGV5QGhhbnNlbnBhcnRuZXJzaGlwLmNvbQAKCRDnQslM7pishbeFAP4wsOm6
BNqDvLbF7gHAH+oSVAeENd9tepXG2xKbUHmLRgEA1wPaUxon8L8v/SSsqwewqMXP
y6CnU6aO2iaViTNBsPs=
=RX74
-----END PGP SIGNATURE-----
Merge tag 'scsi-misc' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi
Pull SCSI updates from James Bottomley:
"This is mostly update of the usual drivers: qla2xxx, qedf, smartpqi,
hpsa, lpfc, ufs, mpt3sas, ibmvfc and hisi_sas. Plus number of minor
changes, spelling fixes and other trivia"
* tag 'scsi-misc' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi: (298 commits)
scsi: qla2xxx: Avoid that lockdep complains about unsafe locking in tcm_qla2xxx_close_session()
scsi: qla2xxx: Avoid that qlt_send_resp_ctio() corrupts memory
scsi: qla2xxx: Fix hardirq-unsafe locking
scsi: qla2xxx: Complain loudly about reference count underflow
scsi: qla2xxx: Use __le64 instead of uint32_t[2] for sending DMA addresses to firmware
scsi: qla2xxx: Introduce the dsd32 and dsd64 data structures
scsi: qla2xxx: Check the size of firmware data structures at compile time
scsi: qla2xxx: Pass little-endian values to the firmware
scsi: qla2xxx: Fix race conditions in the code for aborting SCSI commands
scsi: qla2xxx: Use an on-stack completion in qla24xx_control_vp()
scsi: qla2xxx: Make qla24xx_async_abort_cmd() static
scsi: qla2xxx: Remove unnecessary locking from the target code
scsi: qla2xxx: Remove qla_tgt_cmd.released
scsi: qla2xxx: Complain if a command is released that is owned by the firmware
scsi: qla2xxx: target: Fix offline port handling and host reset handling
scsi: qla2xxx: Fix abort handling in tcm_qla2xxx_write_pending()
scsi: qla2xxx: Fix error handling in qlt_alloc_qfull_cmd()
scsi: qla2xxx: Simplify qlt_send_term_imm_notif()
scsi: qla2xxx: Fix use-after-free issues in qla2xxx_qpair_sp_free_dma()
scsi: qla2xxx: Fix a qla24xx_enable_msix() error path
...
We have a few submissions for 5.2 that depend on fixes merged post
5.1-rc1. Merge the fixes branch into queue.
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
mmiowb() is now implied by spin_unlock() on architectures that require
it, so there is no reason to call it from driver code. This patch was
generated using coccinelle:
@mmiowb@
@@
- mmiowb();
and invoked as:
$ for d in drivers include/linux/qed sound; do \
spatch --include-headers --sp-file mmiowb.cocci --dir $d --in-place; done
NOTE: mmiowb() has only ever guaranteed ordering in conjunction with
spin_unlock(). However, pairing each mmiowb() removal in this patch with
the corresponding call to spin_unlock() is not at all trivial, so there
is a small chance that this change may regress any drivers incorrectly
relying on mmiowb() to order MMIO writes between CPUs using lock-free
synchronisation. If you've ended up bisecting to this commit, you can
reintroduce the mmiowb() calls using wmb() instead, which should restore
the old behaviour on all architectures other than some esoteric ia64
systems.
Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Will Deacon <will.deacon@arm.com>
There are a couple of statements that are incorrectly indented, fix these.
Signed-off-by: Colin Ian King <colin.king@canonical.com>
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
During expander reset handling, the driver invokes kernel function
scsi_host_find_tag() to obtain outstanding requests associated with the
scsi host managed by the driver. Driver loops from tag value zero to hba
queue depth to obtain the outstanding scmds. But when blk-mq is enabled,
the block layer may return stale entry for one or more requests. This may
lead to kernel panic if the returned value is inaccessible or the memory
pointed by the returned value is reused.
Reference of upstream discussion:
https://patchwork.kernel.org/patch/10734933/
Instead of calling scsi_host_find_tag() API for each and every smid (smid
is tag +1) from one to shost->can_queue, now driver will call this API (to
obtain the outstanding scmd) only for those smid's which are outstanding at
the driver level.
Driver will determine whether this smid is outstanding at driver level by
looking into it's corresponding MPI request frame, if its MPI request frame
is empty, then it means that this smid is free and does not need to call
scsi_host_find_tag() for it. By doing this, driver will invoke
scsi_host_find_tag() for only those tags which are outstanding at the
driver level.
Driver will check whether particular MPI request frame is empty or not by
looking into the "DevHandle" field. If this field is zero then it means
that this MPI request is empty. For active MPI request DevHandle must be
non-zero.
Also driver will memset the MPI request frame once the corresponding scmd
is processed (i.e. just before calling
scmd->done function).
Signed-off-by: Sreekanth Reddy <sreekanth.reddy@broadcom.com>
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
Updated driver version to 28.100.00.00, which is equivalent to OOB Phase 9.
Signed-off-by: Suganath Prabu <suganath-prabu.subramani@broadcom.com>
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
* Reduce the threshold value to 1/4 of the queue depth.
* With this FW can find enough entries to post the Reply Descriptors in the
reply descriptor post queue.
* With module param, user can play with threshold value, the same
irqpoll_weight is used as the budget in processing of reply descriptor
post queues in _base_process_reply_queue.
Signed-off-by: Suganath Prabu <suganath-prabu.subramani@broadcom.com>
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
Driver uses "reply descriptor post queues" in round robin fashion so that
IO's are distributed to all the available reply descriptor post queues
equally. With this each reply descriptor post queue load is balanced.
This is enabled only if CPUs count to MSI-X vector count ratio is X:1
(where X > 1) This improves performance and also fixes soft lockups.
Signed-off-by: Suganath Prabu <suganath-prabu.subramani@broadcom.com>
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
Issue Description:
We have seen cpu lock up issue from fields if system has greater (more than
96) logical cpu count. SAS3.0 controller (Invader series) supports at max
96 msix vector and SAS3.5 product (Ventura) supports at max 128 msix
vectors.
This may be a generic issue (if PCI device supports completion on multiple
reply queues). Let me explain it w.r.t to mpt3sas supported h/w just to
simplify the problem and possible changes to handle such issues. IT HBA
(mpt3sas) supports multiple reply queues in completion path. Driver creates
MSI-x vectors for controller as "min of (FW supported Reply queue, Logical
CPUs)". If submitter is not interrupted via completion on same CPU, there
is a loop in the IO path. This behavior can cause hard/soft CPU lockups, IO
timeout, system sluggish etc.
Example - one CPU (e.g. CPU A) is busy submitting the IOs and another CPU
(e.g. CPU B) is busy with processing the corresponding IO's reply
descriptors from reply descriptor queue upon receiving the interrupts from
HBA. If the CPU A is continuously pumping the IOs then always CPU B (which
is executing the ISR) will see the valid reply descriptors in the reply
descriptor queue and it will be continuously processing those reply
descriptor in a loop without quitting the ISR handler.
Mpt3sas driver will exit ISR handler if it finds unused reply descriptor in
the reply descriptor queue. Since CPU A will be continuously sending the
IOs, CPU B may always see a valid reply descriptor (posted by HBA Firmware
after processing the IO) in the reply descriptor queue. In worst case,
driver will not quit from this loop in the ISR handler. Eventually, CPU
lockup will be detected by watchdog.
Above mentioned behavior is not common if "rq_affinity" set to 2 or
affinity_hint is honored by irqbalance as "exact". If rq_affinity is set
to 2, submitter will be always interrupted via completion on same CPU. If
irqbalance is using "exact" policy, interrupt will be delivered to
submitter CPU.
If CPU counts to MSI-X vectors (reply descriptor Queues) count ratio is not
1:1, we still have exposure of issue explained above and for that we don't
have any solution.
Exposure of soft/hard lockup if CPU count is more than MSI-x supported by
device.
If CPUs count to MSI-x vectors count ratio is not 1:1, (Other way, if CPU
counts to MSI-x vector count ratio is something like X:1, where X > 1) then
'exact' irqbalance policy OR rq_affinity = 2 won't help to avoid CPU
hard/soft lockups. There won't be any one to one mapping between CPU to
MSI-x vector instead one MSI-x interrupt (or reply descriptor queue) is
shared with group/set of CPUs and there is a possibility of having a loop
in the IO path within that CPU group and may observe lockups.
For example: Consider a system having two NUMA nodes and each node having
four logical CPUs and also consider that number of MSI-x vectors enabled on
the HBA is two, then CPUs count to MSI-x vector count ratio as 4:1. e.g.
MSIx vector 0 is affinity to CPU 0, CPU 1, CPU 2 & CPU 3 of NUMA node 0 and
MSI-x vector 1 is affinity to CPU 4, CPU 5, CPU 6 & CPU 7 of NUMA node 1.
numactl --hardware
available: 2 nodes (0-1)
node 0 cpus: 0 1 2 3 --> MSI-x 0
node 0 size: 65536 MB
node 0 free: 63176 MB
node 1 cpus: 4 5 6 7 -->MSI-x 1
node 1 size: 65536 MB
node 1 free: 63176 MB
Assume that user started an application which uses all the CPUs of NUMA
node 0 for issuing the IOs. Only one CPU from affinity list (it can be any
cpu since this behavior depends upon irqbalance) CPU0 will receive the
interrupts from MSIx vector 0 for all the IOs. Eventually, CPU 0 IO
submission percentage will be decreasing and ISR processing percentage will
be increasing as it is more busy with processing the interrupts. Gradually
IO submission percentage on CPU 0 will be zero and it's ISR processing
percentage will be 100 percentage as IO loop has already formed within the
NUMA node 0, i.e. CPU 1, CPU 2 & CPU 3 will be continuously busy with
submitting the heavy IOs and only CPU 0 is busy in the ISR path as it
always find the valid reply descriptor in the reply descriptor
queue. Eventually, we will observe the hard lockup here.
Chances of occurring of hard/soft lockups are directly proportional to
value of X. If value of X is high, then chances of observing CPU lockups is
high.
Solution: Use IRQ poll interface defined in " irq_poll.c". mpt3sas driver
will execute ISR routine in Softirq context and it will always quit the
loop based on budget provided in IRQ poll interface.
In these scenarios (i.e. where CPUs count to MSI-X vectors count ratio is
X:1 (where X > 1)), IRQ poll interface will avoid CPU hard lockups due to
voluntary exit from the reply queue processing based on budget. Note -
Only one MSI-x vector is busy doing processing.
Irqstat output:
IRQs / 1 second(s)
IRQ# TOTAL NODE0 NODE1 NODE2 NODE3 NAME
44 122871 122871 0 0 0 IR-PCI-MSI-edge mpt3sas0-msix0
45 0 0 0 0 0 IR-PCI-MSI-edge mpt3sas0-msix1
We use this approach only if cpu count is more than FW supported MSI-x
vector
Signed-off-by: Suganath Prabu <suganath-prabu.subramani@broadcom.com>
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
Separate out processing of reply descriptor post queue from _base_interrupt
to _base_process_reply_queue.
Signed-off-by: Suganath Prabu <suganath-prabu.subramani@broadcom.com>
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
Fixed typo in request_desript_type.
request_desript_type --> request_descript_type.
Signed-off-by: Suganath Prabu <suganath-prabu.subramani@broadcom.com>
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
Fix the following warnings by adding the proper missing breaks:
drivers/scsi/mpt3sas/mpt3sas_base.c: In function _base_display_OEMs_branding :
drivers/scsi/mpt3sas/mpt3sas_base.c:3548:4: warning: this statement may fall through [-Wimplicit-fallthrough=]
switch (ioc->pdev->subsystem_device) {
^~~~~~
drivers/scsi/mpt3sas/mpt3sas_base.c:3566:3: note: here
case MPI2_MFGPAGE_DEVID_SAS2308_2:
^~~~
drivers/scsi/mpt3sas/mpt3sas_base.c:3567:4: warning: this statement may fall through [-Wimplicit-fallthrough=]
switch (ioc->pdev->subsystem_device) {
^~~~~~
drivers/scsi/mpt3sas/mpt3sas_base.c:3601:3: note: here
case MPI25_MFGPAGE_DEVID_SAS3008:
^~~~
drivers/scsi/mpt3sas/mpt3sas_base.c:3735:4: warning: this statement may fall through [-Wimplicit-fallthrough=]
switch (ioc->pdev->subsystem_device) {
^~~~~~
drivers/scsi/mpt3sas/mpt3sas_base.c:3745:3: note: here
case MPI2_MFGPAGE_DEVID_SAS2308_2:
^~~~
drivers/scsi/mpt3sas/mpt3sas_base.c:3746:4: warning: this statement may fall through [-Wimplicit-fallthrough=]
switch (ioc->pdev->subsystem_device) {
^~~~~~
drivers/scsi/mpt3sas/mpt3sas_base.c:3768:3: note: here
default:
^~~~~~~
Warning level 3 was used: -Wimplicit-fallthrough=3
This patch is part of the ongoing efforts to enable
-Wimplicit-fallthrough.
Signed-off-by: Gustavo A. R. Silva <gustavo@embeddedor.com>
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
Updated driver version to 27.102.00.00 from 27.101.00.00.
Signed-off-by: Suganath Prabu S <suganath-prabu.subramani@broadcom.com>
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
Add Atlas PCIe Switch Management Port device PNPID,
Vendor Id: 0x1000
device Id: 0x00B2
This device is based on MPI 2.6 spec and it exposes one SES device to
accept management commands for the PCIe switch.
Signed-off-by: Suganath Prabu S <suganath-prabu.subramani@broadcom.com>
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
Added device ID for NVMe Switch Adapter (Ambrosia).
VID: 0x1000
DID: 0x02B1
Signed-off-by: Suganath Prabu S <suganath-prabu.subramani@broadcom.com>
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
MPI Endpoint is a PCIe switch based on MPI2.
Renaming device ID macro from MPI2_MFGPAGE_DEVID_SAS2308_MPI_EP to
MPI2_MFGPAGE_DEVID_SWITCH_MPI_EP.
Signed-off-by: Suganath Prabu S <suganath-prabu.subramani@broadcom.com>
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
In preparation to enabling -Wimplicit-fallthrough, mark switch cases where
we are expecting to fall through.
Addresses-Coverity-ID: 1475400 ("Missing break in switch")
Signed-off-by: Gustavo A. R. Silva <gustavo@embeddedor.com>
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
With commit 09c2f95ad4 ("scsi: mpt3sas: Swap I/O memory read value back
to cpu endianness"), 64bit writes in _base_writeq() were rewritten to use
__raw_writeq() instad of writeq().
This introduced a bug apparent on powerpc64 systems such as the Raptor
Talos II that causes the HBA to drop from the PCIe bus under heavy load and
being reinitialized after a couple of seconds.
It can easily be triggered on affacted systems by using something like
fio --name=random-write --iodepth=4 --rw=randwrite --bs=4k --direct=0 \
--size=128M --numjobs=64 --end_fsync=1
fio --name=random-write --iodepth=4 --rw=randwrite --bs=64k --direct=0 \
--size=128M --numjobs=64 --end_fsync=1
a couple of times. In my case I tested it on both a ZFS raidz2 and a btrfs
raid6 using LSI 9300-8i and 9400-8i controllers.
The fix consists in resembling the write ordering of writeq() by adding a
mandatory write memory barrier before device access and a compiler barrier
afterwards. The additional MMIO barrier is superfluous.
Signed-off-by: Stephan Günther <moepi@moepi.net>
Reported-by: Matt Corallo <linux@bluematt.me>
Acked-by: Sreekanth Reddy <Sreekanth.Reddy@broadcom.com>
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
Most SCSI drivers want to enable "clustering", that is merging of
segments so that they might span more than a single page. Remove the
ENABLE_CLUSTERING define, and require drivers to explicitly set
DISABLE_CLUSTERING to disable this feature.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
Update driver version from 27.100.00.00 to 27.101.00.00.
Signed-off-by: Suganath Prabu <suganath-prabu.subramani@broadcom.com>
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
Use ioc->base_readl to restrict the readl retries to only Aero controllers.
Signed-off-by: Suganath Prabu <suganath-prabu.subramani@broadcom.com>
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
Sometimes Aero controllers appears to be returning bad data (0) for
doorbell register read and if retries are performed immediately after the
bad read, they return good data.
Workaround is added to retry read from doorbell registers for maximum three
times if driver get the zero. Added functions base_readl_aero for Aero IOC
and base_readl for gen35 and other controllers.
Signed-off-by: Suganath Prabu <suganath-prabu.subramani@broadcom.com>
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
Adding flag "is_aero_ioc" to differentiate aero based controllers from
other gen35 controllers.
Signed-off-by: Suganath Prabu <suganath-prabu.subramani@broadcom.com>
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
There is a spelling mistake in some description text, fix it.
Signed-off-by: Colin Ian King <colin.king@canonical.com>
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
Display following warning message only upon detection of configurable
secure type controllers.
"HBA is in Configurable Secure mode"
[mkp: typos]
Signed-off-by: Sreekanth Reddy <sreekanth.reddy@broadcom.com>
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
Modify driver version to 27.100.00.00 (which is equivalent to PH8 OOB
driver)
Signed-off-by: Suganath Prabu <suganath-prabu.subramani@broadcom.com>
Reviewed-by: Bjorn Helgaas <bhelgaas@google.com>
Reviewed-by: Andy Shevchenko <andy.shevchenko@gmail.com>
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
Currently driver is modifying both current & NVRAM/persistent data in
Manufacturing page11. Driver should change only current copy of
Manufacturing page11. It should not modify the persistent data.
So removed the section of code where driver is modifying the persistent
data of Manufacturing page11.
Signed-off-by: Suganath Prabu <suganath-prabu.subramani@broadcom.com>
Reviewed-by: Bjorn Helgaas <bhelgaas@google.com>
Reviewed-by: Andy Shevchenko <andy.shevchenko@gmail.com>
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
If EEDPTagMode field in manufacturing page11 is set then unset it. This is
needed to fix a hardware bug only in SAS3/SAS2 cards. So, skipping
EEDPTagMode changes in Manufacturing page11 for SAS 3.5 controllers.
Signed-off-by: Suganath Prabu <suganath-prabu.subramani@broadcom.com>
Reviewed-by: Bjorn Helgaas <bhelgaas@google.com>
Reviewed-by: Andy Shevchenko <andy.shevchenko@gmail.com>
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
This is to fix SYNC CACHE and START STOP command failures with
DID_NO_CONNECT during driver unload.
In driver's IO submission patch (i.e. in driver's .queuecommand()) driver
won't allow any SCSI commands to the IOC when ioc->remove_host flag is set
and hence SYNC CACHE commands which are issued to the target drives (where
write cache is enabled) during driver unload time is failed with
DID_NO_CONNECT status.
Now modified the driver to allow SYNC CACHE and START STOP commands to IOC,
even when remove_host flag is set.
Signed-off-by: Suganath Prabu <suganath-prabu.subramani@broadcom.com>
Reviewed-by: Bjorn Helgaas <bhelgaas@google.com>
Reviewed-by: Andy Shevchenko <andy.shevchenko@gmail.com>
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>