Commit Graph

2765 Commits

Author SHA1 Message Date
Leon Romanovsky
deaeb67c9d RDMA/core: Track device memory MRs
[ Upstream commit b47a98efa97889c5b16d17e77eed3dc4500674eb ]

Device memory (DM) are registered as MR during initialization flow, these
MRs were not tracked by resource tracker and had res->valid set as a
false. Update the code to manage them too.

Before this change:
$ ibv_rc_pingpong -j &
$ rdma res show mr <-- shows nothing

After this change:
$ ibv_rc_pingpong -j &
$ rdma res show mr
dev ibp0s9 mrn 0 mrlen 4096 pdn 3 pid 734 comm ibv_rc_pingpong

Fixes: be934cca9e ("IB/uverbs: Add device memory registration ioctl support")
Link: https://lore.kernel.org/r/20201117070148.1974114-2-leon@kernel.org
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
2020-12-30 11:53:24 +01:00
Jason Gunthorpe
f85d05c0a5 RDMA/cma: Fix deadlock on &lock in rdma_cma_listen_on_all() error unwind
[ Upstream commit dd37d2f59eb839d51b988f6668ce5f0d533b23fd ]

rdma_detroy_id() cannot be called under &lock - we must instead keep the
error'd ID around until &lock can be released, then destroy it.

This is complicated by the usual way listen IDs are destroyed through
cma_process_remove() which can run at any time and will asynchronously
destroy the same ID.

Remove the ID from visiblity of cma_process_remove() before going down the
destroy path outside the locking.

Fixes: c80a0c52d85c ("RDMA/cma: Add missing error handling of listen_id")
Link: https://lore.kernel.org/r/20201118133756.GK244516@ziepe.ca
Reported-by: syzbot+1bc48bf7f78253f664a9@syzkaller.appspotmail.com
Reviewed-by: Leon Romanovsky <leonro@nvidia.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
2020-12-30 11:53:22 +01:00
Leon Romanovsky
70ba8b1697 RDMA/cma: Add missing error handling of listen_id
[ Upstream commit c80a0c52d85c49a910d0dc0e342e8d8898677dc0 ]

Don't silently continue if rdma_listen() fails but destroy previously
created CM_ID and return an error to the caller.

Fixes: d02d1f5359 ("RDMA/cma: Fix deadlock destroying listen requests")
Link: https://lore.kernel.org/r/20201104144008.3808124-5-leon@kernel.org
Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
2020-12-30 11:53:12 +01:00
Jing Xiangfeng
e7c49c634a RDMA/core: Fix error return in _ib_modify_qp()
[ Upstream commit 5333499c6014224756e97fa1a1047dfa592d76d3 ]

Fix to return error code PTR_ERR() from the error handling case instead of
0.

Fixes: 51aab12631 ("RDMA/core: Get xmit slave for LAG")
Link: https://lore.kernel.org/r/20201016075845.129562-1-jingxiangfeng@huawei.com
Signed-off-by: Jing Xiangfeng <jingxiangfeng@huawei.com>
Reviewed-by: Maor Gottlieb <maorg@nvidia.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
2020-12-30 11:52:58 +01:00
Leon Romanovsky
340b940ea0 RDMA/cm: Fix an attempt to use non-valid pointer when cleaning timewait
If cm_create_timewait_info() fails, the timewait_info pointer will contain
an error value and will be used in cm_remove_remote() later.

  general protection fault, probably for non-canonical address 0xdffffc0000000024: 0000 [#1] SMP KASAN PTI
  KASAN: null-ptr-deref in range [0×0000000000000120-0×0000000000000127]
  CPU: 2 PID: 12446 Comm: syz-executor.3 Not tainted 5.10.0-rc5-5d4c0742a60e #27
  Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.13.0-0-gf21b5a4aeb02-prebuilt.qemu.org 04/01/2014
  RIP: 0010:cm_remove_remote.isra.0+0x24/0×170 drivers/infiniband/core/cm.c:978
  Code: 84 00 00 00 00 00 41 54 55 53 48 89 fb 48 8d ab 2d 01 00 00 e8 7d bf 4b fe 48 89 ea 48 b8 00 00 00 00 00 fc ff df 48 c1 ea 03 <0f> b6 04 02 48 89 ea 83 e2 07 38 d0 7f 08 84 c0 0f 85 fc 00 00 00
  RSP: 0018:ffff888013127918 EFLAGS: 00010006
  RAX: dffffc0000000000 RBX: fffffffffffffff4 RCX: ffffc9000a18b000
  RDX: 0000000000000024 RSI: ffffffff82edc573 RDI: fffffffffffffff4
  RBP: 0000000000000121 R08: 0000000000000001 R09: ffffed1002624f1d
  R10: 0000000000000003 R11: ffffed1002624f1c R12: ffff888107760c70
  R13: ffff888107760c40 R14: fffffffffffffff4 R15: ffff888107760c9c
  FS:  00007fe1ffcc1700(0000) GS:ffff88811a600000(0000) knlGS:0000000000000000
  CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
  CR2: 0000001b2ff21000 CR3: 000000010f504001 CR4: 0000000000370ee0
  DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
  DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
  Call Trace:
   cm_destroy_id+0x189/0×15b0 drivers/infiniband/core/cm.c:1155
   cma_connect_ib drivers/infiniband/core/cma.c:4029 [inline]
   rdma_connect_locked+0x1100/0×17c0 drivers/infiniband/core/cma.c:4107
   rdma_connect+0x2a/0×40 drivers/infiniband/core/cma.c:4140
   ucma_connect+0x277/0×340 drivers/infiniband/core/ucma.c:1069
   ucma_write+0x236/0×2f0 drivers/infiniband/core/ucma.c:1724
   vfs_write+0x220/0×830 fs/read_write.c:603
   ksys_write+0x1df/0×240 fs/read_write.c:658
   do_syscall_64+0x33/0×40 arch/x86/entry/common.c:46
   entry_SYSCALL_64_after_hwframe+0x44/0xa9

Fixes: a977049dac ("[PATCH] IB: Add the kernel CM implementation")
Link: https://lore.kernel.org/r/20201204064205.145795-1-leon@kernel.org
Reviewed-by: Maor Gottlieb <maorg@nvidia.com>
Reported-by: Amit Matityahu <mitm@nvidia.com>
Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
2020-12-09 15:51:35 -04:00
Gal Pressman
e432c04c17 RDMA/core: Fix empty gid table for non IB/RoCE devices
The query_gid_table ioctl skips non IB/RoCE ports, which as a result
returns an empty gid table for devices such as EFA which have a GID table,
but are not IB/RoCE.

Fixes: c4b4d548fa ("RDMA/core: Introduce new GID table query API")
Link: https://lore.kernel.org/r/20201206153238.34878-1-galpress@amazon.com
Signed-off-by: Gal Pressman <galpress@amazon.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
2020-12-07 16:32:04 -04:00
Jason Gunthorpe
eb73060b97 RDMA/cm: Make the local_id_table xarray non-irq
The xarray is never mutated from an IRQ handler, only from work queues
under a spinlock_irq. Thus there is no reason for it be an IRQ type
xarray.

This was copied over from the original IDR code, but the recent rework put
the xarray inside another spinlock_irq which will unbalance the unlocking.

Fixes: c206f8bad1 ("RDMA/cm: Make it clearer how concurrency works in cm_req_handler()")
Link: https://lore.kernel.org/r/0-v1-808b6da3bd3f+1857-cm_xarray_no_irq_jgg@nvidia.com
Reported-by: Matthew Wilcox <willy@infradead.org>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
2020-11-12 12:31:27 -04:00
Jason Gunthorpe
071ba4cc55 RDMA: Add rdma_connect_locked()
There are two flows for handling RDMA_CM_EVENT_ROUTE_RESOLVED, either the
handler triggers a completion and another thread does rdma_connect() or
the handler directly calls rdma_connect().

In all cases rdma_connect() needs to hold the handler_mutex, but when
handler's are invoked this is already held by the core code. This causes
ULPs using the 2nd method to deadlock.

Provide a rdma_connect_locked() and have all ULPs call it from their
handlers.

Link: https://lore.kernel.org/r/0-v2-53c22d5c1405+33-rdma_connect_locking_jgg@nvidia.com
Reported-and-tested-by: Guoqing Jiang <guoqing.jiang@cloud.ionos.com>
Fixes: 2a7cec5381 ("RDMA/cma: Fix locking for the RDMA_CM_CONNECT state")
Acked-by: Santosh Shilimkar <santosh.shilimkar@oracle.com>
Acked-by: Jack Wang <jinpu.wang@cloud.ionos.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Max Gurtovoy <mgurtovoy@nvidia.com>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
2020-10-28 09:14:49 -03:00
Gal Pressman
7d66a71488 RDMA/uverbs: Fix false error in query gid IOCTL
Some drivers (such as EFA) have a GID table, but aren't IB/RoCE devices.
Remove the unnecessary rdma_ib_or_roce() check.

This fixes rdma-core failures for EFA when it uses the new ioctl interface
for querying the GID table.

Fixes: 9f85cbe50a ("RDMA/uverbs: Expose the new GID query API to user space")
Link: https://lore.kernel.org/r/20201026082621.32463-1-galpress@amazon.com
Signed-off-by: Gal Pressman <galpress@amazon.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
2020-10-26 19:29:26 -03:00
Linus Torvalds
a1e16bc7d5 RDMA 5.10 pull request
The typical set of driver updates across the subsystem:
 
  - Driver minor changes and bug fixes for mlx5, efa, rxe, vmw_pvrdma, hns,
    usnic, qib, qedr, cxgb4, hns, bnxt_re
 
  - Various rtrs fixes and updates
 
  - Bug fix for mlx4 CM emulation for virtualization scenarios where MRA
    wasn't working right
 
  - Use tracepoints instead of pr_debug in the CM code
 
  - Scrub the locking in ucma and cma to close more syzkaller bugs
 
  - Use tasklet_setup in the subsystem
 
  - Revert the idea that 'destroy' operations are not allowed to fail at
    the driver level. This proved unworkable from a HW perspective.
 
  - Revise how the umem API works so drivers make fewer mistakes using it
 
  - XRC support for qedr
 
  - Convert uverbs objects RWQ and MW to new the allocation scheme
 
  - Large queue entry sizes for hns
 
  - Use hmm_range_fault() for mlx5 On Demand Paging
 
  - uverbs APIs to inspect the GID table instead of sysfs
 
  - Move some of the RDMA code for building large page SGLs into
    lib/scatterlist
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEEfB7FMLh+8QxL+6i3OG33FX4gmxoFAl+J37MACgkQOG33FX4g
 mxrKfRAAnIecwdE8df0yvVU5k0Eg6qVjMy9MMHq4va9m7g6GpUcNNI0nIlOASxH2
 l+9vnUQS3ebgsPeECaDYzEr0hh/u53+xw2g4WV5ts/hE8KkQ6erruXb9kasCe8yi
 5QWJ9K36T3c03Cd3EeH6JVtytAxuH42ombfo9BkFLPVyfG/R2tsAzvm5pVi73lxk
 46wtU1Bqi4tsLhyCbifn1huNFGbHp08OIBPAIKPUKCA+iBRPaWS+Dpi+93h3g3Bp
 oJwDhL9CBCGcHM+rKWLzek3Dy87FnQn7R1wmTpUFwkK+4AH3U/XazivhX035w1vL
 YJyhakVU0kosHlX9hJTNKDHJGkt0YEV2mS8dxAuqilFBtdnrVszb5/MirvlzC310
 /b5xCPSEusv9UVZV0G4zbySVNA9knZ4YaRiR3VDVMLKl/pJgTOwEiHIIx+vs3ejk
 p8GRWa1SjXw5LfZEQcq39J689ljt6xjCTonyuBSv7vSQq5v8pjBxvHxiAe2FIa2a
 ZyZeSCYoSh0SwJQukO2VO7aprhHP3TcCJ/987+X03LQ8tV2VWPktHqm62YCaDcOl
 fgiQuQdPivRjDDkJgMfDWDGKfZeHoWLKl5XsJhWByt0lablVrsvc+8ylUl1UI7gI
 16hWB/Qtlhfwg10VdApn+aOFpIS+s5P4XIp8ik57MZO+VeJzpmE=
 =LKpl
 -----END PGP SIGNATURE-----

Merge tag 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdma

Pull rdma updates from Jason Gunthorpe:
 "A usual cycle for RDMA with a typical mix of driver and core subsystem
  updates:

   - Driver minor changes and bug fixes for mlx5, efa, rxe, vmw_pvrdma,
     hns, usnic, qib, qedr, cxgb4, hns, bnxt_re

   - Various rtrs fixes and updates

   - Bug fix for mlx4 CM emulation for virtualization scenarios where
     MRA wasn't working right

   - Use tracepoints instead of pr_debug in the CM code

   - Scrub the locking in ucma and cma to close more syzkaller bugs

   - Use tasklet_setup in the subsystem

   - Revert the idea that 'destroy' operations are not allowed to fail
     at the driver level. This proved unworkable from a HW perspective.

   - Revise how the umem API works so drivers make fewer mistakes using
     it

   - XRC support for qedr

   - Convert uverbs objects RWQ and MW to new the allocation scheme

   - Large queue entry sizes for hns

   - Use hmm_range_fault() for mlx5 On Demand Paging

   - uverbs APIs to inspect the GID table instead of sysfs

   - Move some of the RDMA code for building large page SGLs into
     lib/scatterlist"

* tag 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdma: (191 commits)
  RDMA/ucma: Fix use after free in destroy id flow
  RDMA/rxe: Handle skb_clone() failure in rxe_recv.c
  RDMA/rxe: Move the definitions for rxe_av.network_type to uAPI
  RDMA: Explicitly pass in the dma_device to ib_register_device
  lib/scatterlist: Do not limit max_segment to PAGE_ALIGNED values
  IB/mlx4: Convert rej_tmout radix-tree to XArray
  RDMA/rxe: Fix bug rejecting all multicast packets
  RDMA/rxe: Fix skb lifetime in rxe_rcv_mcast_pkt()
  RDMA/rxe: Remove duplicate entries in struct rxe_mr
  IB/hfi,rdmavt,qib,opa_vnic: Update MAINTAINERS
  IB/rdmavt: Fix sizeof mismatch
  MAINTAINERS: CISCO VIC LOW LATENCY NIC DRIVER
  RDMA/bnxt_re: Fix sizeof mismatch for allocation of pbl_tbl.
  RDMA/bnxt_re: Use rdma_umem_for_each_dma_block()
  RDMA/umem: Move to allocate SG table from pages
  lib/scatterlist: Add support in dynamic allocation of SG table from pages
  tools/testing/scatterlist: Show errors in human readable form
  tools/testing/scatterlist: Rejuvenate bit-rotten test
  RDMA/ipoib: Set rtnl_link_ops for ipoib interfaces
  RDMA/uverbs: Expose the new GID query API to user space
  ...
2020-10-17 11:18:18 -07:00
Jann Horn
4d45e75a99 mm: remove the now-unnecessary mmget_still_valid() hack
The preceding patches have ensured that core dumping properly takes the
mmap_lock.  Thanks to that, we can now remove mmget_still_valid() and all
its users.

Signed-off-by: Jann Horn <jannh@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: "Eric W . Biederman" <ebiederm@xmission.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Hugh Dickins <hughd@google.com>
Link: http://lkml.kernel.org/r/20200827114932.3572699-8-jannh@google.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-10-16 11:11:22 -07:00
Maor Gottlieb
c7a198c700 RDMA/ucma: Fix use after free in destroy id flow
ucma_free_ctx() should call to __destroy_id() on all the connection requests
that have not been delivered to user space. Currently it calls on the
context itself and cause to use after free.

Fixes the trace:

   BUG: Unable to handle kernel data access on write at 0x5deadbeef0000108
   Faulting instruction address: 0xc0080000002428f4
   Oops: Kernel access of bad area, sig: 11 [#1]
   Call Trace:
   [c000000207f2b680] [c00800000024280c] .__destroy_id+0x28c/0x610 [rdma_ucm] (unreliable)
   [c000000207f2b750] [c0080000002429c4] .__destroy_id+0x444/0x610 [rdma_ucm]
   [c000000207f2b820] [c008000000242c24] .ucma_close+0x94/0xf0 [rdma_ucm]
   [c000000207f2b8c0] [c00000000046fbdc] .__fput+0xac/0x330
   [c000000207f2b960] [c00000000015d48c] .task_work_run+0xbc/0x110
   [c000000207f2b9f0] [c00000000012fb00] .do_exit+0x430/0xc50
   [c000000207f2bae0] [c0000000001303ec] .do_group_exit+0x5c/0xd0
   [c000000207f2bb70] [c000000000144a34] .get_signal+0x194/0xe30
   [c000000207f2bc60] [c00000000001f6b4] .do_notify_resume+0x124/0x470
   [c000000207f2bd60] [c000000000032484] .interrupt_exit_user_prepare+0x1b4/0x240
   [c000000207f2be20] [c000000000010034] interrupt_return+0x14/0x1c0

Rename listen_ctx to conn_req_ctx as the poor name was the cause of this
bug.

Fixes: a1d33b70db ("RDMA/ucma: Rework how new connections are passed through event delivery")
Link: https://lore.kernel.org/r/20201012045600.418271-4-leon@kernel.org
Signed-off-by: Maor Gottlieb <maorg@nvidia.com>
Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
2020-10-16 14:07:08 -03:00
Jason Gunthorpe
e0477b34d9 RDMA: Explicitly pass in the dma_device to ib_register_device
The code in setup_dma_device has become rather convoluted, move all of
this to the drivers. Drives now pass in a DMA capable struct device which
will be used to setup DMA, or drivers must fully configure the ibdev for
DMA and pass in NULL.

Other than setting the masks in rvt all drivers were doing this already
anyhow.

mthca, mlx4 and mlx5 were already setting up maximum DMA segment size for
DMA based on their hardweare limits in:
__mthca_init_one()
  dma_set_max_seg_size (1G)

__mlx4_init_one()
  dma_set_max_seg_size (1G)

mlx5_pci_init()
  set_dma_caps()
    dma_set_max_seg_size (2G)

Other non software drivers (except usnic) were extended to UINT_MAX [1, 2]
instead of 2G as was before.

[1] https://lore.kernel.org/linux-rdma/20200924114940.GE9475@nvidia.com/
[2] https://lore.kernel.org/linux-rdma/20200924114940.GE9475@nvidia.com/

Link: https://lore.kernel.org/r/20201008082752.275846-1-leon@kernel.org
Link: https://lore.kernel.org/r/6b2ed339933d066622d5715903870676d8cc523a.1602590106.git.mchehab+huawei@kernel.org
Suggested-by: Christoph Hellwig <hch@infradead.org>
Signed-off-by: Parav Pandit <parav@nvidia.com>
Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Mauro Carvalho Chehab <mchehab+huawei@kernel.org>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
2020-10-16 13:53:46 -03:00
Jason Gunthorpe
16e7483e6f Merge branch 'dynamic_sg' into rdma.git for-next
From Maor Gottlieb says:

====================
This series extends __sg_alloc_table_from_pages to allow chaining of new
pages to an already initialized SG table.

This allows for drivers to utilize the optimization of merging contiguous
pages without a need to pre allocate all the pages and hold them in a very
large temporary buffer prior to the call to SG table initialization.

The last patch changes the Infiniband core to use the new API. It removes
duplicate functionality from the code and benefits from the optimization
of allocating dynamic SG table from pages.

In huge pages system of 2MB page size, without this change, the SG table
would contain x512 SG entries.
====================

* branch 'dynamic_sg':
  RDMA/umem: Move to allocate SG table from pages
  lib/scatterlist: Add support in dynamic allocation of SG table from pages
  tools/testing/scatterlist: Show errors in human readable form
  tools/testing/scatterlist: Rejuvenate bit-rotten test
2020-10-16 12:40:58 -03:00
Maor Gottlieb
0c16d9635e RDMA/umem: Move to allocate SG table from pages
Remove the implementation of ib_umem_add_sg_table and instead
call to __sg_alloc_table_from_pages which already has the logic to
merge contiguous pages.

Besides that it removes duplicated functionality, it reduces the
memory consumption of the SG table significantly. Prior to this
patch, the SG table was allocated in advance regardless consideration
of contiguous pages.

In huge pages system of 2MB page size, without this change, the SG table
would contain x512 SG entries.
E.g. for 100GB memory registration:

	 Number of entries	Size
Before 	      26214400          600.0MB
After            51200		  1.2MB

Link: https://lore.kernel.org/r/20201004154340.1080481-5-leon@kernel.org
Signed-off-by: Maor Gottlieb <maorg@nvidia.com>
Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
2020-10-05 20:45:45 -03:00
Linus Torvalds
165563c050 Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
Pull networking fixes from David Miller:

 1) Make sure SKB control block is in the proper state during IPSEC
    ESP-in-TCP encapsulation. From Sabrina Dubroca.

 2) Various kinds of attributes were not being cloned properly when we
    build new xfrm_state objects from existing ones. Fix from Antony
    Antony.

 3) Make sure to keep BTF sections, from Tony Ambardar.

 4) TX DMA channels need proper locking in lantiq driver, from Hauke
    Mehrtens.

 5) Honour route MTU during forwarding, always. From Maciej
    Żenczykowski.

 6) Fix races in kTLS which can result in crashes, from Rohit
    Maheshwari.

 7) Skip TCP DSACKs with rediculous sequence ranges, from Priyaranjan
    Jha.

 8) Use correct address family in xfrm state lookups, from Herbert Xu.

 9) A bridge FDB flush should not clear out user managed fdb entries
    with the ext_learn flag set, from Nikolay Aleksandrov.

10) Fix nested locking of netdev address lists, from Taehee Yoo.

11) Fix handling of 32-bit DATA_FIN values in mptcp, from Mat Martineau.

12) Fix r8169 data corruptions on RTL8402 chips, from Heiner Kallweit.

13) Don't free command entries in mlx5 while comp handler could still be
    running, from Eran Ben Elisha.

14) Error flow of request_irq() in mlx5 is busted, due to an off by one
    we try to free and IRQ never allocated. From Maor Gottlieb.

15) Fix leak when dumping netlink policies, from Johannes Berg.

16) Sendpage cannot be performed when a page is a slab page, or the page
    count is < 1. Some subsystems such as nvme were doing so. Create a
    "sendpage_ok()" helper and use it as needed, from Coly Li.

17) Don't leak request socket when using syncookes with mptcp, from
    Paolo Abeni.

* git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (111 commits)
  net/core: check length before updating Ethertype in skb_mpls_{push,pop}
  net: mvneta: fix double free of txq->buf
  net_sched: check error pointer in tcf_dump_walker()
  net: team: fix memory leak in __team_options_register
  net: typhoon: Fix a typo Typoon --> Typhoon
  net: hinic: fix DEVLINK build errors
  net: stmmac: Modify configuration method of EEE timers
  tcp: fix syn cookied MPTCP request socket leak
  libceph: use sendpage_ok() in ceph_tcp_sendpage()
  scsi: libiscsi: use sendpage_ok() in iscsi_tcp_segment_map()
  drbd: code cleanup by using sendpage_ok() to check page for kernel_sendpage()
  tcp: use sendpage_ok() to detect misused .sendpage
  nvme-tcp: check page by sendpage_ok() before calling kernel_sendpage()
  net: add WARN_ONCE in kernel_sendpage() for improper zero-copy send
  net: introduce helper sendpage_ok() in include/linux/net.h
  net: usb: pegasus: Proper error handing when setting pegasus' MAC address
  net: core: document two new elements of struct net_device
  netlink: fix policy dump leak
  net/mlx5e: Fix race condition on nhe->n pointer in neigh update
  net/mlx5e: Fix VLAN create flow
  ...
2020-10-05 11:27:14 -07:00
Avihai Horon
9f85cbe50a RDMA/uverbs: Expose the new GID query API to user space
Expose the query GID table and entry API to user space by adding two new
methods and method handlers to the device object.

This API provides a faster way to query a GID table using single call and
will be used in libibverbs to improve current approach that requires
multiple calls to open, close and read multiple sysfs files for a single
GID table entry.

Link: https://lore.kernel.org/r/20200923165015.2491894-5-leon@kernel.org
Signed-off-by: Avihai Horon <avihaih@nvidia.com>
Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
2020-10-01 21:20:11 -03:00
Avihai Horon
c4b4d548fa RDMA/core: Introduce new GID table query API
Introduce rdma_query_gid_table which enables querying all the GID tables
of a given device and copying the attributes of all valid GID entries to a
provided buffer.

This API provides a faster way to query a GID table using single call and
will be used in libibverbs to improve current approach that requires
multiple calls to open, close and read multiple sysfs files for a single
GID table entry.

Link: https://lore.kernel.org/r/20200923165015.2491894-4-leon@kernel.org
Signed-off-by: Avihai Horon <avihaih@nvidia.com>
Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
2020-10-01 21:20:11 -03:00
Avihai Horon
1c15b4f2a4 RDMA/core: Modify enum ib_gid_type and enum rdma_network_type
Separate IB_GID_TYPE_IB and IB_GID_TYPE_ROCE to two different values, so
enum ib_gid_type will match the gid types of the new query GID table API
which will be introduced in the following patches.

This change in enum ib_gid_type requires to change also enum
rdma_network_type by separating RDMA_NETWORK_IB and RDMA_NETWORK_ROCE_V1
values.

Link: https://lore.kernel.org/r/20200923165015.2491894-3-leon@kernel.org
Signed-off-by: Avihai Horon <avihaih@nvidia.com>
Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
2020-10-01 21:20:11 -03:00
Avihai Horon
3ff4de8f60 RDMA/core: Change rdma_get_gid_attr returned error code
Change the error code returned from rdma_get_gid_attr when the GID entry
is invalid but the GID index is in the gid table size range to -ENODATA
instead of -EINVAL.

This change is done in order to provide a more accurate error reporting to
be used by the new GID query API in user space. Nevertheless, -EINVAL is
still returned from sysfs in the aforementioned case to maintain
compatibility with user space that expects -EINVAL.

Link: https://lore.kernel.org/r/20200923165015.2491894-2-leon@kernel.org
Signed-off-by: Avihai Horon <avihaih@nvidia.com>
Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
2020-10-01 21:20:11 -03:00
Rikard Falkeborn
42d5179c89 RDMA/core: Constify struct attribute_group
The only usage of the pma_table field in the ib_port struct is to pass its
address to sysfs_create_group() and sysfs_remove_group(). Make it const to
make it possible to constify a couple of static struct
attribute_group. This allows the compiler to put them in read-only memory.

Link: https://lore.kernel.org/r/20200930224004.24279-2-rikard.falkeborn@gmail.com
Signed-off-by: Rikard Falkeborn <rikard.falkeborn@gmail.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
2020-10-01 20:44:51 -03:00
Yishai Hadas
8bfafde086 IB/core: Enable ODP sync without faulting
Enable ODP sync without faulting, this improves performance by reducing
the number of page faults in the system.

The gain from this option is that the device page table can be aligned
with the presented pages in the CPU page table without causing page
faults.

As of that, the overhead on data path from hardware point of view to
trigger a fault which end-up by calling the driver to bring the pages
will be dropped.

Link: https://lore.kernel.org/r/20200930163828.1336747-3-leon@kernel.org
Signed-off-by: Yishai Hadas <yishaih@nvidia.com>
Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
2020-10-01 16:44:05 -03:00
Yishai Hadas
36f30e486d IB/core: Improve ODP to use hmm_range_fault()
Move to use hmm_range_fault() instead of get_user_pags_remote() to improve
performance in a few aspects:

This includes:
- Dropping the need to allocate and free memory to hold its output

- No need any more to use put_page() to unpin the pages

- The logic to detect contiguous pages is done based on the returned
  order, no need to run per page and evaluate.

In addition, moving to use hmm_range_fault() enables to reduce page faults
in the system with it's snapshot mode, this will be introduced in next
patches from this series.

As part of this, cleanup some flows and use the required data structures
to work with hmm_range_fault().

Link: https://lore.kernel.org/r/20200930163828.1336747-2-leon@kernel.org
Signed-off-by: Yishai Hadas <yishaih@nvidia.com>
Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
2020-10-01 16:39:54 -03:00
Jason Gunthorpe
2ee9bf346f RDMA/addr: Fix race with netevent_callback()/rdma_addr_cancel()
This three thread race can result in the work being run once the callback
becomes NULL:

       CPU1                 CPU2                   CPU3
 netevent_callback()
                     process_one_req()       rdma_addr_cancel()
                      [..]
     spin_lock_bh()
  	set_timeout()
     spin_unlock_bh()

						spin_lock_bh()
						list_del_init(&req->list);
						spin_unlock_bh()

		     req->callback = NULL
		     spin_lock_bh()
		       if (!list_empty(&req->list))
                         // Skipped!
		         // cancel_delayed_work(&req->work);
		     spin_unlock_bh()

		    process_one_req() // again
		     req->callback() // BOOM
						cancel_delayed_work_sync()

The solution is to always cancel the work once it is completed so any
in between set_timeout() does not result in it running again.

Cc: stable@vger.kernel.org
Fixes: 44e75052bc ("RDMA/rdma_cm: Make rdma_addr_cancel into a fence")
Link: https://lore.kernel.org/r/20200930072007.1009692-1-leon@kernel.org
Reported-by: Dan Aloni <dan@kernelim.com>
Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
2020-09-30 15:29:05 -03:00
Jason Gunthorpe
a6f0b08dba RDMA/core: Remove ucontext->closing
Nothing reads this any more, and the reason for its existence has passed
due to the deferred fput() scheme.

Fixes: 8ea1f989aa ("drivers/IB,usnic: reduce scope of mmap_sem")
Link: https://lore.kernel.org/r/0-v1-df64ff042436+42-uctx_closing_jgg@nvidia.com
Reviewed-by: Leon Romanovsky <leonro@nvidia.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
2020-09-30 15:27:19 -03:00
Leon Romanovsky
5807bb3205 RDMA/core: Align write and ioctl checks of QP types
The ioctl flow checks that the user provides only a supported list of QP
types, while write flow didn't do it and relied on the driver to check
it. Align those flows to fail as early as possible.

Link: https://lore.kernel.org/r/20200926102450.2966017-8-leon@kernel.org
Reviewed-by: Maor Gottlieb <maorg@mellanox.com>
Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
2020-09-29 13:11:06 -03:00
Taehee Yoo
eff7423365 net: core: introduce struct netdev_nested_priv for nested interface infrastructure
Functions related to nested interface infrastructure such as
netdev_walk_all_{ upper | lower }_dev() pass both private functions
and "data" pointer to handle their own things.
At this point, the data pointer type is void *.
In order to make it easier to expand common variables and functions,
this new netdev_nested_priv structure is added.

In the following patch, a new member variable will be added into this
struct to fix the lockdep issue.

Signed-off-by: Taehee Yoo <ap420073@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-09-28 15:00:15 -07:00
Leon Romanovsky
b09c4d7012 RDMA/restrack: Improve readability in task name management
Use rdma_restrack_set_name() and rdma_restrack_parent_name() instead of
tricky uses of rdma_restrack_attach_task()/rdma_restrack_uadd().

This uniformly makes all restracks add'd using rdma_restrack_add().

Link: https://lore.kernel.org/r/20200922091106.2152715-6-leon@kernel.org
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
2020-09-22 19:47:35 -03:00
Leon Romanovsky
c34a23c28c RDMA/restrack: Simplify restrack tracking in kernel flows
Have a single rdma_restrack_add() that adds an entry, there is no reason
to split the user/kernel here, the rdma_restrack_set_task() is responsible
for this difference.

This patch prepares the code to the future requirement of making restrack
is mandatory for managing ib objects.

Link: https://lore.kernel.org/r/20200922091106.2152715-5-leon@kernel.org
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
2020-09-22 19:47:35 -03:00
Leon Romanovsky
13ef5539de RDMA/restrack: Count references to the verbs objects
Refactor the restrack code to make sure the kref inside the restrack entry
properly kref's the object in which it is embedded. This slight change is
needed for future conversions of MR and QP which are refcounted before the
release and kfree.

The ideal flow from ib_core perspective as follows:
* Allocate ib_* structure with rdma_zalloc_*.
* Set everything that is known to ib_core to that newly created object.
* Initialize kref with restrack help
* Call to driver specific allocation functions.
* Insert into restrack DB
....
* Return and release restrack with restrack_put.

Largely this means a rdma_restrack_new() should be called near allocating
the containing structure.

Link: https://lore.kernel.org/r/20200922091106.2152715-4-leon@kernel.org
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
2020-09-22 19:47:35 -03:00
Leon Romanovsky
60aaeffa36 RDMA/cma: Delete from restrack DB after successful destroy
Update the code to have similar destroy pattern like other IB objects.

This change create asymmetry to the rdma_id_private create flow to make
sure that memory is managed by restrack.

Link: https://lore.kernel.org/r/20200922091106.2152715-2-leon@kernel.org
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
2020-09-22 19:47:34 -03:00
Jason Gunthorpe
f5449e7480 RDMA/ucma: Rework ucma_migrate_id() to avoid races with destroy
ucma_destroy_id() assumes that all things accessing the ctx will do so via
the xarray. This assumption violated only in the case the FD is being
closed, then the ctx is reached via the ctx_list. Normally this is OK
since ucma_destroy_id() cannot run concurrenty with release(), however
with ucma_migrate_id() is involved this can violated as the close of the
2nd FD can run concurrently with destroy on the first:

                CPU0                      CPU1
        ucma_destroy_id(fda)
                                  ucma_migrate_id(fda -> fdb)
                                       ucma_get_ctx()
        xa_lock()
         _ucma_find_context()
         xa_erase()
        xa_unlock()
                                       xa_lock()
                                        ctx->file = new_file
                                        list_move()
                                       xa_unlock()
                                      ucma_put_ctx()

                                   ucma_close(fdb)
                                      _destroy_id()
                                      kfree(ctx)

        _destroy_id()
          wait_for_completion()
          // boom, ctx was freed

The ctx->file must be modified under the handler and xa_lock, and prior to
modification the ID must be rechecked that it is still reachable from
cur_file, ie there is no parallel destroy or migrate.

To make this work remove the double locking and streamline the control
flow. The double locking was obsoleted by the handler lock now directly
preventing new uevents from being created, and the ctx_list cannot be read
while holding fgets on both files. Removing the double locking also
removes the need to check for the same file.

Fixes: 88314e4dda ("RDMA/cma: add support for rdma_migrate_id()")
Link: https://lore.kernel.org/r/0-v1-05c5a4090305+3a872-ucma_syz_migrate_jgg@nvidia.com
Reported-and-tested-by: syzbot+cc6fc752b3819e082d0c@syzkaller.appspotmail.com
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
2020-09-18 20:54:01 -03:00
Jason Gunthorpe
5dee5872f8 Merge branch 'mlx5_active_speed' into rdma.git for-next
Leon Romanovsky says:

====================
IBTA declares speed as 16 bits, but kernel stores it in u8. This series
fixes in-kernel declaration while keeping external interface intact.
====================

Based on the mlx5-next branch at
     git://git.kernel.org/pub/scm/linux/kernel/git/mellanox/linux
due to dependencies.

* branch 'mlx5_active_speed':
  RDMA: Fix link active_speed size
  RDMA/mlx5: Delete duplicated mlx5_ptys_width enum
  net/mlx5: Refactor query port speed functions
2020-09-18 10:31:45 -03:00
Aharon Landau
376ceb31ff RDMA: Fix link active_speed size
According to the IB spec active_speed size should be u16 and not u8 as
before. Changing it to allow further extensions in offered speeds.

Link: https://lore.kernel.org/r/20200917090223.1018224-4-leon@kernel.org
Signed-off-by: Aharon Landau <aharonl@mellanox.com>
Reviewed-by: Michael Guralnik <michaelgur@nvidia.com>
Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
2020-09-18 10:31:24 -03:00
Leon Romanovsky
c0a6b5ecc5 RDMA: Convert RWQ table logic to ib_core allocation scheme
Move struct ib_rwq_ind_table allocation to ib_core.

Link: https://lore.kernel.org/r/20200902081623.746359-3-leon@kernel.org
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
2020-09-17 14:04:33 -03:00
Leon Romanovsky
d18bb3e152 RDMA: Clean MW allocation and free flows
Move allocation and destruction of memory windows under ib_core
responsibility and clean drivers to ensure that no updates to MW
ib_core structures are done in driver layer.

Link: https://lore.kernel.org/r/20200902081623.746359-2-leon@kernel.org
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
2020-09-17 14:04:32 -03:00
Jason Gunthorpe
b5de0c60cc RDMA/cma: Fix use after free race in roce multicast join
The roce path triggers a work queue that continues to touch the id_priv
but doesn't hold any reference on it. Futher, unlike in the IB case, the
work queue is not fenced during rdma_destroy_id().

This can trigger a use after free if a destroy is triggered in the
incredibly narrow window after the queue_work and the work starting and
obtaining the handler_mutex.

The only purpose of this work queue is to run the ULP event callback from
the standard context, so switch the design to use the existing
cma_work_handler() scheme. This simplifies quite a lot of the flow:

- Use the cma_work_handler() callback to launch the work for roce. This
  requires generating the event synchronously inside the
  rdma_join_multicast(), which in turn means the dummy struct
  ib_sa_multicast can become a simple stack variable.

- cm_work_handler() used the id_priv kref, so we can entirely eliminate
  the kref inside struct cma_multicast. Since the cma_multicast never
  leaks into an unprotected work queue the kfree can be done at the same
  time as for IB.

- Eliminating the general multicast.ib requires using cma_set_mgid() in a
  few places to recompute the mgid.

Fixes: 3c86aa70bf ("RDMA/cm: Add RDMA CM support for IBoE devices")
Link: https://lore.kernel.org/r/20200902081122.745412-9-leon@kernel.org
Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
2020-09-17 09:09:25 -03:00
Jason Gunthorpe
3788d2997b RDMA/cma: Consolidate the destruction of a cma_multicast in one place
Two places were open coding this sequence, and also pull in
cma_leave_roce_mc_group() which was called only once.

Link: https://lore.kernel.org/r/20200902081122.745412-8-leon@kernel.org
Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
2020-09-17 09:09:24 -03:00
Jason Gunthorpe
1bb5091def RDMA/cma: Remove dead code for kernel rdmacm multicast
There is no kernel user of RDMA CM multicast so this code managing the
multicast subscription of the kernel-only internal QP is dead. Remove it.

This makes the bug fixes in the next patches much simpler.

Link: https://lore.kernel.org/r/20200902081122.745412-7-leon@kernel.org
Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
2020-09-17 09:09:24 -03:00
Jason Gunthorpe
7e85bcda8b RDMA/cma: Combine cma_ndev_work with cma_work
These are the same thing, except that cma_ndev_work doesn't have a state
transition. Signal no state transition by setting old_state and new_state
== 0.

In all cases the handler function should not be called once
rdma_destroy_id() has progressed passed setting the state.

Link: https://lore.kernel.org/r/20200902081122.745412-6-leon@kernel.org
Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
2020-09-17 09:09:24 -03:00
Jason Gunthorpe
5cfbf9291e RDMA/cma: Remove cma_comp()
The only place that still uses it is rdma_join_multicast() which is only
doing a sanity check that the caller hasn't done something wrong and
doesn't need the spinlock.

At least in the case of rdma_join_multicast() the information it needs
will remain until the ID is destroyed once it enters these
states. Similarly there is no reason to check for these specific states in
the handler callback, instead use the usual check for a destroyed id under
the handler_mutex.

Link: https://lore.kernel.org/r/20200902081122.745412-5-leon@kernel.org
Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
2020-09-17 09:09:24 -03:00
Jason Gunthorpe
d490ee52f0 RDMA/cma: Fix locking for the RDMA_CM_LISTEN state
There is a strange unlocked read of the ID state when checking for
reuseaddr. This is because an ID cannot be reusable once it becomes a
listening ID. Instead of using the state to exclude reuse, just clear it
as part of rdma_listen()'s flow to convert reusable into not reusable.

Once a ID goes to listen there is no way back out, and the only use of
reusable is on the bind_list check.

Finally, update the checks under handler_mutex to use READ_ONCE and audit
that once RDMA_CM_LISTEN is observed in a req callback it is stable under
the handler_mutex.

Link: https://lore.kernel.org/r/20200902081122.745412-4-leon@kernel.org
Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
2020-09-17 09:09:24 -03:00
Jason Gunthorpe
732d41c545 RDMA/cma: Make the locking for automatic state transition more clear
Re-organize things so the state variable is not read unlocked. The first
attempt to go directly from ADDR_BOUND immediately tells us if the ID is
already bound, if we can't do that then the attempt inside
rdma_bind_addr() to go from IDLE to ADDR_BOUND confirms the ID needs
binding.

Link: https://lore.kernel.org/r/20200902081122.745412-3-leon@kernel.org
Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
2020-09-17 09:09:23 -03:00
Jason Gunthorpe
2a7cec5381 RDMA/cma: Fix locking for the RDMA_CM_CONNECT state
It is currently a bit confusing, but the design is if the handler_mutex
is held, and the state is in RDMA_CM_CONNECT, then the state cannot leave
RDMA_CM_CONNECT without also serializing with the handler_mutex.

Make this clearer by adding a direct assertion, fixing the usage in
rdma_connect and generally using READ_ONCE to read the state value.

Link: https://lore.kernel.org/r/20200902081122.745412-2-leon@kernel.org
Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
2020-09-17 09:09:23 -03:00
Jason Gunthorpe
4aa1615268 RDMA/core: Fix ordering of CQ pool destruction
rxe will hold a refcount on the IB device as long as CQ objects exist,
this causes destruction of a rxe device to hang if the CQ pool has any
cached CQs since they are being destroyed after the refcount must go to
zero.

Treat the CQ pool like a client and create/destroy it before/after all
other clients. No users of CQ pool can exist past a client remove call.

Link: https://lore.kernel.org/r/e8a240aa-9e9b-3dca-062f-9130b787f29b@acm.org
Fixes: c7ff819aef ("RDMA/core: Introduce shared CQ pool API")
Tested-by: Bart Van Assche <bvanassche@acm.org>
Tested-by: Yi Zhang <yi.zhang@redhat.com>
Signed-off-by: Bart Van Assche <bvanassche@acm.org>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
2020-09-14 15:20:18 -03:00
Linus Torvalds
b1df2a0783 RDMA second 5.9-rc pull request
A number of driver bug fixes and a few recent regressions:
 
 - Several bug fixes for bnxt_re. Crashing, incorrect data reported,
   and corruption on new HW
 
 - Memory leak and crash in rxe
 
 - Fix sysfs corruption in rxe if the netdev name is too long
 
 - Fix a crash on error unwind in the new cq_pool code
 
 - Fix kobject panics in rtrs by working device lifetime properly
 
 - Fix a data corruption bug in iser target related to misaligned buffers
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEEfB7FMLh+8QxL+6i3OG33FX4gmxoFAl9atAYACgkQOG33FX4g
 mxqSdw//Qi29dnxzVGpsaO4/krd/VmI6NT6eNpgK7Nqx80DaCYer0JhtwZOUxHqK
 KbHIV9XB/f6BSI67c9ydYj4PNX6FpFnoUWQLvqZwip5VM7R6ifIVjm0ap1jCAUSS
 axDLFZOySIOYNhcZ5I+MtY/kxykKBjMteuMXdpBe4FwZ+XSmsC5KkfRH/+FUhjVG
 peL6aRVDv9TByH8w+iZE1wSmVrOphOE1C/jN5TyotQTmKe7IHoXJtkalosYHXFWw
 KZaaz52e4IYKVFl4HIcl6+FfPExhxsyfDtRHluvn+vzY/wFy1RZw6F0BZt7mioy5
 J8R6w82xEe/SNugTGuvIzqXOymmy9H4CrG9pHy4NRMMzC28LGI7qHJgVhr/jZy8+
 GPxR26cywDhPsd4XA2K3mvs7DVSoBUPYlIUnHdYjfBZl/ghColg9+XGyNv6pdrke
 Q7Kog5blcpOAahBX+ElBLvIZXk5oEk5W+3H/M0OeuVMQ/DrMtALrCnwpp4wDKVvO
 9QuYfGgQ+25xbV9kwzckLGo5eedN3cRD/v4hcqvQUZo+9zLYZ/HZRMjpOdrscQ+I
 QL4FgpcLpOASKZ+bYjjpFxK3rNVTDT9CYJw4/hxEaOhxRhtAO1Q9mJdvJTK6dj09
 oR9LPyefQkyKCAt+heWHKKkEYDiwT8U1SlR8STotg24VHIj6Rb4=
 =2DDd
 -----END PGP SIGNATURE-----

Merge tag 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdma

Pull rdma fixes from Jason Gunthorpe:
 "A number of driver bug fixes and a few recent regressions:

   - Several bug fixes for bnxt_re. Crashing, incorrect data reported,
     and corruption on new HW

   - Memory leak and crash in rxe

   - Fix sysfs corruption in rxe if the netdev name is too long

   - Fix a crash on error unwind in the new cq_pool code

   - Fix kobject panics in rtrs by working device lifetime properly

   - Fix a data corruption bug in iser target related to misaligned
     buffers"

* tag 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdma:
  IB/isert: Fix unaligned immediate-data handling
  RDMA/rtrs-srv: Set .release function for rtrs srv device during device init
  RDMA/bnxt_re: Remove set but not used variable 'qplib_ctx'
  RDMA/core: Fix reported speed and width
  RDMA/core: Fix unsafe linked list traversal after failing to allocate CQ
  RDMA/bnxt_re: Remove the qp from list only if the qp destroy succeeds
  RDMA/bnxt_re: Fix driver crash on unaligned PSN entry address
  RDMA/bnxt_re: Restrict the max_gids to 256
  RDMA/bnxt_re: Static NQ depth allocation
  RDMA/bnxt_re: Fix the qp table indexing
  RDMA/bnxt_re: Do not report transparent vlan from QP1
  RDMA/mlx4: Read pkey table length instead of hardcoded value
  RDMA/rxe: Fix panic when calling kmem_cache_create()
  RDMA/rxe: Fix memleak in rxe_mem_init_user
  RDMA/rxe: Fix the parent sysfs read when the interface has 15 chars
  RDMA/rtrs-srv: Replace device_register with device_initialize and device_add
2020-09-11 10:02:36 -07:00
Jason Gunthorpe
81655d3c4a RDMA/mlx4: Use ib_umem_num_dma_blocks()
For the calls linked to mlx4_ib_umem_calc_optimal_mtt_size() use
ib_umem_num_dma_blocks() inside the function, it is just some weird static
default.

All other places are just using it with PAGE_SIZE, switch to
ib_umem_num_dma_blocks().

As this is the last call site, remove ib_umem_num_count().

Link: https://lore.kernel.org/r/15-v2-270386b7e60b+28f4-umem_1_jgg@nvidia.com
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
2020-09-11 10:24:54 -03:00
Jason Gunthorpe
a665aca89a RDMA/umem: Split ib_umem_num_pages() into ib_umem_num_dma_blocks()
ib_umem_num_pages() should only be used by things working with the SGL in
CPU pages directly.

Drivers building DMA lists should use the new ib_num_dma_blocks() which
returns the number of blocks rdma_umem_for_each_block() will return.

To make this general for DMA drivers requires a different implementation.
Computing DMA block count based on umem->address only works if the
requested page size is < PAGE_SIZE and/or the IOVA == umem->address.

Instead the number of DMA pages should be computed in the IOVA address
space, not umem->address. Thus the IOVA has to be stored inside the umem
so it can be used for these calculations.

For now set it to umem->address by default and fix it up if
ib_umem_find_best_pgsz() was called. This allows drivers to be converted
to ib_umem_num_dma_blocks() safely.

Link: https://lore.kernel.org/r/6-v2-270386b7e60b+28f4-umem_1_jgg@nvidia.com
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
2020-09-11 10:24:53 -03:00
Jason Gunthorpe
3361c29e92 RDMA/umem: Use simpler logic for ib_umem_find_best_pgsz()
The calculation in rdma_find_pg_bit() is fairly complicated, and the
function is never called anywhere else. Inline a simpler version into
ib_umem_find_best_pgsz()

Link: https://lore.kernel.org/r/3-v2-270386b7e60b+28f4-umem_1_jgg@nvidia.com
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
2020-09-09 15:33:17 -03:00
Jason Gunthorpe
10c75ccb54 RDMA/umem: Prevent small pages from being returned by ib_umem_find_best_pgsz()
rdma_for_each_block() makes assumptions about how the SGL is constructed
that don't work if the block size is below the page size used to to build
the SGL.

The rules for umem SGL construction require that the SG's all be PAGE_SIZE
aligned and we don't encode the actual byte offset of the VA range inside
the SGL using offset and length. So rdma_for_each_block() has no idea
where the actual starting/ending point is to compute the first/last block
boundary if the starting address should be within a SGL.

Fixing the SGL construction turns out to be really hard, and will be the
subject of other patches. For now block smaller pages.

Fixes: 4a35339958 ("RDMA/umem: Add API to find best driver supported page size in an MR")
Link: https://lore.kernel.org/r/2-v2-270386b7e60b+28f4-umem_1_jgg@nvidia.com
Reviewed-by: Leon Romanovsky <leonro@nvidia.com>
Reviewed-by: Shiraz Saleem <shiraz.saleem@intel.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
2020-09-09 15:33:17 -03:00