Commit Graph

56 Commits

Author SHA1 Message Date
David S. Miller
89b1698c93 Merge ra.kernel.org:/pub/scm/linux/kernel/git/davem/net
The BTF conflicts were simple overlapping changes.

The virtio_net conflict was an overlap of a fix of statistics counter,
happening alongisde a move over to a bonafide statistics structure
rather than counting value on the stack.

Signed-off-by: David S. Miller <davem@davemloft.net>
2018-08-02 10:55:32 -07:00
Ka-Cheong Poon
e65d4d9633 rds: Remove IPv6 dependency
This patch removes the IPv6 dependency from RDS.

Signed-off-by: Ka-Cheong Poon <ka-cheong.poon@oracle.com>
Acked-by: Santosh Shilimkar <santosh.shilimkar@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-08-01 09:32:35 -07:00
Avinash Repaka
9e630bcb77 RDS: RDMA: Fix the NULL-ptr deref in rds_ib_get_mr
Registration of a memory region(MR) through FRMR/fastreg(unlike FMR)
needs a connection/qp. With a proxy qp, this dependency on connection
will be removed, but that needs more infrastructure patches, which is a
work in progress.

As an intermediate fix, the get_mr returns EOPNOTSUPP when connection
details are not populated. The MR registration through sendmsg() will
continue to work even with fast registration, since connection in this
case is formed upfront.

This patch fixes the following crash:
kasan: GPF could be caused by NULL-ptr deref or user memory access
general protection fault: 0000 [#1] SMP KASAN
Modules linked in:
CPU: 1 PID: 4244 Comm: syzkaller468044 Not tainted 4.16.0-rc6+ #361
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS
Google 01/01/2011
RIP: 0010:rds_ib_get_mr+0x5c/0x230 net/rds/ib_rdma.c:544
RSP: 0018:ffff8801b059f890 EFLAGS: 00010202
RAX: dffffc0000000000 RBX: ffff8801b07e1300 RCX: ffffffff8562d96e
RDX: 000000000000000d RSI: 0000000000000001 RDI: 0000000000000068
RBP: ffff8801b059f8b8 R08: ffffed0036274244 R09: ffff8801b13a1200
R10: 0000000000000004 R11: ffffed0036274243 R12: ffff8801b13a1200
R13: 0000000000000001 R14: ffff8801ca09fa9c R15: 0000000000000000
FS:  00007f4d050af700(0000) GS:ffff8801db300000(0000)
knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00007f4d050aee78 CR3: 00000001b0d9b006 CR4: 00000000001606e0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
Call Trace:
 __rds_rdma_map+0x710/0x1050 net/rds/rdma.c:271
 rds_get_mr_for_dest+0x1d4/0x2c0 net/rds/rdma.c:357
 rds_setsockopt+0x6cc/0x980 net/rds/af_rds.c:347
 SYSC_setsockopt net/socket.c:1849 [inline]
 SyS_setsockopt+0x189/0x360 net/socket.c:1828
 do_syscall_64+0x281/0x940 arch/x86/entry/common.c:287
 entry_SYSCALL_64_after_hwframe+0x42/0xb7
RIP: 0033:0x4456d9
RSP: 002b:00007f4d050aedb8 EFLAGS: 00000246 ORIG_RAX: 0000000000000036
RAX: ffffffffffffffda RBX: 00000000006dac3c RCX: 00000000004456d9
RDX: 0000000000000007 RSI: 0000000000000114 RDI: 0000000000000004
RBP: 00000000006dac38 R08: 00000000000000a0 R09: 0000000000000000
R10: 0000000020000380 R11: 0000000000000246 R12: 0000000000000000
R13: 00007fffbfb36d6f R14: 00007f4d050af9c0 R15: 0000000000000005
Code: fa 48 c1 ea 03 80 3c 02 00 0f 85 cc 01 00 00 4c 8b bb 80 04 00 00
48
b8 00 00 00 00 00 fc ff df 49 8d 7f 68 48 89 fa 48 c1 ea 03 <80> 3c 02
00 0f
85 9c 01 00 00 4d 8b 7f 68 48 b8 00 00 00 00 00
RIP: rds_ib_get_mr+0x5c/0x230 net/rds/ib_rdma.c:544 RSP:
ffff8801b059f890
---[ end trace 7e1cea13b85473b0 ]---

Reported-by: syzbot+b51c77ef956678a65834@syzkaller.appspotmail.com
Signed-off-by: Santosh Shilimkar <santosh.shilimkar@oracle.com>
Signed-off-by: Avinash Repaka <avinash.repaka@oracle.com>

Signed-off-by: David S. Miller <davem@davemloft.net>
2018-07-26 14:03:07 -07:00
Ka-Cheong Poon
b7ff8b1036 rds: Extend RDS API for IPv6 support
There are many data structures (RDS socket options) used by RDS apps
which use a 32 bit integer to store IP address. To support IPv6,
struct in6_addr needs to be used. To ensure backward compatibility, a
new data structure is introduced for each of those data structures
which use a 32 bit integer to represent an IP address. And new socket
options are introduced to use those new structures. This means that
existing apps should work without a problem with the new RDS module.
For apps which want to use IPv6, those new data structures and socket
options can be used. IPv4 mapped address is used to represent IPv4
address in the new data structures.

v4: Revert changes to SO_RDS_TRANSPORT

Signed-off-by: Ka-Cheong Poon <ka-cheong.poon@oracle.com>
Acked-by: Santosh Shilimkar <santosh.shilimkar@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-07-23 21:17:44 -07:00
Ka-Cheong Poon
eee2fa6ab3 rds: Changing IP address internal representation to struct in6_addr
This patch changes the internal representation of an IP address to use
struct in6_addr.  IPv4 address is stored as an IPv4 mapped address.
All the functions which take an IP address as argument are also
changed to use struct in6_addr.  But RDS socket layer is not modified
such that it still does not accept IPv6 address from an application.
And RDS layer does not accept nor initiate IPv6 connections.

v2: Fixed sparse warnings.

Signed-off-by: Ka-Cheong Poon <ka-cheong.poon@oracle.com>
Acked-by: Santosh Shilimkar <santosh.shilimkar@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-07-23 21:17:44 -07:00
Avinash Repaka
b1fb67fa50 RDS: IB: Initialize max_items based on underlying device attributes
Use max_1m_mrs/max_8k_mrs while setting max_items, as the former
variables are set based on the underlying device attributes.

Signed-off-by: Avinash Repaka <avinash.repaka@oracle.com>
Acked-by: Santosh Shilimkar <santosh.shilimkar@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-10-05 21:16:33 -07:00
Reshetova, Elena
50d61ff789 net, rds: convert rds_ib_device.refcount from atomic_t to refcount_t
refcount_t type and corresponding API should be
used instead of atomic_t when the variable is used as
a reference counter. This allows to avoid accidental
refcounter overflows that might lead to use-after-free
situations.

Signed-off-by: Elena Reshetova <elena.reshetova@intel.com>
Signed-off-by: Hans Liljestrand <ishkamiel@gmail.com>
Signed-off-by: Kees Cook <keescook@chromium.org>
Signed-off-by: David Windsor <dwindsor@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-07-04 22:35:17 +01:00
Sowmini Varadhan
0cb43965d4 RDS: split out connection specific state from rds_connection to rds_conn_path
In preparation for multipath RDS, split the rds_connection
structure into a base structure, and a per-path struct rds_conn_path.
The base structure tracks information and locks common to all
paths. The workqs for send/recv/shutdown etc are tracked per
rds_conn_path. Thus the workq callbacks now work with rds_conn_path.

This commit allows for one rds_conn_path per rds_connection, and will
be extended into multiple conn_paths in  subsequent commits.

Signed-off-by: Sowmini Varadhan <sowmini.varadhan@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-06-14 23:50:41 -07:00
Bhaktipriya Shridhar
231edca97f RDS: IB: Remove deprecated create_workqueue
alloc_workqueue replaces deprecated create_workqueue().

Since the driver is infiniband which can be used as block device and the
workqueue seems involved in regular operation of the device, so a
dedicated workqueue has been used  with WQ_MEM_RECLAIM set to guarantee
forward progress under memory pressure.
Since there are only a fixed number of work items, explicit concurrency
limit is unnecessary here.

Signed-off-by: Bhaktipriya Shridhar <bhaktipriya96@gmail.com>
Acked-by: Santosh Shilimkar <santosh.shilimkar@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-06-10 22:52:28 -07:00
Avinash Repaka
1659185fb4 RDS: IB: Support Fastreg MR (FRMR) memory registration mode
Fastreg MR(FRMR) is another method with which one can
register memory to HCA. Some of the newer HCAs supports only fastreg
mr mode, so we need to add support for it to have RDS functional
on them.

Signed-off-by: Santosh Shilimkar <ssantosh@kernel.org>
Signed-off-by: Avinash Repaka <avinash.repaka@oracle.com>
Signed-off-by: Santosh Shilimkar <santosh.shilimkar@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-03-02 14:13:19 -05:00
santosh.shilimkar@oracle.com
db42753adb RDS: IB: add mr reused stats
Add MR reuse statistics to RDS IB transport.

Signed-off-by: Santosh Shilimkar <ssantosh@kernel.org>
Signed-off-by: Santosh Shilimkar <santosh.shilimkar@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-03-02 14:13:19 -05:00
santosh.shilimkar@oracle.com
490ea5967b RDS: IB: move FMR code to its own file
No functional change.

Signed-off-by: Santosh Shilimkar <ssantosh@kernel.org>
Signed-off-by: Santosh Shilimkar <santosh.shilimkar@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-03-02 14:13:18 -05:00
santosh.shilimkar@oracle.com
a69365a39c RDS: IB: create struct rds_ib_fmr
Keep fmr related filed in its own struct. Fastreg MR structure
will be added to the union.

Signed-off-by: Santosh Shilimkar <ssantosh@kernel.org>
Signed-off-by: Santosh Shilimkar <santosh.shilimkar@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-03-02 14:13:18 -05:00
santosh.shilimkar@oracle.com
f6df683f32 RDS: IB: Re-organise ibmr code
No functional changes. This is in preperation towards adding
fastreg memory resgitration support.

Signed-off-by: Santosh Shilimkar <ssantosh@kernel.org>
Signed-off-by: Santosh Shilimkar <santosh.shilimkar@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-03-02 14:13:18 -05:00
Santosh Shilimkar
0676651323 RDS: IB: split mr pool to improve 8K messages performance
8K message sizes are pretty important usecase for RDS current
workloads so we make provison to have 8K mrs available from the pool.
Based on number of SG's in the RDS message, we pick a pool to use.

Also to make sure that we don't under utlise mrs when say 8k messages
are dominating which could lead to 8k pull being exhausted, we fall-back
to 1m pool till 8k pool recovers for use.

This helps to at least push ~55 kB/s bidirectional data which
is a nice improvement.

Signed-off-by: Santosh Shilimkar <ssantosh@kernel.org>
Signed-off-by: Santosh Shilimkar <santosh.shilimkar@oracle.com>
2015-10-05 11:19:02 -07:00
Santosh Shilimkar
67161e250a RDS: IB: mark rds_ib_fmr_wq static
Fix below warning by marking rds_ib_fmr_wq static

net/rds/ib_rdma.c:87:25: warning: symbol 'rds_ib_fmr_wq' was not declared. Should it be static?

Signed-off-by: Santosh Shilimkar <ssantosh@kernel.org>
Signed-off-by: Santosh Shilimkar <santosh.shilimkar@oracle.com>
2015-10-05 11:19:02 -07:00
Santosh Shilimkar
26139dc1db RDS: IB: use already available pool handle from ibmr
rds_ib_mr already keeps the pool handle which it associates
with. Lets use that instead of round about way of fetching
it from rds_ib_device.

No functional change.

Signed-off-by: Santosh Shilimkar <ssantosh@kernel.org>
Signed-off-by: Santosh Shilimkar <santosh.shilimkar@oracle.com>
2015-10-05 11:19:02 -07:00
Santosh Shilimkar
2e1d6b813a RDS: IB: fix the rds_ib_fmr_wq kick call
RDS IB mr pool has its own workqueue 'rds_ib_fmr_wq', so we need
to use queue_delayed_work() to kick the work. This was hurting
the performance since pool maintenance was less often triggered
from other path.

Signed-off-by: Santosh Shilimkar <ssantosh@kernel.org>
Signed-off-by: Santosh Shilimkar <santosh.shilimkar@oracle.com>
2015-10-05 11:19:01 -07:00
Santosh Shilimkar
59fe460674 RDS: use kfree_rcu in rds_ib_remove_ipaddr
synchronize_rcu() slowing down un-necessarily the socket shutdown
path. It is used just kfree() the ip addresses in rds_ib_remove_ipaddr()
which is perfect usecase for kfree_rcu();

So lets use that to gain some speedup.

Signed-off-by: Santosh Shilimkar <ssantosh@kernel.org>
Signed-off-by: Santosh Shilimkar <santosh.shilimkar@oracle.com>
2015-09-30 12:43:24 -04:00
santosh.shilimkar@oracle.com
2724121419 RDS: remove superfluous from rds_ib_alloc_fmr()
Memory allocated for 'ibmr' uses kzalloc_node() which already
initialises the memory to zero. There is no need to do
memset() 0 on that memory.

Signed-off-by: Santosh Shilimkar <ssantosh@kernel.org>
Signed-off-by: Santosh Shilimkar <santosh.shilimkar@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-08-25 16:28:11 -07:00
santosh.shilimkar@oracle.com
ef5217a6e2 RDS: flush the FMR pool less often
FMR flush is an expensive and time consuming operation. Reduce the
frequency of FMR pool flush by 50% so that more FMR work gets accumulated
for more efficient flushing.

Signed-off-by: Santosh Shilimkar <ssantosh@kernel.org>
Signed-off-by: Santosh Shilimkar <santosh.shilimkar@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-08-25 16:28:11 -07:00
santosh.shilimkar@oracle.com
ad1d7dc0d7 RDS: push FMR pool flush work to its own worker
RDS FMR flush operation and also it races with connect/reconect
which happes a lot with RDS. FMR flush being on common rds_wq aggrevates
the problem. Lets push RDS FMR pool flush work to its own worker.

Signed-off-by: Santosh Shilimkar <ssantosh@kernel.org>
Signed-off-by: Santosh Shilimkar <santosh.shilimkar@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-08-25 16:28:11 -07:00
Wengang Wang
6116c2030f RDS: fix fmr pool dirty_count
In rds_ib_flush_mr_pool(), dirty_count accounts the clean ones
which is wrong. This can lead to a negative dirty count value.

Lets fix it.

Signed-off-by: Wengang Wang <wen.gang.wang@oracle.com>
Signed-off-by: Santosh Shilimkar <ssantosh@kernel.org>
Signed-off-by: Santosh Shilimkar <santosh.shilimkar@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-08-25 16:28:11 -07:00
santosh.shilimkar@oracle.com
5c240fa2ab RDS: Fix assertion level from fatal to warning
Fix the asserion level since its not fatal and can be hit
in normal execution paths. There is no need to take the
system down.

We keep the WARN_ON() to detect the condition if we get
here with bad pages.

Reviewed-by: Ajaykumar Hotchandani <ajaykumar.hotchandani@oracle.com>
Signed-off-by: Santosh Shilimkar <ssantosh@kernel.org>
Signed-off-by: Santosh Shilimkar <santosh.shilimkar@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-08-25 13:35:31 -07:00
santosh.shilimkar@oracle.com
e1f475a738 RDS: don't update ip address tables if the address hasn't changed
If the ip address tables hasn't changed, there is no need to remove
them only to be added back again.

Lets fix it.
Reviewed-by: Ajaykumar Hotchandani <ajaykumar.hotchandani@oracle.com>
Signed-off-by: Santosh Shilimkar <ssantosh@kernel.org>
Signed-off-by: Santosh Shilimkar <santosh.shilimkar@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-08-25 13:35:29 -07:00
Wengang Wang
4fabb59449 rds: rds_ib_device.refcount overflow
Fixes: 3e0249f9c0 ("RDS/IB: add refcount tracking to struct rds_ib_device")

There lacks a dropping on rds_ib_device.refcount in case rds_ib_alloc_fmr
failed(mr pool running out). this lead to the refcount overflow.

A complain in line 117(see following) is seen. From vmcore:
s_ib_rdma_mr_pool_depleted is 2147485544 and rds_ibdev->refcount is -2147475448.
That is the evidence the mr pool is used up. so rds_ib_alloc_fmr is very likely
to return ERR_PTR(-EAGAIN).

115 void rds_ib_dev_put(struct rds_ib_device *rds_ibdev)
116 {
117         BUG_ON(atomic_read(&rds_ibdev->refcount) <= 0);
118         if (atomic_dec_and_test(&rds_ibdev->refcount))
119                 queue_work(rds_wq, &rds_ibdev->free_work);
120 }

fix is to drop refcount when rds_ib_alloc_fmr failed.

Signed-off-by: Wengang Wang <wen.gang.wang@oracle.com>
Reviewed-by: Haggai Eran <haggaie@mellanox.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
2015-07-14 13:20:11 -04:00
Christoph Lameter
903ceff7ca net: Replace get_cpu_var through this_cpu_ptr
Replace uses of get_cpu_var for address calculation through this_cpu_ptr.

Cc: netdev@vger.kernel.org
Cc: Eric Dumazet <edumazet@google.com>
Acked-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Christoph Lameter <cl@linux.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2014-08-26 13:45:47 -04:00
Huang Ying
1bc144b625 net, rds, Replace xlist in net/rds/xlist.h with llist
The functionality of xlist and llist is almost same.  This patch
replace xlist with llist to avoid code duplication.

Known issues: don't know how to test this, need special hardware?

Signed-off-by: Huang Ying <ying.huang@intel.com>
Cc: Chris Mason <chris.mason@oracle.com>
Cc: Andy Grover <andy.grover@oracle.com>
Cc: "David S. Miller" <davem@davemloft.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2011-09-15 15:36:32 -04:00
Tejun Heo
c534a107e8 rds/ib: use system_wq instead of rds_ib_fmr_wq
With cmwq, there's no reason to use dedicated rds_ib_fmr_wq - it's not
in the memory reclaim path and the maximum number of concurrent work
items is bound by the number of devices.  Drop it and use system_wq
instead.  This rds_ib_fmr_init/exit() noops.  Both removed.

Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Andy Grover <andy.grover@oracle.com>
2011-02-01 11:42:43 +01:00
stephen hemminger
ff51bf8415 rds: make local functions/variables static
The RDS protocol has lots of functions that should be
declared static. rds_message_get/add_version_extension is
removed since it defined but never used.

Signed-off-by: Stephen Hemminger <shemminger@vyatta.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2010-10-21 04:26:39 -07:00
Dan Carpenter
aef3ea33e8 rds: spin_lock_irq() is not nestable
This is basically just a cleanup.  IRQs were disabled on the previous
line so we don't need to do it again here.  In the current code IRQs
would get turned on one line earlier than intended.

Signed-off-by: Dan Carpenter <error27@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2010-09-19 11:59:44 -07:00
Zach Brown
ea819867b7 RDS/IB: protect the list of IB devices
The RDS IB device list wasn't protected by any locking.  Traversal in
both the get_mr and FMR flushing paths could race with additon and
removal.

List manipulation is done with RCU primatives and is protected by the
write side of a rwsem.  The list traversal in the get_mr fast path is
protected by a rcu read critical section.  The FMR list traversal is
more problematic because it can block while traversing the list.  We
protect this with the read side of the rwsem.

Signed-off-by: Zach Brown <zach.brown@oracle.com>
2010-09-08 18:16:44 -07:00
Chris Mason
8576f374ac RDS: flush fmrs before allocating new ones
Flushing FMRs is somewhat expensive, and is currently kicked off when
the interrupt handler notices that we are getting low.  The result of
this is that FMR flushing only happens from the interrupt cpus.

This spreads the load more effectively by triggering flushes just before
we allocate a new FMR.

Signed-off-by: Chris Mason <chris.mason@oracle.com>
2010-09-08 18:16:42 -07:00
Zach Brown
ef87b7ea39 RDS: remove __init and __exit annotation
The trivial amount of memory saved isn't worth the cost of dealing with section
mismatches.

Signed-off-by: Zach Brown <zach.brown@oracle.com>
2010-09-08 18:16:39 -07:00
Zach Brown
515e079dab RDS/IB: create a work queue for FMR flushing
This patch moves the FMR flushing work in to its own mult-threaded work queue.
This is to maintain performance in preparation for returning the main krdsd
work queue back to a single threaded work queue to avoid deep-rooted
concurrency bugs.

This is also good because it further separates FMRs, which might be removed
some day, from the rest of the code base.

Signed-off-by: Zach Brown <zach.brown@oracle.com>
2010-09-08 18:16:34 -07:00
Zach Brown
8aeb1ba663 RDS/IB: destroy connections on rmmod
IB connections were not being destroyed during rmmod.

First, recently IB device removal callback was changed to disconnect
connections that used the removing device rather than destroying them.  So
connections with devices during rmmod were not being destroyed.

Second, rds_ib_destroy_nodev_conns() was being called before connections are
disassociated with devices.  It would almost never find connections in the
nodev list.

We first get rid of rds_ib_destroy_conns(), which is no longer called, and
refactor the existing caller into the main body of the function and get rid of
the list and lock wrappers.

Then we call rds_ib_destroy_nodev_conns() *after* ib_unregister_client() has
removed the IB device from all the conns and put the conns on the nodev list.

The result is that IB connections are destroyed by rmmod.

Signed-off-by: Zach Brown <zach.brown@oracle.com>
2010-09-08 18:16:33 -07:00
Andy Grover
c9455d9996 RDS: whitespace 2010-09-08 18:15:32 -07:00
Chris Mason
7a0ff5dbdd RDS: use delayed work for the FMR flushes
Using a delayed work queue helps us make sure a healthy number of FMRs
have queued up over the limit.  It makes for a large improvement in RDMA
iops.

Signed-off-by: Chris Mason <chris.mason@oracle.com>
2010-09-08 18:15:30 -07:00
Chris Mason
6fa70da608 rds: recycle FMRs through lockless lists
FRM allocation and recycling is performance critical and fairly lock
intensive.  The current code has a per connection lock that all
processes bang on and it becomes a major bottleneck on large systems.

This changes things to use a number of cmpxchg based lists instead,
allowing us to go through the whole FMR lifecycle without locking inside
RDS.

Zach Brown pointed out that our usage of cmpxchg for xlist removal is
racey if someone manages to remove and add back an FMR struct into the list
while another CPU can see the FMR's address at the head of the list.

The second CPU might assume the list hasn't changed when in fact any
number of operations might have happened in between the deletion and
reinsertion.

This commit maintains a per cpu count of CPUs that are currently
in xlist removal, and establishes a grace period to make sure that
nobody can see an entry we have just removed from the list.

Signed-off-by: Chris Mason <chris.mason@oracle.com>
2010-09-08 18:15:28 -07:00
Zach Brown
3e0249f9c0 RDS/IB: add refcount tracking to struct rds_ib_device
The RDS IB client .remove callback used to free the rds_ibdev for the given
device unconditionally.  This could race other users of the struct.  This patch
adds refcounting so that we only free the rds_ibdev once all of its users are
done.

Many rds_ibdev users are tied to connections.  We give the connection a
reference and change these users to reference the device in the connection
instead of looking it up in the IB client data.  The only user of the IB client
data remaining is the first lookup of the device as connections are built up.

Incrementing the reference count of a device found in the IB client data could
race with final freeing so we use an RCU grace period to make sure that freeing
won't happen until those lookups are done.

MRs need the rds_ibdev to get at the pool that they're freed in to.  They exist
outside a connection and many MRs can reference different devices from one
socket, so it was natural to have each MR hold a reference.  MR refs can be
dropped from interrupt handlers and final device teardown can block so we push
it off to a work struct.  Pool teardown had to be fixed to cancel its pending
work instead of deadlocking waiting for all queued work, including itself, to
finish.

MRs get their reference from the global device list, which gets a reference.
It is left unprotected by locks and remains racy.  A simple global lock would
be a significant bottleneck.  More scalable (complicated) locking should be
done carefully in a later patch.

Signed-off-by: Zach Brown <zach.brown@oracle.com>
2010-09-08 18:15:17 -07:00
Chris Mason
38a4e5e613 rds: Use RCU for the bind lookup searches
The RDS bind lookups are somewhat expensive in terms of CPU
time and locking overhead.  This commit changes them into a
faster RCU based hash tree instead of the rbtrees they were using
before.

On large NUMA systems it is a significant improvement.

Signed-off-by: Chris Mason <chris.mason@oracle.com>
2010-09-08 18:15:08 -07:00
Andy Grover
e4c52c98e0 RDS/IB: add _to_node() macros for numa and use {k,v}malloc_node()
Allocate send/recv rings in memory that is node-local to the HCA.
This significantly helps performance.

Signed-off-by: Andy Grover <andy.grover@oracle.com>
2010-09-08 18:14:06 -07:00
Andy Grover
4a81802b5e RDS/IB: Remove unused variable in ib_remove_addr()
Signed-off-by: Andy Grover <andy.grover@oracle.com>
2010-09-08 18:12:29 -07:00
Chris Mason
764f2dd92f rds: rcu-ize rds_ib_get_device()
rds_ib_get_device is called very often as we turn an
ip address into a corresponding device structure.  It currently
take a global spinlock as it walks different lists to find active
devices.

This commit changes the lists over to RCU, which isn't very complex
because they are not updated very often at all.

Signed-off-by: Chris Mason <chris.mason@oracle.com>
2010-09-08 18:12:28 -07:00
Andy Grover
15133f6e67 RDS: Implement atomic operations
Implement a CMSG-based interface to do FADD and CSWP ops.

Alter send routines to handle atomic ops.

Add atomic counters to stats.

Add xmit_atomic() to struct rds_transport

Inline rds_ib_send_unmap_rdma into unmap_rm

Signed-off-by: Andy Grover <andy.grover@oracle.com>
2010-09-08 18:11:41 -07:00
Andy Grover
21f79afa5f RDS: fold rdma.h into rds.h
RDMA is now an intrinsic part of RDS, so it's easier to just have
a single header.

Signed-off-by: Andy Grover <andy.grover@oracle.com>
2010-09-08 18:11:37 -07:00
Andy Grover
9e2effba2c RDS: Fix BUG_ONs to not fire when in a tasklet
in_interrupt() is true in softirqs. The BUG_ONs are supposed
to check for if irqs are disabled, so we should use
BUG_ON(irqs_disabled()) instead, duh.

Signed-off-by: Andy Grover <andy.grover@oracle.com>
2010-09-08 18:07:31 -07:00
David S. Miller
871039f02f Merge branch 'master' of master.kernel.org:/pub/scm/linux/kernel/git/davem/net-2.6
Conflicts:
	drivers/net/stmmac/stmmac_main.c
	drivers/net/wireless/wl12xx/wl1271_cmd.c
	drivers/net/wireless/wl12xx/wl1271_main.c
	drivers/net/wireless/wl12xx/wl1271_spi.c
	net/core/ethtool.c
	net/mac80211/scan.c
2010-04-11 14:53:53 -07:00
Tejun Heo
5a0e3ad6af include cleanup: Update gfp.h and slab.h includes to prepare for breaking implicit slab.h inclusion from percpu.h
percpu.h is included by sched.h and module.h and thus ends up being
included when building most .c files.  percpu.h includes slab.h which
in turn includes gfp.h making everything defined by the two files
universally available and complicating inclusion dependencies.

percpu.h -> slab.h dependency is about to be removed.  Prepare for
this change by updating users of gfp and slab facilities include those
headers directly instead of assuming availability.  As this conversion
needs to touch large number of source files, the following script is
used as the basis of conversion.

  http://userweb.kernel.org/~tj/misc/slabh-sweep.py

The script does the followings.

* Scan files for gfp and slab usages and update includes such that
  only the necessary includes are there.  ie. if only gfp is used,
  gfp.h, if slab is used, slab.h.

* When the script inserts a new include, it looks at the include
  blocks and try to put the new include such that its order conforms
  to its surrounding.  It's put in the include block which contains
  core kernel includes, in the same order that the rest are ordered -
  alphabetical, Christmas tree, rev-Xmas-tree or at the end if there
  doesn't seem to be any matching order.

* If the script can't find a place to put a new include (mostly
  because the file doesn't have fitting include block), it prints out
  an error message indicating which .h file needs to be added to the
  file.

The conversion was done in the following steps.

1. The initial automatic conversion of all .c files updated slightly
   over 4000 files, deleting around 700 includes and adding ~480 gfp.h
   and ~3000 slab.h inclusions.  The script emitted errors for ~400
   files.

2. Each error was manually checked.  Some didn't need the inclusion,
   some needed manual addition while adding it to implementation .h or
   embedding .c file was more appropriate for others.  This step added
   inclusions to around 150 files.

3. The script was run again and the output was compared to the edits
   from #2 to make sure no file was left behind.

4. Several build tests were done and a couple of problems were fixed.
   e.g. lib/decompress_*.c used malloc/free() wrappers around slab
   APIs requiring slab.h to be added manually.

5. The script was run on all .h files but without automatically
   editing them as sprinkling gfp.h and slab.h inclusions around .h
   files could easily lead to inclusion dependency hell.  Most gfp.h
   inclusion directives were ignored as stuff from gfp.h was usually
   wildly available and often used in preprocessor macros.  Each
   slab.h inclusion directive was examined and added manually as
   necessary.

6. percpu.h was updated not to include slab.h.

7. Build test were done on the following configurations and failures
   were fixed.  CONFIG_GCOV_KERNEL was turned off for all tests (as my
   distributed build env didn't work with gcov compiles) and a few
   more options had to be turned off depending on archs to make things
   build (like ipr on powerpc/64 which failed due to missing writeq).

   * x86 and x86_64 UP and SMP allmodconfig and a custom test config.
   * powerpc and powerpc64 SMP allmodconfig
   * sparc and sparc64 SMP allmodconfig
   * ia64 SMP allmodconfig
   * s390 SMP allmodconfig
   * alpha SMP allmodconfig
   * um on x86_64 SMP allmodconfig

8. percpu.h modifications were reverted so that it could be applied as
   a separate patch and serve as bisection point.

Given the fact that I had only a couple of failures from tests on step
6, I'm fairly confident about the coverage of this conversion patch.
If there is a breakage, it's likely to be something in one of the arch
headers which should be easily discoverable easily on most builds of
the specific arch.

Signed-off-by: Tejun Heo <tj@kernel.org>
Guess-its-ok-by: Christoph Lameter <cl@linux-foundation.org>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Lee Schermerhorn <Lee.Schermerhorn@hp.com>
2010-03-30 22:02:32 +09:00
Andy Grover
561c7df63e RDS: Do not call set_page_dirty() with irqs off
set_page_dirty() unconditionally re-enables interrupts, so
if we call it with irqs off, they will be on after the call,
and that's bad. This patch moves the call after we've re-enabled
interrupts in send_drop_to(), so it's safe.

Also, add BUG_ONs to let us know if we ever do call set_page_dirty
with interrupts off.

Signed-off-by: Andy Grover <andy.grover@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2010-03-16 21:17:01 -07:00