linux_dsm_epyc7002/fs
Tejun Heo b2efa05265 block, cfq: unlink cfq_io_context's immediately
cic is association between io_context and request_queue.  A cic is
linked from both ioc and q and should be destroyed when either one
goes away.  As ioc and q both have their own locks, locking becomes a
bit complex - both orders work for removal from one but not from the
other.

Currently, cfq tries to circumvent this locking order issue with RCU.
ioc->lock nests inside queue_lock but the radix tree and cic's are
also protected by RCU allowing either side to walk their lists without
grabbing lock.

This rather unconventional use of RCU quickly devolves into extremely
fragile convolution.  e.g. The following is from cfqd going away too
soon after ioc and q exits raced.

 general protection fault: 0000 [#1] PREEMPT SMP
 CPU 2
 Modules linked in:
 [   88.503444]
 Pid: 599, comm: hexdump Not tainted 3.1.0-rc10-work+ #158 Bochs Bochs
 RIP: 0010:[<ffffffff81397628>]  [<ffffffff81397628>] cfq_exit_single_io_context+0x58/0xf0
 ...
 Call Trace:
  [<ffffffff81395a4a>] call_for_each_cic+0x5a/0x90
  [<ffffffff81395ab5>] cfq_exit_io_context+0x15/0x20
  [<ffffffff81389130>] exit_io_context+0x100/0x140
  [<ffffffff81098a29>] do_exit+0x579/0x850
  [<ffffffff81098d5b>] do_group_exit+0x5b/0xd0
  [<ffffffff81098de7>] sys_exit_group+0x17/0x20
  [<ffffffff81b02f2b>] system_call_fastpath+0x16/0x1b

The only real hot path here is cic lookup during request
initialization and avoiding extra locking requires very confined use
of RCU.  This patch makes cic removal from both ioc and request_queue
perform double-locking and unlink immediately.

* From q side, the change is almost trivial as ioc->lock nests inside
  queue_lock.  It just needs to grab each ioc->lock as it walks
  cic_list and unlink it.

* From ioc side, it's a bit more difficult because of inversed lock
  order.  ioc needs its lock to walk its cic_list but can't grab the
  matching queue_lock and needs to perform unlock-relock dancing.

  Unlinking is now wholly done from put_io_context() and fast path is
  optimized by using the queue_lock the caller already holds, which is
  by far the most common case.  If the ioc accessed multiple devices,
  it tries with trylock.  In unlikely cases of fast path failure, it
  falls back to full double-locking dance from workqueue.

Double-locking isn't the prettiest thing in the world but it's *far*
simpler and more understandable than RCU trick without adding any
meaningful overhead.

This still leaves a lot of now unnecessary RCU logics.  Future patches
will trim them.

-v2: Vivek pointed out that cic->q was being dereferenced after
     cic->release() was called.  Updated to use local variable @this_q
     instead.

Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Vivek Goyal <vgoyal@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2011-12-14 00:33:39 +01:00
..
9p filesystems: add set_nlink() 2011-11-02 12:53:43 +01:00
adfs filesystems: add set_nlink() 2011-11-02 12:53:43 +01:00
affs filesystems: add set_nlink() 2011-11-02 12:53:43 +01:00
afs filesystems: add set_nlink() 2011-11-02 12:53:43 +01:00
autofs4 filesystems: add set_nlink() 2011-11-02 12:53:43 +01:00
befs filesystems: add set_nlink() 2011-11-02 12:53:43 +01:00
bfs filesystems: add set_nlink() 2011-11-02 12:53:43 +01:00
btrfs Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs 2011-12-08 13:18:59 -08:00
cachefiles kill useless checks for sb->s_op == NULL 2011-07-20 01:44:21 -04:00
ceph ceph: initialize root dentry 2011-11-11 09:50:17 -08:00
cifs cifs: check for NULL last_entry before calling cifs_save_resume_key 2011-12-08 22:04:47 -06:00
coda filesystems: add set_nlink() 2011-11-02 12:53:43 +01:00
configfs doc: fix broken references 2011-09-27 18:08:04 +02:00
cramfs cramfs: get_cramfs_inode() returns ERR_PTR() on failure 2011-07-17 23:22:02 -04:00
debugfs debugfs: Fix a comment mistake 2011-08-22 17:41:48 -07:00
devpts filesystems: add set_nlink() 2011-11-02 12:53:43 +01:00
dlm Merge branch 'for-3.1' of git://linux-nfs.org/~bfields/linux 2011-07-25 22:49:19 -07:00
ecryptfs eCryptfs: Extend array bounds for all filename chars 2011-11-23 15:43:53 -06:00
efs filesystems: add set_nlink() 2011-11-02 12:53:43 +01:00
exofs Merge branch 'modsplit-Oct31_2011' of git://git.kernel.org/pub/scm/linux/kernel/git/paulg/linux 2011-11-06 19:44:47 -08:00
exportfs
ext2 Merge branch 'for-next' of git://git.kernel.org/pub/scm/linux/kernel/git/hch/vfs-queue 2011-11-02 11:41:01 -07:00
ext3 Merge branch 'for-next' of git://git.kernel.org/pub/scm/linux/kernel/git/hch/vfs-queue 2011-11-02 11:41:01 -07:00
ext4 ext4: fix racy use-after-free in ext4_end_io_dio() 2011-11-24 19:22:24 -05:00
fat filesystems: add set_nlink() 2011-11-02 12:53:43 +01:00
freevxfs filesystems: add set_nlink() 2011-11-02 12:53:43 +01:00
fscache FS-Cache: Fix __fscache_uncache_all_inode_pages()'s outer loop 2011-07-21 10:59:16 -07:00
fuse Merge branch 'modsplit-Oct31_2011' of git://git.kernel.org/pub/scm/linux/kernel/git/paulg/linux 2011-11-06 19:44:47 -08:00
gfs2 Merge branch 'modsplit-Oct31_2011' of git://git.kernel.org/pub/scm/linux/kernel/git/paulg/linux 2011-11-06 19:44:47 -08:00
hfs hfs: add sanity check for file name length 2011-11-15 14:29:42 -02:00
hfsplus filesystems: add set_nlink() 2011-11-02 12:53:43 +01:00
hostfs Merge branch 'for-next' of git://git.kernel.org/pub/scm/linux/kernel/git/hch/vfs-queue 2011-11-02 11:41:01 -07:00
hpfs filesystems: add set_nlink() 2011-11-02 12:53:43 +01:00
hppfs filesystems: add set_nlink() 2011-11-02 12:53:43 +01:00
hugetlbfs filesystems: add missing nlink wrappers 2011-11-02 12:53:43 +01:00
isofs Merge branch 'akpm' (Andrew's incoming - part two) 2011-11-02 16:07:27 -07:00
jbd jbd/jbd2: validate sb->s_first in journal_get_superblock() 2011-11-01 19:04:59 -04:00
jbd2 jbd2: Unify log messages in jbd2 code 2011-11-01 19:09:18 -04:00
jffs2 Merge git://git.infradead.org/mtd-2.6 2011-11-07 09:11:16 -08:00
jfs Merge branch 'modsplit-Oct31_2011' of git://git.kernel.org/pub/scm/linux/kernel/git/paulg/linux 2011-11-06 19:44:47 -08:00
lockd SUNRPC: Replace svc_addr_u by sockaddr_storage 2011-09-14 08:21:48 -04:00
logfs Merge branch 'modsplit-Oct31_2011' of git://git.kernel.org/pub/scm/linux/kernel/git/paulg/linux 2011-11-06 19:44:47 -08:00
minix minixfs: kill manual hweight(), simplify 2011-11-19 11:13:28 -05:00
ncpfs filesystems: add set_nlink() 2011-11-02 12:53:43 +01:00
nfs Merge branch 'bugfixes' of git://git.linux-nfs.org/projects/trondmy/linux-nfs 2011-11-22 08:54:15 -08:00
nfs_common
nfsd Merge branch 'modsplit-Oct31_2011' of git://git.kernel.org/pub/scm/linux/kernel/git/paulg/linux 2011-11-06 19:44:47 -08:00
nilfs2 filesystems: add set_nlink() 2011-11-02 12:53:43 +01:00
nls
notify atomic: use <linux/atomic.h> 2011-07-26 16:49:47 -07:00
ntfs filesystems: add set_nlink() 2011-11-02 12:53:43 +01:00
ocfs2 Merge branch 'upstream-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jlbec/ocfs2 2011-12-01 14:55:34 -08:00
omfs omfs: fix (mode & S_IFDIR) abuse 2011-07-26 13:05:28 -04:00
openpromfs filesystems: add set_nlink() 2011-11-02 12:53:43 +01:00
partitions treewide: use __printf not __attribute__((format(printf,...))) 2011-10-31 17:30:54 -07:00
proc procfs: do not overflow get_{idle,iowait}_time for nohz 2011-12-09 07:50:29 -08:00
pstore pstore: pass allocated memory region back to caller 2011-11-17 12:58:07 -08:00
qnx4 filesystems: add set_nlink() 2011-11-02 12:53:43 +01:00
quota Merge branch 'writeback-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/wfg/linux 2011-11-06 19:02:23 -08:00
ramfs ramfs: remove module leftovers 2011-11-02 16:06:58 -07:00
reiserfs filesystems: add set_nlink() 2011-11-02 12:53:43 +01:00
romfs filesystems: add set_nlink() 2011-11-02 12:53:43 +01:00
squashfs Merge git://git.kernel.org/pub/scm/linux/kernel/git/pkl/squashfs-next 2011-11-04 16:48:37 -07:00
sysfs filesystems: add set_nlink() 2011-11-02 12:53:43 +01:00
sysv filesystems: add set_nlink() 2011-11-02 12:53:43 +01:00
ubifs Merge branch 'linux-next' of git://git.infradead.org/ubifs-2.6 2011-11-07 08:52:19 -08:00
udf Merge branch 'for-next' of git://git.kernel.org/pub/scm/linux/kernel/git/hch/vfs-queue 2011-11-02 11:41:01 -07:00
ufs filesystems: add set_nlink() 2011-11-02 12:53:43 +01:00
xfs xfs: fix the logspace waiting algorithm 2011-12-06 14:19:47 -06:00
aio.c aio: allocate kiocbs in batches 2011-11-02 16:07:03 -07:00
anon_inodes.c vfs: dont chain pipe/anon/socket on superblock s_inodes list 2011-07-26 12:57:09 -04:00
attr.c Merge branch 'next-evm' of git://git.kernel.org/pub/scm/linux/kernel/git/zohar/ima-2.6 into next 2011-08-09 10:31:03 +10:00
bad_inode.c fs: push i_mutex and filemap_write_and_wait down into ->fsync() handlers 2011-07-20 20:47:59 -04:00
binfmt_aout.c
binfmt_elf_fdpic.c consolidate BINPRM_FLAGS_ENFORCE_NONDUMP handling 2011-07-20 01:43:10 -04:00
binfmt_elf.c binfmt_elf: fix PIE execution with randomization disabled 2011-11-02 16:06:58 -07:00
binfmt_em86.c
binfmt_flat.c
binfmt_misc.c filesystems: add missing nlink wrappers 2011-11-02 12:53:43 +01:00
binfmt_script.c
binfmt_som.c
bio-integrity.c fs: add export.h to files using EXPORT_SYMBOL/THIS_MODULE macros 2011-10-31 19:30:31 -04:00
bio.c bio: change some signed vars to unsigned 2011-11-16 09:21:50 +01:00
block_dev.c Merge branch 'for-3.2/drivers' of git://git.kernel.dk/linux-block 2011-11-04 17:22:14 -07:00
buffer.c Merge branch 'writeback-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/wfg/linux 2011-11-06 19:02:23 -08:00
char_dev.c
compat_binfmt_elf.c
compat_ioctl.c compat_ioctl: add compat handler for PPPIOCGL2TPSTATS 2011-08-07 22:24:41 -07:00
compat.c Cross Memory Attach 2011-10-31 17:30:44 -07:00
dcache.c fix apparmor dereferencing potentially freed dentry, sanitize __d_path() API 2011-12-06 23:57:18 -05:00
dcookies.c oprofile, dcookies: Fix possible circular locking dependency 2011-05-31 16:33:35 +02:00
direct-io.c direct-io: merge direct_io_walker into __blockdev_direct_IO 2011-10-28 14:58:58 +02:00
drop_caches.c vmscan: change shrinker API by passing shrink_control struct 2011-05-25 08:39:26 -07:00
eventfd.c
eventpoll.c epoll: fix spurious lockdep warnings 2011-10-31 17:30:57 -07:00
exec.c oom: remove oom_disable_count 2011-10-31 17:30:45 -07:00
fcntl.c
fhandle.c
fifo.c
file_table.c atomic: use <linux/atomic.h> 2011-07-26 16:49:47 -07:00
file.c
filesystems.c
fs_struct.c
fs-writeback.c writeback: Add a 'reason' to wb_writeback_work 2011-10-31 00:33:36 +08:00
generic_acl.c switch posix_acl_equiv_mode() to umode_t * 2011-08-01 02:10:06 -04:00
inode.c vfs: protect i_nlink 2011-11-02 12:53:43 +01:00
internal.h superblock: move pin_sb_for_writeback() to fs/super.c 2011-07-20 01:44:38 -04:00
ioctl.c
ioprio.c block, cfq: unlink cfq_io_context's immediately 2011-12-14 00:33:39 +01:00
Kconfig tmpfs: add "tmpfs" to the Kconfig prompt to make it obvious. 2011-10-31 17:30:45 -07:00
Kconfig.binfmt
libfs.c filesystems: add set_nlink() 2011-11-02 12:53:43 +01:00
locks.c Merge branch 'for-3.2' of git://linux-nfs.org/~bfields/linux 2011-10-25 15:42:01 +02:00
Makefile fs/Makefile: Stupid typo breakage of exofs inclusion 2011-10-27 08:36:51 +02:00
mbcache.c vmscan: change shrinker API by passing shrink_control struct 2011-05-25 08:39:26 -07:00
mpage.c mm/fs: add hooks to support cleancache 2011-05-26 10:01:43 -06:00
namei.c VFS: we need to set LOOKUP_JUMPED on mountpoint crossing 2011-11-07 14:58:06 -08:00
namespace.c fix apparmor dereferencing potentially freed dentry, sanitize __d_path() API 2011-12-06 23:57:18 -05:00
no-block.c
open.c leases: fix write-open/read-lease race 2011-10-28 14:59:00 +02:00
pipe.c fs/pipe.c: add ->statfs callback for pipefs 2011-10-31 17:30:51 -07:00
pnode.c
pnode.h
posix_acl.c vfs: pass all mask flags check_acl and posix_acl_permission 2011-10-28 14:58:54 +02:00
read_write.c Cross Memory Attach 2011-10-31 17:30:44 -07:00
read_write.h
readdir.c
select.c
seq_file.c fix apparmor dereferencing potentially freed dentry, sanitize __d_path() API 2011-12-06 23:57:18 -05:00
signalfd.c
splice.c tmpfs: clone shmem_file_splice_read() 2011-07-25 20:57:11 -07:00
stack.c filesystems: add set_nlink() 2011-11-02 12:53:43 +01:00
stat.c readlinkat: ensure we return ENOENT for the empty pathname for normal lookups 2011-11-02 12:53:42 +01:00
statfs.c VFS: fix statfs() automounter semantics regression 2011-11-04 18:15:59 -07:00
super.c vfs: ignore error on forced remount 2011-11-02 12:53:42 +01:00
sync.c writeback: Add a 'reason' to wb_writeback_work 2011-10-31 00:33:36 +08:00
timerfd.c timerfd: Fix wakeup of processes when timer is cancelled on clock change 2011-06-14 11:46:14 +02:00
utimes.c
xattr_acl.c
xattr.c evm: evm_inode_post_removexattr 2011-07-18 12:29:43 -04:00