linux_dsm_epyc7002/drivers/md
Douglas Anderson 9ea61cac0b dm bufio: avoid sleeping while holding the dm_bufio lock
We've seen in-field reports showing _lots_ (18 in one case, 41 in
another) of tasks all sitting there blocked on:

  mutex_lock+0x4c/0x68
  dm_bufio_shrink_count+0x38/0x78
  shrink_slab.part.54.constprop.65+0x100/0x464
  shrink_zone+0xa8/0x198

In the two cases analyzed, we see one task that looks like this:

  Workqueue: kverityd verity_prefetch_io

  __switch_to+0x9c/0xa8
  __schedule+0x440/0x6d8
  schedule+0x94/0xb4
  schedule_timeout+0x204/0x27c
  schedule_timeout_uninterruptible+0x44/0x50
  wait_iff_congested+0x9c/0x1f0
  shrink_inactive_list+0x3a0/0x4cc
  shrink_lruvec+0x418/0x5cc
  shrink_zone+0x88/0x198
  try_to_free_pages+0x51c/0x588
  __alloc_pages_nodemask+0x648/0xa88
  __get_free_pages+0x34/0x7c
  alloc_buffer+0xa4/0x144
  __bufio_new+0x84/0x278
  dm_bufio_prefetch+0x9c/0x154
  verity_prefetch_io+0xe8/0x10c
  process_one_work+0x240/0x424
  worker_thread+0x2fc/0x424
  kthread+0x10c/0x114

...and that looks to be the one holding the mutex.

The problem has been reproduced on fairly easily:
0. Be running Chrome OS w/ verity enabled on the root filesystem
1. Pick test patch: http://crosreview.com/412360
2. Install launchBalloons.sh and balloon.arm from
     http://crbug.com/468342
   ...that's just a memory stress test app.
3. On a 4GB rk3399 machine, run
     nice ./launchBalloons.sh 4 900 100000
   ...that tries to eat 4 * 900 MB of memory and keep accessing.
4. Login to the Chrome web browser and restore many tabs

With that, I've seen printouts like:
  DOUG: long bufio 90758 ms
...and stack trace always show's we're in dm_bufio_prefetch().

The problem is that we try to allocate memory with GFP_NOIO while
we're holding the dm_bufio lock.  Instead we should be using
GFP_NOWAIT.  Using GFP_NOIO can cause us to sleep while holding the
lock and that causes the above problems.

The current behavior explained by David Rientjes:

  It will still try reclaim initially because __GFP_WAIT (or
  __GFP_KSWAPD_RECLAIM) is set by GFP_NOIO.  This is the cause of
  contention on dm_bufio_lock() that the thread holds.  You want to
  pass GFP_NOWAIT instead of GFP_NOIO to alloc_buffer() when holding a
  mutex that can be contended by a concurrent slab shrinker (if
  count_objects didn't use a trylock, this pattern would trivially
  deadlock).

This change significantly increases responsiveness of the system while
in this state.  It makes a real difference because it unblocks kswapd.
In the bug report analyzed, kswapd was hung:

   kswapd0         D ffffffc000204fd8     0    72      2 0x00000000
   Call trace:
   [<ffffffc000204fd8>] __switch_to+0x9c/0xa8
   [<ffffffc00090b794>] __schedule+0x440/0x6d8
   [<ffffffc00090bac0>] schedule+0x94/0xb4
   [<ffffffc00090be44>] schedule_preempt_disabled+0x28/0x44
   [<ffffffc00090d900>] __mutex_lock_slowpath+0x120/0x1ac
   [<ffffffc00090d9d8>] mutex_lock+0x4c/0x68
   [<ffffffc000708e7c>] dm_bufio_shrink_count+0x38/0x78
   [<ffffffc00030b268>] shrink_slab.part.54.constprop.65+0x100/0x464
   [<ffffffc00030dbd8>] shrink_zone+0xa8/0x198
   [<ffffffc00030e578>] balance_pgdat+0x328/0x508
   [<ffffffc00030eb7c>] kswapd+0x424/0x51c
   [<ffffffc00023f06c>] kthread+0x10c/0x114
   [<ffffffc000203dd0>] ret_from_fork+0x10/0x40

By unblocking kswapd memory pressure should be reduced.

Suggested-by: David Rientjes <rientjes@google.com>
Reviewed-by: Guenter Roeck <linux@roeck-us.net>
Signed-off-by: Douglas Anderson <dianders@chromium.org>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2016-12-08 14:13:04 -05:00
..
bcache block: export bio_free_pages to other modules 2016-09-22 07:48:03 -06:00
persistent-data dm block manager: make block locking optional 2016-11-14 15:17:47 -05:00
bitmap.c md/bitmap: fix wrong cleanup 2016-09-21 09:09:44 -07:00
bitmap.h md-cluster: sync bitmap when node received RESYNCING msg 2016-05-04 12:39:35 -07:00
dm-bio-prison.c block: add a bi_error field to struct bio 2015-07-29 08:55:15 -06:00
dm-bio-prison.h dm bio prison: add dm_cell_promote_or_release() 2015-05-29 14:19:06 -04:00
dm-bio-record.h
dm-bufio.c dm bufio: avoid sleeping while holding the dm_bufio lock 2016-12-08 14:13:04 -05:00
dm-bufio.h dm snapshot: use dm-bufio prefetch 2014-01-14 23:23:03 -05:00
dm-builtin.c dm: move request-based code out to dm-rq.[hc] 2016-06-10 15:15:44 -04:00
dm-cache-block-types.h dm cache: revert "remove remainder of distinct discard block size" 2014-11-10 15:25:30 -05:00
dm-cache-metadata.c dm cache metadata: remove an extra newline in DMERR and code 2016-11-21 09:52:02 -05:00
dm-cache-metadata.h dm cache: make sure every metadata function checks fail_io 2016-03-10 17:12:12 -05:00
dm-cache-policy-cleaner.c dm cache: speed up writing of the hint array 2016-09-22 11:15:02 -04:00
dm-cache-policy-internal.h dm cache: speed up writing of the hint array 2016-09-22 11:15:02 -04:00
dm-cache-policy-smq.c dm cache policy smq: distribute entries to random levels when switching to smq 2016-09-22 11:15:03 -04:00
dm-cache-policy.c dm cache: add policy name to status output 2014-01-16 13:44:11 -05:00
dm-cache-policy.h dm cache: speed up writing of the hint array 2016-09-22 11:15:02 -04:00
dm-cache-target.c dm cache: add missing cache device name to DMERR in set_cache_mode() 2016-11-21 09:52:03 -05:00
dm-core.h dm: move request-based code out to dm-rq.[hc] 2016-06-10 15:15:44 -04:00
dm-crypt.c dm crypt: constify crypt_iv_operations structures 2016-11-21 09:52:06 -05:00
dm-delay.c dm: rename target's per_bio_data_size to per_io_data_size 2016-02-22 22:34:37 -05:00
dm-era-target.c block: rename bio bi_rw to bi_opf 2016-08-07 14:41:02 -06:00
dm-exception-store.c - Revert a dm-multipath change that caused a regression for unprivledged 2015-11-04 21:19:53 -08:00
dm-exception-store.h dm snapshot: fix hung bios when copy error occurs 2016-01-08 20:03:05 -05:00
dm-flakey.c dm flakey: return -EINVAL on interval bounds error in flakey_ctr() 2016-11-21 09:52:07 -05:00
dm-io.c dm io: use bvec iterator helpers to implement .get_page and .next_page 2016-11-21 09:51:57 -05:00
dm-ioctl.c dm: allow bio-based table to be upgraded to bio-based with DAX support 2016-07-20 23:49:52 -04:00
dm-kcopyd.c dm: move request-based code out to dm-rq.[hc] 2016-06-10 15:15:44 -04:00
dm-linear.c libnvdimm for 4.8 2016-07-28 17:38:16 -07:00
dm-log-userspace-base.c dm: drop NULL test before kmem_cache_destroy() and mempool_destroy() 2015-10-31 19:06:00 -04:00
dm-log-userspace-transfer.c dm log userspace transfer: match wait_for_completion_timeout return type 2015-04-15 12:10:20 -04:00
dm-log-userspace-transfer.h
dm-log-writes.c Merge branch 'for-4.9/block' of git://git.kernel.dk/linux-block 2016-10-07 14:42:05 -07:00
dm-log.c dm log: fix unitialized bio operation flags 2016-08-24 21:55:05 -04:00
dm-mpath.c dm mpath: do not modify *__clone if blk_mq_alloc_request() fails 2016-11-21 09:52:10 -05:00
dm-mpath.h
dm-path-selector.c
dm-path-selector.h dm path selector: remove 'repeat_count' return from .select_path hook 2016-02-22 22:34:42 -05:00
dm-queue-length.c dm path selector: remove 'repeat_count' return from .select_path hook 2016-02-22 22:34:42 -05:00
dm-raid1.c dm mirror: use all available legs on multiple failures 2016-10-14 11:55:17 -04:00
dm-raid.c dm raid: correct error messages on old metadata validation 2016-11-21 09:52:05 -05:00
dm-region-hash.c block: rename bio bi_rw to bi_opf 2016-08-07 14:41:02 -06:00
dm-round-robin.c dm round robin: do not use this_cpu_ptr() without having preemption disabled 2016-08-15 09:23:14 -04:00
dm-rq.c dm rq: replace 'bio->bi_vcnt == 1' with !bio_multiple_segments 2016-11-21 09:51:57 -05:00
dm-rq.h dm rq: introduce dm_mq_kick_requeue_list() 2016-09-15 11:16:05 -04:00
dm-service-time.c dm path selector: remove 'repeat_count' return from .select_path hook 2016-02-22 22:34:42 -05:00
dm-snap-persistent.c dm: use bio op accessors 2016-06-07 13:41:38 -06:00
dm-snap-transient.c dm snapshot: fix hung bios when copy error occurs 2016-01-08 20:03:05 -05:00
dm-snap.c block: rename bio bi_rw to bi_opf 2016-08-07 14:41:02 -06:00
dm-stats.c dm: move request-based code out to dm-rq.[hc] 2016-06-10 15:15:44 -04:00
dm-stats.h dm stats: support precise timestamps 2015-06-17 12:40:40 -04:00
dm-stripe.c block: rename bio bi_rw to bi_opf 2016-08-07 14:41:02 -06:00
dm-switch.c dm switch: simplify conditional in alloc_region_table() 2015-10-31 19:06:06 -04:00
dm-sysfs.c dm: move request-based code out to dm-rq.[hc] 2016-06-10 15:15:44 -04:00
dm-table.c dm table: simplify dm_table_determine_type() 2016-12-08 14:13:03 -05:00
dm-target.c libnvdimm for 4.8 2016-07-28 17:38:16 -07:00
dm-thin-metadata.c dm thin: fix a race condition between discarding and provisioning a block 2016-07-20 12:43:35 -04:00
dm-thin-metadata.h dm thin: fix a race condition between discarding and provisioning a block 2016-07-20 12:43:35 -04:00
dm-thin.c block: rename bio bi_rw to bi_opf 2016-08-07 14:41:02 -06:00
dm-uevent.c
dm-uevent.h
dm-verity-fec.c dm verity fec: fix block calculation 2016-07-01 23:29:08 -04:00
dm-verity-fec.h dm verity: add support for forward error correction 2015-12-10 10:39:03 -05:00
dm-verity-target.c dm verity: fix incorrect error message 2016-11-21 09:52:01 -05:00
dm-verity.h dm verity: add ignore_zero_blocks feature 2015-12-10 10:39:03 -05:00
dm-zero.c block: rename bio bi_rw to bi_opf 2016-08-07 14:41:02 -06:00
dm.c - A couple DM raid and DM mirror fixes 2016-10-28 09:27:58 -07:00
dm.h dm: add infrastructure for DAX support 2016-07-20 23:49:49 -04:00
faulty.c MD: rename some functions 2016-01-20 13:52:20 -08:00
Kconfig dm block manager: make block locking optional 2016-11-14 15:17:47 -05:00
linear.c block: rename bio bi_rw to bi_opf 2016-08-07 14:41:02 -06:00
linear.h
Makefile dm: move request-based code out to dm-rq.[hc] 2016-06-10 15:15:44 -04:00
md-cluster.c md-cluster: make resync lock also could be interruptted 2016-09-21 09:09:44 -07:00
md-cluster.h md-cluster: gather resync infos and enable recv_thread after bitmap is ready 2016-05-09 09:24:03 -07:00
md.c md: be careful not lot leak internal curr_resync value into metadata. -- (all) 2016-10-28 22:04:05 -07:00
md.h md: changes for MD_STILL_CLOSED flag 2016-09-21 09:09:44 -07:00
multipath.c block: rename bio bi_rw to bi_opf 2016-08-07 14:41:02 -06:00
multipath.h
raid0.c block: rename bio bi_rw to bi_opf 2016-08-07 14:41:02 -06:00
raid0.h block: kill merge_bvec_fn() completely 2015-08-13 12:31:57 -06:00
raid1.c raid1: handle read error also in readonly mode 2016-10-28 22:04:04 -07:00
raid1.h md-cluster: Use a small window for resync 2015-10-12 01:32:05 -05:00
raid5-cache.c raid5-cache: correct condition for empty metadata write 2016-10-28 22:04:03 -07:00
raid5.c Merge tag 'md/4.9-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/shli/md 2016-10-07 09:45:43 -07:00
raid5.h md/raid5: Convert to hotplug state machine 2016-09-06 18:30:23 +02:00
raid10.c RAID10: ignore discard error 2016-10-24 15:28:17 -07:00
raid10.h raid10: improve random reads performance 2016-07-19 15:20:28 -07:00