Commit Graph

6274 Commits

Author SHA1 Message Date
Artur Paszkiewicz
41d2d848e5 md: improve io stats accounting
Use generic io accounting functions to manage io stats. There was an
attempt to do this earlier in commit 18c0b223cf ("md: use generic io
stats accounting functions to simplify io stat accounting"), but it did
not include a call to generic_end_io_acct() and caused issues with
tracking in-flight IOs, so it was later removed in commit 74672d069b
("md: fix md io stats accounting broken").

This patch attempts to fix this by using both disk_start_io_acct() and
disk_end_io_acct(). To make it possible, a struct md_io is allocated for
every new md bio, which includes the io start_time. A new mempool is
introduced for this purpose. We override bio->bi_end_io with our own
callback and call disk_start_io_acct() before passing the bio to
md_handle_request(). When it completes, we call disk_end_io_acct() and
the original bi_end_io callback.

This adds correct statistics about in-flight IOs and IO processing time,
interpreted e.g. in iostat as await, svctm, aqu-sz and %util.

It also fixes a situation where too many IOs where reported if a bio was
re-submitted to the mddev, because io accounting is now performed only
on newly arriving bios.

Acked-by: Guoqing Jiang <guoqing.jiang@cloud.ionos.com>
Signed-off-by: Artur Paszkiewicz <artur.paszkiewicz@intel.com>
Signed-off-by: Song Liu <songliubraving@fb.com>
2020-07-13 23:06:56 -07:00
Colin Ian King
9a5a85972c md: raid0/linear: fix dereference before null check on pointer mddev
Pointer mddev is being dereferenced with a test_bit call before mddev
is being null checked, this may cause a null pointer dereference. Fix
this by moving the null pointer checks to sanity check mddev before
it is dereferenced.

Addresses-Coverity: ("Dereference before null check")
Fixes: 62f7b1989c ("md raid0/linear: Mark array as 'broken' and fail BIOs if a member is gone")
Signed-off-by: Colin Ian King <colin.king@canonical.com>
Reviewed-by: Guilherme G. Piccoli <gpiccoli@canonical.com>
Signed-off-by: Song Liu <songliubraving@fb.com>
2020-07-13 23:06:56 -07:00
Jens Axboe
482c6b614a Linux 5.8-rc4
-----BEGIN PGP SIGNATURE-----
 
 iQFSBAABCAA8FiEEq68RxlopcLEwq+PEeb4+QwBBGIYFAl8CYDYeHHRvcnZhbGRz
 QGxpbnV4LWZvdW5kYXRpb24ub3JnAAoJEHm+PkMAQRiGcQkH/2vOsPf79yWtsc7x
 hd2LpCPfrm7T1xlQcYcXbEbyRI8sqPmguixO8pRI1ePl2lBZ7KurfyeYgYZNGpFU
 t74Ph6A6dSWoCgO68Genm/SQuK8ic6o9n1Vr8tDsGDp5KlHWNaweq4JwHrsPmO1T
 cI0PR/ClAhLG8cQZ4x988Es5HTNGY17XK27e+M/zKYxSMGY2NRdJBGQIq964i5Q8
 2d9G0rtVCaVDzgjrLwaFm6RBu21Il7HV6KsBsacyTFiL1ywx2vnUHzeZQyvuJSOQ
 4YpLo9v4tBP10WHC50LRStZyO0qRwPVd/Yl7fL4R/CKsJT9H4uiwasVoEBVSL/k6
 CUn3JL0=
 =P/Vx
 -----END PGP SIGNATURE-----

Merge tag 'v5.8-rc4' into for-5.9/drivers

Merge in 5.8-rc4 for-5.9/block to setup for-5.9/drivers, to provide
a clean base and making the life for the NVMe changes easier.

Signed-off-by: Jens Axboe <axboe@kernel.dk>

* tag 'v5.8-rc4': (732 commits)
  Linux 5.8-rc4
  x86/ldt: use "pr_info_once()" instead of open-coding it badly
  MIPS: Do not use smp_processor_id() in preemptible code
  MIPS: Add missing EHB in mtc0 -> mfc0 sequence for DSPen
  .gitignore: Do not track `defconfig` from `make savedefconfig`
  io_uring: fix regression with always ignoring signals in io_cqring_wait()
  x86/ldt: Disable 16-bit segments on Xen PV
  x86/entry/32: Fix #MC and #DB wiring on x86_32
  x86/entry/xen: Route #DB correctly on Xen PV
  x86/entry, selftests: Further improve user entry sanity checks
  x86/entry/compat: Clear RAX high bits on Xen PV SYSENTER
  i2c: mlxcpld: check correct size of maximum RECV_LEN packet
  i2c: add Kconfig help text for slave mode
  i2c: slave-eeprom: update documentation
  i2c: eg20t: Load module automatically if ID matches
  i2c: designware: platdrv: Set class based on DMI
  i2c: algo-pca: Add 0x78 as SCL stuck low status for PCA9665
  mm/page_alloc: fix documentation error
  vmalloc: fix the owner argument for the new __vmalloc_node_range callers
  mm/cma.c: use exact_nid true to fix possible per-numa cma leak
  ...
2020-07-08 08:02:13 -06:00
Jens Axboe
b53ac8b891 dm: remove unused variable
Since merging the commit identified in Fixes below, we trigger this
compile time warning:

drivers/md/dm.c: In function ‘__map_bio’:
drivers/md/dm.c:1296:24: warning: unused variable ‘md’ [-Wunused-variable]
 1296 |  struct mapped_device *md = io->md;
       |                        ^~

Remove the 'md' variable.

Fixes: 5a6c35f9af ("block: remove direct_make_request")
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-07-01 15:45:19 -06:00
Christoph Hellwig
e556f6ba10 block: remove the bd_queue field from struct block_device
Just use bd_disk->queue instead.

Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-07-01 08:08:20 -06:00
Christoph Hellwig
5a6c35f9af block: remove direct_make_request
Now that submit_bio_noacct has a decent blk-mq fast path there is no
more need for this bypass.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-07-01 07:27:24 -06:00
Christoph Hellwig
ed00aabd5e block: rename generic_make_request to submit_bio_noacct
generic_make_request has always been very confusingly misnamed, so rename
it to submit_bio_noacct to make it clear that it is submit_bio minus
accounting and a few checks.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-07-01 07:27:24 -06:00
Christoph Hellwig
c62b37d96b block: move ->make_request_fn to struct block_device_operations
The make_request_fn is a little weird in that it sits directly in
struct request_queue instead of an operation vector.  Replace it with
a block_device_operations method called submit_bio (which describes much
better what it does).  Also remove the request_queue argument to it, as
the queue can be derived pretty trivially from the bio.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-07-01 07:27:24 -06:00
Christoph Hellwig
f695ca3886 block: remove the request_queue argument from blk_queue_split
The queue can be trivially derived from the bio, so pass one less
argument.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-07-01 07:27:23 -06:00
Christoph Hellwig
c4a59c4e5d dm: stop using ->queuedata
Instead of setting up the queuedata as well just use one private data
field.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-07-01 07:27:23 -06:00
Christoph Hellwig
987a0ef88b bcache: stop setting ->queuedata
Nothing in bcache actually uses the ->queuedata field.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-07-01 07:27:23 -06:00
Christoph Hellwig
4ef2c5c246 dm: use bio_uninit instead of bio_disassociate_blkg
bio_uninit is the proper API to clean up a BIO that has been allocated
on stack or inside a structure that doesn't come from the BIO allocator.
Switch dm to use that instead of bio_disassociate_blkg, which really is
an implementation detail.  Note that the bio_uninit calls are also moved
to the two callers of __send_empty_flush, so that they better pair with
the bio_init calls used to initialize them.

Acked-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-06-29 09:09:07 -06:00
Linus Torvalds
5e8eed279f - Quite a few DM zoned target fixes and a Zone append fix in DM core.
Considering the amount of dm-zoned changes that went in during the
   5.8 merge window these fixes are not that surprising.
 
 - A few DM writecache target fixes.
 
 - A fix to Documentation index to include DM ebs target docs.
 
 - Small cleanup to use struct_size() in DM core's retrieve_deps().
 -----BEGIN PGP SIGNATURE-----
 
 iQFHBAABCAAxFiEEJfWUX4UqZ4x1O2wixSPxCi2dA1oFAl72SdYTHHNuaXR6ZXJA
 cmVkaGF0LmNvbQAKCRDFI/EKLZ0DWtzXB/4y42NJLrH8EXNW/bGIpPlexpbWRW29
 5z6V1VL/q/s9AWrDunLsvhMAtx38FjHJO6i8GQld/gKRtEbOMHfEItsyrQZqKoSr
 M1NzlthuwFnUS30Uq6i06/ihdH5DDEhTElLs7+Q7pvNUd+KiI0mTgM/y856x3s4p
 4OMG1qEi6AnwvYFyCgu/w7/LLBYRipis+j+cm9Y0cTZujh4QBTeeQdBsu++fit2G
 7U1jkddiQSlb8DrbziRt9JK5WrvR4WpQmP9bnK1SxZtKY/55ZNpuNmFnUmUCgUAY
 Y7CEafoElJ6XYTBzKgYSDqY/eutXSzaxn1euDWizDS4H7kdgDS+3uVT9
 =zq2g
 -----END PGP SIGNATURE-----

Merge tag 'for-5.8/dm-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm

Pull device mapper fixes from Mike Snitzer:

 - Quite a few DM zoned target fixes and a Zone append fix in DM core.

   Considering the amount of dm-zoned changes that went in during the
   5.8 merge window these fixes are not that surprising.

 - A few DM writecache target fixes.

 - A fix to Documentation index to include DM ebs target docs.

 - Small cleanup to use struct_size() in DM core's retrieve_deps().

* tag 'for-5.8/dm-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm:
  dm writecache: add cond_resched to loop in persistent_memory_claim()
  dm zoned: Fix reclaim zone selection
  dm zoned: Fix random zone reclaim selection
  dm: update original bio sector on Zone Append
  dm zoned: Fix metadata zone size check
  docs: device-mapper: add dm-ebs.rst to an index file
  dm ioctl: use struct_size() helper in retrieve_deps()
  dm writecache: skip writecache_wait when using pmem mode
  dm writecache: correct uncommitted_block when discarding uncommitted entry
  dm zoned: assign max_io_len correctly
  dm zoned: fix uninitialized pointer dereference
2020-06-27 08:57:16 -07:00
Christoph Hellwig
15f73f5b3e blk-mq: move failure injection out of blk_mq_complete_request
Move the call to blk_should_fake_timeout out of blk_mq_complete_request
and into the drivers, skipping call sites that are obvious error
handlers, and remove the now superflous blk_mq_force_complete_rq helper.
This ensures we don't keep injecting errors into completions that just
terminate the Linux request after the hardware has been reset or the
command has been aborted.

Reviewed-by: Daniel Wagner <dwagner@suse.de>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-06-24 09:15:57 -06:00
Mikulas Patocka
d35bd764e6 dm writecache: add cond_resched to loop in persistent_memory_claim()
Add cond_resched() to a loop that fills in the mapper memory area
because the loop can be executed many times.

Fixes: 48debafe4f ("dm: add writecache target")
Cc: stable@vger.kernel.org
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2020-06-19 12:32:24 -04:00
Shin'ichiro Kawasaki
f2cd9a5e85 dm zoned: Fix reclaim zone selection
When dm zoned has multiple devices, random zones are never selected for
reclaim if all reserved sequential write zones are in use and no
sequential write required zones can be selected for reclaim. This can
lead to deadlocks as selecting a cache zone allows reclaiming a
sequential zone, ensuring forward progress.

Fix this by always defaulting to selecting a random zone when no
sequential write required zone can be selected.

[Damien: fix commit message]

Signed-off-by: Shin'ichiro Kawasaki <shinichiro.kawasaki@wdc.com>
Signed-off-by: Damien Le Moal <damien.lemoal@wdc.com>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2020-06-19 12:29:39 -04:00
Damien Le Moal
3ee39573e5 dm zoned: Fix random zone reclaim selection
Commit 2094045fe5 ("dm zoned: prefer full zones for reclaim")
modified dmz_get_rnd_zone_for_reclaim() to add a search for the buffer
zone with the heaviest weight as an optimal candidate for reclaim. This
modification uses the zone pointer variabl "last" which is set only once
and never modified as zones are scanned, resulting in the search being
inefective. Furthermore, if the selected buffer zone at the end of the
search loop is active or already locked for reclaim,
dmz_get_rnd_zone_for_reclaim() returns NULL even if other random zones
with a lesser weight can be reclaimed.

To fix the search and to guarantee that reclaim can make forward
progress, fix dmz_get_rnd_zone_for_reclaim() loop to correctly find
the buffer zone with the heaviest weight using the variable maxw_z.
Also make sure to fallback to finding the first random zone that can
be reclaimed if this best candidate zone cannot be reclaimed.

While at it, also fix the device index check to consider only random
zones, ignoring cache zones belonging to the cache device if one is
used as that device does not have a reclaim process.

Fixes: 2094045fe5 ("dm zoned: prefer full zones for reclaim")
Signed-off-by: Damien Le Moal <damien.lemoal@wdc.com>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2020-06-19 12:28:23 -04:00
Johannes Thumshirn
415c79e13b dm: update original bio sector on Zone Append
Naohiro reported that issuing zone-append bios to a zoned block device
underneath a dm-linear device does not work as expected.

This because we forgot to reverse-map the sector the device wrote to the
original bio.

For zone-append bios, get the offset in the zone of the written sector
from the clone bio and add that to the original bio's sector position.

Fixes: 0512a75b98 ("block: Introduce REQ_OP_ZONE_APPEND")
Cc: stable@vger.kernel.org
Reported-by: Naohiro Aota <Naohiro.Aota@wdc.com>
Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Damien Le Moal <damien.lemoal@wdc.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2020-06-19 12:25:58 -04:00
Shin'ichiro Kawasaki
b38c0ad57f dm zoned: Fix metadata zone size check
When dm zoned has multiple devices, metadata is on the cache device, not
in random zones of the zoned devices. Then the number of metadata zones
shall be checked with the number of cache zones, not random zones.

Fixes: 34f5affd04 ("dm zoned: separate random and cache zones")
Signed-off-by: Shin'ichiro Kawasaki <shinichiro.kawasaki@wdc.com>
Reviewed-by: Damien Le Moal <damien.lemoal@wdc.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2020-06-19 12:21:57 -04:00
Gustavo A. R. Silva
da8996250a dm ioctl: use struct_size() helper in retrieve_deps()
One of the more common cases of allocation size calculations is finding
the size of a structure that has a zero-sized array at the end, along
with memory for some number of elements for that array. For example:

struct dm_target_deps {
      ...
        __u64 dev[0];   /* out */
};

Make use of the struct_size() helper instead of an open-coded version
in order to avoid any potential type mistakes.

This code was detected with the help of Coccinelle.

Signed-off-by: Gustavo A. R. Silva <gustavo@embeddedor.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2020-06-17 12:31:45 -04:00
Huaisheng Ye
a143e172b6 dm writecache: skip writecache_wait when using pmem mode
The array bio_in_progress is only used with ssd mode. So skip
writecache_wait_for_ios in writecache_discard when pmem mode.

Signed-off-by: Huaisheng Ye <yehs1@lenovo.com>
Acked-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2020-06-17 12:25:42 -04:00
Huaisheng Ye
39495b12ef dm writecache: correct uncommitted_block when discarding uncommitted entry
When uncommitted entry has been discarded, correct wc->uncommitted_block
for getting the exact number.

Fixes: 48debafe4f ("dm: add writecache target")
Cc: stable@vger.kernel.org
Signed-off-by: Huaisheng Ye <yehs1@lenovo.com>
Acked-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2020-06-17 12:25:41 -04:00
Hou Tao
7b23774867 dm zoned: assign max_io_len correctly
The unit of max_io_len is sector instead of byte (spotted through
code review), so fix it.

Fixes: 3b1a94c88b ("dm zoned: drive-managed zoned block device target")
Cc: stable@vger.kernel.org
Signed-off-by: Hou Tao <houtao1@huawei.com>
Reviewed-by: Damien Le Moal <damien.lemoal@wdc.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2020-06-17 12:25:34 -04:00
Damien Le Moal
c69cb1d17b dm zoned: fix uninitialized pointer dereference
Make sure that the local variable rzone in dmz_do_reclaim() is always
initialized before being used for printing debug messages.

Fixes: f97809aec5 ("dm zoned: per-device reclaim")
Signed-off-by: Damien Le Moal <damien.lemoal@wdc.com>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2020-06-17 12:13:08 -04:00
Coly Li
4b25bbf52a bcache: pr_info() format clean up in bcache_device_init()
scripts/checkpatch.pl reports following warning for patch
("bcache: check and adjust logical block size for backing devices"),
    WARNING: quoted string split across lines
    #146: FILE: drivers/md/bcache/super.c:896:
    +  pr_info("%s: sb/logical block size (%u) greater than page size "
    +	       "(%lu) falling back to device logical block size (%u)",

There are two things to fix up,
- The kernel message print should be in a single line.
- pr_info() won't automatically add new line since v5.8, a '\n' should
  be added.

This patch just does the above cleanup in bcache_device_init().

Signed-off-by: Coly Li <colyli@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-06-14 16:47:56 -06:00
Coly Li
ee4a36f414 bcache: use delayed kworker fo asynchronous devices registration
This patch changes the asynchronous registration kworker to a delayed
kworker. There is probability queue_work() queues the async registration
kworker to the same CPU (even though very little), then the process
which writing sysfs interface to reigster bcache device may won't return
immeidately. queue_delayed_work() in this patch will delay 10 jiffies
before insert the kworker to run queue, which makes sure the registering
process may always returns to user space in time.

Fixes: 9e23ccf8f0 ("bcache: asynchronous devices registration")
Signed-off-by: Coly Li <colyli@suse.de>
Cc: Hannes Reinecke <hare@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-06-14 16:47:56 -06:00
Mauricio Faria de Oliveira
dcacbc1242 bcache: check and adjust logical block size for backing devices
It's possible for a block driver to set logical block size to
a value greater than page size incorrectly; e.g. bcache takes
the value from the superblock, set by the user w/ make-bcache.

This causes a BUG/NULL pointer dereference in the path:

  __blkdev_get()
  -> set_init_blocksize() // set i_blkbits based on ...
     -> bdev_logical_block_size()
        -> queue_logical_block_size() // ... this value
  -> bdev_disk_changed()
     ...
     -> blkdev_readpage()
        -> block_read_full_page()
           -> create_page_buffers() // size = 1 << i_blkbits
              -> create_empty_buffers() // give size/take pointer
                 -> alloc_page_buffers() // return NULL
                 .. BUG!

Because alloc_page_buffers() is called with size > PAGE_SIZE,
thus it initializes head = NULL, skips the loop, return head;
then create_empty_buffers() gets (and uses) the NULL pointer.

This has been around longer than commit ad6bf88a6c ("block:
fix an integer overflow in logical block size"); however, it
increased the range of values that can trigger the issue.

Previously only 8k/16k/32k (on x86/4k page size) would do it,
as greater values overflow unsigned short to zero, and queue_
logical_block_size() would then use the default of 512.

Now the range with unsigned int is much larger, and users w/
the 512k value, which happened to be zero'ed previously and
work fine, started to hit this issue -- as the zero is gone,
and queue_logical_block_size() does return 512k (>PAGE_SIZE.)

Fix this by checking the bcache device's logical block size,
and if it's greater than page size, fallback to the backing/
cached device's logical page size.

This doesn't affect cache devices as those are still checked
for block/page size in read_super(); only the backing/cached
devices are not.

Apparently it's a regression from commit 2903381fce ("bcache:
Take data offset from the bdev superblock."), moving the check
into BCACHE_SB_VERSION_CDEV only. Now that we have superblocks
of backing devices out there with this larger value, we cannot
refuse to load them (i.e., have a similar check in _BDEV.)

Ideally perhaps bcache should use all values from the backing
device (physical/logical/io_min block size)? But for now just
fix the problematic case.

Test-case:

    # IMG=/root/disk.img
    # dd if=/dev/zero of=$IMG bs=1 count=0 seek=1G
    # DEV=$(losetup --find --show $IMG)
    # make-bcache --bdev $DEV --block 8k
      < see dmesg >

Before:

    # uname -r
    5.7.0-rc7

    [   55.944046] BUG: kernel NULL pointer dereference, address: 0000000000000000
    ...
    [   55.949742] CPU: 3 PID: 610 Comm: bcache-register Not tainted 5.7.0-rc7 #4
    ...
    [   55.952281] RIP: 0010:create_empty_buffers+0x1a/0x100
    ...
    [   55.966434] Call Trace:
    [   55.967021]  create_page_buffers+0x48/0x50
    [   55.967834]  block_read_full_page+0x49/0x380
    [   55.972181]  do_read_cache_page+0x494/0x610
    [   55.974780]  read_part_sector+0x2d/0xaa
    [   55.975558]  read_lba+0x10e/0x1e0
    [   55.977904]  efi_partition+0x120/0x5a6
    [   55.980227]  blk_add_partitions+0x161/0x390
    [   55.982177]  bdev_disk_changed+0x61/0xd0
    [   55.982961]  __blkdev_get+0x350/0x490
    [   55.983715]  __device_add_disk+0x318/0x480
    [   55.984539]  bch_cached_dev_run+0xc5/0x270
    [   55.986010]  register_bcache.cold+0x122/0x179
    [   55.987628]  kernfs_fop_write+0xbc/0x1a0
    [   55.988416]  vfs_write+0xb1/0x1a0
    [   55.989134]  ksys_write+0x5a/0xd0
    [   55.989825]  do_syscall_64+0x43/0x140
    [   55.990563]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
    [   55.991519] RIP: 0033:0x7f7d60ba3154
    ...

After:

    # uname -r
    5.7.0.bcachelbspgsz

    [   31.672460] bcache: bcache_device_init() bcache0: sb/logical block size (8192) greater than page size (4096) falling back to device logical block size (512)
    [   31.675133] bcache: register_bdev() registered backing device loop0

    # grep ^ /sys/block/bcache0/queue/*_block_size
    /sys/block/bcache0/queue/logical_block_size:512
    /sys/block/bcache0/queue/physical_block_size:8192

Reported-by: Ryan Finnie <ryan@finnie.org>
Reported-by: Sebastian Marsching <sebastian@marsching.com>
Signed-off-by: Mauricio Faria de Oliveira <mfo@canonical.com>
Signed-off-by: Coly Li <colyli@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-06-14 16:47:56 -06:00
Zhiqiang Liu
be23e83733 bcache: fix potential deadlock problem in btree_gc_coalesce
coccicheck reports:
  drivers/md//bcache/btree.c:1538:1-7: preceding lock on line 1417

In btree_gc_coalesce func, if the coalescing process fails, we will goto
to out_nocoalesce tag directly without releasing new_nodes[i]->write_lock.
Then, it will cause a deadlock when trying to acquire new_nodes[i]->
write_lock for freeing new_nodes[i] before return.

btree_gc_coalesce func details as follows:
	if alloc new_nodes[i] fails:
		goto out_nocoalesce;
	// obtain new_nodes[i]->write_lock
	mutex_lock(&new_nodes[i]->write_lock)
	// main coalescing process
	for (i = nodes - 1; i > 0; --i)
		[snipped]
		if coalescing process fails:
			// Here, directly goto out_nocoalesce
			 // tag will cause a deadlock
			goto out_nocoalesce;
		[snipped]
	// release new_nodes[i]->write_lock
	mutex_unlock(&new_nodes[i]->write_lock)
	// coalesing succ, return
	return;
out_nocoalesce:
	btree_node_free(new_nodes[i])	// free new_nodes[i]
	// obtain new_nodes[i]->write_lock
	mutex_lock(&new_nodes[i]->write_lock);
	// set flag for reuse
	clear_bit(BTREE_NODE_dirty, &ew_nodes[i]->flags);
	// release new_nodes[i]->write_lock
	mutex_unlock(&new_nodes[i]->write_lock);

To fix the problem, we add a new tag 'out_unlock_nocoalesce' for
releasing new_nodes[i]->write_lock before out_nocoalesce tag. If
coalescing process fails, we will go to out_unlock_nocoalesce tag
for releasing new_nodes[i]->write_lock before free new_nodes[i] in
out_nocoalesce tag.

(Coly Li helps to clean up commit log format.)

Fixes: 2a285686c1 ("bcache: btree locking rework")
Signed-off-by: Zhiqiang Liu <liuzhiqiang26@huawei.com>
Signed-off-by: Coly Li <colyli@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-06-14 16:47:56 -06:00
Masahiro Yamada
a7f7f6248d treewide: replace '---help---' in Kconfig files with 'help'
Since commit 84af7a6194 ("checkpatch: kconfig: prefer 'help' over
'---help---'"), the number of '---help---' has been gradually
decreasing, but there are still more than 2400 instances.

This commit finishes the conversion. While I touched the lines,
I also fixed the indentation.

There are a variety of indentation styles found.

  a) 4 spaces + '---help---'
  b) 7 spaces + '---help---'
  c) 8 spaces + '---help---'
  d) 1 space + 1 tab + '---help---'
  e) 1 tab + '---help---'    (correct indentation)
  f) 1 tab + 1 space + '---help---'
  g) 1 tab + 2 spaces + '---help---'

In order to convert all of them to 1 tab + 'help', I ran the
following commend:

  $ find . -name 'Kconfig*' | xargs sed -i 's/^[[:space:]]*---help---/\thelp/'

Signed-off-by: Masahiro Yamada <masahiroy@kernel.org>
2020-06-14 01:57:21 +09:00
Linus Torvalds
b25c6644bf - Largest change for this cycle is the DM zoned target's metadata
version 2 feature that adds support for pairing regular block
   devices with a zoned device to ease performance impact associated
   with finite random zones of zoned device.  Changes came in 3
   batches: first prepared for and then added the ability to pair a
   single regular block device, second was a batch of fixes to improve
   zoned's reclaim heuristic, third removed the limitation of only
   adding a single additional regular block device to allow many
   devices.  Testing has shown linear scaling as more devices are
   added.
 
 - Add new emulated block size (ebs) target that emulates a smaller
   logical_block_size than a block device supports.  Primary use-case
   is to emulate "512e" devices that have 512 byte logical_block_size
   and 4KB physical_block_size.  This is useful to some legacy
   applications otherwise wouldn't be ablee to be used on 4K devices
   because they depend on issuing IO in 512 byte granularity.
 
 - Add discard interfaces to DM bufio.  First consumer of the interface
   is the dm-ebs target that makes heavy use of dm-bufio.
 
 - Fix DM crypt's block queue_limits stacking to not truncate
   logic_block_size.
 
 - Add Documentation for DM integrity's status line.
 
 - Switch DMDEBUG from a compile time config option to instead use
   dynamic debug via pr_debug.
 
 - Fix DM multipath target's hueristic for how it manages
   "queue_if_no_path" state internally.  DM multipath now avoids
   disabling "queue_if_no_path" unless it is actually needed (e.g. in
   response to configure timeout or explicit "fail_if_no_path"
   message).  This fixes reports of spurious -EIO being reported back
   to userspace application during fault tolerance testing with an NVMe
   backend.  Added various dynamic DMDEBUG messages to assist with
   debugging queue_if_no_path in the future.
 
 - Add a new DM multipath "Historical Service Time" Path Selector.
 
 - Fix DM multipath's dm_blk_ioctl() to switch paths on IO error.
 
 - Improve DM writecache target performance by using explicit
   cache flushing for target's single-threaded usecase and a small
   cleanup to remove unnecessary test in persistent_memory_claim.
 
 - Other small cleanups in DM core, dm-persistent-data, and DM integrity.
 -----BEGIN PGP SIGNATURE-----
 
 iQFHBAABCAAxFiEEJfWUX4UqZ4x1O2wixSPxCi2dA1oFAl7alrgTHHNuaXR6ZXJA
 cmVkaGF0LmNvbQAKCRDFI/EKLZ0DWl42B/9sBd+j60emy4Bliu/f3pd7SEkFrSXv
 K2jXicRFx4E5kO0aLK+fX65cOiq2vvLsDh8c++0TLXcD9q7oK0qxK9c8TCPq29Cx
 W2J2dwdjyyqqbr3/FZHYYM9KLOl5rsxJLygXwhJhQ2Gny44L7nVACrAXNzXIHJ3r
 f8xr+GLdF/jz7WLj8bwEDo3Bf8wvxDvl2ijqj7EceOhTutNE8xHQ6UcTPqTtozJy
 sNM8UQNk1L43DBAvXfrKZ+yQ5DYAKdXKJpV9C8qv5DEGbaEikuMrHddgO4KlDdp4
 VjPk9GSfPwGcJ4ecN8vgecZVGvh52ZU7OZ8qey/q0zqps74jeHTZQQyM
 =jRun
 -----END PGP SIGNATURE-----

Merge tag 'for-5.8/dm-changes' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm

Pull device mapper updates from Mike Snitzer:

 - The largest change for this cycle is the DM zoned target's metadata
   version 2 feature that adds support for pairing regular block devices
   with a zoned device to ease the performance impact associated with
   finite random zones of zoned device.

   The changes came in three batches: the first prepared for and then
   added the ability to pair a single regular block device, the second
   was a batch of fixes to improve zoned's reclaim heuristic, and the
   third removed the limitation of only adding a single additional
   regular block device to allow many devices.

   Testing has shown linear scaling as more devices are added.

 - Add new emulated block size (ebs) target that emulates a smaller
   logical_block_size than a block device supports

   The primary use-case is to emulate "512e" devices that have 512 byte
   logical_block_size and 4KB physical_block_size. This is useful to
   some legacy applications that otherwise wouldn't be able to be used
   on 4K devices because they depend on issuing IO in 512 byte
   granularity.

 - Add discard interfaces to DM bufio. First consumer of the interface
   is the dm-ebs target that makes heavy use of dm-bufio.

 - Fix DM crypt's block queue_limits stacking to not truncate
   logic_block_size.

 - Add Documentation for DM integrity's status line.

 - Switch DMDEBUG from a compile time config option to instead use
   dynamic debug via pr_debug.

 - Fix DM multipath target's hueristic for how it manages
   "queue_if_no_path" state internally.

   DM multipath now avoids disabling "queue_if_no_path" unless it is
   actually needed (e.g. in response to configure timeout or explicit
   "fail_if_no_path" message).

   This fixes reports of spurious -EIO being reported back to userspace
   application during fault tolerance testing with an NVMe backend.
   Added various dynamic DMDEBUG messages to assist with debugging
   queue_if_no_path in the future.

 - Add a new DM multipath "Historical Service Time" Path Selector.

 - Fix DM multipath's dm_blk_ioctl() to switch paths on IO error.

 - Improve DM writecache target performance by using explicit cache
   flushing for target's single-threaded usecase and a small cleanup to
   remove unnecessary test in persistent_memory_claim.

 - Other small cleanups in DM core, dm-persistent-data, and DM
   integrity.

* tag 'for-5.8/dm-changes' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm: (62 commits)
  dm crypt: avoid truncating the logical block size
  dm mpath: add DM device name to Failing/Reinstating path log messages
  dm mpath: enhance queue_if_no_path debugging
  dm mpath: restrict queue_if_no_path state machine
  dm mpath: simplify __must_push_back
  dm zoned: check superblock location
  dm zoned: prefer full zones for reclaim
  dm zoned: select reclaim zone based on device index
  dm zoned: allocate zone by device index
  dm zoned: support arbitrary number of devices
  dm zoned: move random and sequential zones into struct dmz_dev
  dm zoned: per-device reclaim
  dm zoned: add metadata pointer to struct dmz_dev
  dm zoned: add device pointer to struct dm_zone
  dm zoned: allocate temporary superblock for tertiary devices
  dm zoned: convert to xarray
  dm zoned: add a 'reserved' zone flag
  dm zoned: improve logging messages for reclaim
  dm zoned: avoid unnecessary device recalulation for secondary superblock
  dm zoned: add debugging message for reading superblocks
  ...
2020-06-05 15:45:03 -07:00
Eric Biggers
64611a15ca dm crypt: avoid truncating the logical block size
queue_limits::logical_block_size got changed from unsigned short to
unsigned int, but it was forgotten to update crypt_io_hints() to use the
new type.  Fix it.

Fixes: ad6bf88a6c ("block: fix an integer overflow in logical block size")
Cc: stable@vger.kernel.org
Signed-off-by: Eric Biggers <ebiggers@google.com>
Reviewed-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2020-06-05 14:59:59 -04:00
Mike Snitzer
04867370ec dm mpath: add DM device name to Failing/Reinstating path log messages
When there are many DM multipath devices it really helps to have
additional context for which DM device a failed or reinstated path is
part of.

Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2020-06-05 14:59:58 -04:00
Mike Snitzer
4c3f48380f dm mpath: enhance queue_if_no_path debugging
Add more DMDEBUG that shows arguments passed and caller, and another
that shows state of related flags at end of queue_if_no_path().

Also add queue_if_no_path DMDEBUG to multipath_resume().

Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2020-06-05 14:59:57 -04:00
Mike Snitzer
553ec94cb4 dm mpath: restrict queue_if_no_path state machine
Do not allow saving disabled queue_if_no_path if already saved as
enabled; implies multiple suspends (which shouldn't ever happen).  Log
if this unlikely scenario is ever triggered.

Also, only write MPATHF_SAVED_QUEUE_IF_NO_PATH during presuspend or if
"fail_if_no_path" message.  MPATHF_SAVED_QUEUE_IF_NO_PATH is no longer
always modified, e.g.: even if queue_if_no_path()'s save_old_value
argument wasn't set.  This just implies a bit tighter control over
the management of MPATHF_SAVED_QUEUE_IF_NO_PATH.  Side-effect is
multipath_resume() doesn't reset MPATHF_QUEUE_IF_NO_PATH unless
MPATHF_SAVED_QUEUE_IF_NO_PATH was set (during presuspend); and at that
time the MPATHF_SAVED_QUEUE_IF_NO_PATH bit gets cleared.  So
MPATHF_SAVED_QUEUE_IF_NO_PATH's use is much more narrow in scope.

Last, but not least, do _not_ disable queue_if_no_path during noflush
suspend.  There is no need/benefit to saving off queue_if_no_path via
MPATHF_SAVED_QUEUE_IF_NO_PATH and clearing MPATHF_QUEUE_IF_NO_PATH for
noflush suspend -- by avoiding this needless queue_if_no_path flag
churn there is less potential for MPATHF_QUEUE_IF_NO_PATH to get lost.
Which avoids potential for IOs to be errored back up to userspace
during DM multipath's handling of path failures.

That said, this last change papers over a reported issue concerning
request-based dm-multipath's interaction with blk-mq, relative to
suspend and resume: multipath_endio is being called _before_
multipath_resume.  This should never happen if DM suspend's
blk_mq_quiesce_queue() + dm_wait_for_completion() is genuinely waiting
for all inflight blk-mq requests to complete.  Similarly:
drivers/md/dm.c:__dm_resume() clearly calls dm_table_resume_targets()
_before_ dm_start_queue()'s blk_mq_unquiesce_queue() is called.  If
the queue isn't even restarted until after multipath_resume(); the BIG
question that still needs answering is: how can multipath_end_io beat
multipath_resume in a race!?

Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2020-06-05 14:59:56 -04:00
Mike Snitzer
a862e4e215 dm mpath: simplify __must_push_back
Remove micro-optimization that infers device is between presuspend and
resume (was done purely to avoid call to dm_noflush_suspending, which
isn't expensive anyway).

Remove flags argument since they are no longer checked.

And remove must_push_back_bio() since it was simply a call to
__must_push_back().

Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2020-06-05 14:59:55 -04:00
Hannes Reinecke
27d49ac1dd dm zoned: check superblock location
When specifying several devices the superblock location must be
checked to ensure the devices are specified in the correct order.

Signed-off-by: Hannes Reinecke <hare@suse.de>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2020-06-05 14:59:54 -04:00
Hannes Reinecke
2094045fe5 dm zoned: prefer full zones for reclaim
Prefer full zones when selecting the next zone for reclaim.

Signed-off-by: Hannes Reinecke <hare@suse.de>
Reviewed-by: Damien Le Moal <damien.lemoal@wdc.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2020-06-05 14:59:54 -04:00
Hannes Reinecke
69875d443b dm zoned: select reclaim zone based on device index
per-device reclaim should select zones on that device only.

Signed-off-by: Hannes Reinecke <hare@suse.de>
Reviewed-by: Damien Le Moal <damien.lemoal@wdc.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2020-06-05 14:59:53 -04:00
Hannes Reinecke
22c1ef66c4 dm zoned: allocate zone by device index
When allocating a zone, pass in an indicator on which device the zone
should be allocated; this increases performance for a multi-device
setup because reclaim will now allocate zones on the device for which
reclaim is running.

Signed-off-by: Hannes Reinecke <hare@suse.de>
Reviewed-by: Damien Le Moal <damien.lemoal@wdc.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2020-06-05 14:59:52 -04:00
Hannes Reinecke
4dba12881f dm zoned: support arbitrary number of devices
Remove the hard-coded limit of two devices and support an unlimited
number of additional zoned devices.

Signed-off-by: Hannes Reinecke <hare@suse.de>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2020-06-05 14:59:51 -04:00
Hannes Reinecke
bd82fdabf1 dm zoned: move random and sequential zones into struct dmz_dev
Random and sequential zones should be part of the respective
device structure to make arbitration between devices possible.

Signed-off-by: Hannes Reinecke <hare@suse.de>
Reviewed-by: Damien Le Moal <damien.lemoal@wdc.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2020-06-05 14:59:50 -04:00
Hannes Reinecke
f97809aec5 dm zoned: per-device reclaim
Instead of having one reclaim workqueue for the entire set we should
be allocating a reclaim workqueue per device; doing so will reduce
contention and should boost performance for a multi-device setup.

Signed-off-by: Hannes Reinecke <hare@suse.de>
Reviewed-by: Damien Le Moal <damien.lemoal@wdc.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2020-06-05 14:59:50 -04:00
Hannes Reinecke
18979819b5 dm zoned: add metadata pointer to struct dmz_dev
Add a metadata pointer within struct dmz_dev and use it as argument
for blkdev_report_zones() instead of the metadata itself.

Signed-off-by: Hannes Reinecke <hare@suse.de>
Reviewed-by: Damien Le Moal <damien.lemoal@wdc.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2020-06-05 14:59:49 -04:00
Hannes Reinecke
8f22272af7 dm zoned: add device pointer to struct dm_zone
Add a pointer, to the containing device, within struct dm_zone and
kill dmz_zone_to_dev().

Signed-off-by: Hannes Reinecke <hare@suse.de>
Reviewed-by: Damien Le Moal <damien.lemoal@wdc.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2020-06-05 14:59:48 -04:00
Hannes Reinecke
5d2c74f3dd dm zoned: allocate temporary superblock for tertiary devices
Checking the tertiary superblock just consists of validating UUIDs,
crcs, and the generation number; it doesn't have contents which would
be required during the actual operation.

So allocate a temporary superblock when checking tertiary devices to
avoid having to store it together with the 'real' superblocks.

Signed-off-by: Hannes Reinecke <hare@suse.de>
Reviewed-by: Damien Le Moal <damien.lemoal@wdc.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2020-06-05 14:59:47 -04:00
Hannes Reinecke
a92fbc446d dm zoned: convert to xarray
The zones array is getting really large, and large arrays tend to
wreak havoc with the CPU caches.  So convert it to xarray to become
more cache friendly.

Signed-off-by: Hannes Reinecke <hare@suse.de>
Reviewed-by: Damien Le Moal <damien.lemoal@wdc.com>
Signed-off-by: Colin Ian King <colin.king@canonical.com> # fix leak in dmz_insert
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2020-06-05 14:59:46 -04:00
Hannes Reinecke
aec67b4ffa dm zoned: add a 'reserved' zone flag
Instead of counting the number of reserved zones in dmz_free_zone(),
mark the zone as 'reserved' during allocation and simplify
dmz_free_zone().

Signed-off-by: Hannes Reinecke <hare@suse.de>
Reviewed-by: Damien Le Moal <damien.lemoal@wdc.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2020-06-05 14:59:46 -04:00
Hannes Reinecke
c3ff479dde dm zoned: improve logging messages for reclaim
Instead of just reporting the errno, add some more verbose debugging
message in the reclaim path.

Signed-off-by: Hannes Reinecke <hare@suse.de>
Reviewed-by: Damien Le Moal <damien.lemoal@wdc.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2020-06-05 14:59:45 -04:00
Hannes Reinecke
1565929b87 dm zoned: avoid unnecessary device recalulation for secondary superblock
The secondary superblock must reside on the same device as the primary
superblock, so there is no need to re-calculate the device.

Signed-off-by: Hannes Reinecke <hare@suse.de>
Reviewed-by: Damien Le Moal <damien.lemoal@wdc.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2020-06-05 14:59:44 -04:00
Hannes Reinecke
35d0c96e42 dm zoned: add debugging message for reading superblocks
Signed-off-by: Hannes Reinecke <hare@suse.de>
Reviewed-by: Damien Le Moal <damien.lemoal@wdc.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2020-06-05 14:59:43 -04:00