full blk-mq support to request-based DM.
- disabled by default but user can opt-in with CONFIG_DM_MQ_DEFAULT
- depends on some blk-mq changes from Jens' for-4.1/core branch so
that explains why this pull is built on linux-block.git
- Update DM to use name_to_dev_t() rather than open-coding a less
capable device parser.
- includes a couple small improvements to name_to_dev_t() that offer
stricter constraints that DM's code provided.
- Improvements to the dm-cache "mq" cache replacement policy.
- A DM crypt crypt_ctr() error path fix and an async crypto deadlock fix.
- A small efficiency improvement for DM crypt decryption by leveraging
immutable biovecs.
- Add error handling modes for corrupted blocks to DM verity.
- A new "log-writes" DM target from Josef Bacik that is meant for
file system developers to test file system integrity at particular
points in the life of a file system.
- A few DM log userspace cleanups and fixes.
- A few Documentation fixes (for thin, cache, crypt and switch).
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1
iQEcBAABAgAGBQJVMGvhAAoJEMUj8QotnQNaOKgH/iuWyEZ+ndWbeEBZs9ZJPEgU
sRiKjSaPujkdB+78o+2DfAYjwCchqH1Oy781PqG85asTENVkcjD7LBPaIom/A2ml
Z9ZeCzAKbrtB3lZR8QAg4u7+90kkuZv1SeOPMWBZCEV4GYLEpV1J4/f+RgdEs1kT
upQcH1fFoQdyHGikwAKXf85Gmi4OrjVb19yoxu2xZnfwT2sCPq0okU4yBxb1B9mL
X/FcHUeLS9LJGewEqTD3AvWrk+J5pqj+1EeKY2kRxisp1065TaljYwetgR76Vnfb
o0J5eovJVL3AKRTYxvr5JABWD09xq9nG1hVBiuQMN5VmmvxjDKdbBPaKI4dZTt0=
=uVjp
-----END PGP SIGNATURE-----
Merge tag 'dm-4.1-changes' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm
Pull device mapper updates from Mike Snitzer:
- the most extensive changes this cycle are the DM core improvements to
add full blk-mq support to request-based DM.
- disabled by default but user can opt-in with CONFIG_DM_MQ_DEFAULT
- depends on some blk-mq changes from Jens' for-4.1/core branch so
that explains why this pull is built on linux-block.git
- update DM to use name_to_dev_t() rather than open-coding a less
capable device parser.
- includes a couple small improvements to name_to_dev_t() that offer
stricter constraints that DM's code provided.
- improvements to the dm-cache "mq" cache replacement policy.
- a DM crypt crypt_ctr() error path fix and an async crypto deadlock
fix
- a small efficiency improvement for DM crypt decryption by leveraging
immutable biovecs
- add error handling modes for corrupted blocks to DM verity
- a new "log-writes" DM target from Josef Bacik that is meant for file
system developers to test file system integrity at particular points
in the life of a file system
- a few DM log userspace cleanups and fixes
- a few Documentation fixes (for thin, cache, crypt and switch)
* tag 'dm-4.1-changes' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm: (34 commits)
dm crypt: fix missing error code return from crypt_ctr error path
dm crypt: fix deadlock when async crypto algorithm returns -EBUSY
dm crypt: leverage immutable biovecs when decrypting on read
dm crypt: update URLs to new cryptsetup project page
dm: add log writes target
dm table: use bool function return values of true/false not 1/0
dm verity: add error handling modes for corrupted blocks
dm thin: remove stale 'trim' message documentation
dm delay: use msecs_to_jiffies for time conversion
dm log userspace base: fix compile warning
dm log userspace transfer: match wait_for_completion_timeout return type
dm table: fall back to getting device using name_to_dev_t()
init: stricter checking of major:minor root= values
init: export name_to_dev_t and mark name argument as const
dm: add 'use_blk_mq' module param and expose in per-device ro sysfs attr
dm: optimize dm_mq_queue_rq to _not_ use kthread if using pure blk-mq
dm: add full blk-mq support to request-based DM
dm: impose configurable deadline for dm_request_fn's merge heuristic
dm sysfs: introduce ability to add writable attributes
dm: don't start current request if it would've merged with the previous
...
I suspect this doesn't show up for most anyone because software
algorithms typically don't have a sense of being too busy. However,
when working with the Freescale CAAM driver it will return -EBUSY on
occasion under heavy -- which resulted in dm-crypt deadlock.
After checking the logic in some other drivers, the scheme for
crypt_convert() and it's callback, kcryptd_async_done(), were not
correctly laid out to properly handle -EBUSY or -EINPROGRESS.
Fix this by using the completion for both -EBUSY and -EINPROGRESS. Now
crypt_convert()'s use of completion is comparable to
af_alg_wait_for_completion(). Similarly, kcryptd_async_done() follows
the pattern used in af_alg_complete().
Before this fix dm-crypt would lockup within 1-2 minutes running with
the CAAM driver. Fix was regression tested against software algorithms
on PPC32 and x86_64, and things seem perfectly happy there as well.
Signed-off-by: Ben Collins <ben.c@servergy.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Cc: stable@vger.kernel.org
Commit 003b5c571 ("block: Convert drivers to immutable biovecs")
stopped short of changing dm-crypt to leverage the fact that the biovec
array of a bio will no longer be modified.
Switch to using bio_clone_fast() when cloning bios for decryption after
read.
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Cryptsetup home page moved to GitLab.
Also remove link to abandonded Truecrypt page.
Signed-off-by: Milan Broz <gmazyland@gmail.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Introduce a new target that is meant for file system developers to test file
system integrity at particular points in the life of a file system. We capture
all write requests and associated data and log them to a separate device
for later replay. There is a userspace utility to do this replay. The
idea behind this is to give file system developers a tool to verify that
the file system is always consistent.
Signed-off-by: Josef Bacik <jbacik@fb.com>
Reviewed-by: Zach Brown <zab@zabbo.net>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Add device specific modes to dm-verity to specify how corrupted
blocks should be handled. The following modes are defined:
- DM_VERITY_MODE_EIO is the default behavior, where reading a
corrupted block results in -EIO.
- DM_VERITY_MODE_LOGGING only logs corrupted blocks, but does
not block the read.
- DM_VERITY_MODE_RESTART calls kernel_restart when a corrupted
block is discovered.
In addition, each mode sends a uevent to notify userspace of
corruption and to allow further recovery actions.
The driver defaults to previous behavior (DM_VERITY_MODE_EIO)
and other modes can be enabled with an additional parameter to
the verity table.
Signed-off-by: Sami Tolvanen <samitolvanen@google.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Converting milliseconds to jiffies by "val * HZ / 1000" is technically
OK but msecs_to_jiffies(val) is the cleaner solution and handles all
corner cases correctly.
Signed-off-by: Nicholas Mc Guire <hofrat@osadl.org>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
This fixes up a compile warning [-Wunused-but-set-variable] - given the
comment in userspace_set_region_sync() the non-reporting of errors is
intentional so the return value can be dropped to make gcc happy.
Also, fix typo in comment.
Signed-off-by: Nicholas Mc Guire <hofrat@osadl.org>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Return type of wait_for_completion_timeout() is unsigned long not int.
An appropriately named unsigned long is added and the assignment fixed.
Signed-off-by: Nicholas Mc Guire <hofrat@osadl.org>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
If a device is used as the root filesystem, it can't be built
off of devices which are within the root filesystem (just like
command line arguments to root=). For this reason, Linux has a
pseudo-filesystem for root= and MD initialization (based on the
function name_to_dev_t) which handles different ways of specifying
devices including PARTUUID and major:minor.
Switch to using name_to_dev_t() in dm_get_device(). Rather than
having DM assume that all things which are not major:minor are paths in
an already-mounted filesystem, change dm_get_device() to first attempt
to look up the device in the filesystem, and if not found it will fall
back to using name_to_dev_t().
In terms of backwards compatibility, there are some cases where
behavior will be different:
- If you have a file in the current working directory named 1:2 and
you initialze DM there, then it will try to use that file rather
than the disk with that major:minor pair as a backing device.
- Similarly for other bdev types which name_to_dev_t() knows how to
interpret, the previous behavior was to repeatedly check for the
existence of the file (e.g., while waiting for rootfs to come up)
but the new behavior is to use the name_to_dev_t() interpretation.
For example, if you have a file named /dev/ubiblock0_0 which is
a symlink to /dev/sda3, but it is not yet present when DM starts
to initialize, then the name_to_dev_t() interpretation will take
precedence.
These incompatibilities would only show up in really strange setups
with bad practices so we shouldn't have to worry about them.
Signed-off-by: Dan Ehrenberg <dehrenberg@chromium.org>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Request-based DM's blk-mq support defaults to off; but a user can easily
change the default using the dm_mod.use_blk_mq module/boot option.
Also, you can check what mode a given request-based DM device is using
with: cat /sys/block/dm-X/dm/use_blk_mq
This change enabled further cleanup and reduced work (e.g. the
md->io_pool and md->rq_pool isn't created if using blk-mq).
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
dm_mq_queue_rq() is in atomic context so care must be taken to not
sleep -- as such GFP_ATOMIC is used for the md->bs bioset allocations
and dm-mpath's call to blk_get_request(). In the future the bioset
allocations will hopefully go away (by removing support for partial
completions of bios in a cloned request).
Also prepare for supporting DM blk-mq ontop of old-style request_fn
device(s) if a new dm-mod 'use_blk_mq' parameter is set. The kthread
will still be used to queue work if blk-mq is used ontop of old-style
request_fn device(s).
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Commit e5863d9ad ("dm: allocate requests in target when stacking on
blk-mq devices") served as the first step toward fully utilizing blk-mq
in request-based DM -- it enabled stacking an old-style (request_fn)
request_queue ontop of the underlying blk-mq device(s). That first step
didn't improve performance of DM multipath ontop of fast blk-mq devices
(e.g. NVMe) because the top-level old-style request_queue was severely
limited by the queue_lock.
The second step offered here enables stacking a blk-mq request_queue
ontop of the underlying blk-mq device(s). This unlocks significant
performance gains on fast blk-mq devices, Keith Busch tested on his NVMe
testbed and offered this really positive news:
"Just providing a performance update. All my fio tests are getting
roughly equal performance whether accessed through the raw block
device or the multipath device mapper (~470k IOPS). I could only push
~20% of the raw iops through dm before this conversion, so this latest
tree is looking really solid from a performance standpoint."
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Tested-by: Keith Busch <keith.busch@intel.com>
Otherwise, for sequential workloads, the dm_request_fn can allow
excessive request merging at the expense of increased service time.
Add a per-device sysfs attribute to allow the user to control how long a
request, that is a reasonable merge candidate, can be queued on the
request queue. The resolution of this request dispatch deadline is in
microseconds (ranging from 1 to 100000 usecs), to set a 20us deadline:
echo 20 > /sys/block/dm-7/dm/rq_based_seq_io_merge_deadline
The dm_request_fn's merge heuristic and associated extra accounting is
disabled by default (rq_based_seq_io_merge_deadline is 0).
This sysfs attribute is not applicable to bio-based DM devices so it
will only ever report 0 for them.
By allowing a request to remain on the queue it will block others
requests on the queue. But introducing a short dequeue delay has proven
very effective at enabling certain sequential IO workloads on really
fast, yet IOPS constrained, devices to build up slightly larger IOs --
yielding 90+% throughput improvements. Having precise control over the
time taken to wait for larger requests to build affords control beyond
that of waiting for certain IO sizes to accumulate (which would require
a deadline anyway). This knob will only ever make sense with sequential
IO workloads and the particular value used is storage configuration
specific.
Given the expected niche use-case for when this knob is useful it has
been deemed acceptable to expose this relatively crude method for
crafting optimal IO on specific storage -- especially given the solution
is simple yet effective. In the context of DM multipath, it is
advisable to tune this sysfs attribute to a value that offers the best
performance for the common case (e.g. if 4 paths are expected active,
tune for that; if paths fail then performance may be slightly reduced).
Alternatives were explored to have request-based DM autotune this value
(e.g. if/when paths fail) but they were quickly deemed too fragile and
complex to warrant further design and development time. If this problem
proves more common as faster storage emerges we'll have to look at
elevating a generic solution into the block core.
Tested-by: Shiva Krishna Merla <shivakrishna.merla@netapp.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Request-based DM's dm_request_fn() is so fast to pull requests off the
queue that steps need to be taken to promote merging by avoiding request
processing if it makes sense.
If the current request would've merged with previous request let the
current request stay on the queue longer.
Suggested-by: Jens Axboe <axboe@fb.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Commit 7eaceaccab ("block: remove per-queue plugging") didn't justify
DM's use of a 100ms delay; such an extended delay is a liability when
there is reason to re-kick the queue.
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
On really fast storage it can be beneficial to delay running the
request_queue to allow the elevator more opportunity to merge requests.
Otherwise, it has been observed that requests are being sent to
q->request_fn much quicker than is ideal on IOPS-bound backends.
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
The old dm_request() method used for q->make_request_fn had a branch for
request-based DM support but it isn't needed given that
dm_init_request_based_queue() sets it to the standard blk_queue_bio()
anyway.
Cleanup dm_init_md_queue() to be DM device-type agnostic and have
dm_setup_md_queue() properly finish queue setup based on DM device-type
(bio-based vs request-based).
A followup block patch can be made to remove the export for
blk_queue_bio() now that DM no longer calls it directly.
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Since commit 20d0189b10
in v3.14-rc1 RAID0 has performed incorrect calculations
when the chunksize is not a power of 2.
This happens because "sector_div()" modifies its first argument, but
this wasn't taken into account in the patch.
So restore that first arg before re-using the variable.
Reported-by: Joe Landman <joe.landman@gmail.com>
Reported-by: Dave Chinner <david@fromorbit.com>
Fixes: 20d0189b10
Cc: stable@vger.kernel.org (3.14 and later).
Signed-off-by: NeilBrown <neilb@suse.de>
Simon reported the md io stats accounting issue:
"
I'm seeing "iostat -x -k 1" print this after a RAID1 rebuild on 4.0-rc5.
It's not abnormal other than it's 3-disk, with one being SSD (sdc) and
the other two being write-mostly:
Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s avgrq-sz avgqu-sz await r_await w_await svctm %util
sda 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
sdb 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
sdc 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
md0 0.00 0.00 0.00 0.00 0.00 0.00 0.00 345.00 0.00 0.00 0.00 0.00 100.00
md2 0.00 0.00 0.00 0.00 0.00 0.00 0.00 58779.00 0.00 0.00 0.00 0.00 100.00
md1 0.00 0.00 0.00 0.00 0.00 0.00 0.00 12.00 0.00 0.00 0.00 0.00 100.00
"
The cause is commit "18c0b223cf9901727ef3b02da6711ac930b4e5d4" uses the
generic_start_io_acct to account the disk stats rather than the open code,
but it also introduced the increase to .in_flight[rw] which is needless to
md. So we re-use the open code here to fix it.
Reported-by: Simon Kirby <sim@hostway.ca>
Cc: <stable@vger.kernel.org> 3.19
Signed-off-by: Gu Zheng <guz.fnst@cn.fujitsu.com>
Signed-off-by: NeilBrown <neilb@suse.de>
DM multipath is the only caller of blk_lld_busy() -- which calls a
queue's lld_busy_fn hook. Request-based DM doesn't support stacking
multipath devices so there is no reason to register the lld_busy_fn hook
on a multipath device's queue using blk_queue_lld_busy().
As such, remove functions dm_lld_busy and dm_table_any_busy_target.
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
__dm_get_module_param() could be useful for future DM module parameters
besides those related to "reserved_ios".
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Writeback takes out a lock on the cache block, so will increase the
latency for any concurrent io.
This patch works by placing 2 sentinel objects on each level of the
multiqueues. Every WRITEBACK_PERIOD the oldest sentinel gets moved to
the newest end of the queue level.
When looking for writeback work:
if less than 25% of the cache is clean:
we select the oldest object with the lowest hit count
otherwise:
we select the oldest object that is not past a writeback sentinel.
Signed-off-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
A sentinel object is placed on each level of the multiqueues. When an
object is hit it is requeued behind the sentinel. When the tick is
incremented we iterate through all objects behind the sentinel and
update the hit_count, then reposition the sentinel at the very back.
This saves memory by avoiding tracking the tick explicitly for every
struct entry object in the multiqueues.
Signed-off-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
queue_shift_down() didn't adjust the hit_counts to the new levels, so it
just had the effect of scrambling levels.
Signed-off-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Small optimisation, now queue_empty() doesn't need to walk all levels of
the multiqueue.
Signed-off-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Use a single slab cache to allocate a mempool for each dirty-log.
This _should_ eliminate DM's need for io_schedule_timeout() in
mempool_alloc(); so io_schedule() should be sufficient now.
Also, rename struct flush_entry to dm_dirty_log_flush_entry to allow
KMEM_CACHE() to create a meaningful global name for the slab cache.
Also, eliminate some holes in struct log_c by rearranging members.
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Acked-by: Heinz Mauelshagen <heinzm@redhat.com>
exposed by the bdi changes that were introduced during the 4.0 merge.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1
iQEcBAABAgAGBQJVExTfAAoJEMUj8QotnQNaMMAIAOT0IjDILfxuP/m6DLl+Qq89
SWrKWJrTDc4j8yjaehELH5Vm5Y8W2qlxQAmwritOEm+53HdODSv9I8WpG/6eCEkt
sx+zwQ0wRF0TK6wrqtXe44WU2Zgi9KTj31Y5nTuOAcEqR7S4JHYmbvuzFv/kSO7T
GusA+A1HAp0klGgqxT/mzjz61J/frOCDIXWGFwGPapD+6ZLcHfnCz+Ss4CWhdzvo
xCMXkeoCgvT5dGj5HQQ3RbucaeUIIeYsja2IT2h572GRiUkau+7qtg/YyBvdMnw+
mEcdMcIg+Higc9z3ZSFu/U/QxGwRF/AycXp4zziuiW8KHfRa2BA7+iyQ0odUHlA=
=Ikkj
-----END PGP SIGNATURE-----
Merge tag 'dm-4.0-fix-2' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm
Pull device mapper fix from Mike Snitzer:
"Fix DM core device cleanup regression -- due to a latent race that was
exposed by the bdi changes that were introduced during the 4.0 merge"
* tag 'dm-4.0-fix-2' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm:
dm: fix add_disk() NULL pointer due to race with free_dev()
Commit c4db59d31e ("fs: don't reassign dirty inodes to
default_backing_dev_info") exposed DM to a latent race in free_dev() vs
add_disk() in relation to management of the device's minor number.
Fix this by refactoring free_dev() to match cleanup order of the
alloc_dev() error path. Move cleanup of the gendisk, queue, and bdev
to _before_ the cleanup of the idr managed minor number.
Also, purely due to cleanup that fell out during the free_dev() audit:
- adjust dm_blk_close() to access the gendisk's private_data under
the _minor_lock spinlock.
- move __dm_destroy()'s dm_get_live_table() call out from under the
_minor_lock spinlock.
Resolves: https://bugzilla.redhat.com/show_bug.cgi?id=1202449
Reported-by: Zdenek Kabelac <zkabelac@redhat.com>
Reported-by: Jeff Moyer <jmoyer@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Regression in recent patch causes crash on error path.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2
iQIVAwUAVQyiMDnsnt1WYoG5AQKV4A//SwSZQmnbSTDZ1H5DyERXBkNNKkyFkZCL
9jwl60HaVA5U0N/fP97ku/757QtODYjjHLzlkYqgiiqMBGPFFIKa78D6Su0Z8uJK
FtMJQaalhsWOCXHUz2R6DPHR1e9x2mlrJgqvnIbnbFXym1PZJ/fo3ccsLX3qqmNJ
F4oFYpy79+yMCUSTn8Sv9czzSE19k5XC/Hz14drWi5ox9zpZmpFbMbQdNaKFa2Uy
kphgEdvFNF9C2X+rJeeY91ofz/keSsZVL8HqxLThjxKXsZ+krpywsXa2nX7W00ya
zt49VK78PL3v2eIAgf1XaBYO3oyjyzEaVuAkFOktv/LuLKAbFbup2Yi5JtHzieor
NoaV9hfgCq1OqxkY8zI8hLcL6HS2vvk8vUXbb2Xqa6X5w6xuyjRMzzAyxqXuG4qm
16jAU2RvITfdwAeNacyzq/xAWBg+jUgeg+lO/Z71O0qS7chGUNzve+hwQOtoeHzP
fmEetiZa+gbOQEBZT3Ubsw4L9CqCaYVx7KOiE8Py5nbUVPMz2nF6jjt6WSPOgyWv
JrflLIoEE5l6QffnEzcJb3OR920TKZx17/87KvBwct9XH3R0+UgJLFggHrYQAiUA
54CnXVPl0MPq5vtROviX2ZWMThOk1s5HAsMhCBuTdqghtQQdNP7DWmGfg7ys5uUq
152cPd+xGw4=
=PM5i
-----END PGP SIGNATURE-----
Merge tag 'md/4.0-rc4-fix' of git://neil.brown.name/md
Pull bugfix for md from Neil Brown:
"One fix for md in 4.0-rc4
Regression in recent patch causes crash on error path"
* tag 'md/4.0-rc4-fix' of git://neil.brown.name/md:
md: fix problems with freeing private data after ->run failure.
- fix thin target to always zero-fill reads to unprovisioned blocks
- fix to interlock device destruction's suspend from internal suspends
- fix 2 snapshot exception store handover bugs
- fix dm-io to cope with DISCARD and WRITE_SAME capabilities changing
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1
iQEcBAABAgAGBQJVDIJ8AAoJEMUj8QotnQNaynEIAK4Q3mvOeI6PIwuC2iOWRglJ
C0xzTKMR6VDQeM/uMRLeiiqxdPs6b89OdSmJHKJYOrdsusjG0a4IBAO9vDayb6sW
AxOlbnpXdoiH3cqQ4dISJIfS9qM49qGM40CAflcLKYsGfLnstVLt+4l9HaWaRrHj
b2kAC+I6PDDybFlKD5KsMB56sjzhhqg/lnnwkY2omV4vLHf6RrhVhK5gcEPF7VN0
SJ5HKSNBHQnpYSEECHXkxGvZ/ZKTe9n7q0t6Q3nnTSUFTXS6esgTsK1q2AykADDR
3W1LkrAAFP31AWKPkUwCEe3C8Rk3rCsrq2H14woWX1/UxSXNPlMYCU50ak4TVmc=
=Svyk
-----END PGP SIGNATURE-----
Merge tag 'dm-4.0-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm
Pull devicemapper fixes from Mike Snitzer:
"A handful of stable fixes for DM:
- fix thin target to always zero-fill reads to unprovisioned blocks
- fix to interlock device destruction's suspend from internal
suspends
- fix 2 snapshot exception store handover bugs
- fix dm-io to cope with DISCARD and WRITE_SAME capabilities changing"
* tag 'dm-4.0-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm:
dm io: deal with wandering queue limits when handling REQ_DISCARD and REQ_WRITE_SAME
dm snapshot: suspend merging snapshot when doing exception handover
dm snapshot: suspend origin when doing exception handover
dm: hold suspend_lock while suspending device during device deletion
dm thin: fix to consistently zero-fill reads to unprovisioned blocks
If ->run() fails, it can either free the data structures it
allocated, or leave that task to ->free() which will be called
on failures.
However:
md.c calls ->free() even if ->private_data is NULL, which
causes problems in some personalities.
raid0.c frees the data, but doesn't clear ->private_data,
which will become a problem when we fix md.c
So better fix both these issues at once.
Reported-by: Richard W.M. Jones <rjones@redhat.com>
Fixes: 5aa61f427e
URL: https://bugzilla.kernel.org/show_bug.cgi?id=94381
Signed-off-by: NeilBrown <neilb@suse.de>
Since it's possible for the discard and write same queue limits to
change while the upper level command is being sliced and diced, fix up
both of them (a) to reject IO if the special command is unsupported at
the start of the function and (b) read the limits once and let the
commands error out on their own if the status happens to change.
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Cc: stable@vger.kernel.org
The "dm snapshot: suspend origin when doing exception handover" commit
fixed a exception store handover bug associated with pending exceptions
to the "snapshot-origin" target.
However, a similar problem exists in snapshot merging. When snapshot
merging is in progress, we use the target "snapshot-merge" instead of
"snapshot-origin". Consequently, during exception store handover, we
must find the snapshot-merge target and suspend its associated
mapped_device.
To avoid lockdep warnings, the target must be suspended and resumed
without holding _origins_lock.
Introduce a dm_hold() function that grabs a reference on a
mapped_device, but unlike dm_get(), it doesn't crash if the device has
the DMF_FREEING flag set, it returns an error in this case.
In snapshot_resume() we grab the reference to the origin device using
dm_hold() while holding _origins_lock (_origins_lock guarantees that the
device won't disappear). Then we release _origins_lock, suspend the
device and grab _origins_lock again.
NOTE to stable@ people:
When backporting to kernels 3.18 and older, use dm_internal_suspend and
dm_internal_resume instead of dm_internal_suspend_fast and
dm_internal_resume_fast.
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Cc: stable@vger.kernel.org
In the function snapshot_resume we perform exception store handover. If
there is another active snapshot target, the exception store is moved
from this target to the target that is being resumed.
The problem is that if there is some pending exception, it will point to
an incorrect exception store after that handover, causing a crash due to
dm-snap-persistent.c:get_exception()'s BUG_ON.
This bug can be triggered by repeatedly changing snapshot permissions
with "lvchange -p r" and "lvchange -p rw" while there are writes on the
associated origin device.
To fix this bug, we must suspend the origin device when doing the
exception store handover to make sure that there are no pending
exceptions:
- introduce _origin_hash that keeps track of dm_origin structures.
- introduce functions __lookup_dm_origin, __insert_dm_origin and
__remove_dm_origin that manipulate the origin hash.
- modify snapshot_resume so that it calls dm_internal_suspend_fast() and
dm_internal_resume_fast() on the origin device.
NOTE to stable@ people:
When backporting to kernels 3.12-3.18, use dm_internal_suspend and
dm_internal_resume instead of dm_internal_suspend_fast and
dm_internal_resume_fast.
When backporting to kernels older than 3.12, you need to pick functions
dm_internal_suspend and dm_internal_resume from the commit
fd2ed4d252.
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Cc: stable@vger.kernel.org
__dm_destroy() must take the suspend_lock so that its presuspend and
postsuspend calls do not race with an internal suspend.
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Cc: stable@vger.kernel.org
It was always intended that a read to an unprovisioned block will return
zeroes regardless of whether the pool is in read-only or read-write
mode. thin_bio_map() was inconsistent with its handling of such reads
when the pool is in read-only mode, it now properly zero-fills the bios
it returns in response to unprovisioned block reads.
Eliminate thin_bio_map()'s special read-only mode handling of -ENODATA
and just allow the IO to be deferred to the worker which will result in
pool->process_bio() handling the IO (which already properly zero-fills
reads to unprovisioned blocks).
Reported-by: Eric Sandeen <sandeen@redhat.com>
Signed-off-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Cc: stable@vger.kernel.org
Since __ATTR_PREALLOC was introduced in v3.19-rc1~78^2~18
it can now be used by md.
This ensure that writing to these sysfs attributes will never
block due to a memory allocation.
Such blocking could become a deadlock if mdmon is trying to
reconfigure an array after a failure prior to re-enabling writes.
Signed-off-by: NeilBrown <neilb@suse.de>
When we have more than 1 drive failure, it's possible we start
rebuild one drive while leaving another faulty drive in array.
To determine whether array will be optimal after building, current
code only check whether a drive is missing, which could potentially
lead to data corruption. This patch is to add checking Faulty flag.
Signed-off-by: NeilBrown <neilb@suse.de>
When a drive is marked write-mostly it should only be the
target of reads if there is no other option.
This behaviour was broken by
commit 9dedf60313
md/raid1: read balance chooses idlest disk for SSD
which causes a write-mostly device to be *preferred* is some cases.
Restore correct behaviour by checking and setting
best_dist_disk and best_pending_disk rather than best_disk.
We only need to test one of these as they are both changed
from -1 or >=0 at the same time.
As we leave min_pending and best_dist unchanged, any non-write-mostly
device will appear better than the write-mostly device.
Reported-by: Tomáš Hodek <tomas.hodek@volny.cz>
Reported-by: Dark Penguin <darkpenguin@yandex.ru>
Signed-off-by: NeilBrown <neilb@suse.de>
Link: http://marc.info/?l=linux-raid&m=135982797322422
Fixes: 9dedf60313
Cc: stable@vger.kernel.org (3.6+)
to changes that enable effective use of an unbound workqueue across
all available CPUs. A large battery of tests were performed to
validate these changes, summary of results is available here:
https://www.redhat.com/archives/dm-devel/2015-February/msg00106.html
- A few additional stable fixes (to DM core, dm-snapshot and dm-mirror)
and a small fix to the dm-space-map-disk.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1
iQEcBAABAgAGBQJU51MsAAoJEMUj8QotnQNan8wH/3Oa2/ofbX8zNYtdDxFBK7W0
pHwVA2CH9yzDDi90X9GsQODmXerMrYf+SHo/RsQTpSnt5WI/4ZVP/1EoNGu6T2Yr
XiaKc/Jwo+ScxdhoHSNtyGL5vKeipX6clgz1wcd/4/UBcBAr0Vxj25s8Ta7naK2V
hZyWnt3MCGEDQ45NVBmLOU78Cl68LGf8JOUY0l3cd8ehQjGyR9a12GXwRktepslp
hw9xu8uYmE93fD+MIe/fUs7W/WvIVgFSskhswlvD3kAFpEAsMczjLGVSRQlBE90L
OK7v1HKxiIUJa17g3LmiCFa59ZjrnSHGpOOv9IPAFU/LzFIU47lcFwRFjgfB8po=
=8vFi
-----END PGP SIGNATURE-----
Merge tag 'dm-3.20-changes-2' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm
Pull more device mapper changes from Mike Snitzer:
- Significant dm-crypt CPU scalability performance improvements thanks
to changes that enable effective use of an unbound workqueue across
all available CPUs. A large battery of tests were performed to
validate these changes, summary of results is available here:
https://www.redhat.com/archives/dm-devel/2015-February/msg00106.html
- A few additional stable fixes (to DM core, dm-snapshot and dm-mirror)
and a small fix to the dm-space-map-disk.
* tag 'dm-3.20-changes-2' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm:
dm snapshot: fix a possible invalid memory access on unload
dm: fix a race condition in dm_get_md
dm crypt: sort writes
dm crypt: add 'submit_from_crypt_cpus' option
dm crypt: offload writes to thread
dm crypt: remove unused io_pool and _crypt_io_pool
dm crypt: avoid deadlock in mempools
dm crypt: don't allocate pages for a partial request
dm crypt: use unbound workqueue for request processing
dm io: reject unsupported DISCARD requests with EOPNOTSUPP
dm mirror: do not degrade the mirror on discard error
dm space map disk: fix sm_disk_count_is_more_than_one()
Pull kconfig updates from Michal Marek:
"Yann E Morin was supposed to take over kconfig maintainership, but
this hasn't happened. So I'm sending a few kconfig patches that I
collected:
- Fix for missing va_end in kconfig
- merge_config.sh displays used if given too few arguments
- s/boolean/bool/ in Kconfig files for consistency, with the plan to
only support bool in the future"
* 'kconfig' of git://git.kernel.org/pub/scm/linux/kernel/git/mmarek/kbuild:
kconfig: use va_end to match corresponding va_start
merge_config.sh: Display usage if given too few arguments
kconfig: use bool instead of boolean for type definition attributes
When the snapshot target is unloaded, snapshot_dtr() waits until
pending_exceptions_count drops to zero. Then, it destroys the snapshot.
Therefore, the function that decrements pending_exceptions_count
should not touch the snapshot structure after the decrement.
pending_complete() calls free_pending_exception(), which decrements
pending_exceptions_count, and then it performs up_write(&s->lock) and it
calls retry_origin_bios() which dereferences s->origin. These two
memory accesses to the fields of the snapshot may touch the dm_snapshot
struture after it is freed.
This patch moves the call to free_pending_exception() to the end of
pending_complete(), so that the snapshot will not be destroyed while
pending_complete() is in progress.
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Cc: stable@vger.kernel.org
The function dm_get_md finds a device mapper device with a given dev_t,
increases the reference count and returns the pointer.
dm_get_md calls dm_find_md, dm_find_md takes _minor_lock, finds the
device, tests that the device doesn't have DMF_DELETING or DMF_FREEING
flag, drops _minor_lock and returns pointer to the device. dm_get_md then
calls dm_get. dm_get calls BUG if the device has the DMF_FREEING flag,
otherwise it increments the reference count.
There is a possible race condition - after dm_find_md exits and before
dm_get is called, there are no locks held, so the device may disappear or
DMF_FREEING flag may be set, which results in BUG.
To fix this bug, we need to call dm_get while we hold _minor_lock. This
patch renames dm_find_md to dm_get_md and changes it so that it calls
dm_get while holding the lock.
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Cc: stable@vger.kernel.org