Commit Graph

143 Commits

Author SHA1 Message Date
Joe Thornber
9153df7405 dm cache: fix leaking of deferred bio prison cells
There were two cases where dm_cell_visit_release() was being called,
which removes the cell from the prison's rbtree, but the callers didn't
also return the cell to the mempool.  Fix this by having them call
free_prison_cell().

This leak manifested as the 'kmalloc-96' slab growing until OOM.

Fixes: 651f5fa2a3 ("dm cache: defer whole cells")
Signed-off-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Cc: stable@vger.kernel.org # 4.1+
2015-08-31 15:08:14 -04:00
Kent Overstreet
8ae126660f block: kill merge_bvec_fn() completely
As generic_make_request() is now able to handle arbitrarily sized bios,
it's no longer necessary for each individual block driver to define its
own ->merge_bvec_fn() callback. Remove every invocation completely.

Cc: Jens Axboe <axboe@kernel.dk>
Cc: Lars Ellenberg <drbd-dev@lists.linbit.com>
Cc: drbd-user@lists.linbit.com
Cc: Jiri Kosina <jkosina@suse.cz>
Cc: Yehuda Sadeh <yehuda@inktank.com>
Cc: Sage Weil <sage@inktank.com>
Cc: Alex Elder <elder@kernel.org>
Cc: ceph-devel@vger.kernel.org
Cc: Alasdair Kergon <agk@redhat.com>
Cc: Mike Snitzer <snitzer@redhat.com>
Cc: dm-devel@redhat.com
Cc: Neil Brown <neilb@suse.de>
Cc: linux-raid@vger.kernel.org
Cc: Christoph Hellwig <hch@infradead.org>
Cc: "Martin K. Petersen" <martin.petersen@oracle.com>
Acked-by: NeilBrown <neilb@suse.de> (for the 'md' bits)
Acked-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
[dpark: also remove ->merge_bvec_fn() in dm-thin as well as
 dm-era-target, and resolve merge conflicts]
Signed-off-by: Dongsu Park <dpark@posteo.net>
Signed-off-by: Ming Lin <ming.l@ssi.samsung.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
2015-08-13 12:31:57 -06:00
Joe Thornber
e44b6a5a3c dm cache: move wake_waker() from free_migrations() to where it is needed
This stops spurious wake ups from calls to prealloc_free_structs().

Signed-off-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2015-08-12 11:32:20 -04:00
Mike Snitzer
795e633a2d dm cache: fix device destroy hang due to improper prealloc_used accounting
Commit 665022d72f ("dm cache: avoid calls to prealloc_free_structs() if
possible") introduced a regression that caused the removal of a DM cache
device to hang in cache_postsuspend()'s call to wait_for_migrations()
with the following stack trace:

  [<ffffffff81651457>] schedule+0x37/0x80
  [<ffffffffa041e21b>] cache_postsuspend+0xbb/0x470 [dm_cache]
  [<ffffffff810ba970>] ? prepare_to_wait_event+0xf0/0xf0
  [<ffffffffa0006f77>] dm_table_postsuspend_targets+0x47/0x60 [dm_mod]
  [<ffffffffa0001eb5>] __dm_destroy+0x215/0x250 [dm_mod]
  [<ffffffffa0004113>] dm_destroy+0x13/0x20 [dm_mod]
  [<ffffffffa00098cd>] dev_remove+0x10d/0x170 [dm_mod]
  [<ffffffffa00097c0>] ? dev_suspend+0x240/0x240 [dm_mod]
  [<ffffffffa0009f85>] ctl_ioctl+0x255/0x4d0 [dm_mod]
  [<ffffffff8127ac00>] ? SYSC_semtimedop+0x280/0xe10
  [<ffffffffa000a213>] dm_ctl_ioctl+0x13/0x20 [dm_mod]
  [<ffffffff811fd432>] do_vfs_ioctl+0x2d2/0x4b0
  [<ffffffff81117d5f>] ? __audit_syscall_entry+0xaf/0x100
  [<ffffffff81022636>] ? do_audit_syscall_entry+0x66/0x70
  [<ffffffff811fd689>] SyS_ioctl+0x79/0x90
  [<ffffffff81023e58>] ? syscall_trace_leave+0xb8/0x110
  [<ffffffff81654f6e>] entry_SYSCALL_64_fastpath+0x12/0x71

Fix this by accounting for the call to prealloc_data_structs()
immediately _before_ the call as opposed to after.  This is needed
because it is possible to break out of the control loop after the call
to prealloc_data_structs() but before prealloc_used was set to true.

Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2015-07-29 14:32:09 -04:00
Mike Snitzer
3508e6590d Revert "dm cache: do not wake_worker() in free_migration()"
This reverts commit 386cb7cdee.

Taking the wake_worker() out of free_migration() will slow writeback
dramatically, and hence adaptability.

Say we have 10k blocks that need writing back, but are only able to
issue 5 concurrently due to the migration bandwidth: it's imperative
that we wake_worker() immediately after migration completion; waiting
for the next 1 second wake up (via do_waker) means it'll take a long
time to write that all back.

Reported-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2015-07-29 14:32:08 -04:00
Christoph Hellwig
4246a0b63b block: add a bi_error field to struct bio
Currently we have two different ways to signal an I/O error on a BIO:

 (1) by clearing the BIO_UPTODATE flag
 (2) by returning a Linux errno value to the bi_end_io callback

The first one has the drawback of only communicating a single possible
error (-EIO), and the second one has the drawback of not beeing persistent
when bios are queued up, and are not passed along from child to parent
bio in the ever more popular chaining scenario.  Having both mechanisms
available has the additional drawback of utterly confusing driver authors
and introducing bugs where various I/O submitters only deal with one of
them, and the others have to add boilerplate code to deal with both kinds
of error returns.

So add a new bi_error field to store an errno value directly in struct
bio and remove the existing mechanisms to clean all this up.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Reviewed-by: NeilBrown <neilb@suse.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
2015-07-29 08:55:15 -06:00
Mike Snitzer
665022d72f dm cache: avoid calls to prealloc_free_structs() if possible
If no work was performed then prealloc_data_structs() wasn't ever called
so there isn't any need to call prealloc_free_structs().

Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2015-07-16 22:32:08 -04:00
Mike Snitzer
e782eff591 dm cache: avoid preallocation if no work in writeback_some_dirty_blocks()
Refactor writeback_some_dirty_blocks() to avoid prealloc_data_structs()
if the policy doesn't have any dirty blocks ready for writeback.

Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2015-07-16 22:32:07 -04:00
Mike Snitzer
386cb7cdee dm cache: do not wake_worker() in free_migration()
All methods that queue work call wake_worker() as you'd expect.
E.g. cell_defer, defer_bio, quiesce_migration (which is called by
writeback, promote, demote_then_promote, invalidate, discard, etc).

Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2015-07-16 22:32:06 -04:00
Mike Snitzer
255eac2005 dm cache: display 'needs_check' in status if it is set
There is currently no way to see that the needs_check flag has been set
in the metadata.  Display 'needs_check' in the cache status if it is set
in the cache metadata.

Also, update cache documentation.

Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2015-07-16 10:23:50 -04:00
Joe Thornber
fba10109a4 dm cache: age and write back cache entries even without active IO
The policy tick() method is normally called from interrupt context.
Both the mq and smq policies do some bottom half work for the tick
method in their map functions.  However if no IO is going through the
cache, then that bottom half work doesn't occur.  With these policies
this means recently hit entries do not age and do not get written
back as early as we'd like.

Fix this by introducing a new 'can_block' parameter to the tick()
method.  When this is set the bottom half work occurs immediately.
'can_block' is set when the tick method is called every second by the
core target (not in interrupt context).

Signed-off-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2015-06-11 17:13:01 -04:00
Mike Snitzer
b61d950962 dm cache: prefix all DMERR and DMINFO messages with cache device name
Having the DM device name associated with the ERR or INFO message is
very helpful.

Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2015-06-11 17:13:01 -04:00
Joe Thornber
028ae9f76f dm cache: add fail io mode and needs_check flag
If a cache metadata operation fails (e.g. transaction commit) the
cache's metadata device will abort the current transaction, set a new
needs_check flag, and the cache will transition to "read-only" mode.  If
aborting the transaction or setting the needs_check flag fails the cache
will transition to "fail-io" mode.

Once needs_check is set the cache device will not be allowed to
activate.  Activation requires write access to metadata.  Future work is
needed to add proper support for running the cache in read-only mode.

Once in fail-io mode the cache will report a status of "Fail".

Also, add commit() wrapper that will disallow commits if in read_only or
fail mode.

Signed-off-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2015-06-11 17:13:00 -04:00
Joe Thornber
88bf5184fa dm cache: wake the worker thread every time we free a migration object
When the cache is idle, writeback work was only being issued every
second.  With this change outstanding writebacks are streamed
constantly.  This offers a writeback performance improvement.

Signed-off-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2015-06-11 17:13:00 -04:00
Joe Thornber
40775257b9 dm cache: boost promotion of blocks that will be overwritten
When considering whether to move a block to the cache we already give
preferential treatment to discarded blocks, since they are cheap to
promote (no read of the origin required since the data is junk).

The same is true of blocks that are about to be completely
overwritten, so we likewise boost their promotion chances.

Signed-off-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2015-05-29 14:19:07 -04:00
Joe Thornber
651f5fa2a3 dm cache: defer whole cells
Currently individual bios are deferred to the worker thread if they
cannot be processed immediately (eg, a block is in the process of
being moved to the fast device).

This patch passes whole cells across to the worker.  This saves
reaquiring the cell, and also collects bios destined for the same block
together, which allows them to be mapped with a single look up to the
policy.  This reduces the overhead of using dm-cache.

Signed-off-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2015-05-29 14:19:06 -04:00
Joe Thornber
451b9e0071 dm cache: pull out some bitset utility functions for reuse
Signed-off-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2015-05-29 14:19:05 -04:00
Joe Thornber
20f6814b94 dm cache: pass a new 'critical' flag to the policies when requesting writeback work
We only allow non critical writeback if the origin is idle.  It is up
to the policy to decide what writeback work is critical.

Signed-off-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2015-05-29 14:19:04 -04:00
Joe Thornber
066dbaa386 dm cache: track IO to the origin device using io_tracker
Signed-off-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2015-05-29 14:19:04 -04:00
Joe Thornber
77289d3207 dm cache: add io_tracker
A little class that keeps track of the volume of io that is in flight,
and the length of time that a device has been idle for.

FIXME: rather than jiffes, may be best to use ktime_t (to support faster
devices).

Signed-off-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2015-05-29 14:19:03 -04:00
Joe Thornber
fb4100ae7f dm cache: fix race when issuing a POLICY_REPLACE operation
There is a race between a policy deciding to replace a cache entry,
the core target writing back any dirty data from this block, and other
IO threads doing IO to the same block.

This sort of problem is avoided most of the time by the core target
grabbing a bio prison cell before making the request to the policy.
But for a demotion the core target doesn't know which block will be
demoted, so can't do this in advance.

Fix this demotion race by introducing a callback to the policy interface
that allows the policy to grab the cell on behalf of the core target.

Signed-off-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Cc: stable@vger.kernel.org
2015-05-29 14:19:03 -04:00
Mike Snitzer
326e1dbb57 block: remove management of bi_remaining when restoring original bi_end_io
Commit c4cf5261 ("bio: skip atomic inc/dec of ->bi_remaining for
non-chains") regressed all existing callers that followed this pattern:
 1) saving a bio's original bi_end_io
 2) wiring up an intermediate bi_end_io
 3) restoring the original bi_end_io from intermediate bi_end_io
 4) calling bio_endio() to execute the restored original bi_end_io

The regression was due to BIO_CHAIN only ever getting set if
bio_inc_remaining() is called.  For the above pattern it isn't set until
step 3 above (step 2 would've needed to establish BIO_CHAIN).  As such
the first bio_endio(), in step 2 above, never decremented __bi_remaining
before calling the intermediate bi_end_io -- leaving __bi_remaining with
the value 1 instead of 0.  When bio_inc_remaining() occurred during step
3 it brought it to a value of 2.  When the second bio_endio() was
called, in step 4 above, it should've called the original bi_end_io but
it didn't because there was an extra reference that wasn't dropped (due
to atomic operations being optimized away since BIO_CHAIN wasn't set
upfront).

Fix this issue by removing the __bi_remaining management complexity for
all callers that use the above pattern -- bio_chain() is the only
interface that _needs_ to be concerned with __bi_remaining.  For the
above pattern callers just expect the bi_end_io they set to get called!
Remove bio_endio_nodec() and also remove all bio_inc_remaining() calls
that aren't associated with the bio_chain() interface.

Also, the bio_inc_remaining() interface has been moved local to bio.c.

Fixes: c4cf5261 ("bio: skip atomic inc/dec of ->bi_remaining for non-chains")
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
2015-05-22 08:58:55 -06:00
Jens Axboe
c4cf5261f8 bio: skip atomic inc/dec of ->bi_remaining for non-chains
Struct bio has an atomic ref count for chained bio's, and we use this
to know when to end IO on the bio. However, most bio's are not chained,
so we don't need to always introduce this atomic operation as part of
ending IO.

Add a helper to elevate the bi_remaining count, and flag the bio as
now actually needing the decrement at end_io time. Rename the field
to __bi_remaining to catch any current users of this doing the
incrementing manually.

For high IOPS workloads, this reduces the overhead of bio_endio()
substantially.

Tested-by: Robert Elliott <elliott@hp.com>
Acked-by: Kent Overstreet <kent.overstreet@gmail.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Jens Axboe <axboe@fb.com>
2015-05-05 13:32:47 -06:00
Linus Torvalds
802ea9d864 - Most significant change this cycle is request-based DM now supports
stacking ontop of blk-mq devices.  This blk-mq support changes the
   model request-based DM uses for cloning a request to relying on
   calling blk_get_request() directly from the underlying blk-mq device.
   Early consumer of this code is Intel's emerging NVMe hardware; thanks
   to Keith Busch for working on, and pushing for, these changes.
 
 - A few other small fixes and cleanups across other DM targets.
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1
 
 iQEcBAABAgAGBQJU3NRnAAoJEMUj8QotnQNavG0H/3yogMcHvKg9H+w0WmUQdwhN
 w99Wj3nkquAw2sm9yahKlAMBNY53iu/LHmC6/PaTpJetgdH7y1foTrRa0qjyeB2D
 DgNr8mOzxSxzX6CX9V8JMwqzky9XoG2IOt/7FeQQOpMqp4T1M2zgvbZtpl0lK/f3
 lNaNBFpl+47NbGssD/WbtfI4Yy3hX0u406yGmQN5DxRyGTWD2AFqpA76g2mp8vrp
 wmw259gPr4oLhj3pDc0GkuiVn59ZR2Zp+2gs0jD5uKlDL84VP/nE+WNB+ny1Mnmt
 cOg8Q+W6/OosL66MKBHNsF0QS6DXNo5UvsN9fHGa5IUJw7Tsa11ZEPKHZGEbQw4=
 =RiN2
 -----END PGP SIGNATURE-----

Merge tag 'dm-3.20-changes' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm

Pull device mapper changes from Mike Snitzer:

 - The most significant change this cycle is request-based DM now
   supports stacking ontop of blk-mq devices.  This blk-mq support
   changes the model request-based DM uses for cloning a request to
   relying on calling blk_get_request() directly from the underlying
   blk-mq device.

   An early consumer of this code is Intel's emerging NVMe hardware;
   thanks to Keith Busch for working on, and pushing for, these changes.

 - A few other small fixes and cleanups across other DM targets.

* tag 'dm-3.20-changes' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm:
  dm: inherit QUEUE_FLAG_SG_GAPS flags from underlying queues
  dm snapshot: remove unnecessary NULL checks before vfree() calls
  dm mpath: simplify failure path of dm_multipath_init()
  dm thin metadata: remove unused dm_pool_get_data_block_size()
  dm ioctl: fix stale comment above dm_get_inactive_table()
  dm crypt: update url in CONFIG_DM_CRYPT help text
  dm bufio: fix time comparison to use time_after_eq()
  dm: use time_in_range() and time_after()
  dm raid: fix a couple integer overflows
  dm table: train hybrid target type detection to select blk-mq if appropriate
  dm: allocate requests in target when stacking on blk-mq devices
  dm: prepare for allocating blk-mq clone requests in target
  dm: submit stacked requests in irq enabled context
  dm: split request structure out from dm_rq_target_io structure
  dm: remove exports for request-based interfaces without external callers
2015-02-12 16:36:31 -08:00
Manuel Schölling
0f30af98cb dm: use time_in_range() and time_after()
To be future-proof and for better readability the time comparisons are modified
to use time_in_range() and time_after() instead of plain, error-prone math.

Signed-off-by: Manuel Schölling <manuel.schoelling@gmx.de>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2015-02-09 13:06:48 -05:00
Joe Thornber
a59db67656 dm cache: fix problematic dual use of a single migration count variable
Introduce a new variable to count the number of allocated migration
structures.  The existing variable cache->nr_migrations became
overloaded.  It was used to:

 i) track of the number of migrations in flight for the purposes of
    quiescing during suspend.

 ii) to estimate the amount of background IO occuring.

Recent discard changes meant that REQ_DISCARD bios are processed with
a migration.  Discards are not background IO so nr_migrations was not
incremented.  However this could cause quiescing to complete early.

(i) is now handled with a new variable cache->nr_allocated_migrations.
cache->nr_migrations has been renamed cache->nr_io_migrations.
cleanup_migration() is now called free_io_migration(), since it
decrements that variable.

Also, remove the unused cache->next_migration variable that got replaced
with with prealloc_structs a while ago.

Signed-off-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Cc: stable@vger.kernel.org
2015-01-23 11:06:08 -05:00
Joe Thornber
f824a2af3d dm cache: fix spurious cell_defer when dealing with partial block at end of device
We never bother caching a partial block that is at the back end of the
origin device.  No cell ever gets locked, but the calling code was
assuming it was and trying to release it.

Now the code only releases if the cell has been set to a non NULL
value.

Signed-off-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Cc: stable@vger.kernel.org
2014-12-01 11:30:13 -05:00
Joe Thornber
1e32134a5a dm cache: dirty flag was mistakenly being cleared when promoting via overwrite
If the incoming bio is a WRITE and completely covers a block then we
don't bother to do any copying for a promotion operation.  Once this is
done the cache block and origin block will be different, so we need to
set it to 'dirty'.

Signed-off-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Cc: stable@vger.kernel.org
2014-12-01 11:30:12 -05:00
Joe Thornber
f29a3147e2 dm cache: only use overwrite optimisation for promotion when in writeback mode
Overwrite causes the cache block and origin blocks to diverge, which
is only allowed in writeback mode.

Signed-off-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Cc: stable@vger.kernel.org
2014-12-01 11:30:12 -05:00
Joe Thornber
2bb812df63 dm cache: discard block size must be a multiple of cache block size
Otherwise the cache blocks may span two discard blocks, which we don't
handle when doing the discard lookup.

Signed-off-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2014-12-01 11:30:11 -05:00
Joe Thornber
43c32bf2b0 dm cache: fix a harmless race when working out if a block is discarded
It is more correct to hold the cell before checking the discard state.
These flags are only used as hints to the policy so this change will
have negligable effect.

Signed-off-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2014-12-01 11:30:11 -05:00
Joe Thornber
3e2e1c3098 dm cache: when reloading a discard bitset allow for a different discard block size
The discard block size can change if the origin changes size or if an
old DM cache is upgraded from using a discard block size that was equal
to cache block size.

To fix this an extent of discarded blocks is established for the purpose
of translating the old discard block size to the new in-core discard
block size and set bits.  The old (potentially huge) discard bitset is
left ondisk until it is re-written using the new in-core information on
the next successful DM cache shutdown.

Fixes: 7ae34e7778 ("dm cache: improve discard support")
Signed-off-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2014-12-01 11:30:10 -05:00
Joe Thornber
2572629a13 dm cache: fix some issues with the new discard range support
Commit 7ae34e777 ("dm cache: improve discard support") needed to also:
- discontinue having DM core split the discard bios on cache block
  boundaries
- calculate the cache's discard_nr_blocks relative to the determined
  discard_block_size rather than using oblock_to_dblock()

Signed-off-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2014-12-01 11:30:09 -05:00
Joe Thornber
d1d9220cba dm cache: emit a warning message if there are a lot of cache blocks
Loading and saving millions of block mappings takes time.  We may as
well explain what's going on, and encourage people to use a larger
cache block size.

Signed-off-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2014-11-12 20:14:59 -05:00
Joe Thornber
7ae34e7778 dm cache: improve discard support
Safely allow the discard blocksize to be larger than the cache blocksize
by using the bio prison's range locking support.  This also improves
discard performance considerly because larger discards are issued to the
dm-cache device.  The discard blocksize was always intended to be
greater than the cache blocksize.  But until now it wasn't implemented
safely.

Also, by safely restoring the ability to have discard blocksize larger
than cache blocksize we're able to significantly reduce the memory used
for the cache's discard bitset.  Before, with a small discard blocksize,
the discard bitset could get quite large because its size is a function
of the discard blocksize and the origin device's size.  For example,
previously, using a 32KB cache blocksize with a 40TB origin resulted in
1280MB of incore memory use for the discard bitset!  Now, the discard
blocksize is scaled up accordingly to ensure the discard bitset is
capped at 2**14 bits, or 16KB.

Signed-off-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2014-11-10 15:25:30 -05:00
Joe Thornber
08b184514f dm cache: revert "prevent corruption caused by discard_block_size > cache_block_size"
This reverts commit d132cc6d9e because we
actually do want to allow the discard blocksize to be larger than the
cache blocksize.  Further dm-cache discard changes will make this
possible.

Signed-off-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2014-11-10 15:25:30 -05:00
Joe Thornber
1bad9bc4ee dm cache: revert "remove remainder of distinct discard block size"
This reverts commit 64ab346a36 because we
actually do want to allow the discard blocksize to be larger than the
cache blocksize.  Further dm-cache discard changes will make this
possible.

Signed-off-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2014-11-10 15:25:30 -05:00
Joe Thornber
5f274d8865 dm bio prison: introduce support for locking ranges of blocks
Ranges will be placed in the same cell if they overlap.

Range locking is a prerequisite for more efficient multi-block discard
support in both the cache and thin-provisioning targets.

Signed-off-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2014-11-10 15:25:30 -05:00
Joe Thornber
a195db2d29 dm bio prison: switch to using a red black tree
Previously it was using a fixed sized hash table.  There are times
when very many concurrent cells are held (such as when processing a very
large discard).  When this happens the hash table performance becomes
very poor.

Signed-off-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2014-11-10 15:25:26 -05:00
Anssi Hannula
40aa978ecc dm cache: fix race causing dirty blocks to be marked as clean
When a writeback or a promotion of a block is completed, the cell of
that block is removed from the prison, the block is marked as clean, and
the clear_dirty() callback of the cache policy is called.

Unfortunately, performing those actions in this order allows an incoming
new write bio for that block to come in before clearing the dirty status
is completed and therefore possibly causing one of these two scenarios:

Scenario A:

Thread 1                      Thread 2
cell_defer()                  .
- cell removed from prison    .
- detained bios queued        .
.                             incoming write bio
.                             remapped to cache
.                             set_dirty() called,
.                               but block already dirty
.                               => it does nothing
clear_dirty()                 .
- block marked clean          .
- policy clear_dirty() called .

Result: Block is marked clean even though it is actually dirty. No
writeback will occur.

Scenario B:

Thread 1                      Thread 2
cell_defer()                  .
- cell removed from prison    .
- detained bios queued        .
clear_dirty()                 .
- block marked clean          .
.                             incoming write bio
.                             remapped to cache
.                             set_dirty() called
.                             - block marked dirty
.                             - policy set_dirty() called
- policy clear_dirty() called .

Result: Block is properly marked as dirty, but policy thinks it is clean
and therefore never asks us to writeback it.
This case is visible in "dmsetup status" dirty block count (which
normally decreases to 0 on a quiet device).

Fix these issues by calling clear_dirty() before calling cell_defer().
Incoming bios for that block will then be detained in the cell and
released only after clear_dirty() has completed, so the race will not
occur.

Found by inspecting the code after noticing spurious dirty counts
(scenario B).

Signed-off-by: Anssi Hannula <anssi.hannula@iki.fi>
Acked-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Cc: stable@vger.kernel.org
2014-09-10 11:20:47 -04:00
Mike Snitzer
b02465308f dm cache: set minimum_io_size to cache's data block size
Before, if the block layer's limit stacking didn't establish an
optimal_io_size that was compatible with the cache's data block size
we'd set optimal_io_size to the data block size and minimum_io_size to 0
(which the block layer adjusts to be physical_block_size).

Update cache_io_hints() to set both minimum_io_size and optimal_io_size
to the cache's data block size.  This fixes an issue where mkfs.xfs
would create more XFS Allocation Groups on cache volumes than on a
normal linear LV of comparable size.

Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2014-08-01 12:30:36 -04:00
Mike Snitzer
895b47d798 dm cache metadata: use dm-space-map-metadata.h defined size limits
Commit 7d48935e cleaned up the persistent-data's space-map-metadata
limits by elevating them to dm-space-map-metadata.h.  Update
dm-cache-metadata to use these same limits.

The calculation for DM_CACHE_METADATA_MAX_SECTORS didn't account for the
sizeof the disk_bitmap_header.  So the supported maximum metadata size
is a bit smaller (reduced from 33423360 to 33292800 sectors).

Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Acked-by: Joe Thornber <ejt@redhat.com>
2014-08-01 12:30:33 -04:00
Joe Thornber
304affaa88 dm cache: fail migrations in the do_worker error path
Signed-off-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2014-08-01 12:30:33 -04:00
Joe Thornber
8c081b52c6 dm cache: simplify deferred set reference count increments
Factor out inc_and_issue and inc_ds helpers to simplify deferred set
reference count increments.  Also cleanup cache_map to consistently call
cell_defer and inc_ds when the bio is DM_MAPIO_REMAPPED.

No functional change.

Signed-off-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2014-08-01 12:30:32 -04:00
Anssi Hannula
44fa816bb7 dm cache: fix race affecting dirty block count
nr_dirty is updated without locking, causing it to drift so that it is
non-zero (either a small positive integer, or a very large one when an
underflow occurs) even when there are no actual dirty blocks.  This was
due to a race between the workqueue and map function accessing nr_dirty
in parallel without proper protection.

People were seeing under runs due to a race on increment/decrement of
nr_dirty, see: https://lkml.org/lkml/2014/6/3/648

Fix this by using an atomic_t for nr_dirty.

Reported-by: roma1390@gmail.com
Signed-off-by: Anssi Hannula <anssi.hannula@iki.fi>
Signed-off-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Cc: stable@vger.kernel.org
2014-08-01 12:25:22 -04:00
Heinz Mauelshagen
f1daa838e8 dm cache: always split discards on cache block boundaries
The DM cache target cannot cope with discards that span multiple cache
blocks, so each discard bio that spans more than one cache block must
get split by the DM core.

Signed-off-by: Heinz Mauelshagen <heinzm@redhat.com>
Acked-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Cc: stable@vger.kernel.org # v3.9+
2014-05-27 10:33:05 -04:00
Mike Snitzer
131cd131a9 dm cache: fix writethrough mode quiescing in cache_map
Commit 2ee57d5873 ("dm cache: add passthrough mode") inadvertently
removed the deferred set reference that was taken in cache_map()'s
writethrough mode support.  Restore taking this reference.

This issue was found with code inspection.

Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Acked-by: Joe Thornber <ejt@redhat.com>
Cc: stable@vger.kernel.org # 3.13+
2014-05-01 16:14:24 -04:00
Joe Thornber
0596661f0a dm cache: fix a lock-inversion
When suspending a cache the policy is walked and the individual policy
hints written to the metadata via sync_metadata().  This led to this
lock order:

      policy->lock
        cache_metadata->root_lock

When loading the cache target the policy is populated while the metadata
lock is held:

      cache_metadata->root_lock
         policy->lock

Fix this potential lock-inversion (ABBA) deadlock in sync_metadata() by
ensuring the cache_metadata root_lock is held whilst all the hints are
written, rather than being repeatedly locked while policy->lock is held
(as was the case with each callout that policy_walk_mappings() made to
the old save_hint() method).

Found by turning on the CONFIG_PROVE_LOCKING ("Lock debugging: prove
locking correctness") build option.  However, it is not clear how the
LOCKDEP reported paths can lead to a deadlock since the two paths,
suspending a target and loading a target, never occur at the same time.
But that doesn't mean the same lock-inversion couldn't have occurred
elsewhere.

Reported-by: Marian Csontos <mcsontos@redhat.com>
Signed-off-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Cc: stable@vger.kernel.org
2014-04-04 14:53:05 -04:00
Heinz Mauelshagen
64ab346a36 dm cache: remove remainder of distinct discard block size
Discard block size not being equal to cache block size causes data
corruption by erroneously avoiding migrations in issue_copy() because
the discard state is being cleared for a group of cache blocks when it
should not.

Completely remove all code that enabled a distinction between the
cache block size and discard block size.

Signed-off-by: Heinz Mauelshagen <heinzm@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2014-03-27 16:56:23 -04:00
Mike Snitzer
d132cc6d9e dm cache: prevent corruption caused by discard_block_size > cache_block_size
If the discard block size is larger than the cache block size we will
not properly quiesce IO to a region that is about to be discarded.  This
results in a race between a cache migration where no copy is needed, and
a write to an adjacent cache block that's within the same large discard
block.

Workaround this by limiting the discard_block_size to cache_block_size.
Also limit the max_discard_sectors to cache_block_size.

A more comprehensive fix that introduces range locking support in the
bio_prison and proper quiescing of a discard range that spans multiple
cache blocks is already in development.

Reported-by: Morgan Mears <Morgan.Mears@netapp.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Acked-by: Joe Thornber <ejt@redhat.com>
Acked-by: Heinz Mauelshagen <heinzm@redhat.com>
Cc: stable@vger.kernel.org
2014-03-27 16:56:23 -04:00
Heinz Mauelshagen
e893fba90c dm cache: fix access beyond end of origin device
In order to avoid wasting cache space a partial block at the end of the
origin device is not cached.  Unfortunately, the check for such a
partial block at the end of the origin device was flawed.

Fix accesses beyond the end of the origin device that occured due to
attempted promotion of an undetected partial block by:

- initializing the per bio data struct to allow cache_end_io to work properly
- recognizing access to the partial block at the end of the origin device
- avoiding out of bounds access to the discard bitset

Otherwise, users can experience errors like the following:

 attempt to access beyond end of device
 dm-5: rw=0, want=20971520, limit=20971456
 ...
 device-mapper: cache: promotion failed; couldn't copy block

Signed-off-by: Heinz Mauelshagen <heinzm@redhat.com>
Acked-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Cc: stable@vger.kernel.org
2014-03-12 13:52:00 -04:00
Heinz Mauelshagen
8b9d966665 dm cache: fix truncation bug when copying a block to/from >2TB fast device
During demotion or promotion to a cache's >2TB fast device we must not
truncate the cache block's associated sector to 32bits.  The 32bit
temporary result of from_cblock() caused a 32bit multiplication when
calculating the sector of the fast device in issue_copy_real().

Use an intermediate 64bit type to store the 32bit from_cblock() to allow
for proper 64bit multiplication.

Here is an example of how this bug manifests on an ext4 filesystem:

 EXT4-fs error (device dm-0): ext4_mb_generate_buddy:756: group 17136, 32768 clusters in bitmap, 30688 in gd; block bitmap corrupt.
 JBD2: Spotted dirty metadata buffer (dev = dm-0, blocknr = 0). There's a risk of filesystem corruption in case of system crash.

Signed-off-by: Heinz Mauelshagen <heinzm@redhat.com>
Acked-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Cc: stable@vger.kernel.org
2014-03-12 13:49:27 -04:00
Heinz Mauelshagen
e0d849fad7 dm cache: fix truncation bug when mapping I/O to >2TB fast device
When remapping a block to the cache's fast device that is larger than
2TB we must not truncate the destination sector to 32bits.  The 32bit
temporary result of from_cblock() was being overflowed in
remap_to_cache() due to the logical left shift.

Use an intermediate 64bit type to store the 32bit from_cblock() result
to fix the overflow.

Signed-off-by: Heinz Mauelshagen <heinzm@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Cc: stable@vger.kernel.org
2014-02-28 09:23:02 -05:00
Mike Snitzer
80ae49aaed dm cache: do not add migration to completed list before unhooking bio
When completing an overwrite bio, in overwrite_endio(), the associated
migration should not be added to the 'completed_migrations' until the
bio's fields are restored with dm_unhook_bio().

Otherwise, do_worker() can race to process 'completed_migrations' before
dm_unhook_bio() -- so the bio's bi_end_io is incorrect.  This is
unlikely to cause any problems given the current code but should be
fixed on the basis of correctness.

Also, the cache's spinlock only needs to be held when manipulating the
'completed_migrations' list -- other changes don't need protection.

Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Acked-by: Joe Thornber <ejt@redhat.com>
2014-02-17 11:00:05 -05:00
Mike Snitzer
c6eda5e81c dm cache: move hook_info into common portion of per_bio_data structure
Commit c9d28d5d ("dm cache: promotion optimisation for writes")
incorrectly placed the 'hook_info' member in the writethrough-only
portion of the per_bio_data structure.

Given that the overwrite optimization may be used for writeback the
'hook_info' member must be placed above the 'cache' member of the
per_bio_data structure.  Any members above 'cache' are available from
both writeback and writethrough modes' per_bio_data structure.

Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Acked-by: Joe Thornber <ejt@redhat.com>
Cc: stable@vger.kernel.org # 3.13+
2014-02-17 11:00:05 -05:00
Linus Torvalds
f568849eda Merge branch 'for-3.14/core' of git://git.kernel.dk/linux-block
Pull core block IO changes from Jens Axboe:
 "The major piece in here is the immutable bio_ve series from Kent, the
  rest is fairly minor.  It was supposed to go in last round, but
  various issues pushed it to this release instead.  The pull request
  contains:

   - Various smaller blk-mq fixes from different folks.  Nothing major
     here, just minor fixes and cleanups.

   - Fix for a memory leak in the error path in the block ioctl code
     from Christian Engelmayer.

   - Header export fix from CaiZhiyong.

   - Finally the immutable biovec changes from Kent Overstreet.  This
     enables some nice future work on making arbitrarily sized bios
     possible, and splitting more efficient.  Related fixes to immutable
     bio_vecs:

        - dm-cache immutable fixup from Mike Snitzer.
        - btrfs immutable fixup from Muthu Kumar.

  - bio-integrity fix from Nic Bellinger, which is also going to stable"

* 'for-3.14/core' of git://git.kernel.dk/linux-block: (44 commits)
  xtensa: fixup simdisk driver to work with immutable bio_vecs
  block/blk-mq-cpu.c: use hotcpu_notifier()
  blk-mq: for_each_* macro correctness
  block: Fix memory leak in rw_copy_check_uvector() handling
  bio-integrity: Fix bio_integrity_verify segment start bug
  block: remove unrelated header files and export symbol
  blk-mq: uses page->list incorrectly
  blk-mq: use __smp_call_function_single directly
  btrfs: fix missing increment of bi_remaining
  Revert "block: Warn and free bio if bi_end_io is not set"
  block: Warn and free bio if bi_end_io is not set
  blk-mq: fix initializing request's start time
  block: blk-mq: don't export blk_mq_free_queue()
  block: blk-mq: make blk_sync_queue support mq
  block: blk-mq: support draining mq queue
  dm cache: increment bi_remaining when bi_end_io is restored
  block: fixup for generic bio chaining
  block: Really silence spurious compiler warnings
  block: Silence spurious compiler warnings
  block: Kill bio_pair_split()
  ...
2014-01-30 11:19:05 -08:00
Mike Snitzer
2e68c4e6ca dm cache: add policy name to status output
The cache's policy may have been established using the "default" alias,
which is currently the "mq" policy but the default policy may change in
the future.  It is useful to know exactly which policy is being used.

Add a 'real' member to the dm_cache_policy_type structure and have the
"default" dm_cache_policy_type point to the real "mq"
dm_cache_policy_type.  Update dm_cache_policy_get_name() to check if
real is set, if so report the name of the real policy (not the alias).

Requested-by: Jonathan Brassow <jbrassow@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2014-01-16 13:44:11 -05:00
Mike Snitzer
6a388618f1 dm cache: add block sizes and total cache blocks to status output
Improve cache_status to emit:
<metadata block size> <#used metadata blocks>/<#total metadata blocks>
<cache block size> <#used cache blocks>/<#total cache blocks>
...

Adding the block sizes allows for easier calculation of the overall size
of both the metadata and cache devices.  Adding <#total cache blocks>
provides useful context for how much of the cache is used.

Unfortunately these additions to the status will require updates to
users' scripts that monitor the cache status.  But these changes help
provide more comprehensive information about the cache device and will
simplify tools that are being developed to manage dm-cache devices --
because they won't need to issue 3 operations to cobble together the
information that we can easily provide via a single status ioctl.

While updating the status documentation in cache.txt spaces were
tabify'd.

Requested-by: Jonathan Brassow <jbrassow@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Acked-by: Joe Thornber <ejt@redhat.com>
2014-01-10 10:24:33 -05:00
Jens Axboe
b28bc9b38c Linux 3.13-rc6
-----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1
 
 iQEcBAABAgAGBQJSwLfoAAoJEHm+PkMAQRiGi6QH/1U1B7lmHChDTw3jj1lfm9gA
 189Si4QJlnxFWCKHvKEL+pcaVuACU+aMGI8+KyMYK4/JfuWVjjj5fr/SvyHH2/8m
 LdSK8aHMhJ46uBS4WJ/l6v46qQa5e2vn8RKSBAyKm/h4vpt+hd6zJdoFrFai4th7
 k/TAwOAEHI5uzexUChwLlUBRTvbq4U8QUvDu+DeifC8cT63CGaaJ4qVzjOZrx1an
 eP6UXZrKDASZs7RU950i7xnFVDQu4PsjlZi25udsbeiKcZJgPqGgXz5ULf8ZH8RQ
 YCi1JOnTJRGGjyIOyLj7pyB01h7XiSM2+eMQ0S7g54F2s7gCJ58c2UwQX45vRWU=
 =/4/R
 -----END PGP SIGNATURE-----

Merge tag 'v3.13-rc6' into for-3.14/core

Needed to bring blk-mq uptodate, since changes have been going in
since for-3.14/core was established.

Fixup merge issues related to the immutable biovec changes.

Signed-off-by: Jens Axboe <axboe@kernel.dk>

Conflicts:
	block/blk-flush.c
	fs/btrfs/check-integrity.c
	fs/btrfs/extent_io.c
	fs/btrfs/scrub.c
	fs/logfs/dev_bdev.c
2013-12-31 09:51:02 -07:00
Vincent Pelletier
088448007b dm cache: actually resize cache
Commit f494a9c6b1 ("dm cache: cache
shrinking support") broke cache resizing support.

dm_cache_resize() is called with cache->cache_size before it gets
updated to new_size, so it is a no-op.  But the dm-cache superblock is
updated with the new_size even though the backing dm-array is not
resized.  Fix this by passing the new_size to dm_cache_resize().

Signed-off-by: Vincent Pelletier <plr.vincent@gmail.com>
Acked-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2013-12-10 16:35:15 -05:00
Mike Snitzer
8d30726912 dm cache: increment bi_remaining when bi_end_io is restored
Move the bio->bi_remaining increment into dm_unhook_bio() so the
overwrite_endio() handler works as expected.

Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2013-12-03 19:16:04 -07:00
Kent Overstreet
196d38bccf block: Generic bio chaining
This adds a generic mechanism for chaining bio completions. This is
going to be used for a bio_split() replacement, and it turns out to be
very useful in a fair amount of driver code - a fair number of drivers
were implementing this in their own roundabout ways, often painfully.

Note that this means it's no longer to call bio_endio() more than once
on the same bio! This can cause problems for drivers that save/restore
bi_end_io. Arguably they shouldn't be saving/restoring bi_end_io at all
- in all but the simplest cases they'd be better off just cloning the
bio, and immutable biovecs is making bio cloning cheaper. But for now,
we add a bio_endio_nodec() for these cases.

Signed-off-by: Kent Overstreet <kmo@daterainc.com>
Cc: Jens Axboe <axboe@kernel.dk>
2013-11-23 22:33:56 -08:00
Kent Overstreet
4f024f3797 block: Abstract out bvec iterator
Immutable biovecs are going to require an explicit iterator. To
implement immutable bvecs, a later patch is going to add a bi_bvec_done
member to this struct; for now, this patch effectively just renames
things.

Signed-off-by: Kent Overstreet <kmo@daterainc.com>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Geert Uytterhoeven <geert@linux-m68k.org>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Paul Mackerras <paulus@samba.org>
Cc: "Ed L. Cashin" <ecashin@coraid.com>
Cc: Nick Piggin <npiggin@kernel.dk>
Cc: Lars Ellenberg <drbd-dev@lists.linbit.com>
Cc: Jiri Kosina <jkosina@suse.cz>
Cc: Matthew Wilcox <willy@linux.intel.com>
Cc: Geoff Levand <geoff@infradead.org>
Cc: Yehuda Sadeh <yehuda@inktank.com>
Cc: Sage Weil <sage@inktank.com>
Cc: Alex Elder <elder@inktank.com>
Cc: ceph-devel@vger.kernel.org
Cc: Joshua Morris <josh.h.morris@us.ibm.com>
Cc: Philip Kelleher <pjk1939@linux.vnet.ibm.com>
Cc: Rusty Russell <rusty@rustcorp.com.au>
Cc: "Michael S. Tsirkin" <mst@redhat.com>
Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Cc: Jeremy Fitzhardinge <jeremy@goop.org>
Cc: Neil Brown <neilb@suse.de>
Cc: Alasdair Kergon <agk@redhat.com>
Cc: Mike Snitzer <snitzer@redhat.com>
Cc: dm-devel@redhat.com
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: linux390@de.ibm.com
Cc: Boaz Harrosh <bharrosh@panasas.com>
Cc: Benny Halevy <bhalevy@tonian.com>
Cc: "James E.J. Bottomley" <JBottomley@parallels.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: "Nicholas A. Bellinger" <nab@linux-iscsi.org>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Chris Mason <chris.mason@fusionio.com>
Cc: "Theodore Ts'o" <tytso@mit.edu>
Cc: Andreas Dilger <adilger.kernel@dilger.ca>
Cc: Jaegeuk Kim <jaegeuk.kim@samsung.com>
Cc: Steven Whitehouse <swhiteho@redhat.com>
Cc: Dave Kleikamp <shaggy@kernel.org>
Cc: Joern Engel <joern@logfs.org>
Cc: Prasad Joshi <prasadjoshi.linux@gmail.com>
Cc: Trond Myklebust <Trond.Myklebust@netapp.com>
Cc: KONISHI Ryusuke <konishi.ryusuke@lab.ntt.co.jp>
Cc: Mark Fasheh <mfasheh@suse.com>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Ben Myers <bpm@sgi.com>
Cc: xfs@oss.sgi.com
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Len Brown <len.brown@intel.com>
Cc: Pavel Machek <pavel@ucw.cz>
Cc: "Rafael J. Wysocki" <rjw@sisk.pl>
Cc: Herton Ronaldo Krzesinski <herton.krzesinski@canonical.com>
Cc: Ben Hutchings <ben@decadent.org.uk>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Guo Chao <yan@linux.vnet.ibm.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Asai Thambi S P <asamymuthupa@micron.com>
Cc: Selvan Mani <smani@micron.com>
Cc: Sam Bradshaw <sbradshaw@micron.com>
Cc: Wei Yongjun <yongjun_wei@trendmicro.com.cn>
Cc: "Roger Pau Monné" <roger.pau@citrix.com>
Cc: Jan Beulich <jbeulich@suse.com>
Cc: Stefano Stabellini <stefano.stabellini@eu.citrix.com>
Cc: Ian Campbell <Ian.Campbell@citrix.com>
Cc: Sebastian Ott <sebott@linux.vnet.ibm.com>
Cc: Christian Borntraeger <borntraeger@de.ibm.com>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Jiang Liu <jiang.liu@huawei.com>
Cc: Nitin Gupta <ngupta@vflare.org>
Cc: Jerome Marchand <jmarchand@redhat.com>
Cc: Joe Perches <joe@perches.com>
Cc: Peng Tao <tao.peng@emc.com>
Cc: Andy Adamson <andros@netapp.com>
Cc: fanchaoting <fanchaoting@cn.fujitsu.com>
Cc: Jie Liu <jeff.liu@oracle.com>
Cc: Sunil Mushran <sunil.mushran@gmail.com>
Cc: "Martin K. Petersen" <martin.petersen@oracle.com>
Cc: Namjae Jeon <namjae.jeon@samsung.com>
Cc: Pankaj Kumar <pankaj.km@samsung.com>
Cc: Dan Magenheimer <dan.magenheimer@oracle.com>
Cc: Mel Gorman <mgorman@suse.de>6
2013-11-23 22:33:47 -08:00
Mike Snitzer
7b6b2bc98c dm cache: resolve small nits and improve Documentation
Document passthrough mode, cache shrinking, and cache invalidation.
Also, use strcasecmp() and hlist_unhashed().

Reported-by: Alasdair G Kergon <agk@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2013-11-12 13:11:09 -05:00
Joe Thornber
65790ff919 dm cache: add cache block invalidation support
Cache block invalidation is removing an entry from the cache without
writing it back.  Cache blocks can be invalidated via the
'invalidate_cblocks' message, which takes an arbitrary number of cblock
ranges:
   invalidate_cblocks [<cblock>|<cblock begin>-<cblock end>]*

E.g.
   dmsetup message my_cache 0 invalidate_cblocks 2345 3456-4567 5678-6789

Signed-off-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2013-11-11 11:37:51 -05:00
Joe Thornber
2ee57d5873 dm cache: add passthrough mode
"Passthrough" is a dm-cache operating mode (like writethrough or
writeback) which is intended to be used when the cache contents are not
known to be coherent with the origin device.  It behaves as follows:

* All reads are served from the origin device (all reads miss the cache)
* All writes are forwarded to the origin device; additionally, write
  hits cause cache block invalidates

This mode decouples cache coherency checks from cache device creation,
largely to avoid having to perform coherency checks while booting.  Boot
scripts can create cache devices in passthrough mode and put them into
service (mount cached filesystems, for example) without having to worry
about coherency.  Coherency that exists is maintained, although the
cache will gradually cool as writes take place.

Later, applications can perform coherency checks, the nature of which
will depend on the type of the underlying storage.  If coherency can be
verified, the cache device can be transitioned to writethrough or
writeback mode while still warm; otherwise, the cache contents can be
discarded prior to transitioning to the desired operating mode.

Signed-off-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Heinz Mauelshagen <heinzm@redhat.com>
Signed-off-by: Morgan Mears <Morgan.Mears@netapp.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2013-11-11 11:37:49 -05:00
Joe Thornber
f494a9c6b1 dm cache: cache shrinking support
Allow a cache to shrink if the blocks being removed from the cache are
not dirty.

Signed-off-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2013-11-11 11:37:45 -05:00
Joe Thornber
c9d28d5d09 dm cache: promotion optimisation for writes
If a write block triggers promotion and covers a whole block we can
avoid a copy.

Introduce dm_{hook,unhook}_bio to simplify saving and restoring bio
fields (bi_private is now used by overwrite).  Switch writethrough
support over to using these helpers too.

Signed-off-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2013-11-09 18:20:26 -05:00
Heinz Mauelshagen
ffcbcb6720 dm cache: optimize commit_if_needed
Check commit_requested flag _before_ calling
dm_cache_changed_this_transaction() superfluously.

Also, be sure to set last_commit_jiffies _after_ dm_cache_commit()
completes.

Signed-off-by: Heinz Mauelshagen <heinzm@redhat.com>
Signed-off-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2013-11-09 18:20:24 -05:00
Heinz Mauelshagen
2c2263c93f dm cache: log error message if dm_kcopyd_copy() fails
A migration failure should be logged (albeit limited).

Signed-off-by: Heinz Mauelshagen <heinzm@redhat.com>
Signed-off-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2013-11-09 18:20:19 -05:00
Heinz Mauelshagen
80f659f3f5 dm cache: use cell_defer() boolean argument consistently
Fix a few cell_defer() calls that weren't passing a bool.

Signed-off-by: Heinz Mauelshagen <heinzm@redhat.com>
Signed-off-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2013-11-09 18:20:19 -05:00
Mikulas Patocka
4cb3e1db21 dm cache: return -EINVAL if the user specifies unknown cache policy
Return -EINVAL when the specified cache policy is unknown rather than
returning -ENOMEM.

Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2013-11-09 18:20:18 -05:00
Joe Thornber
238f8363b6 dm cache: improve efficiency of quiescing flag management
Make the quiescing flag an atomic_t and stop protecting it with a spin
lock.

Signed-off-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2013-11-09 18:19:59 -05:00
Joe Thornber
66cb1910df dm cache: fix a race condition between queuing new migrations and quiescing for a shutdown
The code that was trying to do this was inadequate.  The postsuspend
method (in ioctl context), needs to wait for the worker thread to
acknowledge the request to quiesce.  Otherwise the migration count may
drop to zero temporarily before the worker thread realises we're
quiescing.  In this case the target will be taken down, but the worker
thread may have issued a new migration, which will cause an oops when
it completes.

Signed-off-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Cc: stable@vger.kernel.org # 3.9+
2013-11-09 17:55:50 -05:00
Joe Thornber
f8e5f01a32 dm cache: io destined for the cache device can now serve as tick bios
Previously only origin bios could trigger ticks, which meant if all
the io was destined for the cache no ticks were generated.  If no ticks
are generated then multiple hits, and movements in general, are
attributed to the same tick.

Only a stop gap fix, we need a better solution.

Signed-off-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2013-11-09 17:55:49 -05:00
Mike Snitzer
c9ec5d7c7b dm cache: eliminate holes in cache structure
Reorder members in the cache structure to eliminate 6 out of 7 holes
(reclaiming 24 bytes).  Also, the 'worker' and 'waker' members no longer
straddle cachelines.

Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
Acked-by: Joe Thornber <ejt@redhat.com>
2013-08-23 09:02:14 -04:00
Mike Snitzer
f610937214 dm cache: fix stacking of geometry limits
Do not blindly override the queue limits (specifically io_min and
io_opt).  Allow traditional stacking of these limits if io_opt is a
factor of the cache's data block size.

Without this patch mkfs.xfs does not recognize the cache device's
provided limits as a useful geometry (e.g. raid) so these hints are
ignored.  This was due to setting io_min to a useless value.

Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
Acked-by: Joe Thornber <ejt@redhat.com>
2013-08-23 09:02:14 -04:00
Mike Snitzer
0547304463 dm cache: add data block size limits to code and Documentation
Place upper bound on the cache's data block size (1GB).

Inform users that the data block size can't be any arbitrary number,
i.e. its value must be between 32KB and 1GB.  Also, it should be a
multiple of 32KB.

Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
Acked-by: Joe Thornber <ejt@redhat.com>
2013-08-23 09:02:13 -04:00
Mikulas Patocka
43aeaa2957 dm cache: fix arm link errors with inline
Use __always_inline to avoid a link failure with gcc 4.6 on ARM.
gcc 4.7 is OK.

It creates a function block_div.part.8, it references __udivdi3 and
__umoddi3 and it is never called. The references to __udivdi3 and
__umoddi3 cause a link failure.

Reported-by: Rob Herring <robherring2@gmail.com>
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2013-07-10 23:41:17 +01:00
Joe Thornber
2f14f4b51e dm cache: set config value
Share configuration option processing code between the dm cache
ctr and message functions.

Signed-off-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2013-05-10 14:37:21 +01:00
Alasdair G Kergon
2c73c471fb dm cache: move config fns
Move process_config_option() in dm-cache-target.c to make the
next patch more readable.

Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2013-05-10 14:37:21 +01:00
Joe Thornber
8c5008fac4 dm cache: replace memcpy with struct assignment
Use struct assignment rather than memcpy in dm cache.

Signed-off-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2013-05-10 14:37:18 +01:00
Joe Thornber
aeed1420a3 dm cache: fix typos in comments
Fix up some typos in dm-cache comments.

Signed-off-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2013-05-10 14:37:18 +01:00
Joe Thornber
f8350daf7a dm cache: tune migration throttling
Tune the dm cache migration throttling.

i) Issue a tick every second, just in case there's no i/o going through.

ii) Drop the migration threshold right down to something suitable for
background work.

Signed-off-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2013-05-10 14:37:16 +01:00
Wei Yongjun
fa4d683af3 dm cache: fix error return code in cache_create
Return -ENOMEM if memory allocation fails in cache_create
instead of 0 (to avoid NULL pointer dereference).

Signed-off-by: Wei Yongjun <yongjun_wei@trendmicro.com.cn>
Cc: stable@vger.kernel.org
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2013-05-10 14:37:14 +01:00
Mike Snitzer
19b0092e26 dm cache: reduce bio front_pad size in writeback mode
A recent patch to fix the dm cache target's writethrough mode extended
the bio's front_pad to include a 1056-byte struct dm_bio_details.
Writeback mode doesn't need this, so this patch reduces the
per_bio_data_size to 16 bytes in this case instead of 1096.

The dm_bio_details structure was added in "dm cache: fix writes to
cache device in writethrough mode" which fixed commit e2e74d617e ("dm
cache: fix race in writethrough implementation").  In writeback mode
we avoid allocating the writethrough-specific members of the
per_bio_data structure (the dm_bio_details structure included).

Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2013-04-05 15:36:34 +01:00
Darrick J. Wong
b844fe6918 dm cache: fix writes to cache device in writethrough mode
The dm-cache writethrough strategy introduced by commit e2e74d617e
("dm cache: fix race in writethrough implementation") issues a bio to
the origin device, remaps and then issues the bio to the cache device.
This more conservative in-series approach was selected to favor
correctness over performance (of the previous parallel writethrough).
However, this in-series implementation that reuses the same bio to write
both the origin and cache device didn't take into account that the block
layer's req_bio_endio() modifies a completing bio's bi_sector and
bi_size.  So the new writethrough strategy needs to preserve these bio
fields, and restore them before submission to the cache device,
otherwise nothing gets written to the cache (because bi_size is 0).

This patch adds a struct dm_bio_details field to struct per_bio_data,
and uses dm_bio_record() and dm_bio_restore() to ensure the bio is
restored before reissuing to the cache device.  Adding such a large
structure to the per_bio_data is not ideal but we can improve this
later, for now correctness is the important thing.

This problem initially went unnoticed because the dm-cache test-suite
uses a linear DM device for the dm-cache device's origin device.
Writethrough worked as expected because DM submits a *clone* of the
original bio, so the original bio which was reused for the cache was
never touched.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2013-04-05 15:36:32 +01:00
Mike Snitzer
ea2dd8c1ed dm cache: policy ignore hints if generated by different version
When reading the dm cache metadata from disk, ignore the policy hints
unless they were generated by the same major version number of the same
policy module.

The hints are considered to be private data belonging to the specific
module that generated them and there is no requirement for them to make
sense to different versions of the policy that generated them.
Policy modules are all required to work fine if no previous hints are
supplied (or if existing hints are lost).

Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2013-03-20 17:21:28 +00:00
Joe Thornber
e2e74d617e dm cache: fix race in writethrough implementation
We have found a race in the optimisation used in the dm cache
writethrough implementation.  Currently, dm core sends the cache target
two bios, one for the origin device and one for the cache device and
these are processed in parallel.  This patch avoids the race by
changing the code back to a simpler (slower) implementation which
processes the two writes in series, one after the other, until we can
develop a complete fix for the problem.

When the cache is in writethrough mode it needs to send WRITE bios to
both the origin and cache devices.

Previously we've been implementing this by having dm core query the
cache target on every write to find out how many copies of the bio it
wants.  The cache will ask for two bios if the block is in the cache,
and one otherwise.

Then main problem with this is it's racey.  At the time this check is
made the bio hasn't yet been submitted and so isn't being taken into
account when quiescing a block for migration (promotion or demotion).
This means a single bio may be submitted when two were needed because
the block has since been promoted to the cache (catastrophic), or two
bios where only one is needed (harmless).

I really don't want to start entering bios into the quiescing system
(deferred_set) in the get_num_write_bios callback.  Instead this patch
simplifies things; only one bio is submitted by the core, this is
first written to the origin and then the cache device in series.
Obviously this will have a latency impact.

deferred_writethrough_bios is introduced to record bios that must be
later issued to the cache device from the worker thread.  This deferred
submission, after the origin bio completes, is required given that we're
in interrupt context (writethrough_endio).

Signed-off-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2013-03-20 17:21:27 +00:00
Heinz Mauelshagen
b978440b8d dm cache: avoid calling policy destructor twice on error
If the cache policy's config values are not able to be set we must
set the policy to NULL after destroying it in create_cache_policy()
so we don't attempt to destroy it a second time later.

Signed-off-by: Heinz Mauelshagen <heinzm@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2013-03-20 17:21:26 +00:00
Heinz Mauelshagen
617a0b89da dm cache: detect cache_create failure
Return error if cache_create() fails.

A missing return check made cache_ctr continue even after an error in
cache_create() resulting in the cache object being destroyed.  So a
simple failure like an odd number of cache policy config value arguments
would result in an oops.

Signed-off-by: Heinz Mauelshagen <heinzm@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2013-03-20 17:21:26 +00:00
Joe Thornber
414dd67d50 dm cache: avoid 64 bit division on 32 bit
Squash various 32bit link errors.

  >> on i386:
  >> drivers/built-in.o: In function `is_discarded_oblock':
  >> dm-cache-target.c:(.text+0x1ea28e): undefined reference to `__udivdi3'
  ...

Reported-by: Randy Dunlap <rdunlap@infradead.org>
Signed-off-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2013-03-20 17:21:25 +00:00
Joe Thornber
c6b4fcbad0 dm: add cache target
Add a target that allows a fast device such as an SSD to be used as a
cache for a slower device such as a disk.

A plug-in architecture was chosen so that the decisions about which data
to migrate and when are delegated to interchangeable tunable policy
modules.  The first general purpose module we have developed, called
"mq" (multiqueue), follows in the next patch.  Other modules are
under development.

Signed-off-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Heinz Mauelshagen <mauelshagen@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2013-03-01 22:45:51 +00:00