Commit Graph

3539 Commits

Author SHA1 Message Date
Nicholas Swenson
97d11a660f bcache: Fix heap_peek() macro
Signed-off-by: Nicholas Swenson <nks@daterainc.com>
Signed-off-by: Kent Overstreet <kmo@daterainc.com>
2013-12-16 14:22:57 -08:00
Nicholas Swenson
9eb8ebeb24 bcache: Fix for can_attach_cache()
Signed-off-by: Nicholas Swenson <nks@daterainc.com>
Signed-off-by: Kent Overstreet <kmo@daterainc.com>
2013-12-16 14:22:57 -08:00
Kent Overstreet
d24a6e1087 bcache: Fix dirty_data accounting
Dirty data accounting wasn't quite right - firstly, we were adding the key we're
inserting after it could have merged with another dirty key already in the
btree, and secondly we could sometimes pass the wrong offset to
bcache_dev_sectors_dirty_add() for dirty data we were overwriting - which is
important when tracking dirty data by stripe.

NOTE FOR BACKPORTERS: For 3.10 (and 3.11?) there's other accounting fixes
necessary that got squashed in with other patches; the full patch against 3.10
is 408cc2f47eeac93a, available at:
  git://evilpiepirate.org/~kent/linux-bcache.git bcache-3.10-writeback-fixes

Signed-off-by: Kent Overstreet <kmo@daterainc.com>
Cc: linux-stable <stable@vger.kernel.org> # >= v3.10

diff --git a/drivers/md/bcache/btree.c b/drivers/md/bcache/btree.c
index 2a46036..4a12b2f 100644
--- a/drivers/md/bcache/btree.c
+++ b/drivers/md/bcache/btree.c
@@ -1817,7 +1817,8 @@ static bool fix_overlapping_extents(struct btree *b, struct bkey *insert,
 			if (KEY_START(k) > KEY_START(insert) + sectors_found)
 				goto check_failed;

-			if (KEY_PTRS(replace_key) != KEY_PTRS(k))
+			if (KEY_PTRS(k) != KEY_PTRS(replace_key) ||
+			    KEY_DIRTY(k) != KEY_DIRTY(replace_key))
 				goto check_failed;

 			/* skip past gen */
2013-12-16 14:22:16 -08:00
Kent Overstreet
ce2b3f595e bcache: Use uninterruptible sleep in writeback
We're just waiting on kthread_should_stop(), nothing else, so
interruptible sleep was wrong here.

Signed-off-by: Kent Overstreet <kmo@daterainc.com>
2013-12-16 14:04:57 -08:00
Stefan Priebe
f665c0f852 bcache: kthread don't set writeback task to INTERUPTIBLE
at the beginning (schedule_timout_interuptible) and others
do his on their own

This prevents wrong load average calculation (load of 1 per thread)

Signed-off-by: Kent Overstreet <kmo@daterainc.com>
2013-12-16 14:04:57 -08:00
Linus Torvalds
93e1585e2c A set of device-mapper fixes for 3.13.
A fix for possible memory corruption during DM table load, fix a
 possible leak of snapshot space in case of a crash, fix a possible
 deadlock due to a shared workqueue in the delay target, fix to
 initialize read-only module parameters that are used to export metrics
 for dm stats and dm bufio.
 
 Quite a few stable fixes were identified for both the thin-provisioning
 and caching targets as a result of increased regression testing using
 the device-mapper-test-suite (dmts).  The most notable of these are the
 reference counting fixes for the space map btree that is used by the
 dm-array interface -- without these the dm-cache metadata will leak,
 resulting in dm-cache devices running out of metadata blocks.  Also,
 some important fixes related to the thin-provisioning target's
 transition to read-only mode on error.
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1.4.15 (GNU/Linux)
 
 iQEcBAABAgAGBQJSq2pVAAoJEMUj8QotnQNaAzEH/0qpz06SVd3+oTbsWeWxXwax
 B0zIZ4HPnaD9FDfSWN+6IdQ6Rkn51cspMBHPxWX0I8pmqOTg7MUM7wbaZZaKT3UW
 aDlibzG1O3zBGbkr+Qhbh89fJK5/3xnZXlK0hOyaNI2vz1rA6RThZ2hIeak9xQYr
 7USz2bjSMXchwebb0Z01CrRnOkhUFg6yUJURIEu4XFCJmDM/PM9zKogLGYKuZedK
 pZePnZhwgkLMn7f9l4lPHk3EcWeD4Nf9WR0lS4eKdbYsvMxm1QE7BpDlVJD0Asjg
 /SVOVkgygW18TINMHuw1K0DPdSCo5UZF2HRj1QUD/X4N2BSkio+E1qwF6kT4ltk=
 =0Ix+
 -----END PGP SIGNATURE-----

Merge tag 'dm-3.13-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm

Pull device mapper fixes from Mike Snitzer:
 "A set of device-mapper fixes for 3.13.

  A fix for possible memory corruption during DM table load, fix a
  possible leak of snapshot space in case of a crash, fix a possible
  deadlock due to a shared workqueue in the delay target, fix to
  initialize read-only module parameters that are used to export metrics
  for dm stats and dm bufio.

  Quite a few stable fixes were identified for both the thin-
  provisioning and caching targets as a result of increased regression
  testing using the device-mapper-test-suite (dmts).  The most notable
  of these are the reference counting fixes for the space map btree that
  is used by the dm-array interface -- without these the dm-cache
  metadata will leak, resulting in dm-cache devices running out of
  metadata blocks.  Also, some important fixes related to the
  thin-provisioning target's transition to read-only mode on error"

* tag 'dm-3.13-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm:
  dm array: fix a reference counting bug in shadow_ablock
  dm space map: disallow decrementing a reference count below zero
  dm stats: initialize read-only module parameter
  dm bufio: initialize read-only module parameters
  dm cache: actually resize cache
  dm cache: update Documentation for invalidate_cblocks's range syntax
  dm cache policy mq: fix promotions to occur as expected
  dm thin: allow pool in read-only mode to transition to read-write mode
  dm thin: re-establish read-only state when switching to fail mode
  dm thin: always fallback the pool mode if commit fails
  dm thin: switch to read-only mode if metadata space is exhausted
  dm thin: switch to read only mode if a mapping insert fails
  dm space map metadata: return on failure in sm_metadata_new_block
  dm table: fail dm_table_create on dm_round_up overflow
  dm snapshot: avoid snapshot space leak on crash
  dm delay: fix a possible deadlock due to shared workqueue
2013-12-13 13:22:22 -08:00
Joe Thornber
ed9571f0cf dm array: fix a reference counting bug in shadow_ablock
An old array block could have its reference count decremented below
zero when it is being replaced in the btree by a new array block.

The fix is to increment the old ablock's reference count just before
inserting a new ablock into the btree.

Signed-off-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Cc: stable@vger.kernel.org # 3.9+
2013-12-13 14:22:10 -05:00
Joe Thornber
5b564d80f8 dm space map: disallow decrementing a reference count below zero
The old behaviour, returning -EINVAL if a ref_count of 0 would be
decremented, was removed in commit f722063 ("dm space map: optimise
sm_ll_dec and sm_ll_inc").  To fix this regression we return an error
code from the mutator function pointer passed to sm_ll_mutate() and have
dec_ref_count() return -EINVAL if the old ref_count is 0.

Add a DMERR to reflect the potential seriousness of this error.

Also, add missing dm_tm_unlock() to sm_ll_mutate()'s error path.

With this fix the following dmts regression test now passes:
 dmtest run --suite cache -n /metadata_use_kernel/

The next patch fixes the higher-level dm-array code that exposed this
regression.

Signed-off-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Cc: stable@vger.kernel.org # 3.12+
2013-12-13 14:22:09 -05:00
Tejun Heo
324a56e16e kernfs: s/sysfs_dirent/kernfs_node/ and rename its friends accordingly
kernfs has just been separated out from sysfs and we're already in
full conflict mode.  Nothing can make the situation any worse.  Let's
take the chance to name things properly.

This patch performs the following renames.

* s/sysfs_elem_dir/kernfs_elem_dir/
* s/sysfs_elem_symlink/kernfs_elem_symlink/
* s/sysfs_elem_attr/kernfs_elem_file/
* s/sysfs_dirent/kernfs_node/
* s/sd/kn/ in kernfs proper
* s/parent_sd/parent/
* s/target_sd/target/
* s/dir_sd/parent/
* s/to_sysfs_dirent()/rb_to_kn()/
* misc renames of local vars when they conflict with the above

Because md, mic and gpio dig into sysfs details, this patch ends up
modifying them.  All are sysfs_dirent renames and trivial.  While we
can avoid these by introducing a dummy wrapping struct sysfs_dirent
around kernfs_node, given the limited usage outside kernfs and sysfs
proper, I don't think such workaround is called for.

This patch is strictly rename only and doesn't introduce any
functional difference.

- mic / gpio renames were missing.  Spotted by kbuild test robot.

Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Neil Brown <neilb@suse.de>
Cc: Linus Walleij <linus.walleij@linaro.org>
Cc: Ashutosh Dixit <ashutosh.dixit@intel.com>
Cc: kbuild test robot <fengguang.wu@intel.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2013-12-11 15:28:36 -08:00
Mikulas Patocka
76f5bee5c3 dm stats: initialize read-only module parameter
The module parameter stats_current_allocated_bytes in dm-mod is
read-only.  This parameter informs the user about memory
consumption.  It is not supposed to be changed by the user.

However, despite being read-only, this parameter can be set on
modprobe or insmod command line:
modprobe dm-mod stats_current_allocated_bytes=12345

The kernel doesn't expect that this variable can be non-zero at module
initialization and if the user sets it, it results in warning.

This patch initializes the variable in the module init routine, so
that user-supplied value is ignored.

Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Cc: stable@vger.kernel.org # 3.12+
2013-12-10 19:13:21 -05:00
Mikulas Patocka
4cb57ab4a2 dm bufio: initialize read-only module parameters
Some module parameters in dm-bufio are read-only. These parameters
inform the user about memory consumption. They are not supposed to be
changed by the user.

However, despite being read-only, these parameters can be set on
modprobe or insmod command line, for example:
modprobe dm-bufio current_allocated_bytes=12345

The kernel doesn't expect that these variables can be non-zero at module
initialization and if the user sets them, it results in BUG.

This patch initializes the variables in the module init routine, so that
user-supplied values are ignored.

Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Cc: stable@vger.kernel.org # 3.2+
2013-12-10 19:13:20 -05:00
Vincent Pelletier
088448007b dm cache: actually resize cache
Commit f494a9c6b1 ("dm cache: cache
shrinking support") broke cache resizing support.

dm_cache_resize() is called with cache->cache_size before it gets
updated to new_size, so it is a no-op.  But the dm-cache superblock is
updated with the new_size even though the backing dm-array is not
resized.  Fix this by passing the new_size to dm_cache_resize().

Signed-off-by: Vincent Pelletier <plr.vincent@gmail.com>
Acked-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2013-12-10 16:35:15 -05:00
Joe Thornber
af95e7a69b dm cache policy mq: fix promotions to occur as expected
Micro benchmarks that repeatedly issued IO to a single block were
failing to cause a promotion from the origin device to the cache.  Fix
this by not updating the stats during map() if -EWOULDBLOCK will be
returned.

The mq policy will only update stats, consider migration, etc, once per
tick period (a unit of time established between dm-cache core and the
policies).

When the IO thread calls the policy's map method, if it would like to
migrate the associated block it returns -EWOULDBLOCK, the IO then gets
handed over to a worker thread which handles the migration.  The worker
thread calls map again, to check the migration is still needed (avoids a
race among other things).  *BUT*, before this fix, if we were still in
the same tick period the stats were already updated by the previous map
call -- so the migration would no longer be requested.

Signed-off-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2013-12-10 16:35:14 -05:00
Joe Thornber
9b7aaa64f9 dm thin: allow pool in read-only mode to transition to read-write mode
A thin-pool may be in read-only mode because the pool's data or metadata
space was exhausted.  To allow for recovery, by adding more space to the
pool, we must allow a pool to transition from PM_READ_ONLY to PM_WRITE
mode.  Otherwise, running out of space will render the pool permanently
read-only.

Signed-off-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Cc: stable@vger.kernel.org
2013-12-10 16:35:13 -05:00
Joe Thornber
5383ef3a92 dm thin: re-establish read-only state when switching to fail mode
If the thin-pool transitioned to fail mode and the thin-pool's table
were reloaded for some reason: the new table's default pool mode would
be read-write, though it will transition to fail mode during resume.

When the pool mode transitions directly from PM_WRITE to PM_FAIL we need
to re-establish the intermediate read-only state in both the metadata
and persistent-data block manager (as is usually done with the normal
pool mode transition sequence: PM_WRITE -> PM_READ_ONLY -> PM_FAIL).

Signed-off-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Cc: stable@vger.kernel.org
2013-12-10 16:35:12 -05:00
Joe Thornber
020cc3b5e2 dm thin: always fallback the pool mode if commit fails
Rename commit_or_fallback() to commit().  Now all previous calls to
commit() will trigger the pool mode to fallback if the commit fails.

Also, check the error returned from commit() in alloc_data_block().

Signed-off-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Cc: stable@vger.kernel.org
2013-12-10 16:35:12 -05:00
Mike Snitzer
4a02b34e0c dm thin: switch to read-only mode if metadata space is exhausted
Switch the thin pool to read-only mode in alloc_data_block() if
dm_pool_alloc_data_block() fails because the pool's metadata space is
exhausted.

Differentiate between data and metadata space in messages about no
free space available.

This issue was noticed with the device-mapper-test-suite using:
dmtest run --suite thin-provisioning -n /exhausting_metadata_space_causes_fail_mode/

The quantity of errors logged in this case must be reduced.

before patch:

device-mapper: thin: 253:4: reached low water mark for metadata device: sending event.
device-mapper: space map metadata: unable to allocate new metadata block
device-mapper: space map common: dm_tm_shadow_block() failed
device-mapper: space map metadata: unable to allocate new metadata block
device-mapper: space map common: dm_tm_shadow_block() failed
device-mapper: space map metadata: unable to allocate new metadata block
device-mapper: space map common: dm_tm_shadow_block() failed
device-mapper: space map metadata: unable to allocate new metadata block
device-mapper: space map common: dm_tm_shadow_block() failed
device-mapper: space map metadata: unable to allocate new metadata block
device-mapper: space map common: dm_tm_shadow_block() failed
<snip ... these repeat for a _very_ long while ... >
device-mapper: space map metadata: unable to allocate new metadata block
device-mapper: thin: 253:4: commit failed: error = -28
device-mapper: thin: 253:4: switching pool to read-only mode

after patch:

device-mapper: thin: 253:4: reached low water mark for metadata device: sending event.
device-mapper: space map metadata: unable to allocate new metadata block
device-mapper: thin: 253:4: no free metadata space available.
device-mapper: thin: 253:4: switching pool to read-only mode

Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Acked-by: Joe Thornber <ejt@redhat.com>
Cc: stable@vger.kernel.org
2013-12-10 16:35:04 -05:00
Joe Thornber
fafc7a815e dm thin: switch to read only mode if a mapping insert fails
Switch the thin pool to read-only mode when dm_thin_insert_block() fails
since there is little reason to expect the cause of the failure to be
resolved without further action by user space.

This issue was noticed with the device-mapper-test-suite using:
dmtest run --suite thin-provisioning -n /exhausting_metadata_space_causes_fail_mode/

The quantity of errors logged in this case must be reduced.

before patch:

device-mapper: thin: dm_thin_insert_block() failed
device-mapper: space map metadata: unable to allocate new metadata block
device-mapper: thin: dm_thin_insert_block() failed
device-mapper: space map metadata: unable to allocate new metadata block
device-mapper: thin: dm_thin_insert_block() failed
device-mapper: space map metadata: unable to allocate new metadata block
device-mapper: thin: dm_thin_insert_block() failed
device-mapper: space map metadata: unable to allocate new metadata block
device-mapper: thin: dm_thin_insert_block() failed
device-mapper: space map metadata: unable to allocate new metadata block
device-mapper: thin: dm_thin_insert_block() failed
device-mapper: space map metadata: unable to allocate new metadata block
device-mapper: thin: dm_thin_insert_block() failed
device-mapper: space map metadata: unable to allocate new metadata block
device-mapper: thin: dm_thin_insert_block() failed
device-mapper: space map metadata: unable to allocate new metadata block
device-mapper: thin: dm_thin_insert_block() failed
device-mapper: space map metadata: unable to allocate new metadata block
device-mapper: thin: dm_thin_insert_block() failed
device-mapper: space map metadata: unable to allocate new metadata block
device-mapper: space map metadata: unable to allocate new metadata block
device-mapper: space map metadata: unable to allocate new metadata block
device-mapper: space map metadata: unable to allocate new metadata block
device-mapper: space map metadata: unable to allocate new metadata block
device-mapper: space map metadata: unable to allocate new metadata block
<snip ... these repeat for a long while ... >
device-mapper: space map metadata: unable to allocate new metadata block
device-mapper: space map common: dm_tm_shadow_block() failed
device-mapper: thin: 253:4: no free metadata space available.
device-mapper: thin: 253:4: switching pool to read-only mode

after patch:

device-mapper: space map metadata: unable to allocate new metadata block
device-mapper: thin: 253:4: dm_thin_insert_block() failed: error = -28
device-mapper: thin: 253:4: switching pool to read-only mode

Signed-off-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Cc: stable@vger.kernel.org
2013-12-10 16:34:29 -05:00
Mike Snitzer
f62b6b8f49 dm space map metadata: return on failure in sm_metadata_new_block
Commit 2fc48021f4 ("dm persistent
metadata: add space map threshold callback") introduced a regression
to the metadata block allocation path that resulted in errors being
ignored.  This regression was uncovered by running the following
device-mapper-test-suite test:
dmtest run --suite thin-provisioning -n /exhausting_metadata_space_causes_fail_mode/

The ignored error codes in sm_metadata_new_block() could crash the
kernel through use of either the dm-thin or dm-cache targets, e.g.:

device-mapper: thin: 253:4: reached low water mark for metadata device: sending event.
device-mapper: space map metadata: unable to allocate new metadata block
general protection fault: 0000 [#1] SMP
...
Workqueue: dm-thin do_worker [dm_thin_pool]
task: ffff880035ce2ab0 ti: ffff88021a054000 task.ti: ffff88021a054000
RIP: 0010:[<ffffffffa0331385>]  [<ffffffffa0331385>] metadata_ll_load_ie+0x15/0x30 [dm_persistent_data]
RSP: 0018:ffff88021a055a68  EFLAGS: 00010202
RAX: 003fc8243d212ba0 RBX: ffff88021a780070 RCX: ffff88021a055a78
RDX: ffff88021a055a78 RSI: 0040402222a92a80 RDI: ffff88021a780070
RBP: ffff88021a055a68 R08: ffff88021a055ba4 R09: 0000000000000010
R10: 0000000000000000 R11: 00000002a02e1000 R12: ffff88021a055ad4
R13: 0000000000000598 R14: ffffffffa0338470 R15: ffff88021a055ba4
FS:  0000000000000000(0000) GS:ffff88033fca0000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
CR2: 00007f467c0291b8 CR3: 0000000001a0b000 CR4: 00000000000007e0
Stack:
 ffff88021a055ab8 ffffffffa0332020 ffff88021a055b30 0000000000000001
 ffff88021a055b30 0000000000000000 ffff88021a055b18 0000000000000000
 ffff88021a055ba4 ffff88021a055b98 ffff88021a055ae8 ffffffffa033304c
Call Trace:
 [<ffffffffa0332020>] sm_ll_lookup_bitmap+0x40/0xa0 [dm_persistent_data]
 [<ffffffffa033304c>] sm_metadata_count_is_more_than_one+0x8c/0xc0 [dm_persistent_data]
 [<ffffffffa0333825>] dm_tm_shadow_block+0x65/0x110 [dm_persistent_data]
 [<ffffffffa0331b00>] sm_ll_mutate+0x80/0x300 [dm_persistent_data]
 [<ffffffffa0330e60>] ? set_ref_count+0x10/0x10 [dm_persistent_data]
 [<ffffffffa0331dba>] sm_ll_inc+0x1a/0x20 [dm_persistent_data]
 [<ffffffffa0332270>] sm_disk_new_block+0x60/0x80 [dm_persistent_data]
 [<ffffffff81520036>] ? down_write+0x16/0x40
 [<ffffffffa001e5c4>] dm_pool_alloc_data_block+0x54/0x80 [dm_thin_pool]
 [<ffffffffa001b23c>] alloc_data_block+0x9c/0x130 [dm_thin_pool]
 [<ffffffffa001c27e>] provision_block+0x4e/0x180 [dm_thin_pool]
 [<ffffffffa001fe9a>] ? dm_thin_find_block+0x6a/0x110 [dm_thin_pool]
 [<ffffffffa001c57a>] process_bio+0x1ca/0x1f0 [dm_thin_pool]
 [<ffffffff8111e2ed>] ? mempool_free+0x8d/0xa0
 [<ffffffffa001d755>] process_deferred_bios+0xc5/0x230 [dm_thin_pool]
 [<ffffffffa001d911>] do_worker+0x51/0x60 [dm_thin_pool]
 [<ffffffff81067872>] process_one_work+0x182/0x3b0
 [<ffffffff81068c90>] worker_thread+0x120/0x3a0
 [<ffffffff81068b70>] ? manage_workers+0x160/0x160
 [<ffffffff8106eb2e>] kthread+0xce/0xe0
 [<ffffffff8106ea60>] ? kthread_freezable_should_stop+0x70/0x70
 [<ffffffff8152af6c>] ret_from_fork+0x7c/0xb0
 [<ffffffff8106ea60>] ? kthread_freezable_should_stop+0x70/0x70
 [<ffffffff8152af6c>] ret_from_fork+0x7c/0xb0
 [<ffffffff8106ea60>] ? kthread_freezable_should_stop+0x70/0x70

Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Acked-by: Joe Thornber <ejt@redhat.com>
Cc: stable@vger.kernel.org # v3.10+
2013-12-10 16:34:28 -05:00
Mikulas Patocka
5b2d06576c dm table: fail dm_table_create on dm_round_up overflow
The dm_round_up function may overflow to zero.  In this case,
dm_table_create() must fail rather than go on to allocate an empty array
with alloc_targets().

This fixes a possible memory corruption that could be caused by passing
too large a number in "param->target_count".

Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Cc: stable@vger.kernel.org
2013-12-10 16:34:27 -05:00
Mikulas Patocka
230c83afdd dm snapshot: avoid snapshot space leak on crash
There is a possible leak of snapshot space in case of crash.

The reason for space leaking is that chunks in the snapshot device are
allocated sequentially, but they are finished (and stored in the metadata)
out of order, depending on the order in which copying finished.

For example, supposed that the metadata contains the following records
SUPERBLOCK
METADATA (blocks 0 ... 250)
DATA 0
DATA 1
DATA 2
...
DATA 250

Now suppose that you allocate 10 new data blocks 251-260. Suppose that
copying of these blocks finish out of order (block 260 finished first
and the block 251 finished last). Now, the snapshot device looks like
this:
SUPERBLOCK
METADATA (blocks 0 ... 250, 260, 259, 258, 257, 256)
DATA 0
DATA 1
DATA 2
...
DATA 250
DATA 251
DATA 252
DATA 253
DATA 254
DATA 255
METADATA (blocks 255, 254, 253, 252, 251)
DATA 256
DATA 257
DATA 258
DATA 259
DATA 260

Now, if the machine crashes after writing the first metadata block but
before writing the second metadata block, the space for areas DATA 250-255
is leaked, it contains no valid data and it will never be used in the
future.

This patch makes dm-snapshot complete exceptions in the same order they
were allocated, thus fixing this bug.

Note: when backporting this patch to the stable kernel, change the version
field in the following way:
* if version in the stable kernel is {1, 11, 1}, change it to {1, 12, 0}
* if version in the stable kernel is {1, 10, 0} or {1, 10, 1}, change it
  to {1, 10, 2}
Userspace reads the version to determine if the bug was fixed, so the
version change is needed.

Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Cc: stable@vger.kernel.org
2013-12-10 16:34:25 -05:00
Linus Torvalds
5ee540613d Merge branch 'for-linus' of git://git.kernel.dk/linux-block
Pull block layer fixes from Jens Axboe:
 "A small collection of fixes for the current series. It contains:

   - A fix for a use-after-free of a request in blk-mq.  From Ming Lei

   - A fix for a blk-mq bug that could attempt to dereference a NULL rq
     if allocation failed

   - Two xen-blkfront small fixes

   - Cleanup of submit_bio_wait() type uses in the kernel, unifying
     that.  From Kent

   - A fix for 32-bit blkg_rwstat reading.  I apologize for this one
     looking mangled in the shortlog, it's entirely my fault for missing
     an empty line between the description and body of the text"

* 'for-linus' of git://git.kernel.dk/linux-block:
  blk-mq: fix use-after-free of request
  blk-mq: fix dereference of rq->mq_ctx if allocation fails
  block: xen-blkfront: Fix possible NULL ptr dereference
  xen-blkfront: Silence pfn maybe-uninitialized warning
  block: submit_bio_wait() conversions
  Update of blkg_stat and blkg_rwstat may happen in bh context
2013-12-05 15:33:27 -08:00
Mike Snitzer
8d30726912 dm cache: increment bi_remaining when bi_end_io is restored
Move the bio->bi_remaining increment into dm_unhook_bio() so the
overwrite_endio() handler works as expected.

Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2013-12-03 19:16:04 -07:00
Wei Yongjun
08239ca2a0 bcache: fix sparse non static symbol warning
Fixes the following sparse warning:

drivers/md/bcache/btree.c:2220:5: warning:
 symbol 'btree_insert_fn' was not declared. Should it be static?

Signed-off-by: Wei Yongjun <yongjun_wei@trendmicro.com.cn>
Signed-off-by: Kent Overstreet <kmo@daterainc.com>
2013-11-28 17:05:58 -08:00
NeilBrown
6d183de407 md/raid5: fix newly-broken locking in get_active_stripe.
commit 566c09c534 raid5: relieve lock contention in get_active_stripe()

modified the locking in get_active_stripe() reducing the range
protected by the (highly contended) device_lock.
Unfortunately it reduced the range too much opening up some races.

One race can occur if get_priority_stripe runs between the
test on sh->count and device_lock being taken.
This will mean that sh->lru is not empty while get_active_stripe
thinks ->count is zero resulting in a 'BUG' firing.

Another race happens if __release_stripe is called immediately
after sh->count is tested and found to be non-zero.  If STRIPE_HANDLE
is not set, get_active_stripe should increment ->active_stripes
when it increments ->count from 0, but as it didn't think it was 0,
it doesn't.

Extending device_lock to cover the test on sh->count close these
races.

While we are here, fix the two BUG tests:
 -If count is zero, then lru really must not be empty, or we've
  lock the stripe_head somehow - no other tests are relevant.
 -STRIPE_ON_RELEASE_LIST is completely independent of ->lru so
  testing it is pointless.

Reported-and-tested-by: Brassow Jonathan <jbrassow@redhat.com>
Reviewed-by: Shaohua Li <shli@kernel.org>
Fixes: 566c09c534
Signed-off-by: NeilBrown <neilb@suse.de>
2013-11-28 11:00:15 +11:00
NeilBrown
142d44c310 md: test mddev->flags more safely in md_check_recovery.
commit 7a0a5355cb md: Don't test all of mddev->flags at once.
made most tests on mddev->flags safer, but missed one.

When
commit 260fa034ef md: avoid deadlock when dirty buffers during md_stop.
added MD_STILL_CLOSED, this caused md_check_recovery to misbehave.
It can think there is something to do but find nothing.  This can
lead to the md thread spinning during array shutdown.

https://bugzilla.kernel.org/show_bug.cgi?id=65721

Reported-and-tested-by: Richard W.M. Jones <rjones@redhat.com>
Fixes: 260fa034ef
Cc: stable@vger.kernel.org (3.12)
Signed-off-by: NeilBrown <neilb@suse.de>
2013-11-28 11:00:08 +11:00
NeilBrown
0c775d5208 md/raid5: fix new memory-reference bug in alloc_thread_groups.
In alloc_thread_groups, worker_groups is a pointer to an array,
not an array of pointers.
So
   worker_groups[i]
is wrong.  It should be
   &(*worker_groups)[i]

Found-by: coverity
Fixes: 60aaf93385
Reported-by: Ben Hutchings <bhutchings@solarflare.com>
Cc: majianpeng <majianpeng@gmail.com>
Signed-off-by: NeilBrown <neilb@suse.de>
2013-11-28 11:00:04 +11:00
Kent Overstreet
c170bbb45f block: submit_bio_wait() conversions
It was being open coded in a few places.

Signed-off-by: Kent Overstreet <kmo@daterainc.com>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Joern Engel <joern@logfs.org>
Cc: Prasad Joshi <prasadjoshi.linux@gmail.com>
Cc: Neil Brown <neilb@suse.de>
Cc: Chris Mason <chris.mason@fusionio.com>
Acked-by: NeilBrown <neilb@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2013-11-24 16:33:41 -07:00
Kent Overstreet
20d0189b10 block: Introduce new bio_split()
The new bio_split() can split arbitrary bios - it's not restricted to
single page bios, like the old bio_split() (previously renamed to
bio_pair_split()). It also has different semantics - it doesn't allocate
a struct bio_pair, leaving it up to the caller to handle completions.

Then convert the existing bio_pair_split() users to the new bio_split()
- and also nvme, which was open coding bio splitting.

(We have to take that BUG_ON() out of bio_integrity_trim() because this
bio_split() needs to use it, and there's no reason it has to be used on
bios marked as cloned; BIO_CLONED doesn't seem to have clearly
documented semantics anyways.)

Signed-off-by: Kent Overstreet <kmo@daterainc.com>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Martin K. Petersen <martin.petersen@oracle.com>
Cc: Matthew Wilcox <matthew.r.wilcox@intel.com>
Cc: Keith Busch <keith.busch@intel.com>
Cc: Vishal Verma <vishal.l.verma@intel.com>
Cc: Jiri Kosina <jkosina@suse.cz>
Cc: Neil Brown <neilb@suse.de>
2013-11-23 22:33:57 -08:00
Kent Overstreet
ee67891bf1 block: Rename bio_split() -> bio_pair_split()
This is prep work for introducing a more general bio_split().

Signed-off-by: Kent Overstreet <kmo@daterainc.com>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: NeilBrown <neilb@suse.de>
Cc: Alasdair Kergon <agk@redhat.com>
Cc: Lars Ellenberg <lars.ellenberg@linbit.com>
Cc: Peter Osterlund <petero2@telia.com>
Cc: Sage Weil <sage@inktank.com>
2013-11-23 22:33:56 -08:00
Kent Overstreet
196d38bccf block: Generic bio chaining
This adds a generic mechanism for chaining bio completions. This is
going to be used for a bio_split() replacement, and it turns out to be
very useful in a fair amount of driver code - a fair number of drivers
were implementing this in their own roundabout ways, often painfully.

Note that this means it's no longer to call bio_endio() more than once
on the same bio! This can cause problems for drivers that save/restore
bi_end_io. Arguably they shouldn't be saving/restoring bi_end_io at all
- in all but the simplest cases they'd be better off just cloning the
bio, and immutable biovecs is making bio cloning cheaper. But for now,
we add a bio_endio_nodec() for these cases.

Signed-off-by: Kent Overstreet <kmo@daterainc.com>
Cc: Jens Axboe <axboe@kernel.dk>
2013-11-23 22:33:56 -08:00
Kent Overstreet
e90abc8ec3 block: Remove bi_idx hacks
Now that drivers have been converted to the new bvec_iter primitives,
there's no need to trim the bvec before we submit it; and we can't trim
it once we start sharing bvecs.

It used to be that passing a partially completed bio (i.e. one with
nonzero bi_idx) to generic_make_request() was a dangerous thing -
various drivers would choke on such things. But with immutable biovecs
and our new bio splitting that shares the biovecs, submitting partially
completed bios has to work (and should work, now that all the drivers
have been completed to the new primitives)

Signed-off-by: Kent Overstreet <kmo@daterainc.com>
Cc: Jens Axboe <axboe@kernel.dk>
2013-11-23 22:33:55 -08:00
Kent Overstreet
1c3b13e64c dm: Refactor for new bio cloning/splitting
We need to convert the dm code to the new bvec_iter primitives which
respect bi_bvec_done; they also allow us to drastically simplify dm's
bio splitting code.

Also, it's no longer necessary to save/restore the bvec array anymore -
driver conversions for immutable bvecs are done, so drivers should never
be modifying it.

Also kill bio_sector_offset(), dm was the only user and it doesn't make
much sense anymore.

Signed-off-by: Kent Overstreet <kmo@daterainc.com>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Alasdair Kergon <agk@redhat.com>
Cc: dm-devel@redhat.com
Reviewed-by: Mike Snitzer <snitzer@redhat.com>
2013-11-23 22:33:55 -08:00
Kent Overstreet
59d276fe02 block: Add bio_clone_fast()
bio_clone() just got more expensive - however, most users of bio_clone()
don't actually need to modify the biovec. If they aren't modifying the
biovec, and they can guarantee that the original bio isn't freed before
the clone (also true in most cases), we can just point the clone at the
original bio's biovec.

Signed-off-by: Kent Overstreet <kmo@daterainc.com>
2013-11-23 22:33:54 -08:00
Kent Overstreet
003b5c5719 block: Convert drivers to immutable biovecs
Now that we've got a mechanism for immutable biovecs -
bi_iter.bi_bvec_done - we need to convert drivers to use primitives that
respect it instead of using the bvec array directly.

Signed-off-by: Kent Overstreet <kmo@daterainc.com>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: NeilBrown <neilb@suse.de>
Cc: Alasdair Kergon <agk@redhat.com>
Cc: dm-devel@redhat.com
2013-11-23 22:33:51 -08:00
Kent Overstreet
458b76ed2f block: Kill bio_segments()/bi_vcnt usage
When we start sharing biovecs, keeping bi_vcnt accurate for splits is
going to be error prone - and unnecessary, if we refactor some code.

So bio_segments() has to go - but most of the existing users just needed
to know if the bio had multiple segments, which is easier - add a
bio_multiple_segments() for them.

(Two of the current uses of bio_segments() are going to go away in a
couple patches, but the current implementation of bio_segments() is
unsafe as soon as we start doing driver conversions for immutable
biovecs - so implement a dumb version for bisectability, it'll go away
in a couple patches)

Signed-off-by: Kent Overstreet <kmo@daterainc.com>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Neil Brown <neilb@suse.de>
Cc: Nagalakshmi Nandigama <Nagalakshmi.Nandigama@lsi.com>
Cc: Sreekanth Reddy <Sreekanth.Reddy@lsi.com>
Cc: "James E.J. Bottomley" <JBottomley@parallels.com>
2013-11-23 22:33:51 -08:00
Kent Overstreet
7988613b0e block: Convert bio_for_each_segment() to bvec_iter
More prep work for immutable biovecs - with immutable bvecs drivers
won't be able to use the biovec directly, they'll need to use helpers
that take into account bio->bi_iter.bi_bvec_done.

This updates callers for the new usage without changing the
implementation yet.

Signed-off-by: Kent Overstreet <kmo@daterainc.com>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Geert Uytterhoeven <geert@linux-m68k.org>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Paul Mackerras <paulus@samba.org>
Cc: "Ed L. Cashin" <ecashin@coraid.com>
Cc: Nick Piggin <npiggin@kernel.dk>
Cc: Lars Ellenberg <drbd-dev@lists.linbit.com>
Cc: Jiri Kosina <jkosina@suse.cz>
Cc: Paul Clements <Paul.Clements@steeleye.com>
Cc: Jim Paris <jim@jtan.com>
Cc: Geoff Levand <geoff@infradead.org>
Cc: Yehuda Sadeh <yehuda@inktank.com>
Cc: Sage Weil <sage@inktank.com>
Cc: Alex Elder <elder@inktank.com>
Cc: ceph-devel@vger.kernel.org
Cc: Joshua Morris <josh.h.morris@us.ibm.com>
Cc: Philip Kelleher <pjk1939@linux.vnet.ibm.com>
Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Cc: Jeremy Fitzhardinge <jeremy@goop.org>
Cc: Neil Brown <neilb@suse.de>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: linux390@de.ibm.com
Cc: Nagalakshmi Nandigama <Nagalakshmi.Nandigama@lsi.com>
Cc: Sreekanth Reddy <Sreekanth.Reddy@lsi.com>
Cc: support@lsi.com
Cc: "James E.J. Bottomley" <JBottomley@parallels.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Steven Whitehouse <swhiteho@redhat.com>
Cc: Herton Ronaldo Krzesinski <herton.krzesinski@canonical.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Guo Chao <yan@linux.vnet.ibm.com>
Cc: Asai Thambi S P <asamymuthupa@micron.com>
Cc: Selvan Mani <smani@micron.com>
Cc: Sam Bradshaw <sbradshaw@micron.com>
Cc: Matthew Wilcox <matthew.r.wilcox@intel.com>
Cc: Keith Busch <keith.busch@intel.com>
Cc: Stephen Hemminger <shemminger@vyatta.com>
Cc: Quoc-Son Anh <quoc-sonx.anh@intel.com>
Cc: Sebastian Ott <sebott@linux.vnet.ibm.com>
Cc: Nitin Gupta <ngupta@vflare.org>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Jerome Marchand <jmarchan@redhat.com>
Cc: Seth Jennings <sjenning@linux.vnet.ibm.com>
Cc: "Martin K. Petersen" <martin.petersen@oracle.com>
Cc: Mike Snitzer <snitzer@redhat.com>
Cc: Vivek Goyal <vgoyal@redhat.com>
Cc: "Darrick J. Wong" <darrick.wong@oracle.com>
Cc: Chris Metcalf <cmetcalf@tilera.com>
Cc: Jan Kara <jack@suse.cz>
Cc: linux-m68k@lists.linux-m68k.org
Cc: linuxppc-dev@lists.ozlabs.org
Cc: drbd-user@lists.linbit.com
Cc: nbd-general@lists.sourceforge.net
Cc: cbe-oss-dev@lists.ozlabs.org
Cc: xen-devel@lists.xensource.com
Cc: virtualization@lists.linux-foundation.org
Cc: linux-raid@vger.kernel.org
Cc: linux-s390@vger.kernel.org
Cc: DL-MPTFusionLinux@lsi.com
Cc: linux-scsi@vger.kernel.org
Cc: devel@driverdev.osuosl.org
Cc: linux-fsdevel@vger.kernel.org
Cc: cluster-devel@redhat.com
Cc: linux-mm@kvack.org
Acked-by: Geoff Levand <geoff@infradead.org>
2013-11-23 22:33:49 -08:00
Kent Overstreet
a4ad39b1d1 block: Convert bio_iovec() to bvec_iter
For immutable biovecs, we'll be introducing a new bio_iovec() that uses
our new bvec iterator to construct a biovec, taking into account
bvec_iter->bi_bvec_done - this patch updates existing users for the new
usage.

Some of the existing users really do need a pointer into the bvec array
- those uses are all going to be removed, but we'll need the
functionality from immutable to remove them - so for now rename the
existing bio_iovec() -> __bio_iovec(), and it'll be removed in a couple
patches.

Signed-off-by: Kent Overstreet <kmo@daterainc.com>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: "Ed L. Cashin" <ecashin@coraid.com>
Cc: Alasdair Kergon <agk@redhat.com>
Cc: dm-devel@redhat.com
Cc: "James E.J. Bottomley" <JBottomley@parallels.com>
2013-11-23 22:33:49 -08:00
Kent Overstreet
75d5d81565 dm: Use bvec_iter for dm_bio_record()
This patch doesn't itself have any functional changes, but immutable
biovecs are going to add a bi_bvec_done member to bi_iter, which will
need to be saved too here.

Signed-off-by: Kent Overstreet <kmo@daterainc.com>
Cc: Alasdair Kergon <agk@redhat.com>
Cc: dm-devel@redhat.com
Reviewed-by: Mike Snitzer <snitzer@redhat.com>
2013-11-23 22:33:48 -08:00
Kent Overstreet
4f024f3797 block: Abstract out bvec iterator
Immutable biovecs are going to require an explicit iterator. To
implement immutable bvecs, a later patch is going to add a bi_bvec_done
member to this struct; for now, this patch effectively just renames
things.

Signed-off-by: Kent Overstreet <kmo@daterainc.com>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Geert Uytterhoeven <geert@linux-m68k.org>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Paul Mackerras <paulus@samba.org>
Cc: "Ed L. Cashin" <ecashin@coraid.com>
Cc: Nick Piggin <npiggin@kernel.dk>
Cc: Lars Ellenberg <drbd-dev@lists.linbit.com>
Cc: Jiri Kosina <jkosina@suse.cz>
Cc: Matthew Wilcox <willy@linux.intel.com>
Cc: Geoff Levand <geoff@infradead.org>
Cc: Yehuda Sadeh <yehuda@inktank.com>
Cc: Sage Weil <sage@inktank.com>
Cc: Alex Elder <elder@inktank.com>
Cc: ceph-devel@vger.kernel.org
Cc: Joshua Morris <josh.h.morris@us.ibm.com>
Cc: Philip Kelleher <pjk1939@linux.vnet.ibm.com>
Cc: Rusty Russell <rusty@rustcorp.com.au>
Cc: "Michael S. Tsirkin" <mst@redhat.com>
Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Cc: Jeremy Fitzhardinge <jeremy@goop.org>
Cc: Neil Brown <neilb@suse.de>
Cc: Alasdair Kergon <agk@redhat.com>
Cc: Mike Snitzer <snitzer@redhat.com>
Cc: dm-devel@redhat.com
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: linux390@de.ibm.com
Cc: Boaz Harrosh <bharrosh@panasas.com>
Cc: Benny Halevy <bhalevy@tonian.com>
Cc: "James E.J. Bottomley" <JBottomley@parallels.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: "Nicholas A. Bellinger" <nab@linux-iscsi.org>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Chris Mason <chris.mason@fusionio.com>
Cc: "Theodore Ts'o" <tytso@mit.edu>
Cc: Andreas Dilger <adilger.kernel@dilger.ca>
Cc: Jaegeuk Kim <jaegeuk.kim@samsung.com>
Cc: Steven Whitehouse <swhiteho@redhat.com>
Cc: Dave Kleikamp <shaggy@kernel.org>
Cc: Joern Engel <joern@logfs.org>
Cc: Prasad Joshi <prasadjoshi.linux@gmail.com>
Cc: Trond Myklebust <Trond.Myklebust@netapp.com>
Cc: KONISHI Ryusuke <konishi.ryusuke@lab.ntt.co.jp>
Cc: Mark Fasheh <mfasheh@suse.com>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Ben Myers <bpm@sgi.com>
Cc: xfs@oss.sgi.com
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Len Brown <len.brown@intel.com>
Cc: Pavel Machek <pavel@ucw.cz>
Cc: "Rafael J. Wysocki" <rjw@sisk.pl>
Cc: Herton Ronaldo Krzesinski <herton.krzesinski@canonical.com>
Cc: Ben Hutchings <ben@decadent.org.uk>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Guo Chao <yan@linux.vnet.ibm.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Asai Thambi S P <asamymuthupa@micron.com>
Cc: Selvan Mani <smani@micron.com>
Cc: Sam Bradshaw <sbradshaw@micron.com>
Cc: Wei Yongjun <yongjun_wei@trendmicro.com.cn>
Cc: "Roger Pau Monné" <roger.pau@citrix.com>
Cc: Jan Beulich <jbeulich@suse.com>
Cc: Stefano Stabellini <stefano.stabellini@eu.citrix.com>
Cc: Ian Campbell <Ian.Campbell@citrix.com>
Cc: Sebastian Ott <sebott@linux.vnet.ibm.com>
Cc: Christian Borntraeger <borntraeger@de.ibm.com>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Jiang Liu <jiang.liu@huawei.com>
Cc: Nitin Gupta <ngupta@vflare.org>
Cc: Jerome Marchand <jmarchand@redhat.com>
Cc: Joe Perches <joe@perches.com>
Cc: Peng Tao <tao.peng@emc.com>
Cc: Andy Adamson <andros@netapp.com>
Cc: fanchaoting <fanchaoting@cn.fujitsu.com>
Cc: Jie Liu <jeff.liu@oracle.com>
Cc: Sunil Mushran <sunil.mushran@gmail.com>
Cc: "Martin K. Petersen" <martin.petersen@oracle.com>
Cc: Namjae Jeon <namjae.jeon@samsung.com>
Cc: Pankaj Kumar <pankaj.km@samsung.com>
Cc: Dan Magenheimer <dan.magenheimer@oracle.com>
Cc: Mel Gorman <mgorman@suse.de>6
2013-11-23 22:33:47 -08:00
Kent Overstreet
ed9c47bebe bcache: Kill unaligned bvec hack
Bcache has a hack to avoid cloning the biovec if it's all full pages -
but with immutable biovecs coming this won't be necessary anymore.

For now, we remove the special case and always clone the bvec array so
that the immutable biovec patches are simpler.

Signed-off-by: Kent Overstreet <kmo@daterainc.com>
2013-11-23 22:33:47 -08:00
Kent Overstreet
33879d4512 block: submit_bio_wait() conversions
It was being open coded in a few places.

Signed-off-by: Kent Overstreet <kmo@daterainc.com>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Joern Engel <joern@logfs.org>
Cc: Prasad Joshi <prasadjoshi.linux@gmail.com>
Cc: Neil Brown <neilb@suse.de>
Cc: Chris Mason <chris.mason@fusionio.com>
Acked-by: NeilBrown <neilb@suse.de>
2013-11-23 22:33:38 -08:00
Linus Torvalds
6d6e352c80 md update for 3.13.
Mostly optimisations and obscure bug fixes.
  - raid5 gets less lock contention
  - raid1 gets less contention between normal-io and resync-io
    during resync.
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v2.0.19 (GNU/Linux)
 
 iQIVAwUAUovzDznsnt1WYoG5AQJ1pQ//bDuXadoJ5dwjWjVxFOKoQ9j/9joEI0yH
 XTApD3ADKckdBc4TSLOIbCNLW1Pbe23HlOI/GjCiJ/7mePL3OwHd7Fx8Rfq3BubV
 f7NgjVwu8nwYD0OXEZsshImptEtrbYwQdy+qlKcHXcZz1MUfR+Egih3r/ouTEfEt
 FNq/6MpyN0IKSY82xP/jFZgesBucgKz/YOUIbwClxm7UiyISKvWQLBIAfLB3dyI3
 HoEdEzQX6I56Rw0mkSUG4Mk+8xx/8twxL+yqEUqfdJREWuB56Km8kl8y/e465Nk0
 ZZg6j/TrslVEwbEeVMx0syvYcaAWFZ4X2jdKfo1lI0g9beZp7H1GRF8yR1s2t/h4
 g/vb55MEN++4LPaE9ut4z7SG2yLyGkZgFTzTjyq5of+DFL0cayO7wXxbgpcD7JYf
 Doef/OSa6csKiGiJI48iQa08Bolmz9ZWzZQXhAthKfFQ9Rv+GEtIAi4kLR8EZPbu
 0/FL1ylYNUY9O7p0g+iy9Kcoc+xW36I95pPZf8pO8GFcXTjyuCCBVh/SNvFZZHPl
 3xk3aZJknAEID8VrVG2IJPkeDI8WK8YxmpU/nARCoytn07Df6Ye8jGvLdR8pL3lB
 TIZV6eRY4yciB8LtoK9Kg4XTmOMhBtjt4c3znkljp98vhOQQb/oHN+BXMGcwqvr9
 fk0KGrg31VA=
 =8RCg
 -----END PGP SIGNATURE-----

Merge tag 'md/3.13' of git://neil.brown.name/md

Pull md update from Neil Brown:
 "Mostly optimisations and obscure bug fixes.
   - raid5 gets less lock contention
   - raid1 gets less contention between normal-io and resync-io during
     resync"

* tag 'md/3.13' of git://neil.brown.name/md:
  md/raid5: Use conf->device_lock protect changing of multi-thread resources.
  md/raid5: Before freeing old multi-thread worker, it should flush them.
  md/raid5: For stripe with R5_ReadNoMerge, we replace REQ_FLUSH with REQ_NOMERGE.
  UAPI: include <asm/byteorder.h> in linux/raid/md_p.h
  raid1: Rewrite the implementation of iobarrier.
  raid1: Add some macros to make code clearly.
  raid1: Replace raise_barrier/lower_barrier with freeze_array/unfreeze_array when reconfiguring the array.
  raid1: Add a field array_frozen to indicate whether raid in freeze state.
  md: Convert use of typedef ctl_table to struct ctl_table
  md/raid5: avoid deadlock when raid5 array has unack badblocks during md_stop_writes.
  md: use MD_RECOVERY_INTR instead of kthread_should_stop in resync thread.
  md: fix some places where mddev_lock return value is not checked.
  raid5: Retry R5_ReadNoMerge flag when hit a read error.
  raid5: relieve lock contention in get_active_stripe()
  raid5: relieve lock contention in get_active_stripe()
  wait: add wait_event_cmd()
  md/raid5.c: add proper locking to error path of raid5_start_reshape.
  md: fix calculation of stacking limits on level change.
  raid5: Use slow_path to release stripe when mddev->thread is null
2013-11-20 13:05:25 -08:00
majianpeng
60aaf93385 md/raid5: Use conf->device_lock protect changing of multi-thread resources.
When we change group_thread_cnt from sysfs entry, it can OOPS.

The kernel messages are:
[  135.299021] BUG: unable to handle kernel NULL pointer dereference at           (null)
[  135.299073] IP: [<ffffffff815188ab>] handle_active_stripes+0x32b/0x440
[  135.299107] PGD 0
[  135.299122] Oops: 0000 [#1] SMP
[  135.299144] Modules linked in: netconsole e1000e ptp pps_core
[  135.299188] CPU: 3 PID: 2225 Comm: md0_raid5 Not tainted 3.12.0+ #24
[  135.299214] Hardware name: To Be Filled By O.E.M. To Be Filled By O.E.M./To be filled by O.E.M., BIOS 080015  11/09/2011
[  135.299255] task: ffff8800b9638f80 ti: ffff8800b77a4000 task.ti: ffff8800b77a4000
[  135.299283] RIP: 0010:[<ffffffff815188ab>]  [<ffffffff815188ab>] handle_active_stripes+0x32b/0x440
[  135.299323] RSP: 0018:ffff8800b77a5c48  EFLAGS: 00010002
[  135.299344] RAX: ffff880037bb5c70 RBX: 0000000000000000 RCX: 0000000000000008
[  135.299371] RDX: ffff880037bb5cb8 RSI: 0000000000000001 RDI: ffff880037bb5c00
[  135.299398] RBP: ffff8800b77a5d08 R08: 0000000000000001 R09: 0000000000000000
[  135.299425] R10: ffff8800b77a5c98 R11: 00000000ffffffff R12: ffff880037bb5c00
[  135.299452] R13: 0000000000000000 R14: 0000000000000000 R15: ffff880037bb5c70
[  135.299479] FS:  0000000000000000(0000) GS:ffff88013fd80000(0000) knlGS:0000000000000000
[  135.299510] CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
[  135.299532] CR2: 0000000000000000 CR3: 0000000001c0b000 CR4: 00000000000407e0
[  135.299559] Stack:
[  135.299570]  ffff8800b77a5c88 ffffffff8107383e ffff8800b77a5c88 ffff880037a64300
[  135.299611]  000000000000ec08 ffff880037bb5cb8 ffff8800b77a5c98 ffffffffffffffd8
[  135.299654]  000000000000ec08 ffff880037bb5c60 ffff8800b77a5c98 ffff8800b77a5c98
[  135.299696] Call Trace:
[  135.299711]  [<ffffffff8107383e>] ? __wake_up+0x4e/0x70
[  135.299733]  [<ffffffff81518f88>] raid5d+0x4c8/0x680
[  135.299756]  [<ffffffff817174ed>] ? schedule_timeout+0x15d/0x1f0
[  135.299781]  [<ffffffff81524c9f>] md_thread+0x11f/0x170
[  135.299804]  [<ffffffff81069cd0>] ? wake_up_bit+0x40/0x40
[  135.299826]  [<ffffffff81524b80>] ? md_rdev_init+0x110/0x110
[  135.299850]  [<ffffffff81069656>] kthread+0xc6/0xd0
[  135.299871]  [<ffffffff81069590>] ? kthread_freezable_should_stop+0x70/0x70
[  135.299899]  [<ffffffff81722ffc>] ret_from_fork+0x7c/0xb0
[  135.299923]  [<ffffffff81069590>] ? kthread_freezable_should_stop+0x70/0x70
[  135.299951] Code: ff ff ff 0f 84 d7 fe ff ff e9 5c fe ff ff 66 90 41 8b b4 24 d8 01 00 00 45 31 ed 85 f6 0f 8e 7b fd ff ff 49 8b 9c 24 d0 01 00 00 <48> 3b 1b 49 89 dd 0f 85 67 fd ff ff 48 8d 43 28 31 d2 eb 17 90
[  135.300005] RIP  [<ffffffff815188ab>] handle_active_stripes+0x32b/0x440
[  135.300005]  RSP <ffff8800b77a5c48>
[  135.300005] CR2: 0000000000000000
[  135.300005] ---[ end trace 504854e5bb7562ed ]---
[  135.300005] Kernel panic - not syncing: Fatal exception

This is because raid5d() can be running when the multi-thread
resources are changed via system. We see need to provide locking.

mddev->device_lock is suitable, but we cannot simple call
alloc_thread_groups under this lock as we cannot allocate memory
while holding a spinlock.
So change alloc_thread_groups() to allocate and return the data
structures, then raid5_store_group_thread_cnt() can take the lock
while updating the pointers to the data structures.

This fixes a bug introduced in 3.12 and so is suitable for the 3.12.x
stable series.

Fixes: b721420e87
Cc: stable@vger.kernel.org (3.12)
Signed-off-by: Jianpeng Ma <majianpeng@gmail.com>
Signed-off-by: NeilBrown <neilb@suse.de>
Reviewed-by: Shaohua Li <shli@kernel.org>
2013-11-19 15:19:18 +11:00
majianpeng
d206dcfa98 md/raid5: Before freeing old multi-thread worker, it should flush them.
When changing group_thread_cnt from sysfs entry, the kernel can oops.

The kernel messages are:
[  740.961389] BUG: unable to handle kernel NULL pointer dereference at 0000000000000008
[  740.961444] IP: [<ffffffff81062570>] process_one_work+0x30/0x500
[  740.961476] PGD b9013067 PUD b651e067 PMD 0
[  740.961503] Oops: 0000 [#1] SMP
[  740.961525] Modules linked in: netconsole e1000e ptp pps_core
[  740.961577] CPU: 0 PID: 3683 Comm: kworker/u8:5 Not tainted 3.12.0+ #23
[  740.961602] Hardware name: To Be Filled By O.E.M. To Be Filled By O.E.M./To be filled by O.E.M., BIOS 080015  11/09/2011
[  740.961646] task: ffff88013abe0000 ti: ffff88013a246000 task.ti: ffff88013a246000
[  740.961673] RIP: 0010:[<ffffffff81062570>]  [<ffffffff81062570>] process_one_work+0x30/0x500
[  740.961708] RSP: 0018:ffff88013a247e08  EFLAGS: 00010086
[  740.961730] RAX: ffff8800b912b400 RBX: ffff88013a61e680 RCX: ffff8800b912b400
[  740.961757] RDX: ffff8800b912b600 RSI: ffff8800b912b600 RDI: ffff88013a61e680
[  740.961782] RBP: ffff88013a247e48 R08: ffff88013a246000 R09: 000000000002c09d
[  740.961808] R10: 000000000000010f R11: 0000000000000000 R12: ffff88013b00cc00
[  740.961833] R13: 0000000000000000 R14: ffff88013b00cf80 R15: ffff88013a61e6b0
[  740.961861] FS:  0000000000000000(0000) GS:ffff88013fc00000(0000) knlGS:0000000000000000
[  740.961893] CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
[  740.962001] CR2: 00000000000000b8 CR3: 00000000b24fe000 CR4: 00000000000407f0
[  740.962001] Stack:
[  740.962001]  0000000000000008 ffff8800b912b600 ffff88013b00cc00 ffff88013a61e680
[  740.962001]  ffff88013b00cc00 ffff88013b00cc18 ffff88013b00cf80 ffff88013a61e6b0
[  740.962001]  ffff88013a247eb8 ffffffff810639c6 0000000000012a80 ffff88013a247fd8
[  740.962001] Call Trace:
[  740.962001]  [<ffffffff810639c6>] worker_thread+0x206/0x3f0
[  740.962001]  [<ffffffff810637c0>] ? manage_workers+0x2c0/0x2c0
[  740.962001]  [<ffffffff81069656>] kthread+0xc6/0xd0
[  740.962001]  [<ffffffff81069590>] ? kthread_freezable_should_stop+0x70/0x70
[  740.962001]  [<ffffffff81722ffc>] ret_from_fork+0x7c/0xb0
[  740.962001]  [<ffffffff81069590>] ? kthread_freezable_should_stop+0x70/0x70
[  740.962001] Code: 89 e5 41 57 41 56 41 55 45 31 ed 41 54 53 48 89 fb 48 83 ec 18 48 8b 06 4c 8b 67 48 48 89 c1 30 c9 a8 04 4c 0f 45 e9 80 7f 58 00 <49> 8b 45 08 44 8b b0 00 01 00 00 78 0c 41 f6 44 24 10 04 0f 84
[  740.962001] RIP  [<ffffffff81062570>] process_one_work+0x30/0x500
[  740.962001]  RSP <ffff88013a247e08>
[  740.962001] CR2: 0000000000000008
[  740.962001] ---[ end trace 39181460000748de ]---
[  740.962001] Kernel panic - not syncing: Fatal exception

This can happen if there are some stripes left, fewer than MAX_STRIPE_BATCH.
A worker is queued to handle them.
But before calling raid5_do_work, raid5d handles those
stripes making conf->active_stripe = 0.
So mddev_suspend() can return.
We might then free old worker resources before the queued
raid5_do_work() handled them.  When it runs, it crashes.

	raid5d()		raid5_store_group_thread_cnt()
	queue_work		mddev_suspend()
				handle_strips
				active_stripe=0
				free(old worker resources)
	process_one_work
	raid5_do_work

To avoid this, we should only flush the worker resources before freeing them.

This fixes a bug introduced in 3.12 so is suitable for the 3.12.x
stable series.

Cc: stable@vger.kernel.org (3.12)
Fixes: b721420e87
Signed-off-by: Jianpeng Ma <majianpeng@gmail.com>
Signed-off-by: NeilBrown <neilb@suse.de>
Reviewed-by: Shaohua Li <shli@kernel.org>
2013-11-19 15:19:18 +11:00
majianpeng
e59aa23f4c md/raid5: For stripe with R5_ReadNoMerge, we replace REQ_FLUSH with REQ_NOMERGE.
For R5_ReadNoMerge,it mean this bio can't merge with other bios or
request.It used REQ_FLUSH to achieve this. But REQ_NOMERGE can do the
same work.

Signed-off-by: Jianpeng Ma <majianpeng@gmail.com>
Signed-off-by: NeilBrown <neilb@suse.de>
2013-11-19 15:19:18 +11:00
majianpeng
79ef3a8aa1 raid1: Rewrite the implementation of iobarrier.
There is an iobarrier in raid1 because of contention between normal IO and
resync IO.  It suspends all normal IO when resync/recovery happens.

However if normal IO is out side the resync window, there is no contention.
So this patch changes the barrier mechanism to only block IO that
could contend with the resync that is currently happening.

We partition the whole space into five parts.
|---------|-----------|------------|----------------|-------|
        start   next_resync   start_next_window    end_window

start + RESYNC_WINDOW = next_resync
next_resync + NEXT_NORMALIO_DISTANCE = start_next_window
start_next_window + NEXT_NORMALIO_DISTANCE = end_window

Firstly we introduce some concepts:

1 - RESYNC_WINDOW: For resync, there are 32 resync requests at most at the
      same time. A sync request is RESYNC_BLOCK_SIZE(64*1024).
      So the RESYNC_WINDOW is 32 * RESYNC_BLOCK_SIZE, that is 2MB.
2 - NEXT_NORMALIO_DISTANCE: the distance between next_resync
      and start_next_window.  It also indicates the distance between
      start_next_window and end_window.
      It is currently 3 * RESYNC_WINDOW_SIZE but could be tuned if
      this turned out not to be optimal.
3 - next_resync: the next sector at which we will do sync IO.
4 - start: a position which is at most RESYNC_WINDOW before
      next_resync.
5 - start_next_window:  a position which is NEXT_NORMALIO_DISTANCE
      beyond next_resync.  Normal-io after this position doesn't need to
      wait for resync-io to complete.
6 - end_window:  a position which is 2 * NEXT_NORMALIO_DISTANCE beyond
      next_resync.  This also doesn't need to wait, but is counted
      differently.
7 - current_window_requests:  the count of normalIO between
      start_next_window and end_window.
8 - next_window_requests: the count of normalIO after end_window.

NormalIO will be partitioned into four types:

NormIO1:  the end sector of bio is smaller or equal the start
NormIO2:  the start sector of bio larger or equal to end_window
NormIO3:  the start sector of bio larger or equal to
          start_next_window.
NormIO4:  the location between start_next_window and end_window

|--------|-----------|--------------------|----------------|-------------|
    | start   |   next_resync   |  start_next_window   |  end_window |
 NormIO1   NormIO4            NormIO4                NormIO3      NormIO2

For NormIO1, we don't need any io barrier.
For NormIO4, we used a similar approach to the original iobarrier
    mechanism.  The normalIO and resyncIO must be kept separate.
For NormIO2/3, we add two fields to struct r1conf: "current_window_requests"
    and "next_window_requests". They indicate the count of active
    requests in the two window.
    For these, we don't wait for resync io to complete.

For resync action, if there are NormIO4s, we must wait for it.
If not, we can proceed.
But if resync action reaches start_next_window and
current_window_requests > 0 (that is there are NormIO3s), we must
wait until the current_window_requests becomes zero.
When current_window_requests becomes zero,  start_next_window also
moves forward. Then current_window_requests will replaced by
next_window_requests.

There is a problem which when and how to change from NormIO2 to
NormIO3.  Only then can sync action progress.

We add a field in struct r1conf "start_next_window".

A: if start_next_window == MaxSector, it means there are no NormIO2/3.
   So start_next_window = next_resync + NEXT_NORMALIO_DISTANCE
B: if current_window_requests == 0 && next_window_requests != 0, it
   means start_next_window move to end_window

There is another problem which how to differentiate between
old NormIO2(now it is NormIO3) and NormIO2.
For example, there are many bios which are NormIO2 and a bio which is
NormIO3. NormIO3 firstly completed, so the bios of NormIO2 became NormIO3.

We add a field in struct r1bio "start_next_window".
This is used to record the position conf->start_next_window when the call
to wait_barrier() is made in make_request().

In allow_barrier(), we check the conf->start_next_window.
If r1bio->stat_next_window == conf->start_next_window, it means
there is no transition between NormIO2 and NormIO3.
If r1bio->start_next_window != conf->start_next_window, it mean
there was a transition between NormIO2 and NormIO3.  There can only
have been one transition.  So it only means the bio is old NormIO2.

For one bio, there may be many r1bio's. So we make sure
all the r1bio->start_next_window are the same value.
If we met blocked_dev in make_request(), it must call allow_barrier
and wait_barrier. So the former and the later value of
conf->start_next_window will be change.
If there are many r1bio's with differnet start_next_window,
for the relevant bio, it depend on the last value of r1bio.
It will cause error. To avoid this, we must wait for previous r1bios
to complete.

Signed-off-by: Jianpeng Ma <majianpeng@gmail.com>
Signed-off-by: NeilBrown <neilb@suse.de>
2013-11-19 15:19:18 +11:00
majianpeng
8e005f7c02 raid1: Add some macros to make code clearly.
In a subsequent patch, we'll use some const parameters.
Using macros will make the code clearly.

Signed-off-by: Jianpeng Ma <majianpeng@gmail.com>
Signed-off-by: NeilBrown <neilb@suse.de>
2013-11-19 15:19:18 +11:00
majianpeng
07169fd478 raid1: Replace raise_barrier/lower_barrier with freeze_array/unfreeze_array when reconfiguring the array.
We used to use raise_barrier to suspend normal IO while we reconfigure
the array.  However raise_barrier will soon only suspend some normal
IO, not all.  So we need something else.
Change it to use freeze_array.
But freeze_array not only suspends normal io, it also suspends
resync io.
For the place where call raise_barrier for reconfigure, it isn't a
problem.

Signed-off-by: Jianpeng Ma <majianpeng@gmail.com>
Signed-off-by: NeilBrown <neilb@suse.de>
2013-11-19 15:19:18 +11:00
majianpeng
b364e3d048 raid1: Add a field array_frozen to indicate whether raid in freeze state.
Because the following patch will rewrite the content between normal IO
and resync IO. So we used a parameter to indicate whether raid is in freeze
array.

Signed-off-by: Jianpeng Ma <majianpeng@gmail.com>
Signed-off-by: NeilBrown <neilb@suse.de>
2013-11-19 15:19:18 +11:00
Joe Perches
82592c38a8 md: Convert use of typedef ctl_table to struct ctl_table
This typedef is unnecessary and should just be removed.

Signed-off-by: Joe Perches <joe@perches.com>
Signed-off-by: NeilBrown <neilb@suse.de>
2013-11-19 15:19:18 +11:00
NeilBrown
30b8feb730 md/raid5: avoid deadlock when raid5 array has unack badblocks during md_stop_writes.
When raid5 recovery hits a fresh badblock, this badblock will flagged as unack
badblock until md_update_sb() is called.
But md_stop will take reconfig lock which means raid5d can't call
md_update_sb() in md_check_recovery(), the badblock will always
be unack, so raid5d thread enters an infinite loop and md_stop_write()
can never stop sync_thread. This causes deadlock.

To solve this, when STOP_ARRAY ioctl is issued and sync_thread is
running, we need set md->recovery FROZEN and INTR flags and wait for
sync_thread to stop before we (re)take reconfig lock.

This requires that raid5 reshape_request notices MD_RECOVERY_INTR
(which it probably should have noticed anyway) and stops waiting for a
metadata update in that case.

Reported-by: Jianpeng Ma <majianpeng@gmail.com>
Reported-by: Bian Yu <bianyu@kedacom.com>
Signed-off-by: NeilBrown <neilb@suse.de>
2013-11-19 15:19:17 +11:00
NeilBrown
c91abf5a35 md: use MD_RECOVERY_INTR instead of kthread_should_stop in resync thread.
We currently use kthread_should_stop() in various places in the
sync/reshape code to abort early.
However some places set MD_RECOVERY_INTR but don't immediately call
md_reap_sync_thread() (and we will shortly get another one).
When this happens we are relying on md_check_recovery() to reap the
thread and that only happen when it finishes normally.
So MD_RECOVERY_INTR must lead to a normal finish without the
kthread_should_stop() test.

So replace all relevant tests, and be more careful when the thread is
interrupted not to acknowledge that latest step in a reshape as it may
not be fully committed yet.

Also add a test on MD_RECOVERY_INTR in the 'is_mddev_idle' loop
so we don't wait have to wait for the speed to drop before we can abort.

Signed-off-by: NeilBrown <neilb@suse.de>
2013-11-19 15:19:17 +11:00
NeilBrown
29f097c4d9 md: fix some places where mddev_lock return value is not checked.
Sometimes we need to lock and mddev and cannot cope with
failure due to interrupt.
In these cases we should use mutex_lock, not mutex_lock_interruptible.

Signed-off-by: NeilBrown <neilb@suse.de>
2013-11-19 15:19:17 +11:00
Bian Yu
edfa1f651e raid5: Retry R5_ReadNoMerge flag when hit a read error.
Because of block layer merge, one bio fails will cause other bios
which belongs to the same request fails, so raid5_end_read_request
will record all these bios as badblocks.
If retry request with R5_ReadNoMerge flag to avoid bios merge,
badblocks can only record sector which is bad exactly.

test:
hdparm --yes-i-know-what-i-am-doing --make-bad-sector 300000 /dev/sdb
mdadm -C /dev/md0 -l5 -n3 /dev/sd[bcd] --assume-clean
mdadm /dev/md0 -f /dev/sdd
mdadm /dev/md0 -r /dev/sdd
mdadm --zero-superblock /dev/sdd
mdadm /dev/md0 -a /dev/sdd

1. Without this patch:
cat /sys/block/md0/md/rd*/bad_blocks
299776 256
299776 256

2. With this patch:
cat /sys/block/md0/md/rd*/bad_blocks
300000 8
300000 8

Signed-off-by: Bian Yu <bianyu@kedacom.com>
Signed-off-by: NeilBrown <neilb@suse.de>
2013-11-19 15:18:24 +11:00
Shaohua Li
4bda556aea raid5: relieve lock contention in get_active_stripe()
track empty inactive list count, so md_raid5_congested() can use it to make
decision.

Signed-off-by: Shaohua Li <shli@fusionio.com>
Signed-off-by: NeilBrown <neilb@suse.de>
2013-11-19 15:18:22 +11:00
Mikulas Patocka
718822c1c1 dm delay: fix a possible deadlock due to shared workqueue
The dm-delay target uses a shared workqueue for multiple instances.  This
can cause deadlock if two or more dm-delay targets are stacked on the top
of each other.

This patch changes dm-delay to use a per-instance workqueue.

Cc: stable@vger.kernel.org # 2.6.22+

Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2013-11-18 11:23:21 -05:00
Linus Torvalds
9073e1a804 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/trivial
Pull trivial tree updates from Jiri Kosina:
 "Usual earth-shaking, news-breaking, rocket science pile from
  trivial.git"

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/trivial: (23 commits)
  doc: usb: Fix typo in Documentation/usb/gadget_configs.txt
  doc: add missing files to timers/00-INDEX
  timekeeping: Fix some trivial typos in comments
  mm: Fix some trivial typos in comments
  irq: Fix some trivial typos in comments
  NUMA: fix typos in Kconfig help text
  mm: update 00-INDEX
  doc: Documentation/DMA-attributes.txt fix typo
  DRM: comment: `halve' -> `half'
  Docs: Kconfig: `devlopers' -> `developers'
  doc: typo on word accounting in kprobes.c in mutliple architectures
  treewide: fix "usefull" typo
  treewide: fix "distingush" typo
  mm/Kconfig: Grammar s/an/a/
  kexec: Typo s/the/then/
  Documentation/kvm: Update cpuid documentation for steal time and pv eoi
  treewide: Fix common typo in "identify"
  __page_to_pfn: Fix typo in comment
  Correct some typos for word frequency
  clk: fixed-factor: Fix a trivial typo
  ...
2013-11-15 16:47:22 -08:00
Linus Torvalds
f412f2c60b Merge branch 'for-linus' of git://git.kernel.dk/linux-block
Pull second round of block driver updates from Jens Axboe:
 "As mentioned in the original pull request, the bcache bits were pulled
  because of their dependency on the immutable bio vecs.  Kent re-did
  this part and resubmitted it, so here's the 2nd round of (mostly)
  driver updates for 3.13.  It contains:

 - The bcache work from Kent.

 - Conversion of virtio-blk to blk-mq.  This removes the bio and request
   path, and substitutes with the blk-mq path instead.  The end result
   almost 200 deleted lines.  Patch is acked by Asias and Christoph, who
   both did a bunch of testing.

 - A removal of bootmem.h include from Grygorii Strashko, part of a
   larger series of his killing the dependency on that header file.

 - Removal of __cpuinit from blk-mq from Paul Gortmaker"

* 'for-linus' of git://git.kernel.dk/linux-block: (56 commits)
  virtio_blk: blk-mq support
  blk-mq: remove newly added instances of __cpuinit
  bcache: defensively handle format strings
  bcache: Bypass torture test
  bcache: Delete some slower inline asm
  bcache: Use ida for bcache block dev minor
  bcache: Fix sysfs splat on shutdown with flash only devs
  bcache: Better full stripe scanning
  bcache: Have btree_split() insert into parent directly
  bcache: Move spinlock into struct time_stats
  bcache: Kill sequential_merge option
  bcache: Kill bch_next_recurse_key()
  bcache: Avoid deadlocking in garbage collection
  bcache: Incremental gc
  bcache: Add make_btree_freeing_key()
  bcache: Add btree_node_write_sync()
  bcache: PRECEDING_KEY()
  bcache: bch_(btree|extent)_ptr_invalid()
  bcache: Don't bother with bucket refcount for btree node allocations
  bcache: Debug code improvements
  ...
2013-11-15 16:33:41 -08:00
Christoph Hellwig
b89241e8cd llists: move llist_reverse_order from raid5 to llist.c
Make this useful helper available for other users.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Jan Kara <jack@suse.cz>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Neil Brown <neilb@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-11-15 09:32:22 +09:00
Wolfram Sang
16735d022f tree-wide: use reinit_completion instead of INIT_COMPLETION
Use this new function to make code more comprehensible, since we are
reinitialzing the completion, not initializing.

[akpm@linux-foundation.org: linux-next resyncs]
Signed-off-by: Wolfram Sang <wsa@the-dreams.de>
Acked-by: Linus Walleij <linus.walleij@linaro.org> (personally at LCE13)
Cc: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-11-15 09:32:21 +09:00
Shaohua Li
566c09c534 raid5: relieve lock contention in get_active_stripe()
get_active_stripe() is the last place we have lock contention. It has two
paths. One is stripe isn't found and new stripe is allocated, the other is
stripe is found.

The first path basically calls __find_stripe and init_stripe. It accesses
conf->generation, conf->previous_raid_disks, conf->raid_disks,
conf->prev_chunk_sectors, conf->chunk_sectors, conf->max_degraded,
conf->prev_algo, conf->algorithm, the stripe_hashtbl and inactive_list. Except
stripe_hashtbl and inactive_list, other fields are changed very rarely.

With this patch, we split inactive_list and add new hash locks. Each free
stripe belongs to a specific inactive list. Which inactive list is determined
by stripe's lock_hash. Note, even a stripe hasn't a sector assigned, it has a
lock_hash assigned. Stripe's inactive list is protected by a hash lock, which
is determined by it's lock_hash too. The lock_hash is derivied from current
stripe_hashtbl hash, which guarantees any stripe_hashtbl list will be assigned
to a specific lock_hash, so we can use new hash lock to protect stripe_hashtbl
list too. The goal of the new hash locks introduced is we can only use the new
locks in the first path of get_active_stripe(). Since we have several hash
locks, lock contention is relieved significantly.

The first path of get_active_stripe() accesses other fields, since they are
changed rarely, changing them now need take conf->device_lock and all hash
locks. For a slow path, this isn't a problem.

If we need lock device_lock and hash lock, we always lock hash lock first. The
tricky part is release_stripe and friends. We need take device_lock first.
Neil's suggestion is we put inactive stripes to a temporary list and readd it
to inactive_list after device_lock is released. In this way, we add stripes to
temporary list with device_lock hold and remove stripes from the list with hash
lock hold. So we don't allow concurrent access to the temporary list, which
means we need allocate temporary list for all participants of release_stripe.

One downside is free stripes are maintained in their inactive list, they can't
across between the lists. By default, we have total 256 stripes and 8 lists, so
each list will have 32 stripes. It's possible one list has free stripe but
other list hasn't. The chance should be rare because stripes allocation are
even distributed. And we can always allocate more stripes for cache, several
mega bytes memory isn't a big deal.

This completely removes the lock contention of the first path of
get_active_stripe(). It slows down the second code path a little bit though
because we now need takes two locks, but since the hash lock isn't contended,
the overhead should be quite small (several atomic instructions). The second
path of get_active_stripe() (basically sequential write or big request size
randwrite) still has lock contentions.

Signed-off-by: Shaohua Li <shli@fusionio.com>
Signed-off-by: NeilBrown <neilb@suse.de>
2013-11-14 15:20:58 +11:00
NeilBrown
ba8805b973 md/raid5.c: add proper locking to error path of raid5_start_reshape.
If raid5_start_reshape errors out, we need to reset all the fields
that were updated (not just some), and need to use the seq_counter
to ensure make_request() doesn't use an inconsitent state.

Signed-off-by: NeilBrown <neilb@suse.de>
2013-11-14 15:16:15 +11:00
NeilBrown
02e5f5c0a0 md: fix calculation of stacking limits on level change.
The various ->run routines of md personalities assume that the 'queue'
has been initialised by the blk_set_stacking_limits() call in
md_alloc().

However when the level is changed (by level_store()) the ->run routine
for the new level is called for an array which has already had the
stacking limits modified.  This can result in incorrect final
settings.

So call blk_set_stacking_limits() before ->run in level_store().

A specific consequence of this bug is that it causes
discard_granularity to be set incorrectly when reshaping a RAID4 to a
RAID0.

This is suitable for any -stable kernel since 3.3 in which
blk_set_stacking_limits() was introduced.

Cc: stable@vger.kernel.org (3.3+)
Reported-and-tested-by: "Baldysiak, Pawel" <pawel.baldysiak@intel.com>
Signed-off-by: NeilBrown <neilb@suse.de>
2013-11-14 15:16:15 +11:00
majianpeng
ad4068de49 raid5: Use slow_path to release stripe when mddev->thread is null
When release_stripe() is called in grow_one_stripe(), the
mddev->thread is null. So it will omit one wakeup this thread to
release stripe.
For this condition, use slow_path to release stripe.

Bug was introduced in 3.12

Cc: stable@vger.kernel.org (3.12+)
Fixes: 773ca82fa1
Signed-off-by: Jianpeng Ma <majianpeng@gmail.com>
Signed-off-by: NeilBrown <neilb@suse.de>
2013-11-14 15:16:15 +11:00
Linus Torvalds
7f2dc5c4bc A set of device-mapper changes for 3.13.
Improve reliability of buffer allocations for dm messages with a small
 number of arguments, a couple path group initialization fixes for dm
 multipath, a fix for resizing a dm array, various fixes and
 optimizations for dm cache, a fix for device mapper's Kconfig menu
 indentation.
 
 Features added include:
 - dm crypt support for activating legacy CBC TrueCrypt containers
   (useful for forensics of these old TCRYPT containers)
 - reduced dm-cache memory requirements for each block in the cache
 - basic support for shrinking a dm-cache's cache (fast) device
 - most notably, dm-cache support for managing cache coherency when
   deploying dm-cache with sophisticated origin volumes (that support
   hardware snapshots and/or clustering): these changes come in the form
   of a new passthrough operation mode and a cache block invalidation
   interface.
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1.4.14 (GNU/Linux)
 
 iQEcBAABAgAGBQJSgt+QAAoJEMUj8QotnQNapcEIALC6U1rmw08PRMSanqg4/aVu
 pTahzPtai9jXchQV6q5XsglJryrhD9MoNqrZgHd2drdnmEKTKfVX+/iCXGiE4hQ5
 I5QUZf5myEXSd60pCgZwNam+VHMuAuSPQW6LWqRTJjDLHixGF+AoHZGxkEsYgj6M
 p686OOpga1nmT2w072xLIh9z2tsv/tm+UN7GSbyklM+/1ItcXxq+/J8rsuth7IqT
 k0I60jexq+Q3OaYuJY7vxhdE7PhBCw1fGmtuCcjekqsSVpAdCgDz3FFOEZmyXcUs
 YLFE3GcclYQpIPjNjVGTLDFHdoIMWdKiibs/ScBUtegqxWvqP7c87YFhbL+VHDM=
 =lLxo
 -----END PGP SIGNATURE-----

Merge tag 'dm-3.13-changes' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm

Pull device mapper changes from Mike Snitzer:
 "A set of device-mapper changes for 3.13.

  Improve reliability of buffer allocations for dm messages with a small
  number of arguments, a couple path group initialization fixes for dm
  multipath, a fix for resizing a dm array, various fixes and
  optimizations for dm cache, a fix for device mapper's Kconfig menu
  indentation.

  Features added include:
   - dm crypt support for activating legacy CBC TrueCrypt containers
     (useful for forensics of these old TCRYPT containers)
   - reduced dm-cache memory requirements for each block in the cache
   - basic support for shrinking a dm-cache's cache (fast) device
   - most notably, dm-cache support for managing cache coherency when
     deploying dm-cache with sophisticated origin volumes (that support
     hardware snapshots and/or clustering): these changes come in the
     form of a new passthrough operation mode and a cache block
     invalidation interface"

* tag 'dm-3.13-changes' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm: (32 commits)
  dm cache: resolve small nits and improve Documentation
  dm cache: add cache block invalidation support
  dm cache: add remove_cblock method to policy interface
  dm cache policy mq: reduce memory requirements
  dm cache metadata: check the metadata version when reading the superblock
  dm cache: add passthrough mode
  dm cache: cache shrinking support
  dm cache: promotion optimisation for writes
  dm cache: be much more aggressive about promoting writes to discarded blocks
  dm cache policy mq: implement writeback_work() and mq_{set,clear}_dirty()
  dm cache: optimize commit_if_needed
  dm space map disk: optimise sm_disk_dec_block
  MAINTAINERS: add reference to device-mapper's linux-dm.git tree
  dm: fix Kconfig menu indentation
  dm: allow remove to be deferred
  dm table: print error on preresume failure
  dm crypt: add TCW IV mode for old CBC TCRYPT containers
  dm crypt: properly handle extra key string in initialization
  dm cache: log error message if dm_kcopyd_copy() fails
  dm cache: use cell_defer() boolean argument consistently
  ...
2013-11-14 12:35:48 +09:00
Linus Torvalds
0910c0bdf7 Merge branch 'for-3.13/core' of git://git.kernel.dk/linux-block
Pull block IO core updates from Jens Axboe:
 "This is the pull request for the core changes in the block layer for
  3.13.  It contains:

   - The new blk-mq request interface.

     This is a new and more scalable queueing model that marries the
     best part of the request based interface we currently have (which
     is fully featured, but scales poorly) and the bio based "interface"
     which the new drivers for high IOPS devices end up using because
     it's much faster than the request based one.

     The bio interface has no block layer support, since it taps into
     the stack much earlier.  This means that drivers end up having to
     implement a lot of functionality on their own, like tagging,
     timeout handling, requeue, etc.  The blk-mq interface provides all
     these.  Some drivers even provide a switch to select bio or rq and
     has code to handle both, since things like merging only works in
     the rq model and hence is faster for some workloads.  This is a
     huge mess.  Conversion of these drivers nets us a substantial code
     reduction.  Initial results on converting SCSI to this model even
     shows an 8x improvement on single queue devices.  So while the
     model was intended to work on the newer multiqueue devices, it has
     substantial improvements for "classic" hardware as well.  This code
     has gone through extensive testing and development, it's now ready
     to go.  A pull request is coming to convert virtio-blk to this
     model will be will be coming as well, with more drivers scheduled
     for 3.14 conversion.

   - Two blktrace fixes from Jan and Chen Gang.

   - A plug merge fix from Alireza Haghdoost.

   - Conversion of __get_cpu_var() from Christoph Lameter.

   - Fix for sector_div() with 64-bit divider from Geert Uytterhoeven.

   - A fix for a race between request completion and the timeout
     handling from Jeff Moyer.  This is what caused the merge conflict
     with blk-mq/core, in case you are looking at that.

   - A dm stacking fix from Mike Snitzer.

   - A code consolidation fix and duplicated code removal from Kent
     Overstreet.

   - A handful of block bug fixes from Mikulas Patocka, fixing a loop
     crash and memory corruption on blk cg.

   - Elevator switch bug fix from Tomoki Sekiyama.

  A heads-up that I had to rebase this branch.  Initially the immutable
  bio_vecs had been queued up for inclusion, but a week later, it became
  clear that it wasn't fully cooked yet.  So the decision was made to
  pull this out and postpone it until 3.14.  It was a straight forward
  rebase, just pruning out the immutable series and the later fixes of
  problems with it.  The rest of the patches applied directly and no
  further changes were made"

* 'for-3.13/core' of git://git.kernel.dk/linux-block: (31 commits)
  block: replace IS_ERR and PTR_ERR with PTR_ERR_OR_ZERO
  block: replace IS_ERR and PTR_ERR with PTR_ERR_OR_ZERO
  block: Do not call sector_div() with a 64-bit divisor
  kernel: trace: blktrace: remove redundent memcpy() in compat_blk_trace_setup()
  block: Consolidate duplicated bio_trim() implementations
  block: Use rw_copy_check_uvector()
  block: Enable sysfs nomerge control for I/O requests in the plug list
  block: properly stack underlying max_segment_size to DM device
  elevator: acquire q->sysfs_lock in elevator_change()
  elevator: Fix a race in elevator switching and md device initialization
  block: Replace __get_cpu_var uses
  bdi: test bdi_init failure
  block: fix a probe argument to blk_register_region
  loop: fix crash if blk_alloc_queue fails
  blk-core: Fix memory corruption if blkcg_init_queue fails
  block: fix race between request completion and timeout handling
  blktrace: Send BLK_TN_PROCESS events to all running traces
  blk-mq: don't disallow request merges for req->special being set
  blk-mq: mq plug list breakage
  blk-mq: fix for flush deadlock
  ...
2013-11-14 12:08:14 +09:00
Mike Snitzer
7b6b2bc98c dm cache: resolve small nits and improve Documentation
Document passthrough mode, cache shrinking, and cache invalidation.
Also, use strcasecmp() and hlist_unhashed().

Reported-by: Alasdair G Kergon <agk@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2013-11-12 13:11:09 -05:00
Joe Thornber
65790ff919 dm cache: add cache block invalidation support
Cache block invalidation is removing an entry from the cache without
writing it back.  Cache blocks can be invalidated via the
'invalidate_cblocks' message, which takes an arbitrary number of cblock
ranges:
   invalidate_cblocks [<cblock>|<cblock begin>-<cblock end>]*

E.g.
   dmsetup message my_cache 0 invalidate_cblocks 2345 3456-4567 5678-6789

Signed-off-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2013-11-11 11:37:51 -05:00
Joe Thornber
532906aa7f dm cache: add remove_cblock method to policy interface
Implement policy_remove_cblock() and add remove_cblock method to the mq
policy.  These methods will be used by the following cache block
invalidation patch which adds the 'invalidate_cblocks' message to the
cache core.

Also, update some comments in dm-cache-policy.h

Signed-off-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2013-11-11 11:37:50 -05:00
Joe Thornber
633618e335 dm cache policy mq: reduce memory requirements
Rather than storing the cblock in each cache entry, we allocate all
entries in an array and infer the cblock from the entry position.

Saves 4 bytes of memory per cache block.  In addition, this gives us an
easy way of looking up cache entries by cblock.

We no longer need to keep an explicit bitset to track which cblocks
have been allocated.  And no searching is needed to find free cblocks.

Signed-off-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2013-11-11 11:37:50 -05:00
Joe Thornber
53d498198d dm cache metadata: check the metadata version when reading the superblock
Need to check the version to verify on-disk metadata is supported.

Signed-off-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2013-11-11 11:37:49 -05:00
Joe Thornber
2ee57d5873 dm cache: add passthrough mode
"Passthrough" is a dm-cache operating mode (like writethrough or
writeback) which is intended to be used when the cache contents are not
known to be coherent with the origin device.  It behaves as follows:

* All reads are served from the origin device (all reads miss the cache)
* All writes are forwarded to the origin device; additionally, write
  hits cause cache block invalidates

This mode decouples cache coherency checks from cache device creation,
largely to avoid having to perform coherency checks while booting.  Boot
scripts can create cache devices in passthrough mode and put them into
service (mount cached filesystems, for example) without having to worry
about coherency.  Coherency that exists is maintained, although the
cache will gradually cool as writes take place.

Later, applications can perform coherency checks, the nature of which
will depend on the type of the underlying storage.  If coherency can be
verified, the cache device can be transitioned to writethrough or
writeback mode while still warm; otherwise, the cache contents can be
discarded prior to transitioning to the desired operating mode.

Signed-off-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Heinz Mauelshagen <heinzm@redhat.com>
Signed-off-by: Morgan Mears <Morgan.Mears@netapp.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2013-11-11 11:37:49 -05:00
Joe Thornber
f494a9c6b1 dm cache: cache shrinking support
Allow a cache to shrink if the blocks being removed from the cache are
not dirty.

Signed-off-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2013-11-11 11:37:45 -05:00
Kees Cook
c86949486d bcache: defensively handle format strings
Just to be safe, call the error reporting function with "%s" to avoid
any possible future format string leak.

Signed-off-by: Kees Cook <keescook@chromium.org>
Signed-off-by: Kent Overstreet <kmo@daterainc.com>
2013-11-10 21:56:43 -08:00
Kent Overstreet
5ceaaad704 bcache: Bypass torture test
More testing ftw! Also, now verify mode doesn't break if you read dirty
data.

Signed-off-by: Kent Overstreet <kmo@daterainc.com>
2013-11-10 21:56:43 -08:00
Kent Overstreet
098fb25498 bcache: Delete some slower inline asm
Never saw a profile of bset_search_tree() where it wasn't bottlenecked
on memory until I got my new Haswell machine, but when I tried it there
it was suddenly burning 20% of the cpu in the inner loop on shrd...

Turns out, the version of shrd that takes 64 bit operands has a 9 cycle
latency. hah.

Signed-off-by: Kent Overstreet <kmo@daterainc.com>
2013-11-10 21:56:42 -08:00
Kent Overstreet
28935ab516 bcache: Use ida for bcache block dev minor
Signed-off-by: Kent Overstreet <kmo@daterainc.com>
2013-11-10 21:56:42 -08:00
Kent Overstreet
c4d951ddb6 bcache: Fix sysfs splat on shutdown with flash only devs
Whoops.

Signed-off-by: Kent Overstreet <kmo@daterainc.com>
2013-11-10 21:56:41 -08:00
Kent Overstreet
48a915a87f bcache: Better full stripe scanning
The old scanning-by-stripe code burned too much CPU, this should be
better.

Signed-off-by: Kent Overstreet <kmo@daterainc.com>
2013-11-10 21:56:41 -08:00
Kent Overstreet
17e21a9f24 bcache: Have btree_split() insert into parent directly
The flow control in btree_insert_node() was... fragile... before,
this'll use more stack (but since our btrees are never more than depth
1, that shouldn't matter) and it should be significantly clearer and
less fragile.

Signed-off-by: Kent Overstreet <kmo@daterainc.com>
2013-11-10 21:56:40 -08:00
Kent Overstreet
65d22e911b bcache: Move spinlock into struct time_stats
Minor cleanup.

Signed-off-by: Kent Overstreet <kmo@daterainc.com>
2013-11-10 21:56:40 -08:00
Kent Overstreet
8aee122071 bcache: Kill sequential_merge option
It never really made sense to expose this, so just kill it.

Signed-off-by: Kent Overstreet <kmo@daterainc.com>
2013-11-10 21:56:39 -08:00
Kent Overstreet
50310164bc bcache: Kill bch_next_recurse_key()
This dates from before the btree iterator, and now it's finally gone

Signed-off-by: Kent Overstreet <kmo@daterainc.com>
2013-11-10 21:56:39 -08:00
Kent Overstreet
bc9389eefe bcache: Avoid deadlocking in garbage collection
Not a complete fix - we could still deadlock if btree_insert_node() has
to split...

Signed-off-by: Kent Overstreet <kmo@daterainc.com>
2013-11-10 21:56:38 -08:00
Kent Overstreet
a1f0358b2b bcache: Incremental gc
Big garbage collection rewrite; now, garbage collection uses the same
mechanisms as used elsewhere for inserting/updating btree node pointers,
instead of rewriting interior btree nodes in place.

This makes the code significantly cleaner and less fragile, and means we
can now make garbage collection incremental - it doesn't have to hold a
write lock on the root of the btree for the entire duration of garbage
collection.

This means that there's less of a latency hit for doing garbage
collection, which means we can gc more frequently (and do a better job
of reclaiming from the cache), and we can coalesce across more btree
nodes (improving our space efficiency).

Signed-off-by: Kent Overstreet <kmo@daterainc.com>
2013-11-10 21:56:37 -08:00
Kent Overstreet
8835c1234d bcache: Add make_btree_freeing_key()
Refactoring, prep work for incremental garbage collection.

Signed-off-by: Kent Overstreet <kmo@daterainc.com>
2013-11-10 21:56:37 -08:00
Kent Overstreet
f269af5a07 bcache: Add btree_node_write_sync()
More refactoring - mostly making the interfaces more explicit about what
we actually want to do.

Signed-off-by: Kent Overstreet <kmo@daterainc.com>
2013-11-10 21:56:36 -08:00
Kent Overstreet
0eacac2203 bcache: PRECEDING_KEY()
btree_insert_key() was open coding this, this is just refactoring.

Signed-off-by: Kent Overstreet <kmo@daterainc.com>
2013-11-10 21:56:36 -08:00
Kent Overstreet
d5cc66e957 bcache: bch_(btree|extent)_ptr_invalid()
Trying to treat btree pointers and leaf node pointers the same way was a
mistake - going to start being more explicit about the type of
key/pointer we're dealing with. This is the first part of that
refactoring; this patch shouldn't change any actual behaviour.

Signed-off-by: Kent Overstreet <kmo@daterainc.com>
2013-11-10 21:56:35 -08:00
Kent Overstreet
3a3b6a4e07 bcache: Don't bother with bucket refcount for btree node allocations
The bucket refcount (dropped with bkey_put()) is only needed to prevent
the newly allocated bucket from being garbage collected until we've
added a pointer to it somewhere. But for btree node allocations, the
fact that we have btree nodes locked is enough to guard against races
with garbage collection.

Eventually the per bucket refcount is going to be replaced with
something specific to bch_alloc_sectors().

Signed-off-by: Kent Overstreet <kmo@daterainc.com>
2013-11-10 21:56:34 -08:00
Kent Overstreet
280481d06c bcache: Debug code improvements
Couple changes:
 * Consolidate bch_check_keys() and bch_check_key_order(), and move the
   checks that only check_key_order() could do to bch_btree_iter_next().

 * Get rid of CONFIG_BCACHE_EDEBUG - now, all that code is compiled in
   when CONFIG_BCACHE_DEBUG is enabled, and there's now a sysfs file to
   flip on the EDEBUG checks at runtime.

 * Dropped an old not terribly useful check in rw_unlock(), and
   refactored/improved a some of the other debug code.

Signed-off-by: Kent Overstreet <kmo@daterainc.com>
2013-11-10 21:56:34 -08:00
Kent Overstreet
e58ff15503 bcache: Fix bch_ptr_bad()
Previously, bch_ptr_bad() could return false when there was a pointer to
a nonexistant device... it only filtered out keys with PTR_CHECK_DEV
pointers.

This behaviour was intended for multiple cache device support; for that,
just because the device for one of the pointers has gone away doesn't
mean we want to filter out the rest of the pointers.

But we don't yet explicitly filter/check individual pointers, so without
that this behaviour was wrong - a corrupt bkey with a bad device pointer
could cause us to deref a bad pointer. Doh.

Signed-off-by: Kent Overstreet <kmo@daterainc.com>
2013-11-10 21:56:33 -08:00
Kent Overstreet
81ab4190ac bcache: Pull on disk data structures out into a separate header
Now, the on disk data structures are in a header that can be exported to
userspace - and having them all centralized is nice too.

Signed-off-by: Kent Overstreet <kmo@daterainc.com>
2013-11-10 21:56:33 -08:00
Kent Overstreet
2599b53b7b bcache: Move sector allocator to alloc.c
Just reorganizing things a bit.

Signed-off-by: Kent Overstreet <kmo@daterainc.com>
2013-11-10 21:56:32 -08:00
Kent Overstreet
220bb38c21 bcache: Break up struct search
With all the recent refactoring around struct btree op struct search has
gotten rather large.

But we can now easily break it up in a different way - we break out
struct btree_insert_op which is for inserting data into the cache, and
that's now what the copying gc code uses - struct search is now specific
to request.c

Signed-off-by: Kent Overstreet <kmo@daterainc.com>
2013-11-10 21:56:32 -08:00
Kent Overstreet
cc7b881921 bcache: Convert bch_btree_insert() to bch_btree_map_leaf_nodes()
Last of the btree_map() conversions. Main visible effect is
bch_btree_insert() is no longer taking a struct btree_op as an argument
anymore - there's no fancy state machine stuff going on, it's just a
normal function.

Signed-off-by: Kent Overstreet <kmo@daterainc.com>
2013-11-10 21:56:31 -08:00
Kent Overstreet
6054c6d4da bcache: Don't use op->insert_collision
When we convert bch_btree_insert() to bch_btree_map_leaf_nodes(), we
won't be passing struct btree_op to bch_btree_insert() anymore - so we
need a different way of returning whether there was a collision (really,
a replace collision).

Signed-off-by: Kent Overstreet <kmo@daterainc.com>
2013-11-10 21:56:30 -08:00
Kent Overstreet
1b207d80d5 bcache: Kill op->replace
This is prep work for converting bch_btree_insert to
bch_btree_map_leaf_nodes() - we have to convert all its arguments to
actual arguments. Bunch of churn, but should be straightforward.

Signed-off-by: Kent Overstreet <kmo@daterainc.com>
2013-11-10 21:56:29 -08:00
Kent Overstreet
faadf0c965 bcache: Drop some closure stuff
With a the recent bcache refactoring, some of the closure code isn't
needed anymore.

Signed-off-by: Kent Overstreet <kmo@daterainc.com>
2013-11-10 21:56:10 -08:00
Kent Overstreet
b54d6934da bcache: Kill op->cl
This isn't used for waiting asynchronously anymore - so this is a fairly
trivial refactoring.

Signed-off-by: Kent Overstreet <kmo@daterainc.com>
2013-11-10 21:56:09 -08:00
Kent Overstreet
c18536a72d bcache: Prune struct btree_op
Eventual goal is for struct btree_op to contain only what is necessary
for traversing the btree.

Signed-off-by: Kent Overstreet <kmo@daterainc.com>
2013-11-10 21:56:08 -08:00
Kent Overstreet
cc23196631 bcache: Clean up cache_lookup_fn
There was some looping in submit_partial_cache_hit() and
submit_partial_cache_hit() that isn't needed anymore - originally, we
wouldn't necessarily process the full hit or miss all at once because
when splitting the bio, we took into account the restrictions of the
device we were sending it to.

But, device bio size restrictions are now handled elsewhere, with a
wrapper around generic_make_request() - so that looping has been
unnecessary for awhile now and we can now do quite a bit of cleanup.

And if we trim the key we're reading from to match the subset we're
actually reading, we don't have to explicitly calculate bi_sector
anymore. Neat.

Signed-off-by: Kent Overstreet <kmo@daterainc.com>
2013-11-10 21:56:08 -08:00
Kent Overstreet
2c1953e201 bcache: Convert bch_btree_read_async() to bch_btree_map_keys()
This is a fairly straightforward conversion, mostly reshuffling -
op->lookup_done goes away, replaced by MAP_DONE/MAP_CONTINUE. And the
code for handling cache hits and misses wasn't really btree code, so it
gets moved to request.c.

Signed-off-by: Kent Overstreet <kmo@daterainc.com>
2013-11-10 21:56:07 -08:00
Kent Overstreet
df8e89701f bcache: Move some stuff to btree.c
With the new btree_map() functions, we don't need to export the stuff
needed for traversing the btree anymore.

Signed-off-by: Kent Overstreet <kmo@daterainc.com>
2013-11-10 21:56:07 -08:00
Kent Overstreet
48dad8baf9 bcache: Add btree_map() functions
Lots of stuff has been open coding its own btree traversal - which is
generally pretty simple code, but there are a few subtleties.

This adds new new functions, bch_btree_map_nodes() and
bch_btree_map_keys(), which do the traversal for you. Everything that's
open coding btree traversal now (with the exception of garbage
collection) is slowly going to be converted to these two functions;
being able to write other code at a higher level of abstraction  is a
big improvement w.r.t. overall code quality.

Signed-off-by: Kent Overstreet <kmo@daterainc.com>
2013-11-10 21:56:06 -08:00
Kent Overstreet
5e6926daac bcache: Convert writeback to a kthread
This simplifies the writeback flow control quite a bit - previously, it
was conceptually two coroutines, refill_dirty() and read_dirty(). This
makes the code quite a bit more straightforward.

Signed-off-by: Kent Overstreet <kmo@daterainc.com>
2013-11-10 21:56:05 -08:00
Kent Overstreet
72a44517f3 bcache: Convert gc to a kthread
We needed a dedicated rescuer workqueue for gc anyways... and gc was
conceptually a dedicated thread, just one that wasn't running all the
time. Switch it to a dedicated thread to make the code a bit more
straightforward.

Signed-off-by: Kent Overstreet <kmo@daterainc.com>
2013-11-10 21:56:04 -08:00
Kent Overstreet
35fcd848d7 bcache: Convert bucket_wait to wait_queue_head_t
At one point we did do fancy asynchronous waiting stuff with
bucket_wait, but that's all gone (and bucket_wait is used a lot less
than it used to be). So use the standard primitives.

Signed-off-by: Kent Overstreet <kmo@daterainc.com>
2013-11-10 21:56:04 -08:00
Kent Overstreet
e8e1d4682c bcache: Convert try_wait to wait_queue_head_t
We never waited on c->try_wait asynchronously, so just use the standard
primitives.

Signed-off-by: Kent Overstreet <kmo@daterainc.com>
2013-11-10 21:56:03 -08:00
Kent Overstreet
0b93207abb bcache: Move keylist out of btree_op
Slowly working on pruning struct btree_op - the aim is for it to only
contain things that are actually necessary for traversing the btree.

Signed-off-by: Kent Overstreet <kmo@daterainc.com>
2013-11-10 21:56:02 -08:00
Kent Overstreet
a34a8bfd4e bcache: Refactor journalling flow control
Making things less asynchronous that don't need to be - bch_journal()
only has to block when the journal or journal entry is full, which is
emphatically not a fast path. So make it a normal function that just
returns when it finishes, to make the code and control flow easier to
follow.

Signed-off-by: Kent Overstreet <kmo@daterainc.com>
2013-11-10 21:56:02 -08:00
Kent Overstreet
cdd972b164 bcache: Refactor read request code a bit
More refactoring, and renaming.

Signed-off-by: Kent Overstreet <kmo@daterainc.com>
2013-11-10 21:56:01 -08:00
Kent Overstreet
84f0db03ea bcache: Refactor request_write()
Try to improve some of the naming a bit to be more consistent, and also
improve the flow of control in request_write() a bit.

Signed-off-by: Kent Overstreet <kmo@daterainc.com>
2013-11-10 21:56:00 -08:00
Kent Overstreet
c2f95ae2eb bcache: Clean up keylist code
More random refactoring.

Signed-off-by: Kent Overstreet <kmo@daterainc.com>
2013-11-10 21:56:00 -08:00
Kent Overstreet
4f3d40147b bcache: Add explicit keylist arg to btree_insert()
Some refactoring - better to explicitly pass stuff around instead of
having it all in the "big bag of state", struct btree_op. Going to prune
struct btree_op quite a bit over time.

Signed-off-by: Kent Overstreet <kmo@daterainc.com>
2013-11-10 21:55:59 -08:00
Kent Overstreet
e7c590eb63 bcache: Convert btree_insert_check_key() to btree_insert_node()
This was the main point of all this refactoring - now,
btree_insert_check_key() won't fail just because the leaf node happened
to be full.

Signed-off-by: Kent Overstreet <kmo@daterainc.com>
2013-11-10 21:55:59 -08:00
Kent Overstreet
403b6cdeb1 bcache: Insert multiple keys at a time
We'll often end up with a list of adjacent keys to insert -
because bch_data_insert() may have to fragment the data it writes.

Originally, to simplify things and avoid having to deal with corner
cases bch_btree_insert() would pass keys from this list one at a time to
btree_insert_recurse() - mainly because the list of keys might span leaf
nodes, so it was easier this way.

With the btree_insert_node() refactoring, it's now a lot easier to just
pass down the whole list and have btree_insert_recurse() iterate over
leaf nodes until it's done.

Signed-off-by: Kent Overstreet <kmo@daterainc.com>
2013-11-10 21:55:58 -08:00
Kent Overstreet
26c949f806 bcache: Add btree_insert_node()
The flow of control in the old btree insertion code was rather -
backwards; we'd recurse down the btree (in btree_insert_recurse()), and
then if we needed to split the keys to be inserted into the parent node
would be effectively returned up to btree_insert_recurse(), which would
notice there was more work to do and finish the insertion.

The main problem with this was that the full logic for btree insertion
could only be used by calling btree_insert_recurse; if you'd gotten to a
btree leaf some other way and had a key to insert, if it turned out that
node needed to be split you were SOL.

This inverts the flow of control so btree_insert_node() does _full_
btree insertion, including splitting - and takes a (leaf) btree node to
insert into as a parameter.

This means we can now _correctly_ handle cache misses - for cache
misses, we need to insert a fake "check" key into the btree when we
discover we have a cache miss - while we still have the btree locked.
Previously, if the btree node was full inserting a cache miss would just
fail.

Signed-off-by: Kent Overstreet <kmo@daterainc.com>
2013-11-10 21:55:57 -08:00
Kent Overstreet
d6fd3b11ce bcache: Explicitly track btree node's parent
This is prep work for the reworked btree insertion code.

The way we set b->parent is ugly and hacky... the problem is, when
btree_split() or garbage collection splits or rewrites a btree node, the
parent changes for all its (potentially already cached) children.

I may change this later and add some code to look through the btree node
cache and find all our cached child nodes and change the parent pointer
then...

Signed-off-by: Kent Overstreet <kmo@daterainc.com>
2013-11-10 21:55:57 -08:00
Kent Overstreet
8304ad4dc8 bcache: Remove unnecessary check in should_split()
Checking i->seq was redundant, because since ages ago we always
initialize the new bset when advancing b->written

Signed-off-by: Kent Overstreet <kmo@daterainc.com>
2013-11-10 21:55:56 -08:00
Kent Overstreet
2d679fc756 bcache: Stripe size isn't necessarily a power of two
Originally I got this right... except that the divides didn't use
do_div(), which broke 32 bit kernels. When I went to fix that, I forgot
that the raid stripe size usually isn't a power of two... doh

Signed-off-by: Kent Overstreet <kmo@daterainc.com>
2013-11-10 21:55:55 -08:00
Kent Overstreet
77c320eb46 bcache: Add on error panic/unregister setting
Works kind of like the ext4 setting, to panic or remount read only on
errors.

Signed-off-by: Kent Overstreet <kmo@daterainc.com>
2013-11-10 21:55:55 -08:00
Kent Overstreet
49b1212dfa bcache: Use blkdev_issue_discard()
The old asynchronous discard code was really a relic from when all the
allocation code was asynchronous - now that allocation runs out of a
dedicated thread there's no point in keeping around all that complicated
machinery.

Signed-off-by: Kent Overstreet <kmo@daterainc.com>
2013-11-10 21:55:54 -08:00
Kent Overstreet
dd9ec84da5 bcache: Fix a lockdep splat
bch_keybuf_del() takes a spinlock that can't be taken in interrupt context -
whoops. Fortunately, this code isn't enabled by default (you have to toggle a
sysfs thing).

Signed-off-by: Kent Overstreet <kmo@daterainc.com>
2013-11-10 21:55:54 -08:00
Kent Overstreet
7857d5d470 bcache: Fix a journalling performance bug 2013-11-10 21:55:53 -08:00
Kent Overstreet
1fa8455deb bcache: Fix dirty_data accounting
Dirty data accounting wasn't quite right - firstly, we were adding the key we're
inserting after it could have merged with another dirty key already in the
btree, and secondly we could sometimes pass the wrong offset to
bcache_dev_sectors_dirty_add() for dirty data we were overwriting - which is
important when tracking dirty data by stripe.

Signed-off-by: Kent Overstreet <kmo@daterainc.com>
Cc: linux-stable <stable@vger.kernel.org> # >= v3.10
2013-11-10 21:55:27 -08:00
Joe Thornber
c9d28d5d09 dm cache: promotion optimisation for writes
If a write block triggers promotion and covers a whole block we can
avoid a copy.

Introduce dm_{hook,unhook}_bio to simplify saving and restoring bio
fields (bi_private is now used by overwrite).  Switch writethrough
support over to using these helpers too.

Signed-off-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2013-11-09 18:20:26 -05:00
Joe Thornber
c86c30706c dm cache: be much more aggressive about promoting writes to discarded blocks
Previously these promotions only got priority if there were unused cache
blocks.  Now we give them priority if there are any clean blocks in the
cache.

The fio_soak_test in the device-mapper-test-suite now gives uniform
performance across subvolumes (~16 seconds).

Signed-off-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2013-11-09 18:20:25 -05:00
Joe Thornber
01911c19be dm cache policy mq: implement writeback_work() and mq_{set,clear}_dirty()
There are now two multiqueues for in cache blocks.  A clean one and a
dirty one.

writeback_work comes from the dirty one.  Demotions come from the clean
one.

There are two benefits:
- Performance improvement, since demoting a clean block is a noop.
- The cache cleans itself when io load is light.

Signed-off-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Heinz Mauelshagen <heinzm@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2013-11-09 18:20:25 -05:00
Heinz Mauelshagen
ffcbcb6720 dm cache: optimize commit_if_needed
Check commit_requested flag _before_ calling
dm_cache_changed_this_transaction() superfluously.

Also, be sure to set last_commit_jiffies _after_ dm_cache_commit()
completes.

Signed-off-by: Heinz Mauelshagen <heinzm@redhat.com>
Signed-off-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2013-11-09 18:20:24 -05:00
Joe Thornber
40c57f475f dm space map disk: optimise sm_disk_dec_block
Don't waste time spotting blocks that have been allocated and then freed
in the same transaction.

The extra lookup is expensive, and I don't think it really gives us much.

Signed-off-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2013-11-09 18:20:24 -05:00
Mikulas Patocka
5442851edb dm: fix Kconfig menu indentation
The option DM_LOG_USERSPACE is sub-option of DM_MIRROR, so place it
right after DM_MIRROR.  Doing so fixes various other Device mapper
targets/features to be properly nested under "Device mapper support".

Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2013-11-09 18:20:22 -05:00
Mikulas Patocka
2c140a246d dm: allow remove to be deferred
This patch allows the removal of an open device to be deferred until
it is closed.  (Previously such a removal attempt would fail.)

The deferred remove functionality is enabled by setting the flag
DM_DEFERRED_REMOVE in the ioctl structure on DM_DEV_REMOVE or
DM_REMOVE_ALL ioctl.

On return from DM_DEV_REMOVE, the flag DM_DEFERRED_REMOVE indicates if
the device was removed immediately or flagged to be removed on close -
if the flag is clear, the device was removed.

On return from DM_DEV_STATUS and other ioctls, the flag
DM_DEFERRED_REMOVE is set if the device is scheduled to be removed on
closure.

A device that is scheduled to be deleted can be revived using the
message "@cancel_deferred_remove". This message clears the
DMF_DEFERRED_REMOVE flag so that the device won't be deleted on close.

Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2013-11-09 18:20:22 -05:00
Mike Snitzer
7833b08e18 dm table: print error on preresume failure
If preresume fails it is worth logging an error given that a device is
left suspended due to the failure.

This change was motivated by local preresume error logging that was
added to the cache target ("preresume failed").  Elevating this
target-agnostic context for the where the target-specific error occurred
relative to the DM core's callouts makes sense.

Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Joe Thornber <ejt@redhat.com>
2013-11-09 18:20:21 -05:00
Milan Broz
ed04d98169 dm crypt: add TCW IV mode for old CBC TCRYPT containers
dm-crypt can already activate TCRYPT (TrueCrypt compatible) containers
in LRW or XTS block encryption mode.

TCRYPT containers prior to version 4.1 use CBC mode with some additional
tweaks, this patch adds support for these containers.

This new mode is implemented using special IV generator named TCW
(TrueCrypt IV with whitening).  TCW IV only supports containers that are
encrypted with one cipher (Tested with AES, Twofish, Serpent, CAST5 and
TripleDES).

While this mode is legacy and is known to be vulnerable to some
watermarking attacks (e.g. revealing of hidden disk existence) it can
still be useful to activate old containers without using 3rd party
software or for independent forensic analysis of such containers.

(Both the userspace and kernel code is an independent implementation
based on the format documentation and it completely avoids use of
original source code.)

The TCW IV generator uses two additional keys: Kw (whitening seed, size
is always 16 bytes - TCW_WHITENING_SIZE) and Kiv (IV seed, size is
always the IV size of the selected cipher).  These keys are concatenated
at the end of the main encryption key provided in mapping table.

While whitening is completely independent from IV, it is implemented
inside IV generator for simplification.

The whitening value is always 16 bytes long and is calculated per sector
from provided Kw as initial seed, xored with sector number and mixed
with CRC32 algorithm.  Resulting value is xored with ciphertext sector
content.

IV is calculated from the provided Kiv as initial IV seed and xored with
sector number.

Detailed calculation can be found in the Truecrypt documentation for
version < 4.1 and will also be described on dm-crypt site, see:
http://code.google.com/p/cryptsetup/wiki/DMCrypt

The experimental support for activation of these containers is already
present in git devel brach of cryptsetup.

Signed-off-by: Milan Broz <gmazyland@gmail.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2013-11-09 18:20:20 -05:00
Milan Broz
da31a0787a dm crypt: properly handle extra key string in initialization
Some encryption modes use extra keys (e.g. loopAES has IV seed) which
are not used in block cipher initialization but are part of key string
in table constructor.

This patch adds an additional field which describes the length of the
extra key(s) and substracts it before real key encryption setting.

The key_size always includes the size, in bytes, of the key provided
in mapping table.

The key_parts describes how many parts (usually keys) are contained in
the whole key buffer.  And key_extra_size contains size in bytes of
additional keys part (this number of bytes must be subtracted because it
is processed by the IV generator).

| K1 | K2 | .... | K64 |      Kiv       |
|----------- key_size ----------------- |
|                      |-key_extra_size-|
|     [64 keys]        |  [1 key]       | => key_parts = 65

Example where key string contains main key K, whitening key
Kw and IV seed Kiv:

|     K       |   Kiv   |       Kw      |
|--------------- key_size --------------|
|             |-----key_extra_size------|
|  [1 key]    | [1 key] |     [1 key]   | => key_parts = 3

Because key_extra_size is calculated during IV mode setting, key
initialization is moved after this step.

For now, this change has no effect to supported modes (thanks to ilog2
rounding) but it is required by the following patch.

Also, fix a sparse warning in crypt_iv_lmk_one().

Signed-off-by: Milan Broz <gmazyland@gmail.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2013-11-09 18:20:20 -05:00
Heinz Mauelshagen
2c2263c93f dm cache: log error message if dm_kcopyd_copy() fails
A migration failure should be logged (albeit limited).

Signed-off-by: Heinz Mauelshagen <heinzm@redhat.com>
Signed-off-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2013-11-09 18:20:19 -05:00
Heinz Mauelshagen
80f659f3f5 dm cache: use cell_defer() boolean argument consistently
Fix a few cell_defer() calls that weren't passing a bool.

Signed-off-by: Heinz Mauelshagen <heinzm@redhat.com>
Signed-off-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2013-11-09 18:20:19 -05:00
Mikulas Patocka
4cb3e1db21 dm cache: return -EINVAL if the user specifies unknown cache policy
Return -EINVAL when the specified cache policy is unknown rather than
returning -ENOMEM.

Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2013-11-09 18:20:18 -05:00
Joe Thornber
dd8b0c2096 dm cache metadata: return bool from __superblock_all_zeroes
Signed-off-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2013-11-09 18:20:17 -05:00
Joe Thornber
0184b44e32 dm cache policy mq: a few small fixes
Rename takeout_queue to concat_queue.

Fix a harmless bug in mq policies pop() function.  Currently pop()
always succeeds, with up coming changes this wont be the case.

Fix typo in comment above pre_cache_to_cache prototype.

Signed-off-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2013-11-09 18:20:17 -05:00
Joe Thornber
3351937e4a dm cache policy: remove return from void policy_remove_mapping
No need to return from a void function.

Signed-off-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2013-11-09 18:20:16 -05:00
Joe Thornber
238f8363b6 dm cache: improve efficiency of quiescing flag management
Make the quiescing flag an atomic_t and stop protecting it with a spin
lock.

Signed-off-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2013-11-09 18:19:59 -05:00
Joe Thornber
66cb1910df dm cache: fix a race condition between queuing new migrations and quiescing for a shutdown
The code that was trying to do this was inadequate.  The postsuspend
method (in ioctl context), needs to wait for the worker thread to
acknowledge the request to quiesce.  Otherwise the migration count may
drop to zero temporarily before the worker thread realises we're
quiescing.  In this case the target will be taken down, but the worker
thread may have issued a new migration, which will cause an oops when
it completes.

Signed-off-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Cc: stable@vger.kernel.org # 3.9+
2013-11-09 17:55:50 -05:00
Joe Thornber
f8e5f01a32 dm cache: io destined for the cache device can now serve as tick bios
Previously only origin bios could trigger ticks, which meant if all
the io was destined for the cache no ticks were generated.  If no ticks
are generated then multiple hits, and movements in general, are
attributed to the same tick.

Only a stop gap fix, we need a better solution.

Signed-off-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2013-11-09 17:55:49 -05:00
Joe Thornber
99ba2ae4cd dm cache policy mq: protect residency method with existing mutex
It is safe to use a mutex in mq_residency() at this point since it is
only called from ioctl context.  But future-proof mq_residency() by
using might_sleep() to catch new contexts that cannot sleep.

Signed-off-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2013-11-09 17:54:34 -05:00
Kent Overstreet
6678d83f18 block: Consolidate duplicated bio_trim() implementations
Someone cut and pasted md's md_trim_bio() into xen-blkfront.c. Come on,
we should know better than this.

Signed-off-by: Kent Overstreet <kmo@daterainc.com>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Neil Brown <neilb@suse.de>
Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Cc: Jeremy Fitzhardinge <jeremy@goop.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2013-11-08 09:02:31 -07:00
Linus Torvalds
0324e74534 Driver Core / sysfs patches for 3.13-rc1
Here's the big driver core / sysfs update for 3.13-rc1.
 
 There's lots of dev_groups updates for different subsystems, as they all
 get slowly migrated over to the safe versions of the attribute groups
 (removing userspace races with the creation of the sysfs files.)  Also
 in here are some kobject updates, devres expansions, and the first round
 of Tejun's sysfs reworking to enable it to be used by other subsystems
 as a backend for an in-kernel filesystem.
 
 All of these have been in linux-next for a while with no reported
 issues.
 
 Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v2.0.19 (GNU/Linux)
 
 iEYEABECAAYFAlJ6xAMACgkQMUfUDdst+yk1kQCfcHXhfnrvFZ5J/mDP509IzhNS
 ddEAoLEWoivtBppNsgrWqXpD1vi4UMsE
 =JmVW
 -----END PGP SIGNATURE-----

Merge tag 'driver-core-3.13-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core

Pull driver core / sysfs patches from Greg KH:
 "Here's the big driver core / sysfs update for 3.13-rc1.

  There's lots of dev_groups updates for different subsystems, as they
  all get slowly migrated over to the safe versions of the attribute
  groups (removing userspace races with the creation of the sysfs
  files.) Also in here are some kobject updates, devres expansions, and
  the first round of Tejun's sysfs reworking to enable it to be used by
  other subsystems as a backend for an in-kernel filesystem.

  All of these have been in linux-next for a while with no reported
  issues"

* tag 'driver-core-3.13-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core: (83 commits)
  sysfs: rename sysfs_assoc_lock and explain what it's about
  sysfs: use generic_file_llseek() for sysfs_file_operations
  sysfs: return correct error code on unimplemented mmap()
  mdio_bus: convert bus code to use dev_groups
  device: Make dev_WARN/dev_WARN_ONCE print device as well as driver name
  sysfs: separate out dup filename warning into a separate function
  sysfs: move sysfs_hash_and_remove() to fs/sysfs/dir.c
  sysfs: remove unused sysfs_get_dentry() prototype
  sysfs: honor bin_attr.attr.ignore_lockdep
  sysfs: merge sysfs_elem_bin_attr into sysfs_elem_attr
  devres: restore zeroing behavior of devres_alloc()
  sysfs: fix sysfs_write_file for bin file
  input: gameport: convert bus code to use dev_groups
  input: serio: remove bus usage of dev_attrs
  input: serio: use DEVICE_ATTR_RO()
  i2o: convert bus code to use dev_groups
  memstick: convert bus code to use dev_groups
  tifm: convert bus code to use dev_groups
  virtio: convert bus code to use dev_groups
  ipack: convert bus code to use dev_groups
  ...
2013-11-07 11:42:15 +09:00
Joe Thornber
9c1d4de560 dm array: fix bug in growing array
Entries would be lost if the old tail block was partially filled.

Signed-off-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Cc: stable@vger.kernel.org # 3.9+
2013-11-05 11:20:50 -05:00
Hannes Reinecke
b63349a7a5 dm mpath: requeue I/O during pg_init
When pg_init is running no I/O can be submitted to the underlying
devices, as the path priority etc might change.  When using queue_io for
this, requests will be piling up within multipath as the block I/O
scheduler just sees a _very fast_ device.  All of this queued I/O has to
be resubmitted from within multipathing once pg_init is done.

This approach has the problem that it's virtually impossible to
abort I/O when pg_init is running, and we're adding heavy load
to the devices after pg_init since all of the queued I/O needs to be
resubmitted _before_ any requests can be pulled off of the request queue
and normal operation continues.

This patch will requeue the I/O that triggers the pg_init call, and
return 'busy' when pg_init is in progress.  With these changes the block
I/O scheduler will stop submitting I/O during pg_init, resulting in a
quicker path switch and less I/O pressure (and memory consumption) after
pg_init.

Signed-off-by: Hannes Reinecke <hare@suse.de>
[patch header edited for clarity and typos by Mike Snitzer]
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2013-11-05 11:20:34 -05:00
Shiva Krishna Merla
954a73d5d3 dm mpath: fix race condition between multipath_dtr and pg_init_done
Whenever multipath_dtr() is happening we must prevent queueing any
further path activation work.  Implement this by adding a new
'pg_init_disabled' flag to the multipath structure that denotes future
path activation work should be skipped if it is set.  By disabling
pg_init and then re-enabling in flush_multipath_work() we also avoid the
potential for pg_init to be initiated while suspending an mpath device.

Without this patch a race condition exists that may result in a kernel
panic:

1) If after pg_init_done() decrements pg_init_in_progress to 0, a call
   to wait_for_pg_init_completion() assumes there are no more pending path
   management commands.
2) If pg_init_required is set by pg_init_done(), due to retryable
   mode_select errors, then process_queued_ios() will again queue the
   path activation work.
3) If free_multipath() completes before activate_path() work is called a
   NULL pointer dereference like the following can be seen when
   accessing members of the recently destructed multipath:

BUG: unable to handle kernel NULL pointer dereference at 0000000000000090
RIP: 0010:[<ffffffffa003db1b>]  [<ffffffffa003db1b>] activate_path+0x1b/0x30 [dm_multipath]
[<ffffffff81090ac0>] worker_thread+0x170/0x2a0
[<ffffffff81096c80>] ? autoremove_wake_function+0x0/0x40

[switch to disabling pg_init in flush_multipath_work & header edits by Mike Snitzer]
Signed-off-by: Shiva Krishna Merla <shivakrishna.merla@netapp.com>
Reviewed-by: Krishnasamy Somasundaram <somasundaram.krishnasamy@netapp.com>
Tested-by: Speagle Andy <Andy.Speagle@netapp.com>
Acked-by: Junichi Nomura <j-nomura@ce.jp.nec.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Cc: stable@vger.kernel.org
2013-10-31 21:39:47 -04:00
Mikulas Patocka
f36afb3957 dm: allocate buffer for messages with small number of arguments using GFP_NOIO
dm-mpath and dm-thin must process messages even if some device is
suspended, so we allocate argv buffer with GFP_NOIO. These messages have
a small fixed number of arguments.

On the other hand, dm-switch needs to process bulk data using messages
so excessive use of GFP_NOIO could cause trouble.

The patch also lowers the default number of arguments from 64 to 8, so
that there is smaller load on GFP_NOIO allocations.

Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Cc: stable@vger.kernel.org
Acked-by: Alasdair G Kergon <agk@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2013-10-31 13:55:45 -04:00
Shaohua Li
d47648fcf0 raid5: avoid finding "discard" stripe
SCSI discard will damage discard stripe bio setting, eg, some fields are
changed. If the stripe is reused very soon, we have wrong bios setting. We
remove discard stripe from hash list, so next time the strip will be fully
initialized.

Suitable for backport to 3.7+.

Cc: <stable@vger.kernel.org> (3.7+)
Signed-off-by: Shaohua Li <shli@fusionio.com>
Signed-off-by: NeilBrown <neilb@suse.de>
2013-10-24 13:00:24 +11:00
Shaohua Li
37c61ff31e raid5: set bio bi_vcnt 0 for discard request
SCSI layer will add new payload for discard request. If two bios are merged
to one, the second bio has bi_vcnt 1 which is set in raid5. This will confuse
SCSI and cause oops.

Suitable for backport to 3.7+

Cc: stable@vger.kernel.org (v3.7+)
Reported-by: Jes Sorensen <Jes.Sorensen@redhat.com>
Signed-off-by: Shaohua Li <shli@fusionio.com>
Signed-off-by: NeilBrown <neilb@suse.de>
Acked-by: Martin K. Petersen <martin.petersen@oracle.com>
2013-10-24 12:57:36 +11:00
Bian Yu
905b0297a9 md: avoid deadlock when md_set_badblocks.
When operate harddisk and hit errors, md_set_badblocks is called after
scsi_restart_operations which already disabled the irq. but md_set_badblocks
will call write_sequnlock_irq and enable irq. so softirq can preempt the
current thread and that may cause a deadlock. I think this situation should
use write_sequnlock_irqsave/irqrestore instead.

I met the situation and the call trace is below:
[  638.919974] BUG: spinlock recursion on CPU#0, scsi_eh_13/1010
[  638.921923]  lock: 0xffff8800d4d51fc8, .magic: dead4ead, .owner: scsi_eh_13/1010, .owner_cpu: 0
[  638.923890] CPU: 0 PID: 1010 Comm: scsi_eh_13 Not tainted 3.12.0-rc5+ #37
[  638.925844] Hardware name: To be filled by O.E.M. To be filled by O.E.M./MAHOBAY, BIOS 4.6.5 03/05/2013
[  638.927816]  ffff880037ad4640 ffff880118c03d50 ffffffff8172ff85 0000000000000007
[  638.929829]  ffff8800d4d51fc8 ffff880118c03d70 ffffffff81730030 ffff8800d4d51fc8
[  638.931848]  ffffffff81a72eb0 ffff880118c03d90 ffffffff81730056 ffff8800d4d51fc8
[  638.933884] Call Trace:
[  638.935867]  <IRQ>  [<ffffffff8172ff85>] dump_stack+0x55/0x76
[  638.937878]  [<ffffffff81730030>] spin_dump+0x8a/0x8f
[  638.939861]  [<ffffffff81730056>] spin_bug+0x21/0x26
[  638.941836]  [<ffffffff81336de4>] do_raw_spin_lock+0xa4/0xc0
[  638.943801]  [<ffffffff8173f036>] _raw_spin_lock+0x66/0x80
[  638.945747]  [<ffffffff814a73ed>] ? scsi_device_unbusy+0x9d/0xd0
[  638.947672]  [<ffffffff8173fb1b>] ? _raw_spin_unlock+0x2b/0x50
[  638.949595]  [<ffffffff814a73ed>] scsi_device_unbusy+0x9d/0xd0
[  638.951504]  [<ffffffff8149ec47>] scsi_finish_command+0x37/0xe0
[  638.953388]  [<ffffffff814a75e8>] scsi_softirq_done+0xa8/0x140
[  638.955248]  [<ffffffff8130e32b>] blk_done_softirq+0x7b/0x90
[  638.957116]  [<ffffffff8104fddd>] __do_softirq+0xfd/0x330
[  638.958987]  [<ffffffff810b964f>] ? __lock_release+0x6f/0x100
[  638.960861]  [<ffffffff8174a5cc>] call_softirq+0x1c/0x30
[  638.962724]  [<ffffffff81004c7d>] do_softirq+0x8d/0xc0
[  638.964565]  [<ffffffff8105024e>] irq_exit+0x10e/0x150
[  638.966390]  [<ffffffff8174ad4a>] smp_apic_timer_interrupt+0x4a/0x60
[  638.968223]  [<ffffffff817499af>] apic_timer_interrupt+0x6f/0x80
[  638.970079]  <EOI>  [<ffffffff810b964f>] ? __lock_release+0x6f/0x100
[  638.971899]  [<ffffffff8173fa6a>] ? _raw_spin_unlock_irq+0x3a/0x50
[  638.973691]  [<ffffffff8173fa60>] ? _raw_spin_unlock_irq+0x30/0x50
[  638.975475]  [<ffffffff81562393>] md_set_badblocks+0x1f3/0x4a0
[  638.977243]  [<ffffffff81566e07>] rdev_set_badblocks+0x27/0x80
[  638.978988]  [<ffffffffa00d97bb>] raid5_end_read_request+0x36b/0x4e0 [raid456]
[  638.980723]  [<ffffffff811b5a1d>] bio_endio+0x1d/0x40
[  638.982463]  [<ffffffff81304ff3>] req_bio_endio.isra.65+0x83/0xa0
[  638.984214]  [<ffffffff81306b9f>] blk_update_request+0x7f/0x350
[  638.985967]  [<ffffffff81306ea1>] blk_update_bidi_request+0x31/0x90
[  638.987710]  [<ffffffff813085e0>] __blk_end_bidi_request+0x20/0x50
[  638.989439]  [<ffffffff8130862f>] __blk_end_request_all+0x1f/0x30
[  638.991149]  [<ffffffff81308746>] blk_peek_request+0x106/0x250
[  638.992861]  [<ffffffff814a62a9>] ? scsi_kill_request.isra.32+0xe9/0x130
[  638.994561]  [<ffffffff814a633a>] scsi_request_fn+0x4a/0x3d0
[  638.996251]  [<ffffffff813040a7>] __blk_run_queue+0x37/0x50
[  638.997900]  [<ffffffff813045af>] blk_run_queue+0x2f/0x50
[  638.999553]  [<ffffffff814a5750>] scsi_run_queue+0xe0/0x1c0
[  639.001185]  [<ffffffff814a7721>] scsi_run_host_queues+0x21/0x40
[  639.002798]  [<ffffffff814a2e87>] scsi_restart_operations+0x177/0x200
[  639.004391]  [<ffffffff814a4fe9>] scsi_error_handler+0xc9/0xe0
[  639.005996]  [<ffffffff814a4f20>] ? scsi_unjam_host+0xd0/0xd0
[  639.007600]  [<ffffffff81072f6b>] kthread+0xdb/0xe0
[  639.009205]  [<ffffffff81072e90>] ? flush_kthread_worker+0x170/0x170
[  639.010821]  [<ffffffff81748cac>] ret_from_fork+0x7c/0xb0
[  639.012437]  [<ffffffff81072e90>] ? flush_kthread_worker+0x170/0x170

This bug was introduce in commit  2e8ac30312
(the first time rdev_set_badblock was call from interrupt context),
so this patch is appropriate for 3.5 and subsequent kernels.

Cc: <stable@vger.kernel.org> (3.5+)
Signed-off-by: Bian Yu <bianyu@kedacom.com>
Reviewed-by: Jianpeng Ma <majianpeng@gmail.com>
Signed-off-by: NeilBrown <neilb@suse.de>
2013-10-24 12:57:11 +11:00
Lukasz Dorau
61e4947c99 md: Fix skipping recovery for read-only arrays.
Since:
        commit 7ceb17e87b
        md: Allow devices to be re-added to a read-only array.

spares are activated on a read-only array. In case of raid1 and raid10
personalities it causes that not-in-sync devices are marked in-sync
without checking if recovery has been finished.

If a read-only array is degraded and one of its devices is not in-sync
(because the array has been only partially recovered) recovery will be skipped.

This patch adds checking if recovery has been finished before marking a device
in-sync for raid1 and raid10 personalities. In case of raid5 personality
such condition is already present (at raid5.c:6029).

Bug was introduced in 3.10 and causes data corruption.

Cc: stable@vger.kernel.org
Signed-off-by: Pawel Baldysiak <pawel.baldysiak@intel.com>
Signed-off-by: Lukasz Dorau <lukasz.dorau@intel.com>
Signed-off-by: NeilBrown <neilb@suse.de>
2013-10-24 12:55:17 +11:00
Kent Overstreet
d4eddd42f5 bcache: Fixed incorrect order of arguments to bio_alloc_bioset()
Signed-off-by: Kent Overstreet <kmo@daterainc.com>
Cc: linux-stable <stable@vger.kernel.org> # >= v3.10
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-10-23 07:55:36 +01:00
Greg Kroah-Hartman
a7204d72db Merge 3.12-rc6 into driver-core-next
We want these fixes here too.

Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2013-10-19 13:05:38 -07:00
Mikulas Patocka
e9c6a18264 dm snapshot: fix data corruption
This patch fixes a particular type of data corruption that has been
encountered when loading a snapshot's metadata from disk.

When we allocate a new chunk in persistent_prepare, we increment
ps->next_free and we make sure that it doesn't point to a metadata area
by further incrementing it if necessary.

When we load metadata from disk on device activation, ps->next_free is
positioned after the last used data chunk. However, if this last used
data chunk is followed by a metadata area, ps->next_free is positioned
erroneously to the metadata area. A newly-allocated chunk is placed at
the same location as the metadata area, resulting in data or metadata
corruption.

This patch changes the code so that ps->next_free skips the metadata
area when metadata are loaded in function read_exceptions.

The patch also moves a piece of code from persistent_prepare_exception
to a separate function skip_metadata to avoid code duplication.

CVE-2013-4299

Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Cc: stable@vger.kernel.org
Cc: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2013-10-16 03:17:47 +01:00
Michael Opdenacker
aa5e5dc2a8 treewide: fix "distingush" typo
Signed-off-by: Michael Opdenacker <michael.opdenacker@free-electrons.com>
Signed-off-by: Jiri Kosina <jkosina@suse.cz>
2013-10-14 15:38:33 +02:00
Kent Overstreet
2fe80d3bbf bcache: Fix a null ptr deref regression
Commit c0f04d88e4 ("bcache: Fix flushes in writeback mode") was fixing
a reported data corruption bug, but it seems some last minute
refactoring or rebasing introduced a null pointer deref.

Signed-off-by: Kent Overstreet <kmo@daterainc.com>
Cc: linux-stable <stable@vger.kernel.org> # >= v3.10
Reported-by: Gabriel de Perthuis <g2p.code@gmail.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-10-10 18:17:39 -07:00
Greg Kroah-Hartman
88502b9c0a Merge 3.12-rc3 into driver-core-next
We want the driver core and sysfs fixes in here to make merges and
development easier.

Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2013-09-29 18:29:23 -07:00
Tejun Heo
388975ccca sysfs: clean up sysfs_get_dirent()
The pre-existing sysfs interfaces which take explicit namespace
argument are weird in that they place the optional @ns in front of
@name which is contrary to the established convention.  For example,
we end up forcing vast majority of sysfs_get_dirent() users to do
sysfs_get_dirent(parent, NULL, name), which is silly and error-prone
especially as @ns and @name may be interchanged without causing
compilation warning.

This renames sysfs_get_dirent() to sysfs_get_dirent_ns() and swap the
positions of @name and @ns, and sysfs_get_dirent() is now a wrapper
around sysfs_get_dirent_ns().  This makes confusions a lot less
likely.

There are other interfaces which take @ns before @name.  They'll be
updated by following patches.

This patch doesn't introduce any functional changes.

v2: EXPORT_SYMBOL_GPL() wasn't updated leading to undefined symbol
    error on module builds.  Reported by build test robot.  Fixed.

Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Eric W. Biederman <ebiederm@xmission.com>
Cc: Kay Sievers <kay@vrfy.org>
Cc: Fengguang Wu <fengguang.wu@intel.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2013-09-26 15:33:18 -07:00
Linus Torvalds
e93dd910b9 A set of device-mapper fixes for 3.12.
A few fixes for dm-snapshot, a 32 bit fix for dm-stats, a couple error
 handling fixes for dm-multipath.  A fix for the thin provisioning target
 to not expose non-zero discard limits if discards are disabled.
 
 Lastly, add two DM module parameters which allow users to tune the
 emergency memory reserves that DM mainatins per device -- this helps fix
 a long-standing issue for dm-multipath.  The conservative default
 reserve for request-based dm-multipath devices (256) has proven
 problematic for users with many multipathed SCSI devices but relatively
 little memory.  To responsibly select a smaller value users should use
 the new nr_bios tracepoint info (via commit 75afb352 "block: Add nr_bios
 to block_rq_remap tracepoint") to determine the peak number of bios
 their workloads create.
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1.4.14 (GNU/Linux)
 
 iQEcBAABAgAGBQJSQMVHAAoJEMUj8QotnQNaOXgIAJS6/XJKMoHfiDJ9M+XD34rZ
 Uyr9TEnubX3DKCRBiY23MUcCQn3fx6BjCGv5/c8L4jQFIuLyDi2yatqpwXcbGSJh
 G/S/y6u0Axek+ew7TS80OFop4nblW6MoKnoh9/4N55Ofa+1WvKM4ERUGjHGbauyS
 TxmLQPToCFPLYRIOZ+imd6hQuIZ1+FFdJFvi7kY9O6Llx2sLD6fWi1iruBd/Da2H
 ByMX3biGN45mSpcBzRbSC/FkJ9CRIvT9n82BDPS0o3Tllt8NaVlEDaovB7h4ncc0
 bFuT2Z3Q38B9uZ8Lj0bqdGzv3kXMLCkLo6WhWjyUt84hmDPAzRpBwt60jUqWyZs=
 =bjVp
 -----END PGP SIGNATURE-----

Merge tag 'dm-3.12-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm

Pull device-mapper fixes from Mike Snitzer:
 "A few fixes for dm-snapshot, a 32 bit fix for dm-stats, a couple error
  handling fixes for dm-multipath.  A fix for the thin provisioning
  target to not expose non-zero discard limits if discards are disabled.

  Lastly, add two DM module parameters which allow users to tune the
  emergency memory reserves that DM mainatins per device -- this helps
  fix a long-standing issue for dm-multipath.  The conservative default
  reserve for request-based dm-multipath devices (256) has proven
  problematic for users with many multipathed SCSI devices but
  relatively little memory.  To responsibly select a smaller value users
  should use the new nr_bios tracepoint info (via commit 75afb352
  "block: Add nr_bios to block_rq_remap tracepoint") to determine the
  peak number of bios their workloads create"

* tag 'dm-3.12-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm:
  dm: add reserved_bio_based_ios module parameter
  dm: add reserved_rq_based_ios module parameter
  dm: lower bio-based mempool reservation
  dm thin: do not expose non-zero discard limits if discards disabled
  dm mpath: disable WRITE SAME if it fails
  dm-snapshot: fix performance degradation due to small hash size
  dm snapshot: workaround for a false positive lockdep warning
  dm stats: fix possible counter corruption on 32-bit systems
  dm mpath: do not fail path on -ENOSPC
2013-09-25 15:12:46 -07:00
Kent Overstreet
c0f04d88e4 bcache: Fix flushes in writeback mode
In writeback mode, when we get a cache flush we need to make sure we
issue a flush to the backing device.

The code for sending down an extra flush was wrong - by cloning the bio
we were probably getting flags that didn't make sense for a bare flush,
and also the old code was firing for FUA bios, for which we don't need
to send a flush to the backing device.

This was causing data corruption somehow - the mechanism was never
determined, but this patch fixes it for the users that were seeing it.

Signed-off-by: Kent Overstreet <kmo@daterainc.com>
Cc: linux-stable <stable@vger.kernel.org> # >= v3.10
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-09-24 14:41:43 -07:00
Kent Overstreet
84786438ed bcache: Fix for handling overlapping extents when reading in a btree node
btree_sort_fixup() was overly clever, because it was trying to avoid
pulling a key off the btree iterator in more than one place.

This led to a really obscure bug where we'd break early from the loop in
btree_sort_fixup() if the current key overlapped with keys in more than
one older set, and the next key it overlapped with was zero size.

Signed-off-by: Kent Overstreet <kmo@daterainc.com>
Cc: linux-stable <stable@vger.kernel.org> # >= v3.10
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-09-24 14:41:43 -07:00
Kent Overstreet
a698e08c82 bcache: Fix a shrinker deadlock
GFP_NOIO means we could be getting called recursively - mca_alloc() ->
mca_data_alloc() - definitely can't use mutex_lock(bucket_lock) then.
Whoops.

Signed-off-by: Kent Overstreet <kmo@daterainc.com>
Cc: linux-stable <stable@vger.kernel.org> # >= v3.10
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-09-24 14:41:43 -07:00
Kent Overstreet
79e3dab90d bcache: Fix a dumb CPU spinning bug in writeback
schedule_timeout() != schedule_timeout_uninterruptible()

Signed-off-by: Kent Overstreet <kmo@daterainc.com>
Cc: linux-stable <stable@vger.kernel.org> # >= v3.10
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-09-24 14:41:43 -07:00
Kent Overstreet
1394d6761b bcache: Fix a flush/fua performance bug
bch_journal_meta() was missing the flush to make the journal write
actually go down (instead of waiting up to journal_delay_ms)...

Whoops

Signed-off-by: Kent Overstreet <kmo@daterainc.com>
Cc: linux-stable <stable@vger.kernel.org> # >= v3.10
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-09-24 14:41:43 -07:00
Kent Overstreet
c2a4f3183a bcache: Fix a writeback performance regression
Background writeback works by scanning the btree for dirty data and
adding those keys into a fixed size buffer, then for each dirty key in
the keybuf writing it to the backing device.

When read_dirty() finishes and it's time to scan for more dirty data, we
need to wait for the outstanding writeback IO to finish - they still
take up slots in the keybuf (so that foreground writes can check for
them to avoid races) - without that wait, we'll continually rescan when
we'll be able to add at most a key or two to the keybuf, and that takes
locks that starves foreground IO.  Doh.

Signed-off-by: Kent Overstreet <kmo@daterainc.com>
Cc: linux-stable <stable@vger.kernel.org> # >= v3.10
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-09-24 14:41:43 -07:00
Geert Uytterhoeven
61cbd250f8 bcache: Correct printf()-style format length modifier
Fix

  drivers/md/bcache/btree.c: In function ‘bch_btree_node_read’:
  drivers/md/bcache/btree.c:259: warning: format ‘%lu’ expects type ‘long unsigned int’, but argument 3 has type ‘size_t’

Signed-off-by: Geert Uytterhoeven <geert@linux-m68k.org>
Signed-off-by: Kent Overstreet <kmo@daterainc.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-09-24 14:41:43 -07:00
Kent Overstreet
c426c4fd46 bcache: Fix for when no journal entries are found
The journal replay code didn't handle this case, causing it to go into
an infinite loop...

Signed-off-by: Kent Overstreet <kmo@daterainc.com>
Cc: linux-stable <stable@vger.kernel.org> # >= v3.10
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-09-24 14:41:43 -07:00
Gabriel de Perthuis
aee6f1cfff bcache: Strip endline when writing the label through sysfs
sysfs attributes with unusual characters have crappy failure modes
in Squeeze (udev 164); later versions of udev are unaffected.

This should make these characters more unusual.

Signed-off-by: Gabriel de Perthuis <g2p.code@gmail.com>
Signed-off-by: Kent Overstreet <kmo@daterainc.com>
Cc: linux-stable <stable@vger.kernel.org> # >= v3.10
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-09-24 14:41:43 -07:00
Kent Overstreet
6d9d21e35f bcache: Fix a dumb journal discard bug
That switch statement was obviously wrong, leading to some sort of weird
spinning on rare occasion with discards enabled...

Signed-off-by: Kent Overstreet <kmo@daterainc.com>
Cc: linux-stable <stable@vger.kernel.org> # >= v3.10
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-09-24 14:41:43 -07:00
Mike Snitzer
e8603136cb dm: add reserved_bio_based_ios module parameter
Allow user to change the number of IOs that are reserved by
bio-based DM's mempools by writing to this file:
/sys/module/dm_mod/parameters/reserved_bio_based_ios

The default value is RESERVED_BIO_BASED_IOS (16).  The maximum allowed
value is RESERVED_MAX_IOS (1024).

Export dm_get_reserved_bio_based_ios() for use by DM targets and core
code.  Switch to sizing dm-io's mempool and bioset using DM core's
configurable 'reserved_bio_based_ios'.

Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Frank Mayhar <fmayhar@google.com>
2013-09-23 10:42:24 -04:00
Mike Snitzer
f47908269f dm: add reserved_rq_based_ios module parameter
Allow user to change the number of IOs that are reserved by
request-based DM's mempools by writing to this file:
/sys/module/dm_mod/parameters/reserved_rq_based_ios

The default value is RESERVED_REQUEST_BASED_IOS (256).  The maximum
allowed value is RESERVED_MAX_IOS (1024).

Export dm_get_reserved_rq_based_ios() for use by DM targets and core
code.  Switch to sizing dm-mpath's mempool using DM core's configurable
'reserved_rq_based_ios'.

Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Frank Mayhar <fmayhar@google.com>
Acked-by: Mikulas Patocka <mpatocka@redhat.com>
2013-09-23 10:42:24 -04:00
Mike Snitzer
6cfa58573f dm: lower bio-based mempool reservation
Bio-based device mapper processing doesn't need larger mempools (like
request-based DM does), so lower the number of reserved entries for
bio-based operation.  16 was already used for bio-based DM's bioset
but mistakenly wasn't used for it's _io_cache.

Formalize difference between bio-based and request-based defaults by
introducing RESERVED_BIO_BASED_IOS and RESERVED_REQUEST_BASED_IOS.

(based on older code from Mikulas Patocka)

Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Frank Mayhar <fmayhar@google.com>
Acked-by: Mikulas Patocka <mpatocka@redhat.com>
2013-09-23 10:42:23 -04:00
Mike Snitzer
b60ab990cc dm thin: do not expose non-zero discard limits if discards disabled
Fix issue where the block layer would stack the discard limits of the
pool's data device even if the "ignore_discard" pool feature was
specified.

The pool and thin device(s) still had discards disabled because the
QUEUE_FLAG_DISCARD request_queue flag wasn't set.  But to avoid user
confusion when "ignore_discard" is used: both the pool device and the
thin device(s) have zeroes for all discard limits.

Also, always set discard_zeroes_data_unsupported in targets because they
should never advertise the 'discard_zeroes_data' capability (even if the
pool's data device supports it).

Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Acked-by: Joe Thornber <ejt@redhat.com>
2013-09-23 10:42:06 -04:00
Mike Snitzer
f84cb8a46a dm mpath: disable WRITE SAME if it fails
Workaround the SCSI layer's problematic WRITE SAME heuristics by
disabling WRITE SAME in the DM multipath device's queue_limits if an
underlying device disabled it.

The WRITE SAME heuristics, with both the original commit 5db44863b6
("[SCSI] sd: Implement support for WRITE SAME") and the updated commit
66c28f971 ("[SCSI] sd: Update WRITE SAME heuristics"), default to enabling
WRITE SAME(10) even without successfully determining it is supported.
After the first failed WRITE SAME the SCSI layer will disable WRITE SAME
for the device (by setting sdkp->device->no_write_same which results in
'max_write_same_sectors' in device's queue_limits to be set to 0).

When a device is stacked ontop of such a SCSI device any changes to that
SCSI device's queue_limits do not automatically propagate up the stack.
As such, a DM multipath device will not have its WRITE SAME support
disabled.  This causes the block layer to continue to issue WRITE SAME
requests to the mpath device which causes paths to fail and (if mpath IO
isn't configured to queue when no paths are available) it will result in
actual IO errors to the upper layers.

This fix doesn't help configurations that have additional devices
stacked ontop of the mpath device (e.g. LVM created linear DM devices
ontop).  A proper fix that restacks all the queue_limits from the bottom
of the device stack up will need to be explored if SCSI will continue to
use this model of optimistically allowing op codes and then disabling
them after they fail for the first time.

Before this patch:

EXT4-fs (dm-6): mounted filesystem with ordered data mode. Opts: (null)
device-mapper: multipath: XXX snitm debugging: got -EREMOTEIO (-121)
device-mapper: multipath: XXX snitm debugging: failing WRITE SAME IO with error=-121
end_request: critical target error, dev dm-6, sector 528
dm-6: WRITE SAME failed. Manually zeroing.
device-mapper: multipath: Failing path 8:112.
end_request: I/O error, dev dm-6, sector 4616
dm-6: WRITE SAME failed. Manually zeroing.
end_request: I/O error, dev dm-6, sector 4616
end_request: I/O error, dev dm-6, sector 5640
end_request: I/O error, dev dm-6, sector 6664
end_request: I/O error, dev dm-6, sector 7688
end_request: I/O error, dev dm-6, sector 524288
Buffer I/O error on device dm-6, logical block 65536
lost page write due to I/O error on dm-6
JBD2: Error -5 detected when updating journal superblock for dm-6-8.
end_request: I/O error, dev dm-6, sector 524296
Aborting journal on device dm-6-8.
end_request: I/O error, dev dm-6, sector 524288
Buffer I/O error on device dm-6, logical block 65536
lost page write due to I/O error on dm-6
JBD2: Error -5 detected when updating journal superblock for dm-6-8.

# cat /sys/block/sdh/queue/write_same_max_bytes
0
# cat /sys/block/dm-6/queue/write_same_max_bytes
33553920

After this patch:

EXT4-fs (dm-6): mounted filesystem with ordered data mode. Opts: (null)
device-mapper: multipath: XXX snitm debugging: got -EREMOTEIO (-121)
device-mapper: multipath: XXX snitm debugging: WRITE SAME I/O failed with error=-121
end_request: critical target error, dev dm-6, sector 528
dm-6: WRITE SAME failed. Manually zeroing.

# cat /sys/block/sdh/queue/write_same_max_bytes
0
# cat /sys/block/dm-6/queue/write_same_max_bytes
0

It should be noted that WRITE SAME support wasn't enabled in DM
multipath until v3.10.

Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Cc: Martin K. Petersen <martin.petersen@oracle.com>
Cc: Hannes Reinecke <hare@suse.de>
Cc: stable@vger.kernel.org # 3.10+
2013-09-20 10:36:34 -04:00
Mikulas Patocka
60e356f381 dm-snapshot: fix performance degradation due to small hash size
LVM2, since version 2.02.96, creates origin with zero size, then loads
the snapshot driver and then loads the origin.  Consequently, the
snapshot driver sees the origin size zero and sets the hash size to the
lower bound 64.  Such small hash table causes performance degradation.

This patch changes it so that the hash size is determined by the size of
snapshot volume, not minimum of origin and snapshot size.  It doesn't
make sense to set the snapshot size significantly larger than the origin
size, so we do not need to take origin size into account when
calculating the hash size.

Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Cc: stable@vger.kernel.org
2013-09-20 10:36:34 -04:00
Mikulas Patocka
5ea330a75b dm snapshot: workaround for a false positive lockdep warning
The kernel reports a lockdep warning if a snapshot is invalidated because
it runs out of space.

The lockdep warning was triggered by commit 0976dfc1d0
("workqueue: Catch more locking problems with flush_work()") in v3.5.

The warning is false positive.  The real cause for the warning is that
the lockdep engine treats different instances of md->lock as a single
lock.

This patch is a workaround - we use flush_workqueue instead of flush_work.
This code path is not performance sensitive (it is called only on
initialization or invalidation), thus it doesn't matter that we flush the
whole workqueue.

The real fix for the problem would be to teach the lockdep engine to treat
different instances of md->lock as separate locks.

Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Acked-by: Alasdair G Kergon <agk@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Cc: stable@vger.kernel.org # 3.5+
2013-09-20 10:36:34 -04:00
Mikulas Patocka
bbf3f8cbdc dm stats: fix possible counter corruption on 32-bit systems
There was a deliberate race condition in dm_stat_for_entry() to avoid the
overhead of disabling and enabling interrupts.  The race could result in
some events not being counted on 64-bit architectures.

However, on 32-bit architectures, operations on long long variables are
not atomic, so the race condition could cause the counter to jump by 2^32.
Such jumps could be disruptive, so we need to do proper locking on 32-bit
architectures.

Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Cc: Alasdair G. Kergon <agk@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2013-09-18 14:41:06 -04:00
Jun'ichi Nomura
cc9d3c382b dm mpath: do not fail path on -ENOSPC
Since ENOSPC is a target-side error, dm-mpath should just pass the error
information to upper layer instead of retrying itself with path failover.
Otherwise it will end up failing all paths down while path checkers find
all paths ok.

ENOSPC can now be returned from SCSI device after commit a9d6ceb8
("[SCSI] return ENOSPC on thin provisioning failure").

Signed-off-by: Jun'ichi Nomura <j-nomura@ce.jp.nec.com>
Acked-by: Hannes Reinecke <hare@suse.de>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2013-09-18 14:41:06 -04:00
Linus Torvalds
26935fb06e Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs
Pull vfs pile 4 from Al Viro:
 "list_lru pile, mostly"

This came out of Andrew's pile, Al ended up doing the merge work so that
Andrew didn't have to.

Additionally, a few fixes.

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (42 commits)
  super: fix for destroy lrus
  list_lru: dynamically adjust node arrays
  shrinker: Kill old ->shrink API.
  shrinker: convert remaining shrinkers to count/scan API
  staging/lustre/libcfs: cleanup linux-mem.h
  staging/lustre/ptlrpc: convert to new shrinker API
  staging/lustre/obdclass: convert lu_object shrinker to count/scan API
  staging/lustre/ldlm: convert to shrinkers to count/scan API
  hugepage: convert huge zero page shrinker to new shrinker API
  i915: bail out earlier when shrinker cannot acquire mutex
  drivers: convert shrinkers to new count/scan API
  fs: convert fs shrinkers to new scan/count API
  xfs: fix dquot isolation hang
  xfs-convert-dquot-cache-lru-to-list_lru-fix
  xfs: convert dquot cache lru to list_lru
  xfs: rework buffer dispose list tracking
  xfs-convert-buftarg-lru-to-generic-code-fix
  xfs: convert buftarg LRU to generic code
  fs: convert inode and dentry shrinking to be node aware
  vmscan: per-node deferred work
  ...
2013-09-12 15:01:38 -07:00
Dave Chinner
7dc19d5aff drivers: convert shrinkers to new count/scan API
Convert the driver shrinkers to the new API.  Most changes are compile
tested only because I either don't have the hardware or it's staging
stuff.

FWIW, the md and android code is pretty good, but the rest of it makes me
want to claw my eyes out.  The amount of broken code I just encountered is
mind boggling.  I've added comments explaining what is broken, but I fear
that some of the code would be best dealt with by being dragged behind the
bike shed, burying in mud up to it's neck and then run over repeatedly
with a blunt lawn mower.

Special mention goes to the zcache/zcache2 drivers.  They can't co-exist
in the build at the same time, they are under different menu options in
menuconfig, they only show up when you've got the right set of mm
subsystem options configured and so even compile testing is an exercise in
pulling teeth.  And that doesn't even take into account the horrible,
broken code...

[glommer@openvz.org: fixes for i915, android lowmem, zcache, bcache]
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Glauber Costa <glommer@openvz.org>
Acked-by: Mel Gorman <mgorman@suse.de>
Cc: Daniel Vetter <daniel.vetter@ffwll.ch>
Cc: Kent Overstreet <koverstreet@google.com>
Cc: John Stultz <john.stultz@linaro.org>
Cc: David Rientjes <rientjes@google.com>
Cc: Jerome Glisse <jglisse@redhat.com>
Cc: Thomas Hellstrom <thellstrom@vmware.com>
Cc: "Theodore Ts'o" <tytso@mit.edu>
Cc: Adrian Hunter <adrian.hunter@intel.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Artem Bityutskiy <artem.bityutskiy@linux.intel.com>
Cc: Arve Hjønnevåg <arve@android.com>
Cc: Carlos Maiolino <cmaiolino@redhat.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Chuck Lever <chuck.lever@oracle.com>
Cc: Daniel Vetter <daniel.vetter@ffwll.ch>
Cc: David Rientjes <rientjes@google.com>
Cc: Gleb Natapov <gleb@redhat.com>
Cc: Greg Thelen <gthelen@google.com>
Cc: J. Bruce Fields <bfields@redhat.com>
Cc: Jan Kara <jack@suse.cz>
Cc: Jerome Glisse <jglisse@redhat.com>
Cc: John Stultz <john.stultz@linaro.org>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Kent Overstreet <koverstreet@google.com>
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Marcelo Tosatti <mtosatti@redhat.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Steven Whitehouse <swhiteho@redhat.com>
Cc: Thomas Hellstrom <thellstrom@vmware.com>
Cc: Trond Myklebust <Trond.Myklebust@netapp.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2013-09-10 18:56:32 -04:00
Linus Torvalds
7426d62871 Add the ability to collect I/O statistics on user-defined regions of a
device-mapper device.  This dm-stats code required the reintroduction of
 a div64_u64_rem() helper, but as a separate method that doesn't slow
 down div64_u64() -- especially on 32-bit systems.
 
 Allow the error target to replace request-based DM devices
 (e.g. multipath) in addition to bio-based DM devices.
 
 Various other small code fixes and improvements to thin-provisioning, DM
 cache and the DM ioctl interface.
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1.4.13 (GNU/Linux)
 
 iQEcBAABAgAGBQJSLyNnAAoJEMUj8QotnQNaXVEIAKA1l43enaGiROBZEZXgAGUY
 1JUsnHES4ujyn/jtT39jPTQf9AW/rS4FUCrZiXG2aaNHXo7+7cdVoBHAiWc7mXad
 budBSqn47W7WDyFlQarKwsuYFcdLnqdnieRDMXQ1cN5dl4Rx61LclnsylQd4SSS0
 lznXkfOTquetDSuEPOuUHJDZufdacw3PpxWbTKGJld40fd7YZfGWQoG0ek1OeqqL
 fA30DTlYnkFyhheLCjFcDY6H55Rt7QpBWOUAa2XXLR6GLfk5iFK99autjWk2xTPT
 nppRwQrw9VH+HdW0jGLU+LRs1Y3nxwT9OBLWt9wav87Smdg/7jQAjwde9eKbO2k=
 =3ooH
 -----END PGP SIGNATURE-----

Merge tag 'dm-3.12-changes' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm

Pull device-mapper updates from Mike Snitzer:
 "Add the ability to collect I/O statistics on user-defined regions of a
  device-mapper device.  This dm-stats code required the reintroduction
  of a div64_u64_rem() helper, but as a separate method that doesn't
  slow down div64_u64() -- especially on 32-bit systems.

  Allow the error target to replace request-based DM devices (e.g.
  multipath) in addition to bio-based DM devices.

  Various other small code fixes and improvements to thin-provisioning,
  DM cache and the DM ioctl interface"

* tag 'dm-3.12-changes' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm:
  dm stripe: silence a couple sparse warnings
  dm: add statistics support
  dm thin: always return -ENOSPC if no_free_space is set
  dm ioctl: cleanup error handling in table_load
  dm ioctl: increase granularity of type_lock when loading table
  dm ioctl: prevent rename to empty name or uuid
  dm thin: set pool read-only if breaking_sharing fails block allocation
  dm thin: prefix pool error messages with pool device name
  dm: allow error target to replace bio-based and request-based targets
  math64: New separate div64_u64_rem helper
  dm space map: optimise sm_ll_dec and sm_ll_inc
  dm btree: prefetch child nodes when walking tree for a dm_btree_del
  dm btree: use pop_frame in dm_btree_del to cleanup code
  dm cache: eliminate holes in cache structure
  dm cache: fix stacking of geometry limits
  dm thin: fix stacking of geometry limits
  dm thin: add data block size limits to Documentation
  dm cache: add data block size limits to code and Documentation
  dm cache: document metadata device is exclussive to a cache
  dm: stop using WQ_NON_REENTRANT
2013-09-10 13:06:15 -07:00
Linus Torvalds
4d7696f1b0 md update for v3.12
Headline item is multithreading for RAID5 so that more
 IO/sec can be supported on fast (SSD) devices.
 Also TILE-Gx SIMD suppor for RAID6 calculations and an
 assortment of bug fixes.
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v2.0.19 (GNU/Linux)
 
 iQIVAwUAUi6dRTnsnt1WYoG5AQIqMBAAm/XUEqfyBNUiTPHmIU/OyReOlfsp8A2o
 xtcmSzaCtIUz4btPszUrw3PShqnk+lXXX2AB0rp3PzfOgyNYXBRKbzOf3eGr2VEp
 L/Cm0iSWHqQ7V7MoV5ZrqtvuyJV1a7FK3a3VaoKaUk424o4sZ7P67t/YZAnTCP/i
 9wQoPeIOJ8YjZsaAQjzI3q7yRMRE8ytyBnF4NdgeMyr2p17w2e9pnmNfCTo4wnWs
 Nu2wPr2QCPQXr/FoIhdIVEy3kVatqH8qXG8Fw+5n07HJYxGCvQZLDuoOVDYyFeoW
 gnNq2MMgLZm/7Nzqd1bN+QQZuBCd5JL4VJ2G4vLfYrn3ZSdSysrVKQXFKYG3Gkua
 1KP4Pv0hndAl4DtGbUk8CiZp6b+c5qeWvq+sO2NuhUGmumFMK2q4DJhITNexjmrs
 Eg4opnR8JMLDkYD6o52Ziu5KQR/q1PKRLj80eoVuqB2QQM5+NPb4s3k2WN+53lQD
 L9fH2alUxxSK+5R8ykk923QQ/XErMUwXaka+O/gGFAlYvaaW/GKTxFnKn/GIXAkc
 tKW88zB+zA5EZEFec+K43z1UjtGxMWsryvDN55ON2iV+LIZBISm7krroBeR55cyO
 +3tHlPsga0pO+9DdSm7hvZeWRrq5ZJTiZmL/e2FYygrC5tFAY0p+z49fK3e9Th13
 C85G7fg3yDY=
 =zLxh
 -----END PGP SIGNATURE-----

Merge tag 'md/3.12' of git://neil.brown.name/md

Pull md update from Neil Brown:
 "Headline item is multithreading for RAID5 so that more IO/sec can be
  supported on fast (SSD) devices.  Also TILE-Gx SIMD suppor for RAID6
  calculations and an assortment of bug fixes"

* tag 'md/3.12' of git://neil.brown.name/md:
  raid5: only wakeup necessary threads
  md/raid5: flush out all pending requests before proceeding with reshape.
  md/raid5: use seqcount to protect access to shape in make_request.
  raid5: sysfs entry to control worker thread number
  raid5: offload stripe handle to workqueue
  raid5: fix stripe release order
  raid5: make release_stripe lockless
  md: avoid deadlock when dirty buffers during md_stop.
  md: Don't test all of mddev->flags at once.
  md: Fix apparent cut-and-paste error in super_90_validate
  raid6/test: replace echo -e with printf
  RAID: add tilegx SIMD implementation of raid6
  md: fix safe_mode buglet.
  md: don't call md_allow_write in get_bitmap_file.
2013-09-10 13:03:41 -07:00
Mike Snitzer
7fff5e8f72 dm stripe: silence a couple sparse warnings
Eliminate the following sparse warnings:
drivers/md/dm-stripe.c:443:12: warning: symbol 'dm_stripe_init' was not declared. Should it be static?
drivers/md/dm-stripe.c:456:6: warning: symbol 'dm_stripe_exit' was not declared. Should it be static?

Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2013-09-06 11:36:01 -04:00
Mikulas Patocka
fd2ed4d252 dm: add statistics support
Support the collection of I/O statistics on user-defined regions of
a DM device.  If no regions are defined no statistics are collected so
there isn't any performance impact.  Only bio-based DM devices are
currently supported.

Each user-defined region specifies a starting sector, length and step.
Individual statistics will be collected for each step-sized area within
the range specified.

The I/O statistics counters for each step-sized area of a region are
in the same format as /sys/block/*/stat or /proc/diskstats but extra
counters (12 and 13) are provided: total time spent reading and
writing in milliseconds.  All these counters may be accessed by sending
the @stats_print message to the appropriate DM device via dmsetup.

The creation of DM statistics will allocate memory via kmalloc or
fallback to using vmalloc space.  At most, 1/4 of the overall system
memory may be allocated by DM statistics.  The admin can see how much
memory is used by reading
/sys/module/dm_mod/parameters/stats_current_allocated_bytes

See Documentation/device-mapper/statistics.txt for more details.

Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2013-09-05 20:46:06 -04:00
Mike Snitzer
94563badaf dm thin: always return -ENOSPC if no_free_space is set
If pool has 'no_free_space' set it means a previous allocation already
determined the pool has no free space (and failed that allocation with
-ENOSPC).  By always returning -ENOSPC if 'no_free_space' is set, we do
not allow the pool to oscillate between allocating blocks and then not.

But a side-effect of this determinism is that if a user wants to be able
to allocate new blocks they'll need to reload the pool's table (to clear
the 'no_free_space' flag).  This reload will happen automatically if the
pool's data volume is resized.  But if the user takes action to free a
lot of space by deleting snapshot volumes, etc the pool will no longer
allow data allocations to continue without an intervening table reload.

Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Acked-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2013-09-05 20:46:06 -04:00
Mike Snitzer
f11c1c5693 dm ioctl: cleanup error handling in table_load
Make use of common cleanup code.

Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2013-09-05 20:46:06 -04:00
Mike Snitzer
00c4fc3b1f dm ioctl: increase granularity of type_lock when loading table
Hold the mapped device's type_lock before calling populate_table() since
it is where the table's type is determined based on the specified
targets.  There is no need to allow concurrent table loads to race to
establish the table's targets or type.

This eliminates the need to grab the lock in dm_table_set_type().

Also verify that the type_lock is held in both dm_set_md_type() and
dm_get_md_type().

Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2013-09-05 20:46:06 -04:00
Alasdair Kergon
c2b0482462 dm ioctl: prevent rename to empty name or uuid
A device-mapper device must always have a name consisting of a non-empty
string.  If the device also has a uuid, this similarly must not be an
empty string.

The DM_DEV_CREATE ioctl enforces these rules when the device is created,
but this patch is needed to enforce them when DM_DEV_RENAME is used to
change the name or uuid.

Reported-by: Zdenek Kabelac <zkabelac@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Acked-by: Mikulas Patocka <mpatocka@redhat.com>
2013-09-05 20:46:06 -04:00
Mike Snitzer
d6fc204201 dm thin: set pool read-only if breaking_sharing fails block allocation
break_sharing() now handles an arbitrary alloc_data_block() error
the same way as provision_block(): marks pool read-only and errors the
cell.

Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Acked-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2013-09-05 20:46:06 -04:00
Mike Snitzer
4fa5971a69 dm thin: prefix pool error messages with pool device name
Useful to know which pool is experiencing the error.

Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Acked-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2013-09-05 20:46:05 -04:00
Mike Snitzer
169e2cc279 dm: allow error target to replace bio-based and request-based targets
It may be useful to switch a request-based table to the "error" target.
Enhance the DM core to allow a hybrid target_type which is capable of
handling either bios (via .map) or requests (via .map_rq).

Add a request-based map function (.map_rq) to the "error" target_type;
making it DM's first hybrid target.  Train dm_table_set_type() to prefer
the mapped device's established type (request-based or bio-based).  If
the mapped device doesn't have an established type default to making the
table with the hybrid target(s) bio-based.

Tested 'dmsetup wipe_table' to work on both bio-based and request-based
devices.

Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Joe Jin <joe.jin@oracle.com>
Signed-off-by: Jun'ichi Nomura <j-nomura@ce.jp.nec.com>
Acked-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2013-09-05 20:46:05 -04:00
Linus Torvalds
f66c83d059 SCSI misc on 20130903
This patch set is a set of driver updates (ufs, zfcp, lpfc, mpt2/3sas,
 qla4xxx, qla2xxx [adding support for ISP8044 + other things]) we also have a
 new driver: esas2r which has a number of static checker problems, but which I
 expect to resolve over the -rc course of 3.12 under the new driver exception.
 We also have the error return updates that were discussed at LSF.
 
 Signed-off-by: James Bottomley <JBottomley@Parallels.com>
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v2.0.19 (GNU/Linux)
 
 iQEcBAABAgAGBQJSJfX5AAoJEDeqqVYsXL0M8u8H+gN65iA4YeNc3Eq9F6mliLfg
 JOIfn6GRz7ChbQ1ZZKdH/5xCOtzXphrkg7kRGmr9frsvYZ4X2c7W3xweQTA08gqP
 wPH7/xyPffPnUm/r+V+SV41pm39bEjmltknLwiF572a6iOoVYQpnmDjdZQKT0jU0
 QZEqI81+646m8edCnApLw3Tlsn2gBwHaDrkd55H2IQGTkOD016C0CQbM+cNMU440
 qdqDcfRWCsp1fhLo3JH2kWTx8BihhyfEYAFz4tZwuFdGGkRZxF20HwyzV0h3hZOG
 kZ2Gd1BFf0SybxOcESQmAukbcH5hyumX1Y7HMYKZbS2ubD4MCO1MO8UUtLXlxNc=
 =PDBQ
 -----END PGP SIGNATURE-----

Merge tag 'scsi-misc' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi

Pull first round of SCSI updates from James Bottomley:
 "This patch set is a set of driver updates (ufs, zfcp, lpfc, mpt2/3sas,
  qla4xxx, qla2xxx [adding support for ISP8044 + other things]).

  We also have a new driver: esas2r which has a number of static checker
  problems, but which I expect to resolve over the -rc course of 3.12
  under the new driver exception.

  We also have the error return that were discussed at LSF"

* tag 'scsi-misc' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi: (118 commits)
  [SCSI] sg: push file descriptor list locking down to per-device locking
  [SCSI] sg: checking sdp->detached isn't protected when open
  [SCSI] sg: no need sg_open_exclusive_lock
  [SCSI] sg: use rwsem to solve race during exclusive open
  [SCSI] scsi_debug: fix logical block provisioning support when unmap_alignment != 0
  [SCSI] scsi_debug: fix endianness bug in sdebug_build_parts()
  [SCSI] qla2xxx: Update the driver version to 8.06.00.08-k.
  [SCSI] qla2xxx: print MAC via %pMR.
  [SCSI] qla2xxx: Correction to message ids.
  [SCSI] qla2xxx: Correctly print out/in mailbox registers.
  [SCSI] qla2xxx: Add a new interface to update versions.
  [SCSI] qla2xxx: Move queue depth ramp down message to i/o debug level.
  [SCSI] qla2xxx: Select link initialization option bits from current operating mode.
  [SCSI] qla2xxx: Add loopback IDC-TIME-EXTEND aen handling support.
  [SCSI] qla2xxx: Set default critical temperature value in cases when ISPFX00 firmware doesn't provide it
  [SCSI] qla2xxx: QLAFX00 make over temperature AEN handling informational, add log for normal temperature AEN
  [SCSI] qla2xxx: Correct Interrupt Register offset for ISPFX00
  [SCSI] qla2xxx: Remove handling of Shutdown Requested AEN from qlafx00_process_aen().
  [SCSI] qla2xxx: Send all AENs for ISPFx00 to above layers.
  [SCSI] qla2xxx: Add changes in initialization for ISPFX00 cards with BIOS
  ...
2013-09-03 15:48:06 -07:00
Shaohua Li
bfc90cb093 raid5: only wakeup necessary threads
If there are not enough stripes to handle, we'd better not always
queue all available work_structs. If one worker can only handle small
or even none stripes, it will impact request merge and create lock
contention.

With this patch, the number of work_struct running will depend on
pending stripes number. Note: some statistics info used in the patch
are accessed without locking protection. This should doesn't matter,
we just try best to avoid queue unnecessary work_struct.

Signed-off-by: Shaohua Li <shli@fusionio.com>
Signed-off-by: NeilBrown <neilb@suse.de>
2013-09-02 10:31:29 +10:00
NeilBrown
4d77e3ba88 md/raid5: flush out all pending requests before proceeding with reshape.
Some requests - particularly 'discard' and 'read' are handled
differently depending on whether a reshape is active or not.

It is harmless to assume reshape is active if it isn't but wrong
to act as though reshape is not active when it is.

So when we start reshape - after making clear to all requests that
reshape has started - use mddev_suspend/mddev_resume to flush out all
requests.  This will ensure that no requests will be assuming the
absence of reshape once it really starts.

Signed-off-by: NeilBrown <neilb@suse.de>
2013-08-28 16:58:44 +10:00
NeilBrown
c46501b2de md/raid5: use seqcount to protect access to shape in make_request.
make_request() access various shape parameters (raid_disks, chunk_size
etc) which might be changed by raid5_start_reshape().

If the later is called at and awkward time during the form, the wrong
stripe_head might be used.

So introduce a 'seqcount' and after finding a stripe_head make sure
there is no reason to expect that we got the wrong one.

Signed-off-by: NeilBrown <neilb@suse.de>
2013-08-28 16:58:36 +10:00
Shaohua Li
b721420e87 raid5: sysfs entry to control worker thread number
Add a sysfs entry to control running workqueue thread number. If
group_thread_cnt is set to 0, we will disable workqueue offload handling of
stripes.

Signed-off-by: Shaohua Li <shli@fusionio.com>
Signed-off-by: NeilBrown <neilb@suse.de>
2013-08-28 16:56:52 +10:00
Shaohua Li
851c30c9ba raid5: offload stripe handle to workqueue
This is another attempt to create multiple threads to handle raid5 stripes.
This time I use workqueue.

raid5 handles request (especially write) in stripe unit. A stripe is page size
aligned/long and acrosses all disks. Writing to any disk sector, raid5 runs a
state machine for the corresponding stripe, which includes reading some disks
of the stripe, calculating parity, and writing some disks of the stripe. The
state machine is running in raid5d thread currently. Since there is only one
thread, it doesn't scale well for high speed storage. An obvious solution is
multi-threading.

To get better performance, we have some requirements:
a. locality. stripe corresponding to request submitted from one cpu is better
handled in thread in local cpu or local node. local cpu is preferred but some
times could be a bottleneck, for example, parity calculation is too heavy.
local node running has wide adaptability.
b. configurablity. Different setup of raid5 array might need diffent
configuration. Especially the thread number. More threads don't always mean
better performance because of lock contentions.

My original implementation is creating some kernel threads. There are
interfaces to control which cpu's stripe each thread should handle. And
userspace can set affinity of the threads. This provides biggest flexibility
and configurability. But it's hard to use and apparently a new thread pool
implementation is disfavor.

Recent workqueue improvement is quite promising. unbound workqueue will be
bound to numa node. If WQ_SYSFS is set in workqueue, there are sysfs option to
do affinity setting. For example, we can only include one HT sibling in
affinity. Since work is non-reentrant by default, and we can control running
thread number by limiting dispatched work_struct number.

In this patch, I created several stripe worker group. A group is a numa node.
stripes from cpus of one node will be added to a group list. Workqueue thread
of one node will only handle stripes of worker group of the node. In this way,
stripe handling has numa node locality. And as I said, we can control thread
number by limiting dispatched work_struct number.

The work_struct callback function handles several stripes in one run. A typical
work queue usage is to run one unit in each work_struct. In raid5 case, the
unit is a stripe. But we can't do that:
a. Though handling a stripe doesn't need lock because of reference accounting
and stripe isn't in any list, queuing a work_struct for each stripe will make
workqueue lock contended very heavily.
b. blk_start_plug()/blk_finish_plug() should surround stripe handle, as we
might dispatch request. If each work_struct only handles one stripe, such block
plug is meaningless.

This implementation can't do very fine grained configuration. But the numa
binding is most popular usage model, should be enough for most workloads.

Note: since we have only one stripe queue, switching to multi-thread might
decrease request size dispatching down to low level layer. The impact depends
on thread number, raid configuration and workload. So multi-thread raid5 might
not be proper for all setups.

Changes V1 -> V2:
1. remove WQ_NON_REENTRANT
2. disabling multi-threading by default
3. Add more descriptions in changelog

Signed-off-by: Shaohua Li <shli@fusionio.com>
Signed-off-by: NeilBrown <neilb@suse.de>
2013-08-28 16:46:38 +10:00
Shaohua Li
d265d9dc1d raid5: fix stripe release order
patch "make release_stripe lockless" changes the order stripes are released.
Originally I thought block layer can take care of request merge, but it appears
there are still some requests not merged. It's easy to fix the order.

Signed-off-by: Shaohua Li <shli@fusionio.com>
Signed-off-by: NeilBrown <neilb@suse.de>
2013-08-28 16:36:26 +10:00
Shaohua Li
773ca82fa1 raid5: make release_stripe lockless
release_stripe still has big lock contention. We just add the stripe to a llist
without taking device_lock. We let the raid5d thread to do the real stripe
release, which must hold device_lock anyway. In this way, release_stripe
doesn't hold any locks.

The side effect is the released stripes order is changed. But sounds not a big
deal, stripes are never handled in order. And I thought block layer can already
do nice request merge, which means order isn't that important.

I kept the unplug release batch, which is unnecessary with this patch from lock
contention avoid point of view, and actually if we delete it, the stripe_head
release_list and lru can share storage. But the unplug release batch is also
helpful for request merge. We probably can delay wakeup raid5d till unplug, but
I'm still afraid of the case which raid5d is running.

Signed-off-by: Shaohua Li <shli@fusionio.com>
Signed-off-by: NeilBrown <neilb@suse.de>
2013-08-28 11:55:53 +10:00
NeilBrown
260fa034ef md: avoid deadlock when dirty buffers during md_stop.
When the last process closes /dev/mdX sync_blockdev will be called so
that all buffers get flushed.
So if it is then opened for the STOP_ARRAY ioctl to be sent there will
be nothing to flush.

However if we open /dev/mdX in order to send the STOP_ARRAY ioctl just
moments before some other process which was writing closes their file
descriptor, then there won't be a 'last close' and the buffers might
not get flushed.

So do_md_stop() calls sync_blockdev().  However at this point it is
holding ->reconfig_mutex.  So if the array is currently 'clean' then
the writes from sync_blockdev() will not complete until the array
can be marked dirty and that won't happen until some other thread
can get ->reconfig_mutex.  So we deadlock.

We need to move the sync_blockdev() call to before we take
->reconfig_mutex.
However then some other thread could open /dev/mdX and write to it
after we call sync_blockdev() and before we actually stop the array.
This can leave dirty data in the page cache which is awkward.

So introduce new flag MD_STILL_CLOSED.  Set it before calling
sync_blockdev(), clear it if anyone does open the file, and abort the
STOP_ARRAY attempt if it gets set before we lock against further
opens.

It is still possible to get problems if you open /dev/mdX, write to
it, then issue the STOP_ARRAY ioctl.  Just don't do that.

Signed-off-by: NeilBrown <neilb@suse.de>
2013-08-27 16:45:00 +10:00
NeilBrown
7a0a5355cb md: Don't test all of mddev->flags at once.
mddev->flags is mostly used to record if an update of the
metadata is needed.  Sometimes the whole field is tested
instead of just the important bits.  This makes it difficult
to introduce more state bits.

So replace all bare tests of mddev->flags with tests for the bits
that actually need testing.

Signed-off-by: NeilBrown <neilb@suse.de>
2013-08-27 16:28:23 +10:00
Dave Jones
c9ad020fec md: Fix apparent cut-and-paste error in super_90_validate
Setting a variable to itself probably wasn't the intention here.

Signed-off-by: Dave Jones <davej@fedoraproject.org>
Signed-off-by: NeilBrown <neilb@suse.de>
2013-08-27 16:06:17 +10:00
NeilBrown
275c51c4e3 md: fix safe_mode buglet.
Whe we set the safe_mode_timeout to a smaller value we trigger a timeout
immediately - otherwise the small value might not be honoured.
However if the previous timeout was 0 meaning "no timeout", we didn't.
This would mean that no timeout happens until the next write completes,
which could be a long time.

Signed-off-by: NeilBrown <neilb@suse.de>
2013-08-27 16:05:43 +10:00
NeilBrown
60559da4d8 md: don't call md_allow_write in get_bitmap_file.
There is no really need as GFP_NOIO is very likely sufficient,
and failure is not catastrophic.

Calling md_allow_write here will convert a read-auto array to
read/write which could be confusing when you are just performing
a read operation.

Signed-off-by: NeilBrown <neilb@suse.de>
2013-08-27 16:05:32 +10:00
Hannes Reinecke
7e782af576 [SCSI] Return ENODATA on medium error
When a medium error is detected the SCSI stack should return
ENODATA to the upper layers.

[jejb: fix whitespace error]
Signed-off-by: Hannes Reinecke <hare@suse.de>
Signed-off-by: James Bottomley <JBottomley@Parallels.com>
2013-08-23 12:54:53 -04:00
Joe Thornber
f722063ee0 dm space map: optimise sm_ll_dec and sm_ll_inc
Prior to this patch these methods did a lookup followed by an insert.
Instead they now call a common mutate function that adjusts the value
according to a callback function.  This avoids traversing the data
structures twice and hence improves performance.

Also factor out sm_ll_lookup_big_ref_count() for use by both
sm_ll_lookup() and sm_ll_mutate().

Signed-off-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2013-08-23 09:02:14 -04:00
Joe Thornber
04f17c802f dm btree: prefetch child nodes when walking tree for a dm_btree_del
dm-btree now takes advantage of dm-bufio's ability to prefetch data via
dm_bm_prefetch().  Prior to this change many btree node visits were
causing a synchronous read.

Signed-off-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2013-08-23 09:02:14 -04:00
Joe Thornber
cd5acf0b44 dm btree: use pop_frame in dm_btree_del to cleanup code
Remove a visited leaf straight away from the stack, rather than
marking all it's children as visited and letting it get removed on the
next iteration.  May also offer a micro optimisation in dm_btree_del.

Signed-off-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2013-08-23 09:02:14 -04:00
Mike Snitzer
c9ec5d7c7b dm cache: eliminate holes in cache structure
Reorder members in the cache structure to eliminate 6 out of 7 holes
(reclaiming 24 bytes).  Also, the 'worker' and 'waker' members no longer
straddle cachelines.

Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
Acked-by: Joe Thornber <ejt@redhat.com>
2013-08-23 09:02:14 -04:00
Mike Snitzer
f610937214 dm cache: fix stacking of geometry limits
Do not blindly override the queue limits (specifically io_min and
io_opt).  Allow traditional stacking of these limits if io_opt is a
factor of the cache's data block size.

Without this patch mkfs.xfs does not recognize the cache device's
provided limits as a useful geometry (e.g. raid) so these hints are
ignored.  This was due to setting io_min to a useless value.

Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
Acked-by: Joe Thornber <ejt@redhat.com>
2013-08-23 09:02:14 -04:00
Mike Snitzer
0cc67cd9c5 dm thin: fix stacking of geometry limits
Do not blindly override the queue limits (specifically io_min and
io_opt).  Allow traditional stacking of these limits if io_opt is a
factor of the thin-pool's data block size.

Without this patch mkfs.xfs does not recognize the thin device's
provided limits as a useful geometry (e.g. raid) so these hints are
ignored.  This was due to setting io_min to a useless value.

Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
Acked-by: Joe Thornber <ejt@redhat.com>
2013-08-23 09:02:14 -04:00
Mike Snitzer
0547304463 dm cache: add data block size limits to code and Documentation
Place upper bound on the cache's data block size (1GB).

Inform users that the data block size can't be any arbitrary number,
i.e. its value must be between 32KB and 1GB.  Also, it should be a
multiple of 32KB.

Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
Acked-by: Joe Thornber <ejt@redhat.com>
2013-08-23 09:02:13 -04:00
Tejun Heo
670368a8dd dm: stop using WQ_NON_REENTRANT
dbf2576e37 ("workqueue: make all workqueues non-reentrant") made
WQ_NON_REENTRANT no-op and the flag is going away.  Remove its usages.

This patch doesn't introduce any behavior changes.

Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
Acked-by: Joe Thornber <ejt@redhat.com>
2013-08-23 09:02:13 -04:00
Geert Uytterhoeven
b936bf8b78 dm cache: avoid conflicting remove_mapping() in mq policy
On sparc32, which includes <linux/swap.h> from <asm/pgtable_32.h>:

drivers/md/dm-cache-policy-mq.c:962:13: error: conflicting types for 'remove_mapping'
include/linux/swap.h:285:12: note: previous declaration of 'remove_mapping' was here

As mq_remove_mapping() already exists, and the local remove_mapping() is
used only once, inline it manually to avoid the conflict.

Signed-off-by: Geert Uytterhoeven <geert@linux-m68k.org>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Alasdair Kergon <agk@redhat.com>
Acked-by: Joe Thornber <ejt@redhat.com>
2013-08-16 15:56:51 -04:00
Linus Torvalds
c271f5bc9a 2 more bugfixes for md in 3.11
Both marked for -stable, both since 3.3.  I guess I should spend more
 time testing...
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v2.0.19 (GNU/Linux)
 
 iQIVAwUAUfDTjDnsnt1WYoG5AQKMkw//dbNZsOt1M1LvXUcGiDwBrcewltbXMmju
 7rNkDP6YhOng+p4IRNsvjbA3NplBHbKDPnhzabT8MT+fT7039XYaAFkpI++Q5ylm
 9owub1Ym1eZTrSjCYmp5sK+5OXoUX5kaVsQcPqpb9Pk42mSYepnjpZrj2/EMGHlM
 ud6Ih9OicONM967DGRo5MP2bp82eJVe2yJ5Tg4I9gWVPkDoPYLjTAutahPQ0vXr9
 SS3o+L1Ewsb0CicllirnjcroFFfeEO2b7kIXWC8PUzoR1XEa2wRxeNIo1M9vXKuV
 9YMfLXBUTbzGBx/dA9rW709Lr9UG8rOXRbJcv/I1moidpeEWipsSRg3HeEV7TYq0
 5oY+JW4/gzyepDv7sgm1Mc1wMf4C4/B+WUcgDp5e2EurKWYGx2HzHYWaovyLU6s3
 OsaOUwpivIvQglbW+dEhRbUGXXi9J762H8DT2JmPliA0ylfPwqyAvks+nIIC7TPF
 RUpgYHyBYRxUcYad5Df0eX1pd1KVATR3F7UIWzYTB/aG3+lsprY3jB+oWlAFmIg8
 ExVCuaUNZdW8MQUU4xLFgJjUv+gZ2UJAjCv49Wj+nvJD/W/LgbLCZ1Xosk9CwfVy
 mc9SnLdfUsseM3PE6fBWVNcSO1hddIo1w1hL+xZix36kY7E9Pi8TLiJwB8q2/WRl
 4Yo+g5LQV0w=
 =68H7
 -----END PGP SIGNATURE-----

Merge tag 'md/3.11-fixes' of git://neil.brown.name/md

Pull md fixes from Neil Brown:
 "Two more bugfixes for md in 3.11

  Both marked for -stable, both since 3.3.  I guess I should spend more
  time testing..."

* tag 'md/3.11-fixes' of git://neil.brown.name/md:
  md/raid5: fix interaction of 'replace' and 'recovery'.
  md/raid10: remove use-after-free bug.
2013-07-26 11:20:10 -07:00
NeilBrown
f94c0b6658 md/raid5: fix interaction of 'replace' and 'recovery'.
If a device in a RAID4/5/6 is being replaced while another is being
recovered, then the writes to the replacement device currently don't
happen, resulting in corruption when the replacement completes and the
new drive takes over.

This is because the replacement writes are only triggered when
's.replacing' is set and not when the similar 's.sync' is set (which
is the case during resync and recovery - it means all devices need to
be read).

So schedule those writes when s.replacing is set as well.

In this case we cannot use "STRIPE_INSYNC" to record that the
replacement has happened as that is needed for recording that any
parity calculation is complete.  So introduce STRIPE_REPLACED to
record if the replacement has happened.

For safety we should also check that STRIPE_COMPUTE_RUN is not set.
This has a similar effect to the "s.locked == 0" test.  The latter
ensure that now IO has been flagged but not started.  The former
checks if any parity calculation has been flagged by not started.
We must wait for both of these to complete before triggering the
'replace'.

Add a similar test to the subsequent check for "are we finished yet".
This possibly isn't needed (is subsumed in the STRIPE_INSYNC test),
but it makes it more obvious that the REPLACE will happen before we
think we are finished.

Finally if a NeedReplace device is not UPTODATE then that is an
error.  We really must trigger a warning.

This bug was introduced in commit 9a3e1101b8
(md/raid5:  detect and handle replacements during recovery.)
which introduced replacement for raid5.
That was in 3.3-rc3, so any stable kernel since then would benefit
from this fix.

Cc: stable@vger.kernel.org (3.3+)
Reported-by: qindehua <13691222965@163.com>
Tested-by: qindehua <qindehua@163.com>
Signed-off-by: NeilBrown <neilb@suse.de>
2013-07-25 16:46:57 +10:00
NeilBrown
0eb25bb027 md/raid10: remove use-after-free bug.
We always need to be careful when calling generic_make_request, as it
can start a chain of events which might free something that we are
using.

Here is one place I wasn't careful enough.  If the wbio2 is not in
use, then it might get freed at the first generic_make_request call.
So perform all necessary tests first.

This bug was introduced in 3.3-rc3 (24afd80d99) and can cause an
oops, so fix is suitable for any -stable since then.

Cc: stable@vger.kernel.org (3.3+)
Signed-off-by: NeilBrown <neilb@suse.de>
2013-07-25 16:46:53 +10:00
Linus Torvalds
d4c90b1b9f Merge branch 'for-3.11/drivers' of git://git.kernel.dk/linux-block
Pull block IO driver bits from Jens Axboe:
 "As I mentioned in the core block pull request, due to real life
  circumstances the driver pull request would be late.  Now it looks
  like -rc2 late...  On the plus side, apart form the rsxx update, these
  are all things that I could argue could go in later in the cycle as
  they are fixes and not features.  So even though things are late, it's
  not ALL bad.

  The pull request contains:

   - Updates to bcache, all bug fixes, from Kent.

   - A pile of drbd bug fixes (no big features this time!).

   - xen blk front/back fixes.

   - rsxx driver updates, some of them deferred form 3.10.  So should be
     well cooked by now"

* 'for-3.11/drivers' of git://git.kernel.dk/linux-block: (63 commits)
  bcache: Allocation kthread fixes
  bcache: Fix GC_SECTORS_USED() calculation
  bcache: Journal replay fix
  bcache: Shutdown fix
  bcache: Fix a sysfs splat on shutdown
  bcache: Advertise that flushes are supported
  bcache: check for allocation failures
  bcache: Fix a dumb race
  bcache: Use standard utility code
  bcache: Update email address
  bcache: Delete fuzz tester
  bcache: Document shrinker reserve better
  bcache: FUA fixes
  drbd: Allow online change of al-stripes and al-stripe-size
  drbd: Constants should be UPPERCASE
  drbd: Ignore the exit code of a fence-peer handler if it returns too late
  drbd: Fix rcu_read_lock balance on error path
  drbd: fix error return code in drbd_init()
  drbd: Do not sleep inside rcu
  bcache: Refresh usage docs
  ...
2013-07-22 19:02:52 -07:00
NeilBrown
30bc9b5387 md/raid1: fix bio handling problems in process_checks()
Recent change to use bio_copy_data() in raid1 when repairing
an array is faulty.

The underlying may have changed the bio in various ways using
bio_advance and these need to be undone not just for the 'sbio' which
is being copied to, but also the 'pbio' (primary) which is being
copied from.

So perform the reset on all bios that were read from and do it early.

This also ensure that the sbio->bi_io_vec[j].bv_len passed to
memcmp is correct.

This fixes a crash during a 'check' of a RAID1 array.  The crash was
introduced in 3.10 so this is suitable for 3.10-stable.

Cc: stable@vger.kernel.org (3.10)
Reported-by: Joe Lawrence <joe.lawrence@stratus.com>
Signed-off-by: NeilBrown <neilb@suse.de>
2013-07-18 14:18:04 +10:00
NeilBrown
5024c29831 md: Remove recent change which allows devices to skip recovery.
commit 7ceb17e87b
    md: Allow devices to be re-added to a read-only array.

allowed a bit more than just that.  It also allows devices to be added
to a read-write array and to end up skipping recovery.

This patch removes the offending piece of code pending a rewrite for a
subsequent release.

More specifically:
 If the array has a bitmap, then the device will still need a bitmap
 based resync ('saved_raid_disk' is set under different conditions
 is a bitmap is present).
 If the array doesn't have a bitmap, then this is correct as long as
 nothing has been written to the array since the metadata was checked
 by ->validate_super.  However there is no locking to ensure that there
 was no write.

Bug was introduced in 3.10 and causes data corruption so
patch is suitable for 3.10-stable.

Cc: stable@vger.kernel.org (3.10)
Reported-by: Joe Lawrence <joe.lawrence@stratus.com>
Signed-off-by: NeilBrown <neilb@suse.de>
2013-07-18 14:18:03 +10:00
NeilBrown
7bb23c4934 md/raid10: fix two problems with RAID10 resync.
1/ When an different between blocks is found, data is copied from
   one bio to the other.  However bv_len is used as the length to
   copy and this could be zero.  So use r10_bio->sectors to calculate
   length instead.
   Using bv_len was probably always a bit dubious, but the introduction
   of bio_advance made it much more likely to be a problem.

2/ When preparing some blocks for sync, we don't set BIO_UPTODATE
   except on bios that we schedule for a read.  This ensures that
   missing/failed devices don't confuse the loop at the top of
   sync_request write.
   Commit 8be185f2c9 "raid10: Use bio_reset()"
   removed a loop which set BIO_UPTDATE on all appropriate bios.
   So we need to re-add that flag.

These bugs were introduced in 3.10, so this patch is suitable for
3.10-stable, and can remove a potential for data corruption.

Cc: stable@vger.kernel.org (3.10)
Reported-by: Brassow Jonathan <jbrassow@redhat.com>
Signed-off-by: NeilBrown <neilb@suse.de>
2013-07-18 14:18:01 +10:00
Kent Overstreet
79826c35eb bcache: Allocation kthread fixes
The alloc kthread should've been using try_to_freeze() - and also there
was the potential for the alloc kthread to get woken up after it had
shut down, which would have been bad.

Signed-off-by: Kent Overstreet <kmo@daterainc.com>
2013-07-12 00:22:49 -07:00
Kent Overstreet
29ebf465b9 bcache: Fix GC_SECTORS_USED() calculation
Part of the job of garbage collection is to add up however many sectors
of live data it finds in each bucket, but that doesn't work very well if
it doesn't reset GC_SECTORS_USED() when it starts. Whoops.

This wouldn't have broken anything horribly, but allocation tries to
preferentially reclaim buckets that are mostly empty and that's not
gonna work with an incorrect GC_SECTORS_USED() value.

Signed-off-by: Kent Overstreet <kmo@daterainc.com>
Cc: linux-stable <stable@vger.kernel.org> # >= v3.10
2013-07-12 00:22:48 -07:00
Kent Overstreet
faa5673617 bcache: Journal replay fix
The journal replay code starts by finding something that looks like a
valid journal entry, then it does a binary search over the unchecked
region of the journal for the journal entries with the highest sequence
numbers.

Trouble is, the logic was wrong - journal_read_bucket() returns true if
it found journal entries we need, but if the range of journal entries
we're looking for loops around the end of the journal - in that case
journal_read_bucket() could return true when it hadn't found the highest
sequence number we'd seen yet, and in that case the binary search did
the wrong thing. Whoops.

Signed-off-by: Kent Overstreet <kmo@daterainc.com>
Cc: linux-stable <stable@vger.kernel.org> # >= v3.10
2013-07-12 00:22:48 -07:00
Kent Overstreet
5caa52afc5 bcache: Shutdown fix
Stopping a cache set is supposed to make it stop attached backing
devices, but somewhere along the way that code got lost. Fixing this
mainly has the effect of fixing our reboot notifier.

Signed-off-by: Kent Overstreet <kmo@daterainc.com>
Cc: linux-stable <stable@vger.kernel.org> # >= v3.10
2013-07-12 00:22:47 -07:00
Kent Overstreet
c9502ea442 bcache: Fix a sysfs splat on shutdown
If we stopped a bcache device when we were already detaching (or
something like that), bcache_device_unlink() would try to remove a
symlink from sysfs that was already gone because the bcache dev kobject
had already been removed from sysfs.

So keep track of whether we've removed stuff from sysfs.

Signed-off-by: Kent Overstreet <kmo@daterainc.com>
Cc: linux-stable <stable@vger.kernel.org> # >= v3.10
2013-07-12 00:22:47 -07:00
Kent Overstreet
54d12f2b4f bcache: Advertise that flushes are supported
Whoops - bcache's flush/FUA was mostly correct, but flushes get filtered
out unless we say we support them...

Signed-off-by: Kent Overstreet <kmo@daterainc.com>
Cc: linux-stable <stable@vger.kernel.org> # >= v3.10
2013-07-12 00:22:46 -07:00
Dan Carpenter
d2a65ce2ac bcache: check for allocation failures
There is a missing NULL check after the kzalloc().

Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
2013-07-12 00:22:46 -07:00
Kent Overstreet
6aa8f1a6ca bcache: Fix a dumb race
In the far-too-complicated closure code - closures can have destructors,
for probably dubious reasons; they get run after the closure is no
longer waiting on anything but before dropping the parent ref, intended
just for freeing whatever memory the closure is embedded in.

Trouble is, when remaining goes to 0 and we've got nothing more to run -
we also have to unlock the closure, setting remaining to -1. If there's
a destructor, that unlock isn't doing anything - nobody could be trying
to lock it if we're about to free it - but if the unlock _is needed...
that check for a destructor was racy. Argh.

Signed-off-by: Kent Overstreet <kmo@daterainc.com>
Cc: linux-stable <stable@vger.kernel.org> # >= v3.10
2013-07-12 00:22:33 -07:00
Linus Torvalds
9903883f1d Add a device-mapper target called dm-switch to provide a multipath
framework for storage arrays that dynamically reconfigure their
 preferred paths for different device regions.
 
 Fix a bug in the verity target that prevented its use with some
 specific sizes of devices.
 
 Improve some locking mechanisms in the device-mapper core and bufio.
 
 Add Mike Snitzer as a device-mapper maintainer.
 
 A few more clean-ups and fixes.
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1.4.11 (GNU/Linux)
 
 iQIcBAABAgAGBQJR3ehdAAoJEK2W1qbAHj1nseUP+gPgoX2YTBiKW/fQnbixb11c
 0BExXiHtHgVnxQP4aJo8BJRFW9/DAN740UvKb2XjjbNChIQ47j6vOLCCzJ+97wW+
 FCJ48pltsacgywvm5e3BbnwmcmpQXKk1Wd+1/9beWbcib9IzVB2B06Esv3HRtQZj
 cQbIkeeTGbrSnsiAWSQh2xsNqjv1YObUohs43uG+Pa0WmdE1KebAYfkgEvi0b+E6
 ehSsvAMqYRgkLvYdYTxRNJtC+H3pkucS6r42Q/tZj2YciU3tc0v6rsFW9Ey+l0E7
 c5KaUAKk5e3HAhFvJ4ydlj7r1cu7G49rixIBJ60lX86QBwmZ8js5EEPliw0ZoWI+
 av1P+9gLsxaQTH/Cw8jJW4xK7hYAZAvn//iNVBAATATd65nmQImHNWWMjr205Kw9
 9XOeFUxAdnM7ITKXJkFf3vH2tFrRAKgXiR57im5ZuLMOFYWjR6EYE870+GCWSya8
 Dhzj0Mb8IFHrelEbRWicNbD5IaAxvfQ6/sTvXBiV642jImkQIyIj+PBiIvsq8fTH
 LKNL1l545R5aOHSU4TXnseq3TcIqElx0KsPTJuZq+q/2UfvMe9Lv9g+ld5CywfH1
 1HkEB75yWPvEfOtIac9tzQSt3KnF01fC2QMYZE4rSiYs8KPgln9pxo+UulUaZzId
 8Gch3/C5cBBCHjMJtv/b
 =s5m4
 -----END PGP SIGNATURE-----

Merge tag 'dm-3.11-changes' of git://git.kernel.org/pub/scm/linux/kernel/git/agk/linux-dm

Pull device-mapper changes from Alasdair G Kergon:
 "Add a device-mapper target called dm-switch to provide a multipath
  framework for storage arrays that dynamically reconfigure their
  preferred paths for different device regions.

  Fix a bug in the verity target that prevented its use with some
  specific sizes of devices.

  Improve some locking mechanisms in the device-mapper core and bufio.

  Add Mike Snitzer as a device-mapper maintainer.

  A few more clean-ups and fixes"

* tag 'dm-3.11-changes' of git://git.kernel.org/pub/scm/linux/kernel/git/agk/linux-dm:
  dm: add switch target
  dm: update maintainers
  dm: optimize reorder structure
  dm: optimize use SRCU and RCU
  dm bufio: submit writes outside lock
  dm cache: fix arm link errors with inline
  dm verity: use __ffs and __fls
  dm flakey: correct ctr alloc failure mesg
  dm verity: remove pointless comparison
  dm: use __GFP_HIGHMEM in __vmalloc
  dm verity: fix inability to use a few specific devices sizes
  dm ioctl: set noio flag to avoid __vmalloc deadlock
  dm mpath: fix ioctl deadlock when no paths
2013-07-11 13:05:40 -07:00
Jim Ramsay
9d0eb0ab43 dm: add switch target
dm-switch is a new target that maps IO to underlying block devices
efficiently when there is a large number of fixed-sized address regions
but there is no simple pattern to allow for a compact mapping
representation such as dm-stripe.

Though we have developed this target for a specific storage device, Dell
EqualLogic, we have made an effort to keep it as general purpose as
possible in the hope that others may benefit.

Originally developed by Jim Ramsay. Simplified by Mikulas Patocka.

Signed-off-by: Jim Ramsay <jim_ramsay@dell.com>
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2013-07-10 23:41:19 +01:00
Mikulas Patocka
2a7faeb176 dm: optimize reorder structure
This reorder actually improves performance by 20% (from 39.1s to 32.8s)
on x86-64 quad core Opteron.

I have no explanation for this, possibly it makes some other entries are
better cache-aligned.

Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2013-07-10 23:41:18 +01:00
Mikulas Patocka
83d5e5b0af dm: optimize use SRCU and RCU
This patch removes "io_lock" and "map_lock" in struct mapped_device and
"holders" in struct dm_table and replaces these mechanisms with
sleepable-rcu.

Previously, the code would call "dm_get_live_table" and "dm_table_put" to
get and release table. Now, the code is changed to call "dm_get_live_table"
and "dm_put_live_table". dm_get_live_table locks sleepable-rcu and
dm_put_live_table unlocks it.

dm_get_live_table_fast/dm_put_live_table_fast can be used instead of
dm_get_live_table/dm_put_live_table. These *_fast functions use
non-sleepable RCU, so the caller must not block between them.

If the code changes active or inactive dm table, it must call
dm_sync_table before destroying the old table.

Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Jun'ichi Nomura <j-nomura@ce.jp.nec.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2013-07-10 23:41:18 +01:00
Mikulas Patocka
2480945cd4 dm bufio: submit writes outside lock
This patch changes dm-bufio so that it submits write I/Os outside of the
lock. If the number of submitted buffers is greater than the number of
requests on the target queue, submit_bio blocks. We want to block outside
of the lock to improve latency of other threads that may need the lock.

Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2013-07-10 23:41:18 +01:00
Mikulas Patocka
43aeaa2957 dm cache: fix arm link errors with inline
Use __always_inline to avoid a link failure with gcc 4.6 on ARM.
gcc 4.7 is OK.

It creates a function block_div.part.8, it references __udivdi3 and
__umoddi3 and it is never called. The references to __udivdi3 and
__umoddi3 cause a link failure.

Reported-by: Rob Herring <robherring2@gmail.com>
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2013-07-10 23:41:17 +01:00
Mikulas Patocka
553d8fe029 dm verity: use __ffs and __fls
This patch changes ffs() to __ffs() and fls() to __fls() which don't add
one to the result.

Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2013-07-10 23:41:17 +01:00
Alasdair G Kergon
75e3a0f55b dm flakey: correct ctr alloc failure mesg
Remove the reference to the "linear" target from the error message
issued when allocation fails in the flakey target.

Cc: Robin Dong <sanbai@taobao.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2013-07-10 23:41:17 +01:00
Mikulas Patocka
5d8be84397 dm verity: remove pointless comparison
Remove num < 0 test in verity_ctr because num is unsigned.
(Found by Coverity.)

Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2013-07-10 23:41:17 +01:00
Mikulas Patocka
220cd058d9 dm: use __GFP_HIGHMEM in __vmalloc
Use __GFP_HIGHMEM in __vmalloc.

Pages allocated with __vmalloc can be allocated in high memory that is not
directly mapped to kernel space, so use __GFP_HIGHMEM just like vmalloc
does. This patch reduces memory pressure slightly because pages can be
allocated in the high zone.

Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2013-07-10 23:41:16 +01:00
Mikulas Patocka
b1bf2de072 dm verity: fix inability to use a few specific devices sizes
Fix a boundary condition that caused failure for certain device sizes.

The problem is reported at
  http://code.google.com/p/cryptsetup/issues/detail?id=160

For certain device sizes the number of hashes at a specific level was
calculated incorrectly.

It happens for example for a device with data and metadata block size 4096
that has 16385 blocks and algorithm sha256.

The user can test if he is affected by this bug by running the
"veritysetup verify" command and also by activating the dm-verity kernel
driver and reading the whole block device. If it passes without an error,
then the user is not affected.

The condition for the bug is:

Split the total number of data blocks (data_block_bits) into bit strings,
each string has hash_per_block_bits bits. hash_per_block_bits is
rounddown(log2(metadata_block_size/hash_digest_size)). Equivalently, you
can say that you convert data_blocks_bits to 2^hash_per_block_bits base.

If there some zero bit string below the most significant bit string and at
least one bit below this zero bit string is set, then the bug happens.

The same bug exists in the userspace veritysetup tool, so you must use
fixed veritysetup too if you want to use devices that are affected by
this boundary condition.

Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Cc: stable@vger.kernel.org # 3.4+
Cc: Milan Broz <gmazyland@gmail.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2013-07-10 23:41:16 +01:00
Mikulas Patocka
1c0e883e86 dm ioctl: set noio flag to avoid __vmalloc deadlock
Set noio flag while calling __vmalloc() because it doesn't fully respect
gfp flags to avoid a possible deadlock (see commit
502624bdad).

This should be backported to stable kernels 3.8 and newer. The kernel 3.8
doesn't have memalloc_noio_save(), so we should set and restore process
flag PF_MEMALLOC instead.

Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Cc: stable@vger.kernel.org
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2013-07-10 23:41:15 +01:00
Hannes Reinecke
6c182cd88d dm mpath: fix ioctl deadlock when no paths
When multipath needs to retry an ioctl the reference to the
current live table needs to be dropped. Otherwise a deadlock
occurs when all paths are down:
- dm_blk_ioctl takes a reference to the current table
  and spins in multipath_ioctl().
- A new table is being loaded, but upon resume the process
  hangs in dm_table_destroy() waiting for references to
  drop to zero.

With this patch the reference to the old table is dropped
prior to retry, thereby avoiding the deadlock.

Signed-off-by: Hannes Reinecke <hare@suse.de>
Cc: Mike Snitzer <snitzer@redhat.com>
Cc: stable@vger.kernel.org
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2013-07-10 23:41:15 +01:00
Linus Torvalds
80cc38b163 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/trivial
Pull trivial tree updates from Jiri Kosina:
 "The usual stuff from trivial tree"

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/trivial: (34 commits)
  treewide: relase -> release
  Documentation/cgroups/memory.txt: fix stat file documentation
  sysctl/net.txt: delete reference to obsolete 2.4.x kernel
  spinlock_api_smp.h: fix preprocessor comments
  treewide: Fix typo in printk
  doc: device tree: clarify stuff in usage-model.txt.
  open firmware: "/aliasas" -> "/aliases"
  md: bcache: Fixed a typo with the word 'arithmetic'
  irq/generic-chip: fix a few kernel-doc entries
  frv: Convert use of typedef ctl_table to struct ctl_table
  sgi: xpc: Convert use of typedef ctl_table to struct ctl_table
  doc: clk: Fix incorrect wording
  Documentation/arm/IXP4xx fix a typo
  Documentation/networking/ieee802154 fix a typo
  Documentation/DocBook/media/v4l fix a typo
  Documentation/video4linux/si476x.txt fix a typo
  Documentation/virtual/kvm/api.txt fix a typo
  Documentation/early-userspace/README fix a typo
  Documentation/video4linux/soc-camera.txt fix a typo
  lguest: fix CONFIG_PAE -> CONFIG_x86_PAE in comment
  ...
2013-07-04 11:40:58 -07:00
NeilBrown
1376512065 md/raid10: fix bug which causes all RAID10 reshapes to move no data.
The recent comment:
commit 7e83ccbecd
    md/raid10: Allow skipping recovery when clean arrays are assembled

Causes raid10 to skip a recovery in certain cases where it is safe to
do so.  Unfortunately it also causes a reshape to be skipped which is
never safe.  The result is that an attempt to reshape a RAID10 will
appear to complete instantly, but no data will have been moves so the
array will now contain garbage.
(If nothing is written, you can recovery by simple performing the
reverse reshape which will also complete instantly).

Bug was introduced in 3.10, so this is suitable for 3.10-stable.

Cc: stable@vger.kernel.org (3.10)
Cc: Martin Wilck <mwilck@arcor.de>
Signed-off-by: NeilBrown <neilb@suse.de>
2013-07-04 16:42:57 +10:00
NeilBrown
fdcfbbb653 md/raid5: allow 5-device RAID6 to be reshaped to 4-device.
There is a bug in 'check_reshape' for raid5.c  To checks
that the new minimum number of devices is large enough (which is
good), but it does so also after the reshape has started (bad).

This is bad because
 - the calculation is now wrong as mddev->raid_disks has changed
   already, and
 - it is pointless because it is now too late to stop.

So only perform that test when reshape has not been committed to.

Signed-off-by: NeilBrown <neilb@suse.de>
2013-07-04 16:42:52 +10:00
NeilBrown
78eaa0d4cb md/raid10: fix two bugs affecting RAID10 reshape.
1/ If a RAID10 is being reshaped to a fewer number of devices
 and is stopped while this is ongoing, then when the array is
 reassembled the 'mirrors' array will be allocated too small.
 This will lead to an access error or memory corruption.

2/ A sanity test for a reshaping RAID10 array is restarted
 is slightly incorrect.

Due to the first bug, this is suitable for any -stable
kernel since 3.5 where this code was introduced.

Cc: stable@vger.kernel.org (v3.5+)
Signed-off-by: NeilBrown <neilb@suse.de>
2013-07-03 09:43:28 +10:00
Kent Overstreet
8e51e414a3 bcache: Use standard utility code
Some of bcache's utility code has made it into the rest of the kernel,
so drop the bcache versions.

Bcache used to have a workaround for allocating from a bio set under
generic_make_request() (if you allocated more than once, the bios you
already allocated would get stuck on current->bio_list when you
submitted, and you'd risk deadlock) - bcache would mask out __GFP_WAIT
when allocating bios under generic_make_request() so that allocation
could fail and it could retry from workqueue. But bio_alloc_bioset() has
a workaround now, so we can drop this hack and the associated error
handling.

Signed-off-by: Kent Overstreet <koverstreet@google.com>
2013-07-01 14:43:53 -07:00
Kent Overstreet
f3059a5461 bcache: Delete fuzz tester
This code has rotted and it hasn't been used in ages anyways.

Signed-off-by: Kent Overstreet <kmo@daterainc.com>
2013-07-01 14:43:48 -07:00
Kent Overstreet
36c9ea9837 bcache: Document shrinker reserve better
Signed-off-by: Kent Overstreet <kmo@daterainc.com>
2013-07-01 14:42:48 -07:00
Kent Overstreet
e49c7c374e bcache: FUA fixes
Journal writes need to be marked FUA, not just REQ_FLUSH. And btree node
writes have... weird ordering requirements.

Signed-off-by: Kent Overstreet <koverstreet@google.com>
2013-07-01 14:42:47 -07:00
Gabriel de Perthuis
ab9e14002e bcache: Send label uevents
Signed-off-by: Gabriel de Perthuis <g2p.code@gmail.com>
Signed-off-by: Kent Overstreet <koverstreet@google.com>
2013-06-26 21:58:06 -07:00
Gabriel de Perthuis
a25c32bede bcache: Send a uevent with a cached device's UUID
Signed-off-by: Gabriel de Perthuis <g2p.code@gmail.com>
2013-06-26 21:58:05 -07:00
Kent Overstreet
72c270612b bcache: Write out full stripes
Now that we're tracking dirty data per stripe, we can add two
optimizations for raid5/6:

 * If a stripe is already dirty, force writes to that stripe to
   writeback mode - to help build up full stripes of dirty data

 * When flushing dirty data, preferentially write out full stripes first
   if there are any.

Signed-off-by: Kent Overstreet <koverstreet@google.com>
2013-06-26 21:58:04 -07:00
Kent Overstreet
279afbad4e bcache: Track dirty data by stripe
To make background writeback aware of raid5/6 stripes, we first need to
track the amount of dirty data within each stripe - we do this by
breaking up the existing sectors_dirty into per stripe atomic_ts

Signed-off-by: Kent Overstreet <koverstreet@google.com>
2013-06-26 21:57:23 -07:00
Kent Overstreet
444fc0b6b1 bcache: Initialize sectors_dirty when attaching
Previously, dirty_data wouldn't get initialized until the first garbage
collection... which was a bit of a problem for background writeback (as
the PD controller keys off of it) and also confusing for users.

This is also prep work for making background writeback aware of raid5/6
stripes.

Signed-off-by: Kent Overstreet <koverstreet@google.com>
2013-06-26 17:09:16 -07:00
Kent Overstreet
6ded34d1a5 bcache: Improve lazy sorting
The old lazy sorting code was kind of hacky - rewrite in a way that
mathematically makes more sense; the idea is that the size of the sets
of keys in a btree node should increase by a more or less fixed ratio
from smallest to biggest.

Signed-off-by: Kent Overstreet <koverstreet@google.com>
2013-06-26 17:09:16 -07:00
Kent Overstreet
85b1492ee1 bcache: Rip out pkey()/pbtree()
Old gcc doesnt like the struct hack, and it is kind of ugly. So finish
off the work to convert pr_debug() statements to tracepoints, and delete
pkey()/pbtree().

Signed-off-by: Kent Overstreet <koverstreet@google.com>
2013-06-26 17:09:15 -07:00
Kent Overstreet
c37511b863 bcache: Fix/revamp tracepoints
The tracepoints were reworked to be more sensible, and fixed a null
pointer deref in one of the tracepoints.

Converted some of the pr_debug()s to tracepoints - this is partly a
performance optimization; it used to be that with DEBUG or
CONFIG_DYNAMIC_DEBUG pr_debug() was an empty macro; but at some point it
was changed to an empty inline function.

Some of the pr_debug() statements had rather expensive function calls as
part of the arguments, so this code was getting run unnecessarily even
on non debug kernels - in some fast paths, too.

Signed-off-by: Kent Overstreet <koverstreet@google.com>
2013-06-26 17:09:15 -07:00
Kent Overstreet
5794351146 bcache: Refactor btree io
The most significant change is that btree reads are now done
synchronously, instead of asynchronously and doing the post read stuff
from a workqueue.

This was originally done because we can't block on IO under
generic_make_request(). But - we already have a mechanism to punt cache
lookups to workqueue if needed, so if we just use that we don't have to
deal with the complexity of doing things asynchronously.

The main benefit is this makes the locking situation saner; we can hold
our write lock on the btree node until we're finished reading it, and we
don't need that btree_node_read_done() flag anymore.

Also, for writes, btree_write() was broken out into btree_node_write()
and btree_leaf_dirty() - the old code with the boolean argument was dumb
and confusing.

The prio_blocked mechanism was improved a bit too, now the only counter
is in struct btree_write, we don't mess with transfering a count from
struct btree anymore.

This required changing garbage collection to block prios at the start
and unblock when it finishes, which is cleaner than what it was doing
anyways (the old code had mostly the same effect, but was doing it in a
convoluted way)

And the btree iter btree_node_read_done() uses was converted to a real
mempool.

Signed-off-by: Kent Overstreet <koverstreet@google.com>
2013-06-26 17:09:14 -07:00
Kent Overstreet
119ba0f828 bcache: Convert allocator thread to kthread
Using a workqueue when we just want a single thread is a bit silly.

Signed-off-by: Kent Overstreet <koverstreet@google.com>
2013-06-26 17:09:13 -07:00
Gabriel de Perthuis
a9dd53adbb bcache: Warn when a device is already registered.
Signed-off-by: Gabriel de Perthuis <g2p.code+bcache@gmail.com>
Signed-off-by: Kent Overstreet <koverstreet@google.com>
2013-06-26 17:08:52 -07:00
Kent Overstreet
bbc77aa7fb bcache: fix a spurious gcc complaint, use scnprintf
An old version of gcc was complaining about using a const int as the
size of a stack allocated array. Which should be fine - but using
ARRAY_SIZE() is better, anyways.

Also, refactor the code to use scnprintf().

Signed-off-by: Kent Overstreet <koverstreet@google.com>
2013-06-26 17:06:33 -07:00
Kumar Amit Mehta
5c694129c8 md: bcache: io.c: fix a potential NULL pointer dereference
bio_alloc_bioset returns NULL on failure. This fix adds a missing check
for potential NULL pointer dereferencing.

Signed-off-by: Kumar Amit Mehta <gmate.amit@gmail.com>
Signed-off-by: Kent Overstreet <koverstreet@google.com>
2013-06-26 17:06:19 -07:00
Jonathan Brassow
c4a3955145 MD: Remember the last sync operation that was performed
MD:  Remember the last sync operation that was performed

This patch adds a field to the mddev structure to track the last
sync operation that was performed.  This is especially useful when
it comes to what is recorded in mismatch_cnt in sysfs.  If the
last operation was "data-check", then it reports the number of
descrepancies found by the user-initiated check.  If it was a
"repair" operation, then it is reporting the number of
descrepancies repaired.  etc.

Signed-off-by: Jonathan Brassow <jbrassow@redhat.com>
Signed-off-by: NeilBrown <neilb@suse.de>
2013-06-26 12:38:24 +10:00
NeilBrown
eea136d69f md: fix buglet in RAID5 -> RAID0 conversion.
RAID5 uses a 'per-array' value for the 'size' of each device.
RAID0 uses a 'per-device' value - it can be different for each device.

When converting a RAID5 to a RAID0 we must ensure that the per-device
size of each device matches the per-array size for the RAID5, else
the array will change size.

If the metadata cannot record a changed per-device size (as is the
case with v0.90 metadata) the array could get bigger on restart.  This
does not cause data corruption, so it not a big issue and is mainly
yet another a reason to not use 0.90.

Signed-off-by: NeilBrown <neilb@suse.de>
2013-06-26 12:38:19 +10:00
Phil Viana
48a73025cb md: bcache: Fixed a typo with the word 'arithmetic'
The word 'arithmetic' was typed as 'arithmatic'

Signed-off-by: Phil Viana <phillip.l.viana@gmail.com>
Signed-off-by: Jiri Kosina <jkosina@suse.cz>
2013-06-18 13:41:16 +02:00
NeilBrown
725d6e579f md/raid10: check In_sync flag in 'enough()'.
It isn't really enough to check that the rdev is present, we need to
also be sure that the device is still In_sync.

Doing this requires using rcu_dereference to access the rdev, and
holding the rcu_read_lock() to ensure the rdev doesn't disappear while
we look at it.

Signed-off-by: NeilBrown <neilb@suse.de>
2013-06-14 08:10:27 +10:00
NeilBrown
635f6416a2 md/raid10: locking changes for 'enough()'.
As 'enough' accesses conf->prev and conf->geo, which can change
spontanously, it should guard against changes.
This can be done with device_lock as start_reshape holds device_lock
while updating 'geo' and end_reshape holds it while updating 'prev'.

So 'error' needs to hold 'device_lock'.

On the other hand, raid10_end_read_request knows which of the two it
really wants to access, and as it is an active request on that one,
the value cannot change underneath it.

So change _enough to take flag rather than a pointer, pass the
appropriate flag from raid10_end_read_request(), and remove the locking.

All other calls to 'enough' are made with reconfig_mutex held, so
neither 'prev' nor 'geo' can change.

Signed-off-by: NeilBrown <neilb@suse.de>
2013-06-14 08:10:27 +10:00
Jingoo Han
b29bebd66d md: replace strict_strto*() with kstrto*()
The usage of strict_strtoul() is not preferred, because
strict_strtoul() is obsolete. Thus, kstrtoul() should be
used.

Signed-off-by: Jingoo Han <jg1.han@samsung.com>
Signed-off-by: NeilBrown <neilb@suse.de>
2013-06-14 08:10:26 +10:00
Hannes Reinecke
90f5f7ad4f md: Wait for md_check_recovery before attempting device removal.
When a device has failed, it needs to be removed from the personality
module before it can be removed from the array as a whole.
The first step is performed by md_check_recovery() which is called
from the raid management thread.

So when a HOT_REMOVE ioctl arrives, wait briefly for md_check_recovery
to have run.  This increases the chance that the ioctl will succeed.

Signed-off-by: Hannes Reinecke <hare@suse.de>
Signed-off-by: Neil Brown <nfbrown@suse.de>
2013-06-14 08:10:26 +10:00
NeilBrown
3f6bbd3ffd dm-raid: silence compiler warning on rebuilds_per_group.
This doesn't really need to be initialised, but it doesn't hurt,
silences the compiler, and as it is a counter it makes sense for it to
start at zero.

Signed-off-by: NeilBrown <neilb@suse.de>
2013-06-14 08:10:26 +10:00
Jonathan Brassow
a4dc163a55 DM RAID: Fix raid_resume not reviving failed devices in all cases
DM RAID:  Fix raid_resume not reviving failed devices in all cases

When a device fails in a RAID array, it is marked as Faulty.  Later,
md_check_recovery is called which (through the call chain) calls
'hot_remove_disk' in order to have the personalities remove the device
from use in the array.

Sometimes, it is possible for the array to be suspended before the
personalities get their chance to perform 'hot_remove_disk'.  This is
normally not an issue.  If the array is deactivated, then the failed
device will be noticed when the array is reinstantiated.  If the
array is resumed and the disk is still missing, md_check_recovery will
be called upon resume and 'hot_remove_disk' will be called at that
time.  However, (for dm-raid) if the device has been restored,
a resume on the array would cause it to attempt to revive the device
by calling 'hot_add_disk'.  If 'hot_remove_disk' had not been called,
a situation is then created where the device is thought to concurrently
be the replacement and the device to be replaced.  Thus, the device
is first sync'ed with the rest of the array (because it is the replacement
device) and then marked Faulty and removed from the array (because
it is also the device being replaced).

The solution is to check and see if the device had properly been removed
before the array was suspended.  This is done by seeing whether the
device's 'raid_disk' field is -1 - a condition that implies that
'md_check_recovery -> remove_and_add_spares (where raid_disk is set to -1)
-> hot_remove_disk' has been called.  If 'raid_disk' is not -1, then
'hot_remove_disk' must be called to complete the removal of the previously
faulty device before it can be revived via 'hot_add_disk'.

Signed-off-by: Jonathan Brassow <jbrassow@redhat.com>
Signed-off-by: NeilBrown <neilb@suse.de>
2013-06-14 08:10:25 +10:00
Jonathan Brassow
f381e71b04 DM RAID: Break-up untidy function
DM RAID:  Break-up untidy function

Clean-up excessive indentation by moving some code in raid_resume()
into its own function.

Signed-off-by: Jonathan Brassow <jbrassow@redhat.com>
Signed-off-by: NeilBrown <neilb@suse.de>
2013-06-14 08:10:25 +10:00
Jonathan Brassow
9092c02d94 DM RAID: Add ability to restore transiently failed devices on resume
DM RAID: Add ability to restore transiently failed devices on resume

This patch adds code to the resume function to check over the devices
in the RAID array.  If any are found to be marked as failed and their
superblocks can be read, an attempt is made to reintegrate them into
the array.  This allows the user to refresh the array with a simple
suspend and resume of the array - rather than having to load a
completely new table, allocate and initialize all the structures and
throw away the old instantiation.

Signed-off-by: Jonathan Brassow <jbrassow@redhat.com>
Signed-off-by: NeilBrown <neilb@suse.de>
2013-06-14 08:10:24 +10:00
Linus Torvalds
82ea4be61f A few bugfixes for md
Some tagged for -stable.
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v2.0.19 (GNU/Linux)
 
 iQIVAwUAUbl1mznsnt1WYoG5AQKGlQ//eixdawF+DUK5hadqZ9EDni+BAVzb7m69
 +zU6ilQ7UOh7bxtAoJqrgFVykK+LG8wvYsEBwMjB9oRDLA96/YDXXiBzXHvd6mGh
 g271lwMTQ9h+O8L6psLUX6qsrH3i7SJmF8ySPKi6Fe5ruT8ToOB8Ii8XQebEZdXo
 VOzRz2VgSTcBdrTyKPDsBJByDQX36hsK8Gs5YSl5F3nvyV4dvGWMlyoTF1TRRt9K
 YCCZ8pSk3kTXaSdl0syrJxI17pEUC8mtcA01S6JD/GV49CGO8LYAckVJ4ijWw7VV
 IGGlH0DsYSMgJ7yyuLz4ifaqRnsWsAGW0WyiZYYKvjtNUiyBuBBbo2cQ1lNkR5p4
 jnLhpJJVh0hLCPn6wcCWIBIdT/mFaBpXkvZPd3ks5kefGXsfpVPm0fK8r0fzkzgy
 tJCZtZFZHeK1qsgaDsiS76S2ZNcFh0HQVIa84Q200/XUDgh8dYlD0+7oIsVu0UBZ
 72Aop+Ak9+k4vKTvB9/hpcY+Rt0MI7zKewXBDSDK1sXhIHLQqv8rCEeNYiuPPqr/
 ghRukn+C/Wtr7JYBsX+jMjxtmSzYtwBOihwLoZCH9pp3C5jTvyQk9s8n1j13V2RK
 sAFtfpCVoQ8tTa7IITKRMfftzHn1WiPlPsj6VbigJ6A4N98csgv7x2rF7FyqcF0X
 aoj69nQ3i/4=
 =8iy3
 -----END PGP SIGNATURE-----

Merge tag 'md-3.10-fixes' of git://neil.brown.name/md

Pull md bugfixes from Neil Brown:
 "A few bugfixes for md

  Some tagged for -stable"

* tag 'md-3.10-fixes' of git://neil.brown.name/md:
  md/raid1,5,10: Disable WRITE SAME until a recovery strategy is in place
  md/raid1,raid10: use freeze_array in place of raise_barrier in various places.
  md/raid1: consider WRITE as successful only if at least one non-Faulty and non-rebuilding drive completed it.
  md: md_stop_writes() should always freeze recovery.
2013-06-13 10:13:29 -07:00
H. Peter Anvin
5026d7a9b2 md/raid1,5,10: Disable WRITE SAME until a recovery strategy is in place
There are cases where the kernel will believe that the WRITE SAME
command is supported by a block device which does not, in fact,
support WRITE SAME.  This currently happens for SATA drivers behind a
SAS controller, but there are probably a hundred other ways that can
happen, including drive firmware bugs.

After receiving an error for WRITE SAME the block layer will retry the
request as a plain write of zeroes, but mdraid will consider the
failure as fatal and consider the drive failed.  This has the effect
that all the mirrors containing a specific set of data are each
offlined in very rapid succession resulting in data loss.

However, just bouncing the request back up to the block layer isn't
ideal either, because the whole initial request-retry sequence should
be inside the write bitmap fence, which probably means that md needs
to do its own conversion of WRITE SAME to write zero.

Until the failure scenario has been sorted out, disable WRITE SAME for
raid1, raid5, and raid10.

[neilb: added raid5]

This patch is appropriate for any -stable since 3.7 when write_same
support was added.

Cc: stable@vger.kernel.org
Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
Signed-off-by: NeilBrown <neilb@suse.de>
2013-06-13 14:49:54 +10:00
NeilBrown
e2d5992522 md/raid1,raid10: use freeze_array in place of raise_barrier in various places.
Various places in raid1 and raid10 are calling raise_barrier when they
really should call freeze_array.
The former is only intended to be called from "make_request".
The later has extra checks for 'nr_queued' and makes a call to
flush_pending_writes(), so it is safe to call it from within the
management thread.

Using raise_barrier will sometimes deadlock.  Using freeze_array
should not.

As 'freeze_array' currently expects one request to be pending (in
handle_read_error - the only previous caller), we need to pass
it the number of pending requests (extra) to ignore.

The deadlock was made particularly noticeable by commits
050b66152f (raid10) and 6b740b8d79 (raid1) which
appeared in 3.4, so the fix is appropriate for any -stable
kernel since then.

This patch probably won't apply directly to some early kernels and
will need to be applied by hand.

Cc: stable@vger.kernel.org
Reported-by: Alexander Lyakas <alex.bolshoy@gmail.com>
Signed-off-by: NeilBrown <neilb@suse.de>
2013-06-13 13:40:48 +10:00
Alex Lyakas
3056e3aec8 md/raid1: consider WRITE as successful only if at least one non-Faulty and non-rebuilding drive completed it.
Without that fix, the following scenario could happen:

- RAID1 with drives A and B; drive B was freshly-added and is rebuilding
- Drive A fails
- WRITE request arrives to the array. It is failed by drive A, so
r1_bio is marked as R1BIO_WriteError, but the rebuilding drive B
succeeds in writing it, so the same r1_bio is marked as
R1BIO_Uptodate.
- r1_bio arrives to handle_write_finished, badblocks are disabled,
md_error()->error() does nothing because we don't fail the last drive
of raid1
- raid_end_bio_io()  calls call_bio_endio()
- As a result, in call_bio_endio():
        if (!test_bit(R1BIO_Uptodate, &r1_bio->state))
                clear_bit(BIO_UPTODATE, &bio->bi_flags);
this code doesn't clear the BIO_UPTODATE flag, and the whole master
WRITE succeeds, back to the upper layer.

So we returned success to the upper layer, even though we had written
the data onto the rebuilding drive only. But when we want to read the
data back, we would not read from the rebuilding drive, so this data
is lost.

[neilb - applied identical change to raid10 as well]

This bug can result in lost data, so it is suitable for any
-stable kernel.

Cc: stable@vger.kernel.org
Signed-off-by: Alex Lyakas <alex@zadarastorage.com>
Signed-off-by: NeilBrown <neilb@suse.de>
2013-06-13 13:20:03 +10:00
NeilBrown
6b6204ee92 md: md_stop_writes() should always freeze recovery.
__md_stop_writes() will currently sometimes freeze recovery.
So any caller must be ready for that to happen, and indeed they are.

However if __md_stop_writes() doesn't freeze_recovery, then
a recovery could start before mddev_suspend() is called, which
could be awkward.  This can particularly cause problems or dm-raid.

So change __md_stop_writes() to always freeze recovery.  This is safe
and more predicatable.

Reported-by: Brassow Jonathan <jbrassow@redhat.com>
Tested-by: Brassow Jonathan <jbrassow@redhat.com>
Signed-off-by: NeilBrown <neilb@suse.de>
2013-06-13 13:18:15 +10:00
Linus Torvalds
b2cc9c19e4 Merge branch 'for-linus' of git://git.kernel.dk/linux-block
Pull block layer fixes from Jens Axboe:
 "Outside of bcache (which really isn't super big), these are all
  few-liners.  There are a few important fixes in here:

   - Fix blk pm sleeping when holding the queue lock

   - A small collection of bcache fixes that have been done and tested
     since bcache was included in this merge window.

   - A fix for a raid5 regression introduced with the bio changes.

   - Two important fixes for mtip32xx, fixing an oops and potential data
     corruption (or hang) due to wrong bio iteration on stacked devices."

* 'for-linus' of git://git.kernel.dk/linux-block:
  scatterlist: sg_set_buf() argument must be in linear mapping
  raid5: Initialize bi_vcnt
  pktcdvd: silence static checker warning
  block: remove refs to XD disks from documentation
  blkpm: avoid sleep when holding queue lock
  mtip32xx: Correctly handle bio->bi_idx != 0 conditions
  mtip32xx: Fix NULL pointer dereference during module unload
  bcache: Fix error handling in init code
  bcache: clarify free/available/unused space
  bcache: drop "select CLOSURES"
  bcache: Fix incompatible pointer type warning
2013-06-12 16:42:39 -07:00
Kent Overstreet
4997b72ee6 raid5: Initialize bi_vcnt
The patch that converted raid5 to use bio_reset() forgot to initialize
bi_vcnt.

Signed-off-by: Kent Overstreet <koverstreet@google.com>
Cc: NeilBrown <neilb@suse.de>
Cc: linux-raid@vger.kernel.org
Tested-by: Ilia Mirkin <imirkin@alum.mit.edu>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2013-05-30 08:44:39 +02:00
Alasdair G Kergon
610bba8b93 dm thin: fix metadata dev resize detection
Fix detection of the need to resize the dm thin metadata device.

The code incorrectly tried to extend the metadata device when it
didn't need to due to a merging error with patch 24347e9 ("dm thin:
detect metadata device resizing").

  device-mapper: transaction manager: couldn't open metadata space map
  device-mapper: thin metadata: tm_open_with_sm failed
  device-mapper: thin: aborting transaction failed
  device-mapper: thin: switching pool to failure mode

Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2013-05-19 18:57:50 +01:00
Jens Axboe
c0a363f5cf Merge branch 'bcache-for-upstream' of git://evilpiepirate.org/~kent/linux-bcache into for-linus
Kent writes:

Jens - couple more bcache patches. Bug fixes and a doc update.
2013-05-15 10:36:25 +02:00
Kent Overstreet
f59fce847f bcache: Fix error handling in init code
This code appears to have rotted... fix various bugs and do some
refactoring.

Signed-off-by: Kent Overstreet <koverstreet@google.com>
2013-05-15 00:48:14 -07:00
Paul Bolle
bbb1c3b5ae bcache: drop "select CLOSURES"
The Kconfig entry for BCACHE selects CLOSURES. But there's no Kconfig
symbol CLOSURES. That symbol was used in development versions of bcache,
but was removed when the closures code was no longer provided as a
kernel library. It can safely be dropped.

Signed-off-by: Paul Bolle <pebolle@tiscali.nl>
2013-05-15 00:42:51 -07:00
Emil Goode
867e116206 bcache: Fix incompatible pointer type warning
The function pointer release in struct block_device_operations
should point to functions declared as void.

Sparse warnings:

drivers/md/bcache/super.c:656:27: warning:
	incorrect type in initializer (different base types)
	drivers/md/bcache/super.c:656:27:
	expected void ( *release )( ... )
	drivers/md/bcache/super.c:656:27:
	got int ( static [toplevel] *<noident> )( ... )

drivers/md/bcache/super.c:656:2: warning:
	initialization from incompatible pointer type [enabled by default]

drivers/md/bcache/super.c:656:2: warning:
	(near initialization for ‘bcache_ops.release’) [enabled by default]

Signed-off-by: Emil Goode <emilgoode@gmail.com>
Signed-off-by: Kent Overstreet <koverstreet@google.com>
2013-05-15 00:42:50 -07:00
Joe Thornber
2f14f4b51e dm cache: set config value
Share configuration option processing code between the dm cache
ctr and message functions.

Signed-off-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2013-05-10 14:37:21 +01:00
Alasdair G Kergon
2c73c471fb dm cache: move config fns
Move process_config_option() in dm-cache-target.c to make the
next patch more readable.

Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2013-05-10 14:37:21 +01:00
Joe Thornber
ac8c3f3df6 dm thin: generate event when metadata threshold passed
Generate a dm event when the amount of remaining thin pool metadata
space falls below a certain level.

The threshold is taken to be a quarter of the size of the metadata
device with a minimum threshold of 4MB.

Signed-off-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2013-05-10 14:37:21 +01:00
Joe Thornber
2fc48021f4 dm persistent metadata: add space map threshold callback
Add a threshold callback to dm persistent data space maps.

Signed-off-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2013-05-10 14:37:20 +01:00
Joe Thornber
7c3d3f2a87 dm persistent data: add threshold callback to space map
Add a threshold callback function to the persistent data space map
interface for a subsequent patch to use.

dm-thin and dm-cache are interested in knowing when they're getting
low on metadata or data blocks.  This patch introduces a new method
for registering a callback against a threshold.

Signed-off-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2013-05-10 14:37:20 +01:00
Joe Thornber
24347e9595 dm thin: detect metadata device resizing
Allow the dm thin pool metadata device to be extended.

Whenever a pool is resumed, detect whether the size of the metadata
device has increased, and if so, extend the metadata to use the new
space.

Signed-off-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2013-05-10 14:37:19 +01:00
Joe Thornber
1921c56d95 dm persistent data: support space map resizing
Support extending a dm persistent data metadata space map.

The extend itself is implemented by switching back to the boostrap
allocator and pointing to the new space.  The extra bitmap indexes are
then allocated from the new space, and finally we switch back to the
proper space map ops and tweak the reference counts.

Signed-off-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2013-05-10 14:37:19 +01:00
Joe Thornber
5d0db96d13 dm thin: open dev read only when possible
If a thin pool is created in read-only-metadata mode then only open the
metadata device read-only.

Previously it was always opened with FMODE_READ | FMODE_WRITE.

(Note that dm_get_device() still allows read-only dm devices to be used
read-write at the moment: If I create a read-only linear device for the
metadata, via dmsetup load --readonly, then I can still create a rw pool
out of it.)

Signed-off-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2013-05-10 14:37:19 +01:00
Joe Thornber
b17446df2e dm thin: refactor data dev resize
Refactor device size functions in preparation for similar metadata
device resizing functions.

Signed-off-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2013-05-10 14:37:18 +01:00
Joe Thornber
8c5008fac4 dm cache: replace memcpy with struct assignment
Use struct assignment rather than memcpy in dm cache.

Signed-off-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2013-05-10 14:37:18 +01:00
Joe Thornber
aeed1420a3 dm cache: fix typos in comments
Fix up some typos in dm-cache comments.

Signed-off-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2013-05-10 14:37:18 +01:00
Alasdair G Kergon
e12c1fd9d6 dm cache policy: fix description of lookup fn
Correct the documented requirement on the return code from dm cache policy
lookup functions stated in the policy module header file.

Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2013-05-10 14:37:17 +01:00
Joe Thornber
88a488f624 dm persistent data: fix error message typos
Fix some typos in dm-space-map-metadata.c error messages.

Signed-off-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2013-05-10 14:37:17 +01:00
Joe Thornber
f8350daf7a dm cache: tune migration throttling
Tune the dm cache migration throttling.

i) Issue a tick every second, just in case there's no i/o going through.

ii) Drop the migration threshold right down to something suitable for
background work.

Signed-off-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2013-05-10 14:37:16 +01:00
Mike Snitzer
042bcef889 dm mpath: enable WRITE SAME support
Enable WRITE SAME support in dm multipath.  As far as multipath is
concerned it is just another write request.

Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Tested-by: Bharata B Rao <bharata.rao@gmail.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2013-05-10 14:37:16 +01:00
Mike Snitzer
dc019b21fb dm table: fix write same support
If device_not_write_same_capable() returns true then the iterate_devices
loop in dm_table_supports_write_same() should return false.

Reported-by: Bharata B Rao <bharata.rao@gmail.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Cc: stable@vger.kernel.org # v3.8+
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2013-05-10 14:37:16 +01:00
Mikulas Patocka
502624bdad dm bufio: avoid a possible __vmalloc deadlock
This patch uses memalloc_noio_save to avoid a possible deadlock in
dm-bufio.  (it could happen only with large block size, at most
PAGE_SIZE << MAX_ORDER (typically 8MiB).

__vmalloc doesn't fully respect gfp flags. The specified gfp flags are
used for allocation of requested pages, structures vmap_area, vmap_block
and vm_struct and the radix tree nodes.

However, the kernel pagetables are allocated always with GFP_KERNEL.
Thus the allocation of pagetables can recurse back to the I/O layer and
cause a deadlock.

This patch uses the function memalloc_noio_save to set per-process
PF_MEMALLOC_NOIO flag and the function memalloc_noio_restore to restore
it. When this flag is set, all allocations in the process are done with
implied GFP_NOIO flag, thus the deadlock can't happen.

This should be backported to stable kernels, but they don't have the
PF_MEMALLOC_NOIO flag and memalloc_noio_save/memalloc_noio_restore
functions. So, PF_MEMALLOC should be set and restored instead.

Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Cc: stable@kernel.org
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2013-05-10 14:37:15 +01:00
Wei Yongjun
09e8b81389 dm snapshot: fix error return code in snapshot_ctr
Return -ENOMEM instead of success if unable to allocate pending
exception mempool in snapshot_ctr.

Signed-off-by: Wei Yongjun <yongjun_wei@trendmicro.com.cn>
Cc: stable@vger.kernel.org
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2013-05-10 14:37:15 +01:00
Wei Yongjun
fa4d683af3 dm cache: fix error return code in cache_create
Return -ENOMEM if memory allocation fails in cache_create
instead of 0 (to avoid NULL pointer dereference).

Signed-off-by: Wei Yongjun <yongjun_wei@trendmicro.com.cn>
Cc: stable@vger.kernel.org
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2013-05-10 14:37:14 +01:00
Mike Snitzer
d793e68427 dm stripe: fix regression in stripe_width calculation
Fix a regression in the calculation of the stripe_width in the
dm stripe target which led to incorrect processing of device limits.

The stripe_width is the stripe device length divided by the number of
stripes.  The group of commits in the range f14fa69 ("dm stripe: fix
size test") to eb850de ("dm stripe: support for non power of 2
chunksize") interfered with each other (a merging error) and led to the
stripe_width being set incorrectly to the stripe device length divided by
chunk_size * stripe_count.

For example, a stripe device's table with: 0 33553920 striped 3 512 ...
should result in a stripe_width of 11184640 (33553920 / 3), but due to
the bug it was getting set to 21845 (33553920 / (512 * 3)).

The impact of this bug is that device topologies that previously worked
fine with the stripe target are no longer considered valid.  In
particular, there is a higher risk of seeing this issue if one of the
stripe devices has a 4K logical block size.  Resulting in an error
message like this:
"device-mapper: table: 253:4: len=21845 not aligned to h/w logical block size 4096 of dm-1"

The fix is to swap the order of the divisions and to use a temporary
variable for the second one, so that width retains the intended
value.

Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Cc: stable@vger.kernel.org # 3.6+
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2013-05-10 14:37:14 +01:00
Linus Torvalds
ebb3727779 Merge branch 'for-3.10/drivers' of git://git.kernel.dk/linux-block
Pull block driver updates from Jens Axboe:
 "It might look big in volume, but when categorized, not a lot of
  drivers are touched.  The pull request contains:

   - mtip32xx fixes from Micron.

   - A slew of drbd updates, this time in a nicer series.

   - bcache, a flash/ssd caching framework from Kent.

   - Fixes for cciss"

* 'for-3.10/drivers' of git://git.kernel.dk/linux-block: (66 commits)
  bcache: Use bd_link_disk_holder()
  bcache: Allocator cleanup/fixes
  cciss: bug fix to prevent cciss from loading in kdump crash kernel
  cciss: add cciss_allow_hpsa module parameter
  drivers/block/mg_disk.c: add CONFIG_PM_SLEEP to suspend/resume functions
  mtip32xx: Workaround for unaligned writes
  bcache: Make sure blocksize isn't smaller than device blocksize
  bcache: Fix merge_bvec_fn usage for when it modifies the bvm
  bcache: Correctly check against BIO_MAX_PAGES
  bcache: Hack around stuff that clones up to bi_max_vecs
  bcache: Set ra_pages based on backing device's ra_pages
  bcache: Take data offset from the bdev superblock.
  mtip32xx: mtip32xx: Disable TRIM support
  mtip32xx: fix a smatch warning
  bcache: Disable broken btree fuzz tester
  bcache: Fix a format string overflow
  bcache: Fix a minor memory leak on device teardown
  bcache: Documentation updates
  bcache: Use WARN_ONCE() instead of __WARN()
  bcache: Add missing #include <linux/prefetch.h>
  ...
2013-05-08 11:51:05 -07:00
Linus Torvalds
4de13d7aa8 Merge branch 'for-3.10/core' of git://git.kernel.dk/linux-block
Pull block core updates from Jens Axboe:

 - Major bit is Kents prep work for immutable bio vecs.

 - Stable candidate fix for a scheduling-while-atomic in the queue
   bypass operation.

 - Fix for the hang on exceeded rq->datalen 32-bit unsigned when merging
   discard bios.

 - Tejuns changes to convert the writeback thread pool to the generic
   workqueue mechanism.

 - Runtime PM framework, SCSI patches exists on top of these in James'
   tree.

 - A few random fixes.

* 'for-3.10/core' of git://git.kernel.dk/linux-block: (40 commits)
  relay: move remove_buf_file inside relay_close_buf
  partitions/efi.c: replace useless kzalloc's by kmalloc's
  fs/block_dev.c: fix iov_shorten() criteria in blkdev_aio_read()
  block: fix max discard sectors limit
  blkcg: fix "scheduling while atomic" in blk_queue_bypass_start
  Documentation: cfq-iosched: update documentation help for cfq tunables
  writeback: expose the bdi_wq workqueue
  writeback: replace custom worker pool implementation with unbound workqueue
  writeback: remove unused bdi_pending_list
  aoe: Fix unitialized var usage
  bio-integrity: Add explicit field for owner of bip_buf
  block: Add an explicit bio flag for bios that own their bvec
  block: Add bio_alloc_pages()
  block: Convert some code to bio_for_each_segment_all()
  block: Add bio_for_each_segment_all()
  bounce: Refactor __blk_queue_bounce to not use bi_io_vec
  raid1: use bio_copy_data()
  pktcdvd: Use bio_reset() in disabled code to kill bi_idx usage
  pktcdvd: use bio_copy_data()
  block: Add bio_copy_data()
  ...
2013-05-08 10:13:35 -07:00
Al Viro
db2a144bed block_device_operations->release() should return void
The value passed is 0 in all but "it can never happen" cases (and those
only in a couple of drivers) *and* it would've been lost on the way
out anyway, even if something tried to pass something meaningful.
Just don't bother.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2013-05-07 02:16:21 -04:00
Kent Overstreet
ee66850642 bcache: Use bd_link_disk_holder()
Signed-off-by: Kent Overstreet <koverstreet@google.com>
2013-04-30 19:14:43 -07:00
Kent Overstreet
86b26b824c bcache: Allocator cleanup/fixes
The main fix is that bch_allocator_thread() wasn't waiting on
garbage collection to finish (if invalidate_buckets had set
ca->invalidate_needs_gc); we need that to make sure the allocator
doesn't spin and potentially block gc from finishing.

Signed-off-by: Kent Overstreet <koverstreet@google.com>
2013-04-30 19:14:40 -07:00
Shaohua Li
32f9f570d0 MD: ignore discard request for hard disks of hybid raid1/raid10 array
In SSD/hard disk hybid storage, discard request should be ignored for hard
disk. We used to be doing this way, but the unplug path forgets it.

This is suitable for stable tree since v3.6.

Cc: stable@vger.kernel.org
Reported-and-tested-by: Markus <M4rkusXXL@web.de>
Signed-off-by: Shaohua Li <shli@fusionio.com>
Signed-off-by: NeilBrown <neilb@suse.de>
2013-04-30 14:49:36 +10:00
NeilBrown
486adf72cc md: bad block list should default to disabled.
Maintenance of a bad-block-list currently defaults to 'enabled'
and is then disabled when it cannot be supported.
This is backwards and causes problem for dm-raid which didn't know
to disable it.

So fix the defaults, and only enabled for v1.x metadata which
explicitly has bad blocks enabled.

The problem with dm-raid has been present since badblock support was
added in v3.1, so this patch is suitable for any -stable from 3.1
onwards.

Cc: stable@vger.kernel.org (3.1+)
Reported-by: Jonathan Brassow <jbrassow@redhat.com>
Signed-off-by: NeilBrown <neilb@suse.de>
2013-04-30 14:49:32 +10:00
Hirokazu Takahashi
0fea7ed82b md: raid1/raid10 md devices leak memory when stopping
Hi.

Raid1 and raid10 devices leak memory every time they stop.
This is a patch for linux-3.9.0-rc7 to fix this problem.

Thanks,
Hirokazu Takahashi.

Signed-off-by: Hirokazu Takahashi <taka@valinux.co.jp>
Signed-off-by: NeilBrown <neilb@suse.de>
2013-04-30 14:49:26 +10:00
Kent Overstreet
8abb2a5dba bcache: Make sure blocksize isn't smaller than device blocksize
Sanity check to make sure we don't end up doing IO the device doesn't
support.

Signed-off-by: Kent Overstreet <koverstreet@google.com>
2013-04-24 13:07:39 -07:00
Jonathan Brassow
be83651f00 DM RAID: Add message/status support for changing sync action
DM RAID:  Add message/status support for changing sync action

This patch adds a message interface to dm-raid to allow the user to more
finely control the sync actions being performed by the MD driver.  This
gives the user the ability to initiate "check" and "repair" (i.e. scrubbing).
Two additional fields have been appended to the status output to provide more
information about the type of sync action occurring and the results of those
actions, specifically: <sync_action> and <mismatch_cnt>.  These new fields
will always be populated.  This is essentially the device-mapper way of doing
what MD controls through the 'sync_action' sysfs file and shows through the
'mismatch_cnt' sysfs file.

Signed-off-by: Jonathan Brassow <jbrassow@redhat.com>
Signed-off-by: NeilBrown <neilb@suse.de>
2013-04-24 11:42:43 +10:00
Jonathan Brassow
a91d5ac048 MD: Export 'md_reap_sync_thread' function
MD: Export 'md_reap_sync_thread' function

Make 'md_reap_sync_thread' available to other files, specifically dm-raid.c.
- rename reap_sync_thread to md_reap_sync_thread
- move the fn after md_check_recovery to match md.h declaration placement
- export md_reap_sync_thread

Signed-off-by: Jonathan Brassow <jbrassow@redhat.com>
Signed-off-by: NeilBrown <neilb@suse.de>
2013-04-24 11:42:43 +10:00
NeilBrown
b6d428c669 md: don't update metadata when stopping a read-only array.
read-only arrays should stay that way as much as possible.
Updating the metadata - which could be triggered by a re-add
while assembling the array metadata - should be avoided.

Signed-off-by: NeilBrown <neilb@suse.de>
2013-04-24 11:42:42 +10:00
NeilBrown
7ceb17e87b md: Allow devices to be re-added to a read-only array.
When assembling an array incrementally we might want to make
it device available when "enough" devices are present, but maybe
not "all" devices are present.
If the remaining devices appear before the array is actually used,
they should be added transparently.

We do this by using the "read-auto" mode where the array acts like
it is read-only until a write request arrives.

Current an add-device request switches a read-auto array to active.
This means that only one device can be added after the array is first
made read-auto.  This isn't a problem for RAID5, but is not ideal for
RAID6 or RAID10.
Also we don't really want to switch the array to read-auto at all
when re-adding a device as this doesn't really imply any change.

So:
 - remove the "md_update_sb()" call from add_new_disk().  This isn't
   really needed as just adding a disk doesn't require a metadata
   update.  Instead, just set MD_CHANGE_DEVS.  This will effect a
   metadata update soon enough, once the array is not read-only.

 - Allow the ADD_NEW_DISK ioctl to succeed without activating a
   read-auto array, providing the MD_DISK_SYNC flag is set.
   In this case, the device will be rejected if it cannot be added
   with the correct device number, or has an incorrect event count.

 - Teach remove_and_add_spares() to be careful about adding spares
   when the array is read-only (or read-mostly) - only add devices
   that are thought to be in-sync, and only do it if the array is
   in-sync itself.

 - In md_check_recovery, use remove_and_add_spares in the read-only
   case, rather than open coding just the 'remove' part of it.

Reported-by: Martin Wilck <mwilck@arcor.de>
Signed-off-by: NeilBrown <neilb@suse.de>
2013-04-24 11:42:42 +10:00
Martin Wilck
7e83ccbecd md/raid10: Allow skipping recovery when clean arrays are assembled
When an array is assembled incrementally with mdadm -I -R
and the array switches to "active" mode, md starts a recovery.

If the array was clean, the "fullsync" flag will be 0. Skip
the full recovery in this case, as RAID1 does (the code was
actually copied from the sync_request() method of RAID1).

Signed-off-by: Martin Wilck <mwilck@arcor.de>
Signed-off-by: NeilBrown <neilb@suse.de>
2013-04-24 11:42:42 +10:00
NeilBrown
c0b32972fb md/raid5: avoid an extra write when writing to a known-bad-block.
If we write to a known-bad-block it will be flags as having
a ReadError by analyse_stripe, but the write will proceed anyway
(as it should).  Then the read-error handling will kick in an
write again, then re-read.

We don't need that 'write-again', so set R5_ReWrite so it looks like
it has already been done.  Then we will just get the re-read, which we
want.

Reported-by: majianpeng <majianpeng@gmail.com>
Signed-off-by: NeilBrown <neilb@suse.de>
2013-04-24 11:42:42 +10:00
majianpeng
6f608040ce md/raid5: Change or of some order to improve efficiency.
As the function call is the most expensive of these tests it should be
done later in the chain so that it can be avoided in some cases.

Signed-off-by: Jianpeng Ma <majianpeng@gmail.com>
Signed-off-by: NeilBrown <neilb@suse.de>
2013-04-24 11:42:41 +10:00
Akinobu Mita
3f810b6c4a md: use set_bit_le and clear_bit_le
The value returned by test_and_set_bit_le() drivers/md/bitmap.c is not used.
So just use set_bit_le(). The same goes for test_and_clear_bit_le().

Signed-off-by: Akinobu Mita <akinobu.mita@gmail.com>
Cc: Neil Brown <neilb@suse.de>
Cc: linux-raid@vger.kernel.org
Signed-off-by: NeilBrown <neilb@suse.de>
2013-04-24 11:42:41 +10:00
NeilBrown
3ea8929da3 md: HOT_DISK_REMOVE shouldn't make a read-auto device active.
If a fail device or a spare is removed from an array, there is
not need to make the array 'active'.  If/when the array does become
active for some other reason the metadata will be update to reflect
the removal.
If that never happens and the array is stopped while still read-auto,
then there is no loss in forgetting the that the device had 'failed'.

A read-only array will leave failed devices attached to
the array personality, so we need to explicitly call
remove_and_add_spares() to free it (clearing Blocked just
like we do in store_slot()).

Signed-off-by: NeilBrown <neilb@suse.de>
2013-04-24 11:42:41 +10:00
NeilBrown
746d3207ae md: use common code for all calls to ->hot_remove_disk()
slot_store and remove_and_add_spares both call ->hot_remove_disk(),
but with slightly different tests and consequences, which is
at least untidy and might be buggy.

So modify remove_and_add_spaces() so that it can be asked
to remove a specific device, and call it from slot_store().

We also clear the Blocked flag to ensure that doesn't prevent
removal.  The purpose of Blocked is to prevent automatic removal
by the kernel before an error is acknowledged.
If the array is read/write then user-space would have not reason
to remove a device unless it was known to be 'spare' or 'faulty' in
which it would have already cleared the Blocked flag.
If the array is read-only, the flag might still be blocked, but
there is no harm in clearing the flag for read-only arrays.

Signed-off-by: NeilBrown <neilb@suse.de>
2013-04-24 11:42:41 +10:00
NeilBrown
d87f064f58 md: never update metadata when array is read-only.
Normally we don't even try to update the metadata if
the array is read-only.  However future patches
will increase the number of things that can happen on a read-only
array, so it is safest to explicitly disable this.

Every time that mddev->ro is set to 0, either
 - md_update_sb will be called again (at least if MD_CHANGE_DEVS
   is set) or
 - the mddev->thread is scheduled, which will also run
   md_update_sb if needed.

So this is safe: if the array ever become read-write the
metadata will be updated.

Signed-off-by: NeilBrown <neilb@suse.de>
2013-04-24 11:42:40 +10:00
Kent Overstreet
a09ded8edf bcache: Fix merge_bvec_fn usage for when it modifies the bvm
Stacked md devices reuse the bvm for the subordinate device, causing
problems...

Reported-by: Michael Balser <michael.balser@profitbricks.com>
Signed-off-by: Kent Overstreet <koverstreet@google.com>
2013-04-22 14:44:24 -07:00
Kent Overstreet
1545f13730 bcache: Correctly check against BIO_MAX_PAGES
bch_bio_max_sectors() was checking against BIO_MAX_PAGES as if the limit
was for the total bytes in the bio, not the number of segments.

Signed-off-by: Kent Overstreet <koverstreet@google.com>
2013-04-20 17:57:42 -07:00
Kent Overstreet
bca97adaf5 bcache: Hack around stuff that clones up to bi_max_vecs
Signed-off-by: Kent Overstreet <koverstreet@google.com>
2013-04-20 17:57:41 -07:00
Kent Overstreet
4f0fd955cd bcache: Set ra_pages based on backing device's ra_pages
Signed-off-by: Kent Overstreet <koverstreet@google.com>
2013-04-20 17:57:26 -07:00
Kent Overstreet
2903381fce bcache: Take data offset from the bdev superblock.
Add a new superblock version, and consolidate related defines.

Signed-off-by: Gabriel de Perthuis <g2p.code+bcache@gmail.com>
Signed-off-by: Kent Overstreet <koverstreet@google.com>
2013-04-20 17:56:12 -07:00
Linus Torvalds
0a82a8d132 Revert "block: add missing block_bio_complete() tracepoint"
This reverts commit 3a366e614d.

Wanlong Gao reports that it causes a kernel panic on his machine several
minutes after boot. Reverting it removes the panic.

Jens says:
 "It's not quite clear why that is yet, so I think we should just revert
  the commit for 3.9 final (which I'm assuming is pretty close).

  The wifi is crap at the LSF hotel, so sending this email instead of
  queueing up a revert and pull request."

Reported-by: Wanlong Gao <gaowanlong@cn.fujitsu.com>
Requested-by: Jens Axboe <axboe@kernel.dk>
Cc: Tejun Heo <tj@kernel.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-04-18 09:00:26 -07:00
Kent Overstreet
cef5279735 bcache: Disable broken btree fuzz tester
Reported-by: <sasha.levin@oracle.com>
Signed-off-by: Kent Overstreet <koverstreet@google.com>
2013-04-08 13:33:49 -07:00
Kent Overstreet
91bbcfc361 bcache: Fix a format string overflow
Reported-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Kent Overstreet <koverstreet@google.com>
2013-04-08 13:33:49 -07:00
Kent Overstreet
8ef747909c bcache: Fix a minor memory leak on device teardown
Reported-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Kent Overstreet <koverstreet@google.com>
2013-04-08 13:33:48 -07:00
Kent Overstreet
cc0f4eaa61 bcache: Use WARN_ONCE() instead of __WARN()
Signed-off-by: Kent Overstreet <koverstreet@google.com>
2013-04-08 13:33:48 -07:00
Geert Uytterhoeven
cd953ed036 bcache: Add missing #include <linux/prefetch.h>
m68k/allmodconfig:

drivers/md/bcache/bset.c: In function ‘bset_search_tree’:
drivers/md/bcache/bset.c:727: error: implicit declaration of function ‘prefetch’

drivers/md/bcache/btree.c: In function ‘bch_btree_node_get’:
drivers/md/bcache/btree.c:933: error: implicit declaration of function ‘prefetch’

Signed-off-by: Geert Uytterhoeven <geert@linux-m68k.org>
Signed-off-by: Kent Overstreet <koverstreet@google.com>
2013-04-08 13:33:48 -07:00
Kent Overstreet
c19ed23a0b bcache: Sparse fixes
Signed-off-by: Kent Overstreet <koverstreet@google.com>
2013-04-08 13:33:48 -07:00
Mike Snitzer
19b0092e26 dm cache: reduce bio front_pad size in writeback mode
A recent patch to fix the dm cache target's writethrough mode extended
the bio's front_pad to include a 1056-byte struct dm_bio_details.
Writeback mode doesn't need this, so this patch reduces the
per_bio_data_size to 16 bytes in this case instead of 1096.

The dm_bio_details structure was added in "dm cache: fix writes to
cache device in writethrough mode" which fixed commit e2e74d617e ("dm
cache: fix race in writethrough implementation").  In writeback mode
we avoid allocating the writethrough-specific members of the
per_bio_data structure (the dm_bio_details structure included).

Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2013-04-05 15:36:34 +01:00
Darrick J. Wong
b844fe6918 dm cache: fix writes to cache device in writethrough mode
The dm-cache writethrough strategy introduced by commit e2e74d617e
("dm cache: fix race in writethrough implementation") issues a bio to
the origin device, remaps and then issues the bio to the cache device.
This more conservative in-series approach was selected to favor
correctness over performance (of the previous parallel writethrough).
However, this in-series implementation that reuses the same bio to write
both the origin and cache device didn't take into account that the block
layer's req_bio_endio() modifies a completing bio's bi_sector and
bi_size.  So the new writethrough strategy needs to preserve these bio
fields, and restore them before submission to the cache device,
otherwise nothing gets written to the cache (because bi_size is 0).

This patch adds a struct dm_bio_details field to struct per_bio_data,
and uses dm_bio_record() and dm_bio_restore() to ensure the bio is
restored before reissuing to the cache device.  Adding such a large
structure to the per_bio_data is not ideal but we can improve this
later, for now correctness is the important thing.

This problem initially went unnoticed because the dm-cache test-suite
uses a linear DM device for the dm-cache device's origin device.
Writethrough worked as expected because DM submits a *clone* of the
original bio, so the original bio which was reused for the cache was
never touched.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2013-04-05 15:36:32 +01:00
Jens Axboe
64f8de4da7 Merge branch 'writeback-workqueue' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq into for-3.10/core
Tejun writes:

-----

This is the pull request for the earlier patchset[1] with the same
name.  It's only three patches (the first one was committed to
workqueue tree) but the merge strategy is a bit involved due to the
dependencies.

* Because the conversion needs features from wq/for-3.10,
  block/for-3.10/core is based on rc3, and wq/for-3.10 has conflicts
  with rc3, I pulled mainline (rc5) into wq/for-3.10 to prevent those
  workqueue conflicts from flaring up in block tree.

* Resolving the issue that Jan and Dave raised about debugging
  requires arch-wide changes.  The patchset is being worked on[2] but
  it'll have to go through -mm after these changes show up in -next,
  and not included in this pull request.

The three commits are located in the following git branch.

  git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq.git writeback-workqueue

Pulling it into block/for-3.10/core produces a conflict in
drivers/md/raid5.c between the following two commits.

  e3620a3ad5 ("MD RAID5: Avoid accessing gendisk or queue structs when not available")
  2f6db2a707 ("raid5: use bio_reset()")

The conflict is trivial - one removes an "if ()" conditional while the
other removes "rbi->bi_next = NULL" right above it.  We just need to
remove both.  The merged branch is available at

  git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq.git block-test-merge

so that you can use it for verification.  The test merge commit has
proper merge description.

While these changes are a bit of pain to route, they make code simpler
and even have, while minute, measureable performance gain[3] even on a
workload which isn't particularly favorable to showing the benefits of
this conversion.

----

Fixed up the conflict.

Conflicts:
	drivers/md/raid5.c

Signed-off-by: Jens Axboe <axboe@kernel.dk>
2013-04-02 10:04:39 +02:00
Kent Overstreet
169ef1cf61 bcache: Don't export utility code, prefix with bch_
Signed-off-by: Kent Overstreet <koverstreet@google.com>
Cc: linux-bcache@vger.kernel.org
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2013-03-28 12:50:55 -06:00
Kent Overstreet
29177b8966 bcache: Fix for the build fixes
Commit 82a84eaf7e51ba3da0c36cbc401034a4e943492d left a return 0 in
closure_debug_init(). Whoops.

Signed-off-by: Kent Overstreet <koverstreet@google.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2013-03-25 19:36:39 -06:00
Kent Overstreet
b1a67b0f4c bcache: Style/checkpatch fixes
Took out some nested functions, and fixed some more checkpatch
complaints.

Signed-off-by: Kent Overstreet <koverstreet@google.com>
Cc: linux-bcache@vger.kernel.org
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2013-03-25 13:06:13 -06:00
Kent Overstreet
07e86ccb54 bcache: Build fixes from test robot
config: make ARCH=i386 allmodconfig

All error/warnings:

   drivers/md/bcache/bset.c: In function 'bch_ptr_bad':
>> drivers/md/bcache/bset.c:164:2: warning: format '%li' expects argument of type 'long int', but argument 4 has type 'size_t' [-Wformat]
--
   drivers/md/bcache/debug.c: In function 'bch_pbtree':
>> drivers/md/bcache/debug.c:86:4: warning: format '%li' expects argument of type 'long int', but argument 4 has type 'size_t' [-Wformat]
--
   drivers/md/bcache/btree.c: In function 'bch_btree_read_done':
>> drivers/md/bcache/btree.c:245:8: warning: format '%lu' expects argument of type 'long unsigned int', but argument 4 has type 'size_t' [-Wformat]
--
   drivers/md/bcache/closure.o: In function `closure_debug_init':
>> (.init.text+0x0): multiple definition of `init_module'
>> drivers/md/bcache/super.o:super.c:(.init.text+0x0): first defined here

Signed-off-by: Kent Overstreet <koverstreet@google.com>
Cc: Fengguang Wu <fengguang.wu@intel.com>
Cc: linux-bcache@vger.kernel.org
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2013-03-25 13:06:13 -06:00
Kent Overstreet
cafe563591 bcache: A block layer cache
Does writethrough and writeback caching, handles unclean shutdown, and
has a bunch of other nifty features motivated by real world usage.

See the wiki at http://bcache.evilpiepirate.org for more.

Signed-off-by: Kent Overstreet <koverstreet@google.com>
2013-03-23 16:11:31 -07:00
Linus Torvalds
22c3f2fff6 A few bugfixes for md
- recent regressions in raid5
  - recent regressions in dmraid
  - a few instances of CONFIG_MULTICORE_RAID456 linger
 
 Several tagged for -stable
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v2.0.19 (GNU/Linux)
 
 iQIVAwUAUUzCwDnsnt1WYoG5AQJKMhAAsi2XhqLC4Dx19J8MTF6+cjfynWCxF2SC
 3mMcVZm6yxSowixb1Ht72CyssWdJAi4vgaw0aLNH7b3CbPDZfTSfqLP4tSvyPfod
 aDcFDdd/RhHjDpJqZ52Tyc6QzBfyhwu+s9R+a78TSL47ZMjZpz1QpshG8Sm9JYTs
 z72VlIZeglzhWmzO1FInsL/oT/Hwr9IfpmJpuXBQQObDn3BgvZLuzZyCi35upqrM
 711ei7CKaN0s/jKcWclNRtgUrr10XsgQ6PugOZbli09CC8ushHwvXe/VmxoQFg2+
 Sj14YSfYAY+1QpOiuYc+knrWc7CtPGHgUqBzOoYWMxi9Lqpo5xhD1vkRsFhXxMSg
 GVnAnh/RXl7bGzGWaRv8twG4vU+qYOlEPNgO6/079AxCOrrNrstYrgjBxBSWuxrB
 0UIFQGT69zA5G3cLbIRrXUxO8oIVeEx92YV1TOcgLKP5OXlp/0I8ajnA9b8KoPZa
 He04GdPlZMXTLAqq9MaQRdS0XzX8YQDWbUebqe+w5NW46sLbckkmxaNZs7fOYAfG
 CNHfeRsLp5v0oNbhNyCDSjxqH6uYwKCdCqmDxo6A+fmjmDruHQmZoAK8YISUtPtx
 u4M82jW6Z/xOg4pomxMl4SxzCDhy1pM8PYzyx7Mj82C4XBR8CkrQTP8XD+FQL2Ih
 KhId4tJzx6Q=
 =Rycs
 -----END PGP SIGNATURE-----

Merge tag 'md-3.9-fixes' of git://neil.brown.name/md

Pull md fixes from NeilBrown:
 "A few bugfixes for md

   - recent regressions in raid5
   - recent regressions in dmraid
   - a few instances of CONFIG_MULTICORE_RAID456 linger

  Several tagged for -stable"

* tag 'md-3.9-fixes' of git://neil.brown.name/md:
  md: remove CONFIG_MULTICORE_RAID456 entirely
  md/raid5: ensure sync and DISCARD don't happen at the same time.
  MD: Prevent sysfs operations on uninitialized kobjects
  MD RAID5: Avoid accessing gendisk or queue structs when not available
  md/raid5: schedule_construction should abort if nothing to do.
2013-03-23 15:49:49 -07:00
Kent Overstreet
a07876064a block: Add bio_alloc_pages()
More utility code to replace stuff that's getting open coded.

Signed-off-by: Kent Overstreet <koverstreet@google.com>
CC: Jens Axboe <axboe@kernel.dk>
CC: NeilBrown <neilb@suse.de>
2013-03-23 14:26:31 -07:00
Kent Overstreet
cb34e057ad block: Convert some code to bio_for_each_segment_all()
More prep work for immutable bvecs:

A few places in the code were either open coding or using the wrong
version - fix.

After we introduce the bvec iter, it'll no longer be possible to modify
the biovec through bio_for_each_segment_all() - it doesn't increment a
pointer to the current bvec, you pass in a struct bio_vec (not a
pointer) which is updated with what the current biovec would be (taking
into account bi_bvec_done and bi_size).

So because of that it's more worthwhile to be consistent about
bio_for_each_segment()/bio_for_each_segment_all() usage.

Signed-off-by: Kent Overstreet <koverstreet@google.com>
CC: Jens Axboe <axboe@kernel.dk>
CC: NeilBrown <neilb@suse.de>
CC: Alasdair Kergon <agk@redhat.com>
CC: dm-devel@redhat.com
CC: Alexander Viro <viro@zeniv.linux.org.uk>
2013-03-23 14:26:30 -07:00
Kent Overstreet
d74c6d514f block: Add bio_for_each_segment_all()
__bio_for_each_segment() iterates bvecs from the specified index
instead of bio->bv_idx.  Currently, the only usage is to walk all the
bvecs after the bio has been advanced by specifying 0 index.

For immutable bvecs, we need to split these apart;
bio_for_each_segment() is going to have a different implementation.
This will also help document the intent of code that's using it -
bio_for_each_segment_all() is only legal to use for code that owns the
bio.

Signed-off-by: Kent Overstreet <koverstreet@google.com>
CC: Jens Axboe <axboe@kernel.dk>
CC: Neil Brown <neilb@suse.de>
CC: Boaz Harrosh <bharrosh@panasas.com>
2013-03-23 14:26:28 -07:00
Kent Overstreet
d3b45c2a05 raid1: use bio_copy_data()
This doesn't really delete any code _yet_, but once immutable bvecs are
done we can just delete the rest of the code in that loop.

Signed-off-by: Kent Overstreet <koverstreet@google.com>
CC: Jens Axboe <axboe@kernel.dk>
CC: NeilBrown <neilb@suse.de>
2013-03-23 14:15:38 -07:00
Kent Overstreet
b783863f68 raid1: Refactor narrow_write_error() to not use bi_idx
More bi_idx removal. This code was just open coding bio_clone(). This
could probably be further improved by using bio_advance() instead of
skipping over null pages, but that'd be a larger rework.

Signed-off-by: Kent Overstreet <koverstreet@google.com>
CC: Jens Axboe <axboe@kernel.dk>
CC: NeilBrown <neilb@suse.de>
2013-03-23 14:15:36 -07:00
Kent Overstreet
2f6db2a707 raid5: use bio_reset()
Had to shuffle the code around a bit (where bi_rw and bi_end_io were
set), but shouldn't really be anything tricky here

Signed-off-by: Kent Overstreet <koverstreet@google.com>
CC: Jens Axboe <axboe@kernel.dk>
CC: NeilBrown <neilb@suse.de>
2013-03-23 14:15:35 -07:00
Kent Overstreet
2aabaa65ad raid1: use bio_reset()
Signed-off-by: Kent Overstreet <koverstreet@google.com>
CC: Jens Axboe <axboe@kernel.dk>
CC: NeilBrown <neilb@suse.de>
2013-03-23 14:15:34 -07:00
Kent Overstreet
8be185f2c9 raid10: Use bio_reset()
More prep work for immutable bio vecs, mainly getting rid of references
to bi_idx.

bio_reset was being open coded in a few places. The one in sync_request
was a bit nontrivial to convert, so could use some extra eyeballs.

Signed-off-by: Kent Overstreet <koverstreet@google.com>
CC: Jens Axboe <axboe@kernel.dk>
CC: NeilBrown <neilb@suse.de>
Acked-by: NeilBrown <neilb@suse.de>
2013-03-23 14:15:33 -07:00
Kent Overstreet
9e882242c6 block: Add submit_bio_wait(), remove from md
Random cleanup - this code was duplicated and it's not really specific
to md.

Also added the ability to return the actual error code.

Signed-off-by: Kent Overstreet <koverstreet@google.com>
CC: Jens Axboe <axboe@kernel.dk>
CC: NeilBrown <neilb@suse.de>
Acked-by: Tejun Heo <tj@kernel.org>
2013-03-23 14:15:32 -07:00
Kent Overstreet
4f2ac93c17 block: Remove bi_idx references
For immutable bvecs, all bi_idx usage needs to be audited - so here
we're removing all the unnecessary uses.

Most of these are places where it was being initialized on a bio that
was just allocated, a few others are conversions to standard macros.

Signed-off-by: Kent Overstreet <koverstreet@google.com>
CC: Jens Axboe <axboe@kernel.dk>
2013-03-23 14:15:31 -07:00
Kent Overstreet
5b83636ae3 block: Change bio_split() to respect the current value of bi_idx
In the current code bio_split() won't be seeing partially completed bios
so this doesn't change any behaviour, but this makes the code a bit
clearer as to what bio_split() actually requires.

The immediate purpose of the patch is removing unnecessary bi_idx
references, but the end goal is to allow partial completed bios to be
submitted, which along with immutable biovecs enables effecient bio
splitting.

Some of the callers were (double) checking that bios could be split, so
update their checks too.

Signed-off-by: Kent Overstreet <koverstreet@google.com>
CC: Jens Axboe <axboe@kernel.dk>
CC: Lars Ellenberg <drbd-dev@lists.linbit.com>
CC: Neil Brown <neilb@suse.de>
CC: Martin K. Petersen <martin.petersen@oracle.com>
2013-03-23 14:15:30 -07:00
Kent Overstreet
aa8b57aa3d block: Use bio_sectors() more consistently
Bunch of places in the code weren't using it where they could be -
this'll reduce the size of the patch that puts bi_sector/bi_size/bi_idx
into a struct bvec_iter.

Signed-off-by: Kent Overstreet <koverstreet@google.com>
CC: Jens Axboe <axboe@kernel.dk>
CC: "Ed L. Cashin" <ecashin@coraid.com>
CC: Nick Piggin <npiggin@kernel.dk>
CC: Jiri Kosina <jkosina@suse.cz>
CC: Jim Paris <jim@jtan.com>
CC: Geoff Levand <geoff@infradead.org>
CC: Alasdair Kergon <agk@redhat.com>
CC: dm-devel@redhat.com
CC: Neil Brown <neilb@suse.de>
CC: Steven Rostedt <rostedt@goodmis.org>
Acked-by: Ed Cashin <ecashin@coraid.com>
2013-03-23 14:15:30 -07:00
Kent Overstreet
f73a1c7d11 block: Add bio_end_sector()
Just a little convenience macro - main reason to add it now is preparing
for immutable bio vecs, it'll reduce the size of the patch that puts
bi_sector/bi_size/bi_idx into a struct bvec_iter.

Signed-off-by: Kent Overstreet <koverstreet@google.com>
CC: Jens Axboe <axboe@kernel.dk>
CC: Lars Ellenberg <drbd-dev@lists.linbit.com>
CC: Jiri Kosina <jkosina@suse.cz>
CC: Alasdair Kergon <agk@redhat.com>
CC: dm-devel@redhat.com
CC: Neil Brown <neilb@suse.de>
CC: Martin Schwidefsky <schwidefsky@de.ibm.com>
CC: Heiko Carstens <heiko.carstens@de.ibm.com>
CC: linux-s390@vger.kernel.org
CC: Chris Mason <chris.mason@fusionio.com>
CC: Steven Whitehouse <swhiteho@redhat.com>
Acked-by: Steven Whitehouse <swhiteho@redhat.com>
2013-03-23 14:15:29 -07:00
Kent Overstreet
fb9e353476 md: Convert md_trim_bio() to use bio_advance()
Signed-off-by: Kent Overstreet <koverstreet@google.com>
CC: Jens Axboe <axboe@kernel.dk>
CC: NeilBrown <neilb@suse.de>
Acked-by: NeilBrown <neilb@suse.de>
2013-03-23 14:15:28 -07:00
Mike Snitzer
ea2dd8c1ed dm cache: policy ignore hints if generated by different version
When reading the dm cache metadata from disk, ignore the policy hints
unless they were generated by the same major version number of the same
policy module.

The hints are considered to be private data belonging to the specific
module that generated them and there is no requirement for them to make
sense to different versions of the policy that generated them.
Policy modules are all required to work fine if no previous hints are
supplied (or if existing hints are lost).

Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2013-03-20 17:21:28 +00:00
Mike Snitzer
4e7f506f64 dm cache: policy change version from string to integer set
Separate dm cache policy version string into 3 unsigned numbers
corresponding to major, minor and patchlevel and store them at the end
of the on-disk metadata so we know which version of the policy generated
the hints in case a future version wants to use them differently.

Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2013-03-20 17:21:27 +00:00
Joe Thornber
e2e74d617e dm cache: fix race in writethrough implementation
We have found a race in the optimisation used in the dm cache
writethrough implementation.  Currently, dm core sends the cache target
two bios, one for the origin device and one for the cache device and
these are processed in parallel.  This patch avoids the race by
changing the code back to a simpler (slower) implementation which
processes the two writes in series, one after the other, until we can
develop a complete fix for the problem.

When the cache is in writethrough mode it needs to send WRITE bios to
both the origin and cache devices.

Previously we've been implementing this by having dm core query the
cache target on every write to find out how many copies of the bio it
wants.  The cache will ask for two bios if the block is in the cache,
and one otherwise.

Then main problem with this is it's racey.  At the time this check is
made the bio hasn't yet been submitted and so isn't being taken into
account when quiescing a block for migration (promotion or demotion).
This means a single bio may be submitted when two were needed because
the block has since been promoted to the cache (catastrophic), or two
bios where only one is needed (harmless).

I really don't want to start entering bios into the quiescing system
(deferred_set) in the get_num_write_bios callback.  Instead this patch
simplifies things; only one bio is submitted by the core, this is
first written to the origin and then the cache device in series.
Obviously this will have a latency impact.

deferred_writethrough_bios is introduced to record bios that must be
later issued to the cache device from the worker thread.  This deferred
submission, after the origin bio completes, is required given that we're
in interrupt context (writethrough_endio).

Signed-off-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2013-03-20 17:21:27 +00:00
Joe Thornber
79ed9caffc dm cache: metadata clear dirty bits on clean shutdown
When writing the dirty bitset to the metadata device on a clean
shutdown, clear the dirty bits.  Previously they were left indicating
the cache was dirty. This led to confusion about whether there really
was dirty data in the cache or not.  (This was a harmless bug.)

Reported-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2013-03-20 17:21:27 +00:00
Heinz Mauelshagen
b978440b8d dm cache: avoid calling policy destructor twice on error
If the cache policy's config values are not able to be set we must
set the policy to NULL after destroying it in create_cache_policy()
so we don't attempt to destroy it a second time later.

Signed-off-by: Heinz Mauelshagen <heinzm@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2013-03-20 17:21:26 +00:00
Heinz Mauelshagen
617a0b89da dm cache: detect cache_create failure
Return error if cache_create() fails.

A missing return check made cache_ctr continue even after an error in
cache_create() resulting in the cache object being destroyed.  So a
simple failure like an odd number of cache policy config value arguments
would result in an oops.

Signed-off-by: Heinz Mauelshagen <heinzm@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2013-03-20 17:21:26 +00:00
Joe Thornber
414dd67d50 dm cache: avoid 64 bit division on 32 bit
Squash various 32bit link errors.

  >> on i386:
  >> drivers/built-in.o: In function `is_discarded_oblock':
  >> dm-cache-target.c:(.text+0x1ea28e): undefined reference to `__udivdi3'
  ...

Reported-by: Randy Dunlap <rdunlap@infradead.org>
Signed-off-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2013-03-20 17:21:25 +00:00
Mikulas Patocka
3b6b7813b1 dm verity: avoid deadlock
A deadlock was found in the prefetch code in the dm verity map
function.  This patch fixes this by transferring the prefetch
to a worker thread and skipping it completely if kmalloc fails.

If generic_make_request is called recursively, it queues the I/O
request on the current->bio_list without making the I/O request
and returns. The routine making the recursive call cannot wait
for the I/O to complete.

The deadlock occurs when one thread grabs the bufio_client
mutex and waits for an I/O to complete but the I/O is queued
on another thread's current->bio_list and is waiting to get
the mutex held by the first thread.

The fix recognises that prefetching is not essential.  If memory
can be allocated, it queues the prefetch request to the worker thread,
but if not, it does nothing.

Signed-off-by: Paul Taysom <taysom@chromium.org>
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
Cc: stable@kernel.org
2013-03-20 17:21:25 +00:00
Joe Thornber
58051b94e0 dm thin: fix non power of two discard granularity calc
Fix a discard granularity calculation to work for non power of 2 block sizes.

In order for thinp to passdown discard bios to the underlying data
device, the data device must have a discard granularity that is a
factor of the thinp block size.  Originally this check was done by
using bitops since the block_size was known to be a power of two.

Introduced by commit f13945d757
("dm thin: support a non power of 2 discard_granularity").

Signed-off-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2013-03-20 17:21:25 +00:00
Joe Thornber
f046f89a99 dm thin: fix discard corruption
Fix a bug in dm_btree_remove that could leave leaf values with incorrect
reference counts.  The effect of this was that removal of a shared block
could result in the space maps thinking the block was no longer used.
More concretely, if you have a thin device and a snapshot of it, sending
a discard to a shared region of the thin could corrupt the snapshot.

Thinp uses a 2-level nested btree to store it's mappings.  This first
level is indexed by thin device, and the second level by logical
block.

Often when we're removing an entry in this mapping tree we need to
rebalance nodes, which can involve shadowing them, possibly creating a
copy if the block is shared.  If we do create a copy then children of
that node need to have their reference counts incremented.  In this
way reference counts percolate down the tree as shared trees diverge.

The rebalance functions were incrementing the children at the
appropriate time, but they were always assuming the children were
internal nodes.  This meant the leaf values (in our case packed
block/flags entries) were not being incremented.

Cc: stable@vger.kernel.org
Signed-off-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2013-03-20 17:21:24 +00:00
Paul Bolle
238f5908bd md: remove CONFIG_MULTICORE_RAID456 entirely
Once instance of this Kconfig macro remained after commit
51acbcec6c ("md: remove
CONFIG_MULTICORE_RAID456"). Remove that one too. And, while we're at it,
also remove it from the defconfig files that carry it.

Signed-off-by: Paul Bolle <pebolle@tiscali.nl>
Signed-off-by: NeilBrown <neilb@suse.de>
2013-03-20 13:21:14 +11:00
NeilBrown
f8dfcffd04 md/raid5: ensure sync and DISCARD don't happen at the same time.
A number of problems can occur due to races between
resync/recovery and discard.

- if sync_request calls handle_stripe() while a discard is
  happening on the stripe, it might call handle_stripe_clean_event
  before all of the individual discard requests have completed
  (so some devices are still locked, but not all).
  Since commit ca64cae960
     md/raid5: Make sure we clear R5_Discard when discard is finished.
  this will cause R5_Discard to be cleared for the parity device,
  so handle_stripe_clean_event() will not be called when the other
  devices do become unlocked, so their ->written will not be cleared.
  This ultimately leads to a WARN_ON in init_stripe and a lock-up.

- If handle_stripe_clean_event() does clear R5_UPTODATE at an awkward
  time for resync, it can lead to s->uptodate being less than disks
  in handle_parity_checks5(), which triggers a BUG (because it is
  one).

So:
 - keep R5_Discard on the parity device until all other devices have
   completed their discard request
 - make sure we don't try to have a 'discard' and a 'sync' action at
   the same time.
   This involves a new stripe flag to we know when a 'discard' is
   happening, and the use of R5_Overlap on the parity disk so when a
   discard is wanted while a sync is active, so we know to wake up
   the discard at the appropriate time.

Discard support for RAID5 was added in 3.7, so this is suitable for
any -stable kernel since 3.7.

Cc: stable@vger.kernel.org (v3.7+)
Reported-by: Jes Sorensen <Jes.Sorensen@redhat.com>
Tested-by: Jes Sorensen <Jes.Sorensen@redhat.com>
Signed-off-by: NeilBrown <neilb@suse.de>
2013-03-20 13:20:59 +11:00
Jonathan Brassow
90584fc93d MD: Prevent sysfs operations on uninitialized kobjects
MD: Prevent sysfs operations on uninitialized kobjects

Device-mapper does not use sysfs; but when device-mapper is leveraging
MD's RAID personalities, MD sometimes attempts to update sysfs.  This
patch adds checks for 'mddev-kobj.sd' in sysfs_[un]link_rdev to ensure
it is about to operate on something valid.  This patch also checks for
'mddev->kobj.sd' before calling 'sysfs_notify' in 'remove_and_add_spares'.
Although 'sysfs_notify' already makes this check, doing so in
'remove_and_add_spares' prevents an additional mutex operation.

Signed-off-by: Jonathan Brassow <jbrassow@redhat.com>
Signed-off-by: NeilBrown <neilb@suse.de>
2013-03-20 13:17:57 +11:00
Jonathan Brassow
e3620a3ad5 MD RAID5: Avoid accessing gendisk or queue structs when not available
MD RAID5:  Fix kernel oops when RAID4/5/6 is used via device-mapper

Commit a9add5d (v3.8-rc1) added blktrace calls to the RAID4/5/6 driver.
However, when device-mapper is used to create RAID4/5/6 arrays, the
mddev->gendisk and mddev->queue fields are not setup.  Therefore, calling
things like trace_block_bio_remap will cause a kernel oops.  This patch
conditionalizes those calls on whether the proper fields exist to make
the calls.  (Device-mapper will call trace_block_bio_remap on its own.)

This patch is suitable for the 3.8.y stable kernel.

Cc: stable@vger.kernel.org (v3.8+)
Signed-off-by: Jonathan Brassow <jbrassow@redhat.com>
Signed-off-by: NeilBrown <neilb@suse.de>
2013-03-20 13:16:57 +11:00
NeilBrown
ce7d363aaf md/raid5: schedule_construction should abort if nothing to do.
Since commit 1ed850f356
    md/raid5: make sure to_read and to_write never go negative.

It has been possible for handle_stripe_dirtying to be called
when there isn't actually any work to do.
It then calls schedule_reconstruction() which will set R5_LOCKED
on the parity block(s) even when nothing else is happening.
This then causes problems in do_release_stripe().

So add checks to schedule_reconstruction() so that if it doesn't
find anything to do, it just aborts.

This bug was introduced in v3.7, so the patch is suitable
for -stable kernels since then.

Cc: stable@vger.kernel.org (v3.7+)
Reported-by: majianpeng <majianpeng@gmail.com>
Signed-off-by: NeilBrown <neilb@suse.de>
2013-03-20 12:16:51 +11:00
Linus Torvalds
a5e0d73163 md updates for 3.9
mostly little bugfixes.
 Only "feature" is a new RAID10 layout which slightly
 improves the number of sets of devices that can concurrently
 fail, without data loss.
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v2.0.19 (GNU/Linux)
 
 iQIVAwUAUTPm+znsnt1WYoG5AQLLsw/+PMqr8roC4twgxTWV1NRbU8NtOcRi9Rj9
 uvBS63uYAaLdi/D3UBKFYczmNCu9knuXbcp9SgFDxH7LlthQsWN/GYnif06pPo3w
 9Agu5M8c062TJEG1vrnX6FhPO6pNgrWFr3h+CKkTiD3179i9DoQpP8LXQToeyMtI
 YRMQf/zCkxYtDvWAP0iwsEWtw8cf+q9I/uGPhQ1L+DnZapXYdbtnqWBRz9q6mrDt
 orcGrP41aZHvnOHUaTbwmaorCKkf/Ys4SMaGenrSFpnpQMypt7VgNuwHC59LxvJT
 5eiFG/26zIsv7Wk0jv/TvFP5qzUPo0/PFkd5ug0ArvbVRiXS2cMJDwQvMdO1toxD
 i5Bb+P9DptadvoWhOTgIpxnG77yRH45wJvyJOk+ZfS1/IO87nCRa3d0yiNOU5e2/
 o0VdXPZRr72sdKKTK6kQuYfwCPb+Z2Pz6Q8BJdk6GxlmTXyP6sKhIgwUX86534fE
 LrOxfK8qV+GetVu3X02RoX2CyJJRQHXyXmbHuSzXuo/JiOYtDigAydwNZChvf+tf
 OoMY9K8vgNbhnGsUG6la7XPvZ+6dZMjdnxp2HB99Ml5A3PWZd75i5T6IHHxIQFbD
 C3z9PWTWP+hK4k15DEyjlELtsE9WduGTXG4kUcf328xJ/7lj4VIImVugdCz+1B6z
 +HlI6BiLwzY=
 =YdVD
 -----END PGP SIGNATURE-----

Merge tag 'md-3.9' of git://neil.brown.name/md

Pull md updates from NeilBrown:
 "Mostly little bugfixes.

  Only "feature" is a new RAID10 layout which slightly improves the
  number of sets of devices that can concurrently fail, without data
  loss."

* tag 'md-3.9' of git://neil.brown.name/md:
  md: expedite metadata update when switching  read-auto -> active
  md: remove CONFIG_MULTICORE_RAID456
  md/raid1,raid10: fix deadlock with freeze_array()
  md/raid0: improve error message when converting RAID4-with-spares to RAID0
  md: raid0: fix error return from create_stripe_zones.
  md: fix two bugs when attempting to resize RAID0 array.
  DM RAID: Add support for MD's RAID10 "far" and "offset" algorithms
  MD RAID10: Improve redundancy for 'far' and 'offset' algorithms (part 2)
  MD RAID10: Improve redundancy for 'far' and 'offset' algorithms (part 1)
  MD RAID10: Minor non-functional code changes
  md: raid1,10: Handle REQ_WRITE_SAME flag in write bios
  md: protect against crash upon fsync on ro array
2013-03-05 17:22:08 -08:00
Heinz Mauelshagen
8735a81347 dm cache: add cleaner policy
A simple cache policy that writes back all data to the origin.

This is used to decommission a dm cache by emptying it.

Signed-off-by: Heinz Mauelshagen <mauelshagen@redhat.com>
Signed-off-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2013-03-01 22:45:52 +00:00
Joe Thornber
f283635281 dm cache: add mq policy
A cache policy that uses a multiqueue ordered by recent hit
count to select which blocks should be promoted and demoted.
This is meant to be a general purpose policy.  It prioritises
reads over writes.

Signed-off-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2013-03-01 22:45:51 +00:00
Joe Thornber
c6b4fcbad0 dm: add cache target
Add a target that allows a fast device such as an SSD to be used as a
cache for a slower device such as a disk.

A plug-in architecture was chosen so that the decisions about which data
to migrate and when are delegated to interchangeable tunable policy
modules.  The first general purpose module we have developed, called
"mq" (multiqueue), follows in the next patch.  Other modules are
under development.

Signed-off-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Heinz Mauelshagen <mauelshagen@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2013-03-01 22:45:51 +00:00
Joe Thornber
7a87edfee7 dm persistent data: add bitset
Add a persistent bitset as a wrapper around dm-array.

Signed-off-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2013-03-01 22:45:51 +00:00
Joe Thornber
6513c29f44 dm persistent data: add transactional array
Add a transactional array.

Signed-off-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2013-03-01 22:45:51 +00:00
Joe Thornber
025b96853f dm thin: remove cells from stack
This patch takes advantage of the new bio-prison interface where the
memory is now passed in rather than using a mempool in bio-prison.
This allows the map function to avoid performing potentially-blocking
allocations that could lead to deadlocks: We want to avoid the cell
allocation that is done in bio_detain.

(The potential for mempool deadlocks still remains in other functions
that use bio_detain.)

Signed-off-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2013-03-01 22:45:50 +00:00
Joe Thornber
6beca5eb6e dm bio prison: pass cell memory in
Change the dm_bio_prison interface so that instead of allocating memory
internally, dm_bio_detain is supplied with a pre-allocated cell each
time it is called.

This enables a subsequent patch to move the allocation of the struct
dm_bio_prison_cell outside the thin target's mapping function so it can
no longer block there.

Signed-off-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2013-03-01 22:45:50 +00:00
Joe Thornber
4e7f1f9089 dm persistent data: add btree_walk
Add dm_btree_walk to iterate through the contents of a btree.
This will be used by the dm cache target.

Signed-off-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2013-03-01 22:45:50 +00:00
Alasdair G Kergon
b0d8ed4d96 dm: add target num_write_bios fn
Add a num_write_bios function to struct target.

If an instance of a target sets this, it will be queried before the
target's mapping function is called on a write bio, and the response
controls the number of copies of the write bio that the target will
receive.

This provides a convenient way for a target to send the same data to
more than one device.  The new cache target uses this in writethrough
mode, to send the data both to the cache and the backing device.

Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2013-03-01 22:45:49 +00:00
Mikulas Patocka
df5d2e9089 dm kcopyd: introduce configurable throttling
This patch allows the administrator to reduce the rate at which kcopyd
issues I/O.

Each module that uses kcopyd acquires a throttle parameter that can be
set in /sys/module/*/parameters.

We maintain a history of kcopyd usage by each module in the variables
io_period and total_period in struct dm_kcopyd_throttle. The actual
kcopyd activity is calculated as a percentage of time equal to
"(100 * io_period / total_period)".  This is compared with the user-defined
throttle percentage threshold and if it is exceeded, we sleep.

Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2013-03-01 22:45:49 +00:00
Mikulas Patocka
a26062416e dm ioctl: allow message to return data
This patch introduces enhanced message support that allows the
device-mapper core to recognise messages that are common to all devices,
and for messages to return data to userspace.

Core messages are processed by the function "message_for_md".  If the
device mapper doesn't support the message, it is passed to the target
driver.

If the message returns data, the kernel sets the flag
DM_MESSAGE_OUT_FLAG.

Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2013-03-01 22:45:49 +00:00
Mikulas Patocka
02cde50b7e dm ioctl: optimize functions without variable params
Device-mapper ioctls receive and send data in a buffer supplied
by userspace.  The buffer has two parts.  The first part contains
a 'struct dm_ioctl' and has a fixed size.  The second part depends
on the ioctl and has a variable size.

This patch recognises the specific ioctls that do not use the variable
part of the buffer and skips allocating memory for it.

In particular, when a device is suspended and a resume ioctl is sent,
this now avoid memory allocation completely.

The variable "struct dm_ioctl tmp" is moved from the function
copy_params to its caller ctl_ioctl and renamed to param_kernel.
It is used directly when the ioctl function doesn't need any arguments.

Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2013-03-01 22:45:49 +00:00
Mikulas Patocka
e2914cc26b dm ioctl: introduce ioctl_flags
This patch introduces flags for each ioctl function.

So far, one flag is defined, IOCTL_FLAGS_NO_PARAMS.  It is set if the
function processing the ioctl doesn't take or produce any parameters in
the section of the data buffer that has a variable size.

Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2013-03-01 22:45:48 +00:00
Jun'ichi Nomura
5f01520415 dm: merge io_pool and tio_pool
This patch merges io_pool and tio_pool into io_pool and cleans up
related functions.

Though device-mapper used to have 2 pools of objects for each dm device,
the use of bioset frontbad for per-bio data has shrunk the number of
pools to 1 for both bio-based and request-based device types.
(See c0820cf5 "dm: introduce per_bio_data" and
 94818742 "dm: Use bioset's front_pad for dm_rq_clone_bio_info")

So dm no longer has to maintain 2 different pointers.

No functional changes.

Signed-off-by: Jun'ichi Nomura <j-nomura@ce.jp.nec.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2013-03-01 22:45:48 +00:00
Jun'ichi Nomura
23e5083b4d dm: remove unused _rq_bio_info_cache
Remove _rq_bio_info_cache, which is no longer used.
No functional changes.

Signed-off-by: Jun'ichi Nomura <j-nomura@ce.jp.nec.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2013-03-01 22:45:48 +00:00
Mike Christie
87eb5b21d9 dm: fix limits initialization when there are no data devices
dm_calculate_queue_limits will first reset the provided limits to
defaults using blk_set_stacking_limits; whereby defeating the purpose of
retaining the original live table's limits -- as was intended via commit
3ae7065616 ("dm: retain table limits when
swapping to new table with no devices").

Fix this improper limits initialization (in the no data devices case) by
avoiding the call to dm_calculate_queue_limits.

[patch header revised by Mike Snitzer]

Signed-off-by: Mike Christie <michaelc@cs.wisc.edu>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Cc: stable@vger.kernel.org # v3.6+
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2013-03-01 22:45:48 +00:00
Mikulas Patocka
23cb21092e dm snapshot: add missing module aliases
Add module aliases so that autoloading works correctly if the user
tries to activate "snapshot-origin" or "snapshot-merge" targets.

Reference: https://bugzilla.redhat.com/889973

Reported-by: Chao Yang <chyang@redhat.com>
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Cc: stable@vger.kernel.org
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2013-03-01 22:45:47 +00:00
Mike Snitzer
018cede93c dm persistent data: set some btree fn parms const
Mark some constant parameters constant in some dm-btree functions.

Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2013-03-01 22:45:47 +00:00
Alasdair G Kergon
e4c938111f dm: refactor bio cloning
Refactor part of the bio splitting and cloning code to try to make it
easier to understand.

Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2013-03-01 22:45:47 +00:00
Alasdair G Kergon
14fe594d67 dm: rename bio cloning functions
Rename functions involved in splitting and cloning bios.

The sequence of functions is now:
  (1) __split_and_process* - entry point that selects the processing strategy
  (2) __send* - prepare the details for each bio needed and loop through them
  (3) __clone_and_map* - creates a clone and maps it

Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2013-03-01 22:45:47 +00:00
Alasdair G Kergon
55a62eef8d dm: rename request variables to bios
Use 'bio' in the name of variables and functions that deal with
bios rather than 'request' to avoid confusion with the normal
block layer use of 'request'.

No functional changes.

Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2013-03-01 22:45:47 +00:00
Alasdair G Kergon
bd2a49b86d dm: clean up clone_bio
Remove the no-longer-used struct bio_set argument from clone_bio and split_bvec.
Use tio->ti in __map_bio() instead of passing in ti.
Factor out some code for setting up cloned bios.
Take target_request_nr as a parameter to alloc_tio().

Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Cc: Joe Thornber <ejt@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2013-03-01 22:45:46 +00:00
Kees Cook
88ae4c5294 dm persistent data: remove CONFIG_EXPERIMENTAL
The CONFIG_EXPERIMENTAL config item has not carried much meaning for a
while now and is almost always enabled by default. As agreed during the
Linux kernel summit, remove it from any "depends on" lines in Kconfigs.

Signed-off-by: Kees Cook <keescook@chromium.org>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2013-03-01 22:45:46 +00:00
Alasdair G Kergon
d57916a00f dm: remove CONFIG_EXPERIMENTAL
Remove EXPERIMENTAL from all existing device-mapper targets.

Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2013-03-01 22:45:46 +00:00
Mike Snitzer
58f77a2196 dm thin: use block_size_is_power_of_two
Use block_size_is_power_of_two() rather than checking
sectors_per_block_shift directly.  Also introduce local pool variable in
get_bio_block() to eliminate redundant tc->pool dereferences.

No functional change.

Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2013-03-01 22:45:45 +00:00
Mikulas Patocka
3daec3b447 dm bufio: use WRITE_FLUSH instead of REQ_FLUSH
Use WRITE_FLUSH instead of REQ_FLUSH for submitted requests to make it
consistent with the rest of the kernel. There is no functional change.

Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2013-03-01 22:45:45 +00:00
Wang Sheng-Hui
d2ce70a119 dm table: remove superfluous variable reset
If allocation fails, the local var *t is not used any more after kfree.
Don't need to reset it to NULL. Remove the unnecesary NULL set here.

Signed-off-by: Wang Sheng-Hui <shhuiw@gmail.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2013-03-01 22:45:45 +00:00
Mike Snitzer
f13945d757 dm thin: support a non power of 2 discard_granularity
Support a non-power-of-2 discard granularity in dm-thin, now that the block
layer supports this(via 8dd2cb7e88 "block:
discard granularity might not be power of 2" and
59771079c1 "blk: avoid divide-by-zero with zero
discard granularity").

Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2013-03-01 22:45:44 +00:00
Mikulas Patocka
fd7c092e71 dm: fix truncated status strings
Avoid returning a truncated table or status string instead of setting
the DM_BUFFER_FULL_FLAG when the last target of a table fills the
buffer.

When processing a table or status request, the function retrieve_status
calls ti->type->status. If ti->type->status returns non-zero,
retrieve_status assumes that the buffer overflowed and sets
DM_BUFFER_FULL_FLAG.

However, targets don't return non-zero values from their status method
on overflow. Most targets returns always zero.

If a buffer overflow happens in a target that is not the last in the
table, it gets noticed during the next iteration of the loop in
retrieve_status; but if a buffer overflow happens in the last target, it
goes unnoticed and erroneously truncated data is returned.

In the current code, the targets behave in the following way:
* dm-crypt returns -ENOMEM if there is not enough space to store the
  key, but it returns 0 on all other overflows.
* dm-thin returns errors from the status method if a disk error happened.
  This is incorrect because retrieve_status doesn't check the error
  code, it assumes that all non-zero values mean buffer overflow.
* all the other targets always return 0.

This patch changes the ti->type->status function to return void (because
most targets don't use the return code). Overflow is detected in
retrieve_status: if the status method fills up the remaining space
completely, it is assumed that buffer overflow happened.

Cc: stable@vger.kernel.org
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2013-03-01 22:45:44 +00:00
Jun'ichi Nomura
16245bdc9d dm: do not replace bioset for request based dm
This patch fixes a regression introduced in v3.8, which causes oops
like this when dm-multipath is used:

general protection fault: 0000 [#1] SMP
RIP: 0010:[<ffffffff810fe754>]  [<ffffffff810fe754>] mempool_free+0x24/0xb0
Call Trace:
  <IRQ>
  [<ffffffff81187417>] bio_put+0x97/0xc0
  [<ffffffffa02247a5>] end_clone_bio+0x35/0x90 [dm_mod]
  [<ffffffff81185efd>] bio_endio+0x1d/0x30
  [<ffffffff811f03a3>] req_bio_endio.isra.51+0xa3/0xe0
  [<ffffffff811f2f68>] blk_update_request+0x118/0x520
  [<ffffffff811f3397>] blk_update_bidi_request+0x27/0xa0
  [<ffffffff811f343c>] blk_end_bidi_request+0x2c/0x80
  [<ffffffff811f34d0>] blk_end_request+0x10/0x20
  [<ffffffffa000b32b>] scsi_io_completion+0xfb/0x6c0 [scsi_mod]
  [<ffffffffa000107d>] scsi_finish_command+0xbd/0x120 [scsi_mod]
  [<ffffffffa000b12f>] scsi_softirq_done+0x13f/0x160 [scsi_mod]
  [<ffffffff811f9fd0>] blk_done_softirq+0x80/0xa0
  [<ffffffff81044551>] __do_softirq+0xf1/0x250
  [<ffffffff8142ee8c>] call_softirq+0x1c/0x30
  [<ffffffff8100420d>] do_softirq+0x8d/0xc0
  [<ffffffff81044885>] irq_exit+0xd5/0xe0
  [<ffffffff8142f3e3>] do_IRQ+0x63/0xe0
  [<ffffffff814257af>] common_interrupt+0x6f/0x6f
  <EOI>
  [<ffffffffa021737c>] srp_queuecommand+0x8c/0xcb0 [ib_srp]
  [<ffffffffa0002f18>] scsi_dispatch_cmd+0x148/0x310 [scsi_mod]
  [<ffffffffa000a38e>] scsi_request_fn+0x31e/0x520 [scsi_mod]
  [<ffffffff811f1e57>] __blk_run_queue+0x37/0x50
  [<ffffffff811f1f69>] blk_delay_work+0x29/0x40
  [<ffffffff81059003>] process_one_work+0x1c3/0x5c0
  [<ffffffff8105b22e>] worker_thread+0x15e/0x440
  [<ffffffff8106164b>] kthread+0xdb/0xe0
  [<ffffffff8142db9c>] ret_from_fork+0x7c/0xb0

The regression was introduced by the change
c0820cf5 "dm: introduce per_bio_data", where dm started to replace
bioset during table replacement.
For bio-based dm, it is good because clone bios do not exist during the
table replacement.
For request-based dm, however, (not-yet-mapped) clone bios may stay in
request queue and survive during the table replacement.
So freeing the old bioset could cause the oops in bio_put().

Since the size of front_pad may change only with bio-based dm,
it is not necessary to replace bioset for request-based dm.

Reported-by: Bart Van Assche <bvanassche@acm.org>
Tested-by: Bart Van Assche <bvanassche@acm.org>
Signed-off-by: Jun'ichi Nomura <j-nomura@ce.jp.nec.com>
Acked-by: Mikulas Patocka <mpatocka@redhat.com>
Acked-by: Mike Snitzer <snitzer@redhat.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2013-03-01 22:45:44 +00:00
Linus Torvalds
ee89f81252 Merge branch 'for-3.9/core' of git://git.kernel.dk/linux-block
Pull block IO core bits from Jens Axboe:
 "Below are the core block IO bits for 3.9.  It was delayed a few days
  since my workstation kept crashing every 2-8h after pulling it into
  current -git, but turns out it is a bug in the new pstate code (divide
  by zero, will report separately).  In any case, it contains:

   - The big cfq/blkcg update from Tejun and and Vivek.

   - Additional block and writeback tracepoints from Tejun.

   - Improvement of the should sort (based on queues) logic in the plug
     flushing.

   - _io() variants of the wait_for_completion() interface, using
     io_schedule() instead of schedule() to contribute to io wait
     properly.

   - Various little fixes.

  You'll get two trivial merge conflicts, which should be easy enough to
  fix up"

Fix up the trivial conflicts due to hlist traversal cleanups (commit
b67bfe0d42: "hlist: drop the node parameter from iterators").

* 'for-3.9/core' of git://git.kernel.dk/linux-block: (39 commits)
  block: remove redundant check to bd_openers()
  block: use i_size_write() in bd_set_size()
  cfq: fix lock imbalance with failed allocations
  drivers/block/swim3.c: fix null pointer dereference
  block: don't select PERCPU_RWSEM
  block: account iowait time when waiting for completion of IO request
  sched: add wait_for_completion_io[_timeout]
  writeback: add more tracepoints
  block: add block_{touch|dirty}_buffer tracepoint
  buffer: make touch_buffer() an exported function
  block: add @req to bio_{front|back}_merge tracepoints
  block: add missing block_bio_complete() tracepoint
  block: Remove should_sort judgement when flush blk_plug
  block,elevator: use new hashtable implementation
  cfq-iosched: add hierarchical cfq_group statistics
  cfq-iosched: collect stats from dead cfqgs
  cfq-iosched: separate out cfqg_stats_reset() from cfq_pd_reset_stats()
  blkcg: make blkcg_print_blkgs() grab q locks instead of blkcg lock
  block: RCU free request_queue
  blkcg: implement blkg_[rw]stat_recursive_sum() and blkg_[rw]stat_merge()
  ...
2013-02-28 12:52:24 -08:00
Sasha Levin
b67bfe0d42 hlist: drop the node parameter from iterators
I'm not sure why, but the hlist for each entry iterators were conceived

        list_for_each_entry(pos, head, member)

The hlist ones were greedy and wanted an extra parameter:

        hlist_for_each_entry(tpos, pos, head, member)

Why did they need an extra pos parameter? I'm not quite sure. Not only
they don't really need it, it also prevents the iterator from looking
exactly like the list iterator, which is unfortunate.

Besides the semantic patch, there was some manual work required:

 - Fix up the actual hlist iterators in linux/list.h
 - Fix up the declaration of other iterators based on the hlist ones.
 - A very small amount of places were using the 'node' parameter, this
 was modified to use 'obj->member' instead.
 - Coccinelle didn't handle the hlist_for_each_entry_safe iterator
 properly, so those had to be fixed up manually.

The semantic patch which is mostly the work of Peter Senna Tschudin is here:

@@
iterator name hlist_for_each_entry, hlist_for_each_entry_continue, hlist_for_each_entry_from, hlist_for_each_entry_rcu, hlist_for_each_entry_rcu_bh, hlist_for_each_entry_continue_rcu_bh, for_each_busy_worker, ax25_uid_for_each, ax25_for_each, inet_bind_bucket_for_each, sctp_for_each_hentry, sk_for_each, sk_for_each_rcu, sk_for_each_from, sk_for_each_safe, sk_for_each_bound, hlist_for_each_entry_safe, hlist_for_each_entry_continue_rcu, nr_neigh_for_each, nr_neigh_for_each_safe, nr_node_for_each, nr_node_for_each_safe, for_each_gfn_indirect_valid_sp, for_each_gfn_sp, for_each_host;

type T;
expression a,c,d,e;
identifier b;
statement S;
@@

-T b;
    <+... when != b
(
hlist_for_each_entry(a,
- b,
c, d) S
|
hlist_for_each_entry_continue(a,
- b,
c) S
|
hlist_for_each_entry_from(a,
- b,
c) S
|
hlist_for_each_entry_rcu(a,
- b,
c, d) S
|
hlist_for_each_entry_rcu_bh(a,
- b,
c, d) S
|
hlist_for_each_entry_continue_rcu_bh(a,
- b,
c) S
|
for_each_busy_worker(a, c,
- b,
d) S
|
ax25_uid_for_each(a,
- b,
c) S
|
ax25_for_each(a,
- b,
c) S
|
inet_bind_bucket_for_each(a,
- b,
c) S
|
sctp_for_each_hentry(a,
- b,
c) S
|
sk_for_each(a,
- b,
c) S
|
sk_for_each_rcu(a,
- b,
c) S
|
sk_for_each_from
-(a, b)
+(a)
S
+ sk_for_each_from(a) S
|
sk_for_each_safe(a,
- b,
c, d) S
|
sk_for_each_bound(a,
- b,
c) S
|
hlist_for_each_entry_safe(a,
- b,
c, d, e) S
|
hlist_for_each_entry_continue_rcu(a,
- b,
c) S
|
nr_neigh_for_each(a,
- b,
c) S
|
nr_neigh_for_each_safe(a,
- b,
c, d) S
|
nr_node_for_each(a,
- b,
c) S
|
nr_node_for_each_safe(a,
- b,
c, d) S
|
- for_each_gfn_sp(a, c, d, b) S
+ for_each_gfn_sp(a, c, d) S
|
- for_each_gfn_indirect_valid_sp(a, c, d, b) S
+ for_each_gfn_indirect_valid_sp(a, c, d) S
|
for_each_host(a,
- b,
c) S
|
for_each_host_safe(a,
- b,
c, d) S
|
for_each_mesh_entry(a,
- b,
c, d) S
)
    ...+>

[akpm@linux-foundation.org: drop bogus change from net/ipv4/raw.c]
[akpm@linux-foundation.org: drop bogus hunk from net/ipv6/raw.c]
[akpm@linux-foundation.org: checkpatch fixes]
[akpm@linux-foundation.org: fix warnings]
[akpm@linux-foudnation.org: redo intrusive kvm changes]
Tested-by: Peter Senna Tschudin <peter.senna@gmail.com>
Acked-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Signed-off-by: Sasha Levin <sasha.levin@oracle.com>
Cc: Wu Fengguang <fengguang.wu@intel.com>
Cc: Marcelo Tosatti <mtosatti@redhat.com>
Cc: Gleb Natapov <gleb@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-02-27 19:10:24 -08:00
Tejun Heo
c9d76be696 dm: convert to idr_alloc()
Convert to the much saner new idr interface.

Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Alasdair Kergon <agk@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-02-27 19:10:17 -08:00
Tejun Heo
adaedbd9fe dm: don't use idr_remove_all()
idr_destroy() can destroy idr by itself and idr_remove_all() is being
deprecated.  Drop its usage.

Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Alasdair Kergon <agk@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-02-27 19:10:13 -08:00
NeilBrown
f3378b4870 md: expedite metadata update when switching read-auto -> active
If something has failed while the array was read-auto,
then when we switch to 'active' we need to update the metadata.
This will happen anyway but it is good to expedite it, and
also to ensure any failed device has been released by the
underlying device before we try to action the ioctl which
caused us to switch to 'active' mode.

Reported-by: Joe Lawrence <Joe.Lawrence@stratus.com>
Signed-off-by: NeilBrown <neilb@suse.de>
2013-02-28 11:59:03 +11:00
NeilBrown
51acbcec6c md: remove CONFIG_MULTICORE_RAID456
This doesn't seem to actually help and we have an alternate
multi-threading approach waiting in the wings, so just get
rid of this config option and associated code.

As a bonus, we remove one use of CONFIG_EXPERIMENTAL

Cc: Dan Williams <djbw@fb.com>
Cc: Kees Cook <keescook@chromium.org>
Signed-off-by: NeilBrown <neilb@suse.de>
2013-02-28 09:08:34 +11:00
Linus Torvalds
d895cb1af1 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs
Pull vfs pile (part one) from Al Viro:
 "Assorted stuff - cleaning namei.c up a bit, fixing ->d_name/->d_parent
  locking violations, etc.

  The most visible changes here are death of FS_REVAL_DOT (replaced with
  "has ->d_weak_revalidate()") and a new helper getting from struct file
  to inode.  Some bits of preparation to xattr method interface changes.

  Misc patches by various people sent this cycle *and* ocfs2 fixes from
  several cycles ago that should've been upstream right then.

  PS: the next vfs pile will be xattr stuff."

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (46 commits)
  saner proc_get_inode() calling conventions
  proc: avoid extra pde_put() in proc_fill_super()
  fs: change return values from -EACCES to -EPERM
  fs/exec.c: make bprm_mm_init() static
  ocfs2/dlm: use GFP_ATOMIC inside a spin_lock
  ocfs2: fix possible use-after-free with AIO
  ocfs2: Fix oops in ocfs2_fast_symlink_readpage() code path
  get_empty_filp()/alloc_file() leave both ->f_pos and ->f_version zero
  target: writev() on single-element vector is pointless
  export kernel_write(), convert open-coded instances
  fs: encode_fh: return FILEID_INVALID if invalid fid_type
  kill f_vfsmnt
  vfs: kill FS_REVAL_DOT by adding a d_weak_revalidate dentry op
  nfsd: handle vfs_getattr errors in acl protocol
  switch vfs_getattr() to struct path
  default SET_PERSONALITY() in linux/elf.h
  ceph: prepopulate inodes only when request is aborted
  d_hash_and_lookup(): export, switch open-coded instances
  9p: switch v9fs_set_create_acl() to inode+fid, do it before d_instantiate()
  9p: split dropping the acls from v9fs_set_create_acl()
  ...
2013-02-26 20:16:07 -08:00
NeilBrown
ee0b024403 md/raid1,raid10: fix deadlock with freeze_array()
When raid1/raid10 needs to fix a read error, it first drains
all pending requests by calling freeze_array().
This calls flush_pending_writes() if it needs to sleep,
but some writes may be pending in a per-process plug rather
than in the per-array request queue.

When raid1{,0}_unplug() moves the request from the per-process
plug to the per-array request queue (from which
flush_pending_writes() can flush them), it needs to wake up
freeze_array(), or freeze_array() will never flush them and so
it will block forever.

So add the requires wake_up() calls.

This bug was introduced by commit
   f54a9d0e59
for raid1 and a similar commit for RAID10, and so has been present
since linux-3.6.  As the bug causes a deadlock I believe this fix is
suitable for -stable.

Cc: stable@vger.kernel.org (3.6.y 3.7.y 3.8.y)
Reported-by: Tregaron Bayly <tbayly@bluehost.com>
Tested-by: Tregaron Bayly <tbayly@bluehost.com>
Signed-off-by: NeilBrown <neilb@suse.de>
2013-02-26 11:58:50 +11:00
NeilBrown
f96c9f305c md/raid0: improve error message when converting RAID4-with-spares to RAID0
Mentioning "bad disk number -1" exposes irrelevant internal detail.
Just say they are inactive and must be removed.

Signed-off-by: NeilBrown <neilb@suse.de>
2013-02-26 11:58:44 +11:00
NeilBrown
58ebb34c49 md: raid0: fix error return from create_stripe_zones.
Create_stripe_zones returns an error slightly differently to
raid0_run and to raid0_takeover_*.

The error returned used by the second was wrong and an error would
result in mddev->private being set to NULL and sooner or later a
crash.

So never return NULL, return ERR_PTR(err), not NULL from
create_stripe_zones.

This bug has been present since 2.6.35 so the fix is suitable
for any kernel since then.

Cc: stable@vger.kernel.org
Signed-off-by: NeilBrown <neilb@suse.de>
2013-02-26 11:57:04 +11:00
NeilBrown
a646853991 md: fix two bugs when attempting to resize RAID0 array.
You cannot resize a RAID0 array (in terms of making the devices
bigger), but the code doesn't entirely stop you.
So:

 disable setting of the available size on each device for
 RAID0 and Linear devices.  This must not change as doing so
 can change the effective layout of data.

 Make sure that the size that raid0_size() reports is accurate,
 but rounding devices sizes to chunk sizes.  As the device sizes
 cannot change now, this isn't so important, but it is best to be
 safe.

Without this change:
  mdadm --grow /dev/md0 -z max
  mdadm --grow /dev/md0 -Z max
  then read to the end of the array

can cause a BUG in a RAID0 array.

These bugs have been present ever since it became possible
to resize any device, which is a long time.  So the fix is
suitable for any -stable kerenl.

Cc: stable@vger.kernel.org
Signed-off-by: NeilBrown <neilb@suse.de>
2013-02-26 11:55:40 +11:00
Jonathan Brassow
fe5d2f4a15 DM RAID: Add support for MD's RAID10 "far" and "offset" algorithms
DM RAID:  Add support for MD's RAID10 "far" and "offset" algorithms

Until now, dm-raid.c only supported the "near" algorthm of MD's RAID10
implementation.  This patch adds support for the "far" and "offset"
algorithms, but only with the improved redundancy that is brought with
the introduction of the 'use_far_sets' bit, which shifts copied stripes
according to smaller sets vs the entire array.  That is, the 17th bit
of the 'layout' variable that defines the RAID10 implementation will
always be set.   (More information on how the 'layout' variable selects
the RAID10 algorithm can be found in the opening comments of
drivers/md/raid10.c.)

Signed-off-by: Jonathan Brassow <jbrassow@redhat.com>
Signed-off-by: NeilBrown <neilb@suse.de>
2013-02-26 11:55:36 +11:00
Jonathan Brassow
9a3152ab02 MD RAID10: Improve redundancy for 'far' and 'offset' algorithms (part 2)
MD RAID10:  Improve redundancy for 'far' and 'offset' algorithms (part 2)

This patch addresses raid arrays that have a number of devices that cannot
be evenly divided by 'far_copies'.  (E.g. 5 devices, far_copies = 2)  This
case must be handled differently because it causes that last set to be of
a different size than the rest of the sets.  We must compute a new modulo
for this last set so that copied chunks are properly wrapped around.

Example use_far_sets=1, far_copies=2, near_copies=1, devices=5:
                "far" algorithm
        dev1 dev2 dev3 dev4 dev5
	==== ==== ==== ==== ====
	[ A   B ] [ C    D   E ]
        [ G   H ] [ I    J   K ]
                    ...
        [ B   A ] [ E    C   D ] --> nominal set of 2 and last set of 3
        [ H   G ] [ K    I   J ]     []'s show far/offset sets

Signed-off-by: Jonathan Brassow <jbrassow@redhat.com>
Signed-off-by: NeilBrown <neilb@suse.de>
2013-02-26 11:55:33 +11:00
Jonathan Brassow
475901aff1 MD RAID10: Improve redundancy for 'far' and 'offset' algorithms (part 1)
The MD RAID10 'far' and 'offset' algorithms make copies of entire stripe
widths - copying them to a different location on the same devices after
shifting the stripe.  An example layout of each follows below:

	        "far" algorithm
	dev1 dev2 dev3 dev4 dev5 dev6
	==== ==== ==== ==== ==== ====
	 A    B    C    D    E    F
	 G    H    I    J    K    L
	            ...
	 F    A    B    C    D    E  --> Copy of stripe0, but shifted by 1
	 L    G    H    I    J    K
	            ...

		"offset" algorithm
	dev1 dev2 dev3 dev4 dev5 dev6
	==== ==== ==== ==== ==== ====
	 A    B    C    D    E    F
	 F    A    B    C    D    E  --> Copy of stripe0, but shifted by 1
	 G    H    I    J    K    L
	 L    G    H    I    J    K
	            ...

Redundancy for these algorithms is gained by shifting the copied stripes
one device to the right.  This patch proposes that array be divided into
sets of adjacent devices and when the stripe copies are shifted, they wrap
on set boundaries rather than the array size boundary.  That is, for the
purposes of shifting, the copies are confined to their sets within the
array.  The sets are 'near_copies * far_copies' in size.

The above "far" algorithm example would change to:
	        "far" algorithm
	dev1 dev2 dev3 dev4 dev5 dev6
	==== ==== ==== ==== ==== ====
	 A    B    C    D    E    F
	 G    H    I    J    K    L
	            ...
	 B    A    D    C    F    E  --> Copy of stripe0, shifted 1, 2-dev sets
	 H    G    J    I    L    K      Dev sets are 1-2, 3-4, 5-6
	            ...

This has the affect of improving the redundancy of the array.  We can
always sustain at least one failure, but sometimes more than one can
be handled.  In the first examples, the pairs of devices that CANNOT fail
together are:
	(1,2) (2,3) (3,4) (4,5) (5,6) (1, 6) [40% of possible pairs]
In the example where the copies are confined to sets, the pairs of
devices that cannot fail together are:
	(1,2) (3,4) (5,6)                    [20% of possible pairs]

We cannot simply replace the old algorithms, so the 17th bit of the 'layout'
variable is used to indicate whether we use the old or new method of computing
the shift.  (This is similar to the way the 16th bit indicates whether the
"far" algorithm or the "offset" algorithm is being used.)

This patch only handles the cases where the number of total raid disks is
a multiple of 'far_copies'.  A follow-on patch addresses the condition where
this is not true.

Signed-off-by: Jonathan Brassow <jbrassow@redhat.com>
Signed-off-by: NeilBrown <neilb@suse.de>
2013-02-26 11:55:30 +11:00
Jonathan Brassow
4c0ca26bd2 MD RAID10: Minor non-functional code changes
Changes include assigning 'addr' from 's' instead of 'sector' to be
consistent with the way the code does it just a few lines later and
using '%=' vs a conditional and subtraction.

Signed-off-by: Jonathan Brassow <jbrassow@redhat.com>
Signed-off-by: NeilBrown <neilb@suse.de>
2013-02-26 11:55:27 +11:00
Joe Lawrence
c8dc9c6547 md: raid1,10: Handle REQ_WRITE_SAME flag in write bios
Set mddev queue's max_write_same_sectors to its chunk_sector value (before
disk_stack_limits merges the underlying disk limits.)  With that in place,
be sure to handle writes coming down from the block layer that have the
REQ_WRITE_SAME flag set.  That flag needs to be copied into any newly cloned
write bio.

Signed-off-by: Joe Lawrence <joe.lawrence@stratus.com>
Acked-by: "Martin K. Petersen" <martin.petersen@oracle.com>
Signed-off-by: NeilBrown <neilb@suse.de>
2013-02-26 11:55:21 +11:00
Andrew Morton
df8557982f drivers/md/persistent-data/dm-transaction-manager.c: rename HASH_SIZE
Fix the warning:

  drivers/md/persistent-data/dm-transaction-manager.c:28:1: warning: "HASH_SIZE" redefined
  In file included from include/linux/elevator.h:5,
                   from include/linux/blkdev.h:216,
                   from drivers/md/persistent-data/dm-block-manager.h:11,
                   from drivers/md/persistent-data/dm-transaction-manager.h:10,
                   from drivers/md/persistent-data/dm-transaction-manager.c:6:
  include/linux/hashtable.h:22:1: warning: this is the location of the previous definition

Cc: Alasdair Kergon <agk@redhat.com>
Cc: Neil Brown <neilb@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-02-23 17:50:08 -08:00
Al Viro
496ad9aa8e new helper: file_inode(file)
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2013-02-22 23:31:31 -05:00
Sebastian Riemer
bbfa57c0f2 md: protect against crash upon fsync on ro array
If an fsync occurs on a read-only array, we need to send a
completion for the IO and may not increment the active IO count.
Otherwise, we hit a bug trace and can't stop the MD array anymore.

By advice of Christoph Hellwig we return success upon a flush
request but we return -EROFS for other writes.
We detect flush requests by checking if the bio has zero sectors.

This patch is suitable to any -stable kernel to which it applies.

Cc: Christoph Hellwig <hch@infradead.org>
Cc: Ben Hutchings <ben@decadent.org.uk>
Cc: NeilBrown <neilb@suse.de>
Cc: stable@vger.kernel.org
Signed-off-by: Sebastian Riemer <sebastian.riemer@profitbricks.com>
Reported-by: Ben Hutchings <ben@decadent.org.uk>
Acked-by: Paul Menzel <paulepanter@users.sourceforge.net>
Signed-off-by: NeilBrown <neilb@suse.de>
2013-02-21 13:28:09 +11:00
Linus Torvalds
cc6c954a07 A fix for stacked dm thin devices and a fix for the new dm WRITE SAME
support.
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1.4.11 (GNU/Linux)
 
 iQIcBAABAgAGBQJRCn+IAAoJEK2W1qbAHj1naQEP/2eXMOslRyws7M6CcsEgpEK9
 N2L2hf6bD3xF/04ZSLbHFI6hPe9wDXSL9Vxd+DRLYTSnc0E9WYBXHmE6Eb0L0xK8
 m0Iubk/hi7mk6mnMJtpTFT5pazBTPhVz0nXOijguh5U6PW0xL+4ypXe9nrH2jtW0
 DvEHFDIPbKcqwplm8nvo/QJ5O3YNQaMifKUtpXF/JWGlCYP4vPk0dJVg9ATbscEV
 Fg3kefoJzZM09q3Uvo01wigbj+wRkpBK9+CiyW6XcE0lkOAnFmpvyYerIoAHAK37
 Rsw5J4aMPA9U8mggBEtlHBWa0q5utZafHM11lT2ZeFGCXkdn+TSWni6O7ov54xPP
 Cd7jx+uNpe/OuLT5YjbCg2IMXgJs+zIZMSeqSj3SrywE0a0EQHECWiXaFmMmrCCJ
 TgZtmp/HS1UsdoiHA3v3ZX3AaX4W+mggYp/5md9P1vHyYS9uTlgSVplhwtVgsL23
 EsDxNNxODSIFMAMnrXxAV+NBPiQRY42K22hK/RrnWew9roAQHxroIvQDmzzm5ZRL
 BqCFW3w/x2loJOZZ6NH/J8IUEoF9RhCK1tOGVjFuAVn30srt3zXb4pOzYeydykT7
 m04HaGO7rCBJI75XdVDhm6ozOvV/GhXF2fJOt4qyoX/X6M8YN5i0jfwC3rhKeuOe
 U9fyyYoQV37EWIRI9K5Q
 =qtaF
 -----END PGP SIGNATURE-----

Merge tag 'dm-3.8-fixes-2' of git://git.kernel.org/pub/scm/linux/kernel/git/agk/linux-dm

Pull more device-mapper fixes from Alasdair G Kergon:
 "A fix for stacked dm thin devices and a fix for the new dm WRITE SAME
  support."

* tag 'dm-3.8-fixes-2' of git://git.kernel.org/pub/scm/linux/kernel/git/agk/linux-dm:
  dm: fix write same requests counting
  dm thin: fix queue limits stacking
2013-02-01 12:04:22 +11:00
Alasdair G Kergon
fe7af2d3ba dm: fix write same requests counting
When processing write same requests, fix dm to send the configured
number of WRITE SAME requests to the target rather than the number of
discards, which is not always the same.

Device-mapper WRITE SAME support was introduced by commit
23508a96cd ("dm: add WRITE SAME support").

Signed-off-by: Alasdair G Kergon <agk@redhat.com>
Acked-by: Mike Snitzer <snitzer@redhat.com>
2013-01-31 14:23:36 +00:00
Mike Snitzer
0f640dca08 dm thin: fix queue limits stacking
thin_io_hints() is blindly copying the queue limits from the thin-pool
which can lead to incorrect limits being set.  The fix here simply
deletes the thin_io_hints() hook which leaves the existing stacking
infrastructure to set the limits correctly.

When a thin-pool uses an MD device for the data device a thin device
from the thin-pool must respect MD's constraints about disallowing a bio
from spanning multiple chunks.  Otherwise we can see problems.  If the raid0
chunksize is 1152K and thin-pool chunksize is 256K I see the following
md/raid0 error (with extra debug tracing added to thin_endio) when
mkfs.xfs is executed against the thin device:

md/raid0:md99: make_request bug: can't convert block across chunks or bigger than 1152k 6688 127
device-mapper: thin: bio sector=2080 err=-5 bi_size=130560 bi_rw=17 bi_vcnt=32 bi_idx=0

This extra DM debugging shows that the failing bio is spanning across
the first and second logical 1152K chunk (sector 2080 + 255 takes the
bio beyond the first chunk's boundary of sector 2304).  So the bio
splitting that DM is doing clearly isn't respecting the MD limits.

max_hw_sectors_kb is 127 for both the thin-pool and thin device
(queue_max_hw_sectors returns 255 so we'll excuse sysfs's lack of
precision).  So this explains why bi_size is 130560.

But the thin device's max_hw_sectors_kb should be 4 (PAGE_SIZE) given
that it doesn't have a .merge function (for bio_add_page to consult
indirectly via dm_merge_bvec) yet the thin-pool does sit above an MD
device that has a compulsory merge_bvec_fn.  This scenario is exactly
why DM must resort to sending single PAGE_SIZE bios to the underlying
layer. Some additional context for this is available in the header for
commit 8cbeb67a ("dm: avoid unsupported spanning of md stripe boundaries").

Long story short, the reason a thin device doesn't properly get
configured to have a max_hw_sectors_kb of 4 (PAGE_SIZE) is that
thin_io_hints() is blindly copying the queue limits from the thin-pool
device directly to the thin device's queue limits.

Fix this by eliminating thin_io_hints.  Doing so is safe because the
block layer's queue limits stacking already enables the upper level thin
device to inherit the thin-pool device's discard and minimum_io_size and
optimal_io_size limits that get set in pool_io_hints.  But avoiding the
queue limits copy allows the thin and thin-pool limits to be different
where it is important, namely max_hw_sectors_kb.

Reported-by: Daniel Browning <db@kavod.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Cc: stable@vger.kernel.org
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2013-01-31 14:11:14 +00:00
Jonathan Brassow
55ebbb59c1 DM-RAID: Fix RAID10's check for sufficient redundancy
Before attempting to activate a RAID array, it is checked for sufficient
redundancy.  That is, we make sure that there are not too many failed
devices - or devices specified for rebuild - to undermine our ability to
activate the array.  The current code performs this check twice - once to
ensure there were not too many devices specified for rebuild by the user
('validate_rebuild_devices') and again after possibly experiencing a failure
to read the superblock ('analyse_superblocks').  Neither of these checks are
sufficient.  The first check is done properly but with insufficient
information about the possible failure state of the devices to make a good
determination if the array can be activated.  The second check is simply
done wrong in the case of RAID10 because it doesn't account for the
independence of the stripes (i.e. mirror sets).  The solution is to use the
properly written check ('validate_rebuild_devices'), but perform the check
after the superblocks have been read and we know which devices have failed.
This gives us one check instead of two and performs it in a location where
it can be done right.

Only RAID10 was affected and it was affected in the following ways:
- the code did not properly catch the condition where a user specified
  a device for rebuild that already had a failed device in the same mirror
  set.  (This condition would, however, be caught at a deeper level in MD.)
- the code triggers a false positive and denies activation when devices in
  independent mirror sets have failed - counting the failures as though they
  were all in the same set.

The most likely place this error was introduced (or this patch should have
been included) is in commit 4ec1e369 - first introduced in v3.7-rc1.
Consequently this fix should also go in v3.7.y, however there is a
small conflict on the .version in raid_target, so I'll submit a
separate patch to -stable.

Cc: stable@vger.kernel.org
Signed-off-by: Jonathan Brassow <jbrassow@redhat.com>
Signed-off-by: NeilBrown <neilb@suse.de>
2013-01-24 12:02:36 +11:00
Tejun Heo
3a366e614d block: add missing block_bio_complete() tracepoint
bio completion didn't kick block_bio_complete TP.  Only dm was
explicitly triggering the TP on IO completion.  This makes
block_bio_complete TP useless for tracers which want to know about
bios, and all other bio based drivers skip generating blktrace
completion events.

This patch makes all bio completions via bio_endio() generate
block_bio_complete TP.

* Explicit trace_block_bio_complete() invocation removed from dm and
  the trace point is unexported.

* @rq dropped from trace_block_bio_complete().  bios may fly around
  w/o queue associated.  Verifying and accessing the assocaited queue
  belongs to TP probes.

* blktrace now gets both request and bio completions.  Make it ignore
  bio completions if request completion path is happening.

This makes all bio based drivers generate blktrace completion events
properly and makes the block_bio_complete TP actually useful.

v2: With this change, block_bio_complete TP could be invoked on sg
    commands which have bio's with %NULL bi_bdev.  Update TP
    assignment code to check whether bio->bi_bdev is %NULL before
    dereferencing.

Signed-off-by: Tejun Heo <tj@kernel.org>
Original-patch-by: Namhyung Kim <namhyung@gmail.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Alasdair Kergon <agk@redhat.com>
Cc: dm-devel@redhat.com
Cc: Neil Brown <neilb@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2013-01-14 15:00:36 +01:00
Linus Torvalds
b49249d103 Miscellaneous device-mapper fixes, cleanups and performance improvements.
Of particular note:
 - Disable broken WRITE SAME support in all targets except linear and striped.
   Use it when kcopyd is zeroing blocks.
 - Remove several mempools from targets by moving the data into the bio's new
   front_pad area(which dm calls 'per_bio_data').
 - Fix a race in thin provisioning if discards are misused.
 - Prevent userspace from interfering with the ioctl parameters and
   use kmalloc for the data buffer if it's small instead of vmalloc.
 - Throttle some annoying error messages when I/O fails.
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1.4.11 (GNU/Linux)
 
 iQIcBAABAgAGBQJQ1M5OAAoJEK2W1qbAHj1nrQcP/itnnAw8RNsSHBrFMrL9wVnB
 5dmZ1BXPZmEbG+ViU4wzVmRUPHuSHTwhIqH7UFPjyCgbWaz1jaXfpyIxBsxlJi4E
 zuGjv46akANMwH0o/aJDRuEIrCnMtjLrMiY2Oq00lJFvATurwYAKSIgmnwRVdAYy
 gDehJhaymNtHVjhymu33xEn/hqqkQtUbMDj9o+IZppmAw1aQyNuYnwQu3HvcETuz
 /JBcs8isXKIQMJdMLFdGg7lZjLO241UvSwCAeGycKkupHLaYfycumPywgdiNFVUg
 L6pQP9RtAQ+H2VBQ1OIVMJxqiXxQ0xHhyxUYIe3reTar+RXoMA0yK+FiJTwSY1cE
 Xk0s8x2DXwUyu3Vx7UmvgUXnMgd4TIPITYBYiOAanEF/8Xt0voZn8mzNyyzsyFXy
 0u1vMRK+ZK7+QPio9LRh7bgHNK1g5ZyShvwqTMDmtlp+uskaP4iHDDGtVUjFA+Wf
 r9Ms0CXPbXIN6laUIT/4L3LJZtyRWB6e8wuCrUWIWWRbjrMPaPnB+/NlckGJ0CHa
 P/5r1rmLdneTEZ8Vx/2g3fBJ+H2uNQKhYujjnE0HqtHP+tvjt7ernibyU2QhNBeE
 Zy0PXRatY0Xn7UFpn44uJ2qxkWaO5Dloaa4HkWdlWFdR3f/u5MzVjy5mDXLUxkGq
 wj2Z3YkjYjy948MViBhD
 =yzhS
 -----END PGP SIGNATURE-----

Merge tag 'dm-3.8-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/agk/linux-dm

Pull dm update from Alasdair G Kergon:
 "Miscellaneous device-mapper fixes, cleanups and performance
  improvements.

  Of particular note:
   - Disable broken WRITE SAME support in all targets except linear and
     striped.  Use it when kcopyd is zeroing blocks.
   - Remove several mempools from targets by moving the data into the
     bio's new front_pad area(which dm calls 'per_bio_data').
   - Fix a race in thin provisioning if discards are misused.
   - Prevent userspace from interfering with the ioctl parameters and
     use kmalloc for the data buffer if it's small instead of vmalloc.
   - Throttle some annoying error messages when I/O fails."

* tag 'dm-3.8-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/agk/linux-dm: (36 commits)
  dm stripe: add WRITE SAME support
  dm: remove map_info
  dm snapshot: do not use map_context
  dm thin: dont use map_context
  dm raid1: dont use map_context
  dm flakey: dont use map_context
  dm raid1: rename read_record to bio_record
  dm: move target request nr to dm_target_io
  dm snapshot: use per_bio_data
  dm verity: use per_bio_data
  dm raid1: use per_bio_data
  dm: introduce per_bio_data
  dm kcopyd: add WRITE SAME support to dm_kcopyd_zero
  dm linear: add WRITE SAME support
  dm: add WRITE SAME support
  dm: prepare to support WRITE SAME
  dm ioctl: use kmalloc if possible
  dm ioctl: remove PF_MEMALLOC
  dm persistent data: improve improve space map block alloc failure message
  dm thin: use DMERR_LIMIT for errors
  ...
2012-12-21 17:08:06 -08:00
Mike Snitzer
45e621d45e dm stripe: add WRITE SAME support
Rename stripe_map_discard to stripe_map_range and reuse it for WRITE
SAME bio processing.

Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2012-12-21 20:23:41 +00:00
Mikulas Patocka
7de3ee57da dm: remove map_info
This patch removes map_info from bio-based device mapper targets.
map_info is still used for request-based targets.

Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2012-12-21 20:23:41 +00:00
Mikulas Patocka
ee18026ac6 dm snapshot: do not use map_context
Eliminate struct map_info from dm-snap.

map_info->ptr was used in dm-snap to indicate if the bio was tracked.
If map_info->ptr was non-NULL, the bio was linked in tracked_chunk_hash.

This patch removes the use of map_info->ptr. We determine if the bio was
tracked based on hlist_unhashed(&c->node). If hlist_unhashed is true,
the bio is not tracked, if it is false, the bio is tracked.

Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2012-12-21 20:23:41 +00:00
Mikulas Patocka
59c3d2c6a1 dm thin: dont use map_context
This patch removes endio_hook_pool from dm-thin and uses per-bio data instead.

This patch removes any use of map_info in preparation for the next patch
that removes map_info from bio-based device mapper.

Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2012-12-21 20:23:40 +00:00
Mikulas Patocka
0045d61b5b dm raid1: dont use map_context
Don't use map_info any more in dm-raid1.

map_info was used for writes to hold the region number. For this purpose
we add a new field dm_bio_details to dm_raid1_bio_record.

map_info was used for reads to hold a pointer to dm_raid1_bio_record (if
the pointer was non-NULL, bio details were saved; if the pointer was
NULL, bio details were not saved). We use
dm_raid1_bio_record.details->bi_bdev for this purpose. If bi_bdev is
NULL, details were not saved, if bi_bdev is non-NULL, details were
saved.

Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2012-12-21 20:23:40 +00:00
Mikulas Patocka
c7cfdf5973 dm flakey: dont use map_context
Replace map_info with a per-bio structure "struct per_bio_data" in dm-flakey.

Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2012-12-21 20:23:39 +00:00
Mikulas Patocka
89c7cd8974 dm raid1: rename read_record to bio_record
Rename struct read_record to bio_record in dm-raid1.

In the following patch, the structure will be used for both read and
write bios, so rename it.

Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2012-12-21 20:23:39 +00:00
Mikulas Patocka
ddbd658f64 dm: move target request nr to dm_target_io
This patch moves target_request_nr from map_info to dm_target_io and
makes it accessible with dm_bio_get_target_request_nr.

This patch is a preparation for the next patch that removes map_info.

Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2012-12-21 20:23:39 +00:00
Mikulas Patocka
42bc954f2a dm snapshot: use per_bio_data
Replace tracked_chunk_pool with per_bio_data in dm-snap.

Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2012-12-21 20:23:38 +00:00
Mikulas Patocka
e42c3f914d dm verity: use per_bio_data
Replace io_mempool with per_bio_data in dm-verity.

Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2012-12-21 20:23:38 +00:00
Mikulas Patocka
39cf0ed27e dm raid1: use per_bio_data
Replace read_record_pool with per_bio_data in dm-raid1.

Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2012-12-21 20:23:38 +00:00
Mikulas Patocka
c0820cf5ad dm: introduce per_bio_data
Introduce a field per_bio_data_size in struct dm_target.

Targets can set this field in the constructor. If a target sets this
field to a non-zero value, "per_bio_data_size" bytes of auxiliary data
are allocated for each bio submitted to the target. These data can be
used for any purpose by the target and help us improve performance by
removing some per-target mempools.

Per-bio data is accessed with dm_per_bio_data. The
argument data_size must be the same as the value per_bio_data_size in
dm_target.

If the target has a pointer to per_bio_data, it can get a pointer to
the bio with dm_bio_from_per_bio_data() function (data_size must be the
same as the value passed to dm_per_bio_data).

Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2012-12-21 20:23:38 +00:00
Mike Snitzer
70d6c400ac dm kcopyd: add WRITE SAME support to dm_kcopyd_zero
Add WRITE SAME support to dm-io and make it accessible to
dm_kcopyd_zero().  dm_kcopyd_zero() provides an asynchronous interface
whereas the blkdev_issue_write_same() interface is synchronous.

WRITE SAME is a SCSI command that can be leveraged for more efficient
zeroing of a specified logical extent of a device which supports it.
Only a single zeroed logical block is transfered to the target for each
WRITE SAME and the target then writes that same block across the
specified extent.

The dm thin target uses this.

Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2012-12-21 20:23:37 +00:00
Mike Snitzer
4f0b70b047 dm linear: add WRITE SAME support
The linear target can already support WRITE SAME requests so signal
this by setting num_write_same_requests to 1.

Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2012-12-21 20:23:37 +00:00
Mike Snitzer
23508a96cd dm: add WRITE SAME support
WRITE SAME bios have a payload that contain a single page.  When
cloning WRITE SAME bios DM has no need to modify the bi_io_vec
attributes (and doing so would be detrimental).  DM need only alter the
start and end of the WRITE SAME bio accordingly.

Rather than duplicate __clone_and_map_discard, factor out a common
function that is also used by __clone_and_map_write_same.

Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2012-12-21 20:23:37 +00:00
Mike Snitzer
d54eaa5a0f dm: prepare to support WRITE SAME
Allow targets to opt in to WRITE SAME support by setting
'num_write_same_requests' in the dm_target structure.

A dm device will only advertise WRITE SAME support if all its
targets and all its underlying devices support it.

Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2012-12-21 20:23:36 +00:00
Mikulas Patocka
9c5091f2ee dm ioctl: use kmalloc if possible
If the parameter buffer is small enough, try to allocate it with kmalloc()
rather than vmalloc().

vmalloc is noticeably slower than kmalloc because it has to manipulate
page tables.

In my tests, on PA-RISC this patch speeds up activation 13 times.
On Opteron this patch speeds up activation by 5%.

This patch introduces a new function free_params() to free the
parameters and this uses new flags that record whether or not vmalloc()
was used and whether or not the input buffer must be wiped after use.

Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2012-12-21 20:23:36 +00:00
Mikulas Patocka
5023e5cf58 dm ioctl: remove PF_MEMALLOC
When allocating memory for the userspace ioctl data, set some
appropriate GPF flags directly instead of using PF_MEMALLOC.

Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2012-12-21 20:23:36 +00:00
Joe Thornber
7960123f2d dm persistent data: improve improve space map block alloc failure message
Improve space map error message when unable to allocate a new
metadata block.

Signed-off-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2012-12-21 20:23:36 +00:00
Mike Snitzer
c397741c76 dm thin: use DMERR_LIMIT for errors
Throttle all errors logged from the IO path by dm thin.

Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2012-12-21 20:23:34 +00:00
Mike Snitzer
89ddeb8cb1 dm persistent data: use DMERR_LIMIT for errors
Nearly all of persistent-data is in the IO path so throttle error
messages with DMERR_LIMIT to limit the amount logged when
something has gone wrong.

Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2012-12-21 20:23:34 +00:00
Mike Snitzer
a5bd968aeb dm block manager: reinstate message when validator fails
Reinstate a useful error message when the block manager buffer validator fails.
This was mistakenly eliminated when the block manager was converted to use
dm-bufio.

Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2012-12-21 20:23:34 +00:00
Jonathan Brassow
3a0f9aaee0 dm raid: round region_size to power of two
If the user does not supply a bitmap region_size to the dm raid target,
a reasonable size is computed automatically.  If this is not a power of 2,
the md code will report an error later.

This patch catches the problem early and rounds the region_size to the
next power of two.

Signed-off-by: Jonathan Brassow <jbrassow@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2012-12-21 20:23:33 +00:00
Joe Thornber
2aab38502d dm thin: cleanup dead code
Remove unused @data_block parameter from cell_defer.
Change thin_bio_map to use many returns rather than setting a variable.

Signed-off-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2012-12-21 20:23:33 +00:00
Joe Thornber
f286ba0eed dm thin: rename cell_defer_except to cell_defer_no_holder
Rename cell_defer_except() to cell_defer_no_holder() which describes
its function more clearly.

Signed-off-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2012-12-21 20:23:33 +00:00
Mikulas Patocka
9aa0c0e60f dm snapshot: optimize track_chunk
track_chunk is always called with interrupts enabled. Consequently, we
do not need to save and restore interrupt state in "flags" variable.
This patch changes spin_lock_irqsave to spin_lock_irq and
spin_unlock_irqrestore to spin_unlock_irq.

Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2012-12-21 20:23:33 +00:00
Mikulas Patocka
19cbbc60c6 dm raid: use DM_ENDIO_INCOMPLETE
Use a defined macro DM_ENDIO_INCOMPLETE instead of a numeric constant.

Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2012-12-21 20:23:32 +00:00
Mikulas Patocka
7c27213b20 dm raid1: remove impossible mempool_alloc error test
mempool_alloc can't fail if __GFP_WAIT is specified, so the condition
that tests if read_record is non-NULL is always true.

Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2012-12-21 20:23:32 +00:00
Mike Snitzer
018debea8d dm thin: emit ignore_discard in status when discards disabled
If "ignore_discard" is specified when creating the thin pool device then
discard support is disabled for that device.  The pool device's status
should reflect this fact rather than stating "no_discard_passdown"
(which implies discards are enabled but passdown is disabled).

Reported-by: Zdenek Kabelac <zkabelac@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2012-12-21 20:23:32 +00:00
Joe Thornber
e3cbf94513 dm persistent data: fix nested btree deletion
When deleting nested btrees, the code forgets to delete the innermost
btree.  The thin-metadata code serendipitously compensates for this by
claiming there is one extra layer in the tree.

This patch corrects both problems.

Signed-off-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2012-12-21 20:23:32 +00:00
Joe Thornber
563af186df dm thin: wake worker when discard is prepared
When discards are prepared it is best to directly wake the worker that
will process them.  The worker will be woken anyway, via periodic
commit, but there is no reason to not wake_worker here.

Signed-off-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2012-12-21 20:23:31 +00:00
Joe Thornber
e8088073c9 dm thin: fix race between simultaneous io and discards to same block
There is a race when discard bios and non-discard bios are issued
simultaneously to the same block.

Discard support is expensive for all thin devices precisely because you
have to be careful to quiesce the area you're discarding.  DM thin must
handle this conflicting IO pattern (simultaneous non-discard vs discard)
even though a sane application shouldn't be issuing such IO.

The race manifests as follows:

1. A non-discard bio is mapped in thin_bio_map.
   This doesn't lock out parallel activity to the same block.

2. A discard bio is issued to the same block as the non-discard bio.

3. The discard bio is locked in a dm_bio_prison_cell in process_discard
   to lock out parallel activity against the same block.

4. The non-discard bio's mapping continues and its all_io_entry is
   incremented so the bio is accounted for in the thin pool's all_io_ds
   which is a dm_deferred_set used to track time locality of non-discard IO.

5. The non-discard bio is finally locked in a dm_bio_prison_cell in
   process_bio.

The race can result in deadlock, leaving the block layer hanging waiting
for completion of a discard bio that never completes, e.g.:

INFO: task ruby:15354 blocked for more than 120 seconds.
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
ruby            D ffffffff8160f0e0     0 15354  15314 0x00000000
 ffff8802fb08bc58 0000000000000082 ffff8802fb08bfd8 0000000000012900
 ffff8802fb08a010 0000000000012900 0000000000012900 0000000000012900
 ffff8802fb08bfd8 0000000000012900 ffff8803324b9480 ffff88032c6f14c0
Call Trace:
 [<ffffffff814e5a19>] schedule+0x29/0x70
 [<ffffffff814e3d85>] schedule_timeout+0x195/0x220
 [<ffffffffa06b9bc1>] ? _dm_request+0x111/0x160 [dm_mod]
 [<ffffffff814e589e>] wait_for_common+0x11e/0x190
 [<ffffffff8107a170>] ? try_to_wake_up+0x2b0/0x2b0
 [<ffffffff814e59ed>] wait_for_completion+0x1d/0x20
 [<ffffffff81233289>] blkdev_issue_discard+0x219/0x260
 [<ffffffff81233e79>] blkdev_ioctl+0x6e9/0x7b0
 [<ffffffff8119a65c>] block_ioctl+0x3c/0x40
 [<ffffffff8117539c>] do_vfs_ioctl+0x8c/0x340
 [<ffffffff8119a547>] ? block_llseek+0x67/0xb0
 [<ffffffff811756f1>] sys_ioctl+0xa1/0xb0
 [<ffffffff810561f6>] ? sys_rt_sigprocmask+0x86/0xd0
 [<ffffffff814ef099>] system_call_fastpath+0x16/0x1b

The thinp-test-suite's test_discard_random_sectors reliably hits this
deadlock on fast SSD storage.

The fix for this race is that the all_io_entry for a bio must be
incremented whilst the dm_bio_prison_cell is held for the bio's
associated virtual and physical blocks.  That cell locking wasn't
occurring early enough in thin_bio_map.  This patch fixes this.

Care is taken to always call the new function inc_all_io_entry() with
the relevant cells locked, but they are generally unlocked before
calling issue() to try to avoid holding the cells locked across
generic_submit_request.

Also, now that thin_bio_map may lock bios in a cell, process_bio() is no
longer the only thread that will do so.  Because of this we must be sure
to use cell_defer_except() to release all non-holder entries, that
were added by the other thread, because they must be deferred.

This patch depends on "dm thin: replace dm_cell_release_singleton with
cell_defer_except".

Signed-off-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
Cc: stable@vger.kernel.org
2012-12-21 20:23:31 +00:00
Joe Thornber
b7ca9c9273 dm thin: replace dm_cell_release_singleton with cell_defer_except
Change existing users of the function dm_cell_release_singleton to share
cell_defer_except instead, and then remove the now-unused function.

Everywhere that calls dm_cell_release_singleton, the bio in question
is the holder of the cell.

If there are no non-holder entries in the cell then cell_defer_except
behaves exactly like dm_cell_release_singleton.  Conversely, if there
*are* non-holder entries then dm_cell_release_singleton must not be used
because those entries would need to be deferred.

Consequently, it is safe to replace use of dm_cell_release_singleton
with cell_defer_except.

This patch is a pre-requisite for "dm thin: fix race between
simultaneous io and discards to same block".

Signed-off-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Cc: stable@vger.kernel.org
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2012-12-21 20:23:31 +00:00
Mike Snitzer
c1a94672a8 dm: disable WRITE SAME
WRITE SAME bios are not yet handled correctly by device-mapper so
disable their use on device-mapper devices by setting
max_write_same_sectors to zero.

As an example, a ciphertext device is incompatible because the data
gets changed according to the location at which it written and so the
dm crypt target cannot support it.

Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Cc: stable@vger.kernel.org
Cc: Milan Broz <mbroz@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2012-12-21 20:23:30 +00:00
Alasdair G Kergon
e910d7ebec dm ioctl: prevent unsafe change to dm_ioctl data_size
Abort dm ioctl processing if userspace changes the data_size parameter
after we validated it but before we finished copying the data buffer
from userspace.

The dm ioctl parameters are processed in the following sequence:
 1. ctl_ioctl() calls copy_params();
 2. copy_params() makes a first copy of the fixed-sized portion of the
    userspace parameters into the local variable "tmp";
 3. copy_params() then validates tmp.data_size and allocates a new
    structure big enough to hold the complete data and copies the whole
    userspace buffer there;
 4. ctl_ioctl() reads userspace data the second time and copies the whole
    buffer into the pointer "param";
 5. ctl_ioctl() reads param->data_size without any validation and stores it
    in the variable "input_param_size";
 6. "input_param_size" is further used as the authoritative size of the
    kernel buffer.

The problem is that userspace code could change the contents of user
memory between steps 2 and 4.  In particular, the data_size parameter
can be changed to an invalid value after the kernel has validated it.
This lets userspace force the kernel to access invalid kernel memory.

The fix is to ensure that the size has not changed at step 4.

This patch shouldn't have a security impact because CAP_SYS_ADMIN is
required to run this code, but it should be fixed anyway.

Reported-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
Cc: stable@kernel.org
2012-12-21 20:23:30 +00:00
Mikulas Patocka
550929faf8 dm persistent data: rename node to btree_node
This patch fixes a compilation failure on sparc32 by renaming struct node.

struct node is already defined in include/linux/node.h. On sparc32, it
happens to be included through other dependencies and persistent-data
doesn't compile because of conflicting declarations.

Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Cc: stable@vger.kernel.org
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2012-12-21 20:23:30 +00:00
Linus Torvalds
ea88eeac0c md update for 3.8
Mostly just little fixes.  Probably biggest part is
 AVX accelerated RAID6 calculations.
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v2.0.19 (GNU/Linux)
 
 iQIVAwUAUM/w2Dnsnt1WYoG5AQKXlg/9F5juv4CjRkRRFLqZgOPBLmn/s/2Vspgh
 2Kv8Jcyixd8jUQNbobZv0ahlJH/iSU61kpOE8QjLbKi5Y42vAbM0ZU2aHJ6nqGZy
 HiTI8K+7kTvCK3ZXLcUQ+4oPPBNTcoTZbLWaEOmIqB1ruLddoIR7M9fG3PspVeG0
 jijnXR8IfL6mr4YDXnJkEhFrneTysVik05RkKYZKyM/9r3stAoMJ9o0/EFy3OFxb
 lO6mLEtvjVArXcnuf1RMCw2YKgki9Y4r73HCplgQsVFvcxcpsya4gFF+lRR5j7cO
 /eMYbSQ89iWEYKh1dJ9u1nofc8fX5ia71QQyO1fkO4GXRHXPVIyBgKSbe7SaL6iG
 JUMm7idUV2rZGeq3ln3k8Yor4QqHvN1n7pRKKUF+ZdsPoQ1B/TABu+qpsAdo5ZhP
 fxDsULsHrzEaxgetd4V8F2Uptca9ni43sMI8mwsvVlA0p6SOzMIyoJLC9xAZpx11
 b3H3+7Oje/fasmszBoq5B9uAlSt9XXVN4DDn2q6cX+S96JSX6jcsN1c6cJBO+ZxB
 OU6a6P5mnU6HuxU02rspe7G8BeU+ybaonErOW+GdyC4r7M/cImC0dSp0NGHK2211
 oqu0xBx/Q/ddTFwKQqa4HzR2ws09+LhKbjdqYIhCEKttIbLIAjf73ARZ19XPSRRX
 pDR/ey2CB6E=
 =uK52
 -----END PGP SIGNATURE-----

Merge tag 'md-3.8' of git://neil.brown.name/md

Pull md update from Neil Brown:
 "Mostly just little fixes.  Probably biggest part is AVX accelerated
  RAID6 calculations."

* tag 'md-3.8' of git://neil.brown.name/md:
  md/raid5: add blktrace calls
  md/raid5: use async_tx_quiesce() instead of open-coding it.
  md: Use ->curr_resync as last completed request when cleanly aborting resync.
  lib/raid6: build proper files on corresponding arch
  lib/raid6: Add AVX2 optimized gen_syndrome functions
  lib/raid6: Add AVX2 optimized recovery functions
  md: Update checkpoint of resync/recovery based on time.
  md:Add place to update ->recovery_cp.
  md.c: re-indent various 'switch' statements.
  md: close race between removing and adding a device.
  md: removed unused variable in calc_sb_1_csm.
2012-12-18 09:32:44 -08:00
NeilBrown
a9add5d92b md/raid5: add blktrace calls
This makes it easier to trace what raid5 is doing.

Signed-off-by: NeilBrown <neilb@suse.de>
2012-12-18 10:22:21 +11:00
Linus Torvalds
9228ff9038 Merge branch 'for-3.8/drivers' of git://git.kernel.dk/linux-block
Pull block driver update from Jens Axboe:
 "Now that the core bits are in, here are the driver bits for 3.8.  The
  branch contains:

   - A huge pile of drbd bits that were dumped from the 3.7 merge
     window.  Following that, it was both made perfectly clear that
     there is going to be no more over-the-wall pulls and how the
     situation on individual pulls can be improved.

   - A few cleanups from Akinobu Mita for drbd and cciss.

   - Queue improvement for loop from Lukas.  This grew into adding a
     generic interface for waiting/checking an even with a specific
     lock, allowing this to be pulled out of md and now loop and drbd is
     also using it.

   - A few fixes for xen back/front block driver from Roger Pau Monne.

   - Partition improvements from Stephen Warren, allowing partiion UUID
     to be used as an identifier."

* 'for-3.8/drivers' of git://git.kernel.dk/linux-block: (609 commits)
  drbd: update Kconfig to match current dependencies
  drbd: Fix drbdsetup wait-connect, wait-sync etc... commands
  drbd: close race between drbd_set_role and drbd_connect
  drbd: respect no-md-barriers setting also when changed online via disk-options
  drbd: Remove obsolete check
  drbd: fixup after wait_even_lock_irq() addition to generic code
  loop: Limit the number of requests in the bio list
  wait: add wait_event_lock_irq() interface
  xen-blkfront: free allocated page
  xen-blkback: move free persistent grants code
  block: partition: msdos: provide UUIDs for partitions
  init: reduce PARTUUID min length to 1 from 36
  block: store partition_meta_info.uuid as a string
  cciss: use check_signature()
  cciss: cleanup bitops usage
  drbd: use copy_highpage
  drbd: if the replication link breaks during handshake, keep retrying
  drbd: check return of kmalloc in receive_uuids
  drbd: Broadcast sync progress no more often than once per second
  drbd: don't try to clear bits once the disk has failed
  ...
2012-12-17 13:39:11 -08:00
Linus Torvalds
a2013a13e6 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/trivial
Pull trivial branch from Jiri Kosina:
 "Usual stuff -- comment/printk typo fixes, documentation updates, dead
  code elimination."

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/trivial: (39 commits)
  HOWTO: fix double words typo
  x86 mtrr: fix comment typo in mtrr_bp_init
  propagate name change to comments in kernel source
  doc: Update the name of profiling based on sysfs
  treewide: Fix typos in various drivers
  treewide: Fix typos in various Kconfig
  wireless: mwifiex: Fix typo in wireless/mwifiex driver
  messages: i2o: Fix typo in messages/i2o
  scripts/kernel-doc: check that non-void fcts describe their return value
  Kernel-doc: Convention: Use a "Return" section to describe return values
  radeon: Fix typo and copy/paste error in comments
  doc: Remove unnecessary declarations from Documentation/accounting/getdelays.c
  various: Fix spelling of "asynchronous" in comments.
  Fix misspellings of "whether" in comments.
  eisa: Fix spelling of "asynchronous".
  various: Fix spelling of "registered" in comments.
  doc: fix quite a few typos within Documentation
  target: iscsi: fix comment typos in target/iscsi drivers
  treewide: fix typo of "suport" in various comments and Kconfig
  treewide: fix typo of "suppport" in various comments
  ...
2012-12-13 12:00:02 -08:00
NeilBrown
749586b7d3 md/raid5: use async_tx_quiesce() instead of open-coding it.
handle_stripe_expansion contains:

        if (tx) {
                async_tx_ack(tx);
                dma_wait_for_async_tx(tx);
        }

which is very similar to the body of async_tx_quiesce(),
except that the later handles an error from dma_wait_for_async_tx()
(admittedly by panicing, but that decision belongs in the dma
code, not the md code).

So just us async_tx_quiesce().

Acked-by: Dan Williams <djbw@fb.com>
Reported-by: Bartlomiej Zolnierkiewicz <b.zolnierkie@samsung.com>
Signed-off-by: NeilBrown <neilb@suse.de>
2012-12-13 19:52:32 +11:00
majianpeng
0a19caabf0 md: Use ->curr_resync as last completed request when cleanly aborting resync.
If a resync is aborted cleanly, ->curr_resync is a reliable
record of where we got up to.
If there was an error it is less reliable but we always know that
->curr_resync_completed is safe.

So add a flag MD_RECOVERY_ERROR to differentiate between these cases
and set recovery_cp accordingly.

Signed-off-by: Jianpeng Ma <majianpeng@gmail.com>
Signed-off-by: NeilBrown <neilb@suse.de>
2012-12-13 19:52:11 +11:00
majianpeng
54f89341e8 md: Update checkpoint of resync/recovery based on time.
md will current only only checkpoint recovery or resync ever 1/16th
of the device size.  As devices get larger this can become a long time
an so a lot of work that might need to be duplicated after a shutdown.

So add a time-based checkpoint.  Every 5 minutes limits the amount of
duplicated effort to at most 5 minutes, and has almost zero impact on
performance.

[changelog entry re-written by NeilBrown]

Signed-off-by: Jianpeng Ma <majianpeng@gmail.com>
Signed-off-by: NeilBrown <neilb@suse.de>
2012-12-13 16:41:40 +11:00
kernelmail
35d78c6696 md:Add place to update ->recovery_cp.
In resyncing, recovery_cp only updated when resync aborted or completed.
But in md drives,many place used it to judge.So add a place to update.

Signed-off-by: Jianpeng Ma <majianpeng@gmail.com>
Signed-off-by: NeilBrown <neilb@suse.de>
2012-12-13 16:41:01 +11:00
NeilBrown
c02c0aeb6c md.c: re-indent various 'switch' statements.
Intent was unnecessarily deep.

Also change one 'switch' which has a single case element, into an
'if'.

Signed-off-by: NeilBrown <neilb@suse.de>
2012-12-11 13:39:21 +11:00
NeilBrown
a7a3f08dc2 md: close race between removing and adding a device.
When we remove a device from an md array, the final removal of
the "dev-XX" sys entry is run asynchronously.
If we then re-add that device immediately before the worker thread
gets to run, we can end up trying to add the "dev-XX" sysfs entry back
before it has been removed.

So in both places where we add a device, call
  flush_workqueue(md_misc_wq);
before taking the md lock (as holding the md lock can prevent removal
to complete).

Signed-off-by: NeilBrown <neilb@suse.de>
2012-12-11 13:35:54 +11:00
NeilBrown
1f3c9907b8 md: removed unused variable in calc_sb_1_csm.
'i' is unused.

NeilBrown <neilb@suse.de>
2012-12-11 13:09:00 +11:00
Linus Torvalds
4ccc804586 Single bugfix for raid1/raid10.
Fixes a recently introduced deadlock.
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v2.0.19 (GNU/Linux)
 
 iQIVAwUAULVcJjnsnt1WYoG5AQKJjw//WXClUwmYi7FD9u8OfqLKsmvJCB8rTQFL
 RQTkm/AGwXMTrmV2iifR7Hpm14wP7pbwwiFNbLv4cw8W8ldt+4PfjCRTyoSwH5HT
 +evDAxmtuTfhznGn9fWCJClmrng0+W0ir3Bmkju+u35orCx97+98Cgv4rVAeYZ0R
 TR59g10y0c1QQuLPXoe3J5iVKuXjrlW4USIGRzkaqKmSZa9LGLETLE9V/RQtcdVD
 HB+uVMEMIGfibWSa918yWRbje8tGYeFyWOrxLs6eS/MdMEWPOdYgesxMErN+8/9q
 ZPYc+wIbGdYfzn8RcuAlE10ZMkS/6eNCn/O5ztBrM9Iyztecv3TKxNzb1S9RHppZ
 ze9d7qfX5kDhwc7YPikPZlnP4CDElDjaPzb0jSyy6FwNzWV45YuC9D5n4xGPOgcC
 83ORlSzMcv6NOFZc8HjrV4NFYE4Dezm0sThFPMEkY2FfLzIztg1H5Q0k0bvfxtqa
 yzCaQtuGjMhsbcLELqHCXFNHFhBVaetuFKAPRnynnkgSDMiZVjmV5/rsapy+qBON
 4BSI7Shwq5jn1xrqVd6ylLic5nkFIGuU7jZ15VftzP3ggQxmSLuq8Q7ZOZ+cXdZ9
 bQTkZyrwsIp8qxBE9DCi33VDDcUNiSvCnTdr18XPAzJ0DQIQ4hhlmLxGGj8BV5f4
 KKOqr3RBP6I=
 =fV2f
 -----END PGP SIGNATURE-----

Merge tag 'md-3.7-fixes' of git://neil.brown.name/md

Pull md bugfix from NeilBrown:
 "Single bugfix for raid1/raid10.

  Fixes a recently introduced deadlock."

* tag 'md-3.7-fixes' of git://neil.brown.name/md:
  md/raid1{,0}: fix deadlock in bitmap_unplug.
2012-12-02 16:24:31 -08:00
Lukas Czerner
eed8c02e68 wait: add wait_event_lock_irq() interface
New wait_event{_interruptible}_lock_irq{_cmd} macros added. This commit
moves the private wait_event_lock_irq() macro from MD to regular wait
includes, introduces new macro wait_event_lock_irq_cmd() instead of using
the old method with omitting cmd parameter which is ugly and makes a use
of new macros in the MD. It also introduces the _interruptible_ variant.

The use of new interface is when one have a special lock to protect data
structures used in the condition, or one also needs to invoke "cmd"
before putting it to sleep.

All new macros are expected to be called with the lock taken. The lock
is released before sleep and is reacquired afterwards. We will leave the
macro with the lock held.

Note to DM: IMO this should also fix theoretical race on waitqueue while
using simultaneously wait_event_lock_irq() and wait_event() because of
lack of locking around current state setting and wait queue removal.

Signed-off-by: Lukas Czerner <lczerner@redhat.com>
Cc: Neil Brown <neilb@suse.de>
Cc: David Howells <dhowells@redhat.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2012-11-30 11:47:57 +01:00
NeilBrown
874807a831 md/raid1{,0}: fix deadlock in bitmap_unplug.
If the raid1 or raid10 unplug function gets called
from a make_request function (which is very possible) when
there are bios on the current->bio_list list, then it will not
be able to successfully call bitmap_unplug() and it could
need to submit more bios and wait for them to complete.
But they won't complete while current->bio_list is non-empty.

So detect that case and handle the unplugging off to another thread
just like we already do when called from within the scheduler.

RAID1 version of bug was introduced in 3.6, so that part of fix is
suitable for 3.6.y.  RAID10 part won't apply.

Cc: stable@vger.kernel.org
Reported-by: Torsten Kaiser <just.for.lkml@googlemail.com>
Reported-by: Peter Maloney <peter.maloney@brockmann-consult.de>
Signed-off-by: NeilBrown <neilb@suse.de>
2012-11-27 12:14:40 +11:00
Linus Torvalds
1d838d70fb Several bug fixes for md in 3.7
- raid5 discard has problems
  - raid10 replacement devices have problems
  - bad block lock seqlock usage has problems
  - dm-raid doesn't free everything
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v2.0.19 (GNU/Linux)
 
 iQIVAwUAUK/PfTnsnt1WYoG5AQJlFBAAry6TrfIEed7Sz1BwY0w1Ofd5ZFt6DCN3
 CXc6yi7LQhaMAUYsMcF07BFfuphal0St68vwckFkd1jPShUgruetzsUPLdS1+cql
 AKOQZmJegN+yvpf+N6PxER8z0Ju8M0RNVCvgRZB166ujmoEHGf7A564Hby+FINpZ
 zk1d5eVtcRL05oV0NbeLaX8bNp42nNx2wwvFtM6NEVF4vwbzGzXkC9ePQ6oERJvQ
 Oqsu6F+TzqztIPYk/fbl1Yr/FPVAWXi4dR7KNxs/jHFcnWPi9vKcjjh1jrq46rNy
 xQY+y0xW6FlN0uApIKT6NC3UWutgwOGUqRdCRc4LJ1nT6aHVIn5OCIsipgRrlV0O
 da5pM+rgIMJK3kyT6NjhtuWuQZE4P4OSOmnq5q81VT9XOKADVsFOfibtrIr8cxYS
 c/8mNJVfd+cU58XNKGIEt886DsN+uzWiY8U8HZVckfeVxrBTIPas4ERXlurx+G1D
 jhXqK8TuEfi6ILNdBlWPphAr2ytFqWWpQIGXgYGHEIJp5WaUHoEoEblznl1MiRlZ
 +tYIYy0SRkcZuxs6nUNF8Or5vFidjvaIFJPjIJwSIhwgzkaV+YFad4GfI7/WgWaq
 7VU12MG7UlXLlaGN1Yadvh3jAk7L45DPzWUa/Zgvvtrvvdp3JU7VQhD8d6oc/kxD
 3IOrUdAXWxU=
 =fznK
 -----END PGP SIGNATURE-----

Merge tag 'md-3.7-fixes' of git://neil.brown.name/md

Pull md fixes from NeilBrown:
 "Several bug fixes for md in 3.7:

   - raid5 discard has problems
   - raid10 replacement devices have problems
   - bad block lock seqlock usage has problems
   - dm-raid doesn't free everything"

* tag 'md-3.7-fixes' of git://neil.brown.name/md:
  md/raid10: decrement correct pending counter when writing to replacement.
  md/raid10: close race that lose writes lost when replacement completes.
  md/raid5: Make sure we clear R5_Discard when discard is finished.
  md/raid5: move resolving of reconstruct_state earlier in stripe_handle.
  md/raid5: round discard alignment up to power of 2.
  md: make sure everything is freed when dm-raid stops an array.
  md: Avoid write invalid address if read_seqretry returned true.
  md: Reassigned the parameters if read_seqretry returned true in func md_is_badblock.
2012-11-23 12:11:13 -10:00
Jens Axboe
a8c32a5c98 dm: fix deadlock with request based dm and queue request_fn recursion
Request based dm attempts to re-run the request queue off the
request completion path. If used with a driver that potentially does
end_io from its request_fn, we could deadlock trying to recurse
back into request dispatch. Fix this by punting the request queue
run to kblockd.

Tested to fix a quickly reproducible deadlock in such a scenario.

Cc: stable@kernel.org
Acked-by: Alasdair G Kergon <agk@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2012-11-23 14:32:54 +01:00
NeilBrown
884162df2a md/raid10: decrement correct pending counter when writing to replacement.
When a write to a replacement device completes, we carefully
and correctly found the rdev that the write actually went to
and the blithely called rdev_dec_pending on the primary rdev,
even if this write was to the replacement.

This means that any writes to an array while a replacement
was ongoing would cause the nr_pending count for the primary
device to go negative, so it could never be removed.

This bug has been present since replacement was introduced in
3.3, so it is suitable for any -stable kernel since then.

Reported-by: "George Spelvin" <linux@horizon.com>
Cc: stable@vger.kernel.org
Signed-off-by: NeilBrown <neilb@suse.de>
2012-11-22 15:12:42 +11:00
NeilBrown
e7c0c3fa29 md/raid10: close race that lose writes lost when replacement completes.
When a replacement operation completes there is a small window
when the original device is marked 'faulty' and the replacement
still looks like a replacement.  The faulty should be removed and
the replacement moved in place very quickly, bit it isn't instant.

So the code write out to the array must handle the possibility that
the only working device for some slot in the replacement - but it
doesn't.  If the primary device is faulty it just gives up.  This
can lead to corruption.

So make the code more robust: if either  the primary or the
replacement is present and working, write to them.  Only when
neither are present do we give up.

This bug has been present since replacement was introduced in
3.3, so it is suitable for any -stable kernel since then.

Reported-by: "George Spelvin" <linux@horizon.com>
Cc: stable@vger.kernel.org
Signed-off-by: NeilBrown <neilb@suse.de>
2012-11-22 15:12:36 +11:00
NeilBrown
ca64cae960 md/raid5: Make sure we clear R5_Discard when discard is finished.
commit 9e44476851
    MD: raid5 avoid unnecessary zero page for trim

change raid5 to clear R5_Discard when the complete request is
handled rather than when submitting the per-device discard request.
However it did not clear R5_Discard for the parity device.

This means that if the stripe_head was reused before it expired from
the cache, the setting would be wrong and a hang would result.

Also if the R5_Uptodate bit happens to be set, R5_Discard again
won't be cleared.  But R5_Uptodate really should be clear at this point.

So make sure R5_Discard is cleared in all cases, and clear
R5_Uptodate when a 'discard' completes.

Signed-off-by: NeilBrown <neilb@suse.de>
2012-11-22 09:14:13 +11:00
NeilBrown
ef5b7c69b7 md/raid5: move resolving of reconstruct_state earlier in
stripe_handle.

The chunk of code in stripe_handle which responds to a
*_result value in reconstruct_state is really the completion
of some processing that happened outside of handle_stripe
(possibly asynchronously) and so should be one of the first
things done in handle_stripe().

After the next patch it will be important that it happens before
handle_stripe_clean_event(), as that will clear some dev->flags
bit that this code tests.

Signed-off-by: NeilBrown <neilb@suse.de>
2012-11-22 09:14:09 +11:00
NeilBrown
4ac6875eeb md/raid5: round discard alignment up to power of 2.
blkdev_issue_discard currently assumes that the granularity
is a power of 2.  So in raid5, round the chosen number up to
avoid embarrassment.

Cc: Shaohua Li <shli@kernel.org>
Signed-off-by: NeilBrown <neilb@suse.de>
2012-11-20 19:42:56 +11:00
NeilBrown
5eff3c439d md: make sure everything is freed when dm-raid stops an array.
md_stop() would stop an array, but not free various attached
data structures.
For internal arrays, these are freed later in do_md_stop() or
mddev_put(), but they don't apply for dm-raid arrays.
So get md_stop() to free them, and only all it from dm-raid.
For internal arrays we now call __md_stop.

Reported-by: majianpeng <majianpeng@gmail.com>
Signed-off-by: NeilBrown <neilb@suse.de>
2012-11-20 10:27:37 +11:00
majianpeng
35f9ac2dce md: Avoid write invalid address if read_seqretry returned true.
If read_seqretry returned true and bbp was changed, it will write
invalid address which can cause some serious problem.

This bug was introduced by commit v3.0-rc7-130-g2699b67.
So fix is suitable for 3.0.y thru 3.6.y.

Reported-by: zhuwenfeng@kedacom.com
Tested-by: zhuwenfeng@kedacom.com
Cc: stable@vger.kernel.org
Signed-off-by: Jianpeng Ma <majianpeng@gmail.com>
Signed-off-by: NeilBrown <neilb@suse.de>
2012-11-20 10:27:17 +11:00
majianpeng
ab05613a06 md: Reassigned the parameters if read_seqretry returned true in func md_is_badblock.
This bug was introduced by commit(v3.0-rc7-126-g2230dfe).
So fix is suitable for 3.0.y thru 3.6.y.

Cc: stable@vger.kernel.org
Signed-off-by: Jianpeng Ma <majianpeng@gmail.com>
Signed-off-by: NeilBrown <neilb@suse.de>
2012-11-20 10:27:05 +11:00
Jonathan Brassow
ed30be077e MD RAID10: Fix oops when creating RAID10 arrays via dm-raid.c
Commit 2863b9eb didn't take into account the changes to add TRIM support to
RAID10 (commit 532a2a3fb).  That is, when using dm-raid.c to create the
RAID10 arrays, there is no mddev->gendisk or mddev->queue.  The code added
to support TRIM simply assumes that mddev->queue is available without
checking.  The result is an oops any time dm-raid.c attempts to create a
RAID10 device.

Signed-off-by: Jonathan Brassow <jbrassow@redhat.com>
Signed-off-by: NeilBrown <neilb@suse.de>
2012-10-31 11:42:30 +11:00
NeilBrown
02b898f2f0 md/raid1: Fix assembling of arrays containing Replacements.
setup_conf in raid1.c uses conf->raid_disks before assigning
a value.  It is used when including 'Replacement' devices.

The consequence is that assembling an array which contains a
replacement will misbehave and either not include the replacement, or
not include the device being replaced.

Though this doesn't lead directly to data corruption, it could lead to
reduced data safety.

So use mddev->raid_disks, which is initialised, instead.

Bug was introduced by commit c19d57980b
      md/raid1: recognise replacements when assembling arrays.

in 3.3, so fix is suitable for 3.3.y thru 3.6.y.

Cc: stable@vger.kernel.org
Signed-off-by: NeilBrown <neilb@suse.de>
2012-10-31 11:42:03 +11:00
Masanari Iida
83f0d77a7f md: Fix typo in drivers/md
Correct spelling typo in drivers/md.

Signed-off-by: Masanari Iida <standby24x7@gmail.com>
Signed-off-by: Jiri Kosina <jkosina@suse.cz>
2012-10-29 22:57:50 +01:00
Eric Sandeen
0be1fecd7e md faulty: use disk_stack_limits()
in:
fe86cdce block: do not artificially constrain max_sectors for stacking drivers

max_sectors defaults to UINT_MAX.  md faulty wasn't using
disk_stack_limits(), so inherited this large value as well.
This triggered a bug in XFS when stressed over md_faulty, when
a very large bio_alloc() failed.

That was on an older kernel, and I can't reproduce exactly the
same thing upstream, but I think the fix is appropriate in any
case.

Thanks to Mike Snitzer for pointing out the problem.

Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Signed-off-by: NeilBrown <neilb@suse.de>
2012-10-22 10:44:55 +11:00
Linus Torvalds
9db908806b md updates for 3.7
"discard" support, some dm-raid improvements and other assorted
 bits and pieces.
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v2.0.18 (GNU/Linux)
 
 iQIVAwUAUHk6Rjnsnt1WYoG5AQKovQ//Ym0ROo5a6uekb2USLyFSdQH3TC7z0v0+
 +kujrgoc4nHZU/vj5yfMvPVomEUsAhHEwTkvvCiXFFHn6cxPzC8ezm8d40xEeISX
 qp6i2bPlvGURhsW1tYeD+THtY82/oyzQ4Wa/vaE1sjVLQ+caa2q7kVVgAL9Bj/Kz
 aESIZjAuPxQNE1674/KR0EmMFcbpd0z1WDV+ydKlRV5jHCHGYf8OmxOenJFf+V/b
 /f9p2u+NUq5BN5WLhThcysO8lPX1Y7GG8IYay3DlSt/crU24R2a2j0qh/BDoK8+t
 /DceoHipbIiGxXLVjM7y+1RwPpCh75HJSZQHltPype2Z3iwtwEth9uTkEE3M2h/W
 tOQEbOZku0kcgsrys7JBmpkBwkR9oZqq1kDd4YBzqW4PiGVP6z0JRH8QpjjB+mjN
 47ODYIZcaEYZ+0Jj8kcVxo3gv4Xj4DWH+auSNZihTVmjQPVqrcy3CAt3CkuDzTkY
 34fZVuCDiCetLGCGQKrwfMDnySVy5xOmtC6iWsEY5rExAeb0E+BCzcBvbAXzt+ef
 MPDsrxWbo/ZkvpuwXOwLFTccBuRtAsFi7CM4jcow53W6XMnPpdubphNw5nylaEm1
 DEzfID58mv8VHWRuW15vr7SbtROjYJkEFCIaEK3oprrRUYftZntIABcknqvcIYR+
 /ULNzkRU1w4=
 =XRmL
 -----END PGP SIGNATURE-----

Merge tag 'md-3.7' of git://neil.brown.name/md

Pull md updates from NeilBrown:
 - "discard" support, some dm-raid improvements and other assorted bits
   and pieces.

* tag 'md-3.7' of git://neil.brown.name/md: (29 commits)
  md: refine reporting of resync/reshape delays.
  md/raid5: be careful not to resize_stripes too big.
  md: make sure manual changes to recovery checkpoint are saved.
  md/raid10: use correct limit variable
  md: writing to sync_action should clear the read-auto state.
  Subject: [PATCH] md:change resync_mismatches to atomic64_t to avoid races
  md/raid5: make sure to_read and to_write never go negative.
  md: When RAID5 is dirty, force reconstruct-write instead of read-modify-write.
  md/raid5: protect debug message against NULL derefernce.
  md/raid5: add some missing locking in handle_failed_stripe.
  MD: raid5 avoid unnecessary zero page for trim
  MD: raid5 trim support
  md/bitmap:Don't use IS_ERR to judge alloc_page().
  md/raid1: Don't release reference to device while handling read error.
  raid: replace list_for_each_continue_rcu with new interface
  add further __init annotations to crypto/xor.c
  DM RAID: Fix for "sync" directive ineffectiveness
  DM RAID: Fix comparison of index and quantity for "rebuild" parameter
  DM RAID: Add rebuild capability for RAID10
  DM RAID: Move 'rebuild' checking code to its own function
  ...
2012-10-13 13:22:01 -07:00
Mikulas Patocka
dba141601d dm: store dm_target_io in bio front_pad
Use the recently-added bio front_pad field to allocate struct dm_target_io.

Prior to this patch, dm_target_io was allocated from a mempool. For each
dm_target_io, there is exactly one bio allocated from a bioset.

This patch merges these two allocations into one allocation: we create a
bioset with front_pad equal to the size of dm_target_io so that every
bio allocated from the bioset has sizeof(struct dm_target_io) bytes
before it. We allocate a bio and use the bytes before the bio as
dm_target_io.

_tio_cache is removed and the tio_pool mempool is now only used for
request-based devices.

This idea was introduced by Kent Overstreet.

Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Cc: Kent Overstreet <koverstreet@google.com>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: tj@kernel.org
Cc: Vivek Goyal <vgoyal@redhat.com>
Cc: Bill Pemberton <wfp5p@viridian.itc.virginia.edu>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2012-10-12 21:02:15 +01:00
Mike Snitzer
4f81a41762 dm thin: move bio_prison code to separate module
The bio prison code will be useful to other future DM targets so
move it to a separate module.

Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2012-10-12 21:02:13 +01:00
Mike Snitzer
44feb387f6 dm thin: prepare to separate bio_prison code
The bio prison code will be useful to share with future DM targets.

Prepare to move this code into a separate module, adding a dm prefix
to structures and functions that will be exported.

Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2012-10-12 21:02:10 +01:00
Mike Snitzer
28eed34e76 dm thin: support discard with non power of two block size
Support discards when the pool's block size is not a power of 2.
The block layer assumes discard_granularity is a power of 2 (in
blkdev_issue_discard), so we set this to the largest power of 2 that is
a divides into the number of sectors in each block, but never less than
DATA_DEV_BLOCK_SIZE_MIN_SECTORS.

This patch eliminates the "Discard support must be disabled when the
block size is not a power of 2" constraint that was imposed in commit
55f2b8b ("dm thin: support for non power of 2 pool blocksize").  That
commit was incomplete: using a block size that is not a power of 2
shouldn't mean disabling discard support on the device completely.

Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2012-10-12 21:02:07 +01:00
Wei Yongjun
0bcf08798e dm persistent data: convert to use le32_add_cpu
Convert cpu_to_le32(le32_to_cpu(E1) + E2) to use le32_add_cpu().

dpatch engine is used to auto generate this patch.
(https://github.com/weiyj/dpatch)

Signed-off-by: Wei Yongjun <yongjun_wei@trendmicro.com.cn>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2012-10-12 16:59:47 +01:00
Mikulas Patocka
fe5fe90639 dm: use ACCESS_ONCE for sysfs values
Use the ACCESS_ONCE macro in dm-bufio and dm-verity where a variable
can be modified asynchronously (through sysfs) and we want to prevent
compiler optimizations that assume that the variable hasn't changed.
(See Documentation/atomic_ops.txt.)

Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2012-10-12 16:59:46 +01:00
Wei Yongjun
54499afbb8 dm bufio: use list_move
Use list_move() instead of list_del() + list_add().

spatch with a semantic match was used to find this.
(http://coccinelle.lip6.fr/)

Signed-off-by: Wei Yongjun <yongjun_wei@trendmicro.com.cn>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2012-10-12 16:59:44 +01:00
Wei Yongjun
a71a261f5c dm mpath: fix check for null mpio in end_io fn
The mpio dereference should be moved below the BUG_ON NULL test
in multipath_end_io().

spatch with a semantic match was used to found this.
(http://coccinelle.lip6.fr/)

Signed-off-by: Wei Yongjun <yongjun_wei@trendmicro.com.cn>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2012-10-12 16:59:42 +01:00
NeilBrown
72f36d5972 md: refine reporting of resync/reshape delays.
If 'resync_max' is set to 0 (as is often done when starting a
reshape, so the mdadm can remain in control during a sensitive
period), and if the reshape request is initially delayed because
another array using the same array is resyncing or reshaping etc,
when user-space cannot easily tell when the delay changes from being
due to a conflicting reshape, to being due to resync_max = 0.

So introduce a new state: (curr_resync == 3) to reflect this, make
sure it is visible both via /proc/mdstat and via the "sync_completed"
sysfs attribute, and ensure that the event transition from one delay
state to the other is properly notified.

Signed-off-by: NeilBrown <neilb@suse.de>
2012-10-11 14:25:57 +11:00
NeilBrown
e56108d65f md/raid5: be careful not to resize_stripes too big.
When a RAID5 is reshaping, conf->raid_disks is increased
before mddev->delta_disks becomes zero.
This can result in check_reshape calling resize_stripes with a
number that is too large.  This particularly happens
when md_check_recovery calls ->check_reshape().

If we use ->previous_raid_disks, we don't risk this.

Signed-off-by: NeilBrown <neilb@suse.de>
2012-10-11 14:24:13 +11:00
NeilBrown
db07d85ef6 md: make sure manual changes to recovery checkpoint are saved.
If you make an array bigger but suppress resync of the new region with
  mdadm --grow /dev/mdX --size=max --assume-clean

then stop the array before anything is written to it, the effect of
the "--assume-clean" is lost and the array will resync the new space
when restarted.
So ensure that we update the metadata in the case.

Reported-by: Sebastian Riemer <sebastian.riemer@profitbricks.com>
Signed-off-by: NeilBrown <neilb@suse.de>
2012-10-11 14:22:17 +11:00
Dan Carpenter
91502f099d md/raid10: use correct limit variable
Clang complains that we are assigning a variable to itself.  This should
be using bad_sectors like the similar earlier check does.

Bug has been present since 3.1-rc1.  It is minor but could
conceivably cause corruption or other bad behaviour.

Cc: stable@vger.kernel.org
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: NeilBrown <neilb@suse.de>
2012-10-11 14:20:58 +11:00
NeilBrown
48c26ddc9f md: writing to sync_action should clear the read-auto state.
In some cases array are started in 'read-auto' state where in
nothing gets written to any device until the array is written
to.  The purpose of this is to make accidental auto-assembly
of the wrong arrays less of a risk, and to allow arrays to be
started to read suspend-to-disk images without actually changing
anything (as might happen if the array were dirty and a
resync seemed necessary).

Explicitly writing the 'sync_action' for a read-auto array currently
doesn't clear the read-auto state, so the sync action doesn't
happen, which can be confusing.

So allow any successful write to sync_action to clear any read-auto
state.

Reported-by: Alexander Kühn <alexander.kuehn@nagilum.de>
Signed-off-by: NeilBrown <neilb@suse.de>
2012-10-11 14:19:39 +11:00
Jianpeng Ma
7f7583d420 Subject: [PATCH] md:change resync_mismatches to atomic64_t to avoid races
Now that multiple threads can handle stripes, it is safer to
use an atomic64_t for resync_mismatches, to avoid update races.

Signed-off-by: Jianpeng Ma <majianpeng@gmail.com>
Signed-off-by: NeilBrown <neilb@suse.de>
2012-10-11 14:17:59 +11:00
NeilBrown
1ed850f356 md/raid5: make sure to_read and to_write never go negative.
to_read and to_write are part of the result of analysing
a stripe before handling it.
Their use is to avoid some loops and tests if the values are
known to be zero.  Thus it is not a problem if they are a
little bit larger than they should be.

So decrementing them in handle_failed_stripe serves little value, and
due to races it could cause some loops to be skipped incorrectly.

So remove those decrements.

Reported-by: "Jianpeng Ma" <majianpeng@gmail.com>
Signed-off-by: NeilBrown <neilb@suse.de>
2012-10-11 13:50:13 +11:00
Alexander Lyakas
a7854487cd md: When RAID5 is dirty, force reconstruct-write instead of read-modify-write.
Signed-off-by: Alex Lyakas <alex@zadarastorage.com>
Suggested-by: Yair Hershko <yair@zadarastorage.com>
Signed-off-by: NeilBrown <neilb@suse.de>
2012-10-11 13:50:12 +11:00
NeilBrown
b97390aec4 md/raid5: protect debug message against NULL derefernce.
The pr_debug in add_stripe_bio could race with something
changing *bip, so it is best to hold the lock until
after the pr_debug.

Reported-by: "Jianpeng Ma" <majianpeng@gmail.com>
Signed-off-by: NeilBrown <neilb@suse.de>
2012-10-11 13:50:12 +11:00
NeilBrown
143c4d0573 md/raid5: add some missing locking in handle_failed_stripe.
We really should hold the stripe_lock while accessing
'toread' else we could race with add_stripe_bio and corrupt
a list.

Reported-by: "Jianpeng Ma" <majianpeng@gmail.com>
Signed-off-by: NeilBrown <neilb@suse.de>
2012-10-11 13:50:12 +11:00
Shaohua Li
9e44476851 MD: raid5 avoid unnecessary zero page for trim
We want to avoid zero discarded dev page, because it's useless for discard.
But if we don't zero it, another read/write hit such page in the cache and will
get inconsistent data.

To avoid zero the page, we don't set R5_UPTODATE flag after construction is
done. In this way, discard write request is still issued and finished, but read
will not hit the page. If the stripe gets accessed soon, we need reread the
stripe, but since the chance is low, the reread isn't a big deal.

Signed-off-by: Shaohua Li <shli@fusionio.com>
Signed-off-by: NeilBrown <neilb@suse.de>
2012-10-11 13:49:49 +11:00
Shaohua Li
620125f2bf MD: raid5 trim support
Discard for raid4/5/6 has limitation. If discard request size is
small, we do discard for one disk, but we need calculate parity and
write parity disk.  To correctly calculate parity, zero_after_discard
must be guaranteed. Even it's true, we need do discard for one disk
but write another disks, which makes the parity disks wear out
fast. This doesn't make sense. So an efficient discard for raid4/5/6
should discard all data disks and parity disks, which requires the
write pattern to be (A, A+chunk_size, A+chunk_size*2...). If A's size
is smaller than chunk_size, such pattern is almost impossible in
practice. So in this patch, I only handle the case that A's size
equals to chunk_size. That is discard request should be aligned to
stripe size and its size is multiple of stripe size.

Since we can only handle request with specific alignment and size (or
part of the request fitting stripes), we can't guarantee
zero_after_discard even zero_after_discard is true in low level
drives.

The block layer doesn't send down correctly aligned requests even
correct discard alignment is set, so I must filter out.

For raid4/5/6 parity calculation, if data is 0, parity is 0. So if
zero_after_discard is true for all disks, data is consistent after
discard.  Otherwise, data might be lost. Let's consider a scenario:
discard a stripe, write data to one disk and write parity disk. The
stripe could be still inconsistent till then depending on using data
from other data disks or parity disks to calculate new parity. If the
disk is broken, we can't restore it. So in this patch, we only enable
discard support if all disks have zero_after_discard.

If discard fails in one disk, we face the similar inconsistent issue
above. The patch will make discard follow the same path as normal
write request. If discard fails, a resync will be scheduled to make
the data consistent. This isn't good to have extra writes, but data
consistency is important.

If a subsequent read/write request hits raid5 cache of a discarded
stripe, the discarded dev page should have zero filled, so the data is
consistent. This patch will always zero dev page for discarded request
stripe. This isn't optimal because discard request doesn't need such
payload. Next patch will avoid it.

Signed-off-by: Shaohua Li <shli@fusionio.com>
Signed-off-by: NeilBrown <neilb@suse.de>
2012-10-11 13:49:05 +11:00
Jianpeng Ma
582e2e056a md/bitmap:Don't use IS_ERR to judge alloc_page().
Signed-off-by: Jianpeng Ma <majianpeng@gmail.com>
Signed-off-by: NeilBrown <neilb@suse.de>
2012-10-11 13:45:36 +11:00
NeilBrown
7ad4d4a68a md/raid1: Don't release reference to device while handling read error.
When we get a read error, we arrange for raid1d to handle it.
Currently we release the reference on the device.  This can result
in
   conf->mirrors[read_disk].rdev
being NULL in fix_read_error, if the device happens to get removed
before the read error is handled.

So instead keep the reference until the read error has been fully
handled.

Reported-by: hank <pyu@redhat.com>
Signed-off-by: NeilBrown <neilb@suse.de>
2012-10-11 13:44:30 +11:00
Michael Wang
fd177481b4 raid: replace list_for_each_continue_rcu with new interface
This patch replaces list_for_each_continue_rcu() with
list_for_each_entry_continue_rcu() to save a few lines
of code and allow removing list_for_each_continue_rcu().

Reviewed-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Signed-off-by: Michael Wang <wangyun@linux.vnet.ibm.com>
Signed-off-by: NeilBrown <neilb@suse.de>
2012-10-11 13:43:21 +11:00
Jonathan Brassow
761becff01 DM RAID: Fix for "sync" directive ineffectiveness
There are two table arguments that can be given to a DM RAID target
that control whether the array is forced to (re)synchronize or skip
initialization: "sync" and "nosync".  When "sync" is given, we set
mddev->recovery_cp to 0 in order to cause the device to resynchronize.
This is insufficient if there is a bitmap in use, because the array
will simply look at the bitmap and see that there is no recovery
necessary.

The fix is to skip over the loading of the superblocks when "sync" is
given, causing new superblocks to be written that will force the array
to go through initialization (i.e. synchronization).

Signed-off-by: Jonathan Brassow <jbrassow@redhat.com>
Signed-off-by: NeilBrown <neilb@suse.de>
2012-10-11 13:42:19 +11:00
Jonathan Brassow
7386199c47 DM RAID: Fix comparison of index and quantity for "rebuild" parameter
DM RAID: Fix comparison of index and quantity for "rebuild" parameter

The "rebuild" parameter takes an index argument that starts counting from
zero.  The conditional used to validate the index was using '>' rather than
'>=', leaving the door open for an index value that would be 1 too large.

Reported-by: Neil Brown <neilb@suse.de>
Signed-off-by: Jonathan Brassow <jbrassow@redhat.com>
Signed-off-by: NeilBrown <neilb@suse.de>
2012-10-11 13:40:36 +11:00
Jonathan Brassow
4ec1e369af DM RAID: Add rebuild capability for RAID10
DM RAID:  Add code to validate replacement slots for RAID10 arrays

RAID10 can handle 'copies - 1' failures for each mirror group.  This code
ensures the user has provided a valid array - one whose devices specified for
rebuild do not exceed the amount of redundancy available.

Signed-off-by: Jonathan Brassow <jbrassow@redhat.com>
Signed-off-by: NeilBrown <neilb@suse.de>
2012-10-11 13:40:24 +11:00
Jonathan Brassow
eb6491236f DM RAID: Move 'rebuild' checking code to its own function
DM RAID:  Move chunk of code to it's own function

The code that checks whether device replacements/rebuilds are possible given
a specific RAID type is moved to it's own function.  It will further expand
when the code to check RAID10 is added.  A separate function makes it easier
to read.

Signed-off-by: Jonathan Brassow <jbrassow@redhat.com>
Signed-off-by: NeilBrown <neilb@suse.de>
2012-10-11 13:40:09 +11:00
Jonathan Brassow
2863b9eb44 MD RAID10: Prep for DM RAID10 device replacement capability
MD RAID10:  Fix a couple potential kernel panics if RAID10 is used by dm-raid

When device-mapper uses the RAID10 personality through dm-raid.c, there is no
'gendisk' structure in mddev and some sysfs information is also not populated.

This patch avoids touching those non-existent structures.

Signed-off-by: Jonathan Brassow <jbrassow@rehdat.com>
Signed-off-by: NeilBrown <neilb@suse.de>
2012-10-11 13:38:58 +11:00
NeilBrown
1ca69c4bc4 md: avoid taking the mutex on some ioctls.
Some ioctls don't need to take the mutex and doing so can cause
a delay as it is held during super-block update.
So move those ioctls out of the mutex and rely on rcu locking
to ensure we don't access stale data.

Signed-off-by: NeilBrown <neilb@suse.de>
2012-10-11 13:37:33 +11:00
Shaohua Li
4ed8731d8e MD: change the parameter of md thread
Change the thread parameter, so the thread can carry extra info. Next patch
will use it.

Signed-off-by: Shaohua Li <shli@fusionio.com>
Signed-off-by: NeilBrown <neilb@suse.de>
2012-10-11 13:34:00 +11:00
NeilBrown
57c67df488 md/raid10: submit IO from originating thread instead of md thread.
queuing writes to the md thread means that all requests go through the
one processor which may not be able to keep up with very high request
rates.

So use the plugging infrastructure to submit all requests on unplug.
If a 'schedule' is needed, we fall back on the old approach of handing
the requests to the thread for it to handle.

This is nearly identical to a recent patch which provided similar
functionality to RAID1.

Signed-off-by: NeilBrown <neilb@suse.de>
2012-10-11 13:32:13 +11:00
Shaohua Li
532a2a3fba md: raid 10 supports TRIM
This makes md raid 10 support TRIM.

If one disk supports discard and another not, or one has
discard_zero_data and another not, there could be inconsistent between
data from such disks. But this should not matter, discarded data is
useless. This will add extra copy in rebuild though.

Signed-off-by: Shaohua Li <shli@fusionio.com>
Signed-off-by: NeilBrown <neilb@suse.de>
2012-10-11 13:30:52 +11:00
Shaohua Li
2ff8cc2c6d md: raid 1 supports TRIM
This makes md raid 1 support TRIM.
If one disk supports discard and another not, or one has discard_zero_data and
another not, there could be inconsistent between data from such disks. But this
should not matter, discarded data is useless. This will add extra copy in rebuild
though.

Signed-off-by: Shaohua Li <shli@fusionio.com>
Signed-off-by: NeilBrown <neilb@suse.de>
2012-10-11 13:28:54 +11:00
Shaohua Li
c83057a1f4 md: raid 0 supports TRIM
This makes md raid 0 support TRIM.

Signed-off-by: Shaohua Li <shli@fusionio.com>
Signed-off-by: NeilBrown <neilb@suse.de>
2012-10-11 13:25:44 +11:00
Shaohua Li
f1cad2b68e md: linear supports TRIM
This makes md linear support TRIM.

Signed-off-by: Shaohua Li <shli@fusionio.com>
Signed-off-by: NeilBrown <neilb@suse.de>
2012-10-11 13:08:44 +11:00
Denis Efremov
bc78c57388 md/linear: rcu_dereference outside read-lock section
According to the comment in linear_stop function
rcu_dereference in linear_start and linear_stop functions
occurs under reconfig_mutex. The patch represents this
agreement in code and prevents lockdep complaint.

Found by Linux Driver Verification project (linuxtesting.org)

Signed-off-by: Denis Efremov <yefremov.denis@gmail.com>
Signed-off-by: NeilBrown <neilb@suse.de>
2012-10-11 13:08:02 +11:00
Linus Torvalds
ce40be7a82 Merge branch 'for-3.7/core' of git://git.kernel.dk/linux-block
Pull block IO update from Jens Axboe:
 "Core block IO bits for 3.7.  Not a huge round this time, it contains:

   - First series from Kent cleaning up and generalizing bio allocation
     and freeing.

   - WRITE_SAME support from Martin.

   - Mikulas patches to prevent O_DIRECT crashes when someone changes
     the block size of a device.

   - Make bio_split() work on data-less bio's (like trim/discards).

   - A few other minor fixups."

Fixed up silent semantic mis-merge as per Mikulas Patocka and Andrew
Morton.  It is due to the VM no longer using a prio-tree (see commit
6b2dbba8b6: "mm: replace vma prio_tree with an interval tree").

So make set_blocksize() use mapping_mapped() instead of open-coding the
internal VM knowledge that has changed.

* 'for-3.7/core' of git://git.kernel.dk/linux-block: (26 commits)
  block: makes bio_split support bio without data
  scatterlist: refactor the sg_nents
  scatterlist: add sg_nents
  fs: fix include/percpu-rwsem.h export error
  percpu-rw-semaphore: fix documentation typos
  fs/block_dev.c:1644:5: sparse: symbol 'blkdev_mmap' was not declared
  blockdev: turn a rw semaphore into a percpu rw semaphore
  Fix a crash when block device is read and block size is changed at the same time
  block: fix request_queue->flags initialization
  block: lift the initial queue bypass mode on blk_register_queue() instead of blk_init_allocated_queue()
  block: ioctl to zero block ranges
  block: Make blkdev_issue_zeroout use WRITE SAME
  block: Implement support for WRITE SAME
  block: Consolidate command flag and queue limit checks for merges
  block: Clean up special command handling logic
  block/blk-tag.c: Remove useless kfree
  block: remove the duplicated setting for congestion_threshold
  block: reject invalid queue attribute values
  block: Add bio_clone_bioset(), bio_clone_kmalloc()
  block: Consolidate bio_alloc_bioset(), bio_kmalloc()
  ...
2012-10-11 09:04:23 +09:00
Linus Torvalds
033d9959ed Merge branch 'for-3.7' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq
Pull workqueue changes from Tejun Heo:
 "This is workqueue updates for v3.7-rc1.  A lot of activities this
  round including considerable API and behavior cleanups.

   * delayed_work combines a timer and a work item.  The handling of the
     timer part has always been a bit clunky leading to confusing
     cancelation API with weird corner-case behaviors.  delayed_work is
     updated to use new IRQ safe timer and cancelation now works as
     expected.

   * Another deficiency of delayed_work was lack of the counterpart of
     mod_timer() which led to cancel+queue combinations or open-coded
     timer+work usages.  mod_delayed_work[_on]() are added.

     These two delayed_work changes make delayed_work provide interface
     and behave like timer which is executed with process context.

   * A work item could be executed concurrently on multiple CPUs, which
     is rather unintuitive and made flush_work() behavior confusing and
     half-broken under certain circumstances.  This problem doesn't
     exist for non-reentrant workqueues.  While non-reentrancy check
     isn't free, the overhead is incurred only when a work item bounces
     across different CPUs and even in simulated pathological scenario
     the overhead isn't too high.

     All workqueues are made non-reentrant.  This removes the
     distinction between flush_[delayed_]work() and
     flush_[delayed_]_work_sync().  The former is now as strong as the
     latter and the specified work item is guaranteed to have finished
     execution of any previous queueing on return.

   * In addition to the various bug fixes, Lai redid and simplified CPU
     hotplug handling significantly.

   * Joonsoo introduced system_highpri_wq and used it during CPU
     hotplug.

  There are two merge commits - one to pull in IRQ safe timer from
  tip/timers/core and the other to pull in CPU hotplug fixes from
  wq/for-3.6-fixes as Lai's hotplug restructuring depended on them."

Fixed a number of trivial conflicts, but the more interesting conflicts
were silent ones where the deprecated interfaces had been used by new
code in the merge window, and thus didn't cause any real data conflicts.

Tejun pointed out a few of them, I fixed a couple more.

* 'for-3.7' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq: (46 commits)
  workqueue: remove spurious WARN_ON_ONCE(in_irq()) from try_to_grab_pending()
  workqueue: use cwq_set_max_active() helper for workqueue_set_max_active()
  workqueue: introduce cwq_set_max_active() helper for thaw_workqueues()
  workqueue: remove @delayed from cwq_dec_nr_in_flight()
  workqueue: fix possible stall on try_to_grab_pending() of a delayed work item
  workqueue: use hotcpu_notifier() for workqueue_cpu_down_callback()
  workqueue: use __cpuinit instead of __devinit for cpu callbacks
  workqueue: rename manager_mutex to assoc_mutex
  workqueue: WORKER_REBIND is no longer necessary for idle rebinding
  workqueue: WORKER_REBIND is no longer necessary for busy rebinding
  workqueue: reimplement idle worker rebinding
  workqueue: deprecate __cancel_delayed_work()
  workqueue: reimplement cancel_delayed_work() using try_to_grab_pending()
  workqueue: use mod_delayed_work() instead of __cancel + queue
  workqueue: use irqsafe timer for delayed_work
  workqueue: clean up delayed_work initializers and add missing one
  workqueue: make deferrable delayed_work initializer names consistent
  workqueue: cosmetic whitespace updates for macro definitions
  workqueue: deprecate system_nrt[_freezable]_wq
  workqueue: deprecate flush[_delayed]_work_sync()
  ...
2012-10-02 09:54:49 -07:00
Linus Torvalds
c3a086e638 A few fixes for problems discovered during the 3.6 cycle.
Of particular note, are fixes to the thin target's discard support,
 which I hope is finally working correctly; and fixes for multipath
 ioctls and device limits when there are no paths.
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1.4.11 (GNU/Linux)
 
 iQIcBAABAgAGBQJQY5sSAAoJEK2W1qbAHj1nUKwP/RDEYw6o4WWvnMvMtyhapPaM
 yInBlep/TSr4mA7QPoe0zV9G8guZoZZaeGEgP/T4Je//ZBqw1xhEG6RIVEAgEUh/
 GrJvfYGXYoLhUiCg99vR1oWT5hQvQp8SYG9lJ1+AsjqEgVwBj7uKgl2wmmvars9X
 gPzXqqzv2IlQjhu6eLvShUixk4HFTQfPMDGaPGWWV8nNcWc0Pnb5TVZiuMeNJGrf
 Srem0ScRNF6P9stUqMA93kHp2KRKHwP6kelnuok9CW/RfcNnux1+8015DXcdbOr9
 X1+mi6VIL0Hjp5R/io0FE1YdJDyR6U/Rwjo3jHkblnegRMOMnK3bOTHmhepW/HUe
 Mav9gcXvEXNpqEvQJsaRmhR36ZgJan5mpxaSTeK1HcPuP0wePEN9Lh/ZJDY7oaB1
 33ntNV8LFIj4jXOcIJZkyAf9l/RdI7mAZ4HwNxPiNncG7LSNataguKYA1sZw9/E8
 njBbn9PyDl/arXQVCJa5ARa2hOHqtNViNGqqNVjQ6ySJuz1HgzslXqzPVG/geZQd
 yPs3ylkMNl+vbCZaEDwkuuEpOeiMgNo1BxVuhGuJMIe5Fs1lsjWbUnvwT9a0XsCQ
 fDPFAFZOfb3Xn6AV0za1SyIVgvsHoX8COBViPh8m+PaXgyTB2wf+vkgRMgwAhBRR
 IV5v+oWZSL8ayoe5okEv
 =J1f2
 -----END PGP SIGNATURE-----

Merge tag 'dm-3.6-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/agk/linux-dm

Pull dm fixes from Alasdair G Kergon:
 "A few fixes for problems discovered during the 3.6 cycle.

  Of particular note, are fixes to the thin target's discard support,
  which I hope is finally working correctly; and fixes for multipath
  ioctls and device limits when there are no paths."

* tag 'dm-3.6-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/agk/linux-dm:
  dm verity: fix overflow check
  dm thin: fix discard support for data devices
  dm thin: tidy discard support
  dm: retain table limits when swapping to new table with no devices
  dm table: clear add_random unless all devices have it set
  dm: handle requests beyond end of device instead of using BUG_ON
  dm mpath: only retry ioctl when no paths if queue_if_no_path set
  dm thin: do not set discard_zeroes_data
2012-09-28 10:00:01 -07:00
NeilBrown
80b4812407 md/raid10: fix "enough" function for detecting if array is failed.
The 'enough' function is written to work with 'near' arrays only
in that is implicitly assumes that the offset from one 'group' of
devices to the next is the same as the number of copies.
In reality it is the number of 'near' copies.

So change it to make this number explicit.

This bug makes it possible to run arrays without enough drives
present, which is dangerous.
It is appropriate for an -stable kernel, but will almost certainly
need to be modified for some of them.

Cc: stable@vger.kernel.org
Reported-by: Jakub Husák <jakub@gooseman.cz>
Signed-off-by: NeilBrown <neilb@suse.de>
2012-09-27 12:35:21 +10:00
Mikulas Patocka
1d55f6bcc0 dm verity: fix overflow check
This patch fixes sector_t overflow checking in dm-verity.

Without this patch, the code checks for overflow only if sector_t is
smaller than long long, not if sector_t and long long have the same size.

Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Cc: stable@vger.kernel.org
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2012-09-26 23:45:48 +01:00
Mike Snitzer
0424caa145 dm thin: fix discard support for data devices
The discard limits that get established for a thin-pool or thin device
may be incompatible with the pool's data device.  Avoid this by checking
the discard limits of the pool's data device.  If an incompatibility is
found then the pool's 'discard passdown' feature is disabled.

Change thin_io_hints to ensure that a thin device always uses the same
queue limits as its pool device.

Introduce requested_pf to track whether or not the table line originally
contained the no_discard_passdown flag and use this directly for table
output.  We prepare the correct setting for discard_passdown directly in
bind_control_target (called from pool_io_hints) and store it in
adjusted_pf rather than waiting until we have access to pool->pf in
pool_preresume.

Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2012-09-26 23:45:47 +01:00
Mike Snitzer
9bc142dd75 dm thin: tidy discard support
A little thin discard code refactoring to make the next patch (dm thin:
fix discard support for data devices) more readable.
Pull out a couple of functions (and uses bools instead of unsigned for
features).

No functional changes.

Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2012-09-26 23:45:46 +01:00
Mike Snitzer
3ae7065616 dm: retain table limits when swapping to new table with no devices
Add a safety net that will re-use the DM device's existing limits in the
event that DM device has a temporary table that doesn't have any
component devices.  This is to reduce the chance that requests not
respecting the hardware limits will reach the device.

DM recalculates queue limits based only on devices which currently exist
in the table.  This creates a problem in the event all devices are
temporarily removed such as all paths being lost in multipath.  DM will
reset the limits to the maximum permissible, which can then assemble
requests which exceed the limits of the paths when the paths are
restored.  The request will fail the blk_rq_check_limits() test when
sent to a path with lower limits, and will be retried without end by
multipath.  This became a much bigger issue after v3.6 commit fe86cdcef
("block: do not artificially constrain max_sectors for stacking
drivers").

Reported-by: David Jeffery <djeffery@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2012-09-26 23:45:45 +01:00
Milan Broz
c3c4555edd dm table: clear add_random unless all devices have it set
Always clear QUEUE_FLAG_ADD_RANDOM if any underlying device does not
have it set. Otherwise devices with predictable characteristics may
contribute entropy.

QUEUE_FLAG_ADD_RANDOM specifies whether or not queue IO timings
contribute to the random pool.

For bio-based targets this flag is always 0 because such devices have no
real queue.

For request-based devices this flag was always set to 1 by default.

Now set it according to the flags on underlying devices. If there is at
least one device which should not contribute, set the flag to zero: If a
device, such as fast SSD storage, is not suitable for supplying entropy,
a request-based queue stacked over it will not be either.

Because the checking logic is exactly same as for the rotational flag,
share the iteration function with device_is_nonrot().

Signed-off-by: Milan Broz <mbroz@redhat.com>
Cc: stable@vger.kernel.org
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2012-09-26 23:45:43 +01:00
Mike Snitzer
ba1cbad93d dm: handle requests beyond end of device instead of using BUG_ON
The access beyond the end of device BUG_ON that was introduced to
dm_request_fn via commit 29e4013de7 ("dm: implement
REQ_FLUSH/FUA support for request-based dm") was an overly
drastic (but simple) response to this situation.

I have received a report that this BUG_ON was hit and now think
it would be better to use dm_kill_unmapped_request() to fail the clone
and original request with -EIO.

map_request() will assign the valid target returned by
dm_table_find_target to tio->ti.  But when the target
isn't valid tio->ti is never assigned (because map_request isn't
called); so add a check for tio->ti != NULL to dm_done().

Reported-by: Mike Christie <michaelc@cs.wisc.edu>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Jun'ichi Nomura <j-nomura@ce.jp.nec.com>
Cc: stable@vger.kernel.org # v2.6.37+
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2012-09-26 23:45:42 +01:00
Mike Snitzer
7ba10aa6fb dm mpath: only retry ioctl when no paths if queue_if_no_path set
When there are no paths and multipath receives an ioctl, it waits until
a path becomes available.  This behaviour is incorrect if the
"queue_if_no_path" setting was not specified, as then the ioctl should
be rejected immediately, which this patch now does.

commit 35991652b ("dm mpath: allow ioctls to trigger pg init") should
have checked if queue_if_no_path was configured before queueing IO.

Checking for the queue_if_no_path feature, like is done in map_io(),
allows the following table load to work without blocking in the
multipath_ioctl retry loop:

  echo "0 1024 multipath 0 0 0 0" | dmsetup create mpath_nodevs

Without this fix the multipath_ioctl will block with the following stack
trace:

  blkid           D 0000000000000002     0 23936      1 0x00000000
   ffff8802b89e5cd8 0000000000000082 ffff8802b89e5fd8 0000000000012440
   ffff8802b89e4010 0000000000012440 0000000000012440 0000000000012440
   ffff8802b89e5fd8 0000000000012440 ffff88030c2aab30 ffff880325794040
  Call Trace:
   [<ffffffff814ce099>] schedule+0x29/0x70
   [<ffffffff814cc312>] schedule_timeout+0x182/0x2e0
   [<ffffffff8104dee0>] ? lock_timer_base+0x70/0x70
   [<ffffffff814cc48e>] schedule_timeout_uninterruptible+0x1e/0x20
   [<ffffffff8104f840>] msleep+0x20/0x30
   [<ffffffffa0000839>] multipath_ioctl+0x109/0x170 [dm_multipath]
   [<ffffffffa06bfb9c>] dm_blk_ioctl+0xbc/0xd0 [dm_mod]
   [<ffffffff8122a408>] __blkdev_driver_ioctl+0x28/0x30
   [<ffffffff8122a79e>] blkdev_ioctl+0xce/0x730
   [<ffffffff811970ac>] block_ioctl+0x3c/0x40
   [<ffffffff8117321c>] do_vfs_ioctl+0x8c/0x340
   [<ffffffff81166293>] ? sys_newfstat+0x33/0x40
   [<ffffffff81173571>] sys_ioctl+0xa1/0xb0
   [<ffffffff814d70a9>] system_call_fastpath+0x16/0x1b

Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Cc: stable@vger.kernel.org # 3.5+
Acked-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2012-09-26 23:45:41 +01:00