I hit another BUG_ON with e240c1839d. In __get_priority_stripe(),
stripe count equals to 0 initially. Between atomic_inc and BUG_ON,
get_active_stripe() finds the stripe. So the stripe count isn't 1 any more.
V2: keeps the BUG_ON suggested by Neil.
Signed-off-by: Shaohua Li <shli@fusionio.com>
Signed-off-by: NeilBrown <neilb@suse.de>
This was used in the olden days, back when onions were proper
yellow. Basically it mapped to the current buffer to be
transferred. With highmem being added more than a decade ago,
most drivers map pages out of a bio, and rq->buffer isn't
pointing at anything valid.
Convert old style drivers to just use bio_data().
For the discard payload use case, just reference the page
in the bio.
Signed-off-by: Jens Axboe <axboe@fb.com>
Commit 003b5c5719 ("block: Convert drivers
to immutable biovecs") incorrectly converted biovec iteration in
dm-verity to always calculate the hash from a full biovec, but the
function only needs to calculate the hash from part of the biovec (up to
the calculated "todo" value).
Fix this issue by limiting hash input to only the requested data size.
This problem was identified using the cryptsetup regression test for
veritysetup (verity-compat-test).
Signed-off-by: Milan Broz <gmazyland@gmail.com>
Acked-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Cc: stable@vger.kernel.org # 3.14+
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2.0.22 (GNU/Linux)
iQIVAwUAU0h5fjnsnt1WYoG5AQKmkw/8DUn0vV4q5UbLp0m2Yy6o6EOxwiSJUH/p
6EEUmSwyXou75w9OWOJKDX2lI7z1yAtzqiuCQ19ekD5lA6gAosja+D0jKjv0SA01
rQm2rMjnwOIxZUIRx/7Z+w/H1ZxjIAc8uxOcCBP6DOynWt/YAyVz1SzLLtwCxELL
N9vjgb+4lVt0E9bBYvVRNiJmtVXpDcmn1YySd6Dqyj9t+Mmysnv8QuIStrT3CE7k
apss3ew6bBbtbiJHuCno/Q4FDWVAhUH+9GMvksdajw8QW52oHV+RRBB5IpCU6hOx
OKCT4MVdzmTgi6GRhSr86Dt+KMOLWZmbx7pK7aRQPiL6uFNhqAlJDb/u0xfaHohG
DiRclZBbsHkEpejHaZcJCkyKFHQTEiia3JVk426FAhtiK1qIBuyxEc65RmKf6dsA
1KlZVeclD3wYWKG4hWk/0W3qIPOWBMll+Ely5Zg6s2X3gGy9u5TU+tUsfJaL1aDU
NOY+5D0+hg7o21kK7WgTaP2upexC/iaBrVrdlasM2KYXJVDrsfCAQr1/BwTl4qLq
Lm/OkIg+WtrQ95RvsI85Hm4PJVxBd1HeyDlKNCcz47kc3Xxqabeq8KnwERyOh4hU
U4EmAeCZmSGOOETIWQQxlIn8XdM1+dF4olUH9viEAXrQfGgUyrg6Vcc7BOdTTZCY
8ek3CWG3TwE=
=kHl3
-----END PGP SIGNATURE-----
Merge tag 'md/3.15' of git://neil.brown.name/md
Pull md updates from Neil Brown:
"Just a few md patches for the 3.15 merge window.
Not much happening in md/raid at the moment. Just a few bug fixes
(one for -stable) and a couple of performance tweaks"
* tag 'md/3.15' of git://neil.brown.name/md:
raid5: get_active_stripe avoids device_lock
raid5: make_request does less prepare wait
md: avoid oops on unload if some process is in poll or select.
md/raid1: r1buf_pool_alloc: free allocate pages when subsequent allocation fails.
md/bitmap: don't abuse i_writecount for bitmap files.
For sequential workload (or request size big workload), get_active_stripe can
find cached stripe. In this case, we always hold device_lock, which exposes a
lot of lock contention for such workload. If stripe count isn't 0, we don't
need hold the lock actually, since we just increase its count. And this is the
hot code path for such workload. Unfortunately we must delete the BUG_ON.
Signed-off-by: Shaohua Li <shli@fusionio.com>
Signed-off-by: NeilBrown <neilb@suse.de>
In NUMA machine, prepare_to_wait/finish_wait in make_request exposes a
lot of contention for sequential workload (or big request size
workload). For such workload, each bio includes several stripes. So we
can just do prepare_to_wait/finish_wait once for the whold bio instead
of every stripe. This reduces the lock contention completely for such
workload. Random workload might have the similar lock contention too,
but I didn't see it yet, maybe because my stroage is still not fast
enough.
Signed-off-by: Shaohua Li <shli@fusionio.com>
Signed-off-by: NeilBrown <neilb@suse.de>
If md-mod is unloaded while some process is in poll() or select(),
then that process maintains a pointer to md_event_waiters, and when
the try to unlink from that list, they will oops.
The procfs infrastructure ensures that ->poll won't be called after
remove_proc_entry, but doesn't provide a wait_queue_head for us to
use, and the waitqueue code doesn't provide a way to remove all
listeners from a waitqueue.
So we need to:
1/ make sure no further references to md_event_waiters are taken (by
setting md_unloading)
2/ wake up all processes currently waiting, and
3/ wait until all those processes have disconnected from our
wait_queue_head.
Reported-by: "majianpeng" <majianpeng@gmail.com>
Signed-off-by: NeilBrown <neilb@suse.de>
When performing a user-request check/repair (MD_RECOVERY_REQUEST is set)
on a raid1, we allocate multiple bios each with their own set of pages.
If the page allocations for one bio fails, we currently do *not* free
the pages allocated for the previous bios, nor do we free the bio itself.
This patch frees all the already-allocate pages, and makes sure that
all the bios are freed as well.
This bug can cause a memory leak which can ultimately OOM a machine.
It was introduced in 3.10-rc1.
Fixes: a07876064a
Cc: Kent Overstreet <koverstreet@google.com>
Cc: stable@vger.kernel.org (3.10+)
Reported-by: Russell King - ARM Linux <linux@arm.linux.org.uk>
Signed-off-by: NeilBrown <neilb@suse.de>
md bitmap code currently tries to use i_writecount to stop any other
process from writing to out bitmap file. But that is really an abuse
and has bit-rotted so locking is all wrong.
So discard that - root should be allowed to shoot self in foot.
Still use it in a much less intrusive way to stop the same file being
used as bitmap on two different array, and apply other checks to
ensure the file is at least vaguely usable for bitmap storage
(is regular, is open for write. Support for ->bmap is already checked
elsewhere).
Reported-by: Al Viro <viro@ZenIV.linux.org.uk>
Signed-off-by: NeilBrown <neilb@suse.de>
Commit c140e1c4e2 ("dm thin: use per thin device deferred bio lists")
introduced the use of an rculist for all active thin devices. The use
of rcu_read_lock() in process_deferred_bios() can result in a BUG if a
dm_bio_prison_cell must be allocated as a side-effect of bio_detain():
BUG: sleeping function called from invalid context at mm/mempool.c:203
in_atomic(): 1, irqs_disabled(): 0, pid: 6, name: kworker/u8:0
3 locks held by kworker/u8:0/6:
#0: ("dm-" "thin"){.+.+..}, at: [<ffffffff8106be42>] process_one_work+0x192/0x550
#1: ((&pool->worker)){+.+...}, at: [<ffffffff8106be42>] process_one_work+0x192/0x550
#2: (rcu_read_lock){.+.+..}, at: [<ffffffff816360b5>] do_worker+0x5/0x4d0
We can't process deferred bios with the rcu lock held, since
dm_bio_prison_cell allocation may block if the bio-prison's cell mempool
is exhausted.
To fix:
- Introduce a refcount and completion field to each thin_c
- Add thin_get/put methods for adjusting the refcount. If the refcount
hits zero then the completion is triggered.
- Initialise refcount to 1 when creating thin_c
- When iterating the active_thins list we thin_get() whilst the rcu
lock is held.
- After the rcu lock is dropped we process the deferred bios for that
thin.
- When destroying a thin_c we thin_put() and then wait for the
completion -- to avoid a race between the worker thread iterating
from that thin_c and destroying the thin_c.
Signed-off-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Commit c140e1c4e2 ("dm thin: use per thin device deferred bio lists")
incorrectly stopped disabling irqs when taking the pool's spinlock.
Irqs must be disabled when taking the pool's spinlock otherwise a thread
could spin_lock(), then get interrupted to service thin_endio() in
interrupt context, which would then deadlock in spin_lock_irqsave().
Signed-off-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
cache_block_size
. Fix a lock-inversion detected by LOCKDEP in dm-cache
. Fix a dangling bio bug in the dm-thinp target's process_deferred_bios
error path
. Fix corruption due to non-atomic transaction commit which allowed a
metadata superblock to be written before all other metadata was
successfully written -- this is common to all targets that use the
persistent-data library's transaction manager (dm-thinp, dm-cache and
dm-era).
. Various small cleanups in the DM core
. Add the dm-era target which is useful for keeping track of which
blocks were written within a user defined period of time called an
'era'. Use cases include tracking changed blocks for backup software,
and partially invalidating the contents of a cache to restore cache
coherency after rolling back a vendor snapshot.
. Improve the on-disk layout of multithreaded writes to the dm-thin-pool
by splitting the pool's deferred bio list to be a per-thin device list
and then sorting that list using an rb_tree. The subsequent read
throughput of the data written via multiple threads improved by ~70%.
. Simplify the multipath target's handling of queuing IO by pushing
requests back to the request queue rather than queueing the IO
internally.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1
iQEcBAABAgAGBQJTPv/6AAoJEMUj8QotnQNagQYH/3EkB2f66TRfjRQpVAZuchw/
U0IbVWcMJKMdhj3uaSNzIkAbTgF+QsZUOLHP/7Q6zLq0M2J3WGrJn2ELqq53MenF
E0+rJ8duKnJ5oLhhVT62ukBDh3XHWT0JyijXPWNa2gUoYwJqM9BAlXbC/OTfUNaZ
mBCxvUWGME8k3ht310GhwvzBQjYuxIXhw8XlbGvakb9S83SZwNpCh231iumOEzPe
Vzfx/xTto0fH3R5/knNV/H9xt0Dv4vt4Aqbqqys9UbQvPzj9qN/mxUZIFg+LZh/w
WuvHHw6HcAiNNrQGFcm6i1AK2jJ+F61K3afMlYsiamTxMNM+0q/B9HemkX/0ieU=
=lY8m
-----END PGP SIGNATURE-----
Merge tag 'dm-3.15-changes' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm
Pull device mapper changes from Mike Snitzer:
- Fix dm-cache corruption caused by discard_block_size > cache_block_size
- Fix a lock-inversion detected by LOCKDEP in dm-cache
- Fix a dangling bio bug in the dm-thinp target's process_deferred_bios
error path
- Fix corruption due to non-atomic transaction commit which allowed a
metadata superblock to be written before all other metadata was
successfully written -- this is common to all targets that use the
persistent-data library's transaction manager (dm-thinp, dm-cache and
dm-era).
- Various small cleanups in the DM core
- Add the dm-era target which is useful for keeping track of which
blocks were written within a user defined period of time called an
'era'. Use cases include tracking changed blocks for backup
software, and partially invalidating the contents of a cache to
restore cache coherency after rolling back a vendor snapshot.
- Improve the on-disk layout of multithreaded writes to the
dm-thin-pool by splitting the pool's deferred bio list to be a
per-thin device list and then sorting that list using an rb_tree.
The subsequent read throughput of the data written via multiple
threads improved by ~70%.
- Simplify the multipath target's handling of queuing IO by pushing
requests back to the request queue rather than queueing the IO
internally.
* tag 'dm-3.15-changes' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm: (24 commits)
dm cache: fix a lock-inversion
dm thin: sort the per thin deferred bios using an rb_tree
dm thin: use per thin device deferred bio lists
dm thin: simplify pool_is_congested
dm thin: fix dangling bio in process_deferred_bios error path
dm mpath: print more useful warnings in multipath_message()
dm-mpath: do not activate failed paths
dm mpath: remove extra nesting in map function
dm mpath: remove map_io()
dm mpath: reduce memory pressure when requeuing
dm mpath: remove process_queued_ios()
dm mpath: push back requests instead of queueing
dm table: add dm_table_run_md_queue_async
dm mpath: do not call pg_init when it is already running
dm: use RCU_INIT_POINTER instead of rcu_assign_pointer in __unbind
dm: stop using bi_private
dm: remove dm_get_mapinfo
dm: make dm_table_alloc_md_mempools static
dm: take care to copy the space map roots before locking the superblock
dm transaction manager: fix corruption due to non-atomic transaction commit
...
When suspending a cache the policy is walked and the individual policy
hints written to the metadata via sync_metadata(). This led to this
lock order:
policy->lock
cache_metadata->root_lock
When loading the cache target the policy is populated while the metadata
lock is held:
cache_metadata->root_lock
policy->lock
Fix this potential lock-inversion (ABBA) deadlock in sync_metadata() by
ensuring the cache_metadata root_lock is held whilst all the hints are
written, rather than being repeatedly locked while policy->lock is held
(as was the case with each callout that policy_walk_mappings() made to
the old save_hint() method).
Found by turning on the CONFIG_PROVE_LOCKING ("Lock debugging: prove
locking correctness") build option. However, it is not clear how the
LOCKDEP reported paths can lead to a deadlock since the two paths,
suspending a target and loading a target, never occur at the same time.
But that doesn't mean the same lock-inversion couldn't have occurred
elsewhere.
Reported-by: Marian Csontos <mcsontos@redhat.com>
Signed-off-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Cc: stable@vger.kernel.org
A thin-pool will allocate blocks using FIFO order for all thin devices
which share the thin-pool. Because of this simplistic allocation the
thin-pool's space can become fragmented quite easily; especially when
multiple threads are requesting blocks in parallel.
Sort each thin device's deferred_bio_list based on logical sector to
help reduce fragmentation of the thin-pool's ondisk layout.
The following tables illustrate the realized gains/potential offered by
sorting each thin device's deferred_bio_list. An "io size"-sized random
read of the device would result in "seeks/io" fragments being read, with
an average "distance/seek" between each fragment.
Data was written to a single thin device using multiple threads via
iozone (8 threads, 64K for both the block_size and io_size).
unsorted:
io size seeks/io distance/seek
--------------------------------------
4k 0.000 0b
16k 0.013 11m
64k 0.065 11m
256k 0.274 10m
1m 1.109 10m
4m 4.411 10m
16m 17.097 11m
64m 60.055 13m
256m 148.798 25m
1g 809.929 21m
sorted:
io size seeks/io distance/seek
--------------------------------------
4k 0.000 0b
16k 0.000 1g
64k 0.001 1g
256k 0.003 1g
1m 0.011 1g
4m 0.045 1g
16m 0.181 1g
64m 0.747 1011m
256m 3.299 1g
1g 14.373 1g
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Acked-by: Joe Thornber <ejt@redhat.com>
Pull block driver update from Jens Axboe:
"On top of the core pull request, here's the pull request for the
driver related changes for 3.15. It contains:
- Improvements for msi-x registration for block drivers (mtip32xx,
skd, cciss, nvme) from Alexander Gordeev.
- A round of cleanups and improvements for drbd from Andreas
Gruenbacher and Rashika Kheria.
- A round of clanups and improvements for bcache from Kent.
- Removal of sleep_on() and friends in DAC960, ataflop, swim3 from
Arnd Bergmann.
- Bug fix for a bug in the mtip32xx async completion code from Sam
Bradshaw.
- Bug fix for accidentally bouncing IO on 32-bit platforms with
mtip32xx from Felipe Franciosi"
* 'for-3.15/drivers' of git://git.kernel.dk/linux-block: (103 commits)
bcache: remove nested function usage
bcache: Kill bucket->gc_gen
bcache: Kill unused freelist
bcache: Rework btree cache reserve handling
bcache: Kill btree_io_wq
bcache: btree locking rework
bcache: Fix a race when freeing btree nodes
bcache: Add a real GC_MARK_RECLAIMABLE
bcache: Add bch_keylist_init_single()
bcache: Improve priority_stats
bcache: Better alloc tracepoints
bcache: Kill dead cgroup code
bcache: stop moving_gc marking buckets that can't be moved.
bcache: Fix moving_pred()
bcache: Fix moving_gc deadlocking with a foreground write
bcache: Fix discard granularity
bcache: Fix another bug recovering from unclean shutdown
bcache: Fix a bug recovering from unclean shutdown
bcache: Fix a journalling reclaim after recovery bug
bcache: Fix a null ptr deref in journal replay
...
Here's the big char/misc driver updates for 3.15-rc1.
Lots of various things here, including the new mcb driver subsystem.
All of these have been in linux-next for a while.
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2.0.22 (GNU/Linux)
iEYEABECAAYFAlM7ArIACgkQMUfUDdst+ylS+gCfcJr0Zo2v5aWnqD7rFtFETmFI
LhcAoNTQ4cvlVdxnI0driWCWFYxLj6at
=aj+L
-----END PGP SIGNATURE-----
Merge tag 'char-misc-3.15-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/char-misc
Pull char/misc driver patches from Greg KH:
"Here's the big char/misc driver updates for 3.15-rc1.
Lots of various things here, including the new mcb driver subsystem.
All of these have been in linux-next for a while"
* tag 'char-misc-3.15-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/char-misc: (118 commits)
extcon: Move OF helper function to extcon core and change function name
extcon: of: Remove unnecessary function call by using the name of device_node
extcon: gpio: Use SIMPLE_DEV_PM_OPS macro
extcon: palmas: Use SIMPLE_DEV_PM_OPS macro
mei: don't use deprecated DEFINE_PCI_DEVICE_TABLE macro
mei: amthif: fix checkpatch error
mei: client.h fix checkpatch errors
mei: use cl_dbg where appropriate
mei: fix Unnecessary space after function pointer name
mei: report consistently copy_from/to_user failures
mei: drop pr_fmt macros
mei: make me hw headers private to me hw.
mei: fix memory leak of pending write cb objects
mei: me: do not reset when less than expected data is received
drivers: mcb: Fix build error discovered by 0-day bot
cs5535-mfgpt: Simplify dependencies
spmi: pm: drop bus-level PM suspend/resume routines
spmi: pmic_arb: make selectable on ARCH_QCOM
Drivers: hv: vmbus: Increase the limit on the number of pfns we can handle
pch_phub: Report error writing MAC back to user
...
The thin-pool previously only had a single deferred_bios list that would
collect bios for all thin devices in the pool. Split this per-pool
deferred_bios list out to per-thin deferred_bios_list -- doing so
enables increased parallelism when processing deferred bios. And now
that each thin device has it's own deferred_bios_list we can sort all
bios in the list using logical sector. The requeue code in error
handling path is also cleaner as a side-effect.
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Acked-by: Joe Thornber <ejt@redhat.com>
The pool is congested if the pool is in PM_OUT_OF_DATA_SPACE mode. This
is more explicit/clear/efficient than inferring whether or not the pool
is congested by checking if retry_on_resume_list is empty.
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Acked-by: Joe Thornber <ejt@redhat.com>
If unable to ensure_next_mapping() we must add the current bio, which
was removed from the @bios list via bio_list_pop, back to the
deferred_bios list before all the remaining @bios.
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Acked-by: Joe Thornber <ejt@redhat.com>
Cc: stable@vger.kernel.org
The warning message "Unrecognised multipath message received" is
displayed in two different situations in multipath_message(): when the
number of arguments passed is invalid and when the string passed in
argv[0] is not recognized.
Make it easier to identify where the problem is by making these warnings
more specific with additional context for each case.
Signed-off-by: Jose Castillo <jcastillo@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
activate_path() is run without a lock, so the path might be
set to failed before activate_path() had a chance to run.
This patch add a check for ->active in activate_path() to
avoid unnecessary overhead by calling functions which are known
to be failing.
Signed-off-by: Hannes Reinecke <hare@suse.de>
Reviewed-by: Jun'ichi Nomura <j-nomura@ce.jp.nec.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Return early for case when no path exists, and when the
pathgroup isn't ready. This eliminates the need for
extra nesting for the the common case.
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Hannes Reinecke <hare@suse.de>
multipath_map() is now just a wrapper around map_io(), so we
can rename map_io() to multipath_map().
Signed-off-by: Hannes Reinecke <hare@suse.de>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Reviewed-by: Jun'ichi Nomura <j-nomura@ce.jp.nec.com>
When multipath needs to requeue I/O in the block layer the per-request
context shouldn't be allocated, as it will be freed immediately
afterwards anyway. Avoiding this memory allocation will reduce memory
pressure during requeuing.
Signed-off-by: Hannes Reinecke <hare@suse.de>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Reviewed-by: Jun'ichi Nomura <j-nomura@ce.jp.nec.com>
process_queued_ios() has served 3 functions:
1) select pg and pgpath if none is selected
2) start pg_init if requested
3) dispatch queued IOs when pg is ready
Basically, a call to queue_work(process_queued_ios) can be replaced by
dm_table_run_md_queue_async(), which runs request queue and ends up
calling map_io(), which does 1), 2) and 3).
Exception is when !pg_ready() (which means either pg_init is running or
requested), then multipath_busy() prevents map_io() being called from
request_fn.
If pg_init is running, it should be ok as long as pg_init_done() does
the right thing when pg_init is completed, I.e.: restart pg_init if
!pg_ready() or call dm_table_run_md_queue_async() to kick map_io().
If pg_init is requested, we have to make sure the request is detected
and pg_init will be started. pg_init is requested in 3 places:
a) __choose_pgpath() in map_io()
b) __choose_pgpath() in multipath_ioctl()
c) pg_init retry in pg_init_done()
a) is ok because map_io() calls __pg_init_all_paths(), which does 2).
b) needs a call to __pg_init_all_paths(), which does 2).
c) needs a call to __pg_init_all_paths(), which does 2).
So this patch removes process_queued_ios() and ensures that
__pg_init_all_paths() is called at the appropriate locations.
Signed-off-by: Hannes Reinecke <hare@suse.de>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Reviewed-by: Jun'ichi Nomura <j-nomura@ce.jp.nec.com>
There is no reason why multipath needs to queue requests internally for
queue_if_no_path or pg_init; we should rather push them back onto the
request queue.
And while we're at it we can simplify the conditional statement in
map_io() to make it easier to read.
Since mpath no longer does internal queuing of I/O the table info no
longer emits the internal queue_size. Instead it displays 1 if queuing
is being used or 0 if it is not.
Signed-off-by: Hannes Reinecke <hare@suse.de>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Reviewed-by: Jun'ichi Nomura <j-nomura@ce.jp.nec.com>
Introduce dm_table_run_md_queue_async() to run the request_queue of the
mapped_device associated with a request-based DM table.
Also add dm_md_get_queue() wrapper to extract the request_queue from a
mapped_device.
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Hannes Reinecke <hare@suse.de>
Reviewed-by: Jun'ichi Nomura <j-nomura@ce.jp.nec.com>
This patch moves condition checks as a preparation of following
patches and has no effect on behaviour.
process_queued_ios() is the only caller of __pg_init_all_paths()
and 2 condition checks are moved from outside to inside without
side effects.
Signed-off-by: Hannes Reinecke <hare@suse.de>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Reviewed-by: Jun'ichi Nomura <j-nomura@ce.jp.nec.com>
Replace rcu_assign_pointer(p, NULL) with RCU_INIT_POINTER(p, NULL).
The rcu_assign_pointer() ensures that the initialization of a structure
is carried out before storing a pointer to that structure. And in the
case of the NULL pointer, there is no structure to initialize. So,
rcu_assign_pointer(p, NULL) can be safely converted to
RCU_INIT_POINTER(p, NULL).
Signed-off-by: Monam Agarwal <monamagarwal123@gmail.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Device mapper uses the bio structure's bi_private field as a pointer
to dm_target_io or dm_rq_clone_bio_info. But a bio structure is
embedded in the dm_target_io and dm_rq_clone_bio_info structures, so the
pointer to the structure that contains the bio can be found with the
container_of() macro.
Remove the use of bi_private and use container_of() instead.
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Remove dm_get_mapinfo() because no target uses it. Targets can allocate
per-bio data using ti->per_bio_data_size, this is much more flexible
than union map_info.
Leave union map_info only for the request-based multipath target's use.
Also delete the unused "unsigned long long ll" field of union map_info.
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Make the function dm_table_alloc_md_mempools static because it is not
called from another file.
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
In theory copying the space map root can fail, but in practice it never
does because we're careful to check what size buffer is needed.
But make certain we're able to copy the space map roots before
locking the superblock.
Signed-off-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Cc: stable@vger.kernel.org # drop dm-era and dm-cache changes as needed
The persistent-data library used by dm-thin, dm-cache, etc is
transactional. If anything goes wrong, such as an io error when writing
new metadata or a power failure, then we roll back to the last
transaction.
Atomicity when committing a transaction is achieved by:
a) Never overwriting data from the previous transaction.
b) Writing the superblock last, after all other metadata has hit the
disk.
This commit and the following commit ("dm: take care to copy the space
map roots before locking the superblock") fix a bug associated with (b).
When committing it was possible for the superblock to still be written
in spite of an io error occurring during the preceeding metadata flush.
With these commits we're careful not to take the write lock out on the
superblock until after the metadata flush has completed.
Change the transaction manager's semantics for dm_tm_commit() to assume
all data has been flushed _before_ the single superblock that is passed
in.
As a prerequisite, split the block manager's block unlocking and
flushing by simplifying dm_bm_flush_and_unlock() to dm_bm_flush(). Now
the unlocking must be done separately.
This issue was discovered by forcing io errors at the crucial time
using dm-flakey.
Signed-off-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Cc: stable@vger.kernel.org
Discard block size not being equal to cache block size causes data
corruption by erroneously avoiding migrations in issue_copy() because
the discard state is being cleared for a group of cache blocks when it
should not.
Completely remove all code that enabled a distinction between the
cache block size and discard block size.
Signed-off-by: Heinz Mauelshagen <heinzm@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
If the discard block size is larger than the cache block size we will
not properly quiesce IO to a region that is about to be discarded. This
results in a race between a cache migration where no copy is needed, and
a write to an adjacent cache block that's within the same large discard
block.
Workaround this by limiting the discard_block_size to cache_block_size.
Also limit the max_discard_sectors to cache_block_size.
A more comprehensive fix that introduces range locking support in the
bio_prison and proper quiescing of a discard range that spans multiple
cache blocks is already in development.
Reported-by: Morgan Mears <Morgan.Mears@netapp.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Acked-by: Joe Thornber <ejt@redhat.com>
Acked-by: Heinz Mauelshagen <heinzm@redhat.com>
Cc: stable@vger.kernel.org
dm-era is a target that behaves similar to the linear target. In
addition it keeps track of which blocks were written within a user
defined period of time called an 'era'. Each era target instance
maintains the current era as a monotonically increasing 32-bit
counter.
Use cases include tracking changed blocks for backup software, and
partially invalidating the contents of a cache to restore cache
coherency after rolling back a vendor snapshot.
dm-era is primarily expected to be paired with the dm-cache target.
Signed-off-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Uninlined nested functions can cause crashes when using ftrace, as they don't
follow the normal calling convention and confuse the ftrace function graph
tracer as it examines the stack.
Also, nested functions are supported as a gcc extension, but may fail on other
compilers (e.g. llvm).
Signed-off-by: John Sheu <john.sheu@gmail.com>
gc_gen was a temporary used to recalculate last_gc, but since we only need
bucket->last_gc when gc isn't running (gc_mark_valid = 1), we can just update
last_gc directly.
Signed-off-by: Kent Overstreet <kmo@daterainc.com>
This was originally added as at optimization that for various reasons isn't
needed anymore, but it does add a lot of nasty corner cases (and it was
responsible for some recently fixed bugs). Just get rid of it now.
Signed-off-by: Kent Overstreet <kmo@daterainc.com>
This changes the bucket allocation reserves to use _real_ reserves - separate
freelists - instead of watermarks, which if nothing else makes the current code
saner to reason about and is going to be important in the future when we add
support for multiple btrees.
It also adds btree_check_reserve(), which checks (and locks) the reserves for
both bucket allocation and memory allocation for btree nodes; the old code just
kinda sorta assumed that since (e.g. for btree node splits) it had the root
locked and that meant no other threads could try to make use of the same
reserve; this technically should have been ok for memory allocation (we should
always have a reserve for memory allocation (the btree node cache is used as a
reserve and we preallocate it)), but multiple btrees will mean that locking the
root won't be sufficient anymore, and for the bucket allocation reserve it was
technically possible for the old code to deadlock.
Signed-off-by: Kent Overstreet <kmo@daterainc.com>
With the locking rework in the last patch, this shouldn't be needed anymore -
btree_node_write_work() only takes b->write_lock which is never held for very
long.
Signed-off-by: Kent Overstreet <kmo@daterainc.com>
Add a new lock, b->write_lock, which is required to actually modify - or write -
a btree node; this lock is only held for short durations.
This means we can write out a btree node without taking b->lock, which _is_ held
for long durations - solving a deadlock when btree_flush_write() (from the
journalling code) is called with a btree node locked.
Right now just occurs in bch_btree_set_root(), but with an upcoming journalling
rework is going to happen a lot more.
This also turns b->lock is now more of a read/intent lock instead of a
read/write lock - but not completely, since it still blocks readers. May turn it
into a real intent lock at some point in the future.
Signed-off-by: Kent Overstreet <kmo@daterainc.com>
This isn't a bulletproof fix; btree_node_free() -> bch_bucket_free() puts the
bucket on the unused freelist, where it can be reused right away without any
ordering requirements. It would be better to wait on at least a journal write to
go down before reusing the bucket. bch_btree_set_root() does this, and inserting
into non leaf nodes is completely synchronous so we should be ok, but future
patches are just going to get rid of the unused freelist - it was needed in the
past for various reasons but shouldn't be anymore.
Signed-off-by: Kent Overstreet <kmo@daterainc.com>
This means the garbage collection code can better check for data and metadata
pointers to the same buckets.
Signed-off-by: Kent Overstreet <kmo@daterainc.com>
This will potentially save us an allocation when we've got inode/dirent bkeys
that don't fit in the keylist's inline keys.
Signed-off-by: Kent Overstreet <kmo@daterainc.com>
Change the invalidate tracepoint to indicate how much data we're invalidating,
and change the alloc tracepoints to indicate what offset they're for.
Signed-off-by: Kent Overstreet <kmo@daterainc.com>
Deadlock happened because a foreground write slept, waiting for a bucket
to be allocated. Normally the gc would mark buckets available for invalidation.
But the moving_gc was stuck waiting for outstanding writes to complete.
These writes used the bcache_wq, the same queue foreground writes used.
This fix gives moving_gc its own work queue, so it was still finish moving
even if foreground writes are stuck waiting for allocation. It also makes
work queue a parameter to the data_insert path, so moving_gc can use its
workqueue for writes.
Signed-off-by: Nicholas Swenson <nks@daterainc.com>
Signed-off-by: Kent Overstreet <kmo@daterainc.com>
The on disk bucket gens are allowed to be out of date, when we reuse buckets
that didn't have any live data in them. To deal with this, the initial gc has to
update the bucket gen when we find a pointer gen newer than the bucket's gen.
Unfortunately we weren't doing this for pointers in the journal that we're about
to replay.
Signed-off-by: Kent Overstreet <kmo@daterainc.com>
On recovery we weren't correctly keeping track of what journal buckets had open
journal entries, thus it was possible for them to be overwritten until we'd
written all new journal entries.
Signed-off-by: Kent Overstreet <kmo@daterainc.com>
In order to avoid wasting cache space a partial block at the end of the
origin device is not cached. Unfortunately, the check for such a
partial block at the end of the origin device was flawed.
Fix accesses beyond the end of the origin device that occured due to
attempted promotion of an undetected partial block by:
- initializing the per bio data struct to allow cache_end_io to work properly
- recognizing access to the partial block at the end of the origin device
- avoiding out of bounds access to the discard bitset
Otherwise, users can experience errors like the following:
attempt to access beyond end of device
dm-5: rw=0, want=20971520, limit=20971456
...
device-mapper: cache: promotion failed; couldn't copy block
Signed-off-by: Heinz Mauelshagen <heinzm@redhat.com>
Acked-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Cc: stable@vger.kernel.org
During demotion or promotion to a cache's >2TB fast device we must not
truncate the cache block's associated sector to 32bits. The 32bit
temporary result of from_cblock() caused a 32bit multiplication when
calculating the sector of the fast device in issue_copy_real().
Use an intermediate 64bit type to store the 32bit from_cblock() to allow
for proper 64bit multiplication.
Here is an example of how this bug manifests on an ext4 filesystem:
EXT4-fs error (device dm-0): ext4_mb_generate_buddy:756: group 17136, 32768 clusters in bitmap, 30688 in gd; block bitmap corrupt.
JBD2: Spotted dirty metadata buffer (dev = dm-0, blocknr = 0). There's a risk of filesystem corruption in case of system crash.
Signed-off-by: Heinz Mauelshagen <heinzm@redhat.com>
Acked-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Cc: stable@vger.kernel.org
This has been a relatively long-standing issue that wasn't nailed down
until Teng-Feng Yang's meticulous bug report to dm-devel on 3/7/2014,
see: http://www.redhat.com/archives/dm-devel/2014-March/msg00021.html
From that report:
"When decreasing the reference count of a metadata block with its
reference count equals 3, we will call dm_btree_remove() to remove
this enrty from the B+tree which keeps the reference count info in
metadata device.
The B+tree will try to rebalance the entry of the child nodes in each
node it traversed, and the rebalance process contains the following
steps.
(1) Finding the corresponding children in current node (shadow_current(s))
(2) Shadow the children block (issue BOP_INC)
(3) redistribute keys among children, and free children if necessary (issue BOP_DEC)
Since the update of a metadata block's reference count could be
recursive, we will stash these reference count update operations in
smm->uncommitted and then process them in a FILO fashion.
The problem is that step(3) could free the children which is created
in step(2), so the BOP_DEC issued in step(3) will be carried out
before the BOP_INC issued in step(2) since these BOPs will be
processed in FILO fashion. Once the BOP_DEC from step(3) tries to
decrease the reference count of newly shadow block, it will report
failure for its reference equals 0 before decreasing. It looks like we
can solve this issue by processing these BOPs in a FIFO fashion
instead of FILO."
Commit 5b564d80 ("dm space map: disallow decrementing a reference count
below zero") changed the code to report an error for this temporary
refcount decrement below zero. So what was previously a harmless
invalid refcount became a hard failure due to the new error path:
device-mapper: space map common: unable to decrement a reference count below 0
device-mapper: thin: 253:6: dm_thin_insert_block() failed: error = -22
device-mapper: thin: 253:6: switching pool to read-only mode
This bug is in dm persistent-data code that is common to the DM thin and
cache targets. So any users of those targets should apply this fix.
Fix this by applying recursive space map operations in FIFO order rather
than FILO.
Resolves: https://bugzilla.kernel.org/show_bug.cgi?id=68801
Reported-by: Apollon Oikonomopoulos <apoikos@debian.org>
Reported-by: edwillam1007@gmail.com
Reported-by: Teng-Feng Yang <shinrairis@gmail.com>
Signed-off-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Cc: stable@vger.kernel.org # 3.13+
i) by the time DM core calls the postsuspend hook the dm_noflush flag
has been cleared. So the old thin_postsuspend did nothing. We need to
use the presuspend hook instead.
ii) There was a race between bios leaving DM core and arriving in the
deferred queue.
thin_presuspend now sets a 'requeue' flag causing all bios destined for
that thin to be requeued back to DM core. Then it requeues all held IO,
and all IO on the deferred queue (destined for that thin). Finally
postsuspend clears the 'requeue' flag.
Signed-off-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
The spin lock in requeue_io() was held for too long, allowing deadlock.
Don't worry, due to other issues addressed in the following "dm thin:
fix noflush suspend IO queueing" commit, this code was never called.
Fix this by taking the spin lock for a much shorter period of time.
Signed-off-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Ideally a thin pool would never run out of data space; the low water
mark would trigger userland to extend the pool before we completely run
out of space. However, many small random IOs to unprovisioned space can
consume data space at an alarming rate. Adjust your low water mark if
you're frequently seeing "out-of-data-space" mode.
Before this fix, if data space ran out the pool would be put in
PM_READ_ONLY mode which also aborted the pool's current metadata
transaction (data loss for any changes in the transaction). This had a
side-effect of needlessly compromising data consistency. And retry of
queued unserviceable bios, once the data pool was resized, could
initiate changes to potentially inconsistent pool metadata.
Now when the pool's data space is exhausted transition to a new pool
mode (PM_OUT_OF_DATA_SPACE) that allows metadata to be changed but data
may not be allocated. This allows users to remove thin volumes or
discard data to recover data space.
The pool is no longer put in PM_READ_ONLY mode in response to the pool
running out of data space. And PM_READ_ONLY mode no longer aborts the
pool's current metadata transaction. Also, set_pool_mode() will now
notify userspace when the pool mode is changed.
Signed-off-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
If a thin metadata operation fails the current transaction will abort,
whereby causing potential for IO layers up the stack (e.g. filesystems)
to have data loss. As such, set THIN_METADATA_NEEDS_CHECK_FLAG in the
thin metadata's superblock which:
1) requires the user verify the thin metadata is consistent (e.g. use
thin_check, etc)
2) suggests the user verify the thin data is consistent (e.g. use fsck)
The only way to clear the superblock's THIN_METADATA_NEEDS_CHECK_FLAG is
to run thin_repair.
On metadata operation failure: abort current metadata transaction, set
pool in read-only mode, and now set the needs_check flag.
As part of this change, constraints are introduced or relaxed:
* don't allow a pool to transition to write mode if needs_check is set
* don't allow data or metadata space to be resized if needs_check is set
* if a thin pool's metadata space is exhausted: the kernel will now
force the user to take the pool offline for repair before the kernel
will allow the metadata space to be extended.
Also, update Documentation to include information about when the thin
provisioning target commits metadata, how it handles metadata failures
and running out of space.
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Joe Thornber <ejt@redhat.com>
Commit b5330655 ("dm thin: handle metadata failures more consistently")
increased potential for the pool's mode to be changed in response to
metadata operation failures.
When the pool mode is changed it isn't synchronized with the mode in
pool_features stored in the target's context (ti->private) that is used
as the basis for (re)establishing the pool mode during resume via
bind_control_target.
It is important that we synchronize the pool mode when it is changed
otherwise the pool may experience and unexpected mode transition on the
next resume (especially if there was no new table load).
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Acked-by: Joe Thornber <ejt@redhat.com>
Commit 55494bf294 ("dm snapshot: use dm-bufio") broke snapshots.
Before that 3.14-rc1 commit, loading a snapshot's list of exceptions
involved reading exception areas one by one into ps->area and inserting
those exceptions into the hash table. Commit 55494bf294 changed
it so that dm-bufio with prefetch is used to load exceptions in batchs.
Exceptions are loaded correctly, but ps->area is left uninitialized.
When a new exception is allocated, it is stored in this uninitialized
ps->area which will be written to the disk. This causes metadata
corruption.
Fix this corruption by copying the last area that was read via dm-bufio
into ps->area.
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Since DM_DEBUG_BLOCK_STACK_TRACING is a DM_PERSISTENT_DATA config option
move it from drivers/md/Kconfig to drivers/md/persistent-data/Kconfig.
Doing so fixes indentation for other DM config options.
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
When remapping a block to the cache's fast device that is larger than
2TB we must not truncate the destination sector to 32bits. The 32bit
temporary result of from_cblock() was being overflowed in
remap_to_cache() due to the logical left shift.
Use an intermediate 64bit type to store the 32bit from_cblock() result
to fix the overflow.
Signed-off-by: Heinz Mauelshagen <heinzm@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Cc: stable@vger.kernel.org
It was always intended that a user could provide a thin metadata device
that is larger than the max supported by the on-disk format. The extra
space would just go unused.
Unfortunately that never worked. If the user attempted to use a larger
metadata device on creation they would get an error like the following:
device-mapper: space map common: space map too large
device-mapper: transaction manager: couldn't create metadata space map
device-mapper: thin metadata: tm_create_with_sm failed
device-mapper: table: 252:17: thin-pool: Error creating metadata object
device-mapper: ioctl: error adding target to table
Fix this by allowing the initial metadata space map creation to cap its
size at the max number of blocks supported (DM_SM_METADATA_MAX_BLOCKS).
get_metadata_dev_size() must also impose DM_SM_METADATA_MAX_BLOCKS (via
THIN_METADATA_MAX_SECTORS), otherwise extending metadata would cap at
THIN_METADATA_MAX_SECTORS_WARNING (which is larger than supported).
Also, the calculation for THIN_METADATA_MAX_SECTORS didn't account for
the sizeof the disk_bitmap_header. So the supported maximum metadata
size is a bit smaller (reduced from 33423360 to 33292800 sectors).
Lastly, remove the "excess space will not be used" warning message from
get_metadata_dev_size(); it resulted in printing the warning multiple
times. Factor out warn_if_metadata_device_too_big(), call it from
pool_ctr() and maybe_resize_metadata_dev().
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Acked-by: Joe Thornber <ejt@redhat.com>
An invalid ioctl will never be valid, irrespective of whether multipath
has active paths or not. So for invalid ioctls we do not have to wait
for multipath to activate any paths, but can rather return an error
code immediately. This fix resolves numerous instances of:
udevd[]: worker [] unexpectedly returned with status 0x0100
that have been seen during testing.
Signed-off-by: Hannes Reinecke <hare@suse.de>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Cc: stable@vger.kernel.org
The code was using sectors to count the number of sectors it was zeroing... but
then it passed it to bio_advance()... after it had been set to 0. Amusing...
Signed-off-by: Kent Overstreet <kmo@daterainc.com>
dm_pool_close_thin_device() must be called if dm_set_target_max_io_len()
fails in thin_ctr(). Otherwise __pool_destroy() will fail because the
pool will still have an open thin device:
device-mapper: thin metadata: attempt to close pmd when 1 device(s) are still open
device-mapper: thin: __pool_destroy: dm_pool_metadata_close() failed.
Also, must establish error code if failing thin_ctr() because the pool
is in fail_io mode.
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Acked-by: Joe Thornber <ejt@redhat.com>
Cc: stable@vger.kernel.org
When restoring bi_end_io, increase bi_remaining before retrying the bio
to avoid BUG_ON(atomic_read(&bio->bi_remaining) <= 0) in bio_endio().
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Commit 003b5c5719 ("block: Convert drivers
to immutable biovecs") broke dm-mirror due to dm-io breakage.
dm-io had three possible iterators (DM_IO_PAGE_LIST, DM_IO_BVEC,
DM_IO_VMA) that iterate over pages where the I/O should be performed.
The switch to immutable biovecs changed the DM_IO_BVEC iterator to
DM_IO_BIO. Before this change the iterator stored the pointer to a bio
vector in the dpages structure. The iterator incremented the pointer in
the dpages structure as it advanced over the pages. After the immutable
biovecs change, the DM_IO_BIO iterator stores a pointer to the bio in
the dpages structure and uses bio_advance to change the bio as it
advances.
The problem is that the function dispatch_io stores the content of the
dpages structure into the variable old_pages and restores it before
issuing I/O to each of the devices. Before the change, the statement
"*dp = old_pages;" restored the iterator to its starting position.
After the change, struct dpages holds a pointer to the bio, thus the
statement "*dp = old_pages;" doesn't restore the iterator.
Consequently, in the context of dm-mirror: only the first mirror leg is
written correctly, the kernel locks up when trying to write the other
mirror legs because the number of sectors to write in the where->count
variable doesn't match the number of sectors returned by the iterator.
This patch fixes the bug by partially reverting the original patch - it
changes the code so that struct dpages holds a pointer to the bio vector,
so that the statement "*dp = old_pages;" restores the iterator correctly.
The field "context_u" holds the offset from the beginning of the current
bio vector entry, just like the "bio->bi_iter.bi_bvec_done" field.
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Commit 905e51b ("dm thin: commit outstanding data every second")
introduced a periodic commit. This commit occurs regardless of whether
any thin devices have made changes.
Fix the periodic commit to check if any of a pool's thin devices have
changed using dm_pool_changed_this_transaction().
Reported-by: Alexander Larsson <alexl@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Acked-by: Joe Thornber <ejt@redhat.com>
Cc: stable@vger.kernel.org
When completing an overwrite bio, in overwrite_endio(), the associated
migration should not be added to the 'completed_migrations' until the
bio's fields are restored with dm_unhook_bio().
Otherwise, do_worker() can race to process 'completed_migrations' before
dm_unhook_bio() -- so the bio's bi_end_io is incorrect. This is
unlikely to cause any problems given the current code but should be
fixed on the basis of correctness.
Also, the cache's spinlock only needs to be held when manipulating the
'completed_migrations' list -- other changes don't need protection.
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Acked-by: Joe Thornber <ejt@redhat.com>
Commit c9d28d5d ("dm cache: promotion optimisation for writes")
incorrectly placed the 'hook_info' member in the writethrough-only
portion of the per_bio_data structure.
Given that the overwrite optimization may be used for writeback the
'hook_info' member must be placed above the 'cache' member of the
per_bio_data structure. Any members above 'cache' are available from
both writeback and writethrough modes' per_bio_data structure.
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Acked-by: Joe Thornber <ejt@redhat.com>
Cc: stable@vger.kernel.org # 3.13+
both tagged for -stable
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2.0.22 (GNU/Linux)
iQIVAwUAUvxBNjnsnt1WYoG5AQLymhAAnKznI2YhFVqK21mpo1l2JDSkwxqIBvBZ
hcW24zF6dNU4cJFmRQqOeL2AkzHWSqX4/J/DGXvI9wFll1CkdNs+UVQJ12Pod3gK
gTDmqRCe/x+bQxrOR5VfyKv0slia12vn9mqfDd2mX41wcr7ceHsdHbemPhgIcUCC
WLERQi9Yn/Eb2+rltTzZ3XaHwIlIozqZ0yRZ6wH45iyuk+uiholEJjJp8LOWpzTe
rKE4s5qd1NAAJsrMHZ11mZWq/4VtgYJ3AcWVXVWqBPxmlI0FnBPU/KVpJkAcrVjB
N6tqmR1/nHcrGlaOgWSS6UfNGVMe3L2HJpaIdjTM65Tdb+WFpEPevTy9qYsLC3Ic
zV/KmErUtSFMJKYBr9YyRnSpXtnSDo8BeRsWJm9ZaA5UV9yUVBNwWDFNFP/Bkqze
v4wLMRj54U5fjRZBq/PaFbk/A2nDCkGHC4uZCgJ+Mwhoo6rxpho/oKBjBBlmpw3q
4Q0yWgZ8F/ZWFUrGzi1TY3tdYrl3yCOpZ3l5aRTtTqlU3aVShIIiKCKDvs2v8l6h
C5igUbnW5BtsMMCOwdULc/lHgN3vMbJEA+7YdmeouDEY5QAk0O6nxan3y+cbtC5u
F+++tkWzSQZJRGhdAxdAXsABYfHiR7Wnft96+iMpnQYbm35CdYYwlOhhl0iI/+Ec
FcpDXOz9faA=
=J3I5
-----END PGP SIGNATURE-----
Merge tag 'md/3.14-fixes' of git://neil.brown.name/md
Pull md fixes from Neil Brown:
"Two bugfixes for md
both tagged for -stable"
* tag 'md/3.14-fixes' of git://neil.brown.name/md:
md/raid5: Fix CPU hotplug callback registration
md/raid1: restore ability for check and repair to fix read errors.
Subsystems that want to register CPU hotplug callbacks, as well as perform
initialization for the CPUs that are already online, often do it as shown
below:
get_online_cpus();
for_each_online_cpu(cpu)
init_cpu(cpu);
register_cpu_notifier(&foobar_cpu_notifier);
put_online_cpus();
This is wrong, since it is prone to ABBA deadlocks involving the
cpu_add_remove_lock and the cpu_hotplug.lock (when running concurrently
with CPU hotplug operations).
Interestingly, the raid5 code can actually prevent double initialization and
hence can use the following simplified form of callback registration:
register_cpu_notifier(&foobar_cpu_notifier);
get_online_cpus();
for_each_online_cpu(cpu)
init_cpu(cpu);
put_online_cpus();
A hotplug operation that occurs between registering the notifier and calling
get_online_cpus(), won't disrupt anything, because the code takes care to
perform the memory allocations only once.
So reorganize the code in raid5 this way to fix the deadlock with callback
registration.
Cc: linux-raid@vger.kernel.org
Cc: stable@vger.kernel.org (v2.6.32+)
Fixes: 36d1c6476b
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
[Srivatsa: Fixed the unregister_cpu_notifier() deadlock, added the
free_scratch_buffer() helper to condense code further and wrote the changelog.]
Signed-off-by: Srivatsa S. Bhat <srivatsa.bhat@linux.vnet.ibm.com>
Signed-off-by: NeilBrown <neilb@suse.de>
This allows replying only to the requestor portid while still
supporting broadcasting. Pass 0 to portid for the previous behavior.
Signed-off-by: David Fries <David@Fries.net>
Acked-by: Evgeniy Polyakov <zbr@ioremap.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit 30bc9b5387
md/raid1: fix bio handling problems in process_checks()
Move the bio_reset() to a point before where BIO_UPTODATE is checked,
so that check now always report that the bio is uptodate, even if it is not.
This causes process_check() to sometimes treat read-errors as
successful matches so the good data isn't written out.
This patch preserves the flag until it is needed.
Bug was introduced in 3.11, but backported to 3.10-stable (as it fixed
an even worse bug). So suitable for any -stable since 3.10.
Reported-and-tested-by: Michael Tokarev <mjt@tls.msk.ru>
Cc: stable@vger.kernel.org (3.10+)
Fixed: 30bc9b5387
Signed-off-by: NeilBrown <neilb@suse.de>
Pull block IO driver changes from Jens Axboe:
- bcache update from Kent Overstreet.
- two bcache fixes from Nicholas Swenson.
- cciss pci init error fix from Andrew.
- underflow fix in the parallel IDE pg_write code from Dan Carpenter.
I'm sure the 1 (or 0) users of that are now happy.
- two PCI related fixes for sx8 from Jingoo Han.
- floppy init fix for first block read from Jiri Kosina.
- pktcdvd error return miss fix from Julia Lawall.
- removal of IRQF_SHARED from the SEGA Dreamcast CD-ROM code from
Michael Opdenacker.
- comment typo fix for the loop driver from Olaf Hering.
- potential oops fix for null_blk from Raghavendra K T.
- two fixes from Sam Bradshaw (Micron) for the mtip32xx driver, fixing
an OOM problem and a problem with handling security locked conditions
* 'for-3.14/drivers' of git://git.kernel.dk/linux-block: (47 commits)
mg_disk: Spelling s/finised/finished/
null_blk: Null pointer deference problem in alloc_page_buffers
mtip32xx: Correctly handle security locked condition
mtip32xx: Make SGL container per-command to eliminate high order dma allocation
drivers/block/loop.c: fix comment typo in loop_config_discard
drivers/block/cciss.c:cciss_init_one(): use proper errnos
drivers/block/paride/pg.c: underflow bug in pg_write()
drivers/block/sx8.c: remove unnecessary pci_set_drvdata()
drivers/block/sx8.c: use module_pci_driver()
floppy: bail out in open() if drive is not responding to block0 read
bcache: Fix auxiliary search trees for key size > cacheline size
bcache: Don't return -EINTR when insert finished
bcache: Improve bucket_prio() calculation
bcache: Add bch_bkey_equal_header()
bcache: update bch_bkey_try_merge
bcache: Move insert_fixup() to btree_keys_ops
bcache: Convert sorting to btree_keys
bcache: Convert debug code to btree_keys
bcache: Convert btree_iter to struct btree_keys
bcache: Refactor bset_tree sysfs stats
...
Pull core block IO changes from Jens Axboe:
"The major piece in here is the immutable bio_ve series from Kent, the
rest is fairly minor. It was supposed to go in last round, but
various issues pushed it to this release instead. The pull request
contains:
- Various smaller blk-mq fixes from different folks. Nothing major
here, just minor fixes and cleanups.
- Fix for a memory leak in the error path in the block ioctl code
from Christian Engelmayer.
- Header export fix from CaiZhiyong.
- Finally the immutable biovec changes from Kent Overstreet. This
enables some nice future work on making arbitrarily sized bios
possible, and splitting more efficient. Related fixes to immutable
bio_vecs:
- dm-cache immutable fixup from Mike Snitzer.
- btrfs immutable fixup from Muthu Kumar.
- bio-integrity fix from Nic Bellinger, which is also going to stable"
* 'for-3.14/core' of git://git.kernel.dk/linux-block: (44 commits)
xtensa: fixup simdisk driver to work with immutable bio_vecs
block/blk-mq-cpu.c: use hotcpu_notifier()
blk-mq: for_each_* macro correctness
block: Fix memory leak in rw_copy_check_uvector() handling
bio-integrity: Fix bio_integrity_verify segment start bug
block: remove unrelated header files and export symbol
blk-mq: uses page->list incorrectly
blk-mq: use __smp_call_function_single directly
btrfs: fix missing increment of bi_remaining
Revert "block: Warn and free bio if bi_end_io is not set"
block: Warn and free bio if bi_end_io is not set
blk-mq: fix initializing request's start time
block: blk-mq: don't export blk_mq_free_queue()
block: blk-mq: make blk_sync_queue support mq
block: blk-mq: support draining mq queue
dm cache: increment bi_remaining when bi_end_io is restored
block: fixup for generic bio chaining
block: Really silence spurious compiler warnings
block: Silence spurious compiler warnings
block: Kill bio_pair_split()
...
The BUG_ON at the end of __bch_btree_mark_key can be triggered due to
an integer overflow error:
BITMASK(GC_SECTORS_USED, struct bucket, gc_mark, 2, 13);
...
SET_GC_SECTORS_USED(g, min_t(unsigned,
GC_SECTORS_USED(g) + KEY_SIZE(k),
(1 << 14) - 1));
BUG_ON(!GC_SECTORS_USED(g));
In bcache.h, the SECTORS_USED bitfield is defined to be 13 bits wide.
While the SET_ code tries to ensure that the field doesn't overflow by
clamping it to (1<<14)-1 == 16383, this is incorrect because 16383
requires 14 bits. Therefore, if GC_SECTORS_USED() + KEY_SIZE() =
8192, the SET_ statement tries to store 8192 into a 13-bit field. In
a 13-bit field, 8192 becomes zero, thus triggering the BUG_ON.
Therefore, create a field width constant and a max value constant, and
use those to create the bitfield and check the inputs to
SET_GC_SECTORS_USED. Arguably the BITMASK() template ought to have
BUG_ON checks for too-large values, but that's a separate patch.
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
All bug fixes, two tagged for -stable.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2.0.22 (GNU/Linux)
iQIVAwUAUuG8WTnsnt1WYoG5AQKW7A//V93TYUiUAG6zNUFrNZjuoXP0ym3jlpkH
eIIFdcV7rr0Irgtd8+s9cW8Cjsbq3d/vMbFlwP1Co32mnCnFFojeKtCvM9GqkYrH
o4Zr1nVAVYzKO4awByK3wBT9WbzEc/XlDgYQIpZExIYeZdzLOm6HyvlRbcE86Ug5
QoGYOUlLu4LUZmFgB9zQ7JM0GACV5pS1afSObtACj2t2x5GVHNU84u+M+D8urPXO
wnf+AIAzquh5F+8MX+DxmMEUaSzUHf8fXOM3jYVbzPI71SpaHssL4SwBn+j4I/8/
SCSqeIh7qMSuqUy63/iHKCy5qAgNuRdL9fYlOTkpxzHm81Ddj8u7fySsApggVOa2
yeKTkSRlsMFeu+LiGKNi/fINVxboaoYJVZ2DTNtKxSuW2VL2aPNz1Qjq4QnR3nSI
LpaB3VeVKdMsH8Em1a8cgZWcjo5YFAcNtUnJq2fvj9VZ3SJNw4ZoKDL+l718iGao
xIwAXMSafAHQVAAaNVFkwrea13TeOyxikY5Ra4vWfm+Fw8TzmYq5DqO0zaILwdAJ
2FnNj2/2y3hk2K7qBcEvjjEakxPlTwzrzxZMfJDRMuQLqvrjbXiMGOnWzgl1D/9x
4/uPjeFZLG7byxmIyg4Y83NkPgkWnRPpGK98r26pUH1UgnRF0a5aUFXQk7rsQrU6
noRkZ9EPD8s=
=d81E
-----END PGP SIGNATURE-----
Merge tag 'md/3.14' of git://neil.brown.name/md
Pull md updates from Neil Brown:
"All bug fixes, two tagged for -stable"
* tag 'md/3.14' of git://neil.brown.name/md:
md/raid5: close recently introduced race in stripe_head management.
md/raid5: fix long-standing problem with bitmap handling on write failure.
md: check command validity early in md_ioctl().
md: ensure metadata is writen after raid level change.
md/raid10: avoid fullsync when not necessary.
md: allow a partially recovered device to be hot-added to an array.
md: Change handling of save_raid_disk and metadata update during recovery.
A lot of attention was paid to improving the thin-provisioning target's
handling of metadata operation failures and running out of space. A new
'error_if_no_space' feature was added to allow users to error IOs rather
than queue them when either the data or metadata space is exhausted.
Additional fixes/features include:
- a few fixes to properly support thin metadata device resizing
- a solution for reliably waiting for a DM device's embedded kobject to
be released before destroying the device
- old dm-snapshot is updated to use the dm-bufio interface to take
advantage of readahead capabilities that improve snapshot activation
- new dm-cache target tunables to control how quickly data is promoted
to the cache (fast) device
- improved write efficiency of cluster mirror target by combining
userspace flush and mark requests
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1
iQEcBAABAgAGBQJS4GClAAoJEMUj8QotnQNacdEH/2ES5k5itUQRY9jeI+u2zYNP
vdsRTYf+97+B3jpRvpWbMt4kxT2tjaQbkxJ+iKRHy2MBLFUgq8ruH1RS/Q5VbDeg
6i6ol8mpNxhlvo/KTMxXqRcWDSxShiMfhz2lXC2bJ7M4sP/iiH85s4Pm4YQ59jpd
OIX7qj36m/cV/le9YQbexJEEsaj+3genbzL26wyyvtG/rT9fWnXa7clj2gqTdToG
YCEBCRf5FH9X6W/Oc50nMw5n2dt/MRmPre/MAlOjemeaosB0WJiKaswM25rnvHp0
JnhxQ2K2C5KIKAWIfwPOImdb9zWW7p1dIRLsS8nHBUQr0BF5VRkmvpnYH4qBtcc=
=e7e0
-----END PGP SIGNATURE-----
Merge tag 'dm-3.14-changes' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm
Pull device-mapper changes from Mike Snitzer:
"A lot of attention was paid to improving the thin-provisioning
target's handling of metadata operation failures and running out of
space. A new 'error_if_no_space' feature was added to allow users to
error IOs rather than queue them when either the data or metadata
space is exhausted.
Additional fixes/features include:
- a few fixes to properly support thin metadata device resizing
- a solution for reliably waiting for a DM device's embedded kobject
to be released before destroying the device
- old dm-snapshot is updated to use the dm-bufio interface to take
advantage of readahead capabilities that improve snapshot
activation
- new dm-cache target tunables to control how quickly data is
promoted to the cache (fast) device
- improved write efficiency of cluster mirror target by combining
userspace flush and mark requests"
* tag 'dm-3.14-changes' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm: (35 commits)
dm log userspace: allow mark requests to piggyback on flush requests
dm space map metadata: fix bug in resizing of thin metadata
dm cache: add policy name to status output
dm thin: fix pool feature parsing
dm sysfs: fix a module unload race
dm snapshot: use dm-bufio prefetch
dm snapshot: use dm-bufio
dm snapshot: prepare for switch to using dm-bufio
dm snapshot: use GFP_KERNEL when initializing exceptions
dm cache: add block sizes and total cache blocks to status output
dm btree: add dm_btree_find_lowest_key
dm space map metadata: fix extending the space map
dm space map common: make sure new space is used during extend
dm: wait until embedded kobject is released before destroying a device
dm: remove pointless kobject comparison in dm_get_from_kobject
dm snapshot: call destroy_work_on_stack() to pair with INIT_WORK_ONSTACK()
dm cache policy mq: introduce three promotion threshold tunables
dm cache policy mq: use list_del_init instead of list_del + INIT_LIST_HEAD
dm thin: fix set_pool_mode exposed pool operation races
dm thin: eliminate the no_free_space flag
...
In the cluster evironment, cluster write has poor performance because
userspace_flush() has to contact a userspace program (cmirrord) for
clear/mark/flush requests. But both mark and flush requests require
cmirrord to communicate the message to all the cluster nodes for each
flush call. This behaviour is really slow.
To address this we now merge mark and flush requests together to reduce
the kernel-userspace-kernel time. We allow a new directive,
"integrated_flush" that can be used to instruct the kernel log code to
combine flush and mark requests when directed by userspace. If not
directed by userspace (due to an older version of the userspace code
perhaps), the kernel will function as it did previously - preserving
backwards compatibility. Additionally, flush requests are performed
lazily when only clear requests exist.
Signed-off-by: Dongmao Zhang <dmzhang@suse.com>
Signed-off-by: Jonathan Brassow <jbrassow@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Pull cgroup updates from Tejun Heo:
"The bulk of changes are cleanups and preparations for the upcoming
kernfs conversion.
- cgroup_event mechanism which is and will be used only by memcg is
moved to memcg.
- pidlist handling is updated so that it can be served by seq_file.
Also, the list is not sorted if sane_behavior. cgroup
documentation explicitly states that the file is not sorted but it
has been for quite some time.
- All cgroup file handling now happens on top of seq_file. This is
to prepare for kernfs conversion. In addition, all operations are
restructured so that they map 1-1 to kernfs operations.
- Other cleanups and low-pri fixes"
* 'for-3.14' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup: (40 commits)
cgroup: trivial style updates
cgroup: remove stray references to css_id
doc: cgroups: Fix typo in doc/cgroups
cgroup: fix fail path in cgroup_load_subsys()
cgroup: fix missing unlock on error in cgroup_load_subsys()
cgroup: remove for_each_root_subsys()
cgroup: implement for_each_css()
cgroup: factor out cgroup_subsys_state creation into create_css()
cgroup: combine css handling loops in cgroup_create()
cgroup: reorder operations in cgroup_create()
cgroup: make for_each_subsys() useable under cgroup_root_mutex
cgroup: css iterations and css_from_dir() are safe under cgroup_mutex
cgroup: unify pidlist and other file handling
cgroup: replace cftype->read_seq_string() with cftype->seq_show()
cgroup: attach cgroup_open_file to all cgroup files
cgroup: generalize cgroup_pidlist_open_file
cgroup: unify read path so that seq_file is always used
cgroup: unify cgroup_write_X64() and cgroup_write_string()
cgroup: remove cftype->read(), ->read_map() and ->write()
hugetlb_cgroup: convert away from cftype->read()
...
As release_stripe and __release_stripe decrement ->count and then
manipulate ->lru both under ->device_lock, it is important that
get_active_stripe() increments ->count and clears ->lru also under
->device_lock.
However we currently list_del_init ->lru under the lock, but increment
the ->count outside the lock. This can lead to races and list
corruption.
So move the atomic_inc(&sh->count) up inside the ->device_lock
protected region.
Note that we still increment ->count without device lock in the case
where get_free_stripe() was called, and in fact don't take
->device_lock at all in that path.
This is safe because if the stripe_head can be found by
get_free_stripe, then the hash lock assures us the no-one else could
possibly be calling release_stripe() at the same time.
Fixes: 566c09c534
Cc: stable@vger.kernel.org (3.13)
Reported-and-tested-by: Ian Kumlien <ian.kumlien@gmail.com>
Signed-off-by: NeilBrown <neilb@suse.de>
This bug was introduced in commit 7e664b3dec ("dm space map metadata:
fix extending the space map").
When extending a dm-thin metadata volume we:
- Switch the space map into a simple bootstrap mode, which allocates
all space linearly from the newly added space.
- Add new bitmap entries for the new space
- Increment the reference counts for those newly allocated bitmap
entries
- Commit changes to disk
- Switch back out of bootstrap mode.
But, the disk commit may allocate space itself, if so this fact will be
lost when switching out of bootstrap mode.
The bug exhibited itself as an error when the bitmap_root, with an
erroneous ref count of 0, was subsequently decremented as part of a
later disk commit. This would cause the disk commit to fail, and thinp
to enter read_only mode. The metadata was not damaged (thin_check
passed).
The fix is to put the increments + commit into a loop, running until
the commit has not allocated extra space. In practise this loop only
runs twice.
With this fix the following device mapper testsuite test passes:
dmtest run --suite thin-provisioning -n thin_remove_works_after_resize
Signed-off-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Cc: stable@vger.kernel.org # depends on commit 7e664b3dec
Here's the big driver core and sysfs patch set for 3.14-rc1.
There's a lot of work here moving sysfs logic out into a "kernfs" to
allow other subsystems to also have a virtual filesystem with the same
attributes of sysfs (handle device disconnect, dynamic creation /
removal as needed / unneeded, etc. This is primarily being done for
the cgroups filesystem, but the goal is to also move debugfs to it when
it is ready, solving all of the known issues in that filesystem as well.
The code isn't completed yet, but all should be stable now (there is a
big section that was reverted due to problems found when testing.)
There's also some other smaller fixes, and a driver core addition that
allows for a "collection" of objects, that the DRM people will be using
soon (it's in this tree to make merges after -rc1 easier.)
All of this has been in linux-next with no reported issues.
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2.0.22 (GNU/Linux)
iEYEABECAAYFAlLdh0cACgkQMUfUDdst+ylv4QCfeDKDgLo4LsaBIIrFSxLoH/c7
UUsAoMPRwA0h8wy+BQcJAg4H4J4maKj3
=0pc0
-----END PGP SIGNATURE-----
Merge tag 'driver-core-3.14-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core
Pull driver core / sysfs patches from Greg KH:
"Here's the big driver core and sysfs patch set for 3.14-rc1.
There's a lot of work here moving sysfs logic out into a "kernfs" to
allow other subsystems to also have a virtual filesystem with the same
attributes of sysfs (handle device disconnect, dynamic creation /
removal as needed / unneeded, etc)
This is primarily being done for the cgroups filesystem, but the goal
is to also move debugfs to it when it is ready, solving all of the
known issues in that filesystem as well. The code isn't completed
yet, but all should be stable now (there is a big section that was
reverted due to problems found when testing)
There's also some other smaller fixes, and a driver core addition that
allows for a "collection" of objects, that the DRM people will be
using soon (it's in this tree to make merges after -rc1 easier)
All of this has been in linux-next with no reported issues"
* tag 'driver-core-3.14-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core: (113 commits)
kernfs: associate a new kernfs_node with its parent on creation
kernfs: add struct dentry declaration in kernfs.h
kernfs: fix get_active failure handling in kernfs_seq_*()
Revert "kernfs: fix get_active failure handling in kernfs_seq_*()"
Revert "kernfs: replace kernfs_node->u.completion with kernfs_root->deactivate_waitq"
Revert "kernfs: remove KERNFS_ACTIVE_REF and add kernfs_lockdep()"
Revert "kernfs: remove KERNFS_REMOVED"
Revert "kernfs: restructure removal path to fix possible premature return"
Revert "kernfs: invoke kernfs_unmap_bin_file() directly from __kernfs_remove()"
Revert "kernfs: remove kernfs_addrm_cxt"
Revert "kernfs: make kernfs_get_active() block if the node is deactivated but not removed"
Revert "kernfs: implement kernfs_{de|re}activate[_self]()"
Revert "kernfs, sysfs, driver-core: implement kernfs_remove_self() and its wrappers"
Revert "pci: use device_remove_file_self() instead of device_schedule_callback()"
Revert "scsi: use device_remove_file_self() instead of device_schedule_callback()"
Revert "s390: use device_remove_file_self() instead of device_schedule_callback()"
Revert "sysfs, driver-core: remove unused {sysfs|device}_schedule_callback_owner()"
Revert "kernfs: remove unnecessary NULL check in __kernfs_remove()"
kernfs: remove unnecessary NULL check in __kernfs_remove()
drivers/base: provide an infrastructure for componentised subsystems
...
The cache's policy may have been established using the "default" alias,
which is currently the "mq" policy but the default policy may change in
the future. It is useful to know exactly which policy is being used.
Add a 'real' member to the dm_cache_policy_type structure and have the
"default" dm_cache_policy_type point to the real "mq"
dm_cache_policy_type. Update dm_cache_policy_get_name() to check if
real is set, if so report the name of the real policy (not the alias).
Requested-by: Jonathan Brassow <jbrassow@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Commit 787a996cb2 ("dm thin: add error_if_no_space feature")
mistakenly forgot to increase the number of feature args supported.
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Before a write starts we set a bit in the write-intent bitmap.
When the write completes we clear that bit if the write was successful
to all devices. However if the write wasn't fully successful we
should not clear the bit. If the faulty drive is subsequently
re-added, the fact that the bit is still set ensure that we will
re-write the data that is missing.
This logic is mediated by the STRIPE_DEGRADED flag - we only clear the
bitmap bit when this flag is not set.
Currently we correctly set the flag if a write starts when some
devices are failed or missing. But we do *not* set the flag if some
device failed during the write attempt.
This is wrong and can result in clearing the bit inappropriately.
So: set the flag when a write fails.
This bug has been present since bitmaps were introduces, so the fix is
suitable for any -stable kernel.
Reported-by: Ethan Wilson <ethan.wilson@shiftmail.org>
Cc: stable@vger.kernel.org
Signed-off-by: NeilBrown <neilb@suse.de>
Verify that the cmd parameter passed to md_ioctl() is valid before
doing anything.
This fixes mddev->hold_active being set to 0 when an invalid ioctl
command is passed to md_ioctl() before the array has been configured.
Clearing mddev->hold_active in that case can lead to a livelock
situation when an invalid ioctl number is given to md_ioctl() by a
process when the mddev is currently being opened by another process:
Process 1 Process 2
--------- ---------
md_alloc()
mddev_find()
-> returns a new mddev with
hold_active == UNTIL_IOCTL
add_disk()
-> sends KOBJ_ADD uevent
(sees KOBJ_ADD uevent for device)
md_open()
md_ioctl(INVALID_IOCTL)
-> returns ENODEV and clears
mddev->hold_active
md_release()
md_put()
-> deletes the mddev as
hold_active is 0
md_open()
mddev_find()
-> returns a newly
allocated mddev with
mddev->gendisk == NULL
-> returns with ERESTARTSYS
(kernel restarts the open syscall)
Signed-off-by: Nicolas Schichan <nschichan@freebox.fr>
Signed-off-by: NeilBrown <neilb@suse.de>
All of these fix real bugs the people have hit, and are tagged
for -stable.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2.0.22 (GNU/Linux)
iQIVAwUAUtYZqznsnt1WYoG5AQK50g//XuqVR/esIpGR+knf+1sD3Zk85Vp33kGL
2UfbQbi40q/uLjBhJhOSkx/sYGw1Eo255vNX+yMVjYT9F+xbhI8vlLfecqx5Fk5J
M+JH1sM7E2T79boFLoOBGSl/qppsQsPHa3p87FmFHQrrAuEMIbFiP98MnQjdSiv4
Cu9cAR7x7njepHeMXBFiV7URaYtCHAXR9iMdkebkKIFlfND8w2QYD+LWo3SzBKs9
jTrSBJRpXLHE+bZLOQPhAryb7nWkcT1R7N0vsVMQKcq1o6ZiRNnk/B9xNtV34hkc
5zwTPe/d5AsV6Tsxg0dSs7xcBn/A+F5lg8fzdOhyE1F13COmB7sepjPTMPAy/oP1
zjyPwnnWkHMDUW2usf3aqPMt+LGMofRCJHXjkqpMgIWQ96SQUY8F9PPxchkUCsx/
A38I+vXl2jGDHh/DFSduef3sDOF6TYyKyLteJftyny96dc1RutrZSbHPdrkDz1YQ
6zcyvpv0FexiXITrLg70FG8fnRMK91ZfHrmuzVP7tpm2TyeIfDriLhTAIXAcXHOT
l22a1bNj4shFfztnD0CbH6nY/iJM7ov0x5+IyG5/iYbipon02MenQeV9km6JVwQb
OCGHYCTswiFSduX1E1ru52dHXifbANWgzcUH0sjGQ0YZNmxvPRBWDjB1H2J1auzW
J8T10qimw1w=
=uvyl
-----END PGP SIGNATURE-----
Merge tag 'md/3.13-fixes' of git://neil.brown.name/md
Pull late md fixes from Neil Brown:
"Half a dozen md bug fixes.
All of these fix real bugs the people have hit, and are tagged for
-stable. Sorry they are late .... Christmas holidays and all that.
Hopefully they can still squeak into 3.13"
* tag 'md/3.13-fixes' of git://neil.brown.name/md:
md: fix problem when adding device to read-only array with bitmap.
md/raid10: fix bug when raid10 recovery fails to recover a block.
md/raid5: fix a recently broken BUG_ON().
md/raid1: fix request counting bug in new 'barrier' code.
md/raid10: fix two bugs in handling of known-bad-blocks.
md/raid5: Fix possible confusion when multiple write errors occur.
This reverts commit be35f48610 ("dm: wait until embedded kobject is
released before destroying a device") and provides an improved fix.
The kobject release code that calls the completion must be placed in a
non-module file, otherwise there is a module unload race (if the process
calling dm_kobject_release is preempted and the DM module unloaded after
the completion is triggered, but before dm_kobject_release returns).
To fix this race, this patch moves the completion code to dm-builtin.c
which is always compiled directly into the kernel if BLK_DEV_DM is
selected.
The patch introduces a new dm_kobject_holder structure, its purpose is
to keep the completion and kobject in one place, so that it can be
accessed from non-module code without the need to export the layout of
struct mapped_device to that code.
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Cc: stable@vger.kernel.org
This patch modifies dm-snapshot so that it prefetches the buffers when
loading the exceptions.
The number of buffers read ahead is specified in the DM_PREFETCH_CHUNKS
macro. The current value for DM_PREFETCH_CHUNKS (12) was found to
provide the best performance on a single 15k SCSI spindle. In the
future we may modify this default or make it configurable.
Also, introduce the function dm_bufio_set_minimum_buffers to setup
bufio's number of internal buffers before freeing happens. dm-bufio may
hold more buffers if enough memory is available. There is no guarantee
that the specified number of buffers will be available - if you need a
guarantee, use the argument reserved_buffers for
dm_bufio_client_create.
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Use dm-bufio for initial loading of the exceptions.
Introduce a new function dm_bufio_forget that frees the given buffer.
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Change the functions get_exception, read_exception and insert_exceptions
so that ps->area is passed as an argument.
This patch doesn't change any functionality, but it refactors the code
to allow for a cleaner switch over to using dm-bufio.
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
The list of initial exceptions is loaded in the target constructor. We
are allowed to allocate memory with GFP_KERNEL at this point. So,
change alloc_completed_exception to use GFP_KERNEL when being called
from the constructor.
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
level_store() currently does not make sure the metadata is
updates to reflect the new raid level. It simply sets MD_CHANGE_DEVS.
Any level with a ->thread will quickly notice this and update the
metadata. However RAID0 and Linear do not have a thread so no
metadata update happens until the array is stopped. At that point the
metadata is written.
This is later that we would like. While the delay doesn't risk any
data it can cause confusion. So if there is no md thread, immediately
update the metadata after a level change.
Reported-by: Richard Michael <rmichael@edgeofthenet.org>
Signed-off-by: NeilBrown <neilb@suse.de>
This is the raid10 equivalent of
commit 4f0a5e012c
MD RAID1: Further conditionalize 'fullsync'
If a device in a newly assembled array is not fully recovered we
currently do a fully resync by don't need to.
Signed-off-by: NeilBrown <neilb@suse.de>
When adding a new device into an array it is normally important to
clear any stale data from ->recovery_offset else the new device may
not be recovered properly.
However when re-adding a device which is known to be nearly in-sync,
this is not needed and can be detrimental. The (bitmap-based)
resync will still happen, and further recovery is only needed from
where-ever it was already up to.
So if save_raid_disk is set, signifying a re-add, don't clear
->recovery_offset.
Signed-off-by: NeilBrown <neilb@suse.de>
Since commit d70ed2e4fa
MD: Allow restarting an interrupted incremental recovery.
we don't write out the metadata to devices while they are recovering.
This had a good reason, but has unfortunate consequences. This patch
changes things to make them work better.
At issue is what happens if the array is shut down while a recovery is
happening, particularly a bitmap-guided recovery.
Ideally the recovery should pick up where it left off.
However the metadata cannot represent the state "A recovery is in
process which is guided by the bitmap".
Before the above mentioned commit, we wrote metadata to the device
which said "this is being recovered and it is up to <here>". So after
a restart, a full recovery (not bitmap-guided) would happen from
where-ever it was up to.
After the commit the metadata wasn't updated so it still said "This
device is fully in sync with <this> event count". That leads to a
bitmap-based recovery following the whole bitmap, which should be a
lot less work than a full recovery from some starting point. So this
was an improvement.
However updates some metadata but not all leads to other problems.
In particular, the metadata written to the fully-up-to-date device
record that the array has all devices present (even though some are
recovering). So on restart, mdadm wants to find all devices and
expects them to have current event counts.
Obviously it doesn't (some have old event counts) so (when assembling
with --incremental) it waits indefinitely for the rest of the expected
devices.
It really is wrong to not update all the metadata together. Do that
is bound to cause confusion.
Instead, we should make it possible to record the truth in the
metadata. i.e. we need to be able to record that a device is being
recovered based on the bitmap.
We already have a Feature flag to say that recovery is happening. We
now add another one to say that it is a bitmap-based recovery.
With this we can remove the code that disables the write-out of
metadata on some devices.
So this patch:
- moves the setting of 'saved_raid_disk' from add_new_disk to
the validate_super methods. This makes sure it is always set
properly, both when adding a new device to an array, and when
assembling an array from a collection of devices.
- Adds a metadata flag MD_FEATURE_RECOVERY_BITMAP which is only
used if MD_FEATURE_RECOVERY_OFFSET is set, and record that a
bitmap-based recovery is allowed.
This is only present in v1.x metadata. v0.90 doesn't support
devices which are in the middle of recovery at all.
- Only skips writing metadata to Faulty devices.
- Also allows rdev state to be set to "-insync" via sysfs.
This can be used for external-metadata arrays. When the
'role' is set the device is assumed to be in-sync. If, after
setting the role, we set the state to "-insync", the role is
moved to saved_raid_disk which effectively says the device is
partly in-sync with that slot and needs a bitmap recovery.
Cc: Andrei Warkentin <andreiw@vmware.com>
Signed-off-by: NeilBrown <neilb@suse.de>
If an array is started degraded, and then the missing device
is found it can be re-added and a minimal bitmap-based recovery
will bring it fully up-to-date.
If the array is read-only a recovery would not be allowed.
But also if the array is read-only and the missing device was
present very recently, then there could be no need for any
recovery at all, so we simply include the device in the read-only
array without any recovery.
However... if the missing device was removed a little longer ago
it could be missing some updates, but if a bitmap is present it will
be conditionally accepted pending a bitmap-based update. We don't
currently detect this case properly and will include that old
device into the read-only array with no recovery even though it really
needs a recovery.
This patch keeps track of whether a bitmap-based-recovery is really
needed or not in the new Bitmap_sync rdev flag. If that is set,
then the device will not be added to a read-only array.
Cc: Andrei Warkentin <andreiw@vmware.com>
Fixes: d70ed2e4fa
Cc: stable@vger.kernel.org (3.2+)
Signed-off-by: NeilBrown <neilb@suse.de>
commit e875ecea26
md/raid10 record bad blocks as needed during recovery.
added code to the "cannot recover this block" path to record a bad
block rather than fail the whole recovery.
Unfortunately this new case was placed *after* r10bio was freed rather
than *before*, yet it still uses r10bio.
This is will crash with a null dereference.
So move the freeing of r10bio down where it is safe.
Cc: stable@vger.kernel.org (v3.1+)
Fixes: e875ecea26
Reported-by: Damian Nowak <spam@nowaker.net>
URL: https://bugzilla.kernel.org/show_bug.cgi?id=68181
Signed-off-by: NeilBrown <neilb@suse.de>
commit 6d183de407
md/raid5: fix newly-broken locking in get_active_stripe.
simplified a BUG_ON, but removed too much so now it sometimes fires
when it shouldn't.
When the STRIPE_EXPANDING flag is set, the stripe_head might be on a
special list while multiple stripe_heads are collected, or it might
not be on any list, even a 'free' list when the refcount is zero. As
long as STRIPE_EXPANDING is set, it will be found and added back to a
list eventually.
So both of the BUG_ONs which test for the ->lru being empty or not
need to avoid the case where STRIPE_EXPANDING is set.
The patch which broke this was marked for -stable, so this patch needs
to be applied to any branch that received 6d183de4
Fixes: 6d183de407
Cc: stable@vger.kernel.org (any release to which above was applied)
Signed-off-by: NeilBrown <neilb@suse.de>
The new iobarrier implementation in raid1 (which keeps normal writes
and resync activity separate) counts every request what is not before
the current resync point in either next_window_requests or
current_window_requests.
It flags that the request is counted by setting ->start_next_window.
allow_barrier follows this model exactly and decrements one of the
*_window_requests if and only if ->start_next_window is set.
However wait_barrier(), which increments *_window_requests uses a
slightly different test for setting -.start_next_window (which is set
from the return value of this function).
So there is a possibility of the counts getting out of sync, and this
leads to the resync hanging.
So change wait_barrier() to return a non-zero value in exactly the
same cases that it increments *_window_requests.
But was introduced in 3.13-rc1.
Reported-by: Bruno Wolff III <bruno@wolff.to>
URL: https://bugzilla.kernel.org/show_bug.cgi?id=68061
Fixes: 79ef3a8aa1
Cc: majianpeng <majianpeng@gmail.com>
Signed-off-by: NeilBrown <neilb@suse.de>
If we discover a bad block when reading we split the request and
potentially read some of it from a different device.
The code path of this has two bugs in RAID10.
1/ we get a spin_lock with _irq, but unlock without _irq!!
2/ The calculation of 'sectors_handled' is wrong, as can be clearly
seen by comparison with raid1.c
This leads to at least 2 warnings and a probable crash is a RAID10
ever had known bad blocks.
Cc: stable@vger.kernel.org (v3.1+)
Fixes: 856e08e237
Reported-by: Damian Nowak <spam@nowaker.net>
URL: https://bugzilla.kernel.org/show_bug.cgi?id=68181
Signed-off-by: NeilBrown <neilb@suse.de>
commit 5d8c71f9e5
md: raid5 crash during degradation
Fixed a crash in an overly simplistic way which could leave
R5_WriteError or R5_MadeGood set in the stripe cache for devices
for which it is no longer relevant.
When those devices are removed and spares added the flags are still
set and can cause incorrect behaviour.
commit 14a75d3e07
md/raid5: preferentially read from replacement device if possible.
Fixed the same bug if a more effective way, so we can now revert
the original commit.
Reported-and-tested-by: Alexander Lyakas <alex.bolshoy@gmail.com>
Cc: stable@vger.kernel.org (3.2+ - 3.2 will need a different fix though)
Fixes: 5d8c71f9e5
Signed-off-by: NeilBrown <neilb@suse.de>
Trivial: remove the few stray references to css_id, which itself
was removed in v3.13's 2ff2a7d03b "cgroup: kill css_id".
Signed-off-by: Hugh Dickins <hughd@google.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
Improve cache_status to emit:
<metadata block size> <#used metadata blocks>/<#total metadata blocks>
<cache block size> <#used cache blocks>/<#total cache blocks>
...
Adding the block sizes allows for easier calculation of the overall size
of both the metadata and cache devices. Adding <#total cache blocks>
provides useful context for how much of the cache is used.
Unfortunately these additions to the status will require updates to
users' scripts that monitor the cache status. But these changes help
provide more comprehensive information about the cache device and will
simplify tools that are being developed to manage dm-cache devices --
because they won't need to issue 3 operations to cobble together the
information that we can easily provide via a single status ioctl.
While updating the status documentation in cache.txt spaces were
tabify'd.
Requested-by: Jonathan Brassow <jbrassow@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Acked-by: Joe Thornber <ejt@redhat.com>
dm_btree_find_lowest_key is the reciprocal of dm_btree_find_highest_key.
Factor out common code for dm_btree_find_{highest,lowest}_key.
dm_btree_find_lowest_key is needed for an upcoming DM target, as such it
is best to get this interface in place.
Signed-off-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
We need to return -EINTR after a split because we invalidated iterators
(and freed the btree node) - but if we were finished inserting, we don't
want to redo the traversal.
Signed-off-by: Kent Overstreet <kmo@daterainc.com>
When deciding what order to reuse buckets we take into account both the bucket's
priority (which indicates lru order) and also the amount of live data in that
bucket. The way they were scaled together wasn't as correct as it could be...
this patch improves and documents it.
Signed-off-by: Kent Overstreet <kmo@daterainc.com>
Checks if two keys have equivalent header fields.
(good enough for replacement or merging)
Used in bch_bkey_try_merge, and replacing a key
in the btree.
Signed-off-by: Nicholas Swenson <nks@daterainc.com>
Signed-off-by: Kent Overstreet <kmo@daterainc.com>
Added generic header checks to bch_bkey_try_merge,
which then calls the bkey specific function
Removed extraneous checks from bch_extent_merge
Signed-off-by: Nicholas Swenson <nks@daterainc.com>
We're in the process of turning bset.c into library code, so none of the code in
that file should know about struct cache_set or struct btree - so, move the
btree traversal part of the stats code to sysfs.c.
Signed-off-by: Kent Overstreet <kmo@daterainc.com>
More disentangling bset.c from the rest of the bcache code - soon, the
sorting routines won't have any dependencies on any outside structs.
Signed-off-by: Kent Overstreet <kmo@daterainc.com>
Only use extent comparison for comparing extents, so we're not using
START_KEY() on other key types (i.e. btree pointers)
Signed-off-by: Kent Overstreet <kmo@daterainc.com>
Getting away from KEY_PTRS and moving toward KEY_U64s - and getting rid of magic
2s
Also - split out the part that checks against journal entry size so as to avoid
a dependancy on struct cache_set in bset.c
Signed-off-by: Kent Overstreet <kmo@daterainc.com>
Now that we've got code for raid5/6 stripe awareness, bcache just needs
to know about the stripes and when writing partial stripes is expensive
- we probably don't want to enable this optimization for raid1 or 10,
even though they have stripes. So add a flag to queue_limits.
Signed-off-by: Kent Overstreet <kmo@daterainc.com>
This error path shouldn't have been hit in practice.. and we've got reworked
reserve code coming soon so that it shouldn't _ever_ be bit... but if we've got
code for this error path it should be correct.
Signed-off-by: Kent Overstreet <kmo@daterainc.com>
We need a reserve for allocating buckets for new btree nodes - and now that
we've got multiple btrees, it really needs to be per btree.
This reworks the reserves so we've got separate freelists for each reserve
instead of watermarks, which seems to make things a bit cleaner, and it adds
some code so that btree_split() can make sure the reserve is available before it
starts.
Signed-off-by: Kent Overstreet <kmo@daterainc.com>
Unnecessary since a bucket that has dirty pointers pointing to it can
never be invalidated - and skipping it is a measurable performance
boost, since the bucket gen will usually be a cache miss.
Signed-off-by: Kent Overstreet <kmo@daterainc.com>
We were unnecessarily waiting on a journal write to complete when we just needed
to start a journal write and start setting up the next one.
Signed-off-by: Kent Overstreet <kmo@daterainc.com>
The real fix is where we check the bytes we need against how much is
remaining - we also need to check for a journal entry bigger than our
buffer, we'll never write those and it would be bad if we tried to read
one.
Also improve the diagnostic messages.
Signed-off-by: Kent Overstreet <kmo@daterainc.com>
The code that handles overlapping extents that we've just read back in from disk
was depending on the behaviour of the code that handles overlapping extents as
we're inserting into a btree node in the case of an insert that forced an
existing extent to be split: on insert, if we had to split we'd also insert a
new extent to represent the top part of the old extent - and then that new
extent would get written out.
The code that read the extents back in thus not bother with splitting extents -
if it saw an extent that ovelapped in the middle of an older extent, it would
trim the old extent to only represent the bottom part, assuming that the
original insert would've inserted a new extent to represent the top part.
I still haven't figured out _how_ it can happen, but I'm now pretty convinced
(and testing has confirmed) that there's some kind of an obscure corner case
(probably involving extent merging, and multiple overwrites in different sets)
that breaks this. The fix is to change the mergesort fixup code to split extents
itself when required.
Signed-off-by: Kent Overstreet <kmo@daterainc.com>
Cc: linux-stable <stable@vger.kernel.org> # >= v3.10
When extending a metadata space map we should do the first commit whilst
still in bootstrap mode -- a mode where all blocks get allocated in the
new area.
That way the commit overhead is allocated from the newly added space.
Otherwise we risk running out of space.
With this fix, and the previous commit "dm space map common: make sure
new space is used during extend", the following device mapper testsuite
test passes:
dmtest run --suite thin-provisioning -n /resize_metadata_no_io/
Signed-off-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Cc: stable@vger.kernel.org
When extending a low level space map we should update nr_blocks at
the start so the new space is used for the index entries.
Otherwise extend can fail, e.g.: sm_metadata_extend call sequence
that fails:
-> sm_ll_extend
-> dm_tm_new_block -> dm_sm_new_block -> sm_bootstrap_new_block
=> returns -ENOSPC because smm->begin == smm->ll.nr_blocks
Signed-off-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Cc: stable@vger.kernel.org
There may be other parts of the kernel holding a reference on the dm
kobject. We must wait until all references are dropped before
deallocating the mapped_device structure.
The dm_kobject_release method signals that all references are dropped
via completion. But dm_kobject_release doesn't free the kobject (which
is embedded in the mapped_device structure).
This is the sequence of operations:
* when destroying a DM device, call kobject_put from dm_sysfs_exit
* wait until all users stop using the kobject, when it happens the
release method is called
* the release method signals the completion and should return without
delay
* the dm device removal code that waits on the completion continues
* the dm device removal code drops the dm_mod reference the device had
* the dm device removal code frees the mapped_device structure that
contains the kobject
Using kobject this way should avoid the module unload race that was
mentioned at the beginning of this thread:
https://lkml.org/lkml/2014/1/4/83
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Cc: stable@vger.kernel.org
The comparison is always true and the compiler optimizes it out anyway.
Milan offered additional context relative to the original commit
784aae735d ("dm: add name and uuid to sysfs") which introduced the code:
"I think it is just relict of some experiments before I committed this
simple embedded sysfs kobj handling".
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Acked-by: Milan Broz <gmazyland@gmail.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
In case CONFIG_DEBUG_OBJECTS_WORK is defined, it is needed to
call destroy_work_on_stack() which frees the debug object to pair
with INIT_WORK_ONSTACK().
Signed-off-by: Liu, Chuansheng <chuansheng.liu@intel.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Internally the mq policy maintains a promotion threshold variable. If
the hit count of a block not in the cache goes above this threshold it
gets promoted to the cache.
This patch introduces three new tunables that allow you to tweak the
promotion threshold by adding a small value. These adjustments depend
on the io type:
read_promote_adjustment: READ io, default 4
write_promote_adjustment: WRITE io, default 8
discard_promote_adjustment: READ/WRITE io to a discarded block, default 1
If you're trying to quickly warm a new cache device you may wish to
reduce these to encourage promotion. Remember to switch them back to
their defaults after the cache fills though.
Signed-off-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
The pool mode must not be switched until after the corresponding pool
process_* methods have been established. Otherwise, because
set_pool_mode() isn't interlocked with the IO path for performance
reasons, the IO path can end up executing process_* operations that
don't match the mode. This patch eliminates problems like the following
(as seen on really fast PCIe SSD storage when transitioning the pool's
mode from PM_READ_ONLY to PM_WRITE):
kernel: device-mapper: thin: 253:2: reached low water mark for data device: sending event.
kernel: device-mapper: thin: 253:2: no free data space available.
kernel: device-mapper: thin: 253:2: switching pool to read-only mode
kernel: device-mapper: thin: 253:2: switching pool to write mode
kernel: ------------[ cut here ]------------
kernel: WARNING: CPU: 11 PID: 7564 at drivers/md/dm-thin.c:995 handle_unserviceable_bio+0x146/0x160 [dm_thin_pool]()
...
kernel: Workqueue: dm-thin do_worker [dm_thin_pool]
kernel: 00000000000003e3 ffff880308831cc8 ffffffff8152ebcb 00000000000003e3
kernel: 0000000000000000 ffff880308831d08 ffffffff8104c46c ffff88032502a800
kernel: ffff880036409000 ffff88030ec7ce00 0000000000000001 00000000ffffffc3
kernel: Call Trace:
kernel: [<ffffffff8152ebcb>] dump_stack+0x49/0x5e
kernel: [<ffffffff8104c46c>] warn_slowpath_common+0x8c/0xc0
kernel: [<ffffffff8104c4ba>] warn_slowpath_null+0x1a/0x20
kernel: [<ffffffffa001e2c6>] handle_unserviceable_bio+0x146/0x160 [dm_thin_pool]
kernel: [<ffffffffa001f276>] process_bio_read_only+0x136/0x180 [dm_thin_pool]
kernel: [<ffffffffa0020b75>] process_deferred_bios+0xc5/0x230 [dm_thin_pool]
kernel: [<ffffffffa0020d31>] do_worker+0x51/0x60 [dm_thin_pool]
kernel: [<ffffffff81067823>] process_one_work+0x183/0x490
kernel: [<ffffffff81068c70>] worker_thread+0x120/0x3a0
kernel: [<ffffffff81068b50>] ? manage_workers+0x160/0x160
kernel: [<ffffffff8106e86e>] kthread+0xce/0xf0
kernel: [<ffffffff8106e7a0>] ? kthread_freezable_should_stop+0x70/0x70
kernel: [<ffffffff8153b3ec>] ret_from_fork+0x7c/0xb0
kernel: [<ffffffff8106e7a0>] ? kthread_freezable_should_stop+0x70/0x70
kernel: ---[ end trace 3f00528e08ffa55c ]---
kernel: device-mapper: thin: pool mode is PM_WRITE not PM_READ_ONLY like expected!?
dm-thin.c:995 was the WARN_ON_ONCE(get_pool_mode(pool) != PM_READ_ONLY);
at the top of handle_unserviceable_bio(). And as the additional
debugging I had conveys: the pool mode was _not_ PM_READ_ONLY like
expected, it was already PM_WRITE, yet pool->process_bio was still set
to process_bio_read_only().
Also, while fixing this up, reduce logging of redundant pool mode
transitions by checking new_mode is different from old_mode.
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Cc: stable@vger.kernel.org
The pool's error_if_no_space flag can easily serve the same purpose that
no_free_space did, namely: control whether handle_unserviceable_bio()
will error a bio or requeue it.
This is cleaner since error_if_no_space is established when the pool's
features are processed during table load. So it avoids managing the
no_free_space flag by taking the pool's spinlock.
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
If the pool runs out of data or metadata space, the pool can either
queue or error the IO destined to the data device. The default is to
queue the IO until more space is added.
An admin may now configure the pool to error IO when no space is
available by setting the 'error_if_no_space' feature when loading the
thin-pool table.
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Acked-by: Joe Thornber <ejt@redhat.com>
Now that we switch the pool to read-only mode when the data device runs
out of space it causes active writers to get IO errors once we resume
after resizing the data device.
If no_free_space is set, save bios to the 'retry_on_resume_list' and
requeue them on resume (once the data or metadata device may have been
resized).
With this patch the resize_io test passes again (on slower storage):
dmtest run --suite thin-provisioning -n /resize_io/
Later patches fix some subtle races associated with the pool mode
transitions done as part of the pool's -ENOSPC handling. These races
are exposed on fast storage (e.g. PCIe SSD).
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Acked-by: Joe Thornber <ejt@redhat.com>
Factor out_of_data_space() out of alloc_data_block(). Eliminate the use
of 'no_free_space' as a latch in alloc_data_block() -- this is no longer
needed now that we switch to read-only mode when we run out of data or
metadata space. In a later patch, the 'no_free_space' flag will be
eliminated entirely (in favor of checking metadata rather than relying
on a transient flag).
Move no metdata space handling into metdata_operation_failed(). Set
no_free_space when metadata space is exhausted too. This is useful,
because it offers consistency, for the following patch that will requeue
data IOs if no_free_space.
Also, rename no_space() to retry_bios_on_resume().
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Acked-by: Joe Thornber <ejt@redhat.com>
Introduce metadata_operation_failed() wrappers, around set_pool_mode(),
to assist with improving the consistency of how metadata failures are
handled. Logging is improved and metadata operation failures trigger
read-only mode immediately.
Also, eliminate redundant set_pool_mode() calls in the two
alloc_data_block() caller's error paths.
Signed-off-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Factor check_low_water_mark() out of alloc_data_block().
Change a couple unsigned flags in the pool structure to bool.
Signed-off-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Mappings could be processed in descending logical block order,
particularly if buffered IO is used. This could adversely affect the
latency of IO processing. Fix this by adding mappings to the end of the
'prepared_mappings' and 'prepared_discards' lists.
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Acked-by: Joe Thornber <ejt@redhat.com>
Also, move 'err' member in dm_thin_new_mapping structure to eliminate 4
byte hole (reduces size from 88 bytes to 80).
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Acked-by: Joe Thornber <ejt@redhat.com>
DM's persistent-data library is now used my multiple targets so
exclusive references to "pool" or "thin provisioning" need to be
cleaned up. Adjust Kconfig's DM_DEBUG_BLOCK_STACK_TRACING text
and remove "pool" from a block manager error message.
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Acked-by: Joe Thornber <ejt@redhat.com>
The "unable to allocate new metadata block" error can be a particularly
verbose error if there is a systemic issue with the metadata device.
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Acked-by: Joe Thornber <ejt@redhat.com>
Starting with commit c0820cf5ad ("dm: introduce per_bio_data"),
device mapper has the capability to pre-allocate a target-specific
structure with the bio.
This patch changes dm-delay to use this facility instead of a slab cache
and mempool.
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
A device mapper table is allocated in the following way:
* The function dm_table_create is called, it gets the number of targets
as an argument -- it allocates a targets array accordingly.
* For each target, we call dm_table_add_target.
If we add more targets than were specified in dm_table_create, the
function dm_table_add_target reallocates the targets array. However,
this reallocation code is wrong - it moves the targets array to a new
location, while some target constructors hold pointers to the array in
the old location.
The following DM target drivers save the pointer to the target
structure, so they corrupt memory if the target array is moved:
multipath, raid, mirror, snapshot, stripe, switch, thin, verity.
Under normal circumstances, the reallocation function is not called
(because dm_table_create is called with the correct number of targets),
so the buggy reallocation code is not used.
Prior to the fix "dm table: fail dm_table_create on dm_round_up
overflow", the reallocation code could only be used in case the user
specifies too large a value in param->target_count, such as 0xffffffff.
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
If a snapshot is created and later deleted the origin dm_thin_device's
snapshotted_time will have been updated to reflect the snapshot's
creation time. The 'shared' flag in the dm_thin_lookup_result struct
returned from dm_thin_find_block() is an approximation based on
snapshotted_time -- this is done to avoid 0(n), or worse, time
complexity. In this case, the shared flag would be true.
But because the 'shared' flag reflects an approximation a block can be
incorrectly assumed to be shared (e.g. false positive for 'shared'
because the snapshot no longer exists). This could result in discards
issued to a thin device not being passed down to the pool's underlying
data device.
To fix this we double check that a thin block is really still in-use
after a mapping is removed using dm_pool_block_is_used(). If the
reference count for a block is now zero the discard is allowed to be
passed down.
Also add a 'definitely_not_shared' member to the dm_thin_new_mapping
structure -- reflects that the 'shared' flag in the response from
dm_thin_find_block() can only be held as definitive if false is
returned.
Resolves: https://bugzilla.redhat.com/show_bug.cgi?id=1043527
Signed-off-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Cc: stable@vger.kernel.org
As additional members are added to the dm_thin_new_mapping structure
care should be taken to make sure they get initialized before use.
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Acked-by: Joe Thornber <ejt@redhat.com>
Cc: stable@vger.kernel.org
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1
iQEcBAABAgAGBQJSwLfoAAoJEHm+PkMAQRiGi6QH/1U1B7lmHChDTw3jj1lfm9gA
189Si4QJlnxFWCKHvKEL+pcaVuACU+aMGI8+KyMYK4/JfuWVjjj5fr/SvyHH2/8m
LdSK8aHMhJ46uBS4WJ/l6v46qQa5e2vn8RKSBAyKm/h4vpt+hd6zJdoFrFai4th7
k/TAwOAEHI5uzexUChwLlUBRTvbq4U8QUvDu+DeifC8cT63CGaaJ4qVzjOZrx1an
eP6UXZrKDASZs7RU950i7xnFVDQu4PsjlZi25udsbeiKcZJgPqGgXz5ULf8ZH8RQ
YCi1JOnTJRGGjyIOyLj7pyB01h7XiSM2+eMQ0S7g54F2s7gCJ58c2UwQX45vRWU=
=/4/R
-----END PGP SIGNATURE-----
Merge tag 'v3.13-rc6' into for-3.14/core
Needed to bring blk-mq uptodate, since changes have been going in
since for-3.14/core was established.
Fixup merge issues related to the immutable biovec changes.
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Conflicts:
block/blk-flush.c
fs/btrfs/check-integrity.c
fs/btrfs/extent_io.c
fs/btrfs/scrub.c
fs/logfs/dev_bdev.c
Pull block fixes from Jens Axboe:
- fix for a memory leak on certain unplug events
- a collection of bcache fixes from Kent and Nicolas
- a few null_blk fixes and updates form Matias
- a marking of static of functions in the stec pci-e driver
* 'for-linus' of git://git.kernel.dk/linux-block:
null_blk: support submit_queues on use_per_node_hctx
null_blk: set use_per_node_hctx param to false
null_blk: corrections to documentation
null_blk: warning on ignored submit_queues param
null_blk: refactor init and init errors code paths
null_blk: documentation
null_blk: mem garbage on NUMA systems during init
drivers: block: Mark the functions as static in skd_main.c
bcache: New writeback PD controller
bcache: bugfix for race between moving_gc and bucket_invalidate
bcache: fix for gc and writeback race
bcache: bugfix - moving_gc now moves only correct buckets
bcache: fix for gc crashing when no sectors are used
bcache: Fix heap_peek() macro
bcache: Fix for can_attach_cache()
bcache: Fix dirty_data accounting
bcache: Use uninterruptible sleep in writeback
bcache: kthread don't set writeback task to INTERUPTIBLE
block: fix memory leaks on unplugging block device
bcache: fix sparse non static symbol warning
The old writeback PD controller could get into states where it had throttled all
the way down and take way too long to recover - it was too complicated to really
understand what it was doing.
This rewrites a good chunk of it to hopefully be simpler and make more sense,
and it also pays more attention to units which should make the behaviour a bit
easier to understand.
Signed-off-by: Kent Overstreet <kmo@daterainc.com>
There is a possibility for a bucket to be invalidated by the allocator
while moving_gc was copying it's contents to another bucket, if the
bucket only held cached data. To prevent this moving checks for
a stale ptr (to an invalidated bucket), before and after reads.
It it finds one, it simply ignores moving that data. This only
affects bcache if the moving_gc was turned on, note that it's
off by default.
Signed-off-by: Nicholas Swenson <nks@daterainc.com>
Signed-off-by: Kent Overstreet <kmo@daterainc.com>
Garbage collector needs to check keys in the writeback keybuf to
make sure it's not invalidating buckets to which the writeback
keys point to.
Signed-off-by: Nicholas Swenson <nks@daterainc.com>
Signed-off-by: Kent Overstreet <kmo@daterainc.com>
Removed gc_move_threshold because picking buckets only by
threshold could lead moving extra buckets (ei. if there are
buckets at the threshold that aren't supposed to be moved
do to space considerations).
This is replaced by a GC_MOVE bit in the gc_mark bitmask.
Now only marked buckets get moved.
Signed-off-by: Nicholas Swenson <nks@daterainc.com>
Signed-off-by: Kent Overstreet <kmo@daterainc.com>
Dirty data accounting wasn't quite right - firstly, we were adding the key we're
inserting after it could have merged with another dirty key already in the
btree, and secondly we could sometimes pass the wrong offset to
bcache_dev_sectors_dirty_add() for dirty data we were overwriting - which is
important when tracking dirty data by stripe.
NOTE FOR BACKPORTERS: For 3.10 (and 3.11?) there's other accounting fixes
necessary that got squashed in with other patches; the full patch against 3.10
is 408cc2f47eeac93a, available at:
git://evilpiepirate.org/~kent/linux-bcache.git bcache-3.10-writeback-fixes
Signed-off-by: Kent Overstreet <kmo@daterainc.com>
Cc: linux-stable <stable@vger.kernel.org> # >= v3.10
diff --git a/drivers/md/bcache/btree.c b/drivers/md/bcache/btree.c
index 2a46036..4a12b2f 100644
--- a/drivers/md/bcache/btree.c
+++ b/drivers/md/bcache/btree.c
@@ -1817,7 +1817,8 @@ static bool fix_overlapping_extents(struct btree *b, struct bkey *insert,
if (KEY_START(k) > KEY_START(insert) + sectors_found)
goto check_failed;
- if (KEY_PTRS(replace_key) != KEY_PTRS(k))
+ if (KEY_PTRS(k) != KEY_PTRS(replace_key) ||
+ KEY_DIRTY(k) != KEY_DIRTY(replace_key))
goto check_failed;
/* skip past gen */
at the beginning (schedule_timout_interuptible) and others
do his on their own
This prevents wrong load average calculation (load of 1 per thread)
Signed-off-by: Kent Overstreet <kmo@daterainc.com>
A fix for possible memory corruption during DM table load, fix a
possible leak of snapshot space in case of a crash, fix a possible
deadlock due to a shared workqueue in the delay target, fix to
initialize read-only module parameters that are used to export metrics
for dm stats and dm bufio.
Quite a few stable fixes were identified for both the thin-provisioning
and caching targets as a result of increased regression testing using
the device-mapper-test-suite (dmts). The most notable of these are the
reference counting fixes for the space map btree that is used by the
dm-array interface -- without these the dm-cache metadata will leak,
resulting in dm-cache devices running out of metadata blocks. Also,
some important fixes related to the thin-provisioning target's
transition to read-only mode on error.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.15 (GNU/Linux)
iQEcBAABAgAGBQJSq2pVAAoJEMUj8QotnQNaAzEH/0qpz06SVd3+oTbsWeWxXwax
B0zIZ4HPnaD9FDfSWN+6IdQ6Rkn51cspMBHPxWX0I8pmqOTg7MUM7wbaZZaKT3UW
aDlibzG1O3zBGbkr+Qhbh89fJK5/3xnZXlK0hOyaNI2vz1rA6RThZ2hIeak9xQYr
7USz2bjSMXchwebb0Z01CrRnOkhUFg6yUJURIEu4XFCJmDM/PM9zKogLGYKuZedK
pZePnZhwgkLMn7f9l4lPHk3EcWeD4Nf9WR0lS4eKdbYsvMxm1QE7BpDlVJD0Asjg
/SVOVkgygW18TINMHuw1K0DPdSCo5UZF2HRj1QUD/X4N2BSkio+E1qwF6kT4ltk=
=0Ix+
-----END PGP SIGNATURE-----
Merge tag 'dm-3.13-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm
Pull device mapper fixes from Mike Snitzer:
"A set of device-mapper fixes for 3.13.
A fix for possible memory corruption during DM table load, fix a
possible leak of snapshot space in case of a crash, fix a possible
deadlock due to a shared workqueue in the delay target, fix to
initialize read-only module parameters that are used to export metrics
for dm stats and dm bufio.
Quite a few stable fixes were identified for both the thin-
provisioning and caching targets as a result of increased regression
testing using the device-mapper-test-suite (dmts). The most notable
of these are the reference counting fixes for the space map btree that
is used by the dm-array interface -- without these the dm-cache
metadata will leak, resulting in dm-cache devices running out of
metadata blocks. Also, some important fixes related to the
thin-provisioning target's transition to read-only mode on error"
* tag 'dm-3.13-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm:
dm array: fix a reference counting bug in shadow_ablock
dm space map: disallow decrementing a reference count below zero
dm stats: initialize read-only module parameter
dm bufio: initialize read-only module parameters
dm cache: actually resize cache
dm cache: update Documentation for invalidate_cblocks's range syntax
dm cache policy mq: fix promotions to occur as expected
dm thin: allow pool in read-only mode to transition to read-write mode
dm thin: re-establish read-only state when switching to fail mode
dm thin: always fallback the pool mode if commit fails
dm thin: switch to read-only mode if metadata space is exhausted
dm thin: switch to read only mode if a mapping insert fails
dm space map metadata: return on failure in sm_metadata_new_block
dm table: fail dm_table_create on dm_round_up overflow
dm snapshot: avoid snapshot space leak on crash
dm delay: fix a possible deadlock due to shared workqueue
An old array block could have its reference count decremented below
zero when it is being replaced in the btree by a new array block.
The fix is to increment the old ablock's reference count just before
inserting a new ablock into the btree.
Signed-off-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Cc: stable@vger.kernel.org # 3.9+
The old behaviour, returning -EINVAL if a ref_count of 0 would be
decremented, was removed in commit f722063 ("dm space map: optimise
sm_ll_dec and sm_ll_inc"). To fix this regression we return an error
code from the mutator function pointer passed to sm_ll_mutate() and have
dec_ref_count() return -EINVAL if the old ref_count is 0.
Add a DMERR to reflect the potential seriousness of this error.
Also, add missing dm_tm_unlock() to sm_ll_mutate()'s error path.
With this fix the following dmts regression test now passes:
dmtest run --suite cache -n /metadata_use_kernel/
The next patch fixes the higher-level dm-array code that exposed this
regression.
Signed-off-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Cc: stable@vger.kernel.org # 3.12+
kernfs has just been separated out from sysfs and we're already in
full conflict mode. Nothing can make the situation any worse. Let's
take the chance to name things properly.
This patch performs the following renames.
* s/sysfs_elem_dir/kernfs_elem_dir/
* s/sysfs_elem_symlink/kernfs_elem_symlink/
* s/sysfs_elem_attr/kernfs_elem_file/
* s/sysfs_dirent/kernfs_node/
* s/sd/kn/ in kernfs proper
* s/parent_sd/parent/
* s/target_sd/target/
* s/dir_sd/parent/
* s/to_sysfs_dirent()/rb_to_kn()/
* misc renames of local vars when they conflict with the above
Because md, mic and gpio dig into sysfs details, this patch ends up
modifying them. All are sysfs_dirent renames and trivial. While we
can avoid these by introducing a dummy wrapping struct sysfs_dirent
around kernfs_node, given the limited usage outside kernfs and sysfs
proper, I don't think such workaround is called for.
This patch is strictly rename only and doesn't introduce any
functional difference.
- mic / gpio renames were missing. Spotted by kbuild test robot.
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Neil Brown <neilb@suse.de>
Cc: Linus Walleij <linus.walleij@linaro.org>
Cc: Ashutosh Dixit <ashutosh.dixit@intel.com>
Cc: kbuild test robot <fengguang.wu@intel.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
The module parameter stats_current_allocated_bytes in dm-mod is
read-only. This parameter informs the user about memory
consumption. It is not supposed to be changed by the user.
However, despite being read-only, this parameter can be set on
modprobe or insmod command line:
modprobe dm-mod stats_current_allocated_bytes=12345
The kernel doesn't expect that this variable can be non-zero at module
initialization and if the user sets it, it results in warning.
This patch initializes the variable in the module init routine, so
that user-supplied value is ignored.
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Cc: stable@vger.kernel.org # 3.12+
Some module parameters in dm-bufio are read-only. These parameters
inform the user about memory consumption. They are not supposed to be
changed by the user.
However, despite being read-only, these parameters can be set on
modprobe or insmod command line, for example:
modprobe dm-bufio current_allocated_bytes=12345
The kernel doesn't expect that these variables can be non-zero at module
initialization and if the user sets them, it results in BUG.
This patch initializes the variables in the module init routine, so that
user-supplied values are ignored.
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Cc: stable@vger.kernel.org # 3.2+
Commit f494a9c6b1 ("dm cache: cache
shrinking support") broke cache resizing support.
dm_cache_resize() is called with cache->cache_size before it gets
updated to new_size, so it is a no-op. But the dm-cache superblock is
updated with the new_size even though the backing dm-array is not
resized. Fix this by passing the new_size to dm_cache_resize().
Signed-off-by: Vincent Pelletier <plr.vincent@gmail.com>
Acked-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Micro benchmarks that repeatedly issued IO to a single block were
failing to cause a promotion from the origin device to the cache. Fix
this by not updating the stats during map() if -EWOULDBLOCK will be
returned.
The mq policy will only update stats, consider migration, etc, once per
tick period (a unit of time established between dm-cache core and the
policies).
When the IO thread calls the policy's map method, if it would like to
migrate the associated block it returns -EWOULDBLOCK, the IO then gets
handed over to a worker thread which handles the migration. The worker
thread calls map again, to check the migration is still needed (avoids a
race among other things). *BUT*, before this fix, if we were still in
the same tick period the stats were already updated by the previous map
call -- so the migration would no longer be requested.
Signed-off-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
A thin-pool may be in read-only mode because the pool's data or metadata
space was exhausted. To allow for recovery, by adding more space to the
pool, we must allow a pool to transition from PM_READ_ONLY to PM_WRITE
mode. Otherwise, running out of space will render the pool permanently
read-only.
Signed-off-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Cc: stable@vger.kernel.org
If the thin-pool transitioned to fail mode and the thin-pool's table
were reloaded for some reason: the new table's default pool mode would
be read-write, though it will transition to fail mode during resume.
When the pool mode transitions directly from PM_WRITE to PM_FAIL we need
to re-establish the intermediate read-only state in both the metadata
and persistent-data block manager (as is usually done with the normal
pool mode transition sequence: PM_WRITE -> PM_READ_ONLY -> PM_FAIL).
Signed-off-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Cc: stable@vger.kernel.org
Rename commit_or_fallback() to commit(). Now all previous calls to
commit() will trigger the pool mode to fallback if the commit fails.
Also, check the error returned from commit() in alloc_data_block().
Signed-off-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Cc: stable@vger.kernel.org
Switch the thin pool to read-only mode in alloc_data_block() if
dm_pool_alloc_data_block() fails because the pool's metadata space is
exhausted.
Differentiate between data and metadata space in messages about no
free space available.
This issue was noticed with the device-mapper-test-suite using:
dmtest run --suite thin-provisioning -n /exhausting_metadata_space_causes_fail_mode/
The quantity of errors logged in this case must be reduced.
before patch:
device-mapper: thin: 253:4: reached low water mark for metadata device: sending event.
device-mapper: space map metadata: unable to allocate new metadata block
device-mapper: space map common: dm_tm_shadow_block() failed
device-mapper: space map metadata: unable to allocate new metadata block
device-mapper: space map common: dm_tm_shadow_block() failed
device-mapper: space map metadata: unable to allocate new metadata block
device-mapper: space map common: dm_tm_shadow_block() failed
device-mapper: space map metadata: unable to allocate new metadata block
device-mapper: space map common: dm_tm_shadow_block() failed
device-mapper: space map metadata: unable to allocate new metadata block
device-mapper: space map common: dm_tm_shadow_block() failed
<snip ... these repeat for a _very_ long while ... >
device-mapper: space map metadata: unable to allocate new metadata block
device-mapper: thin: 253:4: commit failed: error = -28
device-mapper: thin: 253:4: switching pool to read-only mode
after patch:
device-mapper: thin: 253:4: reached low water mark for metadata device: sending event.
device-mapper: space map metadata: unable to allocate new metadata block
device-mapper: thin: 253:4: no free metadata space available.
device-mapper: thin: 253:4: switching pool to read-only mode
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Acked-by: Joe Thornber <ejt@redhat.com>
Cc: stable@vger.kernel.org
Switch the thin pool to read-only mode when dm_thin_insert_block() fails
since there is little reason to expect the cause of the failure to be
resolved without further action by user space.
This issue was noticed with the device-mapper-test-suite using:
dmtest run --suite thin-provisioning -n /exhausting_metadata_space_causes_fail_mode/
The quantity of errors logged in this case must be reduced.
before patch:
device-mapper: thin: dm_thin_insert_block() failed
device-mapper: space map metadata: unable to allocate new metadata block
device-mapper: thin: dm_thin_insert_block() failed
device-mapper: space map metadata: unable to allocate new metadata block
device-mapper: thin: dm_thin_insert_block() failed
device-mapper: space map metadata: unable to allocate new metadata block
device-mapper: thin: dm_thin_insert_block() failed
device-mapper: space map metadata: unable to allocate new metadata block
device-mapper: thin: dm_thin_insert_block() failed
device-mapper: space map metadata: unable to allocate new metadata block
device-mapper: thin: dm_thin_insert_block() failed
device-mapper: space map metadata: unable to allocate new metadata block
device-mapper: thin: dm_thin_insert_block() failed
device-mapper: space map metadata: unable to allocate new metadata block
device-mapper: thin: dm_thin_insert_block() failed
device-mapper: space map metadata: unable to allocate new metadata block
device-mapper: thin: dm_thin_insert_block() failed
device-mapper: space map metadata: unable to allocate new metadata block
device-mapper: thin: dm_thin_insert_block() failed
device-mapper: space map metadata: unable to allocate new metadata block
device-mapper: space map metadata: unable to allocate new metadata block
device-mapper: space map metadata: unable to allocate new metadata block
device-mapper: space map metadata: unable to allocate new metadata block
device-mapper: space map metadata: unable to allocate new metadata block
device-mapper: space map metadata: unable to allocate new metadata block
<snip ... these repeat for a long while ... >
device-mapper: space map metadata: unable to allocate new metadata block
device-mapper: space map common: dm_tm_shadow_block() failed
device-mapper: thin: 253:4: no free metadata space available.
device-mapper: thin: 253:4: switching pool to read-only mode
after patch:
device-mapper: space map metadata: unable to allocate new metadata block
device-mapper: thin: 253:4: dm_thin_insert_block() failed: error = -28
device-mapper: thin: 253:4: switching pool to read-only mode
Signed-off-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Cc: stable@vger.kernel.org
The dm_round_up function may overflow to zero. In this case,
dm_table_create() must fail rather than go on to allocate an empty array
with alloc_targets().
This fixes a possible memory corruption that could be caused by passing
too large a number in "param->target_count".
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Cc: stable@vger.kernel.org
There is a possible leak of snapshot space in case of crash.
The reason for space leaking is that chunks in the snapshot device are
allocated sequentially, but they are finished (and stored in the metadata)
out of order, depending on the order in which copying finished.
For example, supposed that the metadata contains the following records
SUPERBLOCK
METADATA (blocks 0 ... 250)
DATA 0
DATA 1
DATA 2
...
DATA 250
Now suppose that you allocate 10 new data blocks 251-260. Suppose that
copying of these blocks finish out of order (block 260 finished first
and the block 251 finished last). Now, the snapshot device looks like
this:
SUPERBLOCK
METADATA (blocks 0 ... 250, 260, 259, 258, 257, 256)
DATA 0
DATA 1
DATA 2
...
DATA 250
DATA 251
DATA 252
DATA 253
DATA 254
DATA 255
METADATA (blocks 255, 254, 253, 252, 251)
DATA 256
DATA 257
DATA 258
DATA 259
DATA 260
Now, if the machine crashes after writing the first metadata block but
before writing the second metadata block, the space for areas DATA 250-255
is leaked, it contains no valid data and it will never be used in the
future.
This patch makes dm-snapshot complete exceptions in the same order they
were allocated, thus fixing this bug.
Note: when backporting this patch to the stable kernel, change the version
field in the following way:
* if version in the stable kernel is {1, 11, 1}, change it to {1, 12, 0}
* if version in the stable kernel is {1, 10, 0} or {1, 10, 1}, change it
to {1, 10, 2}
Userspace reads the version to determine if the bug was fixed, so the
version change is needed.
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Cc: stable@vger.kernel.org
Pull block layer fixes from Jens Axboe:
"A small collection of fixes for the current series. It contains:
- A fix for a use-after-free of a request in blk-mq. From Ming Lei
- A fix for a blk-mq bug that could attempt to dereference a NULL rq
if allocation failed
- Two xen-blkfront small fixes
- Cleanup of submit_bio_wait() type uses in the kernel, unifying
that. From Kent
- A fix for 32-bit blkg_rwstat reading. I apologize for this one
looking mangled in the shortlog, it's entirely my fault for missing
an empty line between the description and body of the text"
* 'for-linus' of git://git.kernel.dk/linux-block:
blk-mq: fix use-after-free of request
blk-mq: fix dereference of rq->mq_ctx if allocation fails
block: xen-blkfront: Fix possible NULL ptr dereference
xen-blkfront: Silence pfn maybe-uninitialized warning
block: submit_bio_wait() conversions
Update of blkg_stat and blkg_rwstat may happen in bh context
Move the bio->bi_remaining increment into dm_unhook_bio() so the
overwrite_endio() handler works as expected.
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Fixes the following sparse warning:
drivers/md/bcache/btree.c:2220:5: warning:
symbol 'btree_insert_fn' was not declared. Should it be static?
Signed-off-by: Wei Yongjun <yongjun_wei@trendmicro.com.cn>
Signed-off-by: Kent Overstreet <kmo@daterainc.com>
commit 566c09c534 raid5: relieve lock contention in get_active_stripe()
modified the locking in get_active_stripe() reducing the range
protected by the (highly contended) device_lock.
Unfortunately it reduced the range too much opening up some races.
One race can occur if get_priority_stripe runs between the
test on sh->count and device_lock being taken.
This will mean that sh->lru is not empty while get_active_stripe
thinks ->count is zero resulting in a 'BUG' firing.
Another race happens if __release_stripe is called immediately
after sh->count is tested and found to be non-zero. If STRIPE_HANDLE
is not set, get_active_stripe should increment ->active_stripes
when it increments ->count from 0, but as it didn't think it was 0,
it doesn't.
Extending device_lock to cover the test on sh->count close these
races.
While we are here, fix the two BUG tests:
-If count is zero, then lru really must not be empty, or we've
lock the stripe_head somehow - no other tests are relevant.
-STRIPE_ON_RELEASE_LIST is completely independent of ->lru so
testing it is pointless.
Reported-and-tested-by: Brassow Jonathan <jbrassow@redhat.com>
Reviewed-by: Shaohua Li <shli@kernel.org>
Fixes: 566c09c534
Signed-off-by: NeilBrown <neilb@suse.de>
commit 7a0a5355cb md: Don't test all of mddev->flags at once.
made most tests on mddev->flags safer, but missed one.
When
commit 260fa034ef md: avoid deadlock when dirty buffers during md_stop.
added MD_STILL_CLOSED, this caused md_check_recovery to misbehave.
It can think there is something to do but find nothing. This can
lead to the md thread spinning during array shutdown.
https://bugzilla.kernel.org/show_bug.cgi?id=65721
Reported-and-tested-by: Richard W.M. Jones <rjones@redhat.com>
Fixes: 260fa034ef
Cc: stable@vger.kernel.org (3.12)
Signed-off-by: NeilBrown <neilb@suse.de>
In alloc_thread_groups, worker_groups is a pointer to an array,
not an array of pointers.
So
worker_groups[i]
is wrong. It should be
&(*worker_groups)[i]
Found-by: coverity
Fixes: 60aaf93385
Reported-by: Ben Hutchings <bhutchings@solarflare.com>
Cc: majianpeng <majianpeng@gmail.com>
Signed-off-by: NeilBrown <neilb@suse.de>
It was being open coded in a few places.
Signed-off-by: Kent Overstreet <kmo@daterainc.com>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Joern Engel <joern@logfs.org>
Cc: Prasad Joshi <prasadjoshi.linux@gmail.com>
Cc: Neil Brown <neilb@suse.de>
Cc: Chris Mason <chris.mason@fusionio.com>
Acked-by: NeilBrown <neilb@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
The new bio_split() can split arbitrary bios - it's not restricted to
single page bios, like the old bio_split() (previously renamed to
bio_pair_split()). It also has different semantics - it doesn't allocate
a struct bio_pair, leaving it up to the caller to handle completions.
Then convert the existing bio_pair_split() users to the new bio_split()
- and also nvme, which was open coding bio splitting.
(We have to take that BUG_ON() out of bio_integrity_trim() because this
bio_split() needs to use it, and there's no reason it has to be used on
bios marked as cloned; BIO_CLONED doesn't seem to have clearly
documented semantics anyways.)
Signed-off-by: Kent Overstreet <kmo@daterainc.com>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Martin K. Petersen <martin.petersen@oracle.com>
Cc: Matthew Wilcox <matthew.r.wilcox@intel.com>
Cc: Keith Busch <keith.busch@intel.com>
Cc: Vishal Verma <vishal.l.verma@intel.com>
Cc: Jiri Kosina <jkosina@suse.cz>
Cc: Neil Brown <neilb@suse.de>
This is prep work for introducing a more general bio_split().
Signed-off-by: Kent Overstreet <kmo@daterainc.com>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: NeilBrown <neilb@suse.de>
Cc: Alasdair Kergon <agk@redhat.com>
Cc: Lars Ellenberg <lars.ellenberg@linbit.com>
Cc: Peter Osterlund <petero2@telia.com>
Cc: Sage Weil <sage@inktank.com>
This adds a generic mechanism for chaining bio completions. This is
going to be used for a bio_split() replacement, and it turns out to be
very useful in a fair amount of driver code - a fair number of drivers
were implementing this in their own roundabout ways, often painfully.
Note that this means it's no longer to call bio_endio() more than once
on the same bio! This can cause problems for drivers that save/restore
bi_end_io. Arguably they shouldn't be saving/restoring bi_end_io at all
- in all but the simplest cases they'd be better off just cloning the
bio, and immutable biovecs is making bio cloning cheaper. But for now,
we add a bio_endio_nodec() for these cases.
Signed-off-by: Kent Overstreet <kmo@daterainc.com>
Cc: Jens Axboe <axboe@kernel.dk>
Now that drivers have been converted to the new bvec_iter primitives,
there's no need to trim the bvec before we submit it; and we can't trim
it once we start sharing bvecs.
It used to be that passing a partially completed bio (i.e. one with
nonzero bi_idx) to generic_make_request() was a dangerous thing -
various drivers would choke on such things. But with immutable biovecs
and our new bio splitting that shares the biovecs, submitting partially
completed bios has to work (and should work, now that all the drivers
have been completed to the new primitives)
Signed-off-by: Kent Overstreet <kmo@daterainc.com>
Cc: Jens Axboe <axboe@kernel.dk>
We need to convert the dm code to the new bvec_iter primitives which
respect bi_bvec_done; they also allow us to drastically simplify dm's
bio splitting code.
Also, it's no longer necessary to save/restore the bvec array anymore -
driver conversions for immutable bvecs are done, so drivers should never
be modifying it.
Also kill bio_sector_offset(), dm was the only user and it doesn't make
much sense anymore.
Signed-off-by: Kent Overstreet <kmo@daterainc.com>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Alasdair Kergon <agk@redhat.com>
Cc: dm-devel@redhat.com
Reviewed-by: Mike Snitzer <snitzer@redhat.com>
bio_clone() just got more expensive - however, most users of bio_clone()
don't actually need to modify the biovec. If they aren't modifying the
biovec, and they can guarantee that the original bio isn't freed before
the clone (also true in most cases), we can just point the clone at the
original bio's biovec.
Signed-off-by: Kent Overstreet <kmo@daterainc.com>
Now that we've got a mechanism for immutable biovecs -
bi_iter.bi_bvec_done - we need to convert drivers to use primitives that
respect it instead of using the bvec array directly.
Signed-off-by: Kent Overstreet <kmo@daterainc.com>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: NeilBrown <neilb@suse.de>
Cc: Alasdair Kergon <agk@redhat.com>
Cc: dm-devel@redhat.com
When we start sharing biovecs, keeping bi_vcnt accurate for splits is
going to be error prone - and unnecessary, if we refactor some code.
So bio_segments() has to go - but most of the existing users just needed
to know if the bio had multiple segments, which is easier - add a
bio_multiple_segments() for them.
(Two of the current uses of bio_segments() are going to go away in a
couple patches, but the current implementation of bio_segments() is
unsafe as soon as we start doing driver conversions for immutable
biovecs - so implement a dumb version for bisectability, it'll go away
in a couple patches)
Signed-off-by: Kent Overstreet <kmo@daterainc.com>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Neil Brown <neilb@suse.de>
Cc: Nagalakshmi Nandigama <Nagalakshmi.Nandigama@lsi.com>
Cc: Sreekanth Reddy <Sreekanth.Reddy@lsi.com>
Cc: "James E.J. Bottomley" <JBottomley@parallels.com>
More prep work for immutable biovecs - with immutable bvecs drivers
won't be able to use the biovec directly, they'll need to use helpers
that take into account bio->bi_iter.bi_bvec_done.
This updates callers for the new usage without changing the
implementation yet.
Signed-off-by: Kent Overstreet <kmo@daterainc.com>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Geert Uytterhoeven <geert@linux-m68k.org>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Paul Mackerras <paulus@samba.org>
Cc: "Ed L. Cashin" <ecashin@coraid.com>
Cc: Nick Piggin <npiggin@kernel.dk>
Cc: Lars Ellenberg <drbd-dev@lists.linbit.com>
Cc: Jiri Kosina <jkosina@suse.cz>
Cc: Paul Clements <Paul.Clements@steeleye.com>
Cc: Jim Paris <jim@jtan.com>
Cc: Geoff Levand <geoff@infradead.org>
Cc: Yehuda Sadeh <yehuda@inktank.com>
Cc: Sage Weil <sage@inktank.com>
Cc: Alex Elder <elder@inktank.com>
Cc: ceph-devel@vger.kernel.org
Cc: Joshua Morris <josh.h.morris@us.ibm.com>
Cc: Philip Kelleher <pjk1939@linux.vnet.ibm.com>
Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Cc: Jeremy Fitzhardinge <jeremy@goop.org>
Cc: Neil Brown <neilb@suse.de>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: linux390@de.ibm.com
Cc: Nagalakshmi Nandigama <Nagalakshmi.Nandigama@lsi.com>
Cc: Sreekanth Reddy <Sreekanth.Reddy@lsi.com>
Cc: support@lsi.com
Cc: "James E.J. Bottomley" <JBottomley@parallels.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Steven Whitehouse <swhiteho@redhat.com>
Cc: Herton Ronaldo Krzesinski <herton.krzesinski@canonical.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Guo Chao <yan@linux.vnet.ibm.com>
Cc: Asai Thambi S P <asamymuthupa@micron.com>
Cc: Selvan Mani <smani@micron.com>
Cc: Sam Bradshaw <sbradshaw@micron.com>
Cc: Matthew Wilcox <matthew.r.wilcox@intel.com>
Cc: Keith Busch <keith.busch@intel.com>
Cc: Stephen Hemminger <shemminger@vyatta.com>
Cc: Quoc-Son Anh <quoc-sonx.anh@intel.com>
Cc: Sebastian Ott <sebott@linux.vnet.ibm.com>
Cc: Nitin Gupta <ngupta@vflare.org>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Jerome Marchand <jmarchan@redhat.com>
Cc: Seth Jennings <sjenning@linux.vnet.ibm.com>
Cc: "Martin K. Petersen" <martin.petersen@oracle.com>
Cc: Mike Snitzer <snitzer@redhat.com>
Cc: Vivek Goyal <vgoyal@redhat.com>
Cc: "Darrick J. Wong" <darrick.wong@oracle.com>
Cc: Chris Metcalf <cmetcalf@tilera.com>
Cc: Jan Kara <jack@suse.cz>
Cc: linux-m68k@lists.linux-m68k.org
Cc: linuxppc-dev@lists.ozlabs.org
Cc: drbd-user@lists.linbit.com
Cc: nbd-general@lists.sourceforge.net
Cc: cbe-oss-dev@lists.ozlabs.org
Cc: xen-devel@lists.xensource.com
Cc: virtualization@lists.linux-foundation.org
Cc: linux-raid@vger.kernel.org
Cc: linux-s390@vger.kernel.org
Cc: DL-MPTFusionLinux@lsi.com
Cc: linux-scsi@vger.kernel.org
Cc: devel@driverdev.osuosl.org
Cc: linux-fsdevel@vger.kernel.org
Cc: cluster-devel@redhat.com
Cc: linux-mm@kvack.org
Acked-by: Geoff Levand <geoff@infradead.org>
For immutable biovecs, we'll be introducing a new bio_iovec() that uses
our new bvec iterator to construct a biovec, taking into account
bvec_iter->bi_bvec_done - this patch updates existing users for the new
usage.
Some of the existing users really do need a pointer into the bvec array
- those uses are all going to be removed, but we'll need the
functionality from immutable to remove them - so for now rename the
existing bio_iovec() -> __bio_iovec(), and it'll be removed in a couple
patches.
Signed-off-by: Kent Overstreet <kmo@daterainc.com>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: "Ed L. Cashin" <ecashin@coraid.com>
Cc: Alasdair Kergon <agk@redhat.com>
Cc: dm-devel@redhat.com
Cc: "James E.J. Bottomley" <JBottomley@parallels.com>
This patch doesn't itself have any functional changes, but immutable
biovecs are going to add a bi_bvec_done member to bi_iter, which will
need to be saved too here.
Signed-off-by: Kent Overstreet <kmo@daterainc.com>
Cc: Alasdair Kergon <agk@redhat.com>
Cc: dm-devel@redhat.com
Reviewed-by: Mike Snitzer <snitzer@redhat.com>
Bcache has a hack to avoid cloning the biovec if it's all full pages -
but with immutable biovecs coming this won't be necessary anymore.
For now, we remove the special case and always clone the bvec array so
that the immutable biovec patches are simpler.
Signed-off-by: Kent Overstreet <kmo@daterainc.com>
It was being open coded in a few places.
Signed-off-by: Kent Overstreet <kmo@daterainc.com>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Joern Engel <joern@logfs.org>
Cc: Prasad Joshi <prasadjoshi.linux@gmail.com>
Cc: Neil Brown <neilb@suse.de>
Cc: Chris Mason <chris.mason@fusionio.com>
Acked-by: NeilBrown <neilb@suse.de>
Mostly optimisations and obscure bug fixes.
- raid5 gets less lock contention
- raid1 gets less contention between normal-io and resync-io
during resync.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2.0.19 (GNU/Linux)
iQIVAwUAUovzDznsnt1WYoG5AQJ1pQ//bDuXadoJ5dwjWjVxFOKoQ9j/9joEI0yH
XTApD3ADKckdBc4TSLOIbCNLW1Pbe23HlOI/GjCiJ/7mePL3OwHd7Fx8Rfq3BubV
f7NgjVwu8nwYD0OXEZsshImptEtrbYwQdy+qlKcHXcZz1MUfR+Egih3r/ouTEfEt
FNq/6MpyN0IKSY82xP/jFZgesBucgKz/YOUIbwClxm7UiyISKvWQLBIAfLB3dyI3
HoEdEzQX6I56Rw0mkSUG4Mk+8xx/8twxL+yqEUqfdJREWuB56Km8kl8y/e465Nk0
ZZg6j/TrslVEwbEeVMx0syvYcaAWFZ4X2jdKfo1lI0g9beZp7H1GRF8yR1s2t/h4
g/vb55MEN++4LPaE9ut4z7SG2yLyGkZgFTzTjyq5of+DFL0cayO7wXxbgpcD7JYf
Doef/OSa6csKiGiJI48iQa08Bolmz9ZWzZQXhAthKfFQ9Rv+GEtIAi4kLR8EZPbu
0/FL1ylYNUY9O7p0g+iy9Kcoc+xW36I95pPZf8pO8GFcXTjyuCCBVh/SNvFZZHPl
3xk3aZJknAEID8VrVG2IJPkeDI8WK8YxmpU/nARCoytn07Df6Ye8jGvLdR8pL3lB
TIZV6eRY4yciB8LtoK9Kg4XTmOMhBtjt4c3znkljp98vhOQQb/oHN+BXMGcwqvr9
fk0KGrg31VA=
=8RCg
-----END PGP SIGNATURE-----
Merge tag 'md/3.13' of git://neil.brown.name/md
Pull md update from Neil Brown:
"Mostly optimisations and obscure bug fixes.
- raid5 gets less lock contention
- raid1 gets less contention between normal-io and resync-io during
resync"
* tag 'md/3.13' of git://neil.brown.name/md:
md/raid5: Use conf->device_lock protect changing of multi-thread resources.
md/raid5: Before freeing old multi-thread worker, it should flush them.
md/raid5: For stripe with R5_ReadNoMerge, we replace REQ_FLUSH with REQ_NOMERGE.
UAPI: include <asm/byteorder.h> in linux/raid/md_p.h
raid1: Rewrite the implementation of iobarrier.
raid1: Add some macros to make code clearly.
raid1: Replace raise_barrier/lower_barrier with freeze_array/unfreeze_array when reconfiguring the array.
raid1: Add a field array_frozen to indicate whether raid in freeze state.
md: Convert use of typedef ctl_table to struct ctl_table
md/raid5: avoid deadlock when raid5 array has unack badblocks during md_stop_writes.
md: use MD_RECOVERY_INTR instead of kthread_should_stop in resync thread.
md: fix some places where mddev_lock return value is not checked.
raid5: Retry R5_ReadNoMerge flag when hit a read error.
raid5: relieve lock contention in get_active_stripe()
raid5: relieve lock contention in get_active_stripe()
wait: add wait_event_cmd()
md/raid5.c: add proper locking to error path of raid5_start_reshape.
md: fix calculation of stacking limits on level change.
raid5: Use slow_path to release stripe when mddev->thread is null
For R5_ReadNoMerge,it mean this bio can't merge with other bios or
request.It used REQ_FLUSH to achieve this. But REQ_NOMERGE can do the
same work.
Signed-off-by: Jianpeng Ma <majianpeng@gmail.com>
Signed-off-by: NeilBrown <neilb@suse.de>
There is an iobarrier in raid1 because of contention between normal IO and
resync IO. It suspends all normal IO when resync/recovery happens.
However if normal IO is out side the resync window, there is no contention.
So this patch changes the barrier mechanism to only block IO that
could contend with the resync that is currently happening.
We partition the whole space into five parts.
|---------|-----------|------------|----------------|-------|
start next_resync start_next_window end_window
start + RESYNC_WINDOW = next_resync
next_resync + NEXT_NORMALIO_DISTANCE = start_next_window
start_next_window + NEXT_NORMALIO_DISTANCE = end_window
Firstly we introduce some concepts:
1 - RESYNC_WINDOW: For resync, there are 32 resync requests at most at the
same time. A sync request is RESYNC_BLOCK_SIZE(64*1024).
So the RESYNC_WINDOW is 32 * RESYNC_BLOCK_SIZE, that is 2MB.
2 - NEXT_NORMALIO_DISTANCE: the distance between next_resync
and start_next_window. It also indicates the distance between
start_next_window and end_window.
It is currently 3 * RESYNC_WINDOW_SIZE but could be tuned if
this turned out not to be optimal.
3 - next_resync: the next sector at which we will do sync IO.
4 - start: a position which is at most RESYNC_WINDOW before
next_resync.
5 - start_next_window: a position which is NEXT_NORMALIO_DISTANCE
beyond next_resync. Normal-io after this position doesn't need to
wait for resync-io to complete.
6 - end_window: a position which is 2 * NEXT_NORMALIO_DISTANCE beyond
next_resync. This also doesn't need to wait, but is counted
differently.
7 - current_window_requests: the count of normalIO between
start_next_window and end_window.
8 - next_window_requests: the count of normalIO after end_window.
NormalIO will be partitioned into four types:
NormIO1: the end sector of bio is smaller or equal the start
NormIO2: the start sector of bio larger or equal to end_window
NormIO3: the start sector of bio larger or equal to
start_next_window.
NormIO4: the location between start_next_window and end_window
|--------|-----------|--------------------|----------------|-------------|
| start | next_resync | start_next_window | end_window |
NormIO1 NormIO4 NormIO4 NormIO3 NormIO2
For NormIO1, we don't need any io barrier.
For NormIO4, we used a similar approach to the original iobarrier
mechanism. The normalIO and resyncIO must be kept separate.
For NormIO2/3, we add two fields to struct r1conf: "current_window_requests"
and "next_window_requests". They indicate the count of active
requests in the two window.
For these, we don't wait for resync io to complete.
For resync action, if there are NormIO4s, we must wait for it.
If not, we can proceed.
But if resync action reaches start_next_window and
current_window_requests > 0 (that is there are NormIO3s), we must
wait until the current_window_requests becomes zero.
When current_window_requests becomes zero, start_next_window also
moves forward. Then current_window_requests will replaced by
next_window_requests.
There is a problem which when and how to change from NormIO2 to
NormIO3. Only then can sync action progress.
We add a field in struct r1conf "start_next_window".
A: if start_next_window == MaxSector, it means there are no NormIO2/3.
So start_next_window = next_resync + NEXT_NORMALIO_DISTANCE
B: if current_window_requests == 0 && next_window_requests != 0, it
means start_next_window move to end_window
There is another problem which how to differentiate between
old NormIO2(now it is NormIO3) and NormIO2.
For example, there are many bios which are NormIO2 and a bio which is
NormIO3. NormIO3 firstly completed, so the bios of NormIO2 became NormIO3.
We add a field in struct r1bio "start_next_window".
This is used to record the position conf->start_next_window when the call
to wait_barrier() is made in make_request().
In allow_barrier(), we check the conf->start_next_window.
If r1bio->stat_next_window == conf->start_next_window, it means
there is no transition between NormIO2 and NormIO3.
If r1bio->start_next_window != conf->start_next_window, it mean
there was a transition between NormIO2 and NormIO3. There can only
have been one transition. So it only means the bio is old NormIO2.
For one bio, there may be many r1bio's. So we make sure
all the r1bio->start_next_window are the same value.
If we met blocked_dev in make_request(), it must call allow_barrier
and wait_barrier. So the former and the later value of
conf->start_next_window will be change.
If there are many r1bio's with differnet start_next_window,
for the relevant bio, it depend on the last value of r1bio.
It will cause error. To avoid this, we must wait for previous r1bios
to complete.
Signed-off-by: Jianpeng Ma <majianpeng@gmail.com>
Signed-off-by: NeilBrown <neilb@suse.de>
In a subsequent patch, we'll use some const parameters.
Using macros will make the code clearly.
Signed-off-by: Jianpeng Ma <majianpeng@gmail.com>
Signed-off-by: NeilBrown <neilb@suse.de>
We used to use raise_barrier to suspend normal IO while we reconfigure
the array. However raise_barrier will soon only suspend some normal
IO, not all. So we need something else.
Change it to use freeze_array.
But freeze_array not only suspends normal io, it also suspends
resync io.
For the place where call raise_barrier for reconfigure, it isn't a
problem.
Signed-off-by: Jianpeng Ma <majianpeng@gmail.com>
Signed-off-by: NeilBrown <neilb@suse.de>
Because the following patch will rewrite the content between normal IO
and resync IO. So we used a parameter to indicate whether raid is in freeze
array.
Signed-off-by: Jianpeng Ma <majianpeng@gmail.com>
Signed-off-by: NeilBrown <neilb@suse.de>
When raid5 recovery hits a fresh badblock, this badblock will flagged as unack
badblock until md_update_sb() is called.
But md_stop will take reconfig lock which means raid5d can't call
md_update_sb() in md_check_recovery(), the badblock will always
be unack, so raid5d thread enters an infinite loop and md_stop_write()
can never stop sync_thread. This causes deadlock.
To solve this, when STOP_ARRAY ioctl is issued and sync_thread is
running, we need set md->recovery FROZEN and INTR flags and wait for
sync_thread to stop before we (re)take reconfig lock.
This requires that raid5 reshape_request notices MD_RECOVERY_INTR
(which it probably should have noticed anyway) and stops waiting for a
metadata update in that case.
Reported-by: Jianpeng Ma <majianpeng@gmail.com>
Reported-by: Bian Yu <bianyu@kedacom.com>
Signed-off-by: NeilBrown <neilb@suse.de>
We currently use kthread_should_stop() in various places in the
sync/reshape code to abort early.
However some places set MD_RECOVERY_INTR but don't immediately call
md_reap_sync_thread() (and we will shortly get another one).
When this happens we are relying on md_check_recovery() to reap the
thread and that only happen when it finishes normally.
So MD_RECOVERY_INTR must lead to a normal finish without the
kthread_should_stop() test.
So replace all relevant tests, and be more careful when the thread is
interrupted not to acknowledge that latest step in a reshape as it may
not be fully committed yet.
Also add a test on MD_RECOVERY_INTR in the 'is_mddev_idle' loop
so we don't wait have to wait for the speed to drop before we can abort.
Signed-off-by: NeilBrown <neilb@suse.de>
Sometimes we need to lock and mddev and cannot cope with
failure due to interrupt.
In these cases we should use mutex_lock, not mutex_lock_interruptible.
Signed-off-by: NeilBrown <neilb@suse.de>
Because of block layer merge, one bio fails will cause other bios
which belongs to the same request fails, so raid5_end_read_request
will record all these bios as badblocks.
If retry request with R5_ReadNoMerge flag to avoid bios merge,
badblocks can only record sector which is bad exactly.
test:
hdparm --yes-i-know-what-i-am-doing --make-bad-sector 300000 /dev/sdb
mdadm -C /dev/md0 -l5 -n3 /dev/sd[bcd] --assume-clean
mdadm /dev/md0 -f /dev/sdd
mdadm /dev/md0 -r /dev/sdd
mdadm --zero-superblock /dev/sdd
mdadm /dev/md0 -a /dev/sdd
1. Without this patch:
cat /sys/block/md0/md/rd*/bad_blocks
299776 256
299776 256
2. With this patch:
cat /sys/block/md0/md/rd*/bad_blocks
300000 8
300000 8
Signed-off-by: Bian Yu <bianyu@kedacom.com>
Signed-off-by: NeilBrown <neilb@suse.de>
track empty inactive list count, so md_raid5_congested() can use it to make
decision.
Signed-off-by: Shaohua Li <shli@fusionio.com>
Signed-off-by: NeilBrown <neilb@suse.de>
The dm-delay target uses a shared workqueue for multiple instances. This
can cause deadlock if two or more dm-delay targets are stacked on the top
of each other.
This patch changes dm-delay to use a per-instance workqueue.
Cc: stable@vger.kernel.org # 2.6.22+
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Pull trivial tree updates from Jiri Kosina:
"Usual earth-shaking, news-breaking, rocket science pile from
trivial.git"
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/trivial: (23 commits)
doc: usb: Fix typo in Documentation/usb/gadget_configs.txt
doc: add missing files to timers/00-INDEX
timekeeping: Fix some trivial typos in comments
mm: Fix some trivial typos in comments
irq: Fix some trivial typos in comments
NUMA: fix typos in Kconfig help text
mm: update 00-INDEX
doc: Documentation/DMA-attributes.txt fix typo
DRM: comment: `halve' -> `half'
Docs: Kconfig: `devlopers' -> `developers'
doc: typo on word accounting in kprobes.c in mutliple architectures
treewide: fix "usefull" typo
treewide: fix "distingush" typo
mm/Kconfig: Grammar s/an/a/
kexec: Typo s/the/then/
Documentation/kvm: Update cpuid documentation for steal time and pv eoi
treewide: Fix common typo in "identify"
__page_to_pfn: Fix typo in comment
Correct some typos for word frequency
clk: fixed-factor: Fix a trivial typo
...
Pull second round of block driver updates from Jens Axboe:
"As mentioned in the original pull request, the bcache bits were pulled
because of their dependency on the immutable bio vecs. Kent re-did
this part and resubmitted it, so here's the 2nd round of (mostly)
driver updates for 3.13. It contains:
- The bcache work from Kent.
- Conversion of virtio-blk to blk-mq. This removes the bio and request
path, and substitutes with the blk-mq path instead. The end result
almost 200 deleted lines. Patch is acked by Asias and Christoph, who
both did a bunch of testing.
- A removal of bootmem.h include from Grygorii Strashko, part of a
larger series of his killing the dependency on that header file.
- Removal of __cpuinit from blk-mq from Paul Gortmaker"
* 'for-linus' of git://git.kernel.dk/linux-block: (56 commits)
virtio_blk: blk-mq support
blk-mq: remove newly added instances of __cpuinit
bcache: defensively handle format strings
bcache: Bypass torture test
bcache: Delete some slower inline asm
bcache: Use ida for bcache block dev minor
bcache: Fix sysfs splat on shutdown with flash only devs
bcache: Better full stripe scanning
bcache: Have btree_split() insert into parent directly
bcache: Move spinlock into struct time_stats
bcache: Kill sequential_merge option
bcache: Kill bch_next_recurse_key()
bcache: Avoid deadlocking in garbage collection
bcache: Incremental gc
bcache: Add make_btree_freeing_key()
bcache: Add btree_node_write_sync()
bcache: PRECEDING_KEY()
bcache: bch_(btree|extent)_ptr_invalid()
bcache: Don't bother with bucket refcount for btree node allocations
bcache: Debug code improvements
...
Make this useful helper available for other users.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Jan Kara <jack@suse.cz>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Neil Brown <neilb@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Use this new function to make code more comprehensible, since we are
reinitialzing the completion, not initializing.
[akpm@linux-foundation.org: linux-next resyncs]
Signed-off-by: Wolfram Sang <wsa@the-dreams.de>
Acked-by: Linus Walleij <linus.walleij@linaro.org> (personally at LCE13)
Cc: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
get_active_stripe() is the last place we have lock contention. It has two
paths. One is stripe isn't found and new stripe is allocated, the other is
stripe is found.
The first path basically calls __find_stripe and init_stripe. It accesses
conf->generation, conf->previous_raid_disks, conf->raid_disks,
conf->prev_chunk_sectors, conf->chunk_sectors, conf->max_degraded,
conf->prev_algo, conf->algorithm, the stripe_hashtbl and inactive_list. Except
stripe_hashtbl and inactive_list, other fields are changed very rarely.
With this patch, we split inactive_list and add new hash locks. Each free
stripe belongs to a specific inactive list. Which inactive list is determined
by stripe's lock_hash. Note, even a stripe hasn't a sector assigned, it has a
lock_hash assigned. Stripe's inactive list is protected by a hash lock, which
is determined by it's lock_hash too. The lock_hash is derivied from current
stripe_hashtbl hash, which guarantees any stripe_hashtbl list will be assigned
to a specific lock_hash, so we can use new hash lock to protect stripe_hashtbl
list too. The goal of the new hash locks introduced is we can only use the new
locks in the first path of get_active_stripe(). Since we have several hash
locks, lock contention is relieved significantly.
The first path of get_active_stripe() accesses other fields, since they are
changed rarely, changing them now need take conf->device_lock and all hash
locks. For a slow path, this isn't a problem.
If we need lock device_lock and hash lock, we always lock hash lock first. The
tricky part is release_stripe and friends. We need take device_lock first.
Neil's suggestion is we put inactive stripes to a temporary list and readd it
to inactive_list after device_lock is released. In this way, we add stripes to
temporary list with device_lock hold and remove stripes from the list with hash
lock hold. So we don't allow concurrent access to the temporary list, which
means we need allocate temporary list for all participants of release_stripe.
One downside is free stripes are maintained in their inactive list, they can't
across between the lists. By default, we have total 256 stripes and 8 lists, so
each list will have 32 stripes. It's possible one list has free stripe but
other list hasn't. The chance should be rare because stripes allocation are
even distributed. And we can always allocate more stripes for cache, several
mega bytes memory isn't a big deal.
This completely removes the lock contention of the first path of
get_active_stripe(). It slows down the second code path a little bit though
because we now need takes two locks, but since the hash lock isn't contended,
the overhead should be quite small (several atomic instructions). The second
path of get_active_stripe() (basically sequential write or big request size
randwrite) still has lock contentions.
Signed-off-by: Shaohua Li <shli@fusionio.com>
Signed-off-by: NeilBrown <neilb@suse.de>
If raid5_start_reshape errors out, we need to reset all the fields
that were updated (not just some), and need to use the seq_counter
to ensure make_request() doesn't use an inconsitent state.
Signed-off-by: NeilBrown <neilb@suse.de>
The various ->run routines of md personalities assume that the 'queue'
has been initialised by the blk_set_stacking_limits() call in
md_alloc().
However when the level is changed (by level_store()) the ->run routine
for the new level is called for an array which has already had the
stacking limits modified. This can result in incorrect final
settings.
So call blk_set_stacking_limits() before ->run in level_store().
A specific consequence of this bug is that it causes
discard_granularity to be set incorrectly when reshaping a RAID4 to a
RAID0.
This is suitable for any -stable kernel since 3.3 in which
blk_set_stacking_limits() was introduced.
Cc: stable@vger.kernel.org (3.3+)
Reported-and-tested-by: "Baldysiak, Pawel" <pawel.baldysiak@intel.com>
Signed-off-by: NeilBrown <neilb@suse.de>
When release_stripe() is called in grow_one_stripe(), the
mddev->thread is null. So it will omit one wakeup this thread to
release stripe.
For this condition, use slow_path to release stripe.
Bug was introduced in 3.12
Cc: stable@vger.kernel.org (3.12+)
Fixes: 773ca82fa1
Signed-off-by: Jianpeng Ma <majianpeng@gmail.com>
Signed-off-by: NeilBrown <neilb@suse.de>
Improve reliability of buffer allocations for dm messages with a small
number of arguments, a couple path group initialization fixes for dm
multipath, a fix for resizing a dm array, various fixes and
optimizations for dm cache, a fix for device mapper's Kconfig menu
indentation.
Features added include:
- dm crypt support for activating legacy CBC TrueCrypt containers
(useful for forensics of these old TCRYPT containers)
- reduced dm-cache memory requirements for each block in the cache
- basic support for shrinking a dm-cache's cache (fast) device
- most notably, dm-cache support for managing cache coherency when
deploying dm-cache with sophisticated origin volumes (that support
hardware snapshots and/or clustering): these changes come in the form
of a new passthrough operation mode and a cache block invalidation
interface.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.14 (GNU/Linux)
iQEcBAABAgAGBQJSgt+QAAoJEMUj8QotnQNapcEIALC6U1rmw08PRMSanqg4/aVu
pTahzPtai9jXchQV6q5XsglJryrhD9MoNqrZgHd2drdnmEKTKfVX+/iCXGiE4hQ5
I5QUZf5myEXSd60pCgZwNam+VHMuAuSPQW6LWqRTJjDLHixGF+AoHZGxkEsYgj6M
p686OOpga1nmT2w072xLIh9z2tsv/tm+UN7GSbyklM+/1ItcXxq+/J8rsuth7IqT
k0I60jexq+Q3OaYuJY7vxhdE7PhBCw1fGmtuCcjekqsSVpAdCgDz3FFOEZmyXcUs
YLFE3GcclYQpIPjNjVGTLDFHdoIMWdKiibs/ScBUtegqxWvqP7c87YFhbL+VHDM=
=lLxo
-----END PGP SIGNATURE-----
Merge tag 'dm-3.13-changes' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm
Pull device mapper changes from Mike Snitzer:
"A set of device-mapper changes for 3.13.
Improve reliability of buffer allocations for dm messages with a small
number of arguments, a couple path group initialization fixes for dm
multipath, a fix for resizing a dm array, various fixes and
optimizations for dm cache, a fix for device mapper's Kconfig menu
indentation.
Features added include:
- dm crypt support for activating legacy CBC TrueCrypt containers
(useful for forensics of these old TCRYPT containers)
- reduced dm-cache memory requirements for each block in the cache
- basic support for shrinking a dm-cache's cache (fast) device
- most notably, dm-cache support for managing cache coherency when
deploying dm-cache with sophisticated origin volumes (that support
hardware snapshots and/or clustering): these changes come in the
form of a new passthrough operation mode and a cache block
invalidation interface"
* tag 'dm-3.13-changes' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm: (32 commits)
dm cache: resolve small nits and improve Documentation
dm cache: add cache block invalidation support
dm cache: add remove_cblock method to policy interface
dm cache policy mq: reduce memory requirements
dm cache metadata: check the metadata version when reading the superblock
dm cache: add passthrough mode
dm cache: cache shrinking support
dm cache: promotion optimisation for writes
dm cache: be much more aggressive about promoting writes to discarded blocks
dm cache policy mq: implement writeback_work() and mq_{set,clear}_dirty()
dm cache: optimize commit_if_needed
dm space map disk: optimise sm_disk_dec_block
MAINTAINERS: add reference to device-mapper's linux-dm.git tree
dm: fix Kconfig menu indentation
dm: allow remove to be deferred
dm table: print error on preresume failure
dm crypt: add TCW IV mode for old CBC TCRYPT containers
dm crypt: properly handle extra key string in initialization
dm cache: log error message if dm_kcopyd_copy() fails
dm cache: use cell_defer() boolean argument consistently
...
Pull block IO core updates from Jens Axboe:
"This is the pull request for the core changes in the block layer for
3.13. It contains:
- The new blk-mq request interface.
This is a new and more scalable queueing model that marries the
best part of the request based interface we currently have (which
is fully featured, but scales poorly) and the bio based "interface"
which the new drivers for high IOPS devices end up using because
it's much faster than the request based one.
The bio interface has no block layer support, since it taps into
the stack much earlier. This means that drivers end up having to
implement a lot of functionality on their own, like tagging,
timeout handling, requeue, etc. The blk-mq interface provides all
these. Some drivers even provide a switch to select bio or rq and
has code to handle both, since things like merging only works in
the rq model and hence is faster for some workloads. This is a
huge mess. Conversion of these drivers nets us a substantial code
reduction. Initial results on converting SCSI to this model even
shows an 8x improvement on single queue devices. So while the
model was intended to work on the newer multiqueue devices, it has
substantial improvements for "classic" hardware as well. This code
has gone through extensive testing and development, it's now ready
to go. A pull request is coming to convert virtio-blk to this
model will be will be coming as well, with more drivers scheduled
for 3.14 conversion.
- Two blktrace fixes from Jan and Chen Gang.
- A plug merge fix from Alireza Haghdoost.
- Conversion of __get_cpu_var() from Christoph Lameter.
- Fix for sector_div() with 64-bit divider from Geert Uytterhoeven.
- A fix for a race between request completion and the timeout
handling from Jeff Moyer. This is what caused the merge conflict
with blk-mq/core, in case you are looking at that.
- A dm stacking fix from Mike Snitzer.
- A code consolidation fix and duplicated code removal from Kent
Overstreet.
- A handful of block bug fixes from Mikulas Patocka, fixing a loop
crash and memory corruption on blk cg.
- Elevator switch bug fix from Tomoki Sekiyama.
A heads-up that I had to rebase this branch. Initially the immutable
bio_vecs had been queued up for inclusion, but a week later, it became
clear that it wasn't fully cooked yet. So the decision was made to
pull this out and postpone it until 3.14. It was a straight forward
rebase, just pruning out the immutable series and the later fixes of
problems with it. The rest of the patches applied directly and no
further changes were made"
* 'for-3.13/core' of git://git.kernel.dk/linux-block: (31 commits)
block: replace IS_ERR and PTR_ERR with PTR_ERR_OR_ZERO
block: replace IS_ERR and PTR_ERR with PTR_ERR_OR_ZERO
block: Do not call sector_div() with a 64-bit divisor
kernel: trace: blktrace: remove redundent memcpy() in compat_blk_trace_setup()
block: Consolidate duplicated bio_trim() implementations
block: Use rw_copy_check_uvector()
block: Enable sysfs nomerge control for I/O requests in the plug list
block: properly stack underlying max_segment_size to DM device
elevator: acquire q->sysfs_lock in elevator_change()
elevator: Fix a race in elevator switching and md device initialization
block: Replace __get_cpu_var uses
bdi: test bdi_init failure
block: fix a probe argument to blk_register_region
loop: fix crash if blk_alloc_queue fails
blk-core: Fix memory corruption if blkcg_init_queue fails
block: fix race between request completion and timeout handling
blktrace: Send BLK_TN_PROCESS events to all running traces
blk-mq: don't disallow request merges for req->special being set
blk-mq: mq plug list breakage
blk-mq: fix for flush deadlock
...
Document passthrough mode, cache shrinking, and cache invalidation.
Also, use strcasecmp() and hlist_unhashed().
Reported-by: Alasdair G Kergon <agk@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Cache block invalidation is removing an entry from the cache without
writing it back. Cache blocks can be invalidated via the
'invalidate_cblocks' message, which takes an arbitrary number of cblock
ranges:
invalidate_cblocks [<cblock>|<cblock begin>-<cblock end>]*
E.g.
dmsetup message my_cache 0 invalidate_cblocks 2345 3456-4567 5678-6789
Signed-off-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Implement policy_remove_cblock() and add remove_cblock method to the mq
policy. These methods will be used by the following cache block
invalidation patch which adds the 'invalidate_cblocks' message to the
cache core.
Also, update some comments in dm-cache-policy.h
Signed-off-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Rather than storing the cblock in each cache entry, we allocate all
entries in an array and infer the cblock from the entry position.
Saves 4 bytes of memory per cache block. In addition, this gives us an
easy way of looking up cache entries by cblock.
We no longer need to keep an explicit bitset to track which cblocks
have been allocated. And no searching is needed to find free cblocks.
Signed-off-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Need to check the version to verify on-disk metadata is supported.
Signed-off-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
"Passthrough" is a dm-cache operating mode (like writethrough or
writeback) which is intended to be used when the cache contents are not
known to be coherent with the origin device. It behaves as follows:
* All reads are served from the origin device (all reads miss the cache)
* All writes are forwarded to the origin device; additionally, write
hits cause cache block invalidates
This mode decouples cache coherency checks from cache device creation,
largely to avoid having to perform coherency checks while booting. Boot
scripts can create cache devices in passthrough mode and put them into
service (mount cached filesystems, for example) without having to worry
about coherency. Coherency that exists is maintained, although the
cache will gradually cool as writes take place.
Later, applications can perform coherency checks, the nature of which
will depend on the type of the underlying storage. If coherency can be
verified, the cache device can be transitioned to writethrough or
writeback mode while still warm; otherwise, the cache contents can be
discarded prior to transitioning to the desired operating mode.
Signed-off-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Heinz Mauelshagen <heinzm@redhat.com>
Signed-off-by: Morgan Mears <Morgan.Mears@netapp.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Allow a cache to shrink if the blocks being removed from the cache are
not dirty.
Signed-off-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Just to be safe, call the error reporting function with "%s" to avoid
any possible future format string leak.
Signed-off-by: Kees Cook <keescook@chromium.org>
Signed-off-by: Kent Overstreet <kmo@daterainc.com>
Never saw a profile of bset_search_tree() where it wasn't bottlenecked
on memory until I got my new Haswell machine, but when I tried it there
it was suddenly burning 20% of the cpu in the inner loop on shrd...
Turns out, the version of shrd that takes 64 bit operands has a 9 cycle
latency. hah.
Signed-off-by: Kent Overstreet <kmo@daterainc.com>
The flow control in btree_insert_node() was... fragile... before,
this'll use more stack (but since our btrees are never more than depth
1, that shouldn't matter) and it should be significantly clearer and
less fragile.
Signed-off-by: Kent Overstreet <kmo@daterainc.com>
Big garbage collection rewrite; now, garbage collection uses the same
mechanisms as used elsewhere for inserting/updating btree node pointers,
instead of rewriting interior btree nodes in place.
This makes the code significantly cleaner and less fragile, and means we
can now make garbage collection incremental - it doesn't have to hold a
write lock on the root of the btree for the entire duration of garbage
collection.
This means that there's less of a latency hit for doing garbage
collection, which means we can gc more frequently (and do a better job
of reclaiming from the cache), and we can coalesce across more btree
nodes (improving our space efficiency).
Signed-off-by: Kent Overstreet <kmo@daterainc.com>
Trying to treat btree pointers and leaf node pointers the same way was a
mistake - going to start being more explicit about the type of
key/pointer we're dealing with. This is the first part of that
refactoring; this patch shouldn't change any actual behaviour.
Signed-off-by: Kent Overstreet <kmo@daterainc.com>
The bucket refcount (dropped with bkey_put()) is only needed to prevent
the newly allocated bucket from being garbage collected until we've
added a pointer to it somewhere. But for btree node allocations, the
fact that we have btree nodes locked is enough to guard against races
with garbage collection.
Eventually the per bucket refcount is going to be replaced with
something specific to bch_alloc_sectors().
Signed-off-by: Kent Overstreet <kmo@daterainc.com>
Couple changes:
* Consolidate bch_check_keys() and bch_check_key_order(), and move the
checks that only check_key_order() could do to bch_btree_iter_next().
* Get rid of CONFIG_BCACHE_EDEBUG - now, all that code is compiled in
when CONFIG_BCACHE_DEBUG is enabled, and there's now a sysfs file to
flip on the EDEBUG checks at runtime.
* Dropped an old not terribly useful check in rw_unlock(), and
refactored/improved a some of the other debug code.
Signed-off-by: Kent Overstreet <kmo@daterainc.com>
Previously, bch_ptr_bad() could return false when there was a pointer to
a nonexistant device... it only filtered out keys with PTR_CHECK_DEV
pointers.
This behaviour was intended for multiple cache device support; for that,
just because the device for one of the pointers has gone away doesn't
mean we want to filter out the rest of the pointers.
But we don't yet explicitly filter/check individual pointers, so without
that this behaviour was wrong - a corrupt bkey with a bad device pointer
could cause us to deref a bad pointer. Doh.
Signed-off-by: Kent Overstreet <kmo@daterainc.com>
Now, the on disk data structures are in a header that can be exported to
userspace - and having them all centralized is nice too.
Signed-off-by: Kent Overstreet <kmo@daterainc.com>
With all the recent refactoring around struct btree op struct search has
gotten rather large.
But we can now easily break it up in a different way - we break out
struct btree_insert_op which is for inserting data into the cache, and
that's now what the copying gc code uses - struct search is now specific
to request.c
Signed-off-by: Kent Overstreet <kmo@daterainc.com>
Last of the btree_map() conversions. Main visible effect is
bch_btree_insert() is no longer taking a struct btree_op as an argument
anymore - there's no fancy state machine stuff going on, it's just a
normal function.
Signed-off-by: Kent Overstreet <kmo@daterainc.com>
When we convert bch_btree_insert() to bch_btree_map_leaf_nodes(), we
won't be passing struct btree_op to bch_btree_insert() anymore - so we
need a different way of returning whether there was a collision (really,
a replace collision).
Signed-off-by: Kent Overstreet <kmo@daterainc.com>
This is prep work for converting bch_btree_insert to
bch_btree_map_leaf_nodes() - we have to convert all its arguments to
actual arguments. Bunch of churn, but should be straightforward.
Signed-off-by: Kent Overstreet <kmo@daterainc.com>
There was some looping in submit_partial_cache_hit() and
submit_partial_cache_hit() that isn't needed anymore - originally, we
wouldn't necessarily process the full hit or miss all at once because
when splitting the bio, we took into account the restrictions of the
device we were sending it to.
But, device bio size restrictions are now handled elsewhere, with a
wrapper around generic_make_request() - so that looping has been
unnecessary for awhile now and we can now do quite a bit of cleanup.
And if we trim the key we're reading from to match the subset we're
actually reading, we don't have to explicitly calculate bi_sector
anymore. Neat.
Signed-off-by: Kent Overstreet <kmo@daterainc.com>
This is a fairly straightforward conversion, mostly reshuffling -
op->lookup_done goes away, replaced by MAP_DONE/MAP_CONTINUE. And the
code for handling cache hits and misses wasn't really btree code, so it
gets moved to request.c.
Signed-off-by: Kent Overstreet <kmo@daterainc.com>
With the new btree_map() functions, we don't need to export the stuff
needed for traversing the btree anymore.
Signed-off-by: Kent Overstreet <kmo@daterainc.com>
Lots of stuff has been open coding its own btree traversal - which is
generally pretty simple code, but there are a few subtleties.
This adds new new functions, bch_btree_map_nodes() and
bch_btree_map_keys(), which do the traversal for you. Everything that's
open coding btree traversal now (with the exception of garbage
collection) is slowly going to be converted to these two functions;
being able to write other code at a higher level of abstraction is a
big improvement w.r.t. overall code quality.
Signed-off-by: Kent Overstreet <kmo@daterainc.com>
This simplifies the writeback flow control quite a bit - previously, it
was conceptually two coroutines, refill_dirty() and read_dirty(). This
makes the code quite a bit more straightforward.
Signed-off-by: Kent Overstreet <kmo@daterainc.com>
We needed a dedicated rescuer workqueue for gc anyways... and gc was
conceptually a dedicated thread, just one that wasn't running all the
time. Switch it to a dedicated thread to make the code a bit more
straightforward.
Signed-off-by: Kent Overstreet <kmo@daterainc.com>
At one point we did do fancy asynchronous waiting stuff with
bucket_wait, but that's all gone (and bucket_wait is used a lot less
than it used to be). So use the standard primitives.
Signed-off-by: Kent Overstreet <kmo@daterainc.com>
Slowly working on pruning struct btree_op - the aim is for it to only
contain things that are actually necessary for traversing the btree.
Signed-off-by: Kent Overstreet <kmo@daterainc.com>
Making things less asynchronous that don't need to be - bch_journal()
only has to block when the journal or journal entry is full, which is
emphatically not a fast path. So make it a normal function that just
returns when it finishes, to make the code and control flow easier to
follow.
Signed-off-by: Kent Overstreet <kmo@daterainc.com>
Try to improve some of the naming a bit to be more consistent, and also
improve the flow of control in request_write() a bit.
Signed-off-by: Kent Overstreet <kmo@daterainc.com>
Some refactoring - better to explicitly pass stuff around instead of
having it all in the "big bag of state", struct btree_op. Going to prune
struct btree_op quite a bit over time.
Signed-off-by: Kent Overstreet <kmo@daterainc.com>
This was the main point of all this refactoring - now,
btree_insert_check_key() won't fail just because the leaf node happened
to be full.
Signed-off-by: Kent Overstreet <kmo@daterainc.com>
We'll often end up with a list of adjacent keys to insert -
because bch_data_insert() may have to fragment the data it writes.
Originally, to simplify things and avoid having to deal with corner
cases bch_btree_insert() would pass keys from this list one at a time to
btree_insert_recurse() - mainly because the list of keys might span leaf
nodes, so it was easier this way.
With the btree_insert_node() refactoring, it's now a lot easier to just
pass down the whole list and have btree_insert_recurse() iterate over
leaf nodes until it's done.
Signed-off-by: Kent Overstreet <kmo@daterainc.com>
The flow of control in the old btree insertion code was rather -
backwards; we'd recurse down the btree (in btree_insert_recurse()), and
then if we needed to split the keys to be inserted into the parent node
would be effectively returned up to btree_insert_recurse(), which would
notice there was more work to do and finish the insertion.
The main problem with this was that the full logic for btree insertion
could only be used by calling btree_insert_recurse; if you'd gotten to a
btree leaf some other way and had a key to insert, if it turned out that
node needed to be split you were SOL.
This inverts the flow of control so btree_insert_node() does _full_
btree insertion, including splitting - and takes a (leaf) btree node to
insert into as a parameter.
This means we can now _correctly_ handle cache misses - for cache
misses, we need to insert a fake "check" key into the btree when we
discover we have a cache miss - while we still have the btree locked.
Previously, if the btree node was full inserting a cache miss would just
fail.
Signed-off-by: Kent Overstreet <kmo@daterainc.com>
This is prep work for the reworked btree insertion code.
The way we set b->parent is ugly and hacky... the problem is, when
btree_split() or garbage collection splits or rewrites a btree node, the
parent changes for all its (potentially already cached) children.
I may change this later and add some code to look through the btree node
cache and find all our cached child nodes and change the parent pointer
then...
Signed-off-by: Kent Overstreet <kmo@daterainc.com>
Checking i->seq was redundant, because since ages ago we always
initialize the new bset when advancing b->written
Signed-off-by: Kent Overstreet <kmo@daterainc.com>
Originally I got this right... except that the divides didn't use
do_div(), which broke 32 bit kernels. When I went to fix that, I forgot
that the raid stripe size usually isn't a power of two... doh
Signed-off-by: Kent Overstreet <kmo@daterainc.com>
The old asynchronous discard code was really a relic from when all the
allocation code was asynchronous - now that allocation runs out of a
dedicated thread there's no point in keeping around all that complicated
machinery.
Signed-off-by: Kent Overstreet <kmo@daterainc.com>
bch_keybuf_del() takes a spinlock that can't be taken in interrupt context -
whoops. Fortunately, this code isn't enabled by default (you have to toggle a
sysfs thing).
Signed-off-by: Kent Overstreet <kmo@daterainc.com>
Dirty data accounting wasn't quite right - firstly, we were adding the key we're
inserting after it could have merged with another dirty key already in the
btree, and secondly we could sometimes pass the wrong offset to
bcache_dev_sectors_dirty_add() for dirty data we were overwriting - which is
important when tracking dirty data by stripe.
Signed-off-by: Kent Overstreet <kmo@daterainc.com>
Cc: linux-stable <stable@vger.kernel.org> # >= v3.10
If a write block triggers promotion and covers a whole block we can
avoid a copy.
Introduce dm_{hook,unhook}_bio to simplify saving and restoring bio
fields (bi_private is now used by overwrite). Switch writethrough
support over to using these helpers too.
Signed-off-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Previously these promotions only got priority if there were unused cache
blocks. Now we give them priority if there are any clean blocks in the
cache.
The fio_soak_test in the device-mapper-test-suite now gives uniform
performance across subvolumes (~16 seconds).
Signed-off-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
There are now two multiqueues for in cache blocks. A clean one and a
dirty one.
writeback_work comes from the dirty one. Demotions come from the clean
one.
There are two benefits:
- Performance improvement, since demoting a clean block is a noop.
- The cache cleans itself when io load is light.
Signed-off-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Heinz Mauelshagen <heinzm@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Check commit_requested flag _before_ calling
dm_cache_changed_this_transaction() superfluously.
Also, be sure to set last_commit_jiffies _after_ dm_cache_commit()
completes.
Signed-off-by: Heinz Mauelshagen <heinzm@redhat.com>
Signed-off-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Don't waste time spotting blocks that have been allocated and then freed
in the same transaction.
The extra lookup is expensive, and I don't think it really gives us much.
Signed-off-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
The option DM_LOG_USERSPACE is sub-option of DM_MIRROR, so place it
right after DM_MIRROR. Doing so fixes various other Device mapper
targets/features to be properly nested under "Device mapper support".
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
This patch allows the removal of an open device to be deferred until
it is closed. (Previously such a removal attempt would fail.)
The deferred remove functionality is enabled by setting the flag
DM_DEFERRED_REMOVE in the ioctl structure on DM_DEV_REMOVE or
DM_REMOVE_ALL ioctl.
On return from DM_DEV_REMOVE, the flag DM_DEFERRED_REMOVE indicates if
the device was removed immediately or flagged to be removed on close -
if the flag is clear, the device was removed.
On return from DM_DEV_STATUS and other ioctls, the flag
DM_DEFERRED_REMOVE is set if the device is scheduled to be removed on
closure.
A device that is scheduled to be deleted can be revived using the
message "@cancel_deferred_remove". This message clears the
DMF_DEFERRED_REMOVE flag so that the device won't be deleted on close.
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
If preresume fails it is worth logging an error given that a device is
left suspended due to the failure.
This change was motivated by local preresume error logging that was
added to the cache target ("preresume failed"). Elevating this
target-agnostic context for the where the target-specific error occurred
relative to the DM core's callouts makes sense.
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Joe Thornber <ejt@redhat.com>
dm-crypt can already activate TCRYPT (TrueCrypt compatible) containers
in LRW or XTS block encryption mode.
TCRYPT containers prior to version 4.1 use CBC mode with some additional
tweaks, this patch adds support for these containers.
This new mode is implemented using special IV generator named TCW
(TrueCrypt IV with whitening). TCW IV only supports containers that are
encrypted with one cipher (Tested with AES, Twofish, Serpent, CAST5 and
TripleDES).
While this mode is legacy and is known to be vulnerable to some
watermarking attacks (e.g. revealing of hidden disk existence) it can
still be useful to activate old containers without using 3rd party
software or for independent forensic analysis of such containers.
(Both the userspace and kernel code is an independent implementation
based on the format documentation and it completely avoids use of
original source code.)
The TCW IV generator uses two additional keys: Kw (whitening seed, size
is always 16 bytes - TCW_WHITENING_SIZE) and Kiv (IV seed, size is
always the IV size of the selected cipher). These keys are concatenated
at the end of the main encryption key provided in mapping table.
While whitening is completely independent from IV, it is implemented
inside IV generator for simplification.
The whitening value is always 16 bytes long and is calculated per sector
from provided Kw as initial seed, xored with sector number and mixed
with CRC32 algorithm. Resulting value is xored with ciphertext sector
content.
IV is calculated from the provided Kiv as initial IV seed and xored with
sector number.
Detailed calculation can be found in the Truecrypt documentation for
version < 4.1 and will also be described on dm-crypt site, see:
http://code.google.com/p/cryptsetup/wiki/DMCrypt
The experimental support for activation of these containers is already
present in git devel brach of cryptsetup.
Signed-off-by: Milan Broz <gmazyland@gmail.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Some encryption modes use extra keys (e.g. loopAES has IV seed) which
are not used in block cipher initialization but are part of key string
in table constructor.
This patch adds an additional field which describes the length of the
extra key(s) and substracts it before real key encryption setting.
The key_size always includes the size, in bytes, of the key provided
in mapping table.
The key_parts describes how many parts (usually keys) are contained in
the whole key buffer. And key_extra_size contains size in bytes of
additional keys part (this number of bytes must be subtracted because it
is processed by the IV generator).
| K1 | K2 | .... | K64 | Kiv |
|----------- key_size ----------------- |
| |-key_extra_size-|
| [64 keys] | [1 key] | => key_parts = 65
Example where key string contains main key K, whitening key
Kw and IV seed Kiv:
| K | Kiv | Kw |
|--------------- key_size --------------|
| |-----key_extra_size------|
| [1 key] | [1 key] | [1 key] | => key_parts = 3
Because key_extra_size is calculated during IV mode setting, key
initialization is moved after this step.
For now, this change has no effect to supported modes (thanks to ilog2
rounding) but it is required by the following patch.
Also, fix a sparse warning in crypt_iv_lmk_one().
Signed-off-by: Milan Broz <gmazyland@gmail.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
A migration failure should be logged (albeit limited).
Signed-off-by: Heinz Mauelshagen <heinzm@redhat.com>
Signed-off-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Fix a few cell_defer() calls that weren't passing a bool.
Signed-off-by: Heinz Mauelshagen <heinzm@redhat.com>
Signed-off-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Return -EINVAL when the specified cache policy is unknown rather than
returning -ENOMEM.
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Rename takeout_queue to concat_queue.
Fix a harmless bug in mq policies pop() function. Currently pop()
always succeeds, with up coming changes this wont be the case.
Fix typo in comment above pre_cache_to_cache prototype.
Signed-off-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Make the quiescing flag an atomic_t and stop protecting it with a spin
lock.
Signed-off-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
The code that was trying to do this was inadequate. The postsuspend
method (in ioctl context), needs to wait for the worker thread to
acknowledge the request to quiesce. Otherwise the migration count may
drop to zero temporarily before the worker thread realises we're
quiescing. In this case the target will be taken down, but the worker
thread may have issued a new migration, which will cause an oops when
it completes.
Signed-off-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Cc: stable@vger.kernel.org # 3.9+
Previously only origin bios could trigger ticks, which meant if all
the io was destined for the cache no ticks were generated. If no ticks
are generated then multiple hits, and movements in general, are
attributed to the same tick.
Only a stop gap fix, we need a better solution.
Signed-off-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
It is safe to use a mutex in mq_residency() at this point since it is
only called from ioctl context. But future-proof mq_residency() by
using might_sleep() to catch new contexts that cannot sleep.
Signed-off-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Someone cut and pasted md's md_trim_bio() into xen-blkfront.c. Come on,
we should know better than this.
Signed-off-by: Kent Overstreet <kmo@daterainc.com>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Neil Brown <neilb@suse.de>
Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Cc: Jeremy Fitzhardinge <jeremy@goop.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Here's the big driver core / sysfs update for 3.13-rc1.
There's lots of dev_groups updates for different subsystems, as they all
get slowly migrated over to the safe versions of the attribute groups
(removing userspace races with the creation of the sysfs files.) Also
in here are some kobject updates, devres expansions, and the first round
of Tejun's sysfs reworking to enable it to be used by other subsystems
as a backend for an in-kernel filesystem.
All of these have been in linux-next for a while with no reported
issues.
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2.0.19 (GNU/Linux)
iEYEABECAAYFAlJ6xAMACgkQMUfUDdst+yk1kQCfcHXhfnrvFZ5J/mDP509IzhNS
ddEAoLEWoivtBppNsgrWqXpD1vi4UMsE
=JmVW
-----END PGP SIGNATURE-----
Merge tag 'driver-core-3.13-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core
Pull driver core / sysfs patches from Greg KH:
"Here's the big driver core / sysfs update for 3.13-rc1.
There's lots of dev_groups updates for different subsystems, as they
all get slowly migrated over to the safe versions of the attribute
groups (removing userspace races with the creation of the sysfs
files.) Also in here are some kobject updates, devres expansions, and
the first round of Tejun's sysfs reworking to enable it to be used by
other subsystems as a backend for an in-kernel filesystem.
All of these have been in linux-next for a while with no reported
issues"
* tag 'driver-core-3.13-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core: (83 commits)
sysfs: rename sysfs_assoc_lock and explain what it's about
sysfs: use generic_file_llseek() for sysfs_file_operations
sysfs: return correct error code on unimplemented mmap()
mdio_bus: convert bus code to use dev_groups
device: Make dev_WARN/dev_WARN_ONCE print device as well as driver name
sysfs: separate out dup filename warning into a separate function
sysfs: move sysfs_hash_and_remove() to fs/sysfs/dir.c
sysfs: remove unused sysfs_get_dentry() prototype
sysfs: honor bin_attr.attr.ignore_lockdep
sysfs: merge sysfs_elem_bin_attr into sysfs_elem_attr
devres: restore zeroing behavior of devres_alloc()
sysfs: fix sysfs_write_file for bin file
input: gameport: convert bus code to use dev_groups
input: serio: remove bus usage of dev_attrs
input: serio: use DEVICE_ATTR_RO()
i2o: convert bus code to use dev_groups
memstick: convert bus code to use dev_groups
tifm: convert bus code to use dev_groups
virtio: convert bus code to use dev_groups
ipack: convert bus code to use dev_groups
...
Entries would be lost if the old tail block was partially filled.
Signed-off-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Cc: stable@vger.kernel.org # 3.9+
When pg_init is running no I/O can be submitted to the underlying
devices, as the path priority etc might change. When using queue_io for
this, requests will be piling up within multipath as the block I/O
scheduler just sees a _very fast_ device. All of this queued I/O has to
be resubmitted from within multipathing once pg_init is done.
This approach has the problem that it's virtually impossible to
abort I/O when pg_init is running, and we're adding heavy load
to the devices after pg_init since all of the queued I/O needs to be
resubmitted _before_ any requests can be pulled off of the request queue
and normal operation continues.
This patch will requeue the I/O that triggers the pg_init call, and
return 'busy' when pg_init is in progress. With these changes the block
I/O scheduler will stop submitting I/O during pg_init, resulting in a
quicker path switch and less I/O pressure (and memory consumption) after
pg_init.
Signed-off-by: Hannes Reinecke <hare@suse.de>
[patch header edited for clarity and typos by Mike Snitzer]
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Whenever multipath_dtr() is happening we must prevent queueing any
further path activation work. Implement this by adding a new
'pg_init_disabled' flag to the multipath structure that denotes future
path activation work should be skipped if it is set. By disabling
pg_init and then re-enabling in flush_multipath_work() we also avoid the
potential for pg_init to be initiated while suspending an mpath device.
Without this patch a race condition exists that may result in a kernel
panic:
1) If after pg_init_done() decrements pg_init_in_progress to 0, a call
to wait_for_pg_init_completion() assumes there are no more pending path
management commands.
2) If pg_init_required is set by pg_init_done(), due to retryable
mode_select errors, then process_queued_ios() will again queue the
path activation work.
3) If free_multipath() completes before activate_path() work is called a
NULL pointer dereference like the following can be seen when
accessing members of the recently destructed multipath:
BUG: unable to handle kernel NULL pointer dereference at 0000000000000090
RIP: 0010:[<ffffffffa003db1b>] [<ffffffffa003db1b>] activate_path+0x1b/0x30 [dm_multipath]
[<ffffffff81090ac0>] worker_thread+0x170/0x2a0
[<ffffffff81096c80>] ? autoremove_wake_function+0x0/0x40
[switch to disabling pg_init in flush_multipath_work & header edits by Mike Snitzer]
Signed-off-by: Shiva Krishna Merla <shivakrishna.merla@netapp.com>
Reviewed-by: Krishnasamy Somasundaram <somasundaram.krishnasamy@netapp.com>
Tested-by: Speagle Andy <Andy.Speagle@netapp.com>
Acked-by: Junichi Nomura <j-nomura@ce.jp.nec.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Cc: stable@vger.kernel.org
dm-mpath and dm-thin must process messages even if some device is
suspended, so we allocate argv buffer with GFP_NOIO. These messages have
a small fixed number of arguments.
On the other hand, dm-switch needs to process bulk data using messages
so excessive use of GFP_NOIO could cause trouble.
The patch also lowers the default number of arguments from 64 to 8, so
that there is smaller load on GFP_NOIO allocations.
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Cc: stable@vger.kernel.org
Acked-by: Alasdair G Kergon <agk@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
SCSI discard will damage discard stripe bio setting, eg, some fields are
changed. If the stripe is reused very soon, we have wrong bios setting. We
remove discard stripe from hash list, so next time the strip will be fully
initialized.
Suitable for backport to 3.7+.
Cc: <stable@vger.kernel.org> (3.7+)
Signed-off-by: Shaohua Li <shli@fusionio.com>
Signed-off-by: NeilBrown <neilb@suse.de>
SCSI layer will add new payload for discard request. If two bios are merged
to one, the second bio has bi_vcnt 1 which is set in raid5. This will confuse
SCSI and cause oops.
Suitable for backport to 3.7+
Cc: stable@vger.kernel.org (v3.7+)
Reported-by: Jes Sorensen <Jes.Sorensen@redhat.com>
Signed-off-by: Shaohua Li <shli@fusionio.com>
Signed-off-by: NeilBrown <neilb@suse.de>
Acked-by: Martin K. Petersen <martin.petersen@oracle.com>
Since:
commit 7ceb17e87b
md: Allow devices to be re-added to a read-only array.
spares are activated on a read-only array. In case of raid1 and raid10
personalities it causes that not-in-sync devices are marked in-sync
without checking if recovery has been finished.
If a read-only array is degraded and one of its devices is not in-sync
(because the array has been only partially recovered) recovery will be skipped.
This patch adds checking if recovery has been finished before marking a device
in-sync for raid1 and raid10 personalities. In case of raid5 personality
such condition is already present (at raid5.c:6029).
Bug was introduced in 3.10 and causes data corruption.
Cc: stable@vger.kernel.org
Signed-off-by: Pawel Baldysiak <pawel.baldysiak@intel.com>
Signed-off-by: Lukasz Dorau <lukasz.dorau@intel.com>
Signed-off-by: NeilBrown <neilb@suse.de>
This patch fixes a particular type of data corruption that has been
encountered when loading a snapshot's metadata from disk.
When we allocate a new chunk in persistent_prepare, we increment
ps->next_free and we make sure that it doesn't point to a metadata area
by further incrementing it if necessary.
When we load metadata from disk on device activation, ps->next_free is
positioned after the last used data chunk. However, if this last used
data chunk is followed by a metadata area, ps->next_free is positioned
erroneously to the metadata area. A newly-allocated chunk is placed at
the same location as the metadata area, resulting in data or metadata
corruption.
This patch changes the code so that ps->next_free skips the metadata
area when metadata are loaded in function read_exceptions.
The patch also moves a piece of code from persistent_prepare_exception
to a separate function skip_metadata to avoid code duplication.
CVE-2013-4299
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Cc: stable@vger.kernel.org
Cc: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
Commit c0f04d88e4 ("bcache: Fix flushes in writeback mode") was fixing
a reported data corruption bug, but it seems some last minute
refactoring or rebasing introduced a null pointer deref.
Signed-off-by: Kent Overstreet <kmo@daterainc.com>
Cc: linux-stable <stable@vger.kernel.org> # >= v3.10
Reported-by: Gabriel de Perthuis <g2p.code@gmail.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The pre-existing sysfs interfaces which take explicit namespace
argument are weird in that they place the optional @ns in front of
@name which is contrary to the established convention. For example,
we end up forcing vast majority of sysfs_get_dirent() users to do
sysfs_get_dirent(parent, NULL, name), which is silly and error-prone
especially as @ns and @name may be interchanged without causing
compilation warning.
This renames sysfs_get_dirent() to sysfs_get_dirent_ns() and swap the
positions of @name and @ns, and sysfs_get_dirent() is now a wrapper
around sysfs_get_dirent_ns(). This makes confusions a lot less
likely.
There are other interfaces which take @ns before @name. They'll be
updated by following patches.
This patch doesn't introduce any functional changes.
v2: EXPORT_SYMBOL_GPL() wasn't updated leading to undefined symbol
error on module builds. Reported by build test robot. Fixed.
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Eric W. Biederman <ebiederm@xmission.com>
Cc: Kay Sievers <kay@vrfy.org>
Cc: Fengguang Wu <fengguang.wu@intel.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
A few fixes for dm-snapshot, a 32 bit fix for dm-stats, a couple error
handling fixes for dm-multipath. A fix for the thin provisioning target
to not expose non-zero discard limits if discards are disabled.
Lastly, add two DM module parameters which allow users to tune the
emergency memory reserves that DM mainatins per device -- this helps fix
a long-standing issue for dm-multipath. The conservative default
reserve for request-based dm-multipath devices (256) has proven
problematic for users with many multipathed SCSI devices but relatively
little memory. To responsibly select a smaller value users should use
the new nr_bios tracepoint info (via commit 75afb352 "block: Add nr_bios
to block_rq_remap tracepoint") to determine the peak number of bios
their workloads create.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.14 (GNU/Linux)
iQEcBAABAgAGBQJSQMVHAAoJEMUj8QotnQNaOXgIAJS6/XJKMoHfiDJ9M+XD34rZ
Uyr9TEnubX3DKCRBiY23MUcCQn3fx6BjCGv5/c8L4jQFIuLyDi2yatqpwXcbGSJh
G/S/y6u0Axek+ew7TS80OFop4nblW6MoKnoh9/4N55Ofa+1WvKM4ERUGjHGbauyS
TxmLQPToCFPLYRIOZ+imd6hQuIZ1+FFdJFvi7kY9O6Llx2sLD6fWi1iruBd/Da2H
ByMX3biGN45mSpcBzRbSC/FkJ9CRIvT9n82BDPS0o3Tllt8NaVlEDaovB7h4ncc0
bFuT2Z3Q38B9uZ8Lj0bqdGzv3kXMLCkLo6WhWjyUt84hmDPAzRpBwt60jUqWyZs=
=bjVp
-----END PGP SIGNATURE-----
Merge tag 'dm-3.12-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm
Pull device-mapper fixes from Mike Snitzer:
"A few fixes for dm-snapshot, a 32 bit fix for dm-stats, a couple error
handling fixes for dm-multipath. A fix for the thin provisioning
target to not expose non-zero discard limits if discards are disabled.
Lastly, add two DM module parameters which allow users to tune the
emergency memory reserves that DM mainatins per device -- this helps
fix a long-standing issue for dm-multipath. The conservative default
reserve for request-based dm-multipath devices (256) has proven
problematic for users with many multipathed SCSI devices but
relatively little memory. To responsibly select a smaller value users
should use the new nr_bios tracepoint info (via commit 75afb352
"block: Add nr_bios to block_rq_remap tracepoint") to determine the
peak number of bios their workloads create"
* tag 'dm-3.12-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm:
dm: add reserved_bio_based_ios module parameter
dm: add reserved_rq_based_ios module parameter
dm: lower bio-based mempool reservation
dm thin: do not expose non-zero discard limits if discards disabled
dm mpath: disable WRITE SAME if it fails
dm-snapshot: fix performance degradation due to small hash size
dm snapshot: workaround for a false positive lockdep warning
dm stats: fix possible counter corruption on 32-bit systems
dm mpath: do not fail path on -ENOSPC
In writeback mode, when we get a cache flush we need to make sure we
issue a flush to the backing device.
The code for sending down an extra flush was wrong - by cloning the bio
we were probably getting flags that didn't make sense for a bare flush,
and also the old code was firing for FUA bios, for which we don't need
to send a flush to the backing device.
This was causing data corruption somehow - the mechanism was never
determined, but this patch fixes it for the users that were seeing it.
Signed-off-by: Kent Overstreet <kmo@daterainc.com>
Cc: linux-stable <stable@vger.kernel.org> # >= v3.10
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
btree_sort_fixup() was overly clever, because it was trying to avoid
pulling a key off the btree iterator in more than one place.
This led to a really obscure bug where we'd break early from the loop in
btree_sort_fixup() if the current key overlapped with keys in more than
one older set, and the next key it overlapped with was zero size.
Signed-off-by: Kent Overstreet <kmo@daterainc.com>
Cc: linux-stable <stable@vger.kernel.org> # >= v3.10
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
GFP_NOIO means we could be getting called recursively - mca_alloc() ->
mca_data_alloc() - definitely can't use mutex_lock(bucket_lock) then.
Whoops.
Signed-off-by: Kent Overstreet <kmo@daterainc.com>
Cc: linux-stable <stable@vger.kernel.org> # >= v3.10
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
bch_journal_meta() was missing the flush to make the journal write
actually go down (instead of waiting up to journal_delay_ms)...
Whoops
Signed-off-by: Kent Overstreet <kmo@daterainc.com>
Cc: linux-stable <stable@vger.kernel.org> # >= v3.10
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Background writeback works by scanning the btree for dirty data and
adding those keys into a fixed size buffer, then for each dirty key in
the keybuf writing it to the backing device.
When read_dirty() finishes and it's time to scan for more dirty data, we
need to wait for the outstanding writeback IO to finish - they still
take up slots in the keybuf (so that foreground writes can check for
them to avoid races) - without that wait, we'll continually rescan when
we'll be able to add at most a key or two to the keybuf, and that takes
locks that starves foreground IO. Doh.
Signed-off-by: Kent Overstreet <kmo@daterainc.com>
Cc: linux-stable <stable@vger.kernel.org> # >= v3.10
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Fix
drivers/md/bcache/btree.c: In function ‘bch_btree_node_read’:
drivers/md/bcache/btree.c:259: warning: format ‘%lu’ expects type ‘long unsigned int’, but argument 3 has type ‘size_t’
Signed-off-by: Geert Uytterhoeven <geert@linux-m68k.org>
Signed-off-by: Kent Overstreet <kmo@daterainc.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The journal replay code didn't handle this case, causing it to go into
an infinite loop...
Signed-off-by: Kent Overstreet <kmo@daterainc.com>
Cc: linux-stable <stable@vger.kernel.org> # >= v3.10
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
sysfs attributes with unusual characters have crappy failure modes
in Squeeze (udev 164); later versions of udev are unaffected.
This should make these characters more unusual.
Signed-off-by: Gabriel de Perthuis <g2p.code@gmail.com>
Signed-off-by: Kent Overstreet <kmo@daterainc.com>
Cc: linux-stable <stable@vger.kernel.org> # >= v3.10
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
That switch statement was obviously wrong, leading to some sort of weird
spinning on rare occasion with discards enabled...
Signed-off-by: Kent Overstreet <kmo@daterainc.com>
Cc: linux-stable <stable@vger.kernel.org> # >= v3.10
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Allow user to change the number of IOs that are reserved by
bio-based DM's mempools by writing to this file:
/sys/module/dm_mod/parameters/reserved_bio_based_ios
The default value is RESERVED_BIO_BASED_IOS (16). The maximum allowed
value is RESERVED_MAX_IOS (1024).
Export dm_get_reserved_bio_based_ios() for use by DM targets and core
code. Switch to sizing dm-io's mempool and bioset using DM core's
configurable 'reserved_bio_based_ios'.
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Frank Mayhar <fmayhar@google.com>
Allow user to change the number of IOs that are reserved by
request-based DM's mempools by writing to this file:
/sys/module/dm_mod/parameters/reserved_rq_based_ios
The default value is RESERVED_REQUEST_BASED_IOS (256). The maximum
allowed value is RESERVED_MAX_IOS (1024).
Export dm_get_reserved_rq_based_ios() for use by DM targets and core
code. Switch to sizing dm-mpath's mempool using DM core's configurable
'reserved_rq_based_ios'.
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Frank Mayhar <fmayhar@google.com>
Acked-by: Mikulas Patocka <mpatocka@redhat.com>
Bio-based device mapper processing doesn't need larger mempools (like
request-based DM does), so lower the number of reserved entries for
bio-based operation. 16 was already used for bio-based DM's bioset
but mistakenly wasn't used for it's _io_cache.
Formalize difference between bio-based and request-based defaults by
introducing RESERVED_BIO_BASED_IOS and RESERVED_REQUEST_BASED_IOS.
(based on older code from Mikulas Patocka)
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Frank Mayhar <fmayhar@google.com>
Acked-by: Mikulas Patocka <mpatocka@redhat.com>
Fix issue where the block layer would stack the discard limits of the
pool's data device even if the "ignore_discard" pool feature was
specified.
The pool and thin device(s) still had discards disabled because the
QUEUE_FLAG_DISCARD request_queue flag wasn't set. But to avoid user
confusion when "ignore_discard" is used: both the pool device and the
thin device(s) have zeroes for all discard limits.
Also, always set discard_zeroes_data_unsupported in targets because they
should never advertise the 'discard_zeroes_data' capability (even if the
pool's data device supports it).
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Acked-by: Joe Thornber <ejt@redhat.com>
Workaround the SCSI layer's problematic WRITE SAME heuristics by
disabling WRITE SAME in the DM multipath device's queue_limits if an
underlying device disabled it.
The WRITE SAME heuristics, with both the original commit 5db44863b6
("[SCSI] sd: Implement support for WRITE SAME") and the updated commit
66c28f971 ("[SCSI] sd: Update WRITE SAME heuristics"), default to enabling
WRITE SAME(10) even without successfully determining it is supported.
After the first failed WRITE SAME the SCSI layer will disable WRITE SAME
for the device (by setting sdkp->device->no_write_same which results in
'max_write_same_sectors' in device's queue_limits to be set to 0).
When a device is stacked ontop of such a SCSI device any changes to that
SCSI device's queue_limits do not automatically propagate up the stack.
As such, a DM multipath device will not have its WRITE SAME support
disabled. This causes the block layer to continue to issue WRITE SAME
requests to the mpath device which causes paths to fail and (if mpath IO
isn't configured to queue when no paths are available) it will result in
actual IO errors to the upper layers.
This fix doesn't help configurations that have additional devices
stacked ontop of the mpath device (e.g. LVM created linear DM devices
ontop). A proper fix that restacks all the queue_limits from the bottom
of the device stack up will need to be explored if SCSI will continue to
use this model of optimistically allowing op codes and then disabling
them after they fail for the first time.
Before this patch:
EXT4-fs (dm-6): mounted filesystem with ordered data mode. Opts: (null)
device-mapper: multipath: XXX snitm debugging: got -EREMOTEIO (-121)
device-mapper: multipath: XXX snitm debugging: failing WRITE SAME IO with error=-121
end_request: critical target error, dev dm-6, sector 528
dm-6: WRITE SAME failed. Manually zeroing.
device-mapper: multipath: Failing path 8:112.
end_request: I/O error, dev dm-6, sector 4616
dm-6: WRITE SAME failed. Manually zeroing.
end_request: I/O error, dev dm-6, sector 4616
end_request: I/O error, dev dm-6, sector 5640
end_request: I/O error, dev dm-6, sector 6664
end_request: I/O error, dev dm-6, sector 7688
end_request: I/O error, dev dm-6, sector 524288
Buffer I/O error on device dm-6, logical block 65536
lost page write due to I/O error on dm-6
JBD2: Error -5 detected when updating journal superblock for dm-6-8.
end_request: I/O error, dev dm-6, sector 524296
Aborting journal on device dm-6-8.
end_request: I/O error, dev dm-6, sector 524288
Buffer I/O error on device dm-6, logical block 65536
lost page write due to I/O error on dm-6
JBD2: Error -5 detected when updating journal superblock for dm-6-8.
# cat /sys/block/sdh/queue/write_same_max_bytes
0
# cat /sys/block/dm-6/queue/write_same_max_bytes
33553920
After this patch:
EXT4-fs (dm-6): mounted filesystem with ordered data mode. Opts: (null)
device-mapper: multipath: XXX snitm debugging: got -EREMOTEIO (-121)
device-mapper: multipath: XXX snitm debugging: WRITE SAME I/O failed with error=-121
end_request: critical target error, dev dm-6, sector 528
dm-6: WRITE SAME failed. Manually zeroing.
# cat /sys/block/sdh/queue/write_same_max_bytes
0
# cat /sys/block/dm-6/queue/write_same_max_bytes
0
It should be noted that WRITE SAME support wasn't enabled in DM
multipath until v3.10.
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Cc: Martin K. Petersen <martin.petersen@oracle.com>
Cc: Hannes Reinecke <hare@suse.de>
Cc: stable@vger.kernel.org # 3.10+
LVM2, since version 2.02.96, creates origin with zero size, then loads
the snapshot driver and then loads the origin. Consequently, the
snapshot driver sees the origin size zero and sets the hash size to the
lower bound 64. Such small hash table causes performance degradation.
This patch changes it so that the hash size is determined by the size of
snapshot volume, not minimum of origin and snapshot size. It doesn't
make sense to set the snapshot size significantly larger than the origin
size, so we do not need to take origin size into account when
calculating the hash size.
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Cc: stable@vger.kernel.org
The kernel reports a lockdep warning if a snapshot is invalidated because
it runs out of space.
The lockdep warning was triggered by commit 0976dfc1d0
("workqueue: Catch more locking problems with flush_work()") in v3.5.
The warning is false positive. The real cause for the warning is that
the lockdep engine treats different instances of md->lock as a single
lock.
This patch is a workaround - we use flush_workqueue instead of flush_work.
This code path is not performance sensitive (it is called only on
initialization or invalidation), thus it doesn't matter that we flush the
whole workqueue.
The real fix for the problem would be to teach the lockdep engine to treat
different instances of md->lock as separate locks.
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Acked-by: Alasdair G Kergon <agk@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Cc: stable@vger.kernel.org # 3.5+
There was a deliberate race condition in dm_stat_for_entry() to avoid the
overhead of disabling and enabling interrupts. The race could result in
some events not being counted on 64-bit architectures.
However, on 32-bit architectures, operations on long long variables are
not atomic, so the race condition could cause the counter to jump by 2^32.
Such jumps could be disruptive, so we need to do proper locking on 32-bit
architectures.
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Cc: Alasdair G. Kergon <agk@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Since ENOSPC is a target-side error, dm-mpath should just pass the error
information to upper layer instead of retrying itself with path failover.
Otherwise it will end up failing all paths down while path checkers find
all paths ok.
ENOSPC can now be returned from SCSI device after commit a9d6ceb8
("[SCSI] return ENOSPC on thin provisioning failure").
Signed-off-by: Jun'ichi Nomura <j-nomura@ce.jp.nec.com>
Acked-by: Hannes Reinecke <hare@suse.de>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Pull vfs pile 4 from Al Viro:
"list_lru pile, mostly"
This came out of Andrew's pile, Al ended up doing the merge work so that
Andrew didn't have to.
Additionally, a few fixes.
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (42 commits)
super: fix for destroy lrus
list_lru: dynamically adjust node arrays
shrinker: Kill old ->shrink API.
shrinker: convert remaining shrinkers to count/scan API
staging/lustre/libcfs: cleanup linux-mem.h
staging/lustre/ptlrpc: convert to new shrinker API
staging/lustre/obdclass: convert lu_object shrinker to count/scan API
staging/lustre/ldlm: convert to shrinkers to count/scan API
hugepage: convert huge zero page shrinker to new shrinker API
i915: bail out earlier when shrinker cannot acquire mutex
drivers: convert shrinkers to new count/scan API
fs: convert fs shrinkers to new scan/count API
xfs: fix dquot isolation hang
xfs-convert-dquot-cache-lru-to-list_lru-fix
xfs: convert dquot cache lru to list_lru
xfs: rework buffer dispose list tracking
xfs-convert-buftarg-lru-to-generic-code-fix
xfs: convert buftarg LRU to generic code
fs: convert inode and dentry shrinking to be node aware
vmscan: per-node deferred work
...
Convert the driver shrinkers to the new API. Most changes are compile
tested only because I either don't have the hardware or it's staging
stuff.
FWIW, the md and android code is pretty good, but the rest of it makes me
want to claw my eyes out. The amount of broken code I just encountered is
mind boggling. I've added comments explaining what is broken, but I fear
that some of the code would be best dealt with by being dragged behind the
bike shed, burying in mud up to it's neck and then run over repeatedly
with a blunt lawn mower.
Special mention goes to the zcache/zcache2 drivers. They can't co-exist
in the build at the same time, they are under different menu options in
menuconfig, they only show up when you've got the right set of mm
subsystem options configured and so even compile testing is an exercise in
pulling teeth. And that doesn't even take into account the horrible,
broken code...
[glommer@openvz.org: fixes for i915, android lowmem, zcache, bcache]
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Glauber Costa <glommer@openvz.org>
Acked-by: Mel Gorman <mgorman@suse.de>
Cc: Daniel Vetter <daniel.vetter@ffwll.ch>
Cc: Kent Overstreet <koverstreet@google.com>
Cc: John Stultz <john.stultz@linaro.org>
Cc: David Rientjes <rientjes@google.com>
Cc: Jerome Glisse <jglisse@redhat.com>
Cc: Thomas Hellstrom <thellstrom@vmware.com>
Cc: "Theodore Ts'o" <tytso@mit.edu>
Cc: Adrian Hunter <adrian.hunter@intel.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Artem Bityutskiy <artem.bityutskiy@linux.intel.com>
Cc: Arve Hjønnevåg <arve@android.com>
Cc: Carlos Maiolino <cmaiolino@redhat.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Chuck Lever <chuck.lever@oracle.com>
Cc: Daniel Vetter <daniel.vetter@ffwll.ch>
Cc: David Rientjes <rientjes@google.com>
Cc: Gleb Natapov <gleb@redhat.com>
Cc: Greg Thelen <gthelen@google.com>
Cc: J. Bruce Fields <bfields@redhat.com>
Cc: Jan Kara <jack@suse.cz>
Cc: Jerome Glisse <jglisse@redhat.com>
Cc: John Stultz <john.stultz@linaro.org>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Kent Overstreet <koverstreet@google.com>
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Marcelo Tosatti <mtosatti@redhat.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Steven Whitehouse <swhiteho@redhat.com>
Cc: Thomas Hellstrom <thellstrom@vmware.com>
Cc: Trond Myklebust <Trond.Myklebust@netapp.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
device-mapper device. This dm-stats code required the reintroduction of
a div64_u64_rem() helper, but as a separate method that doesn't slow
down div64_u64() -- especially on 32-bit systems.
Allow the error target to replace request-based DM devices
(e.g. multipath) in addition to bio-based DM devices.
Various other small code fixes and improvements to thin-provisioning, DM
cache and the DM ioctl interface.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.13 (GNU/Linux)
iQEcBAABAgAGBQJSLyNnAAoJEMUj8QotnQNaXVEIAKA1l43enaGiROBZEZXgAGUY
1JUsnHES4ujyn/jtT39jPTQf9AW/rS4FUCrZiXG2aaNHXo7+7cdVoBHAiWc7mXad
budBSqn47W7WDyFlQarKwsuYFcdLnqdnieRDMXQ1cN5dl4Rx61LclnsylQd4SSS0
lznXkfOTquetDSuEPOuUHJDZufdacw3PpxWbTKGJld40fd7YZfGWQoG0ek1OeqqL
fA30DTlYnkFyhheLCjFcDY6H55Rt7QpBWOUAa2XXLR6GLfk5iFK99autjWk2xTPT
nppRwQrw9VH+HdW0jGLU+LRs1Y3nxwT9OBLWt9wav87Smdg/7jQAjwde9eKbO2k=
=3ooH
-----END PGP SIGNATURE-----
Merge tag 'dm-3.12-changes' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm
Pull device-mapper updates from Mike Snitzer:
"Add the ability to collect I/O statistics on user-defined regions of a
device-mapper device. This dm-stats code required the reintroduction
of a div64_u64_rem() helper, but as a separate method that doesn't
slow down div64_u64() -- especially on 32-bit systems.
Allow the error target to replace request-based DM devices (e.g.
multipath) in addition to bio-based DM devices.
Various other small code fixes and improvements to thin-provisioning,
DM cache and the DM ioctl interface"
* tag 'dm-3.12-changes' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm:
dm stripe: silence a couple sparse warnings
dm: add statistics support
dm thin: always return -ENOSPC if no_free_space is set
dm ioctl: cleanup error handling in table_load
dm ioctl: increase granularity of type_lock when loading table
dm ioctl: prevent rename to empty name or uuid
dm thin: set pool read-only if breaking_sharing fails block allocation
dm thin: prefix pool error messages with pool device name
dm: allow error target to replace bio-based and request-based targets
math64: New separate div64_u64_rem helper
dm space map: optimise sm_ll_dec and sm_ll_inc
dm btree: prefetch child nodes when walking tree for a dm_btree_del
dm btree: use pop_frame in dm_btree_del to cleanup code
dm cache: eliminate holes in cache structure
dm cache: fix stacking of geometry limits
dm thin: fix stacking of geometry limits
dm thin: add data block size limits to Documentation
dm cache: add data block size limits to code and Documentation
dm cache: document metadata device is exclussive to a cache
dm: stop using WQ_NON_REENTRANT
Headline item is multithreading for RAID5 so that more
IO/sec can be supported on fast (SSD) devices.
Also TILE-Gx SIMD suppor for RAID6 calculations and an
assortment of bug fixes.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2.0.19 (GNU/Linux)
iQIVAwUAUi6dRTnsnt1WYoG5AQIqMBAAm/XUEqfyBNUiTPHmIU/OyReOlfsp8A2o
xtcmSzaCtIUz4btPszUrw3PShqnk+lXXX2AB0rp3PzfOgyNYXBRKbzOf3eGr2VEp
L/Cm0iSWHqQ7V7MoV5ZrqtvuyJV1a7FK3a3VaoKaUk424o4sZ7P67t/YZAnTCP/i
9wQoPeIOJ8YjZsaAQjzI3q7yRMRE8ytyBnF4NdgeMyr2p17w2e9pnmNfCTo4wnWs
Nu2wPr2QCPQXr/FoIhdIVEy3kVatqH8qXG8Fw+5n07HJYxGCvQZLDuoOVDYyFeoW
gnNq2MMgLZm/7Nzqd1bN+QQZuBCd5JL4VJ2G4vLfYrn3ZSdSysrVKQXFKYG3Gkua
1KP4Pv0hndAl4DtGbUk8CiZp6b+c5qeWvq+sO2NuhUGmumFMK2q4DJhITNexjmrs
Eg4opnR8JMLDkYD6o52Ziu5KQR/q1PKRLj80eoVuqB2QQM5+NPb4s3k2WN+53lQD
L9fH2alUxxSK+5R8ykk923QQ/XErMUwXaka+O/gGFAlYvaaW/GKTxFnKn/GIXAkc
tKW88zB+zA5EZEFec+K43z1UjtGxMWsryvDN55ON2iV+LIZBISm7krroBeR55cyO
+3tHlPsga0pO+9DdSm7hvZeWRrq5ZJTiZmL/e2FYygrC5tFAY0p+z49fK3e9Th13
C85G7fg3yDY=
=zLxh
-----END PGP SIGNATURE-----
Merge tag 'md/3.12' of git://neil.brown.name/md
Pull md update from Neil Brown:
"Headline item is multithreading for RAID5 so that more IO/sec can be
supported on fast (SSD) devices. Also TILE-Gx SIMD suppor for RAID6
calculations and an assortment of bug fixes"
* tag 'md/3.12' of git://neil.brown.name/md:
raid5: only wakeup necessary threads
md/raid5: flush out all pending requests before proceeding with reshape.
md/raid5: use seqcount to protect access to shape in make_request.
raid5: sysfs entry to control worker thread number
raid5: offload stripe handle to workqueue
raid5: fix stripe release order
raid5: make release_stripe lockless
md: avoid deadlock when dirty buffers during md_stop.
md: Don't test all of mddev->flags at once.
md: Fix apparent cut-and-paste error in super_90_validate
raid6/test: replace echo -e with printf
RAID: add tilegx SIMD implementation of raid6
md: fix safe_mode buglet.
md: don't call md_allow_write in get_bitmap_file.
Eliminate the following sparse warnings:
drivers/md/dm-stripe.c:443:12: warning: symbol 'dm_stripe_init' was not declared. Should it be static?
drivers/md/dm-stripe.c:456:6: warning: symbol 'dm_stripe_exit' was not declared. Should it be static?
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
Support the collection of I/O statistics on user-defined regions of
a DM device. If no regions are defined no statistics are collected so
there isn't any performance impact. Only bio-based DM devices are
currently supported.
Each user-defined region specifies a starting sector, length and step.
Individual statistics will be collected for each step-sized area within
the range specified.
The I/O statistics counters for each step-sized area of a region are
in the same format as /sys/block/*/stat or /proc/diskstats but extra
counters (12 and 13) are provided: total time spent reading and
writing in milliseconds. All these counters may be accessed by sending
the @stats_print message to the appropriate DM device via dmsetup.
The creation of DM statistics will allocate memory via kmalloc or
fallback to using vmalloc space. At most, 1/4 of the overall system
memory may be allocated by DM statistics. The admin can see how much
memory is used by reading
/sys/module/dm_mod/parameters/stats_current_allocated_bytes
See Documentation/device-mapper/statistics.txt for more details.
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
If pool has 'no_free_space' set it means a previous allocation already
determined the pool has no free space (and failed that allocation with
-ENOSPC). By always returning -ENOSPC if 'no_free_space' is set, we do
not allow the pool to oscillate between allocating blocks and then not.
But a side-effect of this determinism is that if a user wants to be able
to allocate new blocks they'll need to reload the pool's table (to clear
the 'no_free_space' flag). This reload will happen automatically if the
pool's data volume is resized. But if the user takes action to free a
lot of space by deleting snapshot volumes, etc the pool will no longer
allow data allocations to continue without an intervening table reload.
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Acked-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
Hold the mapped device's type_lock before calling populate_table() since
it is where the table's type is determined based on the specified
targets. There is no need to allow concurrent table loads to race to
establish the table's targets or type.
This eliminates the need to grab the lock in dm_table_set_type().
Also verify that the type_lock is held in both dm_set_md_type() and
dm_get_md_type().
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
A device-mapper device must always have a name consisting of a non-empty
string. If the device also has a uuid, this similarly must not be an
empty string.
The DM_DEV_CREATE ioctl enforces these rules when the device is created,
but this patch is needed to enforce them when DM_DEV_RENAME is used to
change the name or uuid.
Reported-by: Zdenek Kabelac <zkabelac@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Acked-by: Mikulas Patocka <mpatocka@redhat.com>
break_sharing() now handles an arbitrary alloc_data_block() error
the same way as provision_block(): marks pool read-only and errors the
cell.
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Acked-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
Useful to know which pool is experiencing the error.
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Acked-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
It may be useful to switch a request-based table to the "error" target.
Enhance the DM core to allow a hybrid target_type which is capable of
handling either bios (via .map) or requests (via .map_rq).
Add a request-based map function (.map_rq) to the "error" target_type;
making it DM's first hybrid target. Train dm_table_set_type() to prefer
the mapped device's established type (request-based or bio-based). If
the mapped device doesn't have an established type default to making the
table with the hybrid target(s) bio-based.
Tested 'dmsetup wipe_table' to work on both bio-based and request-based
devices.
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Joe Jin <joe.jin@oracle.com>
Signed-off-by: Jun'ichi Nomura <j-nomura@ce.jp.nec.com>
Acked-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
This patch set is a set of driver updates (ufs, zfcp, lpfc, mpt2/3sas,
qla4xxx, qla2xxx [adding support for ISP8044 + other things]) we also have a
new driver: esas2r which has a number of static checker problems, but which I
expect to resolve over the -rc course of 3.12 under the new driver exception.
We also have the error return updates that were discussed at LSF.
Signed-off-by: James Bottomley <JBottomley@Parallels.com>
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2.0.19 (GNU/Linux)
iQEcBAABAgAGBQJSJfX5AAoJEDeqqVYsXL0M8u8H+gN65iA4YeNc3Eq9F6mliLfg
JOIfn6GRz7ChbQ1ZZKdH/5xCOtzXphrkg7kRGmr9frsvYZ4X2c7W3xweQTA08gqP
wPH7/xyPffPnUm/r+V+SV41pm39bEjmltknLwiF572a6iOoVYQpnmDjdZQKT0jU0
QZEqI81+646m8edCnApLw3Tlsn2gBwHaDrkd55H2IQGTkOD016C0CQbM+cNMU440
qdqDcfRWCsp1fhLo3JH2kWTx8BihhyfEYAFz4tZwuFdGGkRZxF20HwyzV0h3hZOG
kZ2Gd1BFf0SybxOcESQmAukbcH5hyumX1Y7HMYKZbS2ubD4MCO1MO8UUtLXlxNc=
=PDBQ
-----END PGP SIGNATURE-----
Merge tag 'scsi-misc' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi
Pull first round of SCSI updates from James Bottomley:
"This patch set is a set of driver updates (ufs, zfcp, lpfc, mpt2/3sas,
qla4xxx, qla2xxx [adding support for ISP8044 + other things]).
We also have a new driver: esas2r which has a number of static checker
problems, but which I expect to resolve over the -rc course of 3.12
under the new driver exception.
We also have the error return that were discussed at LSF"
* tag 'scsi-misc' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi: (118 commits)
[SCSI] sg: push file descriptor list locking down to per-device locking
[SCSI] sg: checking sdp->detached isn't protected when open
[SCSI] sg: no need sg_open_exclusive_lock
[SCSI] sg: use rwsem to solve race during exclusive open
[SCSI] scsi_debug: fix logical block provisioning support when unmap_alignment != 0
[SCSI] scsi_debug: fix endianness bug in sdebug_build_parts()
[SCSI] qla2xxx: Update the driver version to 8.06.00.08-k.
[SCSI] qla2xxx: print MAC via %pMR.
[SCSI] qla2xxx: Correction to message ids.
[SCSI] qla2xxx: Correctly print out/in mailbox registers.
[SCSI] qla2xxx: Add a new interface to update versions.
[SCSI] qla2xxx: Move queue depth ramp down message to i/o debug level.
[SCSI] qla2xxx: Select link initialization option bits from current operating mode.
[SCSI] qla2xxx: Add loopback IDC-TIME-EXTEND aen handling support.
[SCSI] qla2xxx: Set default critical temperature value in cases when ISPFX00 firmware doesn't provide it
[SCSI] qla2xxx: QLAFX00 make over temperature AEN handling informational, add log for normal temperature AEN
[SCSI] qla2xxx: Correct Interrupt Register offset for ISPFX00
[SCSI] qla2xxx: Remove handling of Shutdown Requested AEN from qlafx00_process_aen().
[SCSI] qla2xxx: Send all AENs for ISPFx00 to above layers.
[SCSI] qla2xxx: Add changes in initialization for ISPFX00 cards with BIOS
...
If there are not enough stripes to handle, we'd better not always
queue all available work_structs. If one worker can only handle small
or even none stripes, it will impact request merge and create lock
contention.
With this patch, the number of work_struct running will depend on
pending stripes number. Note: some statistics info used in the patch
are accessed without locking protection. This should doesn't matter,
we just try best to avoid queue unnecessary work_struct.
Signed-off-by: Shaohua Li <shli@fusionio.com>
Signed-off-by: NeilBrown <neilb@suse.de>
Some requests - particularly 'discard' and 'read' are handled
differently depending on whether a reshape is active or not.
It is harmless to assume reshape is active if it isn't but wrong
to act as though reshape is not active when it is.
So when we start reshape - after making clear to all requests that
reshape has started - use mddev_suspend/mddev_resume to flush out all
requests. This will ensure that no requests will be assuming the
absence of reshape once it really starts.
Signed-off-by: NeilBrown <neilb@suse.de>
make_request() access various shape parameters (raid_disks, chunk_size
etc) which might be changed by raid5_start_reshape().
If the later is called at and awkward time during the form, the wrong
stripe_head might be used.
So introduce a 'seqcount' and after finding a stripe_head make sure
there is no reason to expect that we got the wrong one.
Signed-off-by: NeilBrown <neilb@suse.de>
Add a sysfs entry to control running workqueue thread number. If
group_thread_cnt is set to 0, we will disable workqueue offload handling of
stripes.
Signed-off-by: Shaohua Li <shli@fusionio.com>
Signed-off-by: NeilBrown <neilb@suse.de>
This is another attempt to create multiple threads to handle raid5 stripes.
This time I use workqueue.
raid5 handles request (especially write) in stripe unit. A stripe is page size
aligned/long and acrosses all disks. Writing to any disk sector, raid5 runs a
state machine for the corresponding stripe, which includes reading some disks
of the stripe, calculating parity, and writing some disks of the stripe. The
state machine is running in raid5d thread currently. Since there is only one
thread, it doesn't scale well for high speed storage. An obvious solution is
multi-threading.
To get better performance, we have some requirements:
a. locality. stripe corresponding to request submitted from one cpu is better
handled in thread in local cpu or local node. local cpu is preferred but some
times could be a bottleneck, for example, parity calculation is too heavy.
local node running has wide adaptability.
b. configurablity. Different setup of raid5 array might need diffent
configuration. Especially the thread number. More threads don't always mean
better performance because of lock contentions.
My original implementation is creating some kernel threads. There are
interfaces to control which cpu's stripe each thread should handle. And
userspace can set affinity of the threads. This provides biggest flexibility
and configurability. But it's hard to use and apparently a new thread pool
implementation is disfavor.
Recent workqueue improvement is quite promising. unbound workqueue will be
bound to numa node. If WQ_SYSFS is set in workqueue, there are sysfs option to
do affinity setting. For example, we can only include one HT sibling in
affinity. Since work is non-reentrant by default, and we can control running
thread number by limiting dispatched work_struct number.
In this patch, I created several stripe worker group. A group is a numa node.
stripes from cpus of one node will be added to a group list. Workqueue thread
of one node will only handle stripes of worker group of the node. In this way,
stripe handling has numa node locality. And as I said, we can control thread
number by limiting dispatched work_struct number.
The work_struct callback function handles several stripes in one run. A typical
work queue usage is to run one unit in each work_struct. In raid5 case, the
unit is a stripe. But we can't do that:
a. Though handling a stripe doesn't need lock because of reference accounting
and stripe isn't in any list, queuing a work_struct for each stripe will make
workqueue lock contended very heavily.
b. blk_start_plug()/blk_finish_plug() should surround stripe handle, as we
might dispatch request. If each work_struct only handles one stripe, such block
plug is meaningless.
This implementation can't do very fine grained configuration. But the numa
binding is most popular usage model, should be enough for most workloads.
Note: since we have only one stripe queue, switching to multi-thread might
decrease request size dispatching down to low level layer. The impact depends
on thread number, raid configuration and workload. So multi-thread raid5 might
not be proper for all setups.
Changes V1 -> V2:
1. remove WQ_NON_REENTRANT
2. disabling multi-threading by default
3. Add more descriptions in changelog
Signed-off-by: Shaohua Li <shli@fusionio.com>
Signed-off-by: NeilBrown <neilb@suse.de>
patch "make release_stripe lockless" changes the order stripes are released.
Originally I thought block layer can take care of request merge, but it appears
there are still some requests not merged. It's easy to fix the order.
Signed-off-by: Shaohua Li <shli@fusionio.com>
Signed-off-by: NeilBrown <neilb@suse.de>
release_stripe still has big lock contention. We just add the stripe to a llist
without taking device_lock. We let the raid5d thread to do the real stripe
release, which must hold device_lock anyway. In this way, release_stripe
doesn't hold any locks.
The side effect is the released stripes order is changed. But sounds not a big
deal, stripes are never handled in order. And I thought block layer can already
do nice request merge, which means order isn't that important.
I kept the unplug release batch, which is unnecessary with this patch from lock
contention avoid point of view, and actually if we delete it, the stripe_head
release_list and lru can share storage. But the unplug release batch is also
helpful for request merge. We probably can delay wakeup raid5d till unplug, but
I'm still afraid of the case which raid5d is running.
Signed-off-by: Shaohua Li <shli@fusionio.com>
Signed-off-by: NeilBrown <neilb@suse.de>
When the last process closes /dev/mdX sync_blockdev will be called so
that all buffers get flushed.
So if it is then opened for the STOP_ARRAY ioctl to be sent there will
be nothing to flush.
However if we open /dev/mdX in order to send the STOP_ARRAY ioctl just
moments before some other process which was writing closes their file
descriptor, then there won't be a 'last close' and the buffers might
not get flushed.
So do_md_stop() calls sync_blockdev(). However at this point it is
holding ->reconfig_mutex. So if the array is currently 'clean' then
the writes from sync_blockdev() will not complete until the array
can be marked dirty and that won't happen until some other thread
can get ->reconfig_mutex. So we deadlock.
We need to move the sync_blockdev() call to before we take
->reconfig_mutex.
However then some other thread could open /dev/mdX and write to it
after we call sync_blockdev() and before we actually stop the array.
This can leave dirty data in the page cache which is awkward.
So introduce new flag MD_STILL_CLOSED. Set it before calling
sync_blockdev(), clear it if anyone does open the file, and abort the
STOP_ARRAY attempt if it gets set before we lock against further
opens.
It is still possible to get problems if you open /dev/mdX, write to
it, then issue the STOP_ARRAY ioctl. Just don't do that.
Signed-off-by: NeilBrown <neilb@suse.de>
mddev->flags is mostly used to record if an update of the
metadata is needed. Sometimes the whole field is tested
instead of just the important bits. This makes it difficult
to introduce more state bits.
So replace all bare tests of mddev->flags with tests for the bits
that actually need testing.
Signed-off-by: NeilBrown <neilb@suse.de>