Commit Graph

3245 Commits

Author SHA1 Message Date
NeilBrown
67f455486d md/raid56: Don't perform reads to support writes until stripe is ready.
If it is found that we need to pre-read some blocks before a write
can succeed, we normally set STRIPE_DELAYED and don't actually perform
the read until STRIPE_PREREAD_ACTIVE subsequently gets set.

However for a degraded RAID6 we currently perform the reads as soon
as we see that a write is pending.  This significantly hurts
throughput.

So:
 - when handle_stripe_dirtying find a block that it wants on a device
   that is failed, set STRIPE_DELAY, instead of doing nothing, and
 - when fetch_block detects that a read might be required to satisfy a
   write, only perform the read if STRIPE_PREREAD_ACTIVE is set,
   and if we would actually need to read something to complete the write.

This also helps RAID5, though less often as RAID5 supports a
read-modify-write cycle.  For RAID5 the read is performed too early
only if the write is not a full 4K aligned write (i.e. no an
R5_OVERWRITE).

Also clean up a couple of horrible bits of formatting.

Reported-by: Patrik Horník <patrik@dsl.sk>
Signed-off-by: NeilBrown <neilb@suse.de>
2014-05-29 16:59:46 +10:00
NeilBrown
bd8839e03b md: refuse to change shape of array if it is active but read-only
read-only arrays should not be changed.  This includes changing
the level, layout, size, or number of devices.

So reject those changes for readonly arrays.

Signed-off-by: NeilBrown <neilb@suse.de>
2014-05-29 16:59:30 +10:00
NeilBrown
2ac295a544 md: always set MD_RECOVERY_INTR when interrupting a reshape thread.
Commit 8313b8e57f
   md: fix problem when adding device to read-only array with bitmap.

added a called to md_reap_sync_thread() which cause a reshape thread
to be interrupted (in particular, it could cause md_thread() to never even
call md_do_sync()).
However it didn't set MD_RECOVERY_INTR so ->finish_reshape() would not
know that the reshape didn't complete.

This only happens when mddev->ro is set and normally reshape threads
don't run in that situation.  But raid5 and raid10 can start a reshape
thread during "run" is the array is in the middle of a reshape.
They do this even if ->ro is set.

So it is best to set MD_RECOVERY_INTR before abortingg the
sync thread, just in case.

Though it rare for this to trigger a problem it can cause data corruption
because the reshape isn't finished properly.
So it is suitable for any stable which the offending commit was applied to.
(3.2 or later)

Fixes: 8313b8e57f
Cc: stable@vger.kernel.org (3.2+)
Signed-off-by: NeilBrown <neilb@suse.de>
2014-05-29 16:59:06 +10:00
NeilBrown
3991b31ea0 md: always set MD_RECOVERY_INTR when aborting a reshape or other "resync".
If mddev->ro is set, md_to_sync will (correctly) abort.
However in that case MD_RECOVERY_INTR isn't set.

If a RESHAPE had been requested, then ->finish_reshape() will be
called and it will think the reshape was successful even though
nothing happened.

Normally a resync will not be requested if ->ro is set, but if an
array is stopped while a reshape is on-going, then when the array is
started, the reshape will be restarted.  If the array is also set
read-only at this point, the reshape will instantly appear to success,
resulting in data corruption.

Consequently, this patch is suitable for any -stable kernel.

Cc: stable@vger.kernel.org (any)
Signed-off-by: NeilBrown <neilb@suse.de>
2014-05-28 13:39:39 +10:00
Linus Torvalds
23de4a7af7 A dm-crypt fix for a cpu hotplug crash that switches from using per-cpu
data to a mempool allocation (which offers allocation with cpu locality,
 and there is no inter-cpu communication on slab allocation).
 
 A couple dm-thinp stable fixes to address "out-of-data-space" issues.
 
 A dm-multipath fix for a LOCKDEP warning introduced in 3.15-rc1.
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1
 
 iQEcBAABAgAGBQJTdiI7AAoJEMUj8QotnQNa1fMIAOSGppH4U/VuT1+UMDyabUba
 eXsK8xBUTIDSBuTJ+ljkE5fyvXpn/wvA+b1hTKLhzVkUZ1pCY4pIw1pwpVcw89Bb
 BhktFWRYvcv/MAARDHiMGW5yc6xP319Qm04XN3xbMHx71gxGRwpzb191LSO5S2VR
 0rjXvZZt7WPJe/QPOFUrqyoP7t59LH9hu2/OH/Ic9o5/D0WxbqPEP6X8iJyIs32u
 lNvIQ5r5f3xNzt0VDvEq3sxR3qYhQvLPDMdp0YR67c87fKfsKQj4pkvXltXhY5bM
 wBHFE+NBl6MPzTXNCT9i+p360GXE7B/lY9boochAyE/UEztRq1+oqJOa/dDaBCo=
 =0vk2
 -----END PGP SIGNATURE-----

Merge tag 'dm-3.15-fixes-2' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm

Pull device mapper fixes from Mike Snitzer:
 "A dm-crypt fix for a cpu hotplug crash that switches from using
  per-cpu data to a mempool allocation (which offers allocation with cpu
  locality, and there is no inter-cpu communication on slab allocation).

  A couple dm-thinp stable fixes to address "out-of-data-space" issues.

  A dm-multipath fix for a LOCKDEP warning introduced in 3.15-rc1"

* tag 'dm-3.15-fixes-2' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm:
  dm mpath: fix lock order inconsistency in multipath_ioctl
  dm thin: add timeout to stop out-of-data-space mode holding IO forever
  dm thin: allow metadata commit if pool is in PM_OUT_OF_DATA_SPACE mode
  dm crypt: fix cpu hotplug crash by removing per-cpu structure
2014-05-21 17:57:31 +09:00
Mike Snitzer
4cdd2ad780 dm mpath: fix lock order inconsistency in multipath_ioctl
Commit 3e9f1be1b4 ("dm mpath: remove process_queued_ios()") did not
consistently take the multipath device's spinlock (m->lock) before
calling dm_table_run_md_queue_async() -- which takes the q->queue_lock.

Found with code inspection using hint from reported lockdep warning.

Reported-by: Bart Van Assche <bvanassche@acm.org>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2014-05-14 16:12:17 -04:00
Joe Thornber
85ad643b7e dm thin: add timeout to stop out-of-data-space mode holding IO forever
If the pool runs out of data space, dm-thin can be configured to
either error IOs that would trigger provisioning, or hold those IOs
until the pool is resized.  Unfortunately, holding IOs until the pool is
resized can result in a cascade of tasks hitting the hung_task_timeout,
which may render the system unavailable.

Add a fixed timeout so IOs can only be held for a maximum of 60 seconds.
If LVM is going to resize a thin-pool that is out of data space it needs
to be prompt about it.

Signed-off-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Cc: stable@vger.kernel.org # 3.14+
2014-05-14 16:11:37 -04:00
Joe Thornber
8d07e8a5f5 dm thin: allow metadata commit if pool is in PM_OUT_OF_DATA_SPACE mode
Commit 3e1a0699 ("dm thin: fix out of data space handling") introduced
a regression in the metadata commit() method by returning an error if
the pool is in PM_OUT_OF_DATA_SPACE mode.  This oversight caused a thin
device to return errors even if the default queue_if_no_space ENOSPC
handling mode is used.

Fix commit() to only fail if pool is in PM_READ_ONLY or PM_FAIL mode.

Reported-by: qindehua@163.com
Signed-off-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Cc: stable@vger.kernel.org # 3.14+
2014-05-14 16:11:36 -04:00
Mikulas Patocka
610f2de355 dm crypt: fix cpu hotplug crash by removing per-cpu structure
The DM crypt target used per-cpu structures to hold pointers to a
ablkcipher_request structure.  The code assumed that the work item keeps
executing on a single CPU, so it didn't use synchronization when
accessing this structure.

If a CPU is disabled by writing 0 to /sys/devices/system/cpu/cpu*/online,
the work item could be moved to another CPU.  This causes dm-crypt
crashes, like the following, because the code starts using an incorrect
ablkcipher_request:

 smpboot: CPU 7 is now offline
 BUG: unable to handle kernel NULL pointer dereference at 0000000000000130
 IP: [<ffffffffa1862b3d>] crypt_convert+0x12d/0x3c0 [dm_crypt]
 ...
 Call Trace:
  [<ffffffffa1864415>] ? kcryptd_crypt+0x305/0x470 [dm_crypt]
  [<ffffffff81062060>] ? finish_task_switch+0x40/0xc0
  [<ffffffff81052a28>] ? process_one_work+0x168/0x470
  [<ffffffff8105366b>] ? worker_thread+0x10b/0x390
  [<ffffffff81053560>] ? manage_workers.isra.26+0x290/0x290
  [<ffffffff81058d9f>] ? kthread+0xaf/0xc0
  [<ffffffff81058cf0>] ? kthread_create_on_node+0x120/0x120
  [<ffffffff813464ac>] ? ret_from_fork+0x7c/0xb0
  [<ffffffff81058cf0>] ? kthread_create_on_node+0x120/0x120

Fix this bug by removing the per-cpu definition.  The structure
ablkcipher_request is accessed via a pointer from convert_context.
Consequently, if the work item is rescheduled to a different CPU, the
thread still uses the same ablkcipher_request.

This change may undermine performance improvements intended by commit
c0297721 ("dm crypt: scale to multiple cpus") on select hardware.  In
practice no performance difference was observed on recent hardware.  But
regardless, correctness is more important than performance.

Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Cc: stable@vger.kernel.org
2014-05-14 16:11:35 -04:00
Linus Torvalds
2ddb5998d0 Two bugfixes for md in 3.15
Both tagged for -stable.
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v2.0.22 (GNU/Linux)
 
 iQIVAwUAU3AfITnsnt1WYoG5AQJiqQ/+Pk4n3AQqqtfjPaR5EWmAVwgLgvy7AX8z
 yG9UwN9AXqd1IkgaE+PzUwZHEUR1/fYeF52c5cakrHCvluHgxakUX6/T/f9dO8Ht
 rXK4Q82aTfm+5lfUsZfOL8aeY9ZheXXo97vbVAfegdIDNC6Il2nktHj6AfBfQWlQ
 r0hm3Vz1rgXxXVam7SLlbxa71JUxltlSpLqUoN487iF/hSJx5D04NiLFT8KJwtUh
 UtMiyNsUpMJHWfYZjTsX4+o9psLZB2fE+WXJvYy5jB3C/Yy3FB0x38fVTC7+ozej
 F0J8bhG/6oO0/0gieW7EXTDWNLlCtG8Z/rUi/Hre+7Lps3vp7V65q/uB1B2VnNjn
 TRzbEaCoWdzMjamp5btSzN64MJgvCPRn1TvPwcm+kSDk/IpslYMllwXK7H+UutXZ
 GEEw3TVz1jWk7JKxai9raApKtXB7yDpiKREFMjhowBb0rM+VL4/3gvzSpPyVbJxj
 4TTj9fUqsXWMG4HzKuyxXlV51hAbcaVnYirf0JrkjzzYkl0d/oBAADQtaApD+NX2
 thlfYUW4tjssmMB+X5ok5Zp4A0TV31a1bEmZ8CE63i/IHCf5F8BHsHpyO4P9ITDX
 zNEo1lKuIbhn5oVHDoLZjNgIPGi2+lq6jvq8+0POKyEBr++Nrbld2u0GB8Q3/SjE
 LAhU+0iUY6A=
 =9QhO
 -----END PGP SIGNATURE-----

Merge tag 'md/3.15-fixes' of git://neil.brown.name/md

Pull md bugfixes from Neil Brown:
 "Two bugfixes for md in 3.15

  Both tagged for -stable"

* tag 'md/3.15-fixes' of git://neil.brown.name/md:
  md: avoid possible spinning md thread at shutdown.
  md/raid10: call wait_barrier() for each request submitted.
2014-05-13 11:11:48 +09:00
NeilBrown
0f62fb220a md: avoid possible spinning md thread at shutdown.
If an md array with externally managed metadata (e.g. DDF or IMSM)
is in use, then we should not set safemode==2 at shutdown because:

1/ this is ineffective: user-space need to be involved in any 'safemode' handling,
2/ The safemode management code doesn't cope with safemode==2 on external metadata
   and md_check_recover enters an infinite loop.

Even at shutdown, an infinite-looping process can be problematic, so this
could cause shutdown to hang.

Cc: stable@vger.kernel.org (any kernel)
Signed-off-by: NeilBrown <neilb@suse.de>
2014-05-06 09:49:31 +10:00
NeilBrown
cc13b1d150 md/raid10: call wait_barrier() for each request submitted.
wait_barrier() includes a counter, so we must call it precisely once
(unless balanced by allow_barrier()) for each request submitted.

Since
commit 20d0189b10
    block: Introduce new bio_split()
in 3.14-rc1, we don't call it for the extra requests generated when
we need to split a bio.

When this happens the counter goes negative, any resync/recovery will
never start, and  "mdadm --stop" will hang.

Reported-by: Chris Murphy <lists@colorremedies.com>
Fixes: 20d0189b10
Cc: stable@vger.kernel.org (3.14+)
Cc: Kent Overstreet <kmo@daterainc.com>
Signed-off-by: NeilBrown <neilb@suse.de>
2014-05-06 09:49:26 +10:00
Linus Torvalds
54366a7fd6 A few dm-thinp fixes for changes merged in 3.15-rc1.
A dm-verity fix for an immutable biovec regression that affects 3.14+.
 
 A dm-cache fix to properly quiesce when using writethrough mode (3.14+).
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1
 
 iQEcBAABAgAGBQJTY+6+AAoJEMUj8QotnQNaeC4H/35S9GZL8SVPEDS5nbQ9YdZ9
 co7wAYIGswOInX9u8nq0TqcNtBMhxwwdRX9ScPxHVUTT+/lM/c7axHiMqVjZrMme
 SVmmAXMp2uUMAnK4BGIQs8jjeyxBCHUF/gyfC3OC+RF72Z1bDkG/xXyKsljBSzMe
 RP0iFvvvA1Sm7XzBJRuhZLIdJGkXFAy0ooEBICQOoudg6iDvDKCtiU+owB/x4bBh
 xi9b1MY2VjkobWES6fyW/atolCEpgwU4xhsLl3w534P9oFvCkLEp4GTxdFWBhepl
 K3usGr0t1QhmHy1hKw++WGsAkMRHocf8nIBqxxdDNWpZvOif2z+weLYbOn+TXTM=
 =1Yvj
 -----END PGP SIGNATURE-----

Merge tag 'dm-3.15-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm

Pull device mapper fixes from Mike Snitzer:
 "A few dm-thinp fixes for changes merged in 3.15-rc1.

  A dm-verity fix for an immutable biovec regression that affects 3.14+.

  A dm-cache fix to properly quiesce when using writethrough mode (3.14+)"

* tag 'dm-3.15-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm:
  dm cache: fix writethrough mode quiescing in cache_map
  dm thin: use INIT_WORK_ONSTACK in noflush_work to avoid ODEBUG warning
  dm verity: fix biovecs hash calculation regression
  dm thin: fix rcu_read_lock being held in code that can sleep
  dm thin: irqsave must always be used with the pool->lock spinlock
2014-05-02 14:14:02 -07:00
Mike Snitzer
131cd131a9 dm cache: fix writethrough mode quiescing in cache_map
Commit 2ee57d5873 ("dm cache: add passthrough mode") inadvertently
removed the deferred set reference that was taken in cache_map()'s
writethrough mode support.  Restore taking this reference.

This issue was found with code inspection.

Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Acked-by: Joe Thornber <ejt@redhat.com>
Cc: stable@vger.kernel.org # 3.13+
2014-05-01 16:14:24 -04:00
Mike Snitzer
fbcde3d8b9 dm thin: use INIT_WORK_ONSTACK in noflush_work to avoid ODEBUG warning
Use INIT_WORK_ONSTACK to silence "ODEBUG: object is on stack, but not
annotated".

Reported-by: Zdeněk Kabeláč <zkabelac@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Acked-by: Joe Thornber <ejt@redhat.com>
2014-04-29 11:22:04 -04:00
Linus Torvalds
23c1a60e2e One BUG fix for md for recent commit
-----BEGIN PGP SIGNATURE-----
 Version: GnuPG v2.0.22 (GNU/Linux)
 
 iQIVAwUAU099lznsnt1WYoG5AQJIehAAoPdK4dUZ+A2g+hYxMbXCioakAaqDZwzt
 nFkYMZjJSan7yugkOpd9zBNR864c/9UYAnuggimimuXZuKu0N++Y8/ztJ7FjncDk
 7/R3SPF8AtTaTm0BJ9mzK+/sfBxLRDl1v4Z+ZUAzweH6TTTLzKinuSgIXFObacV4
 DjN2Cf1xZHHmUIXK3kzE0sNC+C8nVXlvFz4gdiCAeHloXMp78a//TucBaN9lpE4z
 +h3FN4++0w+2aFgURdddnmIhY6v76m1fWF7Q9qcbGcnXDnpAxis5CgprBcKGwNAa
 o0bbVl1MNWlcVxO1H1wafbxrXTQZwE71UE47ssXl6vqePUpM1tKVm5ZP2wFbIlTN
 kwIRne2oWmhsBw177K6WUohaY28wHohi+ukt6UzfX81Zm6HAnXnB5LLneEizRTO/
 WBBftzoObiKJ758HIbPs6s300DoSw8CPs/CmdLO9ycxo1m2p2tmDz0802W5k2mO/
 pFSxDGL43c91cnHaoJPAgrWOHf45Lo8IKxfUZDLVliuhgvNKLP+CSyMCLAiV7Kxc
 aeuI1a9fcmjc/+rRSpC62itzk9tQeinI9TR2iBZJUnQVnTfFoPU889tED6jkElbP
 E7A+XBHbuOiRisjynX4RebFb2t23ONSnRLd1/Ce3dkVnAB75v2Zbh0xZ1usHlrH4
 3uPiETq2KiE=
 =CxEv
 -----END PGP SIGNATURE-----

Merge tag '3.15-fixes' of git://neil.brown.name/md

Pull md bugfix from Neil Brown:
 "One BUG fix for md for recent commit"

* tag '3.15-fixes' of git://neil.brown.name/md:
  raid5: fix a race of stripe count check
2014-04-17 10:51:01 -07:00
Shaohua Li
c7a6d35e46 raid5: fix a race of stripe count check
I hit another BUG_ON with e240c1839d. In __get_priority_stripe(),
stripe count equals to 0 initially. Between atomic_inc and BUG_ON,
get_active_stripe() finds the stripe. So the stripe count isn't 1 any more.

V2: keeps the BUG_ON suggested by Neil.

Signed-off-by: Shaohua Li <shli@fusionio.com>
Signed-off-by: NeilBrown <neilb@suse.de>
2014-04-17 17:05:28 +10:00
Milan Broz
3a7745215e dm verity: fix biovecs hash calculation regression
Commit 003b5c5719 ("block: Convert drivers
to immutable biovecs") incorrectly converted biovec iteration in
dm-verity to always calculate the hash from a full biovec, but the
function only needs to calculate the hash from part of the biovec (up to
the calculated "todo" value).

Fix this issue by limiting hash input to only the requested data size.

This problem was identified using the cryptsetup regression test for
veritysetup (verity-compat-test).

Signed-off-by: Milan Broz <gmazyland@gmail.com>
Acked-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Cc: stable@vger.kernel.org # 3.14+
2014-04-15 12:19:24 -04:00
Linus Torvalds
7f87307818 Just a few md patches for the 3.15 merge window.
-----BEGIN PGP SIGNATURE-----
 Version: GnuPG v2.0.22 (GNU/Linux)
 
 iQIVAwUAU0h5fjnsnt1WYoG5AQKmkw/8DUn0vV4q5UbLp0m2Yy6o6EOxwiSJUH/p
 6EEUmSwyXou75w9OWOJKDX2lI7z1yAtzqiuCQ19ekD5lA6gAosja+D0jKjv0SA01
 rQm2rMjnwOIxZUIRx/7Z+w/H1ZxjIAc8uxOcCBP6DOynWt/YAyVz1SzLLtwCxELL
 N9vjgb+4lVt0E9bBYvVRNiJmtVXpDcmn1YySd6Dqyj9t+Mmysnv8QuIStrT3CE7k
 apss3ew6bBbtbiJHuCno/Q4FDWVAhUH+9GMvksdajw8QW52oHV+RRBB5IpCU6hOx
 OKCT4MVdzmTgi6GRhSr86Dt+KMOLWZmbx7pK7aRQPiL6uFNhqAlJDb/u0xfaHohG
 DiRclZBbsHkEpejHaZcJCkyKFHQTEiia3JVk426FAhtiK1qIBuyxEc65RmKf6dsA
 1KlZVeclD3wYWKG4hWk/0W3qIPOWBMll+Ely5Zg6s2X3gGy9u5TU+tUsfJaL1aDU
 NOY+5D0+hg7o21kK7WgTaP2upexC/iaBrVrdlasM2KYXJVDrsfCAQr1/BwTl4qLq
 Lm/OkIg+WtrQ95RvsI85Hm4PJVxBd1HeyDlKNCcz47kc3Xxqabeq8KnwERyOh4hU
 U4EmAeCZmSGOOETIWQQxlIn8XdM1+dF4olUH9viEAXrQfGgUyrg6Vcc7BOdTTZCY
 8ek3CWG3TwE=
 =kHl3
 -----END PGP SIGNATURE-----

Merge tag 'md/3.15' of git://neil.brown.name/md

Pull md updates from Neil Brown:
 "Just a few md patches for the 3.15 merge window.

  Not much happening in md/raid at the moment.  Just a few bug fixes
  (one for -stable) and a couple of performance tweaks"

* tag 'md/3.15' of git://neil.brown.name/md:
  raid5: get_active_stripe avoids device_lock
  raid5: make_request does less prepare wait
  md: avoid oops on unload if some process is in poll or select.
  md/raid1: r1buf_pool_alloc: free allocate pages when subsequent allocation fails.
  md/bitmap: don't abuse i_writecount for bitmap files.
2014-04-11 17:20:38 -07:00
Shaohua Li
e240c1839d raid5: get_active_stripe avoids device_lock
For sequential workload (or request size big workload), get_active_stripe can
find cached stripe. In this case, we always hold device_lock, which exposes a
lot of lock contention for such workload. If stripe count isn't 0, we don't
need hold the lock actually, since we just increase its count. And this is the
hot code path for such workload. Unfortunately we must delete the BUG_ON.

Signed-off-by: Shaohua Li <shli@fusionio.com>
Signed-off-by: NeilBrown <neilb@suse.de>
2014-04-09 14:42:42 +10:00
Shaohua Li
27c0f68f07 raid5: make_request does less prepare wait
In NUMA machine, prepare_to_wait/finish_wait in make_request exposes a
lot of contention for sequential workload (or big request size
workload). For such workload, each bio includes several stripes. So we
can just do prepare_to_wait/finish_wait once for the whold bio instead
of every stripe.  This reduces the lock contention completely for such
workload. Random workload might have the similar lock contention too,
but I didn't see it yet, maybe because my stroage is still not fast
enough.

Signed-off-by: Shaohua Li <shli@fusionio.com>
Signed-off-by: NeilBrown <neilb@suse.de>
2014-04-09 14:42:38 +10:00
NeilBrown
e2f23b606b md: avoid oops on unload if some process is in poll or select.
If md-mod is unloaded while some process is in poll() or select(),
then that process maintains a pointer to md_event_waiters, and when
the try to unlink from that list, they will oops.

The procfs infrastructure ensures that ->poll won't be called after
remove_proc_entry, but doesn't provide a wait_queue_head for us to
use, and the waitqueue code doesn't provide a way to remove all
listeners from a waitqueue.

So we need to:
 1/ make sure no further references to md_event_waiters are taken (by
    setting md_unloading)
 2/ wake up all processes currently waiting, and
 3/ wait until all those processes have disconnected from our
    wait_queue_head.

Reported-by: "majianpeng" <majianpeng@gmail.com>
Signed-off-by: NeilBrown <neilb@suse.de>
2014-04-09 14:42:34 +10:00
NeilBrown
da1aab3dca md/raid1: r1buf_pool_alloc: free allocate pages when subsequent allocation fails.
When performing a user-request check/repair (MD_RECOVERY_REQUEST is set)
on a raid1, we allocate multiple bios each with their own set of pages.

If the page allocations for one bio fails, we currently do *not* free
the pages allocated for the previous bios, nor do we free the bio itself.

This patch frees all the already-allocate pages, and makes sure that
all the bios are freed as well.

This bug can cause a memory leak which can ultimately OOM a machine.
It was introduced in 3.10-rc1.

Fixes: a07876064a
Cc: Kent Overstreet <koverstreet@google.com>
Cc: stable@vger.kernel.org (3.10+)
Reported-by: Russell King - ARM Linux <linux@arm.linux.org.uk>
Signed-off-by: NeilBrown <neilb@suse.de>
2014-04-09 14:42:23 +10:00
NeilBrown
035328c202 md/bitmap: don't abuse i_writecount for bitmap files.
md bitmap code currently tries to use i_writecount to stop any other
process from writing to out bitmap file.  But that is really an abuse
and has bit-rotted so locking is all wrong.

So discard that - root should be allowed to shoot self in foot.

Still use it in a much less intrusive way to stop the same file being
used as bitmap on two different array, and apply other checks to
ensure the file is at least vaguely usable for bitmap storage
(is regular, is open for write.  Support for ->bmap is already checked
elsewhere).

Reported-by: Al Viro <viro@ZenIV.linux.org.uk>
Signed-off-by: NeilBrown <neilb@suse.de>
2014-04-09 12:26:59 +10:00
Joe Thornber
b10ebd34cc dm thin: fix rcu_read_lock being held in code that can sleep
Commit c140e1c4e2 ("dm thin: use per thin device deferred bio lists")
introduced the use of an rculist for all active thin devices.  The use
of rcu_read_lock() in process_deferred_bios() can result in a BUG if a
dm_bio_prison_cell must be allocated as a side-effect of bio_detain():

 BUG: sleeping function called from invalid context at mm/mempool.c:203
 in_atomic(): 1, irqs_disabled(): 0, pid: 6, name: kworker/u8:0
 3 locks held by kworker/u8:0/6:
   #0:  ("dm-" "thin"){.+.+..}, at: [<ffffffff8106be42>] process_one_work+0x192/0x550
   #1:  ((&pool->worker)){+.+...}, at: [<ffffffff8106be42>] process_one_work+0x192/0x550
   #2:  (rcu_read_lock){.+.+..}, at: [<ffffffff816360b5>] do_worker+0x5/0x4d0

We can't process deferred bios with the rcu lock held, since
dm_bio_prison_cell allocation may block if the bio-prison's cell mempool
is exhausted.

To fix:

- Introduce a refcount and completion field to each thin_c

- Add thin_get/put methods for adjusting the refcount.  If the refcount
  hits zero then the completion is triggered.

- Initialise refcount to 1 when creating thin_c

- When iterating the active_thins list we thin_get() whilst the rcu
  lock is held.

- After the rcu lock is dropped we process the deferred bios for that
  thin.

- When destroying a thin_c we thin_put() and then wait for the
  completion -- to avoid a race between the worker thread iterating
  from that thin_c and destroying the thin_c.

Signed-off-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2014-04-08 10:18:35 -04:00
Joe Thornber
5e3283e292 dm thin: irqsave must always be used with the pool->lock spinlock
Commit c140e1c4e2 ("dm thin: use per thin device deferred bio lists")
incorrectly stopped disabling irqs when taking the pool's spinlock.

Irqs must be disabled when taking the pool's spinlock otherwise a thread
could spin_lock(), then get interrupted to service thin_endio() in
interrupt context, which would then deadlock in spin_lock_irqsave().

Signed-off-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2014-04-08 10:10:51 -04:00
Linus Torvalds
04535d273e . Fix dm-cache corruption caused by discard_block_size >
cache_block_size
 
 . Fix a lock-inversion detected by LOCKDEP in dm-cache
 
 . Fix a dangling bio bug in the dm-thinp target's process_deferred_bios
   error path
 
 . Fix corruption due to non-atomic transaction commit which allowed a
   metadata superblock to be written before all other metadata was
   successfully written -- this is common to all targets that use the
   persistent-data library's transaction manager (dm-thinp, dm-cache and
   dm-era).
 
 . Various small cleanups in the DM core
 
 . Add the dm-era target which is useful for keeping track of which
   blocks were written within a user defined period of time called an
   'era'.  Use cases include tracking changed blocks for backup software,
   and partially invalidating the contents of a cache to restore cache
   coherency after rolling back a vendor snapshot.
 
 . Improve the on-disk layout of multithreaded writes to the dm-thin-pool
   by splitting the pool's deferred bio list to be a per-thin device list
   and then sorting that list using an rb_tree.  The subsequent read
   throughput of the data written via multiple threads improved by ~70%.
 
 . Simplify the multipath target's handling of queuing IO by pushing
   requests back to the request queue rather than queueing the IO
   internally.
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1
 
 iQEcBAABAgAGBQJTPv/6AAoJEMUj8QotnQNagQYH/3EkB2f66TRfjRQpVAZuchw/
 U0IbVWcMJKMdhj3uaSNzIkAbTgF+QsZUOLHP/7Q6zLq0M2J3WGrJn2ELqq53MenF
 E0+rJ8duKnJ5oLhhVT62ukBDh3XHWT0JyijXPWNa2gUoYwJqM9BAlXbC/OTfUNaZ
 mBCxvUWGME8k3ht310GhwvzBQjYuxIXhw8XlbGvakb9S83SZwNpCh231iumOEzPe
 Vzfx/xTto0fH3R5/knNV/H9xt0Dv4vt4Aqbqqys9UbQvPzj9qN/mxUZIFg+LZh/w
 WuvHHw6HcAiNNrQGFcm6i1AK2jJ+F61K3afMlYsiamTxMNM+0q/B9HemkX/0ieU=
 =lY8m
 -----END PGP SIGNATURE-----

Merge tag 'dm-3.15-changes' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm

Pull device mapper changes from Mike Snitzer:

 - Fix dm-cache corruption caused by discard_block_size > cache_block_size

 - Fix a lock-inversion detected by LOCKDEP in dm-cache

 - Fix a dangling bio bug in the dm-thinp target's process_deferred_bios
   error path

 - Fix corruption due to non-atomic transaction commit which allowed a
   metadata superblock to be written before all other metadata was
   successfully written -- this is common to all targets that use the
   persistent-data library's transaction manager (dm-thinp, dm-cache and
   dm-era).

 - Various small cleanups in the DM core

 - Add the dm-era target which is useful for keeping track of which
   blocks were written within a user defined period of time called an
   'era'.  Use cases include tracking changed blocks for backup
   software, and partially invalidating the contents of a cache to
   restore cache coherency after rolling back a vendor snapshot.

 - Improve the on-disk layout of multithreaded writes to the
   dm-thin-pool by splitting the pool's deferred bio list to be a
   per-thin device list and then sorting that list using an rb_tree.
   The subsequent read throughput of the data written via multiple
   threads improved by ~70%.

 - Simplify the multipath target's handling of queuing IO by pushing
   requests back to the request queue rather than queueing the IO
   internally.

* tag 'dm-3.15-changes' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm: (24 commits)
  dm cache: fix a lock-inversion
  dm thin: sort the per thin deferred bios using an rb_tree
  dm thin: use per thin device deferred bio lists
  dm thin: simplify pool_is_congested
  dm thin: fix dangling bio in process_deferred_bios error path
  dm mpath: print more useful warnings in multipath_message()
  dm-mpath: do not activate failed paths
  dm mpath: remove extra nesting in map function
  dm mpath: remove map_io()
  dm mpath: reduce memory pressure when requeuing
  dm mpath: remove process_queued_ios()
  dm mpath: push back requests instead of queueing
  dm table: add dm_table_run_md_queue_async
  dm mpath: do not call pg_init when it is already running
  dm: use RCU_INIT_POINTER instead of rcu_assign_pointer in __unbind
  dm: stop using bi_private
  dm: remove dm_get_mapinfo
  dm: make dm_table_alloc_md_mempools static
  dm: take care to copy the space map roots before locking the superblock
  dm transaction manager: fix corruption due to non-atomic transaction commit
  ...
2014-04-05 18:49:31 -07:00
Joe Thornber
0596661f0a dm cache: fix a lock-inversion
When suspending a cache the policy is walked and the individual policy
hints written to the metadata via sync_metadata().  This led to this
lock order:

      policy->lock
        cache_metadata->root_lock

When loading the cache target the policy is populated while the metadata
lock is held:

      cache_metadata->root_lock
         policy->lock

Fix this potential lock-inversion (ABBA) deadlock in sync_metadata() by
ensuring the cache_metadata root_lock is held whilst all the hints are
written, rather than being repeatedly locked while policy->lock is held
(as was the case with each callout that policy_walk_mappings() made to
the old save_hint() method).

Found by turning on the CONFIG_PROVE_LOCKING ("Lock debugging: prove
locking correctness") build option.  However, it is not clear how the
LOCKDEP reported paths can lead to a deadlock since the two paths,
suspending a target and loading a target, never occur at the same time.
But that doesn't mean the same lock-inversion couldn't have occurred
elsewhere.

Reported-by: Marian Csontos <mcsontos@redhat.com>
Signed-off-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Cc: stable@vger.kernel.org
2014-04-04 14:53:05 -04:00
Mike Snitzer
67324ea188 dm thin: sort the per thin deferred bios using an rb_tree
A thin-pool will allocate blocks using FIFO order for all thin devices
which share the thin-pool.  Because of this simplistic allocation the
thin-pool's space can become fragmented quite easily; especially when
multiple threads are requesting blocks in parallel.

Sort each thin device's deferred_bio_list based on logical sector to
help reduce fragmentation of the thin-pool's ondisk layout.

The following tables illustrate the realized gains/potential offered by
sorting each thin device's deferred_bio_list.  An "io size"-sized random
read of the device would result in "seeks/io" fragments being read, with
an average "distance/seek" between each fragment.

Data was written to a single thin device using multiple threads via
iozone (8 threads, 64K for both the block_size and io_size).

unsorted:

     io size   seeks/io distance/seek
  --------------------------------------
          4k    0.000   0b
         16k    0.013   11m
         64k    0.065   11m
        256k    0.274   10m
          1m    1.109   10m
          4m    4.411   10m
         16m    17.097  11m
         64m    60.055  13m
        256m    148.798 25m
          1g    809.929 21m

sorted:

     io size   seeks/io distance/seek
  --------------------------------------
          4k    0.000   0b
         16k    0.000   1g
         64k    0.001   1g
        256k    0.003   1g
          1m    0.011   1g
          4m    0.045   1g
         16m    0.181   1g
         64m    0.747   1011m
        256m    3.299   1g
          1g    14.373  1g

Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Acked-by: Joe Thornber <ejt@redhat.com>
2014-04-04 14:53:03 -04:00
Linus Torvalds
b33ce44299 Merge branch 'for-3.15/drivers' of git://git.kernel.dk/linux-block
Pull block driver update from Jens Axboe:
 "On top of the core pull request, here's the pull request for the
  driver related changes for 3.15.  It contains:

   - Improvements for msi-x registration for block drivers (mtip32xx,
     skd, cciss, nvme) from Alexander Gordeev.

   - A round of cleanups and improvements for drbd from Andreas
     Gruenbacher and Rashika Kheria.

   - A round of clanups and improvements for bcache from Kent.

   - Removal of sleep_on() and friends in DAC960, ataflop, swim3 from
     Arnd Bergmann.

   - Bug fix for a bug in the mtip32xx async completion code from Sam
     Bradshaw.

   - Bug fix for accidentally bouncing IO on 32-bit platforms with
     mtip32xx from Felipe Franciosi"

* 'for-3.15/drivers' of git://git.kernel.dk/linux-block: (103 commits)
  bcache: remove nested function usage
  bcache: Kill bucket->gc_gen
  bcache: Kill unused freelist
  bcache: Rework btree cache reserve handling
  bcache: Kill btree_io_wq
  bcache: btree locking rework
  bcache: Fix a race when freeing btree nodes
  bcache: Add a real GC_MARK_RECLAIMABLE
  bcache: Add bch_keylist_init_single()
  bcache: Improve priority_stats
  bcache: Better alloc tracepoints
  bcache: Kill dead cgroup code
  bcache: stop moving_gc marking buckets that can't be moved.
  bcache: Fix moving_pred()
  bcache: Fix moving_gc deadlocking with a foreground write
  bcache: Fix discard granularity
  bcache: Fix another bug recovering from unclean shutdown
  bcache: Fix a bug recovering from unclean shutdown
  bcache: Fix a journalling reclaim after recovery bug
  bcache: Fix a null ptr deref in journal replay
  ...
2014-04-01 19:43:53 -07:00
Linus Torvalds
675c354a95 Char/Misc driver patches for 3.15-rc1
Here's the big char/misc driver updates for 3.15-rc1.
 
 Lots of various things here, including the new mcb driver subsystem.
 
 All of these have been in linux-next for a while.
 
 Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v2.0.22 (GNU/Linux)
 
 iEYEABECAAYFAlM7ArIACgkQMUfUDdst+ylS+gCfcJr0Zo2v5aWnqD7rFtFETmFI
 LhcAoNTQ4cvlVdxnI0driWCWFYxLj6at
 =aj+L
 -----END PGP SIGNATURE-----

Merge tag 'char-misc-3.15-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/char-misc

Pull char/misc driver patches from Greg KH:
 "Here's the big char/misc driver updates for 3.15-rc1.

  Lots of various things here, including the new mcb driver subsystem.

  All of these have been in linux-next for a while"

* tag 'char-misc-3.15-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/char-misc: (118 commits)
  extcon: Move OF helper function to extcon core and change function name
  extcon: of: Remove unnecessary function call by using the name of device_node
  extcon: gpio: Use SIMPLE_DEV_PM_OPS macro
  extcon: palmas: Use SIMPLE_DEV_PM_OPS macro
  mei: don't use deprecated DEFINE_PCI_DEVICE_TABLE macro
  mei: amthif: fix checkpatch error
  mei: client.h fix checkpatch errors
  mei: use cl_dbg where appropriate
  mei: fix Unnecessary space after function pointer name
  mei: report consistently copy_from/to_user failures
  mei: drop pr_fmt macros
  mei: make me hw headers private to me hw.
  mei: fix memory leak of pending write cb objects
  mei: me: do not reset when less than expected data is received
  drivers: mcb: Fix build error discovered by 0-day bot
  cs5535-mfgpt: Simplify dependencies
  spmi: pm: drop bus-level PM suspend/resume routines
  spmi: pmic_arb: make selectable on ARCH_QCOM
  Drivers: hv: vmbus: Increase the limit on the number of pfns we can handle
  pch_phub: Report error writing MAC back to user
  ...
2014-04-01 16:13:21 -07:00
Mike Snitzer
c140e1c4e2 dm thin: use per thin device deferred bio lists
The thin-pool previously only had a single deferred_bios list that would
collect bios for all thin devices in the pool.  Split this per-pool
deferred_bios list out to per-thin deferred_bios_list -- doing so
enables increased parallelism when processing deferred bios.  And now
that each thin device has it's own deferred_bios_list we can sort all
bios in the list using logical sector.  The requeue code in error
handling path is also cleaner as a side-effect.

Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Acked-by: Joe Thornber <ejt@redhat.com>
2014-03-31 14:14:15 -04:00
Mike Snitzer
760fe67e53 dm thin: simplify pool_is_congested
The pool is congested if the pool is in PM_OUT_OF_DATA_SPACE mode.  This
is more explicit/clear/efficient than inferring whether or not the pool
is congested by checking if retry_on_resume_list is empty.

Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Acked-by: Joe Thornber <ejt@redhat.com>
2014-03-31 10:05:51 -04:00
Mike Snitzer
fe76cd88e6 dm thin: fix dangling bio in process_deferred_bios error path
If unable to ensure_next_mapping() we must add the current bio, which
was removed from the @bios list via bio_list_pop, back to the
deferred_bios list before all the remaining @bios.

Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Acked-by: Joe Thornber <ejt@redhat.com>
Cc: stable@vger.kernel.org
2014-03-28 14:37:02 -04:00
Jose Castillo
a356e42620 dm mpath: print more useful warnings in multipath_message()
The warning message "Unrecognised multipath message received" is
displayed in two different situations in multipath_message(): when the
number of arguments passed is invalid and when the string passed in
argv[0] is not recognized.

Make it easier to identify where the problem is by making these warnings
more specific with additional context for each case.

Signed-off-by: Jose Castillo <jcastillo@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2014-03-27 16:56:25 -04:00
Hannes Reinecke
3a01750964 dm-mpath: do not activate failed paths
activate_path() is run without a lock, so the path might be
set to failed before activate_path() had a chance to run.
This patch add a check for ->active in activate_path() to
avoid unnecessary overhead by calling functions which are known
to be failing.

Signed-off-by: Hannes Reinecke <hare@suse.de>
Reviewed-by: Jun'ichi Nomura <j-nomura@ce.jp.nec.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2014-03-27 16:56:25 -04:00
Mike Snitzer
9bf59a611a dm mpath: remove extra nesting in map function
Return early for case when no path exists, and when the
pathgroup isn't ready. This eliminates the need for
extra nesting for the the common case.

Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Hannes Reinecke <hare@suse.de>
2014-03-27 16:56:25 -04:00
Hannes Reinecke
36fcffcc65 dm mpath: remove map_io()
multipath_map() is now just a wrapper around map_io(), so we
can rename map_io() to multipath_map().

Signed-off-by: Hannes Reinecke <hare@suse.de>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Reviewed-by: Jun'ichi Nomura <j-nomura@ce.jp.nec.com>
2014-03-27 16:56:25 -04:00
Hannes Reinecke
e3bde04f1e dm mpath: reduce memory pressure when requeuing
When multipath needs to requeue I/O in the block layer the per-request
context shouldn't be allocated, as it will be freed immediately
afterwards anyway.  Avoiding this memory allocation will reduce memory
pressure during requeuing.

Signed-off-by: Hannes Reinecke <hare@suse.de>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Reviewed-by: Jun'ichi Nomura <j-nomura@ce.jp.nec.com>
2014-03-27 16:56:25 -04:00
Hannes Reinecke
3e9f1be1b4 dm mpath: remove process_queued_ios()
process_queued_ios() has served 3 functions:
  1) select pg and pgpath if none is selected
  2) start pg_init if requested
  3) dispatch queued IOs when pg is ready

Basically, a call to queue_work(process_queued_ios) can be replaced by
dm_table_run_md_queue_async(), which runs request queue and ends up
calling map_io(), which does 1), 2) and 3).

Exception is when !pg_ready() (which means either pg_init is running or
requested), then multipath_busy() prevents map_io() being called from
request_fn.

If pg_init is running, it should be ok as long as pg_init_done() does
the right thing when pg_init is completed, I.e.: restart pg_init if
!pg_ready() or call dm_table_run_md_queue_async() to kick map_io().

If pg_init is requested, we have to make sure the request is detected
and pg_init will be started.  pg_init is requested in 3 places:
  a) __choose_pgpath() in map_io()
  b) __choose_pgpath() in multipath_ioctl()
  c) pg_init retry in pg_init_done()
a) is ok because map_io() calls __pg_init_all_paths(), which does 2).
b) needs a call to __pg_init_all_paths(), which does 2).
c) needs a call to __pg_init_all_paths(), which does 2).

So this patch removes process_queued_ios() and ensures that
__pg_init_all_paths() is called at the appropriate locations.

Signed-off-by: Hannes Reinecke <hare@suse.de>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Reviewed-by: Jun'ichi Nomura <j-nomura@ce.jp.nec.com>
2014-03-27 16:56:24 -04:00
Hannes Reinecke
e809917735 dm mpath: push back requests instead of queueing
There is no reason why multipath needs to queue requests internally for
queue_if_no_path or pg_init; we should rather push them back onto the
request queue.

And while we're at it we can simplify the conditional statement in
map_io() to make it easier to read.

Since mpath no longer does internal queuing of I/O the table info no
longer emits the internal queue_size.  Instead it displays 1 if queuing
is being used or 0 if it is not.

Signed-off-by: Hannes Reinecke <hare@suse.de>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Reviewed-by: Jun'ichi Nomura <j-nomura@ce.jp.nec.com>
2014-03-27 16:56:24 -04:00
Mike Snitzer
9974fa2c6a dm table: add dm_table_run_md_queue_async
Introduce dm_table_run_md_queue_async() to run the request_queue of the
mapped_device associated with a request-based DM table.

Also add dm_md_get_queue() wrapper to extract the request_queue from a
mapped_device.

Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Hannes Reinecke <hare@suse.de>
Reviewed-by: Jun'ichi Nomura <j-nomura@ce.jp.nec.com>
2014-03-27 16:56:24 -04:00
Hannes Reinecke
17f4ff45b5 dm mpath: do not call pg_init when it is already running
This patch moves condition checks as a preparation of following
patches and has no effect on behaviour.
process_queued_ios() is the only caller of __pg_init_all_paths()
and 2 condition checks are moved from outside to inside without
side effects.

Signed-off-by: Hannes Reinecke <hare@suse.de>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Reviewed-by: Jun'ichi Nomura <j-nomura@ce.jp.nec.com>
2014-03-27 16:56:24 -04:00
Monam Agarwal
9cdb852004 dm: use RCU_INIT_POINTER instead of rcu_assign_pointer in __unbind
Replace rcu_assign_pointer(p, NULL) with RCU_INIT_POINTER(p, NULL).

The rcu_assign_pointer() ensures that the initialization of a structure
is carried out before storing a pointer to that structure.  And in the
case of the NULL pointer, there is no structure to initialize.  So,
rcu_assign_pointer(p, NULL) can be safely converted to
RCU_INIT_POINTER(p, NULL).

Signed-off-by: Monam Agarwal <monamagarwal123@gmail.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2014-03-27 16:56:24 -04:00
Mikulas Patocka
bfc6d41cee dm: stop using bi_private
Device mapper uses the bio structure's bi_private field as a pointer
to dm_target_io or dm_rq_clone_bio_info.  But a bio structure is
embedded in the dm_target_io and dm_rq_clone_bio_info structures, so the
pointer to the structure that contains the bio can be found with the
container_of() macro.

Remove the use of bi_private and use container_of() instead.

Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2014-03-27 16:56:24 -04:00
Mikulas Patocka
d70ab4fb72 dm: remove dm_get_mapinfo
Remove dm_get_mapinfo() because no target uses it.  Targets can allocate
per-bio data using ti->per_bio_data_size, this is much more flexible
than union map_info.

Leave union map_info only for the request-based multipath target's use.
Also delete the unused "unsigned long long ll" field of union map_info.

Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2014-03-27 16:56:24 -04:00
Mikulas Patocka
473c36dfee dm: make dm_table_alloc_md_mempools static
Make the function dm_table_alloc_md_mempools static because it is not
called from another file.

Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2014-03-27 16:56:23 -04:00
Joe Thornber
5a32083d03 dm: take care to copy the space map roots before locking the superblock
In theory copying the space map root can fail, but in practice it never
does because we're careful to check what size buffer is needed.

But make certain we're able to copy the space map roots before
locking the superblock.

Signed-off-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Cc: stable@vger.kernel.org # drop dm-era and dm-cache changes as needed
2014-03-27 16:56:23 -04:00
Joe Thornber
a9d45396f5 dm transaction manager: fix corruption due to non-atomic transaction commit
The persistent-data library used by dm-thin, dm-cache, etc is
transactional.  If anything goes wrong, such as an io error when writing
new metadata or a power failure, then we roll back to the last
transaction.

Atomicity when committing a transaction is achieved by:

a) Never overwriting data from the previous transaction.
b) Writing the superblock last, after all other metadata has hit the
   disk.

This commit and the following commit ("dm: take care to copy the space
map roots before locking the superblock") fix a bug associated with (b).
When committing it was possible for the superblock to still be written
in spite of an io error occurring during the preceeding metadata flush.
With these commits we're careful not to take the write lock out on the
superblock until after the metadata flush has completed.

Change the transaction manager's semantics for dm_tm_commit() to assume
all data has been flushed _before_ the single superblock that is passed
in.

As a prerequisite, split the block manager's block unlocking and
flushing by simplifying dm_bm_flush_and_unlock() to dm_bm_flush().  Now
the unlocking must be done separately.

This issue was discovered by forcing io errors at the crucial time
using dm-flakey.

Signed-off-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Cc: stable@vger.kernel.org
2014-03-27 16:56:23 -04:00
Heinz Mauelshagen
64ab346a36 dm cache: remove remainder of distinct discard block size
Discard block size not being equal to cache block size causes data
corruption by erroneously avoiding migrations in issue_copy() because
the discard state is being cleared for a group of cache blocks when it
should not.

Completely remove all code that enabled a distinction between the
cache block size and discard block size.

Signed-off-by: Heinz Mauelshagen <heinzm@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2014-03-27 16:56:23 -04:00