linux_dsm_epyc7002

mirror of https://github.com/AuxXxilium/linux_dsm_epyc7002.git synced 2024-12-26 01:45:38 +07:00

Author	SHA1	Message	Date
Joe Thornber	2572629a13	dm cache: fix some issues with the new discard range support Commit `7ae34e777` ("dm cache: improve discard support") needed to also: - discontinue having DM core split the discard bios on cache block boundaries - calculate the cache's discard_nr_blocks relative to the determined discard_block_size rather than using oblock_to_dblock() Signed-off-by: Joe Thornber <ejt@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>	2014-12-01 11:30:09 -05:00
Joe Thornber	8001e87d0e	dm array: if resizing the array is a noop set the new root to the old one This could've been quite bad (to return success but not update the new root to point at the old) but in practice the only known consumer of the dm array code is the DM cache target. And the DM cache target passes in the same old root to array_resize() anyway. Signed-off-by: Joe Thornber <ejt@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>	2014-12-01 11:30:07 -05:00
Gu Zheng	18c0b223cf	md: use generic io stats accounting functions to simplify io stat accounting Use generic io stats accounting help functions (generic_{start,end}_io_acct) to simplify io stat accounting. Signed-off-by: Gu Zheng <guz.fnst@cn.fujitsu.com> Signed-off-by: Jens Axboe <axboe@fb.com>	2014-11-24 08:05:16 -07:00
Gu Zheng	aae4933da9	md/bcache: use generic io stats accounting functions to simplify io stat accounting Use generic io stats accounting help functions (generic_{start,end}_io_acct) to simplify io stat accounting. Signed-off-by: Gu Zheng <guz.fnst@cn.fujitsu.com> Acked-by: Kent Overstreet <kmo@datera.io> Signed-off-by: Jens Axboe <axboe@fb.com>	2014-11-24 08:05:12 -07:00
Eric Dumazet	a12f5d48bd	dm: use rcu_dereference_protected instead of rcu_dereference rcu_dereference() should be used in sections protected by rcu_read_lock. For writers, holding some kind of mutex or lock, rcu_dereference_protected() is the way to go, adding explicit lockdep bits. In __unbind(), we are the last user of this mapped device, so can use the constant '1' instead of a lockdep_is_held(), not consistent with other uses of rcu_dereference_protected() which use md->suspend_lock mutex. Reported-by: Kirill A. Shutemov <kirill@shutemov.name> Signed-off-by: Eric Dumazet <edumazet@google.com> Fixes: `33423974bf` ("dm: Use rcu_dereference() for accessing rcu pointer") Cc: Pranith Kumar <bobby.prani@gmail.com> [snitzer: allow lines longer than 80 columns, refine subject] Signed-off-by: Mike Snitzer <snitzer@redhat.com>	2014-11-23 20:32:45 -05:00
Mike Snitzer	d200c30ef0	dm thin: fix pool_io_hints to avoid looking at max_hw_sectors Simplify the pool_io_hints code that works to establish a max_sectors value that is a power-of-2 factor of the thin-pool's blocksize. The biggest associated improvement is that the DM thin-pool is no longer concerning itself with the data device's max_hw_sectors when adjusting max_sectors. This fixes the relative fragility of the original "dm thin: adjust max_sectors_kb based on thinp blocksize" commit that only became apparent when testing was performed using a DM thin-pool ontop of a virtio_blk device. One proposed upstream patch detailed the problems inherent in virtio_blk: https://lkml.org/lkml/2014/11/20/611 So even though virtio_blk incorrectly set its max_hw_sectors it actually helped make it clear that we need DM thinp to be tolerant of any future Linux driver that incorrectly sets max_hw_sectors. We only need to be concerned with modifying the thin-pool device's max_sectors limit if it is smaller than the thin-pool's blocksize. In this case the value of max_sectors does become a limiting factor when upper layers (e.g. filesystems) construct their bios. But if the hardware can support IOs larger than the thin-pool's blocksize the user is encouraged to adjust the thin-pool's data device's max_sectors accordingly -- doing so will enable the thin-pool to inherit the established user-defined max_sectors. Signed-off-by: Mike Snitzer <snitzer@redhat.com>	2014-11-21 12:54:23 -05:00
Mike Snitzer	583024d248	dm thin: suspend/resume active thin devices when reloading thin-pool Before this change it was expected that userspace would first suspend all active thin devices, reload/resize the thin-pool target, then resume all active thin devices. Now the thin-pool suspend/resume will trigger the suspend/resume of all active thins via appropriate calls to dm_internal_suspend and dm_internal_resume. Store the mapped_device for each thin device in struct thin_c to make these calls possible. Signed-off-by: Mike Snitzer <snitzer@redhat.com> Acked-by: Joe Thornber <ejt@redhat.com>	2014-11-19 12:34:08 -05:00
Mike Snitzer	ffcc393641	dm: enhance internal suspend and resume interface Rename dm_internal_{suspend,resume} to dm_internal_{suspend,resume}_fast -- dm-stats will continue using these methods to avoid all the extra suspend/resume logic that is not needed in order to quickly flush IO. Introduce dm_internal_suspend_noflush() variant that actually calls the mapped_device's target callbacks -- otherwise target-specific hooks are avoided (e.g. dm-thin's thin_presuspend and thin_postsuspend). Common code between dm_internal_{suspend_noflush,resume} and dm_{suspend,resume} was factored out as __dm_{suspend,resume}. Update dm_internal_{suspend_noflush,resume} to always take and release the mapped_device's suspend_lock. Also update dm_{suspend,resume} to be aware of potential for DM_INTERNAL_SUSPEND_FLAG to be set and respond accordingly by interruptibly waiting for the DM_INTERNAL_SUSPEND_FLAG to be cleared. Add lockdep annotation to dm_suspend() and dm_resume(). The existing DM_SUSPEND_FLAG remains unchanged. DM_INTERNAL_SUSPEND_FLAG is set by dm_internal_suspend_noflush() and cleared by dm_internal_resume(). Both DM_SUSPEND_FLAG and DM_INTERNAL_SUSPEND_FLAG may be set if a device was already suspended when dm_internal_suspend_noflush() was called -- this can be thought of as a "nested suspend". A "nested suspend" can occur with legacy userspace dm-thin code that might suspend all active thin volumes before suspending the pool for resize. But otherwise, in the normal dm-thin-pool suspend case moving forward: the thin-pool will have DM_SUSPEND_FLAG set and all active thins from that thin-pool will have DM_INTERNAL_SUSPEND_FLAG set. Also add DM_INTERNAL_SUSPEND_FLAG to status report. This new DM_INTERNAL_SUSPEND_FLAG state is being reported to assist with debugging (e.g. 'dmsetup info' will report an internally suspended device accordingly). Signed-off-by: Mike Snitzer <snitzer@redhat.com> Acked-by: Joe Thornber <ejt@redhat.com>	2014-11-19 12:31:17 -05:00
Mike Snitzer	80e96c5484	dm thin: do not allow thin device activation while pool is suspended Otherwise IO could be issued to the pool while it is suspended. Care was taken to properly interlock between the thin and thin-pool targets when accessing the pool's 'suspended' flag. The thin_ctr will not add a new thin device to the pool's active_thins list if the pool is susepended. Signed-off-by: Mike Snitzer <snitzer@redhat.com> Acked-by: Joe Thornber <ejt@redhat.com>	2014-11-19 11:25:36 -05:00
Mike Snitzer	d67ee213fa	dm: add presuspend_undo hook to target_type The DM thin-pool target now must undo the changes performed during pool_presuspend() so introduce presuspend_undo hook in target_type. Signed-off-by: Mike Snitzer <snitzer@redhat.com> Acked-by: Joe Thornber <ejt@redhat.com>	2014-11-19 11:24:59 -05:00
Mike Snitzer	4d341d8216	dm: return earlier from dm_blk_ioctl if target doesn't implement .ioctl No point checking if the device is suspended if the current target doesn't even implement .ioctl Signed-off-by: Mike Snitzer <snitzer@redhat.com>	2014-11-19 11:24:56 -05:00
Linus Torvalds	0fbae13642	One fix for md for 3.18. This fixes a regression introduced in 3.13. -----BEGIN PGP SIGNATURE----- Version: GnuPG v2 iQIVAwUAVGkmSjnsnt1WYoG5AQKu8xAAkRYFm9FDLIU5W4AVnNhtsHfWBtNphi1X myHi75jRO5XUVVAYZ2x7EsGBvjDC3iOmkB++b0qLJ4MCf3yq07P2Y6osd+5pq1gD XOzGefzi9kXkF+7KGKNrQY+xN++q5jcqMWtiSa+ef2j5YqGt06tqgvFz9YtBozrF TEbe73yAVsFdy8XAGi3Z3ceYLaECbTjXRIMhksBqX4YByNVM9N7XT5Gk5L5ykYya 90EV5nDrfQPTicsL5/8Nb9qKczoRT7I6yubNgUpazdd8g3+wWJycew7I+CiVb44I wGQbh5FaJT9KkTYrfkNOwT6N+fAEj4y9GxVMvSW80tmk9VKpv2MkCGrdwwWxp0/q XXI3hSIRjBszvkMlLCANg7VFFvNeehVhYrn1ml3fGiZ548STdsCVewP7cOwhuQFp f3dniAj49zw9GxZNopLkIfI+HmNZCOTf+E5U1nLOKZKOKpsw9ksNJrvZV8ZZxMkK gZRAJwsd64Mob2ClRII9ZKzdRwygN1pDdtS5pa+rvzdRQplE4Flg4Ipv9w+5lsQh 346ijrxim11NpO/nRV0pXDNDudMzpF0cJvzxMo5uTTsX+eLUBbsdm/qmb2rEAxM7 JDdW8b7Vluz8fxq7+0Lc1O31CcEGJlBACtdRAXWIAhLZwIaps8+tn+yAjyMEb73H jBJ9UAfmdCU= =Fs49 -----END PGP SIGNATURE----- Merge tag 'md/3.18-fix' of git://neil.brown.name/md Pull md bugfix from Neil Brown: "One fix for md for 3.18. This fixes a regression introduced in 3.13" * tag 'md/3.18-fix' of git://neil.brown.name/md: md: Always set RECOVERY_NEEDED when clearing RECOVERY_FROZEN	2014-11-16 15:34:31 -08:00
NeilBrown	45eaf45dfa	md: Always set RECOVERY_NEEDED when clearing RECOVERY_FROZEN md_check_recovery will skip any recovery and also clear MD_RECOVERY_NEEDED if MD_RECOVERY_FROZEN is set. So when we clear _FROZEN, we must set _NEEDED and ensure that md_check_recovery gets run. Otherwise we could miss out on something that is needed. In particular, this can make it impossible to remove a failed device from an array is the 'recovery-needed' processing didn't happen. Suitable for stable kernels since 3.13. Cc: stable@vger.kernel.org (3.13+) Reported-and-tested-by: Joe Lawrence <joe.lawrence@stratus.com> Fixes: `30b8feb730` Signed-off-by: NeilBrown <neilb@suse.de>	2014-11-17 09:17:46 +11:00
Linus Torvalds	5a7a662cc6	. stable fix for dm-thin that avoids normal IO racing with discard . stable fix for a dm-cache related bug in dm-btree walking code that results from using very large fast device (e.g. 4T) with a very small cache blocksize (e.g. 32K) -- this is a very uncommon configuration . a couple fixes for dm-raid (one for stable and the other addresses a crash in 3.18-rc1 code) . stable fix for dm-thinp that addresses a very rare dm-bufio bug having to do with memory reclaimation (via shrinker) when using dm-thinp ontop of loopback devices . fix a leak in dm-stripe target constructor's error path -----BEGIN PGP SIGNATURE----- Version: GnuPG v1 iQEcBAABAgAGBQJUY/p7AAoJEMUj8QotnQNaxPEIAJsmJC5ujQAIdm5yUxsOWruU Y/36HbPvlmV8fgWqGyjaubBrzqgWry/yW/u/Sv9+9rE3Zh6JSVLVrCA6uZZ3Yr+j HKYEPjm/O0zVJepfEDKtjG6dxeaql47+luwU1iP1bAYeZE3zmKn1oFT2GW5gTbxO 2n3MiN/dyX8v0cTw6r0O69luIAu93CSY0XDk+1ynfKlKKVmgcAUPvKuobF+yHXoF Rd7KTqFoK6HgRhdUHvUQnCGDandZ9MHjt3oW9p3dv3ezvW1cNUARoVHMRGG6Awfu WZkQ/VORDeaJT+bhjGfPIla1HbgxEKJrgzTUlpj+P6K2uPK2f6ECEyBpDLWKy9g= =lkSu -----END PGP SIGNATURE----- Merge tag 'dm-3.18-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm Pull device mapper fixes from Mike Snitzer: - stable fix for dm-thin that avoids normal IO racing with discard - stable fix for a dm-cache related bug in dm-btree walking code that results from using very large fast device (eg 4T) with a very small cache blocksize (eg 32K) -- this is a very uncommon configuration - a couple fixes for dm-raid (one for stable and the other addresses a crash in 3.18-rc1 code) - stable fix for dm-thinp that addresses a very rare dm-bufio bug having to do with memory reclaimation (via shrinker) when using dm-thinp ontop of loopback devices - fix a leak in dm-stripe target constructor's error path * tag 'dm-3.18-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm: dm btree: fix a recursion depth bug in btree walking code dm thin: grab a virtual cell before looking up the mapping dm raid: fix inaccessible superblocks causing oops in configure_discard_support dm raid: ensure superblock's size matches device's logical block size dm bufio: change __GFP_IO to __GFP_FS in shrinker callbacks dm stripe: fix potential for leak in stripe_ctr error path	2014-11-13 09:19:20 -08:00
Mike Snitzer	5ec02084f6	dm thin: remove stale 'trim' message in block comment above pool_message Signed-off-by: Mike Snitzer <snitzer@redhat.com>	2014-11-12 20:15:05 -05:00
Mikulas Patocka	17181fb7a0	dm thin: fix a race in thin_dtr As long as struct thin_c is in the list, anyone can grab a reference of it. Consequently, we must wait for the reference count to drop to zero after we remove the structure from the list, not before. Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>	2014-11-12 20:15:04 -05:00
Joe Thornber	d1d9220cba	dm cache: emit a warning message if there are a lot of cache blocks Loading and saving millions of block mappings takes time. We may as well explain what's going on, and encourage people to use a larger cache block size. Signed-off-by: Joe Thornber <ejt@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>	2014-11-12 20:14:59 -05:00
Joe Thornber	7ae34e7778	dm cache: improve discard support Safely allow the discard blocksize to be larger than the cache blocksize by using the bio prison's range locking support. This also improves discard performance considerly because larger discards are issued to the dm-cache device. The discard blocksize was always intended to be greater than the cache blocksize. But until now it wasn't implemented safely. Also, by safely restoring the ability to have discard blocksize larger than cache blocksize we're able to significantly reduce the memory used for the cache's discard bitset. Before, with a small discard blocksize, the discard bitset could get quite large because its size is a function of the discard blocksize and the origin device's size. For example, previously, using a 32KB cache blocksize with a 40TB origin resulted in 1280MB of incore memory use for the discard bitset! Now, the discard blocksize is scaled up accordingly to ensure the discard bitset is capped at 2**14 bits, or 16KB. Signed-off-by: Joe Thornber <ejt@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>	2014-11-10 15:25:30 -05:00
Joe Thornber	08b184514f	dm cache: revert "prevent corruption caused by discard_block_size > cache_block_size" This reverts commit `d132cc6d9e` because we actually do want to allow the discard blocksize to be larger than the cache blocksize. Further dm-cache discard changes will make this possible. Signed-off-by: Joe Thornber <ejt@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>	2014-11-10 15:25:30 -05:00
Joe Thornber	1bad9bc4ee	dm cache: revert "remove remainder of distinct discard block size" This reverts commit `64ab346a36` because we actually do want to allow the discard blocksize to be larger than the cache blocksize. Further dm-cache discard changes will make this possible. Signed-off-by: Joe Thornber <ejt@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>	2014-11-10 15:25:30 -05:00
Joe Thornber	5f274d8865	dm bio prison: introduce support for locking ranges of blocks Ranges will be placed in the same cell if they overlap. Range locking is a prerequisite for more efficient multi-block discard support in both the cache and thin-provisioning targets. Signed-off-by: Joe Thornber <ejt@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>	2014-11-10 15:25:30 -05:00
Mike Snitzer	f1afb36a61	dm cache policy mq: simplify ability to promote sequential IO to the cache Before, if the user wanted sequential IO to be promoted to the cache they'd have to set sequential_threshold to some nebulous large value. Now, the user may easily disable sequential IO detection (and sequential IO's implicit bypass of the cache) by setting sequential_threshold to 0. Signed-off-by: Mike Snitzer <snitzer@redhat.com>	2014-11-10 15:25:30 -05:00
Joe Thornber	b155aa0e5a	dm cache policy mq: tweak algorithm that decides when to promote a block Rather than maintaining a separate promote_threshold variable that we periodically update we now use the hit count of the oldest clean block. Also add a fudge factor to discourage demoting dirty blocks. With some tests this has a sizeable difference, because the old code was too eager to demote blocks. For example, device-mapper-test-suite's git_extract_cache_quick test goes from taking 190 seconds, to 142 (linear on spindle takes 250). Signed-off-by: Joe Thornber <ejt@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>	2014-11-10 15:25:29 -05:00
Hannes Reinecke	41abc4e1af	dm: do not call dm_sync_table() when creating new devices When creating new devices dm_sync_table() calls synchronize_rcu_expedited(), causing _all_ pending RCU pointers to be flushed. This causes a latency overhead that is especially noticeable when creating lots of devices. And all of this is pointless as there are no old maps to be disconnected, and hence no stale pointers which would need to be cleared up. Signed-off-by: Hannes Reinecke <hare@suse.de> Reviewed-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>	2014-11-10 15:25:29 -05:00
Pranith Kumar	6fa9952097	dm: sparse: Annotate field with __rcu for checking Annotate the map field with __rcu since this is a rcu pointer which is checked by sparse. Signed-off-by: Pranith Kumar <bobby.prani@gmail.com> Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>	2014-11-10 15:25:29 -05:00
Pranith Kumar	33423974bf	dm: Use rcu_dereference() for accessing rcu pointer The map field in 'struct mapped_device' is an rcu pointer. Use rcu_dereference() while accessing it. Signed-off-by: Pranith Kumar <bobby.prani@gmail.com> Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>	2014-11-10 15:25:29 -05:00
Mike Snitzer	42d6a8ce3c	dm thin: refactor requeue_io to eliminate spinlock bouncing Also refactor some other bio_list erroring helpers. Signed-off-by: Mike Snitzer <snitzer@redhat.com>	2014-11-10 15:25:29 -05:00
Mike Snitzer	9d094eebd7	dm thin: optimize retry_bios_on_resume Eliminate redundant should_error_unserviceable_bio check and error loop. Signed-off-by: Mike Snitzer <snitzer@redhat.com>	2014-11-10 15:25:28 -05:00
Joe Thornber	ac4c3f34a9	dm thin: sort the deferred cells Sort the cells in logical block order before processing each cell in process_thin_deferred_cells(). This significantly improves the ondisk layout on rotational storage, whereby improving read performance. Signed-off-by: Joe Thornber <ejt@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>	2014-11-10 15:25:28 -05:00
Joe Thornber	23ca2bb6c6	dm thin: direct dispatch when breaking sharing This use of direct submission in process_shared_bio() reduces latency for submitting bios in the shared cell by avoiding adding those bios to the deferred list and waiting for the next iteration of the worker. Signed-off-by: Joe Thornber <ejt@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>	2014-11-10 15:25:28 -05:00
Joe Thornber	2d759a46b4	dm thin: remap the bios in a cell immediately This use of direct submission in process_prepared_mapping() reduces latency for submitting bios in a cell by avoiding adding those bios to the deferred list and waiting for the next iteration of the worker. But this direct submission exposes the potential for a race between releasing a cell and incrementing deferred set. Fix this by introducing dm_cell_visit_release() and refactoring inc_remap_and_issue_cell() accordingly. Signed-off-by: Joe Thornber <ejt@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>	2014-11-10 15:25:28 -05:00
Joe Thornber	a374bb217b	dm thin: defer whole cells rather than individual bios This avoids dropping the cell, so increases the probability that other bios will collect within the cell, rather than being passed individually to the worker. Also add required process_cell and process_discard_cell error handling wrappers and set associated pool-mode function pointers accordingly. Signed-off-by: Joe Thornber <ejt@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>	2014-11-10 15:25:28 -05:00
Mike Snitzer	452d7a620d	dm thin: factor out remap_and_issue_overwrite Purely cleanup of duplicated code, no functional change. Signed-off-by: Mike Snitzer <snitzer@redhat.com>	2014-11-10 15:25:28 -05:00
Joe Thornber	7a7e97ca58	dm thin: performance improvement to discard processing When processing a discard bio, if the block is already quiesced do the discard immediately rather than adding the mapping to a list for the next iteration of the worker thread. Discarding a fully provisioned 100G thin volume with 64k block size goes from 860s to 95s with this change. Clearly there's something wrong with the worker architecture, more investigation needed. Signed-off-by: Joe Thornber <ejt@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>	2014-11-10 15:25:27 -05:00
Mike Snitzer	36f12aeb71	dm thin: implement thin_merge Introduce thin_merge so that any additional constraints from the data volume may be taken into account when determing the maximum number of sectors that can be issued relative to the specified logical offset. This is particularly important if/when the data volume is layered ontop of a more sophisticated device (e.g. dm-raid or some other DM target). Reviewed-by: Heinz Mauelshagen <heinzm@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>	2014-11-10 15:25:27 -05:00
Mike Snitzer	148e51baf8	dm: improve documentation and code clarity in dm_merge_bvec These code changes do not introduce a functional change. But bio_add_page() will never attempt to build up a bio larger than queue_max_sectors(). Similarly, bio_get_nr_vecs() is also bound by queue_max_sectors(). Therefore, there is no point in allowing dm_merge_bvec() to answer "how many sectors can a bio have at this offset?" with anything larger than queue_max_sectors(). Using queue_max_sectors() rather than BIO_MAX_SECTORS serves to more accurately convey the limits that are being imposed. Also, use unlikely() to clarify the fact that the defensive code in dm_merge_bvec() relative to max_size going negative shouldn't ever happen -- if it does happen there is a bug in the block layer for requesting larger than dm_merge_bvec()'s initial response for a given offset. Also, update a comment in dm_merge_bvec() relative to max_hw_sectors_kb. And fix empty newline whitespace. Signed-off-by: Mike Snitzer <snitzer@redhat.com>	2014-11-10 15:25:27 -05:00
Mike Snitzer	604ea90641	dm thin: adjust max_sectors_kb based on thinp blocksize Allows for filesystems to submit bios that are a factor of the thinp blocksize, improving dm-thinp efficiency (particularly when the data volume is RAID). Also set io_min to max_sectors_kb if it is a factor of the thinp blocksize. Signed-off-by: Mike Snitzer <snitzer@redhat.com>	2014-11-10 15:25:27 -05:00
Joe Thornber	7d327fe051	dm thin: throttle incoming IO Throttle IO based on the time it's taking the worker to do one loop. There were reports of hung task timeouts occuring and it was observed that the excessively long avgqu-sz (as reported by iostat) was contributing to these hung tasks. Throttling definitely helps dm-thinp perform better under heavy IO load (without being detremental by being overzealous). It reduces avgqu-sz drastically, e.g.: from 60K to ~6K, and even as low as 150 once metadata is cached by bufio, when dirty_ratio=5, dirty_background_ratio=2. And avgqu-sz stays at or below 30K even with dirty_ratio=20, dirty_background_ratio=10. Signed-off-by: Joe Thornber <ejt@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>	2014-11-10 15:25:27 -05:00
Joe Thornber	8a01a6af75	dm thin: prefetch missing metadata pages Prefetch metadata at the start of the worker thread and then again every 128th bio processed from the deferred list. Signed-off-by: Joe Thornber <ejt@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>	2014-11-10 15:25:27 -05:00
Joe Thornber	4646015d7e	dm transaction manager: add support for prefetching blocks of metadata Introduce the dm_tm_issue_prefetches interface. If you're using a non-blocking clone the tm will build up a list of requested blocks that weren't in core. dm_tm_issue_prefetches will request those blocks to be prefetched. Signed-off-by: Joe Thornber <ejt@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>	2014-11-10 15:25:26 -05:00
Joe Thornber	e5cfc69a51	dm thin metadata: change dm_thin_find_block to allow blocking, but not issuing, IO This change is a prerequisite for allowing metadata to be prefetched. Signed-off-by: Joe Thornber <ejt@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>	2014-11-10 15:25:26 -05:00
Joe Thornber	a195db2d29	dm bio prison: switch to using a red black tree Previously it was using a fixed sized hash table. There are times when very many concurrent cells are held (such as when processing a very large discard). When this happens the hash table performance becomes very poor. Signed-off-by: Joe Thornber <ejt@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>	2014-11-10 15:25:26 -05:00
Joe Thornber	33096a7822	dm bufio: evict buffers that are past the max age but retain some buffers These changes help keep metadata backed by dm-bufio in-core longer which fixes reports of metadata churn in the face of heavy random IO workloads. Before, bufio evicted all buffers older than DM_BUFIO_DEFAULT_AGE_SECS. Having a device (e.g. dm-thinp or dm-cache) lose all metadata just because associated buffers had been idle for some time is unfriendly. Now, the user may now configure the number of bytes that bufio retains using the 'retain_bytes' module parameter. The default is 256K. Also, the DM_BUFIO_WORK_TIMER_SECS and DM_BUFIO_DEFAULT_AGE_SECS defaults were quite low so increase them (to 30 and 300 respectively). Signed-off-by: Joe Thornber <ejt@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>	2014-11-10 15:25:26 -05:00
Joe Thornber	4e420c452b	dm bufio: switch from a huge hash table to an rbtree Converting over to using an rbtree eliminates a fixed 8MB allocation from vmalloc space for the hash table. Signed-off-by: Joe Thornber <ejt@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>	2014-11-10 15:25:26 -05:00
Joe Thornber	9b460d3699	dm btree: fix a recursion depth bug in btree walking code The walk code was using a 'ro_spine' to hold it's locked btree nodes. But this data structure is designed for the rolling lock scheme, and as such automatically unlocks blocks that are two steps up the call chain. This is not suitable for the simple recursive walk algorithm, which retraces its steps. This code is only used by the persistent array code, which in turn is only used by dm-cache. In order to trigger it you need to have a mapping tree that is more than 2 levels deep; which equates to 8-16 million cache blocks. For instance a 4T ssd with a very small block size of 32k only just triggers this bug. The fix just places the locked blocks on the stack, and stops using the ro_spine altogether. Signed-off-by: Joe Thornber <ejt@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com> Cc: stable@vger.kernel.org	2014-11-10 15:23:58 -05:00
Joe Thornber	c822ed967c	dm thin: grab a virtual cell before looking up the mapping Avoids normal IO racing with discard. Signed-off-by: Joe Thornber <ejt@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com> Cc: stable@vger.kernel.org	2014-11-04 13:05:53 -05:00
Heinz Mauelshagen	d20c4b08be	dm raid: fix inaccessible superblocks causing oops in configure_discard_support Commit `48cf06bc5f` ("dm raid: add discard support for RAID levels 4, 5 and 6") did not properly handle missing metadata device(s). A failing read of the superblock causes the metadata and data devices to be removed from the dev array in struct raid_set, setting references to both devices to NULL. configure_discard_support() nonetheless tries to access the data dev unconditionally causing an oops. Signed-off-by: Heinz Mauelshagen <heinzm@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>	2014-10-29 14:53:27 -04:00
Heinz Mauelshagen	40d43c4b4c	dm raid: ensure superblock's size matches device's logical block size The dm-raid superblock (struct dm_raid_superblock) is padded to 512 bytes and that size is being used to read it in from the metadata device into one preallocated page. Reading or writing this on a 512-byte sector device works fine but on a 4096-byte sector device this fails. Set the dm-raid superblock's size to the logical block size of the metadata device, because IO at that size is guaranteed too work. Also add a size check to avoid silent partial metadata loss in case the superblock should ever grow past the logical block size or PAGE_SIZE. [includes pointer math fix from Dan Carpenter] Reported-by: "Liuhua Wang" <lwang@suse.com> Signed-off-by: Heinz Mauelshagen <heinzm@redhat.com> Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com> Cc: stable@vger.kernel.org	2014-10-21 09:32:15 -04:00
Linus Torvalds	929254d8da	. fix DM's long-standing excessive use of memory by leveraging the new bioset_create_nobvec() interface when creating the DM's bioset . fix a few bugs in dm-bufio and dm-log-userspace . add DM core support for a DM multipath use-case that requires loading DM tables that contain devices that have failed (by allowing active and inactive DM tables to share dm_devs) . add discard support to the DM raid target; like MD raid456 the user must opt-in to raid456 discard support be specifying the devices_handle_discard_safely=Y module param. -----BEGIN PGP SIGNATURE----- Version: GnuPG v1 iQEcBAABAgAGBQJUQWcdAAoJEMUj8QotnQNaHaQH/RD0xf54AnaQ0tEGuNQXFwtx Gc/3s+VEcKlmTvk9nm2FWNvVagPn8uBQ0O2eid4UJk9AyfPJnPwGUoVqxbKhKK9i G5/O5s8opLlItk14h/btw/zB8RNC1bg8NGnBrGYDudiwHm+Gv4jlnHErp2JMHv9F nonb+QoG23wlEJkBafzBNYhthkNDq1ZFrDjhqG7dNySkXh8VZAW8VcZ/ZfskkhOa C8CDl3TKL1BBJHQKesvqHQbCSqh8Ujzs63bLA3heaSMExkhmUgdfpnbHK4hzPNJP rtmVEW57mVI+O5Cfva1p9RClT5EjiO+5VufHkpRJSIsfsH5PMaQ7vW8gKmwd5JA= =z+Yz -----END PGP SIGNATURE----- Merge tag 'dm-3.18' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm Pull device-mapper updates from Mike Snitzer: "I rebased the DM tree ontop of linux-block.git's 'for-3.18/core' at the beginning of October because DM core now depends on the newly introduced bioset_create_nobvec() interface. Summary: - fix DM's long-standing excessive use of memory by leveraging the new bioset_create_nobvec() interface when creating the DM's bioset - fix a few bugs in dm-bufio and dm-log-userspace - add DM core support for a DM multipath use-case that requires loading DM tables that contain devices that have failed (by allowing active and inactive DM tables to share dm_devs) - add discard support to the DM raid target; like MD raid456 the user must opt-in to raid456 discard support be specifying the devices_handle_discard_safely=Y module param" * tag 'dm-3.18' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm: dm log userspace: fix memory leak in dm_ulog_tfr_init failure path dm bufio: when done scanning return from __scan immediately dm bufio: update last_accessed when relinking a buffer dm raid: add discard support for RAID levels 4, 5 and 6 dm raid: add discard support for RAID levels 1 and 10 dm: allow active and inactive tables to share dm_devs dm mpath: stop queueing IO when no valid paths exist dm: use bioset_create_nobvec() dm: remove nr_iovecs parameter from alloc_tio()	2014-10-18 12:25:30 -07:00
Linus Torvalds	e75437fb93	Merge branch 'for-3.18/drivers' of git://git.kernel.dk/linux-block Pull block layer driver update from Jens Axboe: "This is the block driver pull request for 3.18. Not a lot in there this round, and nothing earth shattering. - A round of drbd fixes from the linbit team, and an improvement in asender performance. - Removal of deprecated (and unused) IRQF_DISABLED flag in rsxx and hd from Michael Opdenacker. - Disable entropy collection from flash devices by default, from Mike Snitzer. - A small collection of xen blkfront/back fixes from Roger Pau Monné and Vitaly Kuznetsov" * 'for-3.18/drivers' of git://git.kernel.dk/linux-block: block: disable entropy contributions for nonrot devices xen, blkfront: factor out flush-related checks from do_blkif_request() xen-blkback: fix leak on grant map error path xen/blkback: unmap all persistent grants when frontend gets disconnected rsxx: Remove deprecated IRQF_DISABLED block: hd: remove deprecated IRQF_DISABLED drbd: use RB_DECLARE_CALLBACKS() to define augment callbacks drbd: compute the end before rb_insert_augmented() drbd: Add missing newline in resync progress display in /proc/drbd drbd: reduce lock contention in drbd_worker drbd: Improve asender performance drbd: Get rid of the WORK_PENDING macro drbd: Get rid of the __no_warn and __cond_lock macros drbd: Avoid inconsistent locking warning drbd: Remove superfluous newline from "resync_extents" debugfs entry. drbd: Use consistent names for all the bi_end_io callbacks drbd: Use better variable names	2014-10-18 12:12:45 -07:00
Linus Torvalds	88ed806abb	md updates for 3.18 - a few minor bug fixes - quite a lot of code tidy-up and simplification - remove PRINT_RAID_DEBUG ioctl. I'm fairly sure it is unused, and it isn't particularly useful. -----BEGIN PGP SIGNATURE----- Version: GnuPG v2.0.22 (GNU/Linux) iQIVAwUAVD9k1jnsnt1WYoG5AQKCaw/9F7jE9PDYtEfJ8ShEWMM0CWNsCKmgqfpV i4RaeVKe1IoA5JOurYk+wvdbWSXGfz5XQ9GX8ptRl9X7ZoG4aJ65v9GHpBamPLrc mB2Lz3zR9AZVYrMDeZym9cSZ6FpZNzzXpJE2O2RslXq3gFI03MObyM8xyeh8ybOD 45nhH+CJ17OFNn5OzzFLhYEAoYDeOS97zAwInWeFlUp14Jl403xnZ3srF2YJ78TR PjcCpxo1IhGEnYE8rDYqH/UjDPzEfAdYrqM5k3NEnuPiqn+KxYNSsbAQGdeMrGUc DO0H8dt6U1U2tq/t/qN8n01uQ7AJ3S3JrTsQxSW/UC1SVfgpztK/a78eA/YSy/zs iZzPP7CpLfF4T945jaQionevZOBFRM+gbrMqeoQTPO2QtfrSGe4Awoht7Z+no3RR dCX0ScO16kHkcAcSXXGZkGtC1DcteEwUfufSdako12exo1k3efc98DsyMw2VfzOM EJcQD1JGYVW+czZM58EEue92TT5jvWnhU5s3PEUMTZrDgSWwTVQC3oNCgDGFKI4X eebpWlG3gEjNrnL5givbBwC2LCfI59R70gpnGhavdKtt9AtpfsMJnj8E3cCqHE9I xR6YPF161KSmKGOG47RK/VJJnq5SmZbxxeShT101uq3SVeqImit6ql3JfAM9HoMD RI2iWG9yma4= =2QEJ -----END PGP SIGNATURE----- Merge tag 'md/3.18' of git://neil.brown.name/md Pull md updates from Neil Brown: - a few minor bug fixes - quite a lot of code tidy-up and simplification - remove PRINT_RAID_DEBUG ioctl. I'm fairly sure it is unused, and it isn't particularly useful. * tag 'md/3.18' of git://neil.brown.name/md: (21 commits) lib/raid6: Add log level to printks md: move EXPORT_SYMBOL to after function in md.c md: discard PRINT_RAID_DEBUG ioctl md: remove MD_BUG() md: clean up 'exit' labels in md_ioctl(). md: remove unnecessary test for MD_MAJOR in md_ioctl() md: don't allow "-sync" to be set for device in an active array. md: remove unwanted white space from md.c md: don't start resync thread directly from md thread. md: Just use RCU when checking for overlap between arrays. md: avoid potential long delay under pers_lock md: simplify export_array() md: discard find_rdev_nr in favour of find_rdev_nr_rcu md: use wait_event() to simplify md_super_wait() md: be more relaxed about stopping an array which isn't started. md/raid1: process_checks doesn't use its return value. md/raid5: fix init_stripe() inconsistencies md/raid10: another memory leak due to reshape. md: use set_bit/clear_bit instead of shift/mask for bi_flags changes. md/raid1: minor typos and reformatting. ...	2014-10-18 11:39:52 -07:00
Mikulas Patocka	9d28eb1244	dm bufio: change __GFP_IO to __GFP_FS in shrinker callbacks The shrinker uses gfp flags to indicate what kind of operation can the driver wait for. If __GFP_IO flag is present, the driver can wait for block I/O operations, if __GFP_FS flag is present, the driver can wait on operations involving the filesystem. dm-bufio tested for __GFP_IO. However, dm-bufio can run on a loop block device that makes calls into the filesystem. If __GFP_IO is present and __GFP_FS isn't, dm-bufio could still block on filesystem operations if it runs on a loop block device. The change from __GFP_IO to __GFP_FS supposedly fixes one observed (though unreproducible) deadlock involving dm-bufio and loop device. Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com> Cc: stable@vger.kernel.org	2014-10-17 01:40:23 -04:00
Linus Torvalds	0429fbc0bd	Merge branch 'for-3.18-consistent-ops' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/percpu Pull percpu consistent-ops changes from Tejun Heo: "Way back, before the current percpu allocator was implemented, static and dynamic percpu memory areas were allocated and handled separately and had their own accessors. The distinction has been gone for many years now; however, the now duplicate two sets of accessors remained with the pointer based ones - this_cpu_() - evolving various other operations over time. During the process, we also accumulated other inconsistent operations. This pull request contains Christoph's patches to clean up the duplicate accessor situation. __get_cpu_var() uses are replaced with with this_cpu_ptr() and __this_cpu_ptr() with raw_cpu_ptr(). Unfortunately, the former sometimes is tricky thanks to C being a bit messy with the distinction between lvalues and pointers, which led to a rather ugly solution for cpumask_var_t involving the introduction of this_cpu_cpumask_var_ptr(). This converts most of the uses but not all. Christoph will follow up with the remaining conversions in this merge window and hopefully remove the obsolete accessors" 'for-3.18-consistent-ops' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/percpu: (38 commits) irqchip: Properly fetch the per cpu offset percpu: Resolve ambiguities in __get_cpu_var/cpumask_var_t -fix ia64: sn_nodepda cannot be assigned to after this_cpu conversion. Use __this_cpu_write. percpu: Resolve ambiguities in __get_cpu_var/cpumask_var_t Revert "powerpc: Replace __get_cpu_var uses" percpu: Remove __this_cpu_ptr clocksource: Replace __this_cpu_ptr with raw_cpu_ptr sparc: Replace __get_cpu_var uses avr32: Replace __get_cpu_var with __this_cpu_write blackfin: Replace __get_cpu_var uses tile: Use this_cpu_ptr() for hardware counters tile: Replace __get_cpu_var uses powerpc: Replace __get_cpu_var uses alpha: Replace __get_cpu_var ia64: Replace __get_cpu_var uses s390: cio driver &__get_cpu_var replacements s390: Replace __get_cpu_var uses mips: Replace __get_cpu_var uses MIPS: Replace __get_cpu_var uses in FPU emulator. arm: Replace __this_cpu_ptr with raw_cpu_ptr ...	2014-10-15 07:48:18 +02:00
Jan-Simon Möller	b610626523	crypto, dm: LLVMLinux: Remove VLAIS usage from dm-crypt Replaced the use of a Variable Length Array In Struct (VLAIS) with a C99 compliant equivalent. This patch allocates the appropriate amount of memory using a char array using the SHASH_DESC_ON_STACK macro. The new code can be compiled with both gcc and clang. Signed-off-by: Jan-Simon Möller <dl9pf@gmx.de> Signed-off-by: Behan Webster <behanw@converseincode.com> Reviewed-by: Mark Charlebois <charlebm@gmail.com> Acked-by: Herbert Xu <herbert@gondor.apana.org.au> Cc: pageexec@freemail.hu Cc: gmazyland@gmail.com Cc: "David S. Miller" <davem@davemloft.net>	2014-10-14 10:51:23 +02:00
NeilBrown	6c144d3164	md: move EXPORT_SYMBOL to after function in md.c Signed-off-by: NeilBrown <neilb@suse.de>	2014-10-14 13:08:29 +11:00
NeilBrown	2cbbca5e7c	md: discard PRINT_RAID_DEBUG ioctl All the interesting information printed by this ioctl is provided in /proc/mdstat and/or sysfs. So it isn't needed and isn't used and would be best if it didn't exist. Signed-off-by: NeilBrown <neilb@suse.de>	2014-10-14 13:08:29 +11:00
NeilBrown	403df47888	md: remove MD_BUG() Most of the places that call this are doing so pointlessly. A couple of the others a best replaced with WARN_ON(). Signed-off-by: NeilBrown <neilb@suse.de>	2014-10-14 13:08:29 +11:00
NeilBrown	3adc28d85f	md: clean up 'exit' labels in md_ioctl(). There are 4 labels and we only really need two. Signed-off-by: NeilBrown <neilb@suse.de>	2014-10-14 13:08:29 +11:00
NeilBrown	326eb17d73	md: remove unnecessary test for MD_MAJOR in md_ioctl() unknown ioctls no longer get this deep into md_ioctl since md_ioctl_valid() was introduced in 3.14. So remove the test and the misleading comment. Signed-off-by: NeilBrown <neilb@suse.de>	2014-10-14 13:08:29 +11:00
NeilBrown	e1960f8c5c	md: don't allow "-sync" to be set for device in an active array. If an array is active, devices can be marked 'faulty', but simply removing the 'sync' flag is wrong. That only makes sense for an array which is not active (and is probably only useful for testing anyway). Signed-off-by: NeilBrown <neilb@suse.de>	2014-10-14 13:08:29 +11:00
NeilBrown	f72ffdd686	md: remove unwanted white space from md.c My editor shows much of this is RED. Signed-off-by: NeilBrown <neilb@suse.de>	2014-10-14 13:08:29 +11:00
NeilBrown	ac05f25669	md: don't start resync thread directly from md thread. The main 'md' thread is needed for processing writes, so if it blocks write requests could be delayed. Starting a new thread requires some GFP_KERNEL allocations and so can wait for writes to complete. This can deadlock. So instead, ask a workqueue to start the sync thread. There is no particular rush for this to happen, so any work queue will do. MD_RECOVERY_RUNNING is used to ensure only one thread is started. Reported-by: BillStuff <billstuff2001@sbcglobal.net> Signed-off-by: NeilBrown <neilb@suse.de>	2014-10-14 13:08:28 +11:00
NeilBrown	8b1afc3d67	md: Just use RCU when checking for overlap between arrays. We don't really need the full mddev_lock here, and having to drop it is messy. RCU is enough to protect these lists. Signed-off-by: NeilBrown <neilb@suse.de>	2014-10-14 13:08:28 +11:00
Chao Yu	50bd377405	md: avoid potential long delay under pers_lock printk may cause long time lapse if value of printk_delay in sysctl is configured large by user. If register_md_personality takes long time to print in spinlock pers_lock, we may encounter high CPU usage rate when there are other pers_lock competitors who may be blocked to spin. We can avoid this condition by moving printk out of coverage of pers_lock spinlock. Signed-off-by: Chao Yu <chao2.yu@samsung.com> Signed-off-by: NeilBrown <neilb@suse.de>	2014-10-14 13:08:28 +11:00
NeilBrown	0638bb0e73	md: simplify export_array() We don't really need that for_each loop, or those MD_BUGs. Signed-off-by: NeilBrown <neilb@suse.de>	2014-10-14 13:08:28 +11:00
NeilBrown	4878e9eb88	md: discard find_rdev_nr in favour of find_rdev_nr_rcu Having both is a waste - just use the one. Signed-off-by: NeilBrown <neilb@suse.de>	2014-10-14 13:08:28 +11:00
NeilBrown	1967cd5616	md: use wait_event() to simplify md_super_wait() md_super_wait is really just wait_event() open-coded. So use the macro instead. Signed-off-by: NeilBrown <neilb@suse.de>	2014-10-14 13:08:28 +11:00
NeilBrown	9ba3b7f5d0	md: be more relaxed about stopping an array which isn't started. In general we don't allow an array to be stopped if it is in use. However if the array hasn't really been started yet, then any apparent use is an anomily, probably due to 'udev' or similar having a look to see what is there. This means that if something goes wrong while assembling an array it cannot reliably be un-assembled - STOP_ARRAY could fail. There is no value here, so change do_md_stop() to succeed despite concurrent opens if the array has not yet been activated. i.e. if ->pers is NULL. Reported-by: "Baldysiak, Pawel" <pawel.baldysiak@intel.com> Signed-off-by: NeilBrown <neilb@suse.de>	2014-10-14 13:08:28 +11:00
NeilBrown	c95e6385e8	md/raid1: process_checks doesn't use its return value. process_checks() always returns '0', so change it to 'void'. Signed-off-by: NeilBrown <neilb@suse.de>	2014-10-14 13:08:28 +11:00
Markus Stockhausen	b8e6a15a1a	md/raid5: fix init_stripe() inconsistencies raid5: fix init_stripe() inconsistencies 1) remove_hash() is not necessary. We will only be called right after get_free_stripe(). There we have already a call to remove_hash(). 2) Tracing prints out the sector of the freed stripe and not the sector that we want to initialize. Signed-off-by: NeilBrown <neilb@suse.de>	2014-10-14 13:08:28 +11:00
NeilBrown	c4796e215f	md/raid10: another memory leak due to reshape. Signed-off-by: NeilBrown <neilb@suse.de>	2014-10-14 13:08:28 +11:00
Linus Torvalds	faafcba3b5	Merge branch 'sched-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull scheduler updates from Ingo Molnar: "The main changes in this cycle were: - Optimized support for Intel "Cluster-on-Die" (CoD) topologies (Dave Hansen) - Various sched/idle refinements for better idle handling (Nicolas Pitre, Daniel Lezcano, Chuansheng Liu, Vincent Guittot) - sched/numa updates and optimizations (Rik van Riel) - sysbench speedup (Vincent Guittot) - capacity calculation cleanups/refactoring (Vincent Guittot) - Various cleanups to thread group iteration (Oleg Nesterov) - Double-rq-lock removal optimization and various refactorings (Kirill Tkhai) - various sched/deadline fixes ... and lots of other changes" * 'sched-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (72 commits) sched/dl: Use dl_bw_of() under rcu_read_lock_sched() sched/fair: Delete resched_cpu() from idle_balance() sched, time: Fix build error with 64 bit cputime_t on 32 bit systems sched: Improve sysbench performance by fixing spurious active migration sched/x86: Fix up typo in topology detection x86, sched: Add new topology for multi-NUMA-node CPUs sched/rt: Use resched_curr() in task_tick_rt() sched: Use rq->rd in sched_setaffinity() under RCU read lock sched: cleanup: Rename 'out_unlock' to 'out_free_new_mask' sched: Use dl_bw_of() under RCU read lock sched/fair: Remove duplicate code from can_migrate_task() sched, mips, ia64: Remove __ARCH_WANT_UNLOCKED_CTXSW sched: print_rq(): Don't use tasklist_lock sched: normalize_rt_tasks(): Don't use _irqsave for tasklist_lock, use task_rq_lock() sched: Fix the task-group check in tg_has_rt_tasks() sched/fair: Leverage the idle state info when choosing the "idlest" cpu sched: Let the scheduler see CPU idle states sched/deadline: Fix inter- exclusive cpusets migrations sched/deadline: Clear dl_entity params when setscheduling to different class sched/numa: Kill the wrong/dead TASK_DEAD check in task_numa_fault() ...	2014-10-13 16:23:15 +02:00
Pavitra Kumar	a3f2af2547	dm stripe: fix potential for leak in stripe_ctr error path Fix a potential struct stripe_c leak that would occur if the chunk_size exceeded the maximum allowed by dm_set_target_max_io_len (UINT_MAX). However, in practice there is no possibility of this occuring given that chunk_size is of type uint32_t. But it is good to fix this to future-proof in case dm_set_target_max_io_len's implementation were to change. Signed-off-by: Pavitra Kumar <pavitrak@nvidia.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>	2014-10-10 22:05:18 -04:00
NeilBrown	3fd83717e4	md: use set_bit/clear_bit instead of shift/mask for bi_flags changes. Using {set,clear}_bit is more consistent than shifting and masking. No functional change. Signed-off-by: NeilBrown <neilb@suse.de>	2014-10-09 10:07:04 +11:00
NeilBrown	5965b642ff	md/raid1: minor typos and reformatting. Signed-off-by: NeilBrown <neilb@suse.de>	2014-10-09 10:07:04 +11:00
NeilBrown	4b5060ddae	md/bitmap: always wait for writes on unplug. If two threads call bitmap_unplug at the same time, then one might schedule all the writes, and the other might decide that it doesn't need to wait. But really it does. It rarely hurts to wait when it isn't absolutely necessary, and the current code doesn't really focus on 'absolutely necessary' anyway. So just wait always. This can potentially lead to data corruption if a crash happens at an awkward time and data was written before the bitmap was updated. It is very unlikely, but this should go to -stable just to be safe. Appropriate for any -stable. Signed-off-by: NeilBrown <neilb@suse.de> Cc: stable@vger.kernel.org (please delay until 3.18 is released)	2014-10-09 10:07:04 +11:00
Alexey Khoroshilov	56ec16cb1e	dm log userspace: fix memory leak in dm_ulog_tfr_init failure path If cn_add_callback() fails in dm_ulog_tfr_init(), it does not deallocate prealloced memory but calls cn_del_callback(). Found by Linux Driver Verification project (linuxtesting.org). Signed-off-by: Alexey Khoroshilov <khoroshilov@ispras.ru> Reviewed-by: Jonathan Brassow <jbrassow@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com> Cc: stable@vger.kernel.org	2014-10-05 20:03:38 -04:00
Mikulas Patocka	0e825862f3	dm bufio: when done scanning return from __scan immediately When __scan frees the required number of buffer entries that the shrinker requested (nr_to_scan becomes zero) it must return. Before this fix the __scan code exited only the inner loop and continued in the outer loop -- which could result in reduced performance due to extra buffers being freed (e.g. unnecessarily evicted thinp metadata needing to be synchronously re-read into bufio's cache). Also, move dm_bufio_cond_resched to __scan's inner loop, so that iterating the bufio client's lru lists doesn't result in scheduling latency. Reported-by: Joe Thornber <thornber@redhat.com> Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com> Cc: stable@vger.kernel.org # 3.2+	2014-10-05 20:03:37 -04:00
Joe Thornber	eb76faf53b	dm bufio: update last_accessed when relinking a buffer The 'last_accessed' member of the dm_buffer structure was only set when the the buffer was created. This led to each buffer being discarded after dm_bufio_max_age time even if it was used recently. In practice this resulted in all thinp metadata being evicted soon after being read -- this is particularly problematic for metadata intensive workloads like multithreaded small random IO. 'last_accessed' is now updated each time the buffer is moved to the head of the LRU list, so the buffer is now properly discarded if it was not used in dm_bufio_max_age time. Signed-off-by: Joe Thornber <ejt@redhat.com> Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com> Cc: stable@vger.kernel.org # v3.2+	2014-10-05 20:03:37 -04:00
Heinz Mauelshagen	48cf06bc5f	dm raid: add discard support for RAID levels 4, 5 and 6 In case of RAID levels 4, 5 and 6 we have to verify each RAID members' ability to zero data on discards to avoid stripe data corruption -- if discard_zeroes_data is not set for each RAID member discard support must be disabled. But given the uncertainty of whether or not a RAID member properly supports zeroing data on discard we require the user to explicitly allow discard support on RAID levels 4, 5, and 6 by setting a dm-raid module paramter, e.g.: dm-raid.devices_handle_discard_safely=Y Otherwise, discards could cause data corruption on RAID4/5/6. Signed-off-by: Heinz Mauelshagen <heinzm@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>	2014-10-05 20:03:36 -04:00
Heinz Mauelshagen	75b8e04bbf	dm raid: add discard support for RAID levels 1 and 10 Discard support is not enabled for RAID levels 4, 5, and 6 at this time due to concerns about unreliable discard_zeroes_data support on some hardware. Otherwise, discards could cause stripe data corruption (classic example of bad apples spoiling the bunch). Signed-off-by: Heinz Mauelshagen <heinzm@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>	2014-10-05 20:03:36 -04:00
Benjamin Marzinski	86f1152b11	dm: allow active and inactive tables to share dm_devs Until this change, when loading a new DM table, DM core would re-open all of the devices in the DM table. Now, DM core will avoid redundant device opens (and closes when destroying the old table) if the old table already has a device open using the same mode. This is achieved by managing reference counts on the table_devices that DM core now stores in the mapped_device structure (rather than in the dm_table structure). So a mapped_device's active and inactive dm_tables' dm_dev lists now just point to the dm_devs stored in the mapped_device's table_devices list. This improvement in DM core's device reference counting has the side-effect of fixing a long-standing limitation of the multipath target: a DM multipath table couldn't include any paths that were unusable (failed). For example: if all paths have failed and you add a new, working, path to the table; you can't use it since the table load would fail due to it still containing failed paths. Now a re-load of a multipath table can include failed devices and when those devices become active again they can be used instantly. The device list code in dm.c isn't a straight copy/paste from the code in dm-table.c, but it's very close (aside from some variable renames). One subtle difference is that find_table_device for the tables_devices list will only match devices with the same name and mode. This is because we don't want to upgrade a device's mode in the active table when an inactive table is loaded. Access to the mapped_device structure's tables_devices list requires a mutex (tables_devices_lock), so that tables cannot be created and destroyed concurrently. Signed-off-by: Benjamin Marzinski <bmarzins@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>	2014-10-05 20:03:35 -04:00
Benjamin Marzinski	1f27197247	dm mpath: stop queueing IO when no valid paths exist 'queue_io' is set so that IO is queued while paths are being initialized. Clear queue_io in __choose_pgpath if there are no valid paths, since there are obviously no paths that can be initialized. Otherwise IOs to the device will back up. Signed-off-by: Benjamin Marzinski <bmarzins@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>	2014-10-05 20:03:35 -04:00
Junichi Nomura	3d8aab2d2c	dm: use bioset_create_nobvec() Since DM core uses bio_clone_fast() for both bio-based and request-based DM devices there is no need for DM's bioset to have a bvec mempool. With this patch, on arch with 4KB page for example, memory usage will be reduced by 64KB for each bio-based DM device and 1MB for each request-based DM device. For example, when you create 10,000 bio-based DM devices and 1,000 request-based DM devices, memory usage of biovec under no load is: # grep biovec /proc/slabinfo biovec-256 418068 418068 4096 ... biovec-128 0 0 2048 ... biovec-64 0 0 1024 ... biovec-16 0 0 256 ... With this patch series applied, the usage becomes: # grep biovec /proc/slabinfo biovec-256 116 116 4096 ... biovec-128 0 0 2048 ... biovec-64 0 0 1024 ... biovec-16 0 0 256 ... So 4096 * (418068 - 116) = 1.6GB of memory is saved in this example. Signed-off-by: Jun'ichi Nomura <j-nomura@ce.jp.nec.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>	2014-10-05 20:03:34 -04:00
Junichi Nomura	997782735c	dm: remove nr_iovecs parameter from alloc_tio() alloc_tio() uses bio_alloc_bioset() to allocate a clone-bio for a bio. alloc_tio() takes the number of bvecs to allocate for the clone-bio. However, with v3.14's immutable biovec changes DM now uses __bio_clone_fast() and no longer needs to allocate bvecs. In practice, the 'nr_iovecs' passed to alloc_tio() is always effectively 0. __clone_and_map_simple_bio() looked like it was passing non-zero nr_iovecs, but its value was always within the range of inline bvecs and no allocation actually happened. If allocation happened, the BUG_ON() in __bio_clone_fast() would've triggered. Remove the nr_iovecs parameter from alloc_tio() to prevent possible future bio_alloc_bioset() mis-use of a new bioset interface that will no longer allow bvecs to be allocated. Also fix extra whitespace before the __bio_clone_fast() call in __clone_and_map_simple_bio(). Signed-off-by: Jun'ichi Nomura <j-nomura@ce.jp.nec.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>	2014-10-05 20:03:34 -04:00
Mike Snitzer	b277da0a8a	block: disable entropy contributions for nonrot devices Clear QUEUE_FLAG_ADD_RANDOM in all block drivers that set QUEUE_FLAG_NONROT. Historically, all block devices have automatically made entropy contributions. But as previously stated in commit `e2e1a148` ("block: add sysfs knob for turning off disk entropy contributions"): - On SSD disks, the completion times aren't as random as they are for rotational drives. So it's questionable whether they should contribute to the random pool in the first place. - Calling add_disk_randomness() has a lot of overhead. There are more reliable sources for randomness than non-rotational block devices. From a security perspective it is better to err on the side of caution than to allow entropy contributions from unreliable "random" sources. Signed-off-by: Mike Snitzer <snitzer@redhat.com> Signed-off-by: Jens Axboe <axboe@fb.com>	2014-10-04 10:55:32 -06:00
NeilBrown	8e0e99ba64	md/raid5: disable 'DISCARD' by default due to safety concerns. It has come to my attention (thanks Martin) that 'discard_zeroes_data' is only a hint. Some devices in some cases don't do what it says on the label. The use of DISCARD in RAID5 depends on reads from discarded regions being predictably zero. If a write to a previously discarded region performs a read-modify-write cycle it assumes that the parity block was consistent with the data blocks. If all were zero, this would be the case. If some are and some aren't this would not be the case. This could lead to data corruption after a device failure when data needs to be reconstructed from the parity. As we cannot trust 'discard_zeroes_data', ignore it by default and so disallow DISCARD on all raid4/5/6 arrays. As many devices are trustworthy, and as there are benefits to using DISCARD, add a module parameter to over-ride this caution and cause DISCARD to work if discard_zeroes_data is set. If a site want to enable DISCARD on some arrays but not on others they should select DISCARD support at the filesystem level, and set the raid456 module parameter. raid456.devices_handle_discard_safely=Y As this is a data-safety issue, I believe this patch is suitable for -stable. DISCARD support for RAID456 was added in 3.7 Cc: Shaohua Li <shli@kernel.org> Cc: "Martin K. Petersen" <martin.petersen@oracle.com> Cc: Mike Snitzer <snitzer@redhat.com> Cc: Heinz Mauelshagen <heinzm@redhat.com> Cc: stable@vger.kernel.org (3.7+) Acked-by: Martin K. Petersen <martin.petersen@oracle.com> Acked-by: Mike Snitzer <snitzer@redhat.com> Fixes: `620125f2bf` Signed-off-by: NeilBrown <neilb@suse.de>	2014-10-02 13:45:00 +10:00
Linus Torvalds	a90e41e228	Bugfixes for md/raid1 particularly, but not only, fixing new "resync" code. -----BEGIN PGP SIGNATURE----- Version: GnuPG v2.0.22 (GNU/Linux) iQIVAwUAVCIoAznsnt1WYoG5AQJDzRAAtciwFilYXxu8M7fPOQ/HZeoMtNLVX0dK cvL5yRhNxfoGLIG7TEeb5Wvd8cxHNR5t4x+jGmipJ7cTGE4S6Edgdpy2yhHDFBdo AyGgYreX441P07cefPUUa9nTVFlqx2TzJa+SR75CmBwbuZpx52kfHK9KMXWljY+Q Hm60k34tK4zzC5Tm2E7aeegFjUaIAwrpt3TOJlh8E/JiEQDsVz2+o+7RFwPXrXgm YnxfpaAcw5XcanlUj0q6r6O86hhItO54sBBcTtTNZtD7oZC82/OYj6SxlG0V3D2a wBFouI518Rf0TmdtG3XgPAfI0eCZyowZtYmpoYX+/8rkGSy2ZmJfxSY2NzmGBmX4 LtH0tYkp2qSu6WCXUMPOLmPRqQuT6iX4ho7KCNMr2n05kHMom/InNUajWUvqPFdE eBs27u9HngTVCTMpwdCfFV/qWXszEhpp9wyzAv5zRV7gyc3hZM3cQ1iV2GKor8Ka wSTeDT+gY9J2sCJgqx7li45jpsZPzayupwW+hBvieKeY6/fM1leur4Ji/mcRXytK YUci6fiy2kwxs1uzFq7Kra3Y5gqGq+S6HCspmZTtstzFxbKcMTmOC1B2ukKDPvGS HwXnQ6w+fXmF/+fXWD98++ET80rWj6utXBJhSGhkdQcyYRz5DU/2GsLsA4yvho4N Dbo2gIjTtD8= =gMsu -----END PGP SIGNATURE----- Merge tag 'md/3.17-more-fixes' of git://git.neil.brown.name/md Pull bugfixes for md/raid1 from Neil Brown: "It is amazing how much easier it is to find bugs when you know one is there. Two bug reports resulted in finding 7 bugs! All are tagged for -stable. Those that can't cause (rare) data corruption, cause lockups. Particularly, but not only, fixing new "resync" code" * tag 'md/3.17-more-fixes' of git://git.neil.brown.name/md: md/raid1: fix_read_error should act on all non-faulty devices. md/raid1: count resync requests in nr_pending. md/raid1: update next_resync under resync_lock. md/raid1: Don't use next_resync to determine how far resync has progressed md/raid1: make sure resync waits for conflicting writes to complete. md/raid1: clean up request counts properly in close_sync() md/raid1: be more cautious where we read-balance during resync. md/raid1: intialise start_next_window for READ case to avoid hang	2014-09-24 08:53:33 -07:00
NeilBrown	b8cb6b4c12	md/raid1: fix_read_error should act on all non-faulty devices. If a devices is being recovered it is not InSync and is not Faulty. If a read error is experienced on that device, fix_read_error() will be called, but it ignores non-InSync devices. So it will neither fix the error nor fail the device. It is incorrect that fix_read_error() ignores non-InSync devices. It should only ignore Faulty devices. So fix it. This became a bug when we allowed reading from a device that was being recovered. It is suitable for any subsequent -stable kernel. Fixes: `da8840a747` Cc: stable@vger.kernel.org (v3.5+) Reported-by: Alexander Lyakas <alex.bolshoy@gmail.com> Tested-by: Alexander Lyakas <alex.bolshoy@gmail.com> Signed-off-by: NeilBrown <neilb@suse.de>	2014-09-22 11:26:01 +10:00
NeilBrown	34e97f1701	md/raid1: count resync requests in nr_pending. Both normal IO and resync IO can be retried with reschedule_retry() and so be counted into ->nr_queued, but only normal IO gets counted in ->nr_pending. Before the recent improvement to RAID1 resync there could only possibly have been one or the other on the queue. When handling a read failure it could only be normal IO. So when handle_read_error() called freeze_array() the fact that freeze_array only compares ->nr_queued against ->nr_pending was safe. But now that these two types can interleave, we can have both normal and resync IO requests queued, so we need to count them both in nr_pending. This error can lead to freeze_array() hanging if there is a read error, so it is suitable for -stable. Fixes: `79ef3a8aa1` cc: stable@vger.kernel.org (v3.13+) Reported-by: Brassow Jonathan <jbrassow@redhat.com> Signed-off-by: NeilBrown <neilb@suse.de>	2014-09-22 11:26:01 +10:00
NeilBrown	c2fd4c94de	md/raid1: update next_resync under resync_lock. raise_barrier() uses next_resync as part of its calculations, so it really should be updated first, instead of afterwards. next_resync is always used under resync_lock so update it under resync lock to, just before it is used. That is safest. This could cause normal IO and resync IO to interact badly so it suitable for -stable. Fixes: `79ef3a8aa1` cc: stable@vger.kernel.org (v3.13+) Signed-off-by: NeilBrown <neilb@suse.de>	2014-09-22 11:26:01 +10:00
NeilBrown	235549605e	md/raid1: Don't use next_resync to determine how far resync has progressed next_resync is (approximately) the location for the next resync request. However it does not reliably determine the earliest location at which resync might be happening. This is because resync requests can complete out of order, and we only limit the number of current requests, not the distance from the earliest pending request to the latest. mddev->curr_resync_completed is a reliable indicator of the earliest position at which resync could be happening. It is updated less frequently, but is actually reliable which is more important. So use it to determine if a write request is before the region being resynced and so safe from conflict. This error can allow resync IO to interfere with normal IO which could lead to data corruption. Hence: stable. Fixes: `79ef3a8aa1` cc: stable@vger.kernel.org (v3.13+) Signed-off-by: NeilBrown <neilb@suse.de>	2014-09-22 11:26:01 +10:00
NeilBrown	2f73d3c55d	md/raid1: make sure resync waits for conflicting writes to complete. The resync/recovery process for raid1 was recently changed so that writes could happen in parallel with resync providing they were in different regions of the device. There is a problem though: While a write request will always wait for conflicting resync to complete, a resync request will not always wait for conflicting writes to complete. Two changes are needed to fix this: 1/ raise_barrier (which waits until it is safe to do resync) must wait until current_window_requests is zero 2/ wait_battier (which waits at the start of a new write request) must update current_window_requests if the request could possible conflict with a concurrent resync. As concurrent writes and resync can lead to data loss, this patch is suitable for -stable. Fixes: `79ef3a8aa1` Cc: stable@vger.kernel.org (v3.13+) Cc: majianpeng <majianpeng@gmail.com> Signed-off-by: NeilBrown <neilb@suse.de>	2014-09-22 11:26:01 +10:00
NeilBrown	669cc7ba77	md/raid1: clean up request counts properly in close_sync() If there are outstanding writes when close_sync is called, the change to ->start_next_window might cause them to decrement the wrong counter when they complete. Fix this by merging the two counters into the one that will be decremented. Having an incorrect value in a counter can cause raise_barrier() to hangs, so this is suitable for -stable. Fixes: `79ef3a8aa1` cc: stable@vger.kernel.org (v3.13+) Signed-off-by: NeilBrown <neilb@suse.de>	2014-09-22 11:26:01 +10:00
NeilBrown	c6d119cf1b	md/raid1: be more cautious where we read-balance during resync. commit `79ef3a8aa1` made it possible for reads to happen concurrently with resync. This means that we need to be more careful where read_balancing is allowed during resync - we can no longer be sure that any resync that has already started will definitely finish. So keep read_balancing to before recovery_cp, which is conservative but safe. This bug makes it possible to read from a device that doesn't have up-to-date data, so it can cause data corruption. So it is suitable for any kernel since 3.11. Fixes: `79ef3a8aa1` cc: stable@vger.kernel.org (v3.13+) Signed-off-by: NeilBrown <neilb@suse.de>	2014-09-22 10:26:41 +10:00
NeilBrown	f0cc9a0571	md/raid1: intialise start_next_window for READ case to avoid hang r1_bio->start_next_window is not initialised in the READ case, so allow_barrier may incorrectly decrement conf->current_window_requests which can cause raise_barrier() to block forever. Fixes: `79ef3a8aa1` cc: stable@vger.kernel.org (v3.13+) Reported-by: Brassow Jonathan <jbrassow@redhat.com> Signed-off-by: NeilBrown <neilb@suse.de>	2014-09-22 10:18:03 +10:00
Kirill Tkhai	f139caf2e8	sched, cleanup, treewide: Remove set_current_state(TASK_RUNNING) after schedule() schedule(), io_schedule() and schedule_timeout() always return with TASK_RUNNING state set, so one more setting is unnecessary. (All places in patch are visible good, only exception is kiblnd_scheduler() from: drivers/staging/lustre/lnet/klnds/o2iblnd/o2iblnd_cb.c Its schedule() is one line above standard 3 lines of unified diff) No places where set_current_state() is used for mb(). Signed-off-by: Kirill Tkhai <ktkhai@parallels.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: http://lkml.kernel.org/r/1410529254.3569.23.camel@tkhai Cc: Alasdair Kergon <agk@redhat.com> Cc: Anil Belur <askb23@gmail.com> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Dave Kleikamp <shaggy@kernel.org> Cc: David Airlie <airlied@linux.ie> Cc: David Howells <dhowells@redhat.com> Cc: Dmitry Eremin <dmitry.eremin@intel.com> Cc: Frank Blaschka <blaschka@linux.vnet.ibm.com> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Heiko Carstens <heiko.carstens@de.ibm.com> Cc: Helge Deller <deller@gmx.de> Cc: Isaac Huang <he.huang@intel.com> Cc: James E.J. Bottomley <JBottomley@parallels.com> Cc: James E.J. Bottomley <jejb@parisc-linux.org> Cc: J. Bruce Fields <bfields@fieldses.org> Cc: Jeff Dike <jdike@addtoit.com> Cc: Jesper Nilsson <jesper.nilsson@axis.com> Cc: Jiri Slaby <jslaby@suse.cz> Cc: Laura Abbott <lauraa@codeaurora.org> Cc: Liang Zhen <liang.zhen@intel.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Martin Schwidefsky <schwidefsky@de.ibm.com> Cc: Masaru Nomura <massa.nomura@gmail.com> Cc: Michael Opdenacker <michael.opdenacker@free-electrons.com> Cc: Mikael Starvik <starvik@axis.com> Cc: Mike Snitzer <snitzer@redhat.com> Cc: Neil Brown <neilb@suse.de> Cc: Oleg Drokin <green@linuxhacker.ru> Cc: Peng Tao <bergwolf@gmail.com> Cc: Richard Weinberger <richard@nod.at> Cc: Robert Love <robert.w.love@intel.com> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Trond Myklebust <trond.myklebust@primarydata.com> Cc: Ursula Braun <ursula.braun@de.ibm.com> Cc: Zi Shen Lim <zlim.lnx@gmail.com> Cc: devel@driverdev.osuosl.org Cc: dm-devel@redhat.com Cc: dri-devel@lists.freedesktop.org Cc: fcoe-devel@open-fcoe.org Cc: jfs-discussion@lists.sourceforge.net Cc: linux390@de.ibm.com Cc: linux-afs@lists.infradead.org Cc: linux-cris-kernel@axis.com Cc: linux-kernel@vger.kernel.org Cc: linux-nfs@vger.kernel.org Cc: linux-parisc@vger.kernel.org Cc: linux-raid@vger.kernel.org Cc: linux-s390@vger.kernel.org Cc: linux-scsi@vger.kernel.org Cc: qla2xxx-upstream@qlogic.com Cc: user-mode-linux-devel@lists.sourceforge.net Cc: user-mode-linux-user@lists.sourceforge.net Signed-off-by: Ingo Molnar <mingo@kernel.org>	2014-09-19 12:35:17 +02:00
Anssi Hannula	40aa978ecc	dm cache: fix race causing dirty blocks to be marked as clean When a writeback or a promotion of a block is completed, the cell of that block is removed from the prison, the block is marked as clean, and the clear_dirty() callback of the cache policy is called. Unfortunately, performing those actions in this order allows an incoming new write bio for that block to come in before clearing the dirty status is completed and therefore possibly causing one of these two scenarios: Scenario A: Thread 1 Thread 2 cell_defer() . - cell removed from prison . - detained bios queued . . incoming write bio . remapped to cache . set_dirty() called, . but block already dirty . => it does nothing clear_dirty() . - block marked clean . - policy clear_dirty() called . Result: Block is marked clean even though it is actually dirty. No writeback will occur. Scenario B: Thread 1 Thread 2 cell_defer() . - cell removed from prison . - detained bios queued . clear_dirty() . - block marked clean . . incoming write bio . remapped to cache . set_dirty() called . - block marked dirty . - policy set_dirty() called - policy clear_dirty() called . Result: Block is properly marked as dirty, but policy thinks it is clean and therefore never asks us to writeback it. This case is visible in "dmsetup status" dirty block count (which normally decreases to 0 on a quiet device). Fix these issues by calling clear_dirty() before calling cell_defer(). Incoming bios for that block will then be detained in the cell and released only after clear_dirty() has completed, so the race will not occur. Found by inspecting the code after noticing spurious dirty counts (scenario B). Signed-off-by: Anssi Hannula <anssi.hannula@iki.fi> Acked-by: Joe Thornber <ejt@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com> Cc: stable@vger.kernel.org	2014-09-10 11:20:47 -04:00
Mikulas Patocka	d49ec52ff6	dm crypt: fix access beyond the end of allocated space The DM crypt target accesses memory beyond allocated space resulting in a crash on 32 bit x86 systems. This bug is very old (it dates back to 2.6.25 commit `3a7f6c990a` "dm crypt: use async crypto"). However, this bug was masked by the fact that kmalloc rounds the size up to the next power of two. This bug wasn't exposed until 3.17-rc1 commit `298a9fa08a` ("dm crypt: use per-bio data"). By switching to using per-bio data there was no longer any padding beyond the end of a dm-crypt allocated memory block. To minimize allocation overhead dm-crypt puts several structures into one block allocated with kmalloc. The block holds struct ablkcipher_request, cipher-specific scratch pad (crypto_ablkcipher_reqsize(any_tfm(cc))), struct dm_crypt_request and an initialization vector. The variable dmreq_start is set to offset of struct dm_crypt_request within this memory block. dm-crypt allocates the block with this size: cc->dmreq_start + sizeof(struct dm_crypt_request) + cc->iv_size. When accessing the initialization vector, dm-crypt uses the function iv_of_dmreq, which performs this calculation: ALIGN((unsigned long)(dmreq + 1), crypto_ablkcipher_alignmask(any_tfm(cc)) + 1). dm-crypt allocated "cc->iv_size" bytes beyond the end of dm_crypt_request structure. However, when dm-crypt accesses the initialization vector, it takes a pointer to the end of dm_crypt_request, aligns it, and then uses it as the initialization vector. If the end of dm_crypt_request is not aligned on a crypto_ablkcipher_alignmask(any_tfm(cc)) boundary the alignment causes the initialization vector to point beyond the allocated space. Fix this bug by calculating the variable iv_size_padding and adding it to the allocated size. Also correct the alignment of dm_crypt_request. struct dm_crypt_request is specific to dm-crypt (it isn't used by the crypto subsystem at all), so it is aligned on __alignof__(struct dm_crypt_request). Also align per_bio_data_size on ARCH_KMALLOC_MINALIGN, so that it is aligned as if the block was allocated with kmalloc. Reported-by: Krzysztof Kolasa <kkolasa@winsoft.pl> Tested-by: Milan Broz <gmazyland@gmail.com> Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>	2014-08-28 14:24:09 -04:00
Christoph Lameter	1f125e76f5	md: Replace __this_cpu_ptr with raw_cpu_ptr __this_cpu_ptr is being phased out. Signed-off-by: Christoph Lameter <cl@linux.com> Signed-off-by: Tejun Heo <tj@kernel.org>	2014-08-26 13:45:47 -04:00
NeilBrown	cb8b12b5d8	md/raid10: always initialise ->state on newly allocated r10_bio Most places which allocate an r10_bio zero the ->state, some don't. As the r10_bio comes from a mempool, and the allocation function uses kzalloc it is often zero anyway. But sometimes it isn't and it is best to be safe. I only noticed this because of the bug fixed by an earlier patch where the r10_bios allocated for a reshape were left around to be used by a subsequent resync. In that case the R10BIO_IsReshape flag caused problems. Signed-off-by: NeilBrown <neilb@suse.de>	2014-08-19 17:20:27 +10:00
NeilBrown	e337aead3a	md/raid10: avoid memory leak on error path during reshape. If raid10 reshape fails to find somewhere to read a block from, it returns without freeing memory... Signed-off-by: NeilBrown <neilb@suse.de>	2014-08-19 17:20:27 +10:00
NeilBrown	b39685526f	md/raid10: Fix memory leak when raid10 reshape completes. When a raid10 commences a resync/recovery/reshape it allocates some buffer space. When a resync/recovery completes the buffer space is freed. But not when the reshape completes. This can result in a small memory leak. There is a subtle side-effect of this bug. When a RAID10 is reshaped to a larger array (more devices), the reshape is immediately followed by a "resync" of the new space. This "resync" will use the buffer space which was allocated for "reshape". This can cause problems including a "BUG" in the SCSI layer. So this is suitable for -stable. Cc: stable@vger.kernel.org (v3.5+) Fixes: `3ea7daa5d7` Signed-off-by: NeilBrown <neilb@suse.de>	2014-08-19 17:20:27 +10:00
NeilBrown	ce0b0a4695	md/raid10: fix memory leak when reshaping a RAID10. raid10 reshape clears unwanted bits from a bio->bi_flags using a method which, while clumsy, worked until 3.10 when BIO_OWNS_VEC was added. Since then it clears that bit but shouldn't. This results in a memory leak. So change to used the approved method of clearing unwanted bits. As this causes a memory leak which can consume all of memory the fix is suitable for -stable. Fixes: `a38352e0ac` Cc: stable@vger.kernel.org (v3.10+) Reported-by: mdraid.pkoch@dfgh.net (Peter Koch) Signed-off-by: NeilBrown <neilb@suse.de>	2014-08-19 17:20:27 +10:00
NeilBrown	9c4bdf697c	md/raid6: avoid data corruption during recovery of double-degraded RAID6 During recovery of a double-degraded RAID6 it is possible for some blocks not to be recovered properly, leading to corruption. If a write happens to one block in a stripe that would be written to a missing device, and at the same time that stripe is recovering data to the other missing device, then that recovered data may not be written. This patch skips, in the double-degraded case, an optimisation that is only safe for single-degraded arrays. Bug was introduced in 2.6.32 and fix is suitable for any kernel since then. In an older kernel with separate handle_stripe5() and handle_stripe6() functions the patch must change handle_stripe6(). Cc: stable@vger.kernel.org (2.6.32+) Fixes: `6c0069c0ae` Cc: Yuri Tikhonov <yur@emcraft.com> Cc: Dan Williams <dan.j.williams@intel.com> Reported-by: "Manibalan P" <pmanibalan@amiindia.co.in> Tested-by: "Manibalan P" <pmanibalan@amiindia.co.in> Resolves: https://bugzilla.redhat.com/show_bug.cgi?id=1090423 Signed-off-by: NeilBrown <neilb@suse.de> Acked-by: Dan Williams <dan.j.williams@intel.com>	2014-08-18 14:49:46 +10:00
NeilBrown	a40687ff73	md/raid5: avoid livelock caused by non-aligned writes. If a stripe in a raid6 array received a write to each data block while the array is degraded, and if any of these writes to a missing device are not page-aligned, then a live-lock happens. In this case the P and Q blocks need to be read so that the part of the missing block which is not being updated by the write can be constructed. Due to a logic error, these blocks are not loaded, so the update cannot proceed and the stripe is 'handled' repeatedly in an infinite loop. This bug is unlikely as most writes are page aligned. However as it can lead to a livelock it is suitable for -stable. It was introduced in 3.16. Cc: stable@vger.kernel.org (v3.16) Fixed: `67f455486d` Signed-off-by: NeilBrown <neilb@suse.de>	2014-08-18 14:49:41 +10:00
Linus Torvalds	ba368991f6	. Allow the thin target to paired with any size external origin; also allow thin snapshots to be larger than the external origin. . Add support for quickly loading a repetitive pattern into the dm-switch target. . Use per-bio data in the dm-crypt target instead of always using a mempool for each allocation. Required switching to kmalloc alignment for the bio slab. . Fix DM core to properly stack the QUEUE_FLAG_NO_SG_MERGE flag . Fix the dm-cache and dm-thin targets' export of the minimum_io_size to match the data block size -- this fixes an issue where mkfs.xfs would improperly infer raid striping was in place on the underlying storage. . Small cleanups in dm-io, dm-mpath and dm-cache -----BEGIN PGP SIGNATURE----- Version: GnuPG v1 iQEcBAABAgAGBQJT64yEAAoJEMUj8QotnQNatjQH/2mqm8EtPuZas70zHVDzjMlE ZyV8xgHpU0MBmiBi+JhUBv9iKX4sVa+C25559WkKtxRVMnZmI1WDry4TagiqrhnK 9o/uvdWigJMR+uwahwe4UErEtKscOQJD30a8taN/suJ6Z2C7XJJRUZPsyL4a3Vov w+UIi7aYDEGp/2VQ8mvTTxjdF5x5km4wKsjBTs03uTrrkEJ+bIUndl2I1X+X4bsw kiWYOQwmcnD8GwYkSrthJYLsS3Hjur/J/My7KZwXc00ANLOexqHdKfRDwH8b36+m olKXv3swCd8vi+jJYEYzuW9213ACsSEGP7h8NFVZ/+2FeDsSzB/C7zjW9okIUIw= =y/3r -----END PGP SIGNATURE----- Merge tag 'dm-3.17-changes' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm Pull device mapper changes from Mike Snitzer: - Allow the thin target to paired with any size external origin; also allow thin snapshots to be larger than the external origin. - Add support for quickly loading a repetitive pattern into the dm-switch target. - Use per-bio data in the dm-crypt target instead of always using a mempool for each allocation. Required switching to kmalloc alignment for the bio slab. - Fix DM core to properly stack the QUEUE_FLAG_NO_SG_MERGE flag - Fix the dm-cache and dm-thin targets' export of the minimum_io_size to match the data block size -- this fixes an issue where mkfs.xfs would improperly infer raid striping was in place on the underlying storage. - Small cleanups in dm-io, dm-mpath and dm-cache * tag 'dm-3.17-changes' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm: dm table: propagate QUEUE_FLAG_NO_SG_MERGE dm switch: efficiently support repetitive patterns dm switch: factor out switch_region_table_read dm cache: set minimum_io_size to cache's data block size dm thin: set minimum_io_size to pool's data block size dm crypt: use per-bio data block: use kmalloc alignment for bio slab dm table: make dm_table_supports_discards static dm cache metadata: use dm-space-map-metadata.h defined size limits dm cache: fail migrations in the do_worker error path dm cache: simplify deferred set reference count increments dm thin: relax external origin size constraints dm thin: switch to an atomic_t for tracking pending new block preparations dm mpath: eliminate pg_ready() wrapper dm io: simplify dec_count and sync_io	2014-08-14 09:17:56 -06:00
Linus Torvalds	d429a3639c	Merge branch 'for-3.17/drivers' of git://git.kernel.dk/linux-block Pull block driver changes from Jens Axboe: "Nothing out of the ordinary here, this pull request contains: - A big round of fixes for bcache from Kent Overstreet, Slava Pestov, and Surbhi Palande. No new features, just a lot of fixes. - The usual round of drbd updates from Andreas Gruenbacher, Lars Ellenberg, and Philipp Reisner. - virtio_blk was converted to blk-mq back in 3.13, but now Ming Lei has taken it one step further and added support for actually using more than one queue. - Addition of an explicit SG_FLAG_Q_AT_HEAD for block/bsg, to compliment the the default behavior of adding to the tail of the queue. From Douglas Gilbert" * 'for-3.17/drivers' of git://git.kernel.dk/linux-block: (86 commits) bcache: Drop unneeded blk_sync_queue() calls bcache: add mutex lock for bch_is_open bcache: Correct printing of btree_gc_max_duration_ms bcache: try to set b->parent properly bcache: fix memory corruption in init error path bcache: fix crash with incomplete cache set bcache: Fix more early shutdown bugs bcache: fix use-after-free in btree_gc_coalesce() bcache: Fix an infinite loop in journal replay bcache: fix crash in bcache_btree_node_alloc_fail tracepoint bcache: bcache_write tracepoint was crashing bcache: fix typo in bch_bkey_equal_header bcache: Allocate bounce buffers with GFP_NOWAIT bcache: Make sure to pass GFP_WAIT to mempool_alloc() bcache: fix uninterruptible sleep in writeback thread bcache: wait for buckets when allocating new btree root bcache: fix crash on shutdown in passthrough mode bcache: fix lockdep warnings on shutdown bcache allocator: send discards with correct size bcache: Fix to remove the rcu_sched stalls. ...	2014-08-14 09:10:21 -06:00
Linus Torvalds	2213d7c29a	md updates for 3.17 Most interesting is that md devices (major == 9) with minor numbers of 512 or more will no longer be created simply by opening a block device file. They can only be created by writing to /sys/module/md_mod/parameters/new_array The 'auto-create-on-open' semantic is cumbersome and we need to start moving away from it. -----BEGIN PGP SIGNATURE----- Version: GnuPG v2.0.22 (GNU/Linux) iQIVAwUAU+hOmDnsnt1WYoG5AQJxZxAAjCJAOHiqIBeFg2auJdZF5u5dgqshJQk3 gMpl96Arazf5YXJNTMRoNsfgPg/XjG/9MDS9UoyzgAsQTRi90UYjq+gcSODdHGcF pVFxdrPf3Ja+cnlniW3aCuosv143c9N06IUDOm5TuY7nioQQA7gVS6HEpctmLbgi R5Q2jQ5GeSVhWKsowYBdzbso3y/QAH0zS5AS4tW3c+l7YMie3gfBqNHTHiQby9I1 NwEUaWrU13jUHT6M4CWIp+wqgORXVqFNHx1dxYefTEgIdB1Bfo15MgDorG1sReF7 I+H1zvc/QpvEQ/WWJ6Cg5gItVZqof6SPljnQHkt0kqtu6fDJD3wYBDYKE8VgP9k1 9zJUibyOOqDFAxrEkVy6XnQ50bYV2uNPFSKpw3jFLXpapRzL602NRWvgHGLkPEbS TofriiykOzYGxKkIhIWtiqzOaOhPMo+Z771WGRjoF4ZOAALYRnItvC/FFoRKCHeZ xTOp2xnKwAX6vHsrIHpJP+0/no2R8kzkwnSFSCgoRdraFGxfqY2N32svZaXFUt0c FlaAktWPtM+gaVzzJzEnm4kfruvVRvPV9zyVa1ro990E+00X5eGHi7g7mPmTQ33q gxQK/d/jL22RoOXppfaqj2lApQsQJ6F1rMyuvUwLK+gyFacMQyGtRfHqJo5JZKRT rBFNoOcGh3A= =v8Er -----END PGP SIGNATURE----- Merge tag 'md/3.17' of git://neil.brown.name/md Pull md updates from Neil Brown: "Most interesting is that md devices (major == 9) with minor numbers of 512 or more will no longer be created simply by opening a block device file. They can only be created by writing to /sys/module/md_mod/parameters/new_array The 'auto-create-on-open' semantic is cumbersome and we need to start moving away from it" * tag 'md/3.17' of git://neil.brown.name/md: md: don't allow bitmap file to be added to raid0/linear. md/raid0: check for bitmap compatability when changing raid levels. md: Recovery speed is wrong md: disable probing for md devices 512 and over. md/raid1,raid10: always abort recover on write error.	2014-08-11 07:02:35 -07:00
Jeff Moyer	200612ec33	dm table: propagate QUEUE_FLAG_NO_SG_MERGE Commit `05f1dd5` ("block: add queue flag for disabling SG merging") introduced a new queue flag: QUEUE_FLAG_NO_SG_MERGE. This gets set by default in blk_mq_init_queue for mq-enabled devices. The effect of the flag is to bypass the SG segment merging. Instead, the bio->bi_vcnt is used as the number of hardware segments. With a device mapper target on top of a device with QUEUE_FLAG_NO_SG_MERGE set, we can end up sending down more segments than a driver is prepared to handle. I ran into this when backporting the virtio_blk mq support. It triggerred this BUG_ON, in virtio_queue_rq: BUG_ON(req->nr_phys_segments + 2 > vblk->sg_elems); The queue's max is set here: blk_queue_max_segments(q, vblk->sg_elems-2); Basically, what happens is that a bio is built up for the dm device (which does not have the QUEUE_FLAG_NO_SG_MERGE flag set) using bio_add_page. That path will call into __blk_recalc_rq_segments, so what you end up with is bi_phys_segments being much smaller than bi_vcnt (and bi_vcnt grows beyond the maximum sg elements). Then, when the bio is submitted, it gets cloned. When the cloned bio is submitted, it will end up in blk_recount_segments, here: if (test_bit(QUEUE_FLAG_NO_SG_MERGE, &q->queue_flags)) bio->bi_phys_segments = bio->bi_vcnt; and now we've set bio->bi_phys_segments to a number that is beyond what was registered as queue_max_segments by the driver. The right way to fix this is to propagate the queue flag up the stack. The rules for propagating the flag are simple: - if the flag is set for any underlying device, it must be set for the upper device - consequently, if the flag is not set for any underlying device, it should not be set for the upper device. Signed-off-by: Jeff Moyer <jmoyer@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com> Cc: stable@vger.kernel.org # 3.16+	2014-08-10 20:54:49 -04:00
NeilBrown	d66b1b395a	md: don't allow bitmap file to be added to raid0/linear. An array can only accept a bitmap if it will call bitmap_daemon_work periodically, which means it needs a thread running. If there is no thread, don't allow a bitmap to be added. Signed-off-by: NeilBrown <neilb@suse.de>	2014-08-08 15:43:20 +10:00
NeilBrown	a8461a61c2	md/raid0: check for bitmap compatability when changing raid levels. If an array has a bitmap, then it cannot be converted to raid0. Reported-by: Xiao Ni <xni@redhat.com> Signed-off-by: NeilBrown <neilb@suse.de>	2014-08-08 15:33:17 +10:00
Xiao Ni	ac7e50a383	md: Recovery speed is wrong When we calculate the speed of recovery, the numerator that contains the recovery done sectors. It's need to subtract the sectors which don't finish recovery. Signed-off-by: Xiao Ni <xni@redhat.com> Signed-off-by: NeilBrown <neilb@suse.de>	2014-08-08 12:11:25 +10:00
Linus Torvalds	98959948a7	Merge branch 'sched-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull scheduler updates from Ingo Molnar: - Move the nohz kick code out of the scheduler tick to a dedicated IPI, from Frederic Weisbecker. This necessiated quite some background infrastructure rework, including: * Clean up some irq-work internals * Implement remote irq-work * Implement nohz kick on top of remote irq-work * Move full dynticks timer enqueue notification to new kick * Move multi-task notification to new kick * Remove unecessary barriers on multi-task notification - Remove proliferation of wait_on_bit() action functions and allow wait_on_bit_action() functions to support a timeout. (Neil Brown) - Another round of sched/numa improvements, cleanups and fixes. (Rik van Riel) - Implement fast idling of CPUs when the system is partially loaded, for better scalability. (Tim Chen) - Restructure and fix the CPU hotplug handling code that may leave cfs_rq and rt_rq's throttled when tasks are migrated away from a dead cpu. (Kirill Tkhai) - Robustify the sched topology setup code. (Peterz Zijlstra) - Improve sched_feat() handling wrt. static_keys (Jason Baron) - Misc fixes. * 'sched-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (37 commits) sched/fair: Fix 'make xmldocs' warning caused by missing description sched: Use macro for magic number of -1 for setparam sched: Robustify topology setup sched: Fix sched_setparam() policy == -1 logic sched: Allow wait_on_bit_action() functions to support a timeout sched: Remove proliferation of wait_on_bit() action functions sched/numa: Revert "Use effective_load() to balance NUMA loads" sched: Fix static_key race with sched_feat() sched: Remove extra static_key*() function indirection sched/rt: Fix replenish_dl_entity() comments to match the current upstream code sched: Transform resched_task() into resched_curr() sched/deadline: Kill task_struct->pi_top_task sched: Rework check_for_tasks() sched/rt: Enqueue just unthrottled rt_rq back on the stack in __disable_runtime() sched/fair: Disable runtime_enabled on dying rq sched/numa: Change scan period code to match intent sched/numa: Rework best node setting in task_numa_migrate() sched/numa: Examine a task move when examining a task swap sched/numa: Simplify task_numa_compare() sched/numa: Use effective_load() to balance NUMA loads ...	2014-08-04 16:23:30 -07:00
Kent Overstreet	0781c8748c	bcache: Drop unneeded blk_sync_queue() calls this is needed for the queue/block device we created (it's done by blk_cleanup_queue() which we do call) - but calling it for the block devices we only opened is pointless. Change-Id: I53dfded14ed15b9581d10ca8399d5e1b3abbf9f2	2014-08-04 15:23:04 -07:00
Jianjian Huo	789d21dbd9	bcache: add mutex lock for bch_is_open Since bch_is_open will iterate linked list bch_cache_sets and uncached_devices, it needs bch_register_lock. Signed-off-by: Jianjian Huo <samuel.huo@gmail.com>	2014-08-04 15:23:04 -07:00
Surbhi Palande	5b25abade2	bcache: Correct printing of btree_gc_max_duration_ms time_stats::btree_gc_max_duration_mc is not bit shifted by 8 Fixes BUG #138 Change-Id: I44fc6e1d0579674016acc533f1a546b080e5371a Signed-off-by: Surbhi Palande <sap@daterainc.com>	2014-08-04 15:23:04 -07:00
Slava Pestov	2452cc8906	bcache: try to set b->parent properly bcache_flash_dev.ktest would reliably crash with 8k and 16k bucket size before; now it passes. Change-Id: Ib542232235e39298c3a7548fe52b645cabb823d1	2014-08-04 15:23:04 -07:00
Slava Pestov	c9a78332b4	bcache: fix memory corruption in init error path If register_cache_set() failed, we would touch ca->set after it had already been freed. Also, fix an assertion to catch this. Change-Id: I748e5f5b223e2d9b2602075dec2f997cced2394d	2014-08-04 15:23:04 -07:00
Slava Pestov	bf0c55c986	bcache: fix crash with incomplete cache set Change-Id: I6abde52afe917633480caaf4e2518f42a816d886	2014-08-04 15:23:04 -07:00
Kent Overstreet	d83353b319	bcache: Fix more early shutdown bugs Signed-off-by: Kent Overstreet <kmo@daterainc.com>	2014-08-04 15:23:04 -07:00
Slava Pestov	400ffaa2ac	bcache: fix use-after-free in btree_gc_coalesce() If we goto out_nocoalesce after we free new_nodes[0], we end up freeing new_nodes[0] again. This was generating a lockdep warning. The fix is to set new_nodes[0] to NULL, since the out_nocoalesce path safely ignores NULL entries in the new_nodes array. This regression was introduced in 2d7f9531. Change-Id: I76564d7257800583214376b4bacf236cda90c89c	2014-08-04 15:23:04 -07:00
Kent Overstreet	6b708de64a	bcache: Fix an infinite loop in journal replay When running with multiple cache devices, if one of the devices has a completely empty journal but we'd already found some journal entries on a previosu device we'd go into an infinite loop. Change-Id: I1dcdc0d738192746de28f40e8b08825b0dea5e2b Signed-off-by: Kent Overstreet <kmo@daterainc.com>	2014-08-04 15:23:03 -07:00
Slava Pestov	913dc33fb2	bcache: fix crash in bcache_btree_node_alloc_fail tracepoint 'b' was NULL. Change-Id: Icac0fd04afa2d23f213d96d51afd53374e6dd0c0	2014-08-04 15:23:03 -07:00
Slava Pestov	60ae81eee8	bcache: bcache_write tracepoint was crashing Signed-off-by: Kent Overstreet <kmo@daterainc.com>	2014-08-04 15:23:03 -07:00
Slava Pestov	8e09480806	bcache: fix typo in bch_bkey_equal_header Signed-off-by: Kent Overstreet <kmo@daterainc.com>	2014-08-04 15:23:03 -07:00
Kent Overstreet	501d52a90c	bcache: Allocate bounce buffers with GFP_NOWAIT There's no point in blocking on these allocations, since our fallback paths will probably go faster than blocking. Change-Id: I733ca202c25cb36bde02607a0a60552229a4241c	2014-08-04 15:23:03 -07:00
Kent Overstreet	bcf090e004	bcache: Make sure to pass GFP_WAIT to mempool_alloc() this was very wrong - mempool_alloc() only guarantees success with GFP_WAIT. bcache uses GFP_NOWAIT in various other places where we have a fallback, circuits must've gotten crossed when writing this code or something. Signed-off-by: Kent Overstreet <kmo@daterainc.com>	2014-08-04 15:23:03 -07:00
Slava Pestov	9e5c353510	bcache: fix uninterruptible sleep in writeback thread There were two issues here: - writeback thread did not start until the device first became dirty - writeback thread used uninterruptible sleep once running Without this patch I see kernel warnings printed and a load average of 1.52 after booting my test VM. With this patch the warnings are gone and the load average is near 0.00 as expected. Signed-off-by: Kent Overstreet <kmo@daterainc.com>	2014-08-04 15:23:03 -07:00
Slava Pestov	c5aa4a3157	bcache: wait for buckets when allocating new btree root Tested: - sometimes bcache_tier test would hang on startup with a failure to allocate the btree root -- no longer seeing this Signed-off-by: Kent Overstreet <kmo@daterainc.com>	2014-08-04 15:23:03 -07:00
Slava Pestov	a664d0f05a	bcache: fix crash on shutdown in passthrough mode We never started the writeback thread in this case, so don't stop it.	2014-08-04 15:23:03 -07:00
Slava Pestov	e5112201c1	bcache: fix lockdep warnings on shutdown	2014-08-04 15:23:03 -07:00
Slava Pestov	8b326d3a2a	bcache allocator: send discards with correct size	2014-08-04 15:23:03 -07:00
Surbhi Palande	dbd810ab67	bcache: Fix to remove the rcu_sched stalls. while loop was executing infinitely. This fix ends the while loop gracefully. Signed-off-by: Surbhi Palande <sap@daterainc.com> Signed-off-by: Kent Overstreet <kmo@daterainc.com>	2014-08-04 15:23:02 -07:00
Kent Overstreet	9aa61a992a	bcache: Fix a journal replay bug journal replay wansn't validating pointers with bch_extent_invalid() before derefing, fixed Signed-off-by: Kent Overstreet <kmo@daterainc.com>	2014-08-04 15:23:02 -07:00
Kent Overstreet	5b1016e62f	bcache: Fix a bug when detaching After detaching a backing device from a cache set, a bit wasn't getting reset meaning the second detach wouldn't work correctly. Signed-off-by: Kent Overstreet <kmo@daterainc.com>	2014-08-04 15:23:02 -07:00
Mikulas Patocka	56b1ebf2d9	dm switch: efficiently support repetitive patterns Add support for quickly loading a repetitive pattern into the dm-switch target. In the "set_regions_mappings" message, the user may now use "Rn,m" as one of the arguments. "n" and "m" are hexadecimal numbers. The "Rn,m" argument repeats the last "n" arguments in the following "m" slots. For example: dmsetup message switch 0 set_region_mappings 1000:1 :2 R2,10 is equivalent to dmsetup message switch 0 set_region_mappings 1000:1 :2 :1 :2 :1 :2 :1 :2 \ :1 :2 :1 :2 :1 :2 :1 :2 :1 :2 Requested-by: Jay Wang <jwang@nimblestorage.com> Tested-by: Jay Wang <jwang@nimblestorage.com> Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>	2014-08-01 12:30:37 -04:00
Mikulas Patocka	99eb1908e6	dm switch: factor out switch_region_table_read Move code that reads the table to a switch_region_table_read. It will be needed for the next commit. No functional change. Tested-by: Jay Wang <jwang@nimblestorage.com> Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>	2014-08-01 12:30:36 -04:00
Mike Snitzer	b02465308f	dm cache: set minimum_io_size to cache's data block size Before, if the block layer's limit stacking didn't establish an optimal_io_size that was compatible with the cache's data block size we'd set optimal_io_size to the data block size and minimum_io_size to 0 (which the block layer adjusts to be physical_block_size). Update cache_io_hints() to set both minimum_io_size and optimal_io_size to the cache's data block size. This fixes an issue where mkfs.xfs would create more XFS Allocation Groups on cache volumes than on a normal linear LV of comparable size. Signed-off-by: Mike Snitzer <snitzer@redhat.com>	2014-08-01 12:30:36 -04:00
Mike Snitzer	fdfb4c8c1a	dm thin: set minimum_io_size to pool's data block size Before, if the block layer's limit stacking didn't establish an optimal_io_size that was compatible with the thin-pool's data block size we'd set optimal_io_size to the data block size and minimum_io_size to 0 (which the block layer adjusts to be physical_block_size). Update pool_io_hints() to set both minimum_io_size and optimal_io_size to the thin-pool's data block size. This fixes an issue reported where mkfs.xfs would create more XFS Allocation Groups on thinp volumes than on a normal linear LV of comparable size, see: https://bugzilla.redhat.com/show_bug.cgi?id=1003227 Reported-by: Chris Murphy <lists@colorremedies.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>	2014-08-01 12:30:35 -04:00
Mikulas Patocka	298a9fa08a	dm crypt: use per-bio data Change dm-crypt so that it uses auxiliary data allocated with the bio. Dm-crypt requires two allocations per request - struct dm_crypt_io and struct ablkcipher_request (with other data appended to it). It previously only used mempool allocations. Some requests may require more dm_crypt_ios and ablkcipher_requests, however most requests need just one of each of these two structures to complete. This patch changes it so that the first dm_crypt_io and ablkcipher_request are allocated with the bio (using target per_bio_data_size option). If the request needs additional values, they are allocated from the mempool. Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>	2014-08-01 12:30:35 -04:00
Mikulas Patocka	a7ffb6a533	dm table: make dm_table_supports_discards static The function dm_table_supports_discards is only called from dm-table.c:dm_table_set_restrictions(). So move it above dm_table_set_restrictions and make it static. Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>	2014-08-01 12:30:34 -04:00
Mike Snitzer	895b47d798	dm cache metadata: use dm-space-map-metadata.h defined size limits Commit `7d48935e` cleaned up the persistent-data's space-map-metadata limits by elevating them to dm-space-map-metadata.h. Update dm-cache-metadata to use these same limits. The calculation for DM_CACHE_METADATA_MAX_SECTORS didn't account for the sizeof the disk_bitmap_header. So the supported maximum metadata size is a bit smaller (reduced from 33423360 to 33292800 sectors). Signed-off-by: Mike Snitzer <snitzer@redhat.com> Acked-by: Joe Thornber <ejt@redhat.com>	2014-08-01 12:30:33 -04:00
Joe Thornber	304affaa88	dm cache: fail migrations in the do_worker error path Signed-off-by: Joe Thornber <ejt@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>	2014-08-01 12:30:33 -04:00
Joe Thornber	8c081b52c6	dm cache: simplify deferred set reference count increments Factor out inc_and_issue and inc_ds helpers to simplify deferred set reference count increments. Also cleanup cache_map to consistently call cell_defer and inc_ds when the bio is DM_MAPIO_REMAPPED. No functional change. Signed-off-by: Joe Thornber <ejt@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>	2014-08-01 12:30:32 -04:00
Joe Thornber	e5aea7b49f	dm thin: relax external origin size constraints Track the size of any external origin. Previously the external origin's size had to be a multiple of the thin-pool's block size, that is no longer a requirement. In addition, snapshots that are larger than the external origin are now supported. Signed-off-by: Joe Thornber <ejt@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>	2014-08-01 12:30:32 -04:00
Joe Thornber	50f3c3efdd	dm thin: switch to an atomic_t for tracking pending new block preparations Previously we used separate boolean values to track quiescing and copying actions. By switching to an atomic_t we can support blocks that need a partial copy and partial zero. Signed-off-by: Joe Thornber <ejt@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>	2014-08-01 12:30:31 -04:00
Mike Snitzer	6afbc01d75	dm mpath: eliminate pg_ready() wrapper pg_ready() is not comprehensive in its logic and only serves to obfuscate code. Replace pg_ready() with the appropriate logic in multipath_map(). Signed-off-by: Mike Snitzer <snitzer@redhat.com>	2014-08-01 12:30:31 -04:00
Joe Thornber	97e7cdf12b	dm io: simplify dec_count and sync_io Remove the io struct off the stack in sync_io() and allocate it from the mempool like is done in async_io(). dec_count() now always calls a callback function and always frees the io struct back to the mempool (so sync_io and async_io share this pattern). Signed-off-by: Joe Thornber <ejt@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>	2014-08-01 12:30:30 -04:00
Anssi Hannula	44fa816bb7	dm cache: fix race affecting dirty block count nr_dirty is updated without locking, causing it to drift so that it is non-zero (either a small positive integer, or a very large one when an underflow occurs) even when there are no actual dirty blocks. This was due to a race between the workqueue and map function accessing nr_dirty in parallel without proper protection. People were seeing under runs due to a race on increment/decrement of nr_dirty, see: https://lkml.org/lkml/2014/6/3/648 Fix this by using an atomic_t for nr_dirty. Reported-by: roma1390@gmail.com Signed-off-by: Anssi Hannula <anssi.hannula@iki.fi> Signed-off-by: Joe Thornber <ejt@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com> Cc: stable@vger.kernel.org	2014-08-01 12:25:22 -04:00

1 2 3 4 5 ...

3539 Commits