- DM core cleanups

- blk-mq request-based DM no longer uses any mempools now that partial
     completions are no longer handled as part of cloned requests
 
 - DM raid cleanups and support for MD raid0
 
 - DM cache core advances and a new stochastic-multi-queue (smq) cache
   replacement policy
   - smq is the new default dm-cache policy
 
 - DM thinp cleanups and much more efficient large discard support
 
 - DM statistics support for request-based DM and nanosecond resolution
   timestamps
 
 - Fixes to DM stripe, DM log-writes, DM raid1 and DM crypt
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1
 
 iQEcBAABAgAGBQJViE/tAAoJEMUj8QotnQNag8wIAMhcmy46qGCy6SrOew6D8EQ0
 4Ielsr5ZI67Q9pZVI1x/Zdt5GfDUGXiSEy/y8xoIuJfmsRXwnZ/gkOUyWpSUW8dH
 mgPUUpGF0aN2D/66P3chgm39rVZEt3crDY+SQu02Fm86JKlMUomFbWKgMXpsM6Ga
 HHW80zLF9Ca6D5m8xMze/ic1KBJtA9zaXcO9xnMCBym8mSaORtgwoCzKAgy43H7U
 KIBwPKR/y+c7SaKWGxXwxxhTKZDyP4FbgtiJ2Dc9yBDoD0RzbyuY/EFMEUPEbEMv
 e9fFHNEzmKlsNVY2vYXkVH4TdldWrrjkmYj/JVtz8cifoupWOOnDTjb2aqjAIww=
 =DKCC
 -----END PGP SIGNATURE-----

Merge tag 'dm-4.2-changes' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm

Pull device mapper updates from Mike Snitzer:

 - DM core cleanups:

     * blk-mq request-based DM no longer uses any mempools now that
       partial completions are no longer handled as part of cloned
       requests

 - DM raid cleanups and support for MD raid0

 - DM cache core advances and a new stochastic-multi-queue (smq) cache
   replacement policy

     * smq is the new default dm-cache policy

 - DM thinp cleanups and much more efficient large discard support

 - DM statistics support for request-based DM and nanosecond resolution
   timestamps

 - Fixes to DM stripe, DM log-writes, DM raid1 and DM crypt

* tag 'dm-4.2-changes' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm: (39 commits)
  dm stats: add support for request-based DM devices
  dm stats: collect and report histogram of IO latencies
  dm stats: support precise timestamps
  dm stats: fix divide by zero if 'number_of_areas' arg is zero
  dm cache: switch the "default" cache replacement policy from mq to smq
  dm space map metadata: fix occasional leak of a metadata block on resize
  dm thin metadata: fix a race when entering fail mode
  dm thin: fail messages with EOPNOTSUPP when pool cannot handle messages
  dm thin: range discard support
  dm thin metadata: add dm_thin_remove_range()
  dm thin metadata: add dm_thin_find_mapped_range()
  dm btree: add dm_btree_remove_leaves()
  dm stats: Use kvfree() in dm_kvfree()
  dm cache: age and write back cache entries even without active IO
  dm cache: prefix all DMERR and DMINFO messages with cache device name
  dm cache: add fail io mode and needs_check flag
  dm cache: wake the worker thread every time we free a migration object
  dm cache: add stochastic-multi-queue (smq) policy
  dm cache: boost promotion of blocks that will be overwritten
  dm cache: defer whole cells
  ...
This commit is contained in:
Linus Torvalds 2015-06-25 16:34:39 -07:00
commit 6597ac8a51
33 changed files with 4228 additions and 738 deletions

View File

@ -25,10 +25,10 @@ trying to see when the io scheduler has let the ios run.
Overview of supplied cache replacement policies
===============================================
multiqueue
----------
multiqueue (mq)
---------------
This policy is the default.
This policy has been deprecated in favor of the smq policy (see below).
The multiqueue policy has three sets of 16 queues: one set for entries
waiting for the cache and another two for those in the cache (a set for
@ -73,6 +73,67 @@ If you're trying to quickly warm a new cache device you may wish to
reduce these to encourage promotion. Remember to switch them back to
their defaults after the cache fills though.
Stochastic multiqueue (smq)
---------------------------
This policy is the default.
The stochastic multi-queue (smq) policy addresses some of the problems
with the multiqueue (mq) policy.
The smq policy (vs mq) offers the promise of less memory utilization,
improved performance and increased adaptability in the face of changing
workloads. SMQ also does not have any cumbersome tuning knobs.
Users may switch from "mq" to "smq" simply by appropriately reloading a
DM table that is using the cache target. Doing so will cause all of the
mq policy's hints to be dropped. Also, performance of the cache may
degrade slightly until smq recalculates the origin device's hotspots
that should be cached.
Memory usage:
The mq policy uses a lot of memory; 88 bytes per cache block on a 64
bit machine.
SMQ uses 28bit indexes to implement it's data structures rather than
pointers. It avoids storing an explicit hit count for each block. It
has a 'hotspot' queue rather than a pre cache which uses a quarter of
the entries (each hotspot block covers a larger area than a single
cache block).
All these mean smq uses ~25bytes per cache block. Still a lot of
memory, but a substantial improvement nontheless.
Level balancing:
MQ places entries in different levels of the multiqueue structures
based on their hit count (~ln(hit count)). This means the bottom
levels generally have the most entries, and the top ones have very
few. Having unbalanced levels like this reduces the efficacy of the
multiqueue.
SMQ does not maintain a hit count, instead it swaps hit entries with
the least recently used entry from the level above. The over all
ordering being a side effect of this stochastic process. With this
scheme we can decide how many entries occupy each multiqueue level,
resulting in better promotion/demotion decisions.
Adaptability:
The MQ policy maintains a hit count for each cache block. For a
different block to get promoted to the cache it's hit count has to
exceed the lowest currently in the cache. This means it can take a
long time for the cache to adapt between varying IO patterns.
Periodically degrading the hit counts could help with this, but I
haven't found a nice general solution.
SMQ doesn't maintain hit counts, so a lot of this problem just goes
away. In addition it tracks performance of the hotspot queue, which
is used to decide which blocks to promote. If the hotspot queue is
performing badly then it starts moving entries more quickly between
levels. This lets it adapt to new IO patterns very quickly.
Performance:
Testing SMQ shows substantially better performance than MQ.
cleaner
-------

View File

@ -221,6 +221,7 @@ Status
<#read hits> <#read misses> <#write hits> <#write misses>
<#demotions> <#promotions> <#dirty> <#features> <features>*
<#core args> <core args>* <policy name> <#policy args> <policy args>*
<cache metadata mode>
metadata block size : Fixed block size for each metadata block in
sectors
@ -251,8 +252,12 @@ core args : Key/value pairs for tuning the core
e.g. migration_threshold
policy name : Name of the policy
#policy args : Number of policy arguments to follow (must be even)
policy args : Key/value pairs
e.g. sequential_threshold
policy args : Key/value pairs e.g. sequential_threshold
cache metadata mode : ro if read-only, rw if read-write
In serious cases where even a read-only mode is deemed unsafe
no further I/O will be permitted and the status will just
contain the string 'Fail'. The userspace recovery tools
should then be used.
Messages
--------

View File

@ -224,3 +224,5 @@ Version History
New status (STATUSTYPE_INFO) fields: sync_action and mismatch_cnt.
1.5.1 Add ability to restore transiently failed devices on resume.
1.5.2 'mismatch_cnt' is zero unless [last_]sync_action is "check".
1.6.0 Add discard support (and devices_handle_discard_safely module param).
1.7.0 Add support for MD RAID0 mappings.

View File

@ -13,9 +13,14 @@ the range specified.
The I/O statistics counters for each step-sized area of a region are
in the same format as /sys/block/*/stat or /proc/diskstats (see:
Documentation/iostats.txt). But two extra counters (12 and 13) are
provided: total time spent reading and writing in milliseconds. All
these counters may be accessed by sending the @stats_print message to
the appropriate DM device via dmsetup.
provided: total time spent reading and writing. When the histogram
argument is used, the 14th parameter is reported that represents the
histogram of latencies. All these counters may be accessed by sending
the @stats_print message to the appropriate DM device via dmsetup.
The reported times are in milliseconds and the granularity depends on
the kernel ticks. When the option precise_timestamps is used, the
reported times are in nanoseconds.
Each region has a corresponding unique identifier, which we call a
region_id, that is assigned when the region is created. The region_id
@ -33,7 +38,9 @@ memory is used by reading
Messages
========
@stats_create <range> <step> [<program_id> [<aux_data>]]
@stats_create <range> <step>
[<number_of_optional_arguments> <optional_arguments>...]
[<program_id> [<aux_data>]]
Create a new region and return the region_id.
@ -48,6 +55,29 @@ Messages
"/<number_of_areas>" - the range is subdivided into the specified
number of areas.
<number_of_optional_arguments>
The number of optional arguments
<optional_arguments>
The following optional arguments are supported
precise_timestamps - use precise timer with nanosecond resolution
instead of the "jiffies" variable. When this argument is
used, the resulting times are in nanoseconds instead of
milliseconds. Precise timestamps are a little bit slower
to obtain than jiffies-based timestamps.
histogram:n1,n2,n3,n4,... - collect histogram of latencies. The
numbers n1, n2, etc are times that represent the boundaries
of the histogram. If precise_timestamps is not used, the
times are in milliseconds, otherwise they are in
nanoseconds. For each range, the kernel will report the
number of requests that completed within this range. For
example, if we use "histogram:10,20,30", the kernel will
report four numbers a:b:c:d. a is the number of requests
that took 0-10 ms to complete, b is the number of requests
that took 10-20 ms to complete, c is the number of requests
that took 20-30 ms to complete and d is the number of
requests that took more than 30 ms to complete.
<program_id>
An optional parameter. A name that uniquely identifies
the userspace owner of the range. This groups ranges together
@ -55,6 +85,9 @@ Messages
created and ignore those created by others.
The kernel returns this string back in the output of
@stats_list message, but it doesn't use it for anything else.
If we omit the number of optional arguments, program id must not
be a number, otherwise it would be interpreted as the number of
optional arguments.
<aux_data>
An optional parameter. A word that provides auxiliary data

View File

@ -304,6 +304,18 @@ config DM_CACHE_MQ
This is meant to be a general purpose policy. It prioritises
reads over writes.
config DM_CACHE_SMQ
tristate "Stochastic MQ Cache Policy (EXPERIMENTAL)"
depends on DM_CACHE
default y
---help---
A cache policy that uses a multiqueue ordered by recent hits
to select which blocks should be promoted and demoted.
This is meant to be a general purpose policy. It prioritises
reads over writes. This SMQ policy (vs MQ) offers the promise
of less memory utilization, improved performance and increased
adaptability in the face of changing workloads.
config DM_CACHE_CLEANER
tristate "Cleaner Cache Policy (EXPERIMENTAL)"
depends on DM_CACHE

View File

@ -13,6 +13,7 @@ dm-log-userspace-y \
dm-thin-pool-y += dm-thin.o dm-thin-metadata.o
dm-cache-y += dm-cache-target.o dm-cache-metadata.o dm-cache-policy.o
dm-cache-mq-y += dm-cache-policy-mq.o
dm-cache-smq-y += dm-cache-policy-smq.o
dm-cache-cleaner-y += dm-cache-policy-cleaner.o
dm-era-y += dm-era-target.o
md-mod-y += md.o bitmap.o
@ -54,6 +55,7 @@ obj-$(CONFIG_DM_THIN_PROVISIONING) += dm-thin-pool.o
obj-$(CONFIG_DM_VERITY) += dm-verity.o
obj-$(CONFIG_DM_CACHE) += dm-cache.o
obj-$(CONFIG_DM_CACHE_MQ) += dm-cache-mq.o
obj-$(CONFIG_DM_CACHE_SMQ) += dm-cache-smq.o
obj-$(CONFIG_DM_CACHE_CLEANER) += dm-cache-cleaner.o
obj-$(CONFIG_DM_ERA) += dm-era.o
obj-$(CONFIG_DM_LOG_WRITES) += dm-log-writes.o

View File

@ -255,6 +255,32 @@ void dm_cell_visit_release(struct dm_bio_prison *prison,
}
EXPORT_SYMBOL_GPL(dm_cell_visit_release);
static int __promote_or_release(struct dm_bio_prison *prison,
struct dm_bio_prison_cell *cell)
{
if (bio_list_empty(&cell->bios)) {
rb_erase(&cell->node, &prison->cells);
return 1;
}
cell->holder = bio_list_pop(&cell->bios);
return 0;
}
int dm_cell_promote_or_release(struct dm_bio_prison *prison,
struct dm_bio_prison_cell *cell)
{
int r;
unsigned long flags;
spin_lock_irqsave(&prison->lock, flags);
r = __promote_or_release(prison, cell);
spin_unlock_irqrestore(&prison->lock, flags);
return r;
}
EXPORT_SYMBOL_GPL(dm_cell_promote_or_release);
/*----------------------------------------------------------------*/
#define DEFERRED_SET_SIZE 64

View File

@ -101,6 +101,19 @@ void dm_cell_visit_release(struct dm_bio_prison *prison,
void (*visit_fn)(void *, struct dm_bio_prison_cell *),
void *context, struct dm_bio_prison_cell *cell);
/*
* Rather than always releasing the prisoners in a cell, the client may
* want to promote one of them to be the new holder. There is a race here
* though between releasing an empty cell, and other threads adding new
* inmates. So this function makes the decision with its lock held.
*
* This function can have two outcomes:
* i) An inmate is promoted to be the holder of the cell (return value of 0).
* ii) The cell has no inmate for promotion and is released (return value of 1).
*/
int dm_cell_promote_or_release(struct dm_bio_prison *prison,
struct dm_bio_prison_cell *cell);
/*----------------------------------------------------------------*/
/*

View File

@ -39,6 +39,8 @@
enum superblock_flag_bits {
/* for spotting crashes that would invalidate the dirty bitset */
CLEAN_SHUTDOWN,
/* metadata must be checked using the tools */
NEEDS_CHECK,
};
/*
@ -107,6 +109,7 @@ struct dm_cache_metadata {
struct dm_disk_bitset discard_info;
struct rw_semaphore root_lock;
unsigned long flags;
dm_block_t root;
dm_block_t hint_root;
dm_block_t discard_root;
@ -129,6 +132,14 @@ struct dm_cache_metadata {
* buffer before the superblock is locked and updated.
*/
__u8 metadata_space_map_root[SPACE_MAP_ROOT_SIZE];
/*
* Set if a transaction has to be aborted but the attempt to roll
* back to the previous (good) transaction failed. The only
* metadata operation permissible in this state is the closing of
* the device.
*/
bool fail_io:1;
};
/*-------------------------------------------------------------------
@ -527,6 +538,7 @@ static unsigned long clear_clean_shutdown(unsigned long flags)
static void read_superblock_fields(struct dm_cache_metadata *cmd,
struct cache_disk_superblock *disk_super)
{
cmd->flags = le32_to_cpu(disk_super->flags);
cmd->root = le64_to_cpu(disk_super->mapping_root);
cmd->hint_root = le64_to_cpu(disk_super->hint_root);
cmd->discard_root = le64_to_cpu(disk_super->discard_root);
@ -625,6 +637,7 @@ static int __commit_transaction(struct dm_cache_metadata *cmd,
if (mutator)
update_flags(disk_super, mutator);
disk_super->flags = cpu_to_le32(cmd->flags);
disk_super->mapping_root = cpu_to_le64(cmd->root);
disk_super->hint_root = cpu_to_le64(cmd->hint_root);
disk_super->discard_root = cpu_to_le64(cmd->discard_root);
@ -693,6 +706,7 @@ static struct dm_cache_metadata *metadata_open(struct block_device *bdev,
cmd->cache_blocks = 0;
cmd->policy_hint_size = policy_hint_size;
cmd->changed = true;
cmd->fail_io = false;
r = __create_persistent_data_objects(cmd, may_format_device);
if (r) {
@ -796,7 +810,8 @@ void dm_cache_metadata_close(struct dm_cache_metadata *cmd)
list_del(&cmd->list);
mutex_unlock(&table_lock);
__destroy_persistent_data_objects(cmd);
if (!cmd->fail_io)
__destroy_persistent_data_objects(cmd);
kfree(cmd);
}
}
@ -848,13 +863,26 @@ static int blocks_are_unmapped_or_clean(struct dm_cache_metadata *cmd,
return 0;
}
#define WRITE_LOCK(cmd) \
if (cmd->fail_io || dm_bm_is_read_only(cmd->bm)) \
return -EINVAL; \
down_write(&cmd->root_lock)
#define WRITE_LOCK_VOID(cmd) \
if (cmd->fail_io || dm_bm_is_read_only(cmd->bm)) \
return; \
down_write(&cmd->root_lock)
#define WRITE_UNLOCK(cmd) \
up_write(&cmd->root_lock)
int dm_cache_resize(struct dm_cache_metadata *cmd, dm_cblock_t new_cache_size)
{
int r;
bool clean;
__le64 null_mapping = pack_value(0, 0);
down_write(&cmd->root_lock);
WRITE_LOCK(cmd);
__dm_bless_for_disk(&null_mapping);
if (from_cblock(new_cache_size) < from_cblock(cmd->cache_blocks)) {
@ -880,7 +908,7 @@ int dm_cache_resize(struct dm_cache_metadata *cmd, dm_cblock_t new_cache_size)
cmd->changed = true;
out:
up_write(&cmd->root_lock);
WRITE_UNLOCK(cmd);
return r;
}
@ -891,7 +919,7 @@ int dm_cache_discard_bitset_resize(struct dm_cache_metadata *cmd,
{
int r;
down_write(&cmd->root_lock);
WRITE_LOCK(cmd);
r = dm_bitset_resize(&cmd->discard_info,
cmd->discard_root,
from_dblock(cmd->discard_nr_blocks),
@ -903,7 +931,7 @@ int dm_cache_discard_bitset_resize(struct dm_cache_metadata *cmd,
}
cmd->changed = true;
up_write(&cmd->root_lock);
WRITE_UNLOCK(cmd);
return r;
}
@ -946,9 +974,9 @@ int dm_cache_set_discard(struct dm_cache_metadata *cmd,
{
int r;
down_write(&cmd->root_lock);
WRITE_LOCK(cmd);
r = __discard(cmd, dblock, discard);
up_write(&cmd->root_lock);
WRITE_UNLOCK(cmd);
return r;
}
@ -1020,9 +1048,9 @@ int dm_cache_remove_mapping(struct dm_cache_metadata *cmd, dm_cblock_t cblock)
{
int r;
down_write(&cmd->root_lock);
WRITE_LOCK(cmd);
r = __remove(cmd, cblock);
up_write(&cmd->root_lock);
WRITE_UNLOCK(cmd);
return r;
}
@ -1048,9 +1076,9 @@ int dm_cache_insert_mapping(struct dm_cache_metadata *cmd,
{
int r;
down_write(&cmd->root_lock);
WRITE_LOCK(cmd);
r = __insert(cmd, cblock, oblock);
up_write(&cmd->root_lock);
WRITE_UNLOCK(cmd);
return r;
}
@ -1234,9 +1262,9 @@ int dm_cache_set_dirty(struct dm_cache_metadata *cmd,
{
int r;
down_write(&cmd->root_lock);
WRITE_LOCK(cmd);
r = __dirty(cmd, cblock, dirty);
up_write(&cmd->root_lock);
WRITE_UNLOCK(cmd);
return r;
}
@ -1252,9 +1280,9 @@ void dm_cache_metadata_get_stats(struct dm_cache_metadata *cmd,
void dm_cache_metadata_set_stats(struct dm_cache_metadata *cmd,
struct dm_cache_statistics *stats)
{
down_write(&cmd->root_lock);
WRITE_LOCK_VOID(cmd);
cmd->stats = *stats;
up_write(&cmd->root_lock);
WRITE_UNLOCK(cmd);
}
int dm_cache_commit(struct dm_cache_metadata *cmd, bool clean_shutdown)
@ -1263,7 +1291,7 @@ int dm_cache_commit(struct dm_cache_metadata *cmd, bool clean_shutdown)
flags_mutator mutator = (clean_shutdown ? set_clean_shutdown :
clear_clean_shutdown);
down_write(&cmd->root_lock);
WRITE_LOCK(cmd);
r = __commit_transaction(cmd, mutator);
if (r)
goto out;
@ -1271,7 +1299,7 @@ int dm_cache_commit(struct dm_cache_metadata *cmd, bool clean_shutdown)
r = __begin_transaction(cmd);
out:
up_write(&cmd->root_lock);
WRITE_UNLOCK(cmd);
return r;
}
@ -1376,9 +1404,9 @@ int dm_cache_write_hints(struct dm_cache_metadata *cmd, struct dm_cache_policy *
{
int r;
down_write(&cmd->root_lock);
WRITE_LOCK(cmd);
r = write_hints(cmd, policy);
up_write(&cmd->root_lock);
WRITE_UNLOCK(cmd);
return r;
}
@ -1387,3 +1415,70 @@ int dm_cache_metadata_all_clean(struct dm_cache_metadata *cmd, bool *result)
{
return blocks_are_unmapped_or_clean(cmd, 0, cmd->cache_blocks, result);
}
void dm_cache_metadata_set_read_only(struct dm_cache_metadata *cmd)
{
WRITE_LOCK_VOID(cmd);
dm_bm_set_read_only(cmd->bm);
WRITE_UNLOCK(cmd);
}
void dm_cache_metadata_set_read_write(struct dm_cache_metadata *cmd)
{
WRITE_LOCK_VOID(cmd);
dm_bm_set_read_write(cmd->bm);
WRITE_UNLOCK(cmd);
}
int dm_cache_metadata_set_needs_check(struct dm_cache_metadata *cmd)
{
int r;
struct dm_block *sblock;
struct cache_disk_superblock *disk_super;
/*
* We ignore fail_io for this function.
*/
down_write(&cmd->root_lock);
set_bit(NEEDS_CHECK, &cmd->flags);
r = superblock_lock(cmd, &sblock);
if (r) {
DMERR("couldn't read superblock");
goto out;
}
disk_super = dm_block_data(sblock);
disk_super->flags = cpu_to_le32(cmd->flags);
dm_bm_unlock(sblock);
out:
up_write(&cmd->root_lock);
return r;
}
bool dm_cache_metadata_needs_check(struct dm_cache_metadata *cmd)
{
bool needs_check;
down_read(&cmd->root_lock);
needs_check = !!test_bit(NEEDS_CHECK, &cmd->flags);
up_read(&cmd->root_lock);
return needs_check;
}
int dm_cache_metadata_abort(struct dm_cache_metadata *cmd)
{
int r;
WRITE_LOCK(cmd);
__destroy_persistent_data_objects(cmd);
r = __create_persistent_data_objects(cmd, false);
if (r)
cmd->fail_io = true;
WRITE_UNLOCK(cmd);
return r;
}

View File

@ -102,6 +102,10 @@ struct dm_cache_statistics {
void dm_cache_metadata_get_stats(struct dm_cache_metadata *cmd,
struct dm_cache_statistics *stats);
/*
* 'void' because it's no big deal if it fails.
*/
void dm_cache_metadata_set_stats(struct dm_cache_metadata *cmd,
struct dm_cache_statistics *stats);
@ -133,6 +137,12 @@ int dm_cache_write_hints(struct dm_cache_metadata *cmd, struct dm_cache_policy *
*/
int dm_cache_metadata_all_clean(struct dm_cache_metadata *cmd, bool *result);
bool dm_cache_metadata_needs_check(struct dm_cache_metadata *cmd);
int dm_cache_metadata_set_needs_check(struct dm_cache_metadata *cmd);
void dm_cache_metadata_set_read_only(struct dm_cache_metadata *cmd);
void dm_cache_metadata_set_read_write(struct dm_cache_metadata *cmd);
int dm_cache_metadata_abort(struct dm_cache_metadata *cmd);
/*----------------------------------------------------------------*/
#endif /* DM_CACHE_METADATA_H */

View File

@ -171,7 +171,8 @@ static void remove_cache_hash_entry(struct wb_cache_entry *e)
/* Public interface (see dm-cache-policy.h */
static int wb_map(struct dm_cache_policy *pe, dm_oblock_t oblock,
bool can_block, bool can_migrate, bool discarded_oblock,
struct bio *bio, struct policy_result *result)
struct bio *bio, struct policy_locker *locker,
struct policy_result *result)
{
struct policy *p = to_policy(pe);
struct wb_cache_entry *e;
@ -358,7 +359,8 @@ static struct wb_cache_entry *get_next_dirty_entry(struct policy *p)
static int wb_writeback_work(struct dm_cache_policy *pe,
dm_oblock_t *oblock,
dm_cblock_t *cblock)
dm_cblock_t *cblock,
bool critical_only)
{
int r = -ENOENT;
struct policy *p = to_policy(pe);

View File

@ -7,6 +7,7 @@
#ifndef DM_CACHE_POLICY_INTERNAL_H
#define DM_CACHE_POLICY_INTERNAL_H
#include <linux/vmalloc.h>
#include "dm-cache-policy.h"
/*----------------------------------------------------------------*/
@ -16,9 +17,10 @@
*/
static inline int policy_map(struct dm_cache_policy *p, dm_oblock_t oblock,
bool can_block, bool can_migrate, bool discarded_oblock,
struct bio *bio, struct policy_result *result)
struct bio *bio, struct policy_locker *locker,
struct policy_result *result)
{
return p->map(p, oblock, can_block, can_migrate, discarded_oblock, bio, result);
return p->map(p, oblock, can_block, can_migrate, discarded_oblock, bio, locker, result);
}
static inline int policy_lookup(struct dm_cache_policy *p, dm_oblock_t oblock, dm_cblock_t *cblock)
@ -54,9 +56,10 @@ static inline int policy_walk_mappings(struct dm_cache_policy *p,
static inline int policy_writeback_work(struct dm_cache_policy *p,
dm_oblock_t *oblock,
dm_cblock_t *cblock)
dm_cblock_t *cblock,
bool critical_only)
{
return p->writeback_work ? p->writeback_work(p, oblock, cblock) : -ENOENT;
return p->writeback_work ? p->writeback_work(p, oblock, cblock, critical_only) : -ENOENT;
}
static inline void policy_remove_mapping(struct dm_cache_policy *p, dm_oblock_t oblock)
@ -80,19 +83,21 @@ static inline dm_cblock_t policy_residency(struct dm_cache_policy *p)
return p->residency(p);
}
static inline void policy_tick(struct dm_cache_policy *p)
static inline void policy_tick(struct dm_cache_policy *p, bool can_block)
{
if (p->tick)
return p->tick(p);
return p->tick(p, can_block);
}
static inline int policy_emit_config_values(struct dm_cache_policy *p, char *result, unsigned maxlen)
static inline int policy_emit_config_values(struct dm_cache_policy *p, char *result,
unsigned maxlen, ssize_t *sz_ptr)
{
ssize_t sz = 0;
ssize_t sz = *sz_ptr;
if (p->emit_config_values)
return p->emit_config_values(p, result, maxlen);
return p->emit_config_values(p, result, maxlen, sz_ptr);
DMEMIT("0");
DMEMIT("0 ");
*sz_ptr = sz;
return 0;
}
@ -104,6 +109,33 @@ static inline int policy_set_config_value(struct dm_cache_policy *p,
/*----------------------------------------------------------------*/
/*
* Some utility functions commonly used by policies and the core target.
*/
static inline size_t bitset_size_in_bytes(unsigned nr_entries)
{
return sizeof(unsigned long) * dm_div_up(nr_entries, BITS_PER_LONG);
}
static inline unsigned long *alloc_bitset(unsigned nr_entries)
{
size_t s = bitset_size_in_bytes(nr_entries);
return vzalloc(s);
}
static inline void clear_bitset(void *bitset, unsigned nr_entries)
{
size_t s = bitset_size_in_bytes(nr_entries);
memset(bitset, 0, s);
}
static inline void free_bitset(unsigned long *bits)
{
vfree(bits);
}
/*----------------------------------------------------------------*/
/*
* Creates a new cache policy given a policy name, a cache size, an origin size and the block size.
*/

View File

@ -693,9 +693,10 @@ static void requeue(struct mq_policy *mq, struct entry *e)
* - set the hit count to a hard coded value other than 1, eg, is it better
* if it goes in at level 2?
*/
static int demote_cblock(struct mq_policy *mq, dm_oblock_t *oblock)
static int demote_cblock(struct mq_policy *mq,
struct policy_locker *locker, dm_oblock_t *oblock)
{
struct entry *demoted = pop(mq, &mq->cache_clean);
struct entry *demoted = peek(&mq->cache_clean);
if (!demoted)
/*
@ -707,6 +708,13 @@ static int demote_cblock(struct mq_policy *mq, dm_oblock_t *oblock)
*/
return -ENOSPC;
if (locker->fn(locker, demoted->oblock))
/*
* We couldn't lock the demoted block.
*/
return -EBUSY;
del(mq, demoted);
*oblock = demoted->oblock;
free_entry(&mq->cache_pool, demoted);
@ -795,6 +803,7 @@ static int cache_entry_found(struct mq_policy *mq,
* finding which cache block to use.
*/
static int pre_cache_to_cache(struct mq_policy *mq, struct entry *e,
struct policy_locker *locker,
struct policy_result *result)
{
int r;
@ -803,11 +812,12 @@ static int pre_cache_to_cache(struct mq_policy *mq, struct entry *e,
/* Ensure there's a free cblock in the cache */
if (epool_empty(&mq->cache_pool)) {
result->op = POLICY_REPLACE;
r = demote_cblock(mq, &result->old_oblock);
r = demote_cblock(mq, locker, &result->old_oblock);
if (r) {
result->op = POLICY_MISS;
return 0;
}
} else
result->op = POLICY_NEW;
@ -829,7 +839,8 @@ static int pre_cache_to_cache(struct mq_policy *mq, struct entry *e,
static int pre_cache_entry_found(struct mq_policy *mq, struct entry *e,
bool can_migrate, bool discarded_oblock,
int data_dir, struct policy_result *result)
int data_dir, struct policy_locker *locker,
struct policy_result *result)
{
int r = 0;
@ -842,7 +853,7 @@ static int pre_cache_entry_found(struct mq_policy *mq, struct entry *e,
else {
requeue(mq, e);
r = pre_cache_to_cache(mq, e, result);
r = pre_cache_to_cache(mq, e, locker, result);
}
return r;
@ -872,6 +883,7 @@ static void insert_in_pre_cache(struct mq_policy *mq,
}
static void insert_in_cache(struct mq_policy *mq, dm_oblock_t oblock,
struct policy_locker *locker,
struct policy_result *result)
{
int r;
@ -879,7 +891,7 @@ static void insert_in_cache(struct mq_policy *mq, dm_oblock_t oblock,
if (epool_empty(&mq->cache_pool)) {
result->op = POLICY_REPLACE;
r = demote_cblock(mq, &result->old_oblock);
r = demote_cblock(mq, locker, &result->old_oblock);
if (unlikely(r)) {
result->op = POLICY_MISS;
insert_in_pre_cache(mq, oblock);
@ -907,11 +919,12 @@ static void insert_in_cache(struct mq_policy *mq, dm_oblock_t oblock,
static int no_entry_found(struct mq_policy *mq, dm_oblock_t oblock,
bool can_migrate, bool discarded_oblock,
int data_dir, struct policy_result *result)
int data_dir, struct policy_locker *locker,
struct policy_result *result)
{
if (adjusted_promote_threshold(mq, discarded_oblock, data_dir) <= 1) {
if (can_migrate)
insert_in_cache(mq, oblock, result);
insert_in_cache(mq, oblock, locker, result);
else
return -EWOULDBLOCK;
} else {
@ -928,7 +941,8 @@ static int no_entry_found(struct mq_policy *mq, dm_oblock_t oblock,
*/
static int map(struct mq_policy *mq, dm_oblock_t oblock,
bool can_migrate, bool discarded_oblock,
int data_dir, struct policy_result *result)
int data_dir, struct policy_locker *locker,
struct policy_result *result)
{
int r = 0;
struct entry *e = hash_lookup(mq, oblock);
@ -942,11 +956,11 @@ static int map(struct mq_policy *mq, dm_oblock_t oblock,
else if (e)
r = pre_cache_entry_found(mq, e, can_migrate, discarded_oblock,
data_dir, result);
data_dir, locker, result);
else
r = no_entry_found(mq, oblock, can_migrate, discarded_oblock,
data_dir, result);
data_dir, locker, result);
if (r == -EWOULDBLOCK)
result->op = POLICY_MISS;
@ -1012,7 +1026,8 @@ static void copy_tick(struct mq_policy *mq)
static int mq_map(struct dm_cache_policy *p, dm_oblock_t oblock,
bool can_block, bool can_migrate, bool discarded_oblock,
struct bio *bio, struct policy_result *result)
struct bio *bio, struct policy_locker *locker,
struct policy_result *result)
{
int r;
struct mq_policy *mq = to_mq_policy(p);
@ -1028,7 +1043,7 @@ static int mq_map(struct dm_cache_policy *p, dm_oblock_t oblock,
iot_examine_bio(&mq->tracker, bio);
r = map(mq, oblock, can_migrate, discarded_oblock,
bio_data_dir(bio), result);
bio_data_dir(bio), locker, result);
mutex_unlock(&mq->lock);
@ -1221,7 +1236,7 @@ static int __mq_writeback_work(struct mq_policy *mq, dm_oblock_t *oblock,
}
static int mq_writeback_work(struct dm_cache_policy *p, dm_oblock_t *oblock,
dm_cblock_t *cblock)
dm_cblock_t *cblock, bool critical_only)
{
int r;
struct mq_policy *mq = to_mq_policy(p);
@ -1268,7 +1283,7 @@ static dm_cblock_t mq_residency(struct dm_cache_policy *p)
return r;
}
static void mq_tick(struct dm_cache_policy *p)
static void mq_tick(struct dm_cache_policy *p, bool can_block)
{
struct mq_policy *mq = to_mq_policy(p);
unsigned long flags;
@ -1276,6 +1291,12 @@ static void mq_tick(struct dm_cache_policy *p)
spin_lock_irqsave(&mq->tick_lock, flags);
mq->tick_protected++;
spin_unlock_irqrestore(&mq->tick_lock, flags);
if (can_block) {
mutex_lock(&mq->lock);
copy_tick(mq);
mutex_unlock(&mq->lock);
}
}
static int mq_set_config_value(struct dm_cache_policy *p,
@ -1308,22 +1329,24 @@ static int mq_set_config_value(struct dm_cache_policy *p,
return 0;
}
static int mq_emit_config_values(struct dm_cache_policy *p, char *result, unsigned maxlen)
static int mq_emit_config_values(struct dm_cache_policy *p, char *result,
unsigned maxlen, ssize_t *sz_ptr)
{
ssize_t sz = 0;
ssize_t sz = *sz_ptr;
struct mq_policy *mq = to_mq_policy(p);
DMEMIT("10 random_threshold %u "
"sequential_threshold %u "
"discard_promote_adjustment %u "
"read_promote_adjustment %u "
"write_promote_adjustment %u",
"write_promote_adjustment %u ",
mq->tracker.thresholds[PATTERN_RANDOM],
mq->tracker.thresholds[PATTERN_SEQUENTIAL],
mq->discard_promote_adjustment,
mq->read_promote_adjustment,
mq->write_promote_adjustment);
*sz_ptr = sz;
return 0;
}
@ -1408,21 +1431,12 @@ static struct dm_cache_policy *mq_create(dm_cblock_t cache_size,
static struct dm_cache_policy_type mq_policy_type = {
.name = "mq",
.version = {1, 3, 0},
.version = {1, 4, 0},
.hint_size = 4,
.owner = THIS_MODULE,
.create = mq_create
};
static struct dm_cache_policy_type default_policy_type = {
.name = "default",
.version = {1, 3, 0},
.hint_size = 4,
.owner = THIS_MODULE,
.create = mq_create,
.real = &mq_policy_type
};
static int __init mq_init(void)
{
int r;
@ -1432,36 +1446,21 @@ static int __init mq_init(void)
__alignof__(struct entry),
0, NULL);
if (!mq_entry_cache)
goto bad;
return -ENOMEM;
r = dm_cache_policy_register(&mq_policy_type);
if (r) {
DMERR("register failed %d", r);
goto bad_register_mq;
kmem_cache_destroy(mq_entry_cache);
return -ENOMEM;
}
r = dm_cache_policy_register(&default_policy_type);
if (!r) {
DMINFO("version %u.%u.%u loaded",
mq_policy_type.version[0],
mq_policy_type.version[1],
mq_policy_type.version[2]);
return 0;
}
DMERR("register failed (as default) %d", r);
dm_cache_policy_unregister(&mq_policy_type);
bad_register_mq:
kmem_cache_destroy(mq_entry_cache);
bad:
return -ENOMEM;
return 0;
}
static void __exit mq_exit(void)
{
dm_cache_policy_unregister(&mq_policy_type);
dm_cache_policy_unregister(&default_policy_type);
kmem_cache_destroy(mq_entry_cache);
}

File diff suppressed because it is too large Load Diff

View File

@ -69,6 +69,18 @@ enum policy_operation {
POLICY_REPLACE
};
/*
* When issuing a POLICY_REPLACE the policy needs to make a callback to
* lock the block being demoted. This doesn't need to occur during a
* writeback operation since the block remains in the cache.
*/
struct policy_locker;
typedef int (*policy_lock_fn)(struct policy_locker *l, dm_oblock_t oblock);
struct policy_locker {
policy_lock_fn fn;
};
/*
* This is the instruction passed back to the core target.
*/
@ -122,7 +134,8 @@ struct dm_cache_policy {
*/
int (*map)(struct dm_cache_policy *p, dm_oblock_t oblock,
bool can_block, bool can_migrate, bool discarded_oblock,
struct bio *bio, struct policy_result *result);
struct bio *bio, struct policy_locker *locker,
struct policy_result *result);
/*
* Sometimes we want to see if a block is in the cache, without
@ -165,7 +178,9 @@ struct dm_cache_policy {
int (*remove_cblock)(struct dm_cache_policy *p, dm_cblock_t cblock);
/*
* Provide a dirty block to be written back by the core target.
* Provide a dirty block to be written back by the core target. If
* critical_only is set then the policy should only provide work if
* it urgently needs it.
*
* Returns:
*
@ -173,7 +188,8 @@ struct dm_cache_policy {
*
* -ENODATA: no dirty blocks available
*/
int (*writeback_work)(struct dm_cache_policy *p, dm_oblock_t *oblock, dm_cblock_t *cblock);
int (*writeback_work)(struct dm_cache_policy *p, dm_oblock_t *oblock, dm_cblock_t *cblock,
bool critical_only);
/*
* How full is the cache?
@ -184,16 +200,16 @@ struct dm_cache_policy {
* Because of where we sit in the block layer, we can be asked to
* map a lot of little bios that are all in the same block (no
* queue merging has occurred). To stop the policy being fooled by
* these the core target sends regular tick() calls to the policy.
* these, the core target sends regular tick() calls to the policy.
* The policy should only count an entry as hit once per tick.
*/
void (*tick)(struct dm_cache_policy *p);
void (*tick)(struct dm_cache_policy *p, bool can_block);
/*
* Configuration.
*/
int (*emit_config_values)(struct dm_cache_policy *p,
char *result, unsigned maxlen);
int (*emit_config_values)(struct dm_cache_policy *p, char *result,
unsigned maxlen, ssize_t *sz_ptr);
int (*set_config_value)(struct dm_cache_policy *p,
const char *key, const char *value);

File diff suppressed because it is too large Load Diff

View File

@ -1,7 +1,7 @@
/*
* Copyright (C) 2003 Jana Saout <jana@saout.de>
* Copyright (C) 2004 Clemens Fruhwirth <clemens@endorphin.org>
* Copyright (C) 2006-2009 Red Hat, Inc. All rights reserved.
* Copyright (C) 2006-2015 Red Hat, Inc. All rights reserved.
* Copyright (C) 2013 Milan Broz <gmazyland@gmail.com>
*
* This file is released under the GPL.
@ -891,6 +891,11 @@ static void crypt_alloc_req(struct crypt_config *cc,
ctx->req = mempool_alloc(cc->req_pool, GFP_NOIO);
ablkcipher_request_set_tfm(ctx->req, cc->tfms[key_index]);
/*
* Use REQ_MAY_BACKLOG so a cipher driver internally backlogs
* requests if driver request queue is full.
*/
ablkcipher_request_set_callback(ctx->req,
CRYPTO_TFM_REQ_MAY_BACKLOG | CRYPTO_TFM_REQ_MAY_SLEEP,
kcryptd_async_done, dmreq_of_req(cc, ctx->req));
@ -924,24 +929,32 @@ static int crypt_convert(struct crypt_config *cc,
r = crypt_convert_block(cc, ctx, ctx->req);
switch (r) {
/* async */
/*
* The request was queued by a crypto driver
* but the driver request queue is full, let's wait.
*/
case -EBUSY:
wait_for_completion(&ctx->restart);
reinit_completion(&ctx->restart);
/* fall through*/
/* fall through */
/*
* The request is queued and processed asynchronously,
* completion function kcryptd_async_done() will be called.
*/
case -EINPROGRESS:
ctx->req = NULL;
ctx->cc_sector++;
continue;
/* sync */
/*
* The request was already processed (synchronously).
*/
case 0:
atomic_dec(&ctx->cc_pending);
ctx->cc_sector++;
cond_resched();
continue;
/* error */
/* There was an error while processing the request. */
default:
atomic_dec(&ctx->cc_pending);
return r;
@ -1346,6 +1359,11 @@ static void kcryptd_async_done(struct crypto_async_request *async_req,
struct dm_crypt_io *io = container_of(ctx, struct dm_crypt_io, ctx);
struct crypt_config *cc = io->cc;
/*
* A request from crypto driver backlog is going to be processed now,
* finish the completion and continue in crypt_convert().
* (Callback will be called for the second time for this request.)
*/
if (error == -EINPROGRESS) {
complete(&ctx->restart);
return;

View File

@ -55,8 +55,8 @@
#define LOG_DISCARD_FLAG (1 << 2)
#define LOG_MARK_FLAG (1 << 3)
#define WRITE_LOG_VERSION 1
#define WRITE_LOG_MAGIC 0x6a736677736872
#define WRITE_LOG_VERSION 1ULL
#define WRITE_LOG_MAGIC 0x6a736677736872ULL
/*
* The disk format for this is braindead simple.

View File

@ -1,6 +1,6 @@
/*
* Copyright (C) 2010-2011 Neil Brown
* Copyright (C) 2010-2014 Red Hat, Inc. All rights reserved.
* Copyright (C) 2010-2015 Red Hat, Inc. All rights reserved.
*
* This file is released under the GPL.
*/
@ -17,6 +17,7 @@
#include <linux/device-mapper.h>
#define DM_MSG_PREFIX "raid"
#define MAX_RAID_DEVICES 253 /* raid4/5/6 limit */
static bool devices_handle_discard_safely = false;
@ -45,25 +46,25 @@ struct raid_dev {
};
/*
* Flags for rs->print_flags field.
* Flags for rs->ctr_flags field.
*/
#define DMPF_SYNC 0x1
#define DMPF_NOSYNC 0x2
#define DMPF_REBUILD 0x4
#define DMPF_DAEMON_SLEEP 0x8
#define DMPF_MIN_RECOVERY_RATE 0x10
#define DMPF_MAX_RECOVERY_RATE 0x20
#define DMPF_MAX_WRITE_BEHIND 0x40
#define DMPF_STRIPE_CACHE 0x80
#define DMPF_REGION_SIZE 0x100
#define DMPF_RAID10_COPIES 0x200
#define DMPF_RAID10_FORMAT 0x400
#define CTR_FLAG_SYNC 0x1
#define CTR_FLAG_NOSYNC 0x2
#define CTR_FLAG_REBUILD 0x4
#define CTR_FLAG_DAEMON_SLEEP 0x8
#define CTR_FLAG_MIN_RECOVERY_RATE 0x10
#define CTR_FLAG_MAX_RECOVERY_RATE 0x20
#define CTR_FLAG_MAX_WRITE_BEHIND 0x40
#define CTR_FLAG_STRIPE_CACHE 0x80
#define CTR_FLAG_REGION_SIZE 0x100
#define CTR_FLAG_RAID10_COPIES 0x200
#define CTR_FLAG_RAID10_FORMAT 0x400
struct raid_set {
struct dm_target *ti;
uint32_t bitmap_loaded;
uint32_t print_flags;
uint32_t ctr_flags;
struct mddev md;
struct raid_type *raid_type;
@ -81,6 +82,7 @@ static struct raid_type {
const unsigned level; /* RAID level. */
const unsigned algorithm; /* RAID algorithm. */
} raid_types[] = {
{"raid0", "RAID0 (striping)", 0, 2, 0, 0 /* NONE */},
{"raid1", "RAID1 (mirroring)", 0, 2, 1, 0 /* NONE */},
{"raid10", "RAID10 (striped mirrors)", 0, 2, 10, UINT_MAX /* Varies */},
{"raid4", "RAID4 (dedicated parity disk)", 1, 2, 5, ALGORITHM_PARITY_0},
@ -119,15 +121,15 @@ static int raid10_format_to_md_layout(char *format, unsigned copies)
{
unsigned n = 1, f = 1;
if (!strcmp("near", format))
if (!strcasecmp("near", format))
n = copies;
else
f = copies;
if (!strcmp("offset", format))
if (!strcasecmp("offset", format))
return 0x30000 | (f << 8) | n;
if (!strcmp("far", format))
if (!strcasecmp("far", format))
return 0x20000 | (f << 8) | n;
return (f << 8) | n;
@ -477,8 +479,6 @@ static int validate_raid_redundancy(struct raid_set *rs)
* will form the "stripe"
* [[no]sync] Force or prevent recovery of the
* entire array
* [devices_handle_discard_safely] Allow discards on RAID4/5/6; useful if RAID
* member device(s) properly support TRIM/UNMAP
* [rebuild <idx>] Rebuild the drive indicated by the index
* [daemon_sleep <ms>] Time between bitmap daemon work to
* clear bits
@ -555,12 +555,12 @@ static int parse_raid_params(struct raid_set *rs, char **argv,
for (i = 0; i < num_raid_params; i++) {
if (!strcasecmp(argv[i], "nosync")) {
rs->md.recovery_cp = MaxSector;
rs->print_flags |= DMPF_NOSYNC;
rs->ctr_flags |= CTR_FLAG_NOSYNC;
continue;
}
if (!strcasecmp(argv[i], "sync")) {
rs->md.recovery_cp = 0;
rs->print_flags |= DMPF_SYNC;
rs->ctr_flags |= CTR_FLAG_SYNC;
continue;
}
@ -585,7 +585,7 @@ static int parse_raid_params(struct raid_set *rs, char **argv,
return -EINVAL;
}
raid10_format = argv[i];
rs->print_flags |= DMPF_RAID10_FORMAT;
rs->ctr_flags |= CTR_FLAG_RAID10_FORMAT;
continue;
}
@ -602,7 +602,7 @@ static int parse_raid_params(struct raid_set *rs, char **argv,
}
clear_bit(In_sync, &rs->dev[value].rdev.flags);
rs->dev[value].rdev.recovery_offset = 0;
rs->print_flags |= DMPF_REBUILD;
rs->ctr_flags |= CTR_FLAG_REBUILD;
} else if (!strcasecmp(key, "write_mostly")) {
if (rs->raid_type->level != 1) {
rs->ti->error = "write_mostly option is only valid for RAID1";
@ -618,7 +618,7 @@ static int parse_raid_params(struct raid_set *rs, char **argv,
rs->ti->error = "max_write_behind option is only valid for RAID1";
return -EINVAL;
}
rs->print_flags |= DMPF_MAX_WRITE_BEHIND;
rs->ctr_flags |= CTR_FLAG_MAX_WRITE_BEHIND;
/*
* In device-mapper, we specify things in sectors, but
@ -631,14 +631,14 @@ static int parse_raid_params(struct raid_set *rs, char **argv,
}
rs->md.bitmap_info.max_write_behind = value;
} else if (!strcasecmp(key, "daemon_sleep")) {
rs->print_flags |= DMPF_DAEMON_SLEEP;
rs->ctr_flags |= CTR_FLAG_DAEMON_SLEEP;
if (!value || (value > MAX_SCHEDULE_TIMEOUT)) {
rs->ti->error = "daemon sleep period out of range";
return -EINVAL;
}
rs->md.bitmap_info.daemon_sleep = value;
} else if (!strcasecmp(key, "stripe_cache")) {
rs->print_flags |= DMPF_STRIPE_CACHE;
rs->ctr_flags |= CTR_FLAG_STRIPE_CACHE;
/*
* In device-mapper, we specify things in sectors, but
@ -656,21 +656,21 @@ static int parse_raid_params(struct raid_set *rs, char **argv,
return -EINVAL;
}
} else if (!strcasecmp(key, "min_recovery_rate")) {
rs->print_flags |= DMPF_MIN_RECOVERY_RATE;
rs->ctr_flags |= CTR_FLAG_MIN_RECOVERY_RATE;
if (value > INT_MAX) {
rs->ti->error = "min_recovery_rate out of range";
return -EINVAL;
}
rs->md.sync_speed_min = (int)value;
} else if (!strcasecmp(key, "max_recovery_rate")) {
rs->print_flags |= DMPF_MAX_RECOVERY_RATE;
rs->ctr_flags |= CTR_FLAG_MAX_RECOVERY_RATE;
if (value > INT_MAX) {
rs->ti->error = "max_recovery_rate out of range";
return -EINVAL;
}
rs->md.sync_speed_max = (int)value;
} else if (!strcasecmp(key, "region_size")) {
rs->print_flags |= DMPF_REGION_SIZE;
rs->ctr_flags |= CTR_FLAG_REGION_SIZE;
region_size = value;
} else if (!strcasecmp(key, "raid10_copies") &&
(rs->raid_type->level == 10)) {
@ -678,7 +678,7 @@ static int parse_raid_params(struct raid_set *rs, char **argv,
rs->ti->error = "Bad value for 'raid10_copies'";
return -EINVAL;
}
rs->print_flags |= DMPF_RAID10_COPIES;
rs->ctr_flags |= CTR_FLAG_RAID10_COPIES;
raid10_copies = value;
} else {
DMERR("Unable to parse RAID parameter: %s", key);
@ -720,7 +720,7 @@ static int parse_raid_params(struct raid_set *rs, char **argv,
rs->md.layout = raid10_format_to_md_layout(raid10_format,
raid10_copies);
rs->md.new_layout = rs->md.layout;
} else if ((rs->raid_type->level > 1) &&
} else if ((!rs->raid_type->level || rs->raid_type->level > 1) &&
sector_div(sectors_per_dev,
(rs->md.raid_disks - rs->raid_type->parity_devs))) {
rs->ti->error = "Target length not divisible by number of data devices";
@ -947,7 +947,7 @@ static int super_init_validation(struct mddev *mddev, struct md_rdev *rdev)
return -EINVAL;
}
if (!(rs->print_flags & (DMPF_SYNC | DMPF_NOSYNC)))
if (!(rs->ctr_flags & (CTR_FLAG_SYNC | CTR_FLAG_NOSYNC)))
mddev->recovery_cp = le64_to_cpu(sb->array_resync_offset);
/*
@ -1026,8 +1026,9 @@ static int super_init_validation(struct mddev *mddev, struct md_rdev *rdev)
return 0;
}
static int super_validate(struct mddev *mddev, struct md_rdev *rdev)
static int super_validate(struct raid_set *rs, struct md_rdev *rdev)
{
struct mddev *mddev = &rs->md;
struct dm_raid_superblock *sb = page_address(rdev->sb_page);
/*
@ -1037,8 +1038,10 @@ static int super_validate(struct mddev *mddev, struct md_rdev *rdev)
if (!mddev->events && super_init_validation(mddev, rdev))
return -EINVAL;
mddev->bitmap_info.offset = 4096 >> 9; /* Enable bitmap creation */
rdev->mddev->bitmap_info.default_offset = 4096 >> 9;
/* Enable bitmap creation for RAID levels != 0 */
mddev->bitmap_info.offset = (rs->raid_type->level) ? to_sector(4096) : 0;
rdev->mddev->bitmap_info.default_offset = mddev->bitmap_info.offset;
if (!test_bit(FirstUse, &rdev->flags)) {
rdev->recovery_offset = le64_to_cpu(sb->disk_recovery_offset);
if (rdev->recovery_offset != MaxSector)
@ -1073,7 +1076,7 @@ static int analyse_superblocks(struct dm_target *ti, struct raid_set *rs)
freshest = NULL;
rdev_for_each_safe(rdev, tmp, mddev) {
/*
* Skipping super_load due to DMPF_SYNC will cause
* Skipping super_load due to CTR_FLAG_SYNC will cause
* the array to undergo initialization again as
* though it were new. This is the intended effect
* of the "sync" directive.
@ -1082,7 +1085,9 @@ static int analyse_superblocks(struct dm_target *ti, struct raid_set *rs)
* that the "sync" directive is disallowed during the
* reshape.
*/
if (rs->print_flags & DMPF_SYNC)
rdev->sectors = to_sector(i_size_read(rdev->bdev->bd_inode));
if (rs->ctr_flags & CTR_FLAG_SYNC)
continue;
if (!rdev->meta_bdev)
@ -1140,11 +1145,11 @@ static int analyse_superblocks(struct dm_target *ti, struct raid_set *rs)
* validation for the remaining devices.
*/
ti->error = "Unable to assemble array: Invalid superblocks";
if (super_validate(mddev, freshest))
if (super_validate(rs, freshest))
return -EINVAL;
rdev_for_each(rdev, mddev)
if ((rdev != freshest) && super_validate(mddev, rdev))
if ((rdev != freshest) && super_validate(rs, rdev))
return -EINVAL;
return 0;
@ -1243,7 +1248,7 @@ static int raid_ctr(struct dm_target *ti, unsigned argc, char **argv)
}
if ((kstrtoul(argv[num_raid_params], 10, &num_raid_devs) < 0) ||
(num_raid_devs >= INT_MAX)) {
(num_raid_devs > MAX_RAID_DEVICES)) {
ti->error = "Cannot understand number of raid devices";
return -EINVAL;
}
@ -1282,10 +1287,11 @@ static int raid_ctr(struct dm_target *ti, unsigned argc, char **argv)
*/
configure_discard_support(ti, rs);
mutex_lock(&rs->md.reconfig_mutex);
/* Has to be held on running the array */
mddev_lock_nointr(&rs->md);
ret = md_run(&rs->md);
rs->md.in_sync = 0; /* Assume already marked dirty */
mutex_unlock(&rs->md.reconfig_mutex);
mddev_unlock(&rs->md);
if (ret) {
ti->error = "Fail to run raid array";
@ -1368,34 +1374,40 @@ static void raid_status(struct dm_target *ti, status_type_t type,
case STATUSTYPE_INFO:
DMEMIT("%s %d ", rs->raid_type->name, rs->md.raid_disks);
if (test_bit(MD_RECOVERY_RUNNING, &rs->md.recovery))
sync = rs->md.curr_resync_completed;
else
sync = rs->md.recovery_cp;
if (rs->raid_type->level) {
if (test_bit(MD_RECOVERY_RUNNING, &rs->md.recovery))
sync = rs->md.curr_resync_completed;
else
sync = rs->md.recovery_cp;
if (sync >= rs->md.resync_max_sectors) {
/*
* Sync complete.
*/
if (sync >= rs->md.resync_max_sectors) {
/*
* Sync complete.
*/
array_in_sync = 1;
sync = rs->md.resync_max_sectors;
} else if (test_bit(MD_RECOVERY_REQUESTED, &rs->md.recovery)) {
/*
* If "check" or "repair" is occurring, the array has
* undergone and initial sync and the health characters
* should not be 'a' anymore.
*/
array_in_sync = 1;
} else {
/*
* The array may be doing an initial sync, or it may
* be rebuilding individual components. If all the
* devices are In_sync, then it is the array that is
* being initialized.
*/
for (i = 0; i < rs->md.raid_disks; i++)
if (!test_bit(In_sync, &rs->dev[i].rdev.flags))
array_in_sync = 1;
}
} else {
/* RAID0 */
array_in_sync = 1;
sync = rs->md.resync_max_sectors;
} else if (test_bit(MD_RECOVERY_REQUESTED, &rs->md.recovery)) {
/*
* If "check" or "repair" is occurring, the array has
* undergone and initial sync and the health characters
* should not be 'a' anymore.
*/
array_in_sync = 1;
} else {
/*
* The array may be doing an initial sync, or it may
* be rebuilding individual components. If all the
* devices are In_sync, then it is the array that is
* being initialized.
*/
for (i = 0; i < rs->md.raid_disks; i++)
if (!test_bit(In_sync, &rs->dev[i].rdev.flags))
array_in_sync = 1;
}
/*
@ -1446,7 +1458,7 @@ static void raid_status(struct dm_target *ti, status_type_t type,
case STATUSTYPE_TABLE:
/* The string you would use to construct this array */
for (i = 0; i < rs->md.raid_disks; i++) {
if ((rs->print_flags & DMPF_REBUILD) &&
if ((rs->ctr_flags & CTR_FLAG_REBUILD) &&
rs->dev[i].data_dev &&
!test_bit(In_sync, &rs->dev[i].rdev.flags))
raid_param_cnt += 2; /* for rebuilds */
@ -1455,33 +1467,33 @@ static void raid_status(struct dm_target *ti, status_type_t type,
raid_param_cnt += 2;
}
raid_param_cnt += (hweight32(rs->print_flags & ~DMPF_REBUILD) * 2);
if (rs->print_flags & (DMPF_SYNC | DMPF_NOSYNC))
raid_param_cnt += (hweight32(rs->ctr_flags & ~CTR_FLAG_REBUILD) * 2);
if (rs->ctr_flags & (CTR_FLAG_SYNC | CTR_FLAG_NOSYNC))
raid_param_cnt--;
DMEMIT("%s %u %u", rs->raid_type->name,
raid_param_cnt, rs->md.chunk_sectors);
if ((rs->print_flags & DMPF_SYNC) &&
if ((rs->ctr_flags & CTR_FLAG_SYNC) &&
(rs->md.recovery_cp == MaxSector))
DMEMIT(" sync");
if (rs->print_flags & DMPF_NOSYNC)
if (rs->ctr_flags & CTR_FLAG_NOSYNC)
DMEMIT(" nosync");
for (i = 0; i < rs->md.raid_disks; i++)
if ((rs->print_flags & DMPF_REBUILD) &&
if ((rs->ctr_flags & CTR_FLAG_REBUILD) &&
rs->dev[i].data_dev &&
!test_bit(In_sync, &rs->dev[i].rdev.flags))
DMEMIT(" rebuild %u", i);
if (rs->print_flags & DMPF_DAEMON_SLEEP)
if (rs->ctr_flags & CTR_FLAG_DAEMON_SLEEP)
DMEMIT(" daemon_sleep %lu",
rs->md.bitmap_info.daemon_sleep);
if (rs->print_flags & DMPF_MIN_RECOVERY_RATE)
if (rs->ctr_flags & CTR_FLAG_MIN_RECOVERY_RATE)
DMEMIT(" min_recovery_rate %d", rs->md.sync_speed_min);
if (rs->print_flags & DMPF_MAX_RECOVERY_RATE)
if (rs->ctr_flags & CTR_FLAG_MAX_RECOVERY_RATE)
DMEMIT(" max_recovery_rate %d", rs->md.sync_speed_max);
for (i = 0; i < rs->md.raid_disks; i++)
@ -1489,11 +1501,11 @@ static void raid_status(struct dm_target *ti, status_type_t type,
test_bit(WriteMostly, &rs->dev[i].rdev.flags))
DMEMIT(" write_mostly %u", i);
if (rs->print_flags & DMPF_MAX_WRITE_BEHIND)
if (rs->ctr_flags & CTR_FLAG_MAX_WRITE_BEHIND)
DMEMIT(" max_write_behind %lu",
rs->md.bitmap_info.max_write_behind);
if (rs->print_flags & DMPF_STRIPE_CACHE) {
if (rs->ctr_flags & CTR_FLAG_STRIPE_CACHE) {
struct r5conf *conf = rs->md.private;
/* convert from kiB to sectors */
@ -1501,15 +1513,15 @@ static void raid_status(struct dm_target *ti, status_type_t type,
conf ? conf->max_nr_stripes * 2 : 0);
}
if (rs->print_flags & DMPF_REGION_SIZE)
if (rs->ctr_flags & CTR_FLAG_REGION_SIZE)
DMEMIT(" region_size %lu",
rs->md.bitmap_info.chunksize >> 9);
if (rs->print_flags & DMPF_RAID10_COPIES)
if (rs->ctr_flags & CTR_FLAG_RAID10_COPIES)
DMEMIT(" raid10_copies %u",
raid10_md_layout_to_copies(rs->md.layout));
if (rs->print_flags & DMPF_RAID10_FORMAT)
if (rs->ctr_flags & CTR_FLAG_RAID10_FORMAT)
DMEMIT(" raid10_format %s",
raid10_md_layout_to_format(rs->md.layout));
@ -1684,26 +1696,48 @@ static void raid_resume(struct dm_target *ti)
{
struct raid_set *rs = ti->private;
set_bit(MD_CHANGE_DEVS, &rs->md.flags);
if (!rs->bitmap_loaded) {
bitmap_load(&rs->md);
rs->bitmap_loaded = 1;
} else {
/*
* A secondary resume while the device is active.
* Take this opportunity to check whether any failed
* devices are reachable again.
*/
attempt_restore_of_faulty_devices(rs);
if (rs->raid_type->level) {
set_bit(MD_CHANGE_DEVS, &rs->md.flags);
if (!rs->bitmap_loaded) {
bitmap_load(&rs->md);
rs->bitmap_loaded = 1;
} else {
/*
* A secondary resume while the device is active.
* Take this opportunity to check whether any failed
* devices are reachable again.
*/
attempt_restore_of_faulty_devices(rs);
}
clear_bit(MD_RECOVERY_FROZEN, &rs->md.recovery);
}
clear_bit(MD_RECOVERY_FROZEN, &rs->md.recovery);
mddev_resume(&rs->md);
}
static int raid_merge(struct dm_target *ti, struct bvec_merge_data *bvm,
struct bio_vec *biovec, int max_size)
{
struct raid_set *rs = ti->private;
struct md_personality *pers = rs->md.pers;
if (pers && pers->mergeable_bvec)
return min(max_size, pers->mergeable_bvec(&rs->md, bvm, biovec));
/*
* In case we can't request the personality because
* the raid set is not running yet
*
* -> return safe minimum
*/
return rs->md.chunk_sectors;
}
static struct target_type raid_target = {
.name = "raid",
.version = {1, 6, 0},
.version = {1, 7, 0},
.module = THIS_MODULE,
.ctr = raid_ctr,
.dtr = raid_dtr,
@ -1715,6 +1749,7 @@ static struct target_type raid_target = {
.presuspend = raid_presuspend,
.postsuspend = raid_postsuspend,
.resume = raid_resume,
.merge = raid_merge,
};
static int __init dm_raid_init(void)

View File

@ -23,8 +23,10 @@
#define MAX_RECOVERY 1 /* Maximum number of regions recovered in parallel. */
#define DM_RAID1_HANDLE_ERRORS 0x01
#define DM_RAID1_HANDLE_ERRORS 0x01
#define DM_RAID1_KEEP_LOG 0x02
#define errors_handled(p) ((p)->features & DM_RAID1_HANDLE_ERRORS)
#define keep_log(p) ((p)->features & DM_RAID1_KEEP_LOG)
static DECLARE_WAIT_QUEUE_HEAD(_kmirrord_recovery_stopped);
@ -229,7 +231,7 @@ static void fail_mirror(struct mirror *m, enum dm_raid1_error error_type)
if (m != get_default_mirror(ms))
goto out;
if (!ms->in_sync) {
if (!ms->in_sync && !keep_log(ms)) {
/*
* Better to issue requests to same failing device
* than to risk returning corrupt data.
@ -370,6 +372,17 @@ static int recover(struct mirror_set *ms, struct dm_region *reg)
return r;
}
static void reset_ms_flags(struct mirror_set *ms)
{
unsigned int m;
ms->leg_failure = 0;
for (m = 0; m < ms->nr_mirrors; m++) {
atomic_set(&(ms->mirror[m].error_count), 0);
ms->mirror[m].error_type = 0;
}
}
static void do_recovery(struct mirror_set *ms)
{
struct dm_region *reg;
@ -398,6 +411,7 @@ static void do_recovery(struct mirror_set *ms)
/* the sync is complete */
dm_table_event(ms->ti->table);
ms->in_sync = 1;
reset_ms_flags(ms);
}
}
@ -759,7 +773,7 @@ static void do_writes(struct mirror_set *ms, struct bio_list *writes)
dm_rh_delay(ms->rh, bio);
while ((bio = bio_list_pop(&nosync))) {
if (unlikely(ms->leg_failure) && errors_handled(ms)) {
if (unlikely(ms->leg_failure) && errors_handled(ms) && !keep_log(ms)) {
spin_lock_irq(&ms->lock);
bio_list_add(&ms->failures, bio);
spin_unlock_irq(&ms->lock);
@ -803,15 +817,21 @@ static void do_failures(struct mirror_set *ms, struct bio_list *failures)
/*
* If all the legs are dead, fail the I/O.
* If we have been told to handle errors, hold the bio
* and wait for userspace to deal with the problem.
* If the device has failed and keep_log is enabled,
* fail the I/O.
*
* If we have been told to handle errors, and keep_log
* isn't enabled, hold the bio and wait for userspace to
* deal with the problem.
*
* Otherwise pretend that the I/O succeeded. (This would
* be wrong if the failed leg returned after reboot and
* got replicated back to the good legs.)
*/
if (!get_valid_mirror(ms))
if (unlikely(!get_valid_mirror(ms) || (keep_log(ms) && ms->log_failure)))
bio_endio(bio, -EIO);
else if (errors_handled(ms))
else if (errors_handled(ms) && !keep_log(ms))
hold_bio(ms, bio);
else
bio_endio(bio, 0);
@ -987,6 +1007,7 @@ static int parse_features(struct mirror_set *ms, unsigned argc, char **argv,
unsigned num_features;
struct dm_target *ti = ms->ti;
char dummy;
int i;
*args_used = 0;
@ -1007,15 +1028,25 @@ static int parse_features(struct mirror_set *ms, unsigned argc, char **argv,
return -EINVAL;
}
if (!strcmp("handle_errors", argv[0]))
ms->features |= DM_RAID1_HANDLE_ERRORS;
else {
ti->error = "Unrecognised feature requested";
for (i = 0; i < num_features; i++) {
if (!strcmp("handle_errors", argv[0]))
ms->features |= DM_RAID1_HANDLE_ERRORS;
else if (!strcmp("keep_log", argv[0]))
ms->features |= DM_RAID1_KEEP_LOG;
else {
ti->error = "Unrecognised feature requested";
return -EINVAL;
}
argc--;
argv++;
(*args_used)++;
}
if (!errors_handled(ms) && keep_log(ms)) {
ti->error = "keep_log feature requires the handle_errors feature";
return -EINVAL;
}
(*args_used)++;
return 0;
}
@ -1029,7 +1060,7 @@ static int parse_features(struct mirror_set *ms, unsigned argc, char **argv,
* log_type is "core" or "disk"
* #log_params is between 1 and 3
*
* If present, features must be "handle_errors".
* If present, supported features are "handle_errors" and "keep_log".
*/
static int mirror_ctr(struct dm_target *ti, unsigned int argc, char **argv)
{
@ -1363,6 +1394,7 @@ static void mirror_status(struct dm_target *ti, status_type_t type,
unsigned status_flags, char *result, unsigned maxlen)
{
unsigned int m, sz = 0;
int num_feature_args = 0;
struct mirror_set *ms = (struct mirror_set *) ti->private;
struct dm_dirty_log *log = dm_rh_dirty_log(ms->rh);
char buffer[ms->nr_mirrors + 1];
@ -1392,8 +1424,17 @@ static void mirror_status(struct dm_target *ti, status_type_t type,
DMEMIT(" %s %llu", ms->mirror[m].dev->name,
(unsigned long long)ms->mirror[m].offset);
if (ms->features & DM_RAID1_HANDLE_ERRORS)
DMEMIT(" 1 handle_errors");
num_feature_args += !!errors_handled(ms);
num_feature_args += !!keep_log(ms);
if (num_feature_args) {
DMEMIT(" %d", num_feature_args);
if (errors_handled(ms))
DMEMIT(" handle_errors");
if (keep_log(ms))
DMEMIT(" keep_log");
}
break;
}
}
@ -1413,7 +1454,7 @@ static int mirror_iterate_devices(struct dm_target *ti,
static struct target_type mirror_target = {
.name = "mirror",
.version = {1, 13, 2},
.version = {1, 14, 0},
.module = THIS_MODULE,
.ctr = mirror_ctr,
.dtr = mirror_dtr,

View File

@ -29,30 +29,37 @@ struct dm_stat_percpu {
unsigned long long io_ticks[2];
unsigned long long io_ticks_total;
unsigned long long time_in_queue;
unsigned long long *histogram;
};
struct dm_stat_shared {
atomic_t in_flight[2];
unsigned long stamp;
unsigned long long stamp;
struct dm_stat_percpu tmp;
};
struct dm_stat {
struct list_head list_entry;
int id;
unsigned stat_flags;
size_t n_entries;
sector_t start;
sector_t end;
sector_t step;
unsigned n_histogram_entries;
unsigned long long *histogram_boundaries;
const char *program_id;
const char *aux_data;
struct rcu_head rcu_head;
size_t shared_alloc_size;
size_t percpu_alloc_size;
size_t histogram_alloc_size;
struct dm_stat_percpu *stat_percpu[NR_CPUS];
struct dm_stat_shared stat_shared[0];
};
#define STAT_PRECISE_TIMESTAMPS 1
struct dm_stats_last_position {
sector_t last_sector;
unsigned last_rw;
@ -160,10 +167,7 @@ static void dm_kvfree(void *ptr, size_t alloc_size)
free_shared_memory(alloc_size);
if (is_vmalloc_addr(ptr))
vfree(ptr);
else
kfree(ptr);
kvfree(ptr);
}
static void dm_stat_free(struct rcu_head *head)
@ -173,8 +177,11 @@ static void dm_stat_free(struct rcu_head *head)
kfree(s->program_id);
kfree(s->aux_data);
for_each_possible_cpu(cpu)
for_each_possible_cpu(cpu) {
dm_kvfree(s->stat_percpu[cpu][0].histogram, s->histogram_alloc_size);
dm_kvfree(s->stat_percpu[cpu], s->percpu_alloc_size);
}
dm_kvfree(s->stat_shared[0].tmp.histogram, s->histogram_alloc_size);
dm_kvfree(s, s->shared_alloc_size);
}
@ -227,7 +234,10 @@ void dm_stats_cleanup(struct dm_stats *stats)
}
static int dm_stats_create(struct dm_stats *stats, sector_t start, sector_t end,
sector_t step, const char *program_id, const char *aux_data,
sector_t step, unsigned stat_flags,
unsigned n_histogram_entries,
unsigned long long *histogram_boundaries,
const char *program_id, const char *aux_data,
void (*suspend_callback)(struct mapped_device *),
void (*resume_callback)(struct mapped_device *),
struct mapped_device *md)
@ -238,6 +248,7 @@ static int dm_stats_create(struct dm_stats *stats, sector_t start, sector_t end,
size_t ni;
size_t shared_alloc_size;
size_t percpu_alloc_size;
size_t histogram_alloc_size;
struct dm_stat_percpu *p;
int cpu;
int ret_id;
@ -261,19 +272,34 @@ static int dm_stats_create(struct dm_stats *stats, sector_t start, sector_t end,
if (percpu_alloc_size / sizeof(struct dm_stat_percpu) != n_entries)
return -EOVERFLOW;
if (!check_shared_memory(shared_alloc_size + num_possible_cpus() * percpu_alloc_size))
histogram_alloc_size = (n_histogram_entries + 1) * (size_t)n_entries * sizeof(unsigned long long);
if (histogram_alloc_size / (n_histogram_entries + 1) != (size_t)n_entries * sizeof(unsigned long long))
return -EOVERFLOW;
if (!check_shared_memory(shared_alloc_size + histogram_alloc_size +
num_possible_cpus() * (percpu_alloc_size + histogram_alloc_size)))
return -ENOMEM;
s = dm_kvzalloc(shared_alloc_size, NUMA_NO_NODE);
if (!s)
return -ENOMEM;
s->stat_flags = stat_flags;
s->n_entries = n_entries;
s->start = start;
s->end = end;
s->step = step;
s->shared_alloc_size = shared_alloc_size;
s->percpu_alloc_size = percpu_alloc_size;
s->histogram_alloc_size = histogram_alloc_size;
s->n_histogram_entries = n_histogram_entries;
s->histogram_boundaries = kmemdup(histogram_boundaries,
s->n_histogram_entries * sizeof(unsigned long long), GFP_KERNEL);
if (!s->histogram_boundaries) {
r = -ENOMEM;
goto out;
}
s->program_id = kstrdup(program_id, GFP_KERNEL);
if (!s->program_id) {
@ -291,6 +317,19 @@ static int dm_stats_create(struct dm_stats *stats, sector_t start, sector_t end,
atomic_set(&s->stat_shared[ni].in_flight[WRITE], 0);
}
if (s->n_histogram_entries) {
unsigned long long *hi;
hi = dm_kvzalloc(s->histogram_alloc_size, NUMA_NO_NODE);
if (!hi) {
r = -ENOMEM;
goto out;
}
for (ni = 0; ni < n_entries; ni++) {
s->stat_shared[ni].tmp.histogram = hi;
hi += s->n_histogram_entries + 1;
}
}
for_each_possible_cpu(cpu) {
p = dm_kvzalloc(percpu_alloc_size, cpu_to_node(cpu));
if (!p) {
@ -298,6 +337,18 @@ static int dm_stats_create(struct dm_stats *stats, sector_t start, sector_t end,
goto out;
}
s->stat_percpu[cpu] = p;
if (s->n_histogram_entries) {
unsigned long long *hi;
hi = dm_kvzalloc(s->histogram_alloc_size, cpu_to_node(cpu));
if (!hi) {
r = -ENOMEM;
goto out;
}
for (ni = 0; ni < n_entries; ni++) {
p[ni].histogram = hi;
hi += s->n_histogram_entries + 1;
}
}
}
/*
@ -375,9 +426,11 @@ static int dm_stats_delete(struct dm_stats *stats, int id)
* vfree can't be called from RCU callback
*/
for_each_possible_cpu(cpu)
if (is_vmalloc_addr(s->stat_percpu))
if (is_vmalloc_addr(s->stat_percpu) ||
is_vmalloc_addr(s->stat_percpu[cpu][0].histogram))
goto do_sync_free;
if (is_vmalloc_addr(s)) {
if (is_vmalloc_addr(s) ||
is_vmalloc_addr(s->stat_shared[0].tmp.histogram)) {
do_sync_free:
synchronize_rcu_expedited();
dm_stat_free(&s->rcu_head);
@ -417,18 +470,24 @@ static int dm_stats_list(struct dm_stats *stats, const char *program,
return 1;
}
static void dm_stat_round(struct dm_stat_shared *shared, struct dm_stat_percpu *p)
static void dm_stat_round(struct dm_stat *s, struct dm_stat_shared *shared,
struct dm_stat_percpu *p)
{
/*
* This is racy, but so is part_round_stats_single.
*/
unsigned long now = jiffies;
unsigned in_flight_read;
unsigned in_flight_write;
unsigned long difference = now - shared->stamp;
unsigned long long now, difference;
unsigned in_flight_read, in_flight_write;
if (likely(!(s->stat_flags & STAT_PRECISE_TIMESTAMPS)))
now = jiffies;
else
now = ktime_to_ns(ktime_get());
difference = now - shared->stamp;
if (!difference)
return;
in_flight_read = (unsigned)atomic_read(&shared->in_flight[READ]);
in_flight_write = (unsigned)atomic_read(&shared->in_flight[WRITE]);
if (in_flight_read)
@ -443,8 +502,9 @@ static void dm_stat_round(struct dm_stat_shared *shared, struct dm_stat_percpu *
}
static void dm_stat_for_entry(struct dm_stat *s, size_t entry,
unsigned long bi_rw, sector_t len, bool merged,
bool end, unsigned long duration)
unsigned long bi_rw, sector_t len,
struct dm_stats_aux *stats_aux, bool end,
unsigned long duration_jiffies)
{
unsigned long idx = bi_rw & REQ_WRITE;
struct dm_stat_shared *shared = &s->stat_shared[entry];
@ -474,15 +534,35 @@ static void dm_stat_for_entry(struct dm_stat *s, size_t entry,
p = &s->stat_percpu[smp_processor_id()][entry];
if (!end) {
dm_stat_round(shared, p);
dm_stat_round(s, shared, p);
atomic_inc(&shared->in_flight[idx]);
} else {
dm_stat_round(shared, p);
unsigned long long duration;
dm_stat_round(s, shared, p);
atomic_dec(&shared->in_flight[idx]);
p->sectors[idx] += len;
p->ios[idx] += 1;
p->merges[idx] += merged;
p->ticks[idx] += duration;
p->merges[idx] += stats_aux->merged;
if (!(s->stat_flags & STAT_PRECISE_TIMESTAMPS)) {
p->ticks[idx] += duration_jiffies;
duration = jiffies_to_msecs(duration_jiffies);
} else {
p->ticks[idx] += stats_aux->duration_ns;
duration = stats_aux->duration_ns;
}
if (s->n_histogram_entries) {
unsigned lo = 0, hi = s->n_histogram_entries + 1;
while (lo + 1 < hi) {
unsigned mid = (lo + hi) / 2;
if (s->histogram_boundaries[mid - 1] > duration) {
hi = mid;
} else {
lo = mid;
}
}
p->histogram[lo]++;
}
}
#if BITS_PER_LONG == 32
@ -494,7 +574,7 @@ static void dm_stat_for_entry(struct dm_stat *s, size_t entry,
static void __dm_stat_bio(struct dm_stat *s, unsigned long bi_rw,
sector_t bi_sector, sector_t end_sector,
bool end, unsigned long duration,
bool end, unsigned long duration_jiffies,
struct dm_stats_aux *stats_aux)
{
sector_t rel_sector, offset, todo, fragment_len;
@ -523,7 +603,7 @@ static void __dm_stat_bio(struct dm_stat *s, unsigned long bi_rw,
if (fragment_len > s->step - offset)
fragment_len = s->step - offset;
dm_stat_for_entry(s, entry, bi_rw, fragment_len,
stats_aux->merged, end, duration);
stats_aux, end, duration_jiffies);
todo -= fragment_len;
entry++;
offset = 0;
@ -532,11 +612,13 @@ static void __dm_stat_bio(struct dm_stat *s, unsigned long bi_rw,
void dm_stats_account_io(struct dm_stats *stats, unsigned long bi_rw,
sector_t bi_sector, unsigned bi_sectors, bool end,
unsigned long duration, struct dm_stats_aux *stats_aux)
unsigned long duration_jiffies,
struct dm_stats_aux *stats_aux)
{
struct dm_stat *s;
sector_t end_sector;
struct dm_stats_last_position *last;
bool got_precise_time;
if (unlikely(!bi_sectors))
return;
@ -560,8 +642,17 @@ void dm_stats_account_io(struct dm_stats *stats, unsigned long bi_rw,
rcu_read_lock();
list_for_each_entry_rcu(s, &stats->list, list_entry)
__dm_stat_bio(s, bi_rw, bi_sector, end_sector, end, duration, stats_aux);
got_precise_time = false;
list_for_each_entry_rcu(s, &stats->list, list_entry) {
if (s->stat_flags & STAT_PRECISE_TIMESTAMPS && !got_precise_time) {
if (!end)
stats_aux->duration_ns = ktime_to_ns(ktime_get());
else
stats_aux->duration_ns = ktime_to_ns(ktime_get()) - stats_aux->duration_ns;
got_precise_time = true;
}
__dm_stat_bio(s, bi_rw, bi_sector, end_sector, end, duration_jiffies, stats_aux);
}
rcu_read_unlock();
}
@ -574,10 +665,25 @@ static void __dm_stat_init_temporary_percpu_totals(struct dm_stat_shared *shared
local_irq_disable();
p = &s->stat_percpu[smp_processor_id()][x];
dm_stat_round(shared, p);
dm_stat_round(s, shared, p);
local_irq_enable();
memset(&shared->tmp, 0, sizeof(shared->tmp));
shared->tmp.sectors[READ] = 0;
shared->tmp.sectors[WRITE] = 0;
shared->tmp.ios[READ] = 0;
shared->tmp.ios[WRITE] = 0;
shared->tmp.merges[READ] = 0;
shared->tmp.merges[WRITE] = 0;
shared->tmp.ticks[READ] = 0;
shared->tmp.ticks[WRITE] = 0;
shared->tmp.io_ticks[READ] = 0;
shared->tmp.io_ticks[WRITE] = 0;
shared->tmp.io_ticks_total = 0;
shared->tmp.time_in_queue = 0;
if (s->n_histogram_entries)
memset(shared->tmp.histogram, 0, (s->n_histogram_entries + 1) * sizeof(unsigned long long));
for_each_possible_cpu(cpu) {
p = &s->stat_percpu[cpu][x];
shared->tmp.sectors[READ] += ACCESS_ONCE(p->sectors[READ]);
@ -592,6 +698,11 @@ static void __dm_stat_init_temporary_percpu_totals(struct dm_stat_shared *shared
shared->tmp.io_ticks[WRITE] += ACCESS_ONCE(p->io_ticks[WRITE]);
shared->tmp.io_ticks_total += ACCESS_ONCE(p->io_ticks_total);
shared->tmp.time_in_queue += ACCESS_ONCE(p->time_in_queue);
if (s->n_histogram_entries) {
unsigned i;
for (i = 0; i < s->n_histogram_entries + 1; i++)
shared->tmp.histogram[i] += ACCESS_ONCE(p->histogram[i]);
}
}
}
@ -621,6 +732,15 @@ static void __dm_stat_clear(struct dm_stat *s, size_t idx_start, size_t idx_end,
p->io_ticks_total -= shared->tmp.io_ticks_total;
p->time_in_queue -= shared->tmp.time_in_queue;
local_irq_enable();
if (s->n_histogram_entries) {
unsigned i;
for (i = 0; i < s->n_histogram_entries + 1; i++) {
local_irq_disable();
p = &s->stat_percpu[smp_processor_id()][x];
p->histogram[i] -= shared->tmp.histogram[i];
local_irq_enable();
}
}
}
}
@ -646,11 +766,15 @@ static int dm_stats_clear(struct dm_stats *stats, int id)
/*
* This is like jiffies_to_msec, but works for 64-bit values.
*/
static unsigned long long dm_jiffies_to_msec64(unsigned long long j)
static unsigned long long dm_jiffies_to_msec64(struct dm_stat *s, unsigned long long j)
{
unsigned long long result = 0;
unsigned long long result;
unsigned mult;
if (s->stat_flags & STAT_PRECISE_TIMESTAMPS)
return j;
result = 0;
if (j)
result = jiffies_to_msecs(j & 0x3fffff);
if (j >= 1 << 22) {
@ -706,22 +830,29 @@ static int dm_stats_print(struct dm_stats *stats, int id,
__dm_stat_init_temporary_percpu_totals(shared, s, x);
DMEMIT("%llu+%llu %llu %llu %llu %llu %llu %llu %llu %llu %d %llu %llu %llu %llu\n",
DMEMIT("%llu+%llu %llu %llu %llu %llu %llu %llu %llu %llu %d %llu %llu %llu %llu",
(unsigned long long)start,
(unsigned long long)step,
shared->tmp.ios[READ],
shared->tmp.merges[READ],
shared->tmp.sectors[READ],
dm_jiffies_to_msec64(shared->tmp.ticks[READ]),
dm_jiffies_to_msec64(s, shared->tmp.ticks[READ]),
shared->tmp.ios[WRITE],
shared->tmp.merges[WRITE],
shared->tmp.sectors[WRITE],
dm_jiffies_to_msec64(shared->tmp.ticks[WRITE]),
dm_jiffies_to_msec64(s, shared->tmp.ticks[WRITE]),
dm_stat_in_flight(shared),
dm_jiffies_to_msec64(shared->tmp.io_ticks_total),
dm_jiffies_to_msec64(shared->tmp.time_in_queue),
dm_jiffies_to_msec64(shared->tmp.io_ticks[READ]),
dm_jiffies_to_msec64(shared->tmp.io_ticks[WRITE]));
dm_jiffies_to_msec64(s, shared->tmp.io_ticks_total),
dm_jiffies_to_msec64(s, shared->tmp.time_in_queue),
dm_jiffies_to_msec64(s, shared->tmp.io_ticks[READ]),
dm_jiffies_to_msec64(s, shared->tmp.io_ticks[WRITE]));
if (s->n_histogram_entries) {
unsigned i;
for (i = 0; i < s->n_histogram_entries + 1; i++) {
DMEMIT("%s%llu", !i ? " " : ":", shared->tmp.histogram[i]);
}
}
DMEMIT("\n");
if (unlikely(sz + 1 >= maxlen))
goto buffer_overflow;
@ -763,55 +894,134 @@ static int dm_stats_set_aux(struct dm_stats *stats, int id, const char *aux_data
return 0;
}
static int parse_histogram(const char *h, unsigned *n_histogram_entries,
unsigned long long **histogram_boundaries)
{
const char *q;
unsigned n;
unsigned long long last;
*n_histogram_entries = 1;
for (q = h; *q; q++)
if (*q == ',')
(*n_histogram_entries)++;
*histogram_boundaries = kmalloc(*n_histogram_entries * sizeof(unsigned long long), GFP_KERNEL);
if (!*histogram_boundaries)
return -ENOMEM;
n = 0;
last = 0;
while (1) {
unsigned long long hi;
int s;
char ch;
s = sscanf(h, "%llu%c", &hi, &ch);
if (!s || (s == 2 && ch != ','))
return -EINVAL;
if (hi <= last)
return -EINVAL;
last = hi;
(*histogram_boundaries)[n] = hi;
if (s == 1)
return 0;
h = strchr(h, ',') + 1;
n++;
}
}
static int message_stats_create(struct mapped_device *md,
unsigned argc, char **argv,
char *result, unsigned maxlen)
{
int r;
int id;
char dummy;
unsigned long long start, end, len, step;
unsigned divisor;
const char *program_id, *aux_data;
unsigned stat_flags = 0;
unsigned n_histogram_entries = 0;
unsigned long long *histogram_boundaries = NULL;
struct dm_arg_set as, as_backup;
const char *a;
unsigned feature_args;
/*
* Input format:
* <range> <step> [<program_id> [<aux_data>]]
* <range> <step> [<extra_parameters> <parameters>] [<program_id> [<aux_data>]]
*/
if (argc < 3 || argc > 5)
return -EINVAL;
if (argc < 3)
goto ret_einval;
if (!strcmp(argv[1], "-")) {
as.argc = argc;
as.argv = argv;
dm_consume_args(&as, 1);
a = dm_shift_arg(&as);
if (!strcmp(a, "-")) {
start = 0;
len = dm_get_size(md);
if (!len)
len = 1;
} else if (sscanf(argv[1], "%llu+%llu%c", &start, &len, &dummy) != 2 ||
} else if (sscanf(a, "%llu+%llu%c", &start, &len, &dummy) != 2 ||
start != (sector_t)start || len != (sector_t)len)
return -EINVAL;
goto ret_einval;
end = start + len;
if (start >= end)
return -EINVAL;
goto ret_einval;
if (sscanf(argv[2], "/%u%c", &divisor, &dummy) == 1) {
a = dm_shift_arg(&as);
if (sscanf(a, "/%u%c", &divisor, &dummy) == 1) {
if (!divisor)
return -EINVAL;
step = end - start;
if (do_div(step, divisor))
step++;
if (!step)
step = 1;
} else if (sscanf(argv[2], "%llu%c", &step, &dummy) != 1 ||
} else if (sscanf(a, "%llu%c", &step, &dummy) != 1 ||
step != (sector_t)step || !step)
return -EINVAL;
goto ret_einval;
as_backup = as;
a = dm_shift_arg(&as);
if (a && sscanf(a, "%u%c", &feature_args, &dummy) == 1) {
while (feature_args--) {
a = dm_shift_arg(&as);
if (!a)
goto ret_einval;
if (!strcasecmp(a, "precise_timestamps"))
stat_flags |= STAT_PRECISE_TIMESTAMPS;
else if (!strncasecmp(a, "histogram:", 10)) {
if (n_histogram_entries)
goto ret_einval;
if ((r = parse_histogram(a + 10, &n_histogram_entries, &histogram_boundaries)))
goto ret;
} else
goto ret_einval;
}
} else {
as = as_backup;
}
program_id = "-";
aux_data = "-";
if (argc > 3)
program_id = argv[3];
a = dm_shift_arg(&as);
if (a)
program_id = a;
if (argc > 4)
aux_data = argv[4];
a = dm_shift_arg(&as);
if (a)
aux_data = a;
if (as.argc)
goto ret_einval;
/*
* If a buffer overflow happens after we created the region,
@ -820,17 +1030,29 @@ static int message_stats_create(struct mapped_device *md,
* leaked). So we must detect buffer overflow in advance.
*/
snprintf(result, maxlen, "%d", INT_MAX);
if (dm_message_test_buffer_overflow(result, maxlen))
return 1;
if (dm_message_test_buffer_overflow(result, maxlen)) {
r = 1;
goto ret;
}
id = dm_stats_create(dm_get_stats(md), start, end, step, program_id, aux_data,
id = dm_stats_create(dm_get_stats(md), start, end, step, stat_flags,
n_histogram_entries, histogram_boundaries, program_id, aux_data,
dm_internal_suspend_fast, dm_internal_resume_fast, md);
if (id < 0)
return id;
if (id < 0) {
r = id;
goto ret;
}
snprintf(result, maxlen, "%d", id);
return 1;
r = 1;
goto ret;
ret_einval:
r = -EINVAL;
ret:
kfree(histogram_boundaries);
return r;
}
static int message_stats_delete(struct mapped_device *md,
@ -933,11 +1155,6 @@ int dm_stats_message(struct mapped_device *md, unsigned argc, char **argv,
{
int r;
if (dm_request_based(md)) {
DMWARN("Statistics are only supported for bio-based devices");
return -EOPNOTSUPP;
}
/* All messages here must start with '@' */
if (!strcasecmp(argv[0], "@stats_create"))
r = message_stats_create(md, argc, argv, result, maxlen);

View File

@ -18,6 +18,7 @@ struct dm_stats {
struct dm_stats_aux {
bool merged;
unsigned long long duration_ns;
};
void dm_stats_init(struct dm_stats *st);
@ -30,7 +31,8 @@ int dm_stats_message(struct mapped_device *md, unsigned argc, char **argv,
void dm_stats_account_io(struct dm_stats *stats, unsigned long bi_rw,
sector_t bi_sector, unsigned bi_sectors, bool end,
unsigned long duration, struct dm_stats_aux *aux);
unsigned long duration_jiffies,
struct dm_stats_aux *aux);
static inline bool dm_stats_used(struct dm_stats *st)
{

View File

@ -451,10 +451,8 @@ int __init dm_stripe_init(void)
int r;
r = dm_register_target(&stripe_target);
if (r < 0) {
if (r < 0)
DMWARN("target registration failed");
return r;
}
return r;
}

View File

@ -964,8 +964,8 @@ static int dm_table_alloc_md_mempools(struct dm_table *t, struct mapped_device *
return -EINVAL;
}
if (!t->mempools)
return -ENOMEM;
if (IS_ERR(t->mempools))
return PTR_ERR(t->mempools);
return 0;
}

View File

@ -184,7 +184,6 @@ struct dm_pool_metadata {
uint64_t trans_id;
unsigned long flags;
sector_t data_block_size;
bool read_only:1;
/*
* Set if a transaction has to be aborted but the attempt to roll back
@ -836,7 +835,6 @@ struct dm_pool_metadata *dm_pool_metadata_open(struct block_device *bdev,
init_rwsem(&pmd->root_lock);
pmd->time = 0;
INIT_LIST_HEAD(&pmd->thin_devices);
pmd->read_only = false;
pmd->fail_io = false;
pmd->bdev = bdev;
pmd->data_block_size = data_block_size;
@ -880,7 +878,7 @@ int dm_pool_metadata_close(struct dm_pool_metadata *pmd)
return -EBUSY;
}
if (!pmd->read_only && !pmd->fail_io) {
if (!dm_bm_is_read_only(pmd->bm) && !pmd->fail_io) {
r = __commit_transaction(pmd);
if (r < 0)
DMWARN("%s: __commit_transaction() failed, error = %d",
@ -1392,10 +1390,11 @@ int dm_thin_find_block(struct dm_thin_device *td, dm_block_t block,
dm_block_t keys[2] = { td->id, block };
struct dm_btree_info *info;
if (pmd->fail_io)
return -EINVAL;
down_read(&pmd->root_lock);
if (pmd->fail_io) {
up_read(&pmd->root_lock);
return -EINVAL;
}
if (can_issue_io) {
info = &pmd->info;
@ -1419,6 +1418,63 @@ int dm_thin_find_block(struct dm_thin_device *td, dm_block_t block,
return r;
}
/* FIXME: write a more efficient one in btree */
int dm_thin_find_mapped_range(struct dm_thin_device *td,
dm_block_t begin, dm_block_t end,
dm_block_t *thin_begin, dm_block_t *thin_end,
dm_block_t *pool_begin, bool *maybe_shared)
{
int r;
dm_block_t pool_end;
struct dm_thin_lookup_result lookup;
if (end < begin)
return -ENODATA;
/*
* Find first mapped block.
*/
while (begin < end) {
r = dm_thin_find_block(td, begin, true, &lookup);
if (r) {
if (r != -ENODATA)
return r;
} else
break;
begin++;
}
if (begin == end)
return -ENODATA;
*thin_begin = begin;
*pool_begin = lookup.block;
*maybe_shared = lookup.shared;
begin++;
pool_end = *pool_begin + 1;
while (begin != end) {
r = dm_thin_find_block(td, begin, true, &lookup);
if (r) {
if (r == -ENODATA)
break;
else
return r;
}
if ((lookup.block != pool_end) ||
(lookup.shared != *maybe_shared))
break;
pool_end++;
begin++;
}
*thin_end = begin;
return 0;
}
static int __insert(struct dm_thin_device *td, dm_block_t block,
dm_block_t data_block)
{
@ -1471,6 +1527,47 @@ static int __remove(struct dm_thin_device *td, dm_block_t block)
return 0;
}
static int __remove_range(struct dm_thin_device *td, dm_block_t begin, dm_block_t end)
{
int r;
unsigned count;
struct dm_pool_metadata *pmd = td->pmd;
dm_block_t keys[1] = { td->id };
__le64 value;
dm_block_t mapping_root;
/*
* Find the mapping tree
*/
r = dm_btree_lookup(&pmd->tl_info, pmd->root, keys, &value);
if (r)
return r;
/*
* Remove from the mapping tree, taking care to inc the
* ref count so it doesn't get deleted.
*/
mapping_root = le64_to_cpu(value);
dm_tm_inc(pmd->tm, mapping_root);
r = dm_btree_remove(&pmd->tl_info, pmd->root, keys, &pmd->root);
if (r)
return r;
r = dm_btree_remove_leaves(&pmd->bl_info, mapping_root, &begin, end, &mapping_root, &count);
if (r)
return r;
td->mapped_blocks -= count;
td->changed = 1;
/*
* Reinsert the mapping tree.
*/
value = cpu_to_le64(mapping_root);
__dm_bless_for_disk(&value);
return dm_btree_insert(&pmd->tl_info, pmd->root, keys, &value, &pmd->root);
}
int dm_thin_remove_block(struct dm_thin_device *td, dm_block_t block)
{
int r = -EINVAL;
@ -1483,6 +1580,19 @@ int dm_thin_remove_block(struct dm_thin_device *td, dm_block_t block)
return r;
}
int dm_thin_remove_range(struct dm_thin_device *td,
dm_block_t begin, dm_block_t end)
{
int r = -EINVAL;
down_write(&td->pmd->root_lock);
if (!td->pmd->fail_io)
r = __remove_range(td, begin, end);
up_write(&td->pmd->root_lock);
return r;
}
int dm_pool_block_is_used(struct dm_pool_metadata *pmd, dm_block_t b, bool *result)
{
int r;
@ -1739,7 +1849,6 @@ int dm_pool_resize_metadata_dev(struct dm_pool_metadata *pmd, dm_block_t new_cou
void dm_pool_metadata_read_only(struct dm_pool_metadata *pmd)
{
down_write(&pmd->root_lock);
pmd->read_only = true;
dm_bm_set_read_only(pmd->bm);
up_write(&pmd->root_lock);
}
@ -1747,7 +1856,6 @@ void dm_pool_metadata_read_only(struct dm_pool_metadata *pmd)
void dm_pool_metadata_read_write(struct dm_pool_metadata *pmd)
{
down_write(&pmd->root_lock);
pmd->read_only = false;
dm_bm_set_read_write(pmd->bm);
up_write(&pmd->root_lock);
}

View File

@ -146,6 +146,15 @@ struct dm_thin_lookup_result {
int dm_thin_find_block(struct dm_thin_device *td, dm_block_t block,
int can_issue_io, struct dm_thin_lookup_result *result);
/*
* Retrieve the next run of contiguously mapped blocks. Useful for working
* out where to break up IO. Returns 0 on success, < 0 on error.
*/
int dm_thin_find_mapped_range(struct dm_thin_device *td,
dm_block_t begin, dm_block_t end,
dm_block_t *thin_begin, dm_block_t *thin_end,
dm_block_t *pool_begin, bool *maybe_shared);
/*
* Obtain an unused block.
*/
@ -158,6 +167,8 @@ int dm_thin_insert_block(struct dm_thin_device *td, dm_block_t block,
dm_block_t data_block);
int dm_thin_remove_block(struct dm_thin_device *td, dm_block_t block);
int dm_thin_remove_range(struct dm_thin_device *td,
dm_block_t begin, dm_block_t end);
/*
* Queries.

View File

@ -111,22 +111,30 @@ DECLARE_DM_KCOPYD_THROTTLE_WITH_MODULE_PARM(snapshot_copy_throttle,
/*
* Key building.
*/
static void build_data_key(struct dm_thin_device *td,
dm_block_t b, struct dm_cell_key *key)
enum lock_space {
VIRTUAL,
PHYSICAL
};
static void build_key(struct dm_thin_device *td, enum lock_space ls,
dm_block_t b, dm_block_t e, struct dm_cell_key *key)
{
key->virtual = 0;
key->virtual = (ls == VIRTUAL);
key->dev = dm_thin_dev_id(td);
key->block_begin = b;
key->block_end = b + 1ULL;
key->block_end = e;
}
static void build_data_key(struct dm_thin_device *td, dm_block_t b,
struct dm_cell_key *key)
{
build_key(td, PHYSICAL, b, b + 1llu, key);
}
static void build_virtual_key(struct dm_thin_device *td, dm_block_t b,
struct dm_cell_key *key)
{
key->virtual = 1;
key->dev = dm_thin_dev_id(td);
key->block_begin = b;
key->block_end = b + 1ULL;
build_key(td, VIRTUAL, b, b + 1llu, key);
}
/*----------------------------------------------------------------*/
@ -312,6 +320,138 @@ struct thin_c {
/*----------------------------------------------------------------*/
/**
* __blkdev_issue_discard_async - queue a discard with async completion
* @bdev: blockdev to issue discard for
* @sector: start sector
* @nr_sects: number of sectors to discard
* @gfp_mask: memory allocation flags (for bio_alloc)
* @flags: BLKDEV_IFL_* flags to control behaviour
* @parent_bio: parent discard bio that all sub discards get chained to
*
* Description:
* Asynchronously issue a discard request for the sectors in question.
* NOTE: this variant of blk-core's blkdev_issue_discard() is a stop-gap
* that is being kept local to DM thinp until the block changes to allow
* late bio splitting land upstream.
*/
static int __blkdev_issue_discard_async(struct block_device *bdev, sector_t sector,
sector_t nr_sects, gfp_t gfp_mask, unsigned long flags,
struct bio *parent_bio)
{
struct request_queue *q = bdev_get_queue(bdev);
int type = REQ_WRITE | REQ_DISCARD;
unsigned int max_discard_sectors, granularity;
int alignment;
struct bio *bio;
int ret = 0;
struct blk_plug plug;
if (!q)
return -ENXIO;
if (!blk_queue_discard(q))
return -EOPNOTSUPP;
/* Zero-sector (unknown) and one-sector granularities are the same. */
granularity = max(q->limits.discard_granularity >> 9, 1U);
alignment = (bdev_discard_alignment(bdev) >> 9) % granularity;
/*
* Ensure that max_discard_sectors is of the proper
* granularity, so that requests stay aligned after a split.
*/
max_discard_sectors = min(q->limits.max_discard_sectors, UINT_MAX >> 9);
max_discard_sectors -= max_discard_sectors % granularity;
if (unlikely(!max_discard_sectors)) {
/* Avoid infinite loop below. Being cautious never hurts. */
return -EOPNOTSUPP;
}
if (flags & BLKDEV_DISCARD_SECURE) {
if (!blk_queue_secdiscard(q))
return -EOPNOTSUPP;
type |= REQ_SECURE;
}
blk_start_plug(&plug);
while (nr_sects) {
unsigned int req_sects;
sector_t end_sect, tmp;
/*
* Required bio_put occurs in bio_endio thanks to bio_chain below
*/
bio = bio_alloc(gfp_mask, 1);
if (!bio) {
ret = -ENOMEM;
break;
}
req_sects = min_t(sector_t, nr_sects, max_discard_sectors);
/*
* If splitting a request, and the next starting sector would be
* misaligned, stop the discard at the previous aligned sector.
*/
end_sect = sector + req_sects;
tmp = end_sect;
if (req_sects < nr_sects &&
sector_div(tmp, granularity) != alignment) {
end_sect = end_sect - alignment;
sector_div(end_sect, granularity);
end_sect = end_sect * granularity + alignment;
req_sects = end_sect - sector;
}
bio_chain(bio, parent_bio);
bio->bi_iter.bi_sector = sector;
bio->bi_bdev = bdev;
bio->bi_iter.bi_size = req_sects << 9;
nr_sects -= req_sects;
sector = end_sect;
submit_bio(type, bio);
/*
* We can loop for a long time in here, if someone does
* full device discards (like mkfs). Be nice and allow
* us to schedule out to avoid softlocking if preempt
* is disabled.
*/
cond_resched();
}
blk_finish_plug(&plug);
return ret;
}
static bool block_size_is_power_of_two(struct pool *pool)
{
return pool->sectors_per_block_shift >= 0;
}
static sector_t block_to_sectors(struct pool *pool, dm_block_t b)
{
return block_size_is_power_of_two(pool) ?
(b << pool->sectors_per_block_shift) :
(b * pool->sectors_per_block);
}
static int issue_discard(struct thin_c *tc, dm_block_t data_b, dm_block_t data_e,
struct bio *parent_bio)
{
sector_t s = block_to_sectors(tc->pool, data_b);
sector_t len = block_to_sectors(tc->pool, data_e - data_b);
return __blkdev_issue_discard_async(tc->pool_dev->bdev, s, len,
GFP_NOWAIT, 0, parent_bio);
}
/*----------------------------------------------------------------*/
/*
* wake_worker() is used when new work is queued and when pool_resume is
* ready to continue deferred IO processing.
@ -461,6 +601,7 @@ struct dm_thin_endio_hook {
struct dm_deferred_entry *all_io_entry;
struct dm_thin_new_mapping *overwrite_mapping;
struct rb_node rb_node;
struct dm_bio_prison_cell *cell;
};
static void __merge_bio_list(struct bio_list *bios, struct bio_list *master)
@ -541,11 +682,6 @@ static void error_retry_list(struct pool *pool)
* target.
*/
static bool block_size_is_power_of_two(struct pool *pool)
{
return pool->sectors_per_block_shift >= 0;
}
static dm_block_t get_bio_block(struct thin_c *tc, struct bio *bio)
{
struct pool *pool = tc->pool;
@ -559,6 +695,34 @@ static dm_block_t get_bio_block(struct thin_c *tc, struct bio *bio)
return block_nr;
}
/*
* Returns the _complete_ blocks that this bio covers.
*/
static void get_bio_block_range(struct thin_c *tc, struct bio *bio,
dm_block_t *begin, dm_block_t *end)
{
struct pool *pool = tc->pool;
sector_t b = bio->bi_iter.bi_sector;
sector_t e = b + (bio->bi_iter.bi_size >> SECTOR_SHIFT);
b += pool->sectors_per_block - 1ull; /* so we round up */
if (block_size_is_power_of_two(pool)) {
b >>= pool->sectors_per_block_shift;
e >>= pool->sectors_per_block_shift;
} else {
(void) sector_div(b, pool->sectors_per_block);
(void) sector_div(e, pool->sectors_per_block);
}
if (e < b)
/* Can happen if the bio is within a single block. */
e = b;
*begin = b;
*end = e;
}
static void remap(struct thin_c *tc, struct bio *bio, dm_block_t block)
{
struct pool *pool = tc->pool;
@ -647,7 +811,7 @@ struct dm_thin_new_mapping {
struct list_head list;
bool pass_discard:1;
bool definitely_not_shared:1;
bool maybe_shared:1;
/*
* Track quiescing, copying and zeroing preparation actions. When this
@ -658,9 +822,9 @@ struct dm_thin_new_mapping {
int err;
struct thin_c *tc;
dm_block_t virt_block;
dm_block_t virt_begin, virt_end;
dm_block_t data_block;
struct dm_bio_prison_cell *cell, *cell2;
struct dm_bio_prison_cell *cell;
/*
* If the bio covers the whole area of a block then we can avoid
@ -705,6 +869,8 @@ static void overwrite_endio(struct bio *bio, int err)
struct dm_thin_endio_hook *h = dm_per_bio_data(bio, sizeof(struct dm_thin_endio_hook));
struct dm_thin_new_mapping *m = h->overwrite_mapping;
bio->bi_end_io = m->saved_bi_end_io;
m->err = err;
complete_mapping_preparation(m);
}
@ -793,9 +959,6 @@ static void inc_remap_and_issue_cell(struct thin_c *tc,
static void process_prepared_mapping_fail(struct dm_thin_new_mapping *m)
{
if (m->bio)
m->bio->bi_end_io = m->saved_bi_end_io;
cell_error(m->tc->pool, m->cell);
list_del(&m->list);
mempool_free(m, m->tc->pool->mapping_pool);
@ -805,13 +968,9 @@ static void process_prepared_mapping(struct dm_thin_new_mapping *m)
{
struct thin_c *tc = m->tc;
struct pool *pool = tc->pool;
struct bio *bio;
struct bio *bio = m->bio;
int r;
bio = m->bio;
if (bio)
bio->bi_end_io = m->saved_bi_end_io;
if (m->err) {
cell_error(pool, m->cell);
goto out;
@ -822,7 +981,7 @@ static void process_prepared_mapping(struct dm_thin_new_mapping *m)
* Any I/O for this block arriving after this point will get
* remapped to it directly.
*/
r = dm_thin_insert_block(tc->td, m->virt_block, m->data_block);
r = dm_thin_insert_block(tc->td, m->virt_begin, m->data_block);
if (r) {
metadata_operation_failed(pool, "dm_thin_insert_block", r);
cell_error(pool, m->cell);
@ -849,50 +1008,112 @@ static void process_prepared_mapping(struct dm_thin_new_mapping *m)
mempool_free(m, pool->mapping_pool);
}
/*----------------------------------------------------------------*/
static void free_discard_mapping(struct dm_thin_new_mapping *m)
{
struct thin_c *tc = m->tc;
if (m->cell)
cell_defer_no_holder(tc, m->cell);
mempool_free(m, tc->pool->mapping_pool);
}
static void process_prepared_discard_fail(struct dm_thin_new_mapping *m)
{
struct thin_c *tc = m->tc;
bio_io_error(m->bio);
cell_defer_no_holder(tc, m->cell);
cell_defer_no_holder(tc, m->cell2);
mempool_free(m, tc->pool->mapping_pool);
free_discard_mapping(m);
}
static void process_prepared_discard_passdown(struct dm_thin_new_mapping *m)
static void process_prepared_discard_success(struct dm_thin_new_mapping *m)
{
struct thin_c *tc = m->tc;
inc_all_io_entry(tc->pool, m->bio);
cell_defer_no_holder(tc, m->cell);
cell_defer_no_holder(tc, m->cell2);
if (m->pass_discard)
if (m->definitely_not_shared)
remap_and_issue(tc, m->bio, m->data_block);
else {
bool used = false;
if (dm_pool_block_is_used(tc->pool->pmd, m->data_block, &used) || used)
bio_endio(m->bio, 0);
else
remap_and_issue(tc, m->bio, m->data_block);
}
else
bio_endio(m->bio, 0);
mempool_free(m, tc->pool->mapping_pool);
bio_endio(m->bio, 0);
free_discard_mapping(m);
}
static void process_prepared_discard(struct dm_thin_new_mapping *m)
static void process_prepared_discard_no_passdown(struct dm_thin_new_mapping *m)
{
int r;
struct thin_c *tc = m->tc;
r = dm_thin_remove_block(tc->td, m->virt_block);
if (r)
DMERR_LIMIT("dm_thin_remove_block() failed");
r = dm_thin_remove_range(tc->td, m->cell->key.block_begin, m->cell->key.block_end);
if (r) {
metadata_operation_failed(tc->pool, "dm_thin_remove_range", r);
bio_io_error(m->bio);
} else
bio_endio(m->bio, 0);
process_prepared_discard_passdown(m);
cell_defer_no_holder(tc, m->cell);
mempool_free(m, tc->pool->mapping_pool);
}
static int passdown_double_checking_shared_status(struct dm_thin_new_mapping *m)
{
/*
* We've already unmapped this range of blocks, but before we
* passdown we have to check that these blocks are now unused.
*/
int r;
bool used = true;
struct thin_c *tc = m->tc;
struct pool *pool = tc->pool;
dm_block_t b = m->data_block, e, end = m->data_block + m->virt_end - m->virt_begin;
while (b != end) {
/* find start of unmapped run */
for (; b < end; b++) {
r = dm_pool_block_is_used(pool->pmd, b, &used);
if (r)
return r;
if (!used)
break;
}
if (b == end)
break;
/* find end of run */
for (e = b + 1; e != end; e++) {
r = dm_pool_block_is_used(pool->pmd, e, &used);
if (r)
return r;
if (used)
break;
}
r = issue_discard(tc, b, e, m->bio);
if (r)
return r;
b = e;
}
return 0;
}
static void process_prepared_discard_passdown(struct dm_thin_new_mapping *m)
{
int r;
struct thin_c *tc = m->tc;
struct pool *pool = tc->pool;
r = dm_thin_remove_range(tc->td, m->virt_begin, m->virt_end);
if (r)
metadata_operation_failed(pool, "dm_thin_remove_range", r);
else if (m->maybe_shared)
r = passdown_double_checking_shared_status(m);
else
r = issue_discard(tc, m->data_block, m->data_block + (m->virt_end - m->virt_begin), m->bio);
/*
* Even if r is set, there could be sub discards in flight that we
* need to wait for.
*/
bio_endio(m->bio, r);
cell_defer_no_holder(tc, m->cell);
mempool_free(m, pool->mapping_pool);
}
static void process_prepared(struct pool *pool, struct list_head *head,
@ -976,7 +1197,7 @@ static void ll_zero(struct thin_c *tc, struct dm_thin_new_mapping *m,
}
static void remap_and_issue_overwrite(struct thin_c *tc, struct bio *bio,
dm_block_t data_block,
dm_block_t data_begin,
struct dm_thin_new_mapping *m)
{
struct pool *pool = tc->pool;
@ -986,7 +1207,7 @@ static void remap_and_issue_overwrite(struct thin_c *tc, struct bio *bio,
m->bio = bio;
save_and_set_endio(bio, &m->saved_bi_end_io, overwrite_endio);
inc_all_io_entry(pool, bio);
remap_and_issue(tc, bio, data_block);
remap_and_issue(tc, bio, data_begin);
}
/*
@ -1003,7 +1224,8 @@ static void schedule_copy(struct thin_c *tc, dm_block_t virt_block,
struct dm_thin_new_mapping *m = get_next_mapping(pool);
m->tc = tc;
m->virt_block = virt_block;
m->virt_begin = virt_block;
m->virt_end = virt_block + 1u;
m->data_block = data_dest;
m->cell = cell;
@ -1082,7 +1304,8 @@ static void schedule_zero(struct thin_c *tc, dm_block_t virt_block,
atomic_set(&m->prepare_actions, 1); /* no need to quiesce */
m->tc = tc;
m->virt_block = virt_block;
m->virt_begin = virt_block;
m->virt_end = virt_block + 1u;
m->data_block = data_block;
m->cell = cell;
@ -1091,16 +1314,14 @@ static void schedule_zero(struct thin_c *tc, dm_block_t virt_block,
* zeroing pre-existing data, we can issue the bio immediately.
* Otherwise we use kcopyd to zero the data first.
*/
if (!pool->pf.zero_new_blocks)
if (pool->pf.zero_new_blocks) {
if (io_overwrites_block(pool, bio))
remap_and_issue_overwrite(tc, bio, data_block, m);
else
ll_zero(tc, m, data_block * pool->sectors_per_block,
(data_block + 1) * pool->sectors_per_block);
} else
process_prepared_mapping(m);
else if (io_overwrites_block(pool, bio))
remap_and_issue_overwrite(tc, bio, data_block, m);
else
ll_zero(tc, m,
data_block * pool->sectors_per_block,
(data_block + 1) * pool->sectors_per_block);
}
static void schedule_external_copy(struct thin_c *tc, dm_block_t virt_block,
@ -1291,99 +1512,149 @@ static void retry_bios_on_resume(struct pool *pool, struct dm_bio_prison_cell *c
retry_on_resume(bio);
}
static void process_discard_cell(struct thin_c *tc, struct dm_bio_prison_cell *cell)
static void process_discard_cell_no_passdown(struct thin_c *tc,
struct dm_bio_prison_cell *virt_cell)
{
int r;
struct bio *bio = cell->holder;
struct pool *pool = tc->pool;
struct dm_bio_prison_cell *cell2;
struct dm_cell_key key2;
dm_block_t block = get_bio_block(tc, bio);
struct dm_thin_lookup_result lookup_result;
struct dm_thin_new_mapping *m = get_next_mapping(pool);
/*
* We don't need to lock the data blocks, since there's no
* passdown. We only lock data blocks for allocation and breaking sharing.
*/
m->tc = tc;
m->virt_begin = virt_cell->key.block_begin;
m->virt_end = virt_cell->key.block_end;
m->cell = virt_cell;
m->bio = virt_cell->holder;
if (!dm_deferred_set_add_work(pool->all_io_ds, &m->list))
pool->process_prepared_discard(m);
}
/*
* FIXME: DM local hack to defer parent bios's end_io until we
* _know_ all chained sub range discard bios have completed.
* Will go away once late bio splitting lands upstream!
*/
static inline void __bio_inc_remaining(struct bio *bio)
{
bio->bi_flags |= (1 << BIO_CHAIN);
smp_mb__before_atomic();
atomic_inc(&bio->__bi_remaining);
}
static void break_up_discard_bio(struct thin_c *tc, dm_block_t begin, dm_block_t end,
struct bio *bio)
{
struct pool *pool = tc->pool;
int r;
bool maybe_shared;
struct dm_cell_key data_key;
struct dm_bio_prison_cell *data_cell;
struct dm_thin_new_mapping *m;
dm_block_t virt_begin, virt_end, data_begin;
if (tc->requeue_mode) {
cell_requeue(pool, cell);
return;
}
while (begin != end) {
r = ensure_next_mapping(pool);
if (r)
/* we did our best */
return;
r = dm_thin_find_block(tc->td, block, 1, &lookup_result);
switch (r) {
case 0:
/*
* Check nobody is fiddling with this pool block. This can
* happen if someone's in the process of breaking sharing
* on this block.
*/
build_data_key(tc->td, lookup_result.block, &key2);
if (bio_detain(tc->pool, &key2, bio, &cell2)) {
cell_defer_no_holder(tc, cell);
r = dm_thin_find_mapped_range(tc->td, begin, end, &virt_begin, &virt_end,
&data_begin, &maybe_shared);
if (r)
/*
* Silently fail, letting any mappings we've
* created complete.
*/
break;
build_key(tc->td, PHYSICAL, data_begin, data_begin + (virt_end - virt_begin), &data_key);
if (bio_detain(tc->pool, &data_key, NULL, &data_cell)) {
/* contention, we'll give up with this range */
begin = virt_end;
continue;
}
if (io_overlaps_block(pool, bio)) {
/*
* IO may still be going to the destination block. We must
* quiesce before we can do the removal.
*/
m = get_next_mapping(pool);
m->tc = tc;
m->pass_discard = pool->pf.discard_passdown;
m->definitely_not_shared = !lookup_result.shared;
m->virt_block = block;
m->data_block = lookup_result.block;
m->cell = cell;
m->cell2 = cell2;
m->bio = bio;
if (!dm_deferred_set_add_work(pool->all_io_ds, &m->list))
pool->process_prepared_discard(m);
} else {
inc_all_io_entry(pool, bio);
cell_defer_no_holder(tc, cell);
cell_defer_no_holder(tc, cell2);
/*
* The DM core makes sure that the discard doesn't span
* a block boundary. So we submit the discard of a
* partial block appropriately.
*/
if ((!lookup_result.shared) && pool->pf.discard_passdown)
remap_and_issue(tc, bio, lookup_result.block);
else
bio_endio(bio, 0);
}
break;
case -ENODATA:
/*
* It isn't provisioned, just forget it.
* IO may still be going to the destination block. We must
* quiesce before we can do the removal.
*/
cell_defer_no_holder(tc, cell);
bio_endio(bio, 0);
break;
m = get_next_mapping(pool);
m->tc = tc;
m->maybe_shared = maybe_shared;
m->virt_begin = virt_begin;
m->virt_end = virt_end;
m->data_block = data_begin;
m->cell = data_cell;
m->bio = bio;
default:
DMERR_LIMIT("%s: dm_thin_find_block() failed: error = %d",
__func__, r);
cell_defer_no_holder(tc, cell);
bio_io_error(bio);
break;
/*
* The parent bio must not complete before sub discard bios are
* chained to it (see __blkdev_issue_discard_async's bio_chain)!
*
* This per-mapping bi_remaining increment is paired with
* the implicit decrement that occurs via bio_endio() in
* process_prepared_discard_{passdown,no_passdown}.
*/
__bio_inc_remaining(bio);
if (!dm_deferred_set_add_work(pool->all_io_ds, &m->list))
pool->process_prepared_discard(m);
begin = virt_end;
}
}
static void process_discard_cell_passdown(struct thin_c *tc, struct dm_bio_prison_cell *virt_cell)
{
struct bio *bio = virt_cell->holder;
struct dm_thin_endio_hook *h = dm_per_bio_data(bio, sizeof(struct dm_thin_endio_hook));
/*
* The virt_cell will only get freed once the origin bio completes.
* This means it will remain locked while all the individual
* passdown bios are in flight.
*/
h->cell = virt_cell;
break_up_discard_bio(tc, virt_cell->key.block_begin, virt_cell->key.block_end, bio);
/*
* We complete the bio now, knowing that the bi_remaining field
* will prevent completion until the sub range discards have
* completed.
*/
bio_endio(bio, 0);
}
static void process_discard_bio(struct thin_c *tc, struct bio *bio)
{
struct dm_bio_prison_cell *cell;
struct dm_cell_key key;
dm_block_t block = get_bio_block(tc, bio);
dm_block_t begin, end;
struct dm_cell_key virt_key;
struct dm_bio_prison_cell *virt_cell;
build_virtual_key(tc->td, block, &key);
if (bio_detain(tc->pool, &key, bio, &cell))
get_bio_block_range(tc, bio, &begin, &end);
if (begin == end) {
/*
* The discard covers less than a block.
*/
bio_endio(bio, 0);
return;
}
build_key(tc->td, VIRTUAL, begin, end, &virt_key);
if (bio_detain(tc->pool, &virt_key, bio, &virt_cell))
/*
* Potential starvation issue: We're relying on the
* fs/application being well behaved, and not trying to
* send IO to a region at the same time as discarding it.
* If they do this persistently then it's possible this
* cell will never be granted.
*/
return;
process_discard_cell(tc, cell);
tc->pool->process_discard_cell(tc, virt_cell);
}
static void break_sharing(struct thin_c *tc, struct bio *bio, dm_block_t block,
@ -2099,6 +2370,24 @@ static void notify_of_pool_mode_change(struct pool *pool, const char *new_mode)
dm_device_name(pool->pool_md), new_mode);
}
static bool passdown_enabled(struct pool_c *pt)
{
return pt->adjusted_pf.discard_passdown;
}
static void set_discard_callbacks(struct pool *pool)
{
struct pool_c *pt = pool->ti->private;
if (passdown_enabled(pt)) {
pool->process_discard_cell = process_discard_cell_passdown;
pool->process_prepared_discard = process_prepared_discard_passdown;
} else {
pool->process_discard_cell = process_discard_cell_no_passdown;
pool->process_prepared_discard = process_prepared_discard_no_passdown;
}
}
static void set_pool_mode(struct pool *pool, enum pool_mode new_mode)
{
struct pool_c *pt = pool->ti->private;
@ -2150,7 +2439,7 @@ static void set_pool_mode(struct pool *pool, enum pool_mode new_mode)
pool->process_cell = process_cell_read_only;
pool->process_discard_cell = process_cell_success;
pool->process_prepared_mapping = process_prepared_mapping_fail;
pool->process_prepared_discard = process_prepared_discard_passdown;
pool->process_prepared_discard = process_prepared_discard_success;
error_retry_list(pool);
break;
@ -2169,9 +2458,8 @@ static void set_pool_mode(struct pool *pool, enum pool_mode new_mode)
pool->process_bio = process_bio_read_only;
pool->process_discard = process_discard_bio;
pool->process_cell = process_cell_read_only;
pool->process_discard_cell = process_discard_cell;
pool->process_prepared_mapping = process_prepared_mapping;
pool->process_prepared_discard = process_prepared_discard;
set_discard_callbacks(pool);
if (!pool->pf.error_if_no_space && no_space_timeout)
queue_delayed_work(pool->wq, &pool->no_space_timeout, no_space_timeout);
@ -2184,9 +2472,8 @@ static void set_pool_mode(struct pool *pool, enum pool_mode new_mode)
pool->process_bio = process_bio;
pool->process_discard = process_discard_bio;
pool->process_cell = process_cell;
pool->process_discard_cell = process_discard_cell;
pool->process_prepared_mapping = process_prepared_mapping;
pool->process_prepared_discard = process_prepared_discard;
set_discard_callbacks(pool);
break;
}
@ -2275,6 +2562,7 @@ static void thin_hook_bio(struct thin_c *tc, struct bio *bio)
h->shared_read_entry = NULL;
h->all_io_entry = NULL;
h->overwrite_mapping = NULL;
h->cell = NULL;
}
/*
@ -2422,7 +2710,6 @@ static void disable_passdown_if_not_supported(struct pool_c *pt)
struct pool *pool = pt->pool;
struct block_device *data_bdev = pt->data_dev->bdev;
struct queue_limits *data_limits = &bdev_get_queue(data_bdev)->limits;
sector_t block_size = pool->sectors_per_block << SECTOR_SHIFT;
const char *reason = NULL;
char buf[BDEVNAME_SIZE];
@ -2435,12 +2722,6 @@ static void disable_passdown_if_not_supported(struct pool_c *pt)
else if (data_limits->max_discard_sectors < pool->sectors_per_block)
reason = "max discard sectors smaller than a block";
else if (data_limits->discard_granularity > block_size)
reason = "discard granularity larger than a block";
else if (!is_factor(block_size, data_limits->discard_granularity))
reason = "discard granularity not a factor of block size";
if (reason) {
DMWARN("Data device (%s) %s: Disabling discard passdown.", bdevname(data_bdev, buf), reason);
pt->adjusted_pf.discard_passdown = false;
@ -3375,7 +3656,7 @@ static int pool_message(struct dm_target *ti, unsigned argc, char **argv)
if (get_pool_mode(pool) >= PM_READ_ONLY) {
DMERR("%s: unable to service pool target messages in READ_ONLY or FAIL mode",
dm_device_name(pool->pool_md));
return -EINVAL;
return -EOPNOTSUPP;
}
if (!strcasecmp(argv[0], "create_thin"))
@ -3573,24 +3854,6 @@ static int pool_merge(struct dm_target *ti, struct bvec_merge_data *bvm,
return min(max_size, q->merge_bvec_fn(q, bvm, biovec));
}
static void set_discard_limits(struct pool_c *pt, struct queue_limits *limits)
{
struct pool *pool = pt->pool;
struct queue_limits *data_limits;
limits->max_discard_sectors = pool->sectors_per_block;
/*
* discard_granularity is just a hint, and not enforced.
*/
if (pt->adjusted_pf.discard_passdown) {
data_limits = &bdev_get_queue(pt->data_dev->bdev)->limits;
limits->discard_granularity = max(data_limits->discard_granularity,
pool->sectors_per_block << SECTOR_SHIFT);
} else
limits->discard_granularity = pool->sectors_per_block << SECTOR_SHIFT;
}
static void pool_io_hints(struct dm_target *ti, struct queue_limits *limits)
{
struct pool_c *pt = ti->private;
@ -3645,14 +3908,17 @@ static void pool_io_hints(struct dm_target *ti, struct queue_limits *limits)
disable_passdown_if_not_supported(pt);
set_discard_limits(pt, limits);
/*
* The pool uses the same discard limits as the underlying data
* device. DM core has already set this up.
*/
}
static struct target_type pool_target = {
.name = "thin-pool",
.features = DM_TARGET_SINGLETON | DM_TARGET_ALWAYS_WRITEABLE |
DM_TARGET_IMMUTABLE,
.version = {1, 14, 0},
.version = {1, 15, 0},
.module = THIS_MODULE,
.ctr = pool_ctr,
.dtr = pool_dtr,
@ -3811,8 +4077,7 @@ static int thin_ctr(struct dm_target *ti, unsigned argc, char **argv)
if (tc->pool->pf.discard_enabled) {
ti->discards_supported = true;
ti->num_discard_bios = 1;
/* Discard bios must be split on a block boundary */
ti->split_discard_bios = true;
ti->split_discard_bios = false;
}
mutex_unlock(&dm_thin_pool_table.mutex);
@ -3899,6 +4164,9 @@ static int thin_endio(struct dm_target *ti, struct bio *bio, int err)
}
}
if (h->cell)
cell_defer_no_holder(h->tc, h->cell);
return 0;
}
@ -4026,9 +4294,18 @@ static int thin_iterate_devices(struct dm_target *ti,
return 0;
}
static void thin_io_hints(struct dm_target *ti, struct queue_limits *limits)
{
struct thin_c *tc = ti->private;
struct pool *pool = tc->pool;
limits->discard_granularity = pool->sectors_per_block << SECTOR_SHIFT;
limits->max_discard_sectors = 2048 * 1024 * 16; /* 16G */
}
static struct target_type thin_target = {
.name = "thin",
.version = {1, 14, 0},
.version = {1, 15, 0},
.module = THIS_MODULE,
.ctr = thin_ctr,
.dtr = thin_dtr,
@ -4040,6 +4317,7 @@ static struct target_type thin_target = {
.status = thin_status,
.merge = thin_merge,
.iterate_devices = thin_iterate_devices,
.io_hints = thin_io_hints,
};
/*----------------------------------------------------------------*/

View File

@ -86,6 +86,9 @@ struct dm_rq_target_io {
struct kthread_work work;
int error;
union map_info info;
struct dm_stats_aux stats_aux;
unsigned long duration_jiffies;
unsigned n_sectors;
};
/*
@ -995,6 +998,17 @@ static struct dm_rq_target_io *tio_from_request(struct request *rq)
return (rq->q->mq_ops ? blk_mq_rq_to_pdu(rq) : rq->special);
}
static void rq_end_stats(struct mapped_device *md, struct request *orig)
{
if (unlikely(dm_stats_used(&md->stats))) {
struct dm_rq_target_io *tio = tio_from_request(orig);
tio->duration_jiffies = jiffies - tio->duration_jiffies;
dm_stats_account_io(&md->stats, orig->cmd_flags, blk_rq_pos(orig),
tio->n_sectors, true, tio->duration_jiffies,
&tio->stats_aux);
}
}
/*
* Don't touch any member of the md after calling this function because
* the md may be freed in dm_put() at the end of this function.
@ -1078,6 +1092,7 @@ static void dm_end_request(struct request *clone, int error)
}
free_rq_clone(clone);
rq_end_stats(md, rq);
if (!rq->q->mq_ops)
blk_end_request_all(rq, error);
else
@ -1113,13 +1128,14 @@ static void old_requeue_request(struct request *rq)
spin_unlock_irqrestore(q->queue_lock, flags);
}
static void dm_requeue_unmapped_original_request(struct mapped_device *md,
struct request *rq)
static void dm_requeue_original_request(struct mapped_device *md,
struct request *rq)
{
int rw = rq_data_dir(rq);
dm_unprep_request(rq);
rq_end_stats(md, rq);
if (!rq->q->mq_ops)
old_requeue_request(rq);
else {
@ -1130,13 +1146,6 @@ static void dm_requeue_unmapped_original_request(struct mapped_device *md,
rq_completed(md, rw, false);
}
static void dm_requeue_unmapped_request(struct request *clone)
{
struct dm_rq_target_io *tio = clone->end_io_data;
dm_requeue_unmapped_original_request(tio->md, tio->orig);
}
static void old_stop_queue(struct request_queue *q)
{
unsigned long flags;
@ -1200,7 +1209,7 @@ static void dm_done(struct request *clone, int error, bool mapped)
return;
else if (r == DM_ENDIO_REQUEUE)
/* The target wants to requeue the I/O */
dm_requeue_unmapped_request(clone);
dm_requeue_original_request(tio->md, tio->orig);
else {
DMWARN("unimplemented target endio return value: %d", r);
BUG();
@ -1218,6 +1227,7 @@ static void dm_softirq_done(struct request *rq)
int rw;
if (!clone) {
rq_end_stats(tio->md, rq);
rw = rq_data_dir(rq);
if (!rq->q->mq_ops) {
blk_end_request_all(rq, tio->error);
@ -1910,7 +1920,7 @@ static int map_request(struct dm_rq_target_io *tio, struct request *rq,
break;
case DM_MAPIO_REQUEUE:
/* The target wants to requeue the I/O */
dm_requeue_unmapped_request(clone);
dm_requeue_original_request(md, tio->orig);
break;
default:
if (r > 0) {
@ -1933,7 +1943,7 @@ static void map_tio_request(struct kthread_work *work)
struct mapped_device *md = tio->md;
if (map_request(tio, rq, md) == DM_MAPIO_REQUEUE)
dm_requeue_unmapped_original_request(md, rq);
dm_requeue_original_request(md, rq);
}
static void dm_start_request(struct mapped_device *md, struct request *orig)
@ -1950,6 +1960,14 @@ static void dm_start_request(struct mapped_device *md, struct request *orig)
md->last_rq_start_time = ktime_get();
}
if (unlikely(dm_stats_used(&md->stats))) {
struct dm_rq_target_io *tio = tio_from_request(orig);
tio->duration_jiffies = jiffies;
tio->n_sectors = blk_rq_sectors(orig);
dm_stats_account_io(&md->stats, orig->cmd_flags, blk_rq_pos(orig),
tio->n_sectors, false, 0, &tio->stats_aux);
}
/*
* Hold the md reference here for the in-flight I/O.
* We can't rely on the reference count by device opener,
@ -2173,6 +2191,40 @@ static void dm_init_old_md_queue(struct mapped_device *md)
blk_queue_bounce_limit(md->queue, BLK_BOUNCE_ANY);
}
static void cleanup_mapped_device(struct mapped_device *md)
{
cleanup_srcu_struct(&md->io_barrier);
if (md->wq)
destroy_workqueue(md->wq);
if (md->kworker_task)
kthread_stop(md->kworker_task);
if (md->io_pool)
mempool_destroy(md->io_pool);
if (md->rq_pool)
mempool_destroy(md->rq_pool);
if (md->bs)
bioset_free(md->bs);
if (md->disk) {
spin_lock(&_minor_lock);
md->disk->private_data = NULL;
spin_unlock(&_minor_lock);
if (blk_get_integrity(md->disk))
blk_integrity_unregister(md->disk);
del_gendisk(md->disk);
put_disk(md->disk);
}
if (md->queue)
blk_cleanup_queue(md->queue);
if (md->bdev) {
bdput(md->bdev);
md->bdev = NULL;
}
}
/*
* Allocate and initialise a blank device with a given minor.
*/
@ -2218,13 +2270,13 @@ static struct mapped_device *alloc_dev(int minor)
md->queue = blk_alloc_queue(GFP_KERNEL);
if (!md->queue)
goto bad_queue;
goto bad;
dm_init_md_queue(md);
md->disk = alloc_disk(1);
if (!md->disk)
goto bad_disk;
goto bad;
atomic_set(&md->pending[0], 0);
atomic_set(&md->pending[1], 0);
@ -2245,11 +2297,11 @@ static struct mapped_device *alloc_dev(int minor)
md->wq = alloc_workqueue("kdmflush", WQ_MEM_RECLAIM, 0);
if (!md->wq)
goto bad_thread;
goto bad;
md->bdev = bdget_disk(md->disk, 0);
if (!md->bdev)
goto bad_bdev;
goto bad;
bio_init(&md->flush_bio);
md->flush_bio.bi_bdev = md->bdev;
@ -2266,15 +2318,8 @@ static struct mapped_device *alloc_dev(int minor)
return md;
bad_bdev:
destroy_workqueue(md->wq);
bad_thread:
del_gendisk(md->disk);
put_disk(md->disk);
bad_disk:
blk_cleanup_queue(md->queue);
bad_queue:
cleanup_srcu_struct(&md->io_barrier);
bad:
cleanup_mapped_device(md);
bad_io_barrier:
free_minor(minor);
bad_minor:
@ -2291,71 +2336,65 @@ static void free_dev(struct mapped_device *md)
int minor = MINOR(disk_devt(md->disk));
unlock_fs(md);
destroy_workqueue(md->wq);
if (md->kworker_task)
kthread_stop(md->kworker_task);
if (md->io_pool)
mempool_destroy(md->io_pool);
if (md->rq_pool)
mempool_destroy(md->rq_pool);
if (md->bs)
bioset_free(md->bs);
cleanup_srcu_struct(&md->io_barrier);
free_table_devices(&md->table_devices);
dm_stats_cleanup(&md->stats);
spin_lock(&_minor_lock);
md->disk->private_data = NULL;
spin_unlock(&_minor_lock);
if (blk_get_integrity(md->disk))
blk_integrity_unregister(md->disk);
del_gendisk(md->disk);
put_disk(md->disk);
blk_cleanup_queue(md->queue);
cleanup_mapped_device(md);
if (md->use_blk_mq)
blk_mq_free_tag_set(&md->tag_set);
bdput(md->bdev);
free_table_devices(&md->table_devices);
dm_stats_cleanup(&md->stats);
free_minor(minor);
module_put(THIS_MODULE);
kfree(md);
}
static unsigned filter_md_type(unsigned type, struct mapped_device *md)
{
if (type == DM_TYPE_BIO_BASED)
return type;
return !md->use_blk_mq ? DM_TYPE_REQUEST_BASED : DM_TYPE_MQ_REQUEST_BASED;
}
static void __bind_mempools(struct mapped_device *md, struct dm_table *t)
{
struct dm_md_mempools *p = dm_table_get_md_mempools(t);
if (md->bs) {
/* The md already has necessary mempools. */
if (dm_table_get_type(t) == DM_TYPE_BIO_BASED) {
switch (filter_md_type(dm_table_get_type(t), md)) {
case DM_TYPE_BIO_BASED:
if (md->bs && md->io_pool) {
/*
* This bio-based md already has necessary mempools.
* Reload bioset because front_pad may have changed
* because a different table was loaded.
*/
bioset_free(md->bs);
md->bs = p->bs;
p->bs = NULL;
goto out;
}
/*
* There's no need to reload with request-based dm
* because the size of front_pad doesn't change.
* Note for future: If you are to reload bioset,
* prep-ed requests in the queue may refer
* to bio from the old bioset, so you must walk
* through the queue to unprep.
*/
goto out;
break;
case DM_TYPE_REQUEST_BASED:
if (md->rq_pool && md->io_pool)
/*
* This request-based md already has necessary mempools.
*/
goto out;
break;
case DM_TYPE_MQ_REQUEST_BASED:
BUG_ON(p); /* No mempools needed */
return;
}
BUG_ON(!p || md->io_pool || md->rq_pool || md->bs);
md->io_pool = p->io_pool;
p->io_pool = NULL;
md->rq_pool = p->rq_pool;
p->rq_pool = NULL;
md->bs = p->bs;
p->bs = NULL;
out:
/* mempool bind completed, no longer need any mempools in the table */
dm_table_free_md_mempools(t);
@ -2675,6 +2714,7 @@ static int dm_mq_queue_rq(struct blk_mq_hw_ctx *hctx,
/* Direct call is fine since .queue_rq allows allocations */
if (map_request(tio, rq, md) == DM_MAPIO_REQUEUE) {
/* Undo dm_start_request() before requeuing */
rq_end_stats(md, rq);
rq_completed(md, rq_data_dir(rq), false);
return BLK_MQ_RQ_QUEUE_BUSY;
}
@ -2734,14 +2774,6 @@ static int dm_init_request_based_blk_mq_queue(struct mapped_device *md)
return err;
}
static unsigned filter_md_type(unsigned type, struct mapped_device *md)
{
if (type == DM_TYPE_BIO_BASED)
return type;
return !md->use_blk_mq ? DM_TYPE_REQUEST_BASED : DM_TYPE_MQ_REQUEST_BASED;
}
/*
* Setup the DM device's queue based on md's type
*/
@ -3463,7 +3495,7 @@ struct dm_md_mempools *dm_alloc_bio_mempools(unsigned integrity,
pools = kzalloc(sizeof(*pools), GFP_KERNEL);
if (!pools)
return NULL;
return ERR_PTR(-ENOMEM);
front_pad = roundup(per_bio_data_size, __alignof__(struct dm_target_io)) +
offsetof(struct dm_target_io, clone);
@ -3482,24 +3514,26 @@ struct dm_md_mempools *dm_alloc_bio_mempools(unsigned integrity,
return pools;
out:
dm_free_md_mempools(pools);
return NULL;
return ERR_PTR(-ENOMEM);
}
struct dm_md_mempools *dm_alloc_rq_mempools(struct mapped_device *md,
unsigned type)
{
unsigned int pool_size = dm_get_reserved_rq_based_ios();
unsigned int pool_size;
struct dm_md_mempools *pools;
if (filter_md_type(type, md) == DM_TYPE_MQ_REQUEST_BASED)
return NULL; /* No mempools needed */
pool_size = dm_get_reserved_rq_based_ios();
pools = kzalloc(sizeof(*pools), GFP_KERNEL);
if (!pools)
return NULL;
return ERR_PTR(-ENOMEM);
if (filter_md_type(type, md) == DM_TYPE_REQUEST_BASED) {
pools->rq_pool = mempool_create_slab_pool(pool_size, _rq_cache);
if (!pools->rq_pool)
goto out;
}
pools->rq_pool = mempool_create_slab_pool(pool_size, _rq_cache);
if (!pools->rq_pool)
goto out;
pools->io_pool = mempool_create_slab_pool(pool_size, _rq_tio_cache);
if (!pools->io_pool)
@ -3508,7 +3542,7 @@ struct dm_md_mempools *dm_alloc_rq_mempools(struct mapped_device *md,
return pools;
out:
dm_free_md_mempools(pools);
return NULL;
return ERR_PTR(-ENOMEM);
}
void dm_free_md_mempools(struct dm_md_mempools *pools)

View File

@ -609,6 +609,12 @@ void dm_bm_prefetch(struct dm_block_manager *bm, dm_block_t b)
dm_bufio_prefetch(bm->bufio, b, 1);
}
bool dm_bm_is_read_only(struct dm_block_manager *bm)
{
return bm->read_only;
}
EXPORT_SYMBOL_GPL(dm_bm_is_read_only);
void dm_bm_set_read_only(struct dm_block_manager *bm)
{
bm->read_only = true;

View File

@ -123,6 +123,7 @@ void dm_bm_prefetch(struct dm_block_manager *bm, dm_block_t b);
* Additionally you should not use dm_bm_unlock_move, however no error will
* be returned if you do.
*/
bool dm_bm_is_read_only(struct dm_block_manager *bm);
void dm_bm_set_read_only(struct dm_block_manager *bm);
void dm_bm_set_read_write(struct dm_block_manager *bm);

View File

@ -590,3 +590,130 @@ int dm_btree_remove(struct dm_btree_info *info, dm_block_t root,
return r;
}
EXPORT_SYMBOL_GPL(dm_btree_remove);
/*----------------------------------------------------------------*/
static int remove_nearest(struct shadow_spine *s, struct dm_btree_info *info,
struct dm_btree_value_type *vt, dm_block_t root,
uint64_t key, int *index)
{
int i = *index, r;
struct btree_node *n;
for (;;) {
r = shadow_step(s, root, vt);
if (r < 0)
break;
/*
* We have to patch up the parent node, ugly, but I don't
* see a way to do this automatically as part of the spine
* op.
*/
if (shadow_has_parent(s)) {
__le64 location = cpu_to_le64(dm_block_location(shadow_current(s)));
memcpy(value_ptr(dm_block_data(shadow_parent(s)), i),
&location, sizeof(__le64));
}
n = dm_block_data(shadow_current(s));
if (le32_to_cpu(n->header.flags) & LEAF_NODE) {
*index = lower_bound(n, key);
return 0;
}
r = rebalance_children(s, info, vt, key);
if (r)
break;
n = dm_block_data(shadow_current(s));
if (le32_to_cpu(n->header.flags) & LEAF_NODE) {
*index = lower_bound(n, key);
return 0;
}
i = lower_bound(n, key);
/*
* We know the key is present, or else
* rebalance_children would have returned
* -ENODATA
*/
root = value64(n, i);
}
return r;
}
static int remove_one(struct dm_btree_info *info, dm_block_t root,
uint64_t *keys, uint64_t end_key,
dm_block_t *new_root, unsigned *nr_removed)
{
unsigned level, last_level = info->levels - 1;
int index = 0, r = 0;
struct shadow_spine spine;
struct btree_node *n;
uint64_t k;
init_shadow_spine(&spine, info);
for (level = 0; level < last_level; level++) {
r = remove_raw(&spine, info, &le64_type,
root, keys[level], (unsigned *) &index);
if (r < 0)
goto out;
n = dm_block_data(shadow_current(&spine));
root = value64(n, index);
}
r = remove_nearest(&spine, info, &info->value_type,
root, keys[last_level], &index);
if (r < 0)
goto out;
n = dm_block_data(shadow_current(&spine));
if (index < 0)
index = 0;
if (index >= le32_to_cpu(n->header.nr_entries)) {
r = -ENODATA;
goto out;
}
k = le64_to_cpu(n->keys[index]);
if (k >= keys[last_level] && k < end_key) {
if (info->value_type.dec)
info->value_type.dec(info->value_type.context,
value_ptr(n, index));
delete_at(n, index);
} else
r = -ENODATA;
out:
*new_root = shadow_root(&spine);
exit_shadow_spine(&spine);
return r;
}
int dm_btree_remove_leaves(struct dm_btree_info *info, dm_block_t root,
uint64_t *first_key, uint64_t end_key,
dm_block_t *new_root, unsigned *nr_removed)
{
int r;
*nr_removed = 0;
do {
r = remove_one(info, root, first_key, end_key, &root, nr_removed);
if (!r)
(*nr_removed)++;
} while (!r);
*new_root = root;
return r == -ENODATA ? 0 : r;
}
EXPORT_SYMBOL_GPL(dm_btree_remove_leaves);

View File

@ -134,6 +134,15 @@ int dm_btree_insert_notify(struct dm_btree_info *info, dm_block_t root,
int dm_btree_remove(struct dm_btree_info *info, dm_block_t root,
uint64_t *keys, dm_block_t *new_root);
/*
* Removes values between 'keys' and keys2, where keys2 is keys with the
* final key replaced with 'end_key'. 'end_key' is the one-past-the-end
* value. 'keys' may be altered.
*/
int dm_btree_remove_leaves(struct dm_btree_info *info, dm_block_t root,
uint64_t *keys, uint64_t end_key,
dm_block_t *new_root, unsigned *nr_removed);
/*
* Returns < 0 on failure. Otherwise the number of key entries that have
* been filled out. Remember trees can have zero entries, and as such have

View File

@ -204,6 +204,27 @@ static void in(struct sm_metadata *smm)
smm->recursion_count++;
}
static int apply_bops(struct sm_metadata *smm)
{
int r = 0;
while (!brb_empty(&smm->uncommitted)) {
struct block_op bop;
r = brb_pop(&smm->uncommitted, &bop);
if (r) {
DMERR("bug in bop ring buffer");
break;
}
r = commit_bop(smm, &bop);
if (r)
break;
}
return r;
}
static int out(struct sm_metadata *smm)
{
int r = 0;
@ -216,21 +237,8 @@ static int out(struct sm_metadata *smm)
return -ENOMEM;
}
if (smm->recursion_count == 1) {
while (!brb_empty(&smm->uncommitted)) {
struct block_op bop;
r = brb_pop(&smm->uncommitted, &bop);
if (r) {
DMERR("bug in bop ring buffer");
break;
}
r = commit_bop(smm, &bop);
if (r)
break;
}
}
if (smm->recursion_count == 1)
apply_bops(smm);
smm->recursion_count--;
@ -704,6 +712,12 @@ static int sm_metadata_extend(struct dm_space_map *sm, dm_block_t extra_blocks)
}
old_len = smm->begin;
r = apply_bops(smm);
if (r) {
DMERR("%s: apply_bops failed", __func__);
goto out;
}
r = sm_ll_commit(&smm->ll);
if (r)
goto out;
@ -773,6 +787,12 @@ int dm_sm_metadata_create(struct dm_space_map *sm,
if (r)
return r;
r = apply_bops(smm);
if (r) {
DMERR("%s: apply_bops failed", __func__);
return r;
}
return sm_metadata_commit(sm);
}