Commit Graph

174 Commits

Author SHA1 Message Date
tang.junhui
54cd640d20 dm mpath: use hw_handler_params if attached hw_handler is same as requested
Let the requested m->hw_handler_params be used if the attached hardware
handler is the same handler as requested with m->hw_handler_name.

Signed-off-by: tang.junhui <tang.junhui@zte.com.cn>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2016-12-08 14:13:10 -05:00
Bart Van Assche
6599c84e4c dm mpath: do not modify *__clone if blk_mq_alloc_request() fails
Purely cleanup, avoids potential for strange coding bugs.  But in
reality if __multipath_map() fails the caller has no business looking at
*__clone.

Signed-off-by: Bart Van Assche <bart.vanassche@sandisk.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2016-11-21 09:52:10 -05:00
Bart Van Assche
4813577f93 dm mpath: change return type of pg_init_all_paths() from int to void
None of the callers of pg_init_all_paths() check its return value.

Signed-off-by: Bart Van Assche <bart.vanassche@sandisk.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2016-11-21 09:52:09 -05:00
tang.junhui
cc5bd925f1 dm mpath: add checks for priority group count to avoid invalid memory access
This avoids the potential for invalid memory access, if/when there are
no priority groups, in response to invalid arguments being sent by the
user via DM message (e.g. "switch_group", "disable_group" or
"enable_group").

Signed-off-by: tang.junhui <tang.junhui@zte.com.cn>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2016-11-21 09:52:08 -05:00
tang.junhui
f97dc42128 dm mpath: add m->hw_handler_name NULL pointer check in parse_hw_handler()
Avoids false positive of no hardware handler being specified (which is
implied by a NULL m->hw_handler_name).

Signed-off-by: tang.junhui <tang.junhui@zte.com.cn>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2016-11-21 09:52:07 -05:00
Hannes Reinecke
8ff232c1a8 dm mpath: always return reservation conflict without failing over
If dm-mpath encounters an reservation conflict it should not fail the
path (as communication with the target is not affected) but should
rather retry on another path.  However, in doing so we might be inducing
a ping-pong between paths, with no guarantee of any forward progress.
And arguably a reservation conflict is an unexpected error, so we should
be passing it upwards to allow the application to take appropriate
steps.

This change resolves a show-stopper problem seen with the pNFS SCSI
layout because it is trivial to hit reservation conflict based failover
loops without it.

Doubts were raised about the implications of this change relative to
products like IBM's SVC.  But there is little point withholding a fix
for Linux because a proprietary product may or may not have some issues
in its implementation of how it interfaces with Linux.  In the future,
if there is glaring evidence that this change is certainly problematic
we can revisit it.

Signed-off-by: Hannes Reinecke <hare@suse.de>
Acked-by: Christoph Hellwig <hch@lst.de>
Tested-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Mike Snitzer <snitzer@redhat.com> # tweaked header
2016-09-29 10:57:07 -04:00
Mike Snitzer
b88efd43f9 dm mpath: delay the requeue of blk-mq requests while all paths down
Return DM_MAPIO_DELAY_REQUEUE from .clone_and_map_rq.  Also, return
false from .busy, if all paths are down, so that blk-mq requests get
mapped via .clone_and_map_rq -- which results in DM_MAPIO_DELAY_REQUEUE
being returned to dm-rq.

This change allows for a noticeable reduction in cpu utilization
(reduced kworker load) while all paths are down, e.g.:

system CPU idleness (as measured by fio's --idle-prof=system):
before: system: 86.58%
after:  system: 98.60%

Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Reviewed-by: Hannes Reinecke <hare@suse.com>
2016-09-15 11:16:17 -04:00
Mike Snitzer
7e48c768f4 dm mpath: use dm_mq_kick_requeue_list()
When reinstating a path the blk-mq request_queue's requeue_list should
get kicked.  It makes sense to kick the requeue_list as part of the
existing hook (previously only used by bio-based support).

Rename process_queued_bios_list to process_queued_io_list.

Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Reviewed-by: Hannes Reinecke <hare@suse.com>
2016-09-15 11:16:11 -04:00
Bart Van Assche
9f4c3f874a dm: convert wait loops to use autoremove_wake_function()
Use autoremove_wake_function() instead of default_wake_function()
to make the dm wait loops more similar to other wait loops in the
kernel.  This patch does not change any functionality.

Signed-off-by: Bart Van Assche <bart.vanassche@sandisk.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2016-09-14 13:56:38 -04:00
Mike Snitzer
f10e06b744 dm mpath: check if path's request_queue is dying in activate_path()
If pg_init_retries is set and a request is queued against a multipath
device with all underlying block device request_queues in the "dying"
state then an infinite loop is triggered because activate_path() never
succeeds and hence never calls pg_init_done().

This change avoids that device removal triggers an infinite loop by
failing the activate_path() which causes the "dying" path to be failed.

Reported-by: Bart Van Assche <bart.vanassche@sandisk.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Cc: stable@vger.kernel.org
2016-09-14 13:56:38 -04:00
Jens Axboe
1eff9d322a block: rename bio bi_rw to bi_opf
Since commit 63a4cc2486, bio->bi_rw contains flags in the lower
portion and the op code in the higher portions. This means that
old code that relies on manually setting bi_rw is most likely
going to be broken. Instead of letting that brokeness linger,
rename the member, to force old and out-of-tree code to break
at compile time instead of at runtime.

No intended functional changes in this commit.

Signed-off-by: Jens Axboe <axboe@fb.com>
2016-08-07 14:41:02 -06:00
Mike Snitzer
1814f2e3fb dm mpath: add locking to multipath_resume and must_push_back
Multiple flags were being tested without locking.  Protect against
non-atomic bit changes in m->flags by holding m->lock (while testing or
setting the queue_if_no_path related flags).

Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2016-08-02 16:21:34 -04:00
Mike Snitzer
e83068a5fa dm mpath: add optional "queue_mode" feature
Allow a user to specify an optional feature 'queue_mode <mode>' where
<mode> may be "bio", "rq" or "mq" -- which corresponds to bio-based,
request_fn rq-based, and blk-mq rq-based respectively.

If the queue_mode feature isn't specified the default for the
"multipath" target is still "rq" but if dm_mod.use_blk_mq is set to Y
it'll default to mode "mq".

This new queue_mode feature introduces the ability for each multipath
device to have its own queue_mode (whereas before this feature all
multipath devices effectively had to have the same queue_mode).

This commit also goes a long way to eliminate the awkward (ab)use of
DM_TYPE_*, the associated filter_md_type() and other relatively fragile
and difficult to maintain code.

Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2016-06-10 15:16:02 -04:00
Mike Snitzer
bf661be1fc dm mpath: remove bio-based bloat from struct dm_mpath_io
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2016-06-10 15:15:48 -04:00
Mike Snitzer
76e33fe4e2 dm mpath: reinstate bio-based support
Add "multipath-bio" target that offers a bio-based multipath target as
an alternative to the request-based "multipath" target -- but in a
following commit "multipath-bio" will immediately be replaced by a new
"queue_mode" feature for the "multipath" target which will allow
bio-based mode to be selected.

When DM multipath was originally converted from bio-based to
request-based the motivation for the change was better dynamic load
balancing (by leveraging block core's request-based IO schedulers, for
merging and sorting, _before_ DM multipath would make the decision on
where to steer the IO -- based on path load and/or availability).

More background is available in this "Request-based Device-mapper
multipath and Dynamic load balancing" paper:
https://www.kernel.org/doc/ols/2007/ols2007v2-pages-235-244.pdf

But we've now come full circle where significantly faster storage
devices no longer need IOs to be made larger to drive optimal IO
performance.  And even if they do there have been changes to the block
and filesystem layers that help ensure upper layers are constructing
larger IOs.  In addition, SCSI's differentiated IO errors will propagate
through to bio-based IO completion hooks -- so that eliminates another
historic justiciation for request-based DM multipath.  Lastly, the block
layer's immutable biovec changes have made bio cloning cheaper than it
has ever been; whereas request cloning is still relatively expensive
(both on a CPU usage and memory footprint level).

As such, bio-based DM multipath offers the promise of a more efficient
IO path for high IOPs devices that are, or will be, emerging.

Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2016-06-10 15:15:47 -04:00
Mike Snitzer
4cc96131af dm: move request-based code out to dm-rq.[hc]
Add some seperation between bio-based and request-based DM core code.

'struct mapped_device' and other DM core only structures and functions
have been moved to dm-core.h and all relevant DM core .c files have been
updated to include dm-core.h rather than dm.h

DM targets should _never_ include dm-core.h!

[block core merge conflict resolution from Stephen Rothwell]
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
2016-06-10 15:15:44 -04:00
Mike Snitzer
2da1610ae2 dm mpath: eliminate use of spinlock in IO fast-paths
The primary motivation of this commit is to improve the scalability of
DM multipath on large NUMA systems where m->lock spinlock contention has
been proven to be a serious bottleneck on really fast storage.

The ability to atomically read a pointer, using lockless_dereference(),
is leveraged in this commit.  But all pointer writes are still protected
by the m->lock spinlock (which is fine since these all now occur in the
slow-path).

The following functions no longer require the m->lock spinlock in their
fast-path: multipath_busy(), __multipath_map(), and do_end_io()

And choose_pgpath() is modified to _not_ update m->current_pgpath unless
it also switches the path-group.  This is done to avoid needing to take
the m->lock everytime __multipath_map() calls choose_pgpath().
But m->current_pgpath will be reset if it is failed via fail_path().

Suggested-by: Jeff Moyer <jmoyer@redhat.com>
Reviewed-by: Hannes Reinecke <hare@suse.com>
Tested-by: Hannes Reinecke <hare@suse.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2016-05-05 15:25:52 -04:00
Mike Snitzer
20800cb345 dm mpath: move trigger_event member to the end of 'struct multipath'
Allows the 'work_mutex' member to no longer cross a cacheline.

Reviewed-by: Hannes Reinecke <hare@suse.com>
Tested-by: Hannes Reinecke <hare@suse.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2016-05-05 15:25:52 -04:00
Mike Snitzer
91e968aa60 dm mpath: use atomic_t for counting members of 'struct multipath'
The use of atomic_t for nr_valid_paths, pg_init_in_progress and
pg_init_count will allow relaxing the use of the m->lock spinlock.

Suggested-by: Hannes Reinecke <hare@suse.de>
Reviewed-by: Hannes Reinecke <hare@suse.com>
Tested-by: Hannes Reinecke <hare@suse.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2016-05-05 15:25:51 -04:00
Mike Snitzer
518257b132 dm mpath: switch to using bitops for state flags
Mechanical change that doesn't make any real effort to reduce the use of
m->lock; that will come later (once atomics are used for counters, etc).

Suggested-by: Hannes Reinecke <hare@suse.de>
Reviewed-by: Hannes Reinecke <hare@suse.com>
Tested-by: Hannes Reinecke <hare@suse.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2016-05-05 15:25:50 -04:00
Mike Snitzer
ec31f3f78a dm mpath: cleanup reinstate_path() et al based on code review
fail_path() will print a "Failing path ..." message but reinstate_path()
doesn't print a "Reinstating path ...".  Add that message to
reinstate_path() to add symmetry and aid system debugging.

Remove reinstate_path()'s check for the path_selector providing
.reinstate_path hook.  All path selectors provide this and any future
ones must too.

activate_path() calls pg_init_done() with SCSI_DH_DEV_OFFLINED but
pg_init_done() doesn't expicitly handle it in its swicth statement.  Add
SCSI_DH_DEV_OFFLINED to the default case.

Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2016-03-10 17:12:04 -05:00
Mike Snitzer
9f54cec553 dm mpath: remove __pgpath_busy forward declaration, rename to pgpath_busy
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2016-02-22 22:34:44 -05:00
Mike Snitzer
be7d31cca8 dm mpath: switch from 'unsigned' to 'bool' for flags where appropriate
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2016-02-22 22:34:43 -05:00
Mike Snitzer
90a4323ccf dm path selector: remove 'repeat_count' return from .select_path hook
If a path selector has any use for a repeat_count it should be handled
locally and not depend on the dm-mpath core to be concerned with it.

Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2016-02-22 22:34:42 -05:00
Mike Snitzer
21136f89d7 dm mpath: remove repeat_count support from multipath core
Preparation for making __multipath_map() avoid taking the m->lock
spinlock -- in favor of using RCU locking.

repeat_count was primarily for bio-based DM multipath's benefit.  There
is really no need for it anymore now that DM multipath is request-based.
As such, repeat_count > 1 is no longer honored and a warning is
displayed if the user attempts to use a value > 1.  This is a temporary
change for the round-robin path-selector (as a later commit will restore
its support for repeat_count > 1).

Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2016-02-22 22:34:40 -05:00
Mike Snitzer
7943bd6dd3 dm mpath: remove unnecessary casts in front of ti->private
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2016-02-22 22:34:40 -05:00
Mike Snitzer
78ce23b518 dm mpath: use blk_mq_alloc_request() and blk_mq_free_request() directly
There isn't any need to support both old .request_fn and blk-mq paths
in the blk-mq specific portion of __multipath_map().  Call
blk_mq_alloc_request() directly rather than use blk_get_request().

Similarly, call blk_mq_free_request(), rather than blk_put_request(), in
multipath_release_clone().

Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2016-02-22 22:34:39 -05:00
Mike Snitzer
2eff1924e1 dm mpath: cleanup 'struct dm_mpath_io' management code
Refactor and rename existing interfaces to be more specific and
self-documenting.

Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2016-02-22 22:34:39 -05:00
Mike Snitzer
8637a6bf14 dm mpath: use blk-mq pdu for per-request 'struct dm_mpath_io'
Allow the multipath target to avoid making small allocations for each
'struct dm_mpath_io' that is needed for each request.

Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2016-02-22 22:34:38 -05:00
Mike Snitzer
eca7ee6dc0 dm: distinquish old .request_fn (dm-old) vs dm-mq request-based DM
Rename various methods to have either a "dm_old" or "dm_mq" prefix.
Improve code comments to assist with understanding the duality of code
that handles both "dm_old" and "dm_mq" cases.

It is no much easier to quickly look at the code and _know_ that a given
method is either 1) "dm_old" only 2) "dm_mq" only 3) common to both.

Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2016-02-22 22:34:33 -05:00
Mike Snitzer
c5248f79f3 dm: remove support for stacking dm-mq on .request_fn device(s)
Remove all fiddley code that propped up this support for a blk-mq
request-queue ontop of all .request_fn devices.

Testing has proven this niche request-based dm-mq mode to be buggy, when
testing fault tolerance with DM multipath, and there is no point trying
to preserve it.

Should help improve efficiency of pure dm-mq code and make code
maintenance less delicate.

Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2016-02-22 22:33:46 -05:00
Mike Snitzer
16f122661d dm: optimize dm_mq_queue_rq()
DM multipath is the only dm-mq target.  But that aside, request-based DM
only supports tables with a single target that is immutable.  Leverage
this fact in dm_mq_queue_rq() by using the 'immutable_target' stored in
the mapped_device when the table was made active.  This saves the need
to even take the read-side of the SRCU via dm_{get,put}_live_table.

If the active DM table does not have an immutable target (e.g. "error"
target was swapped in) then fallback to the slow-path where the target
is looked up from the live table.

Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2016-02-22 11:06:22 -05:00
Junichi Nomura
43e43c9ea6 dm mpath: fix infinite recursion in ioctl when no paths and !queue_if_no_path
In multipath_prepare_ioctl(),
  - pgpath is a path selected from available paths
  - m->queue_io is true if we cannot send a request immediately to
    paths, either because:
      * there is no available path
      * the path group needs activation (pg_init)
          - pg_init is not started
          - pg_init is still running
  - m->queue_if_no_path is true if the device is configured to queue
    I/O if there are no available paths

If !pgpath && !m->queue_if_no_path, the handler should return -EIO.
However in the course of refactoring the condition check has broken
and returns success in that case.  Since bdev points to the dm device
itself, dm_blk_ioctl() calls __blk_dev_driver_ioctl() for itself and
recurses until crash.

You could reproduce the problem like this:

  # dmsetup create mp --table '0 1024 multipath 0 0 0 0'
  # sg_inq /dev/mapper/mp
  <crash>
  [  172.648615] BUG: unable to handle kernel paging request at fffffffc81b10268
  [  172.662843] PGD 19dd067 PUD 0
  [  172.666269] Thread overran stack, or stack corrupted
  [  172.671808] Oops: 0000 [#1] SMP
  ...

Fix the condition check with some clarifications.

Fixes: e56f81e0b0 ("dm: refactor ioctl handling")
Signed-off-by: Jun'ichi Nomura <j-nomura@ce.jp.nec.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2015-11-17 14:19:00 -05:00
Junichi Nomura
5bbbfdf685 dm: fix ioctl retry termination with signal
dm-mpath retries ioctl, when no path is readily available and the device
is configured to queue I/O in such a case. If you want to stop the retry
before multipathd decides to turn off queueing mode, you could send
signal for the process to exit from the loop.

However the check of fatal signal has not carried along when commit
6c182cd88d ("dm mpath: fix ioctl deadlock when no paths") moved the
loop from dm-mpath to dm core. As a result, we can't terminate such
a process in the retry loop.

Easy reproducer of the situation is:

  # dmsetup create mp --table '0 1024 multipath 0 0 0 0'
  # dmsetup message mp 0 'queue_if_no_path'
  # sg_inq /dev/mapper/mp

then you should be able to terminate sg_inq by pressing Ctrl+C.

Fixes: 6c182cd88d ("dm mpath: fix ioctl deadlock when no paths")
Signed-off-by: Jun'ichi Nomura <j-nomura@ce.jp.nec.com>
Cc: Hannes Reinecke <hare@suse.de>
Cc: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Cc: stable@vger.kernel.org
2015-11-17 14:04:32 -05:00
Christoph Hellwig
71cdb6978a dm: add support for passing through persistent reservations
This adds support to pass through persistent reservation requests
similar to the existing ioctl handling, and with the same limitations,
e.g. devices may only have a single target attached.

This is mostly intended for multipathing.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2015-10-31 19:05:59 -04:00
Christoph Hellwig
e56f81e0b0 dm: refactor ioctl handling
This moves the call to blkdev_ioctl and the argument checking to DM core
code, and only leaves a callout to find the block device to operate on
in the targets.  This simplifies the code and allows us to pass through
ioctl-like command using other methods in the next patch.

Also split out a helper around calling the prepare_ioctl method that
will be reused for persistent reservation handling.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2015-10-31 19:05:59 -04:00
Mauricio Faria de Oliveira
47796938c4 Revert "dm mpath: fix stalls when handling invalid ioctls"
This reverts commit a1989b3300.

That commit introduced a regression at least for the case of the SG_IO ioctl()
running without CAP_SYS_RAWIO capability (e.g., unprivileged users) when there
are no active paths: the ioctl() fails with the ENOTTY errno immediately rather
than blocking due to queue_if_no_path until a path becomes active, for example.

That case happens to be exercised by QEMU KVM guests with 'scsi-block' devices
(qemu "-device scsi-block" [1], libvirt "<disk type='block' device='lun'>" [2])
from multipath devices; which leads to SCSI/filesystem errors in such a guest.

More general scenarios can hit that regression too. The following demonstration
employs a SG_IO ioctl() with a standard SCSI INQUIRY command for this objective
(some output & user changes omitted for brevity and comments added for clarity).

Reverting that commit restores normal operation (queueing) in failing scenarios;
tested on linux-next (next-20151022).

1) Test-case is based on sg_simple0 [3] (just SG_IO; remove SG_GET_VERSION_NUM)

    $ cat sg_simple0.c
    ... see [3] ...
    $ sed '/SG_GET_VERSION_NUM/,/}/d' sg_simple0.c > sgio_inquiry.c
    $ gcc sgio_inquiry.c -o sgio_inquiry

2) The ioctl() works fine with active paths present.

    # multipath -l 85ag56
    85ag56 (...) dm-19 IBM     ,2145
    size=60G features='1 queue_if_no_path' hwhandler='0' wp=rw
    |-+- policy='service-time 0' prio=0 status=active
    | |- 8:0:11:0  sdz  65:144  active undef running
    | `- 9:0:9:0   sdbf 67:144  active undef running
    `-+- policy='service-time 0' prio=0 status=enabled
      |- 8:0:12:0  sdae 65:224  active undef running
      `- 9:0:12:0  sdbo 68:32   active undef running

    $ ./sgio_inquiry /dev/mapper/85ag56
    Some of the INQUIRY command's response:
        IBM       2145              0000
    INQUIRY duration=0 millisecs, resid=0

3) The ioctl() fails with ENOTTY errno with _no_ active paths present,
   for unprivileged users (rather than blocking due to queue_if_no_path).

    # for path in $(multipath -l 85ag56 | grep -o 'sd[a-z]\+'); \
          do multipathd -k"fail path $path"; done

    # multipath -l 85ag56
    85ag56 (...) dm-19 IBM     ,2145
    size=60G features='1 queue_if_no_path' hwhandler='0' wp=rw
    |-+- policy='service-time 0' prio=0 status=enabled
    | |- 8:0:11:0  sdz  65:144  failed undef running
    | `- 9:0:9:0   sdbf 67:144  failed undef running
    `-+- policy='service-time 0' prio=0 status=enabled
      |- 8:0:12:0  sdae 65:224  failed undef running
      `- 9:0:12:0  sdbo 68:32   failed undef running

    $ ./sgio_inquiry /dev/mapper/85ag56
    sg_simple0: Inquiry SG_IO ioctl error: Inappropriate ioctl for device

4) dmesg shows that scsi_verify_blk_ioctl() failed for SG_IO (0x2285);
   it returns -ENOIOCTLCMD, later replaced with -ENOTTY in vfs_ioctl().

    $ dmesg
    <...>
    [] device-mapper: multipath: Failing path 65:144.
    [] device-mapper: multipath: Failing path 67:144.
    [] device-mapper: multipath: Failing path 65:224.
    [] device-mapper: multipath: Failing path 68:32.
    [] sgio_inquiry: sending ioctl 2285 to a partition!

5) The ioctl() only works if the SYS_CAP_RAWIO capability is present
   (then queueing happens -- in this example, queue_if_no_path is set);
   this is due to a conditional check in scsi_verify_blk_ioctl().

    # capsh --drop=cap_sys_rawio -- -c './sgio_inquiry /dev/mapper/85ag56'
    sg_simple0: Inquiry SG_IO ioctl error: Inappropriate ioctl for device

    # ./sgio_inquiry /dev/mapper/85ag56 &
    [1] 72830

    # cat /proc/72830/stack
    [<c00000171c0df700>] 0xc00000171c0df700
    [<c000000000015934>] __switch_to+0x204/0x350
    [<c000000000152d4c>] msleep+0x5c/0x80
    [<c00000000077dfb0>] dm_blk_ioctl+0x70/0x170
    [<c000000000487c40>] blkdev_ioctl+0x2b0/0x9b0
    [<c0000000003128e4>] block_ioctl+0x64/0xd0
    [<c0000000002dd3b0>] do_vfs_ioctl+0x490/0x780
    [<c0000000002dd774>] SyS_ioctl+0xd4/0xf0
    [<c000000000009358>] system_call+0x38/0xd0

6) This is the function call chain exercised in this analysis:

SYSCALL_DEFINE3(ioctl, <...>) @ fs/ioctl.c
    -> do_vfs_ioctl()
        -> vfs_ioctl()
            ...
            error = filp->f_op->unlocked_ioctl(filp, cmd, arg);
            ...
                -> dm_blk_ioctl() @ drivers/md/dm.c
                    -> multipath_ioctl() @ drivers/md/dm-mpath.c
                        ...
                        (bdev = NULL, due to no active paths)
                        ...
                        if (!bdev || <...>) {
                            int err = scsi_verify_blk_ioctl(NULL, cmd);
                            if (err)
                                r = err;
                        }
                        ...
                            -> scsi_verify_blk_ioctl() @ block/scsi_ioctl.c
                                ...
                                if (bd && bd == bd->bd_contains) // not taken (bd = NULL)
                                    return 0;
                                ...
                                if (capable(CAP_SYS_RAWIO)) // not taken (unprivileged user)
                                    return 0;
                                ...
                                printk_ratelimited(KERN_WARNING
                                           "%s: sending ioctl %x to a partition!\n" <...>);

                                return -ENOIOCTLCMD;
                            <-
                        ...
                        return r ? : <...>
                    <-
            ...
            if (error == -ENOIOCTLCMD)
                error = -ENOTTY;
             out:
                return error;
            ...

Links:
[1] http://git.qemu.org/?p=qemu.git;a=commit;h=336a6915bc7089fb20fea4ba99972ad9a97c5f52
[2] https://libvirt.org/formatdomain.html#elementsDisks (see 'disk' -> 'device')
[3] http://tldp.org/HOWTO/SCSI-Generic-HOWTO/pexample.html (Revision 1.2, 2002-05-03)

Signed-off-by: Mauricio Faria de Oliveira <mauricfo@linux.vnet.ibm.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Cc: stable@vger.kernel.org
2015-10-31 18:53:51 -04:00
Christoph Hellwig
566079c849 dm-mpath, scsi_dh: request scsi_dh modules in scsi_dh, not dm-mpath
This way we can reused the same code any attachment method, not just those
requested from dm-mpath.

[jejb: fixup checkpatch error]
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Acked-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: James Bottomley <JBottomley@Odin.com>
2015-08-28 13:14:55 -07:00
Christoph Hellwig
1bab0de027 dm-mpath, scsi_dh: don't let dm detach device handlers
While allowing dm-mpath to attach device handlers is a functionality we need
for backwards compatibility reason there is no reason to reference count
them and detach them if dm-mpath stops using the device for some reason.

If the device handler works for the given device it can just stay attached,
and we can take the retain_hw_handler codepath.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Acked-by: Mike Snitzer <snitzer@redhat.com>
Acked-by: Hannes Reinecke <hare@Suse.de>
Signed-off-by: James Bottomley <JBottomley@Odin.com>
2015-08-28 13:14:54 -07:00
Mike Snitzer
4c6dd53dd3 dm mpath: fix leak of dm_mpath_io structure in blk-mq .queue_rq error path
Otherwise kmemleak reported:

unreferenced object 0xffff88009b14e2b0 (size 16):
  comm "fio", pid 4274, jiffies 4294978034 (age 1253.210s)
  hex dump (first 16 bytes):
    40 12 f3 99 01 88 ff ff 00 10 00 00 00 00 00 00  @...............
  backtrace:
    [<ffffffff81600029>] kmemleak_alloc+0x49/0xb0
    [<ffffffff811679a8>] kmem_cache_alloc+0xf8/0x160
    [<ffffffff8111c950>] mempool_alloc_slab+0x10/0x20
    [<ffffffff8111cb37>] mempool_alloc+0x57/0x150
    [<ffffffffa04d2b61>] __multipath_map.isra.17+0xe1/0x220 [dm_multipath]
    [<ffffffffa04d2cb5>] multipath_clone_and_map+0x15/0x20 [dm_multipath]
    [<ffffffffa02889b5>] map_request.isra.39+0xd5/0x220 [dm_mod]
    [<ffffffffa028b0e4>] dm_mq_queue_rq+0x134/0x240 [dm_mod]
    [<ffffffff812cccb5>] __blk_mq_run_hw_queue+0x1d5/0x380
    [<ffffffff812ccaa5>] blk_mq_run_hw_queue+0xc5/0x100
    [<ffffffff812ce350>] blk_sq_make_request+0x240/0x300
    [<ffffffff812c0f30>] generic_make_request+0xc0/0x110
    [<ffffffff812c0ff2>] submit_bio+0x72/0x150
    [<ffffffff811c07cb>] do_blockdev_direct_IO+0x1f3b/0x2da0
    [<ffffffff811c166e>] __blockdev_direct_IO+0x3e/0x40
    [<ffffffff8120aa1a>] ext4_direct_IO+0x1aa/0x390

Fixes: e5863d9ad ("dm: allocate requests in target when stacking on blk-mq devices")
Reported-by: Bart Van Assche <bart.vanassche@sandisk.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Cc: stable@vger.kernel.org # 4.0+
2015-05-27 17:37:22 -04:00
Mike Snitzer
022333427a dm: optimize dm_mq_queue_rq to _not_ use kthread if using pure blk-mq
dm_mq_queue_rq() is in atomic context so care must be taken to not
sleep -- as such GFP_ATOMIC is used for the md->bs bioset allocations
and dm-mpath's call to blk_get_request().  In the future the bioset
allocations will hopefully go away (by removing support for partial
completions of bios in a cloned request).

Also prepare for supporting DM blk-mq ontop of old-style request_fn
device(s) if a new dm-mod 'use_blk_mq' parameter is set.  The kthread
will still be used to queue work if blk-mq is used ontop of old-style
request_fn device(s).

Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2015-04-15 12:10:17 -04:00
Mike Snitzer
bfebd1cdb4 dm: add full blk-mq support to request-based DM
Commit e5863d9ad ("dm: allocate requests in target when stacking on
blk-mq devices") served as the first step toward fully utilizing blk-mq
in request-based DM -- it enabled stacking an old-style (request_fn)
request_queue ontop of the underlying blk-mq device(s).  That first step
didn't improve performance of DM multipath ontop of fast blk-mq devices
(e.g. NVMe) because the top-level old-style request_queue was severely
limited by the queue_lock.

The second step offered here enables stacking a blk-mq request_queue
ontop of the underlying blk-mq device(s).  This unlocks significant
performance gains on fast blk-mq devices, Keith Busch tested on his NVMe
testbed and offered this really positive news:

 "Just providing a performance update. All my fio tests are getting
  roughly equal performance whether accessed through the raw block
  device or the multipath device mapper (~470k IOPS). I could only push
  ~20% of the raw iops through dm before this conversion, so this latest
  tree is looking really solid from a performance standpoint."

Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Tested-by: Keith Busch <keith.busch@intel.com>
2015-04-15 12:10:16 -04:00
Mike Snitzer
52b09914af dm: remove unnecessary wrapper around blk_lld_busy
There is no need for DM to export a wrapper around the already exported
blk_lld_busy().

Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2015-03-31 12:03:49 -04:00
Johannes Thumshirn
ff658e9c1a dm mpath: simplify failure path of dm_multipath_init()
Currently the cleanup of all error cases are open-coded.  Introduce a
common exit path and labels.

Signed-off-by: Johannes Thumshirn <morbidrsa@gmail.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2015-02-09 13:06:49 -05:00
Mike Snitzer
e5863d9ad7 dm: allocate requests in target when stacking on blk-mq devices
For blk-mq request-based DM the responsibility of allocating a cloned
request is transfered from DM core to the target type.  Doing so
enables the cloned request to be allocated from the appropriate
blk-mq request_queue's pool (only the DM target, e.g. multipath, can
know which block device to send a given cloned request to).

Care was taken to preserve compatibility with old-style block request
completion that requires request-based DM _not_ acquire the clone
request's queue lock in the completion path.  As such, there are now 2
different request-based DM target_type interfaces:
1) the original .map_rq() interface will continue to be used for
   non-blk-mq devices -- the preallocated clone request is passed in
   from DM core.
2) a new .clone_and_map_rq() and .release_clone_rq() will be used for
   blk-mq devices -- blk_get_request() and blk_put_request() are used
   respectively from these hooks.

dm_table_set_type() was updated to detect if the request-based target is
being stacked on blk-mq devices, if so DM_TYPE_MQ_REQUEST_BASED is set.
DM core disallows switching the DM table's type after it is set.  This
means that there is no mixing of non-blk-mq and blk-mq devices within
the same request-based DM table.

[This patch was started by Keith and later heavily modified by Mike]

Tested-by: Bart Van Assche <bvanassche@acm.org>
Signed-off-by: Keith Busch <keith.busch@intel.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2015-02-09 13:06:47 -05:00
Keith Busch
2eb6e1e3aa dm: submit stacked requests in irq enabled context
Switch to having request-based DM enqueue all prep'ed requests into work
processed by another thread.  This allows request-based DM to invoke
block APIs that assume interrupt enabled context (e.g. blk_get_request)
and is a prerequisite for adding blk-mq support to request-based DM.

The new kernel thread is only initialized for request-based DM devices.

multipath_map() is now always in irq enabled context so change multipath
spinlock (m->lock) locking to always disable interrupts.

Signed-off-by: Keith Busch <keith.busch@intel.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2015-02-09 13:06:47 -05:00
Benjamin Marzinski
1f27197247 dm mpath: stop queueing IO when no valid paths exist
'queue_io' is set so that IO is queued while paths are being
initialized.  Clear queue_io in __choose_pgpath if there are no valid
paths, since there are obviously no paths that can be initialized.
Otherwise IOs to the device will back up.

Signed-off-by: Benjamin Marzinski <bmarzins@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2014-10-05 20:03:35 -04:00
Mike Snitzer
6afbc01d75 dm mpath: eliminate pg_ready() wrapper
pg_ready() is not comprehensive in its logic and only serves to
obfuscate code.  Replace pg_ready() with the appropriate logic in
multipath_map().

Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2014-08-01 12:30:31 -04:00
Jun'ichi Nomura
7a7a3b45fe dm mpath: fix IO hang due to logic bug in multipath_busy
Commit e80991773 ("dm mpath: push back requests instead of queueing")
modified multipath_busy() to return true if !pg_ready().  pg_ready()
checks the current state of the multipath device and may return false
even if a new IO is needed to change the state.

Bart Van Assche reported that he had multipath IO lockup when he was
performing cable pull tests.  Analysis showed that the multipath
device had a single path group with both paths active, but that the
path group itself was not active.  During the multipath device state
transitions 'queue_io' got set but nothing could clear it.  Clearing
'queue_io' only happens in __choose_pgpath(), but it won't be called
if multipath_busy() returns true due to pg_ready() returning false
when 'queue_io' is set.

As such the !pg_ready() check in multipath_busy() is wrong because new
IO will not be sent to multipath target and the multipath state change
won't happen.  That results in multipath IO lockup.

The intent of multipath_busy() is to avoid unnecessary cycles of
dequeue + request_fn + requeue if it is known that the multipath
device will requeue.

Such "busy" situations would be:
  - path group is being activated
  - there is no path and the multipath is setup to requeue if no path

Fix multipath_busy() to return "busy" early only for these specific
situations.

Reported-by: Bart Van Assche <bvanassche@acm.org>
Tested-by: Bart Van Assche <bvanassche@acm.org>
Signed-off-by: Jun'ichi Nomura <j-nomura@ce.jp.nec.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Cc: stable@vger.kernel.org # v3.15
2014-07-10 16:44:15 -04:00
Mike Snitzer
7eee4ae2db dm: disable WRITE SAME if it fails
Add DM core support for disabling WRITE SAME on first failure to both
request-based and bio-based targets.  The need to disable WRITE SAME
stems from SCSI enabling it by default but then disabling it when it
fails.  When SCSI does this it returns "permanent target failure, do
not retry" using -EREMOTEIO.  Update DM core to only disable WRITE SAME
on failure if the returned error is -EREMOTEIO.

Commit f84cb8a4 ("dm mpath: disable WRITE SAME if it fails")
implemented multipath specific disabling of WRITE SAME if it fails.
However, as that commit detailed, the multipath-only solution doesn't go
far enough if bio-based DM targets are stacked ontop of the
request-based dm-multipath target (as is commonly done using dm-linear
to support partitions on multipath devices, via kpartx).

Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Acked-by: Martin K. Petersen <martin.petersen@oracle.com>
Tested-by: Alex Chen <alex.chen@huawei.com>
2014-06-04 09:45:52 -04:00