Commit Graph

241 Commits

Author SHA1 Message Date
Mikulas Patocka
708e929513 dm: skip second flush on bio unsupported error
When processing barriers, skip the second flush if processing the bio
failed with -EOPNOTSUPP.  This can happen with discard+barrier requests.
If the device doesn't support discard, there would be two useless
SYNCHRONIZE CACHE commands.  The first dm_flush cannot be so easily
optimized out, so we leave it there.

Previously, -EOPNOTSUPP could be received in dec_pending only with empty
barriers and we ignored that error, assuming the device not supporting
cache flushes has cache always consistent.  With the addition of discard
barriers, this -EOPNOTSUPP can also be generated by discards and we
must record it in md->barrier_error for process_barrier.

Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2010-08-12 04:14:00 +01:00
Kiyoshi Ueda
3f77316de0 dm: separate device deletion from dm_put
This patch separates the device deletion code from dm_put()
to make sure the deletion happens in the process context.

By this patch, device deletion always occurs in an ioctl (process)
context and dm_put() can be called in interrupt context.
As a result, the request-based dm's bad dm_put() usage pointed out
by Mikulas below disappears.
    http://marc.info/?l=dm-devel&m=126699981019735&w=2

Without this patch, I confirmed there is a case to crash the system:
    dm_put() => dm_table_destroy() => vfree() => BUG_ON(in_interrupt())

Some more backgrounds and details:
In request-based dm, a device opener can remove a mapped_device
while the last request is still completing, because bios in the last
request complete first and then the device opener can close and remove
the mapped_device before the last request completes:
  CPU0                                          CPU1
  =================================================================
  <<INTERRUPT>>
  blk_end_request_all(clone_rq)
    blk_update_request(clone_rq)
      bio_endio(clone_bio) == end_clone_bio
        blk_update_request(orig_rq)
          bio_endio(orig_bio)
                                                <<I/O completed>>
                                                dm_blk_close()
                                                dev_remove()
                                                  dm_put(md)
                                                    <<Free md>>
   blk_finish_request(clone_rq)
     ....
     dm_end_request(clone_rq)
       free_rq_clone(clone_rq)
       blk_end_request_all(orig_rq)
       rq_completed(md)

So request-based dm used dm_get()/dm_put() to hold md for each I/O
until its request completion handling is fully done.
However, the final dm_put() can call the device deletion code which
must not be run in interrupt context and may cause kernel panic.

To solve the problem, this patch moves the device deletion code,
dm_destroy(), to predetermined places that is actually deleting
the mapped_device in ioctl (process) context, and changes dm_put()
just to decrement the reference count of the mapped_device.
By this change, dm_put() can be used in any context and the symmetric
model below is introduced:
    dm_create():  create a mapped_device
    dm_destroy(): destroy a mapped_device
    dm_get():     increment the reference count of a mapped_device
    dm_put():     decrement the reference count of a mapped_device

dm_destroy() waits for all references of the mapped_device to disappear,
then deletes the mapped_device.

dm_destroy() uses active waiting with msleep(1), since deleting
the mapped_device isn't performance-critical task.
And since at this point, nobody opens the mapped_device and no new
reference will be taken, the pending counts are just for racing
completing activity and will eventually decrease to zero.

For the unlikely case of the forced module unload, dm_destroy_immediate(),
which doesn't wait and forcibly deletes the mapped_device, is also
introduced and used in dm_hash_remove_all().  Otherwise, "rmmod -f"
may be stuck and never return.
And now, because the mapped_device is deleted at this point, subsequent
accesses to the mapped_device may cause NULL pointer references.

Cc: stable@kernel.org
Signed-off-by: Kiyoshi Ueda <k-ueda@ct.jp.nec.com>
Signed-off-by: Jun'ichi Nomura <j-nomura@ce.jp.nec.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2010-08-12 04:13:56 +01:00
Kiyoshi Ueda
abdc568b05 dm: prevent access to md being deleted
This patch prevents access to mapped_device which is being deleted.

Currently, even after a mapped_device has been removed from the hash,
it could be accessed through idr_find() using minor number.
That could cause a race and NULL pointer reference below:
  CPU0                          CPU1
  ------------------------------------------------------------------
  dev_remove(param)
    down_write(_hash_lock)
    dm_lock_for_deletion(md)
      spin_lock(_minor_lock)
      set_bit(DMF_DELETING)
      spin_unlock(_minor_lock)
    __hash_remove(hc)
    up_write(_hash_lock)
                                dev_status(param)
                                  md = find_device(param)
                                         down_read(_hash_lock)
                                         __find_device_hash_cell(param)
                                           dm_get_md(param->dev)
                                             md = dm_find_md(dev)
                                                    spin_lock(_minor_lock)
                                                    md = idr_find(MINOR(dev))
                                                    spin_unlock(_minor_lock)
    dm_put(md)
      free_dev(md)
                                             dm_get(md)
                                         up_read(_hash_lock)
                                  __dev_status(md, param)
                                  dm_put(md)

This patch fixes such problems.

Signed-off-by: Kiyoshi Ueda <k-ueda@ct.jp.nec.com>
Signed-off-by: Jun'ichi Nomura <j-nomura@ce.jp.nec.com>
Cc: stable@kernel.org
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2010-08-12 04:13:54 +01:00
Arnd Bergmann
6e9624b8ca block: push down BKL into .open and .release
The open and release block_device_operations are currently
called with the BKL held. In order to change that, we must
first make sure that all drivers that currently rely
on this have no regressions.

This blindly pushes the BKL into all .open and .release
operations for all block drivers to prepare for the
next step. The drivers can subsequently replace the BKL
with their own locks or remove it completely when it can
be shown that it is not needed.

The functions blkdev_get and blkdev_put are the only
remaining users of the big kernel lock in the block
layer, besides a few uses in the ioctl code, none
of which need to serialize with blkdev_{get,put}.

Most of these two functions is also under the protection
of bdev->bd_mutex, including the actual calls to
->open and ->release, and the common code does not
access any global data structures that need the BKL.

Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Acked-by: Christoph Hellwig <hch@infradead.org>
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
2010-08-07 18:25:34 +02:00
FUJITA Tomonori
00fff26539 block: remove q->prepare_flush_fn completely
This removes q->prepare_flush_fn completely (changes the
blk_queue_ordered API).

Signed-off-by: FUJITA Tomonori <fujita.tomonori@lab.ntt.co.jp>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
2010-08-07 18:24:15 +02:00
FUJITA Tomonori
144d6ed551 dm: stop using q->prepare_flush_fn
use REQ_FLUSH flag instead.

Signed-off-by: FUJITA Tomonori <fujita.tomonori@lab.ntt.co.jp>
Cc: Alasdair G Kergon <agk@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
2010-08-07 18:24:14 +02:00
Christoph Hellwig
7b6d91daee block: unify flags for struct bio and struct request
Remove the current bio flags and reuse the request flags for the bio, too.
This allows to more easily trace the type of I/O from the filesystem
down to the block driver.  There were two flags in the bio that were
missing in the requests:  BIO_RW_UNPLUG and BIO_RW_AHEAD.  Also I've
renamed two request flags that had a superflous RW in them.

Note that the flags are in bio.h despite having the REQ_ name - as
blkdev.h includes bio.h that is the only way to go for now.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
2010-08-07 18:20:39 +02:00
Christoph Hellwig
33659ebbae block: remove wrappers for request type/flags
Remove all the trivial wrappers for the cmd_type and cmd_flags fields in
struct requests.  This allows much easier grepping for different request
types instead of unwinding through macros.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
2010-08-07 18:17:56 +02:00
Peter Rajnoha
3abf85b5b5 dm ioctl: introduce flag indicating uevent was generated
Set a new DM_UEVENT_GENERATED_FLAG when returning from ioctls to
indicate that a uevent was actually generated.  This tells the userspace
caller that it may need to wait for the event to be processed.

Signed-off-by: Peter Rajnoha <prajnoha@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2010-03-06 02:32:31 +00:00
Mikulas Patocka
a97f925a32 dm: free dm_io before bio_endio not after
Free the dm_io structure before calling bio_endio() instead of after it,
to ensure that the io_pool containing it is not referenced after it is
freed.

This partially fixes a problem described here
  https://www.redhat.com/archives/dm-devel/2010-February/msg00109.html

thread 1:
bio_endio(bio, io_error);
/* scheduling happens */
					thread 2:
					close the device
					remove the device
thread 1:
free_io(md, io);

Thread 2, when removing the device, sees non-empty md->io_pool (because the
io hasn't been freed by thread 1 yet) and may crash with BUG in mempool_free.
Thread 1 may also crash, when freeing into a nonexisting mempool.

To fix this we must make sure that bio_endio() is the last call and
the md structure is not accessed afterwards.

There is another bio_endio in process_barrier, but it is called from the thread
and the thread is destroyed prior to freeing the mempools, so this call is
not affected by the bug.

A similar bug exists with module unloads - the module may be unloaded
immediately after bio_endio - but that is more difficult to fix.

Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Cc: stable@kernel.org
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2010-03-06 02:32:29 +00:00
Kiyoshi Ueda
ecdb2e257a dm table: remove dm_get from dm_table_get_md
Remove the dm_get() in dm_table_get_md() because dm_table_get_md() could
be called from presuspend/postsuspend, which are called while
mapped_device is in DMF_FREEING state, where dm_get() is not allowed.

Justification for that is the lifetime of both objects: As far as the
current dm design/implementation, mapped_device is never freed while
targets are doing something, because dm core waits for targets to become
quiet in dm_put() using presuspend/postsuspend.  So targets should be
able to touch mapped_device without holding reference count of the
mapped_device, and we should allow targets to touch mapped_device even
if it is in DMF_FREEING state.

Backgrounds:
I'm trying to remove the multipath internal queue, since dm core now has
a generic queue for request-based dm.  In the patch-set, the multipath
target wants to request dm core to start/stop queue.  One of such
start/stop requests can happen during postsuspend() while the target
waits for pg-init to complete, because the target stops queue when
starting pg-init and tries to restart it when completing pg-init.  Since
queue belongs to mapped_device, it involves calling dm_table_get_md()
and dm_put().  On the other hand, postsuspend() is called in dm_put()
for mapped_device which is in DMF_FREEING state, and that triggers
BUG_ON(DMF_FREEING) in the 2nd dm_put().

I had tried to solve this problem by changing only multipath not to
touch mapped_device which is in DMF_FREEING state, but I couldn't and I
came up with a question why we need dm_get() in dm_table_get_md().

Signed-off-by: Kiyoshi Ueda <k-ueda@ct.jp.nec.com>
Signed-off-by: Jun'ichi Nomura <j-nomura@ce.jp.nec.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2010-03-06 02:29:52 +00:00
Kiyoshi Ueda
9eef87da2a dm mpath: fix stall when requeueing io
This patch fixes the problem that system may stall if target's ->map_rq
returns DM_MAPIO_REQUEUE in map_request().
E.g. stall happens on 1 CPU box when a dm-mpath device with queue_if_no_path
     bounces between all-paths-down and paths-up on I/O load.

When target's ->map_rq returns DM_MAPIO_REQUEUE, map_request() requeues
the request and returns to dm_request_fn().  Then, dm_request_fn()
doesn't exit the I/O dispatching loop and continues processing
the requeued request again.
This map and requeue loop can be done with interrupt disabled,
so 1 CPU system can be stalled if this situation happens.

For example, commands below can stall my 1 CPU box within 1 minute or so:
  # dmsetup table mp
  mp: 0 2097152 multipath 1 queue_if_no_path 0 1 1 service-time 0 1 2 8:144 1 1
  # while true; do dd if=/dev/mapper/mp of=/dev/null bs=1M count=100; done &
  # while true; do \
  > dmsetup message mp 0 "fail_path 8:144" \
  > dmsetup suspend --noflush mp \
  > dmsetup resume mp \
  > dmsetup message mp 0 "reinstate_path 8:144" \
  > done

To fix the problem above, this patch changes dm_request_fn() to exit
the I/O dispatching loop once if a request is requeued in map_request().

Signed-off-by: Kiyoshi Ueda <k-ueda@ct.jp.nec.com>
Signed-off-by: Jun'ichi Nomura <j-nomura@ce.jp.nec.com>
Cc: stable@kernel.org
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2010-02-16 18:43:01 +00:00
Kiyoshi Ueda
64dbce580d dm: export suspended state to targets
This patch adds the exported dm_suspended() function so that targets
can check whether or not they are suspended.

Signed-off-by: Kiyoshi Ueda <k-ueda@ct.jp.nec.com>
Signed-off-by: Jun'ichi Nomura <j-nomura@ce.jp.nec.com>
Cc: Mike Anderson <andmike@linux.vnet.ibm.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2009-12-10 23:52:27 +00:00
Kiyoshi Ueda
4f186f8bbf dm: rename dm_suspended to dm_suspended_md
This patch renames dm_suspended() to dm_suspended_md() and
keeps it internal to dm.
No functional change.

Signed-off-by: Kiyoshi Ueda <k-ueda@ct.jp.nec.com>
Signed-off-by: Jun'ichi Nomura <j-nomura@ce.jp.nec.com>
Cc: Mike Anderson <andmike@linux.vnet.ibm.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2009-12-10 23:52:26 +00:00
Kiyoshi Ueda
4d4471cb5c dm: swap target postsuspend call and setting suspended flag
This patch moves DMF_SUSPENDED flag set before postsuspend.
No one should care about the ordering, because the flag set and
the postsuspend are protected by a single lock, md->suspend_lock,
and all strict flag-checkers take the lock.

Signed-off-by: Kiyoshi Ueda <k-ueda@ct.jp.nec.com>
Signed-off-by: Jun'ichi Nomura <j-nomura@ce.jp.nec.com>
Cc: Mike Anderson <andmike@linux.vnet.ibm.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2009-12-10 23:52:26 +00:00
Jun'ichi Nomura
6db4ccd635 dm: trace request based remapping
This patch adds a remapping trace to request-based dm.
BIO-based dm already has the equivalent tracepoint.

For example, under this dm stack (linear LV on multipath):
  # dmsetup ls --tree -o ascii
  vg-lv0 (253:1)
   `-mpath0 (253:0)
      |- (8:160)
      |- (66:80)
      |- (65:176)
      `- (65:160)

Trace of 'dd of=/dev/vg/lv0 bs=128k count=1 oflag=direct' looks like this:

without the patch:
  dd-6674  [000]   539.727384: block_bio_queue: 253,1 WS 0 + 256 [dd]
  dd-6674  [000]   539.727392: block_remap: 253,0 WS 384 + 256 <- (253,1) 0
  dd-6674  [000]   539.727394: block_bio_queue: 253,0 WS 384 + 256 [dd]
  dd-6674  [000]   539.727405: block_getrq: 253,0 WS 384 + 256 [dd]
  dd-6674  [000]   539.727409: block_plug: [dd]
  dd-6674  [000]   539.727410: block_rq_insert: 253,0 W 0 () 384 + 256 [dd]
  dd-6674  [000]   539.727416: block_rq_issue: 253,0 W 0 () 384 + 256 [dd]
  dd-6674  [000]   539.727426: block_rq_insert: 65,176 W 0 () 384 + 256 [dd]
  dd-6674  [000]   539.727427: block_rq_issue: 65,176 W 0 () 384 + 256 [dd]
  ...

and with the patch: (the line with '**' is the trace added by this patch)
  dd-6617  [002]   162.914301: block_bio_queue: 253,1 WS 0 + 256 [dd]
  dd-6617  [002]   162.914314: block_remap: 253,0 WS 384 + 256 <- (253,1) 0
  dd-6617  [002]   162.914316: block_bio_queue: 253,0 WS 384 + 256 [dd]
  dd-6617  [002]   162.914331: block_getrq: 253,0 WS 384 + 256 [dd]
  dd-6617  [002]   162.914335: block_plug: [dd]
  dd-6617  [002]   162.914337: block_rq_insert: 253,0 W 0 () 384 + 256 [dd]
  dd-6617  [002]   162.914347: block_rq_issue: 253,0 W 0 () 384 + 256 [dd]
**dd-6617  [002]   162.914356: block_rq_remap: 65,176 W 384 + 256 <- (253,0) 384
  dd-6617  [002]   162.914358: block_rq_insert: 65,176 W 0 () 384 + 256 [dd]
  dd-6617  [002]   162.914359: block_rq_issue: 65,176 W 0 () 384 + 256 [dd]
  ...

Signed-off-by: Kiyoshi Ueda <k-ueda@ct.jp.nec.com>
Signed-off-by: Jun'ichi Nomura <j-nomura@ce.jp.nec.com>
Cc: Jens Axboe <jens.axboe@oracle.com>
Cc: Li Zefan <lizf@cn.fujitsu.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2009-12-10 23:52:25 +00:00
Alasdair G Kergon
042d2a9bcd dm: keep old table until after resume succeeded
When swapping a new table into place, retain the old table until
its replacement is in place.

An old check for an empty table is removed because this is enforced
in populate_table().

__unbind() becomes redundant when followed by __bind().

Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2009-12-10 23:52:24 +00:00
Alasdair G Kergon
a794015597 dm: bind new table before destroying old
When replacing a mapped device's table during a 'resume', delay the
destruction of the old table until the new one is successfully in place.

This will make it easier for a later patch to transfer internal state
information from the old table to the new one (something we do not currently
support) while giving us more options for reversion if a later part
of the operation fails.

Devices are always in the suspended state during dm_swap_table().
This patch reinforces the requirement that all I/O must have been
flushed from the table targets while in this state (including any in
workqueues).  In the case of 'noflush' suspending, unprocessed
I/O should have been 'pushed back' to the dm core prior to this point,
for resubmission after the new table is in place.

Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2009-12-10 23:52:23 +00:00
Mike Anderson
432a212c0d dm: add dm_deleting_md function
Add dm_deleting_md to check whether or not a given mapped
device is currently being deleted.

Signed-off-by: Mike Anderson <andmike@linux.vnet.ibm.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2009-12-10 23:52:20 +00:00
Alasdair G Kergon
7c6664114b dm: rename dm_get_table to dm_get_live_table
Rename dm_get_table to dm_get_live_table.

Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2009-12-10 23:52:19 +00:00
Kiyoshi Ueda
d0bcb87865 dm: add request based barrier support
This patch adds barrier support for request-based dm.

CORE DESIGN

The design is basically same as bio-based dm, which emulates barrier
by mapping empty barrier bios before/after a barrier I/O.
But request-based dm has been using struct request_queue for I/O
queueing, so the block-layer's barrier mechanism can be used.

o Summary of the block-layer's behavior (which is depended by dm-core)
  Request-based dm uses QUEUE_ORDERED_DRAIN_FLUSH ordered mode for
  I/O barrier.  It means that when an I/O requiring barrier is found
  in the request_queue, the block-layer makes pre-flush request and
  post-flush request just before and just after the I/O respectively.

  After the ordered sequence starts, the block-layer waits for all
  in-flight I/Os to complete, then gives drivers the pre-flush request,
  the barrier I/O and the post-flush request one by one.
  It means that the request_queue is stopped automatically by
  the block-layer until drivers complete each sequence.

o dm-core
  For the barrier I/O, treats it as a normal I/O, so no additional
  code is needed.

  For the pre/post-flush request, flushes caches by the followings:
    1. Make the number of empty barrier requests required by target's
       num_flush_requests, and map them (dm_rq_barrier()).
    2. Waits for the mapped barriers to complete (dm_rq_barrier()).
       If error has occurred, save the error value to md->barrier_error
       (dm_end_request()).
       (*) Basically, the first reported error is taken.
           But -EOPNOTSUPP supersedes any error and DM_ENDIO_REQUEUE
           follows.
    3. Requeue the pre/post-flush request if the error value is
       DM_ENDIO_REQUEUE.  Otherwise, completes with the error value
       (dm_rq_barrier_work()).
  The pre/post-flush work above is done in the kernel thread (kdmflush)
  context, since memory allocation which might sleep is needed in
  dm_rq_barrier() but sleep is not allowed in dm_request_fn(), which is
  an irq-disabled context.
  Also, clones of the pre/post-flush request share an original, so
  such clones can't be completed using the softirq context.
  Instead, complete them in the context of underlying device drivers.
  It should be safe since there is no I/O dispatching during
  the completion of such clones.

  For suspend, the workqueue of kdmflush needs to be flushed after
  the request_queue has been stopped.  Otherwise, the next flush work
  can be kicked even after the suspend completes.

TARGET INTERFACE

No new interface is added.
Just use the existing num_flush_requests in struct target_type
as same as bio-based dm.

Signed-off-by: Kiyoshi Ueda <k-ueda@ct.jp.nec.com>
Signed-off-by: Jun'ichi Nomura <j-nomura@ce.jp.nec.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2009-12-10 23:52:18 +00:00
Kiyoshi Ueda
980691e5f3 dm: move dm_end_request
This patch moves dm_end_request() to make the next patch more readable.
No functional change.

Signed-off-by: Kiyoshi Ueda <k-ueda@ct.jp.nec.com>
Signed-off-by: Jun'ichi Nomura <j-nomura@ce.jp.nec.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2009-12-10 23:52:17 +00:00
Kiyoshi Ueda
11a68244e1 dm: refactor request based completion functions
This patch factors out the clone completion code, dm_done(),
from dm_softirq_done() in preparation for a subsequent patch.
No functional change.

dm_done() will be used in barrier completion, which can't use and
doesn't need softirq.  The softirq_done callback needs to get a clone
from an original request but it can't in the case of barrier, where
an original request is shared by multiple clones.  On the other hand,
the completion of barrier clones doesn't involve re-submitting requests,
which was the primary reason of the need for softirq.

Signed-off-by: Kiyoshi Ueda <k-ueda@ct.jp.nec.com>
Signed-off-by: Jun'ichi Nomura <j-nomura@ce.jp.nec.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2009-12-10 23:52:17 +00:00
Kiyoshi Ueda
b4324feeae dm: use md pending for in flight IO counting
This patch changes the counter for the number of in_flight I/Os
to md->pending from q->in_flight in preparation for a later patch.
No functional change.

Request-based dm used q->in_flight to count the number of in-flight
clones assuming the counter is always incremented for an in-flight
original request and original:clone is 1:1 relationship.
However, it this no longer true for barrier requests.
So use md->pending to count the number of in-flight clones.

Signed-off-by: Kiyoshi Ueda <k-ueda@ct.jp.nec.com>
Signed-off-by: Jun'ichi Nomura <j-nomura@ce.jp.nec.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2009-12-10 23:52:16 +00:00
Kiyoshi Ueda
9f518b27cf dm: simplify request based suspend
The semantics of bio-based dm were changed recently in the case of
suspend with "--nolockfs" but without "--noflush".
Before 2.6.30, I/Os submitted before the suspend invocation were always
flushed.  From 2.6.30 onwards, I/Os submitted before the suspend
invocation might not be flushed.  (For details, see
http://marc.info/?t=123994433400003&r=1&w=2)

This patch brings the behaviour of request-based dm into line with
bio-based dm, simplifying the code and preparing for a subsequent patch
that will wait for all in_flight I/Os to complete without stopping
request_queue and use dm_wait_for_completion() for it.

This change in semantics simplifies the suspend code as follows:
  o Suspend is implemented as stopping request_queue
    in request-based dm, and all I/Os are queued in the request_queue
    even after suspend is invoked.
  o In the old semantics, we had to track whether I/Os were
    queued before or after the suspend invocation, so a special
    barrier-like request called 'suspend marker' was introduced.
  o With the new semantics, we don't need to flush any I/O
    so we can remove the marker and the code related to the marker
    handling and I/O flushing.

After removing this codes, the suspend sequence is now:
  1. Flush all I/Os by lock_fs() if needed.
  2. Stop dispatching any I/O by stopping the request_queue.
  3. Wait for all in-flight I/Os to be completed or requeued.

Signed-off-by: Kiyoshi Ueda <k-ueda@ct.jp.nec.com>
Signed-off-by: Jun'ichi Nomura <j-nomura@ce.jp.nec.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2009-12-10 23:52:16 +00:00
Kiyoshi Ueda
6facdaff22 dm: abstract clone_rq
This patch factors out the request cloning code in dm_prep_fn()
as clone_rq().  No functional change.

This patch is a preparation for a later patch in this series which needs to
make clones from an original barrier request.

Signed-off-by: Kiyoshi Ueda <k-ueda@ct.jp.nec.com>
Signed-off-by: Jun'ichi Nomura <j-nomura@ce.jp.nec.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2009-12-10 23:52:15 +00:00
Kiyoshi Ueda
0888564393 dm: pass gfp_mask to alloc_rq_tio
This patch adds the gfp_mask argument to alloc_rq_tio().
No functional change.

This patch is a preparation for a later patch in this series which needs to
allocate tio (for barrier I/O) with different allocation flag (GFP_NOIO) from
the one in the normal I/O code path.

Signed-off-by: Kiyoshi Ueda <k-ueda@ct.jp.nec.com>
Signed-off-by: Jun'ichi Nomura <j-nomura@ce.jp.nec.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2009-12-10 23:52:15 +00:00
Kiyoshi Ueda
598de40947 dm: use clone in map_request function
This patch changes the argument of map_request() to clone request
from original request.  No functional change.

This patch is a preparation for PATCH 9, which needs to use
map_request() for clones sharing an original barrier request.

Signed-off-by: Kiyoshi Ueda <k-ueda@ct.jp.nec.com>
Signed-off-by: Jun'ichi Nomura <j-nomura@ce.jp.nec.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2009-12-10 23:52:14 +00:00
Kiyoshi Ueda
90abb8c4ce dm: abstract dm_in_flight function
This patch adds md_in_flight() to get the number of in_flight I/Os.
No functional change.

This patch is a preparation for a later patch in this series, which
changes I/O counter to md->pending from q->in_flight in request-based dm.

Signed-off-by: Kiyoshi Ueda <k-ueda@ct.jp.nec.com>
Signed-off-by: Jun'ichi Nomura <j-nomura@ce.jp.nec.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2009-12-10 23:52:13 +00:00
Mikulas Patocka
952b355760 dm io: use slab for struct io
Allocate "struct io" from a slab.

This patch changes dm-io, so that "struct io" is allocated from a slab cache.
It used to be allocated with kmalloc. Allocating from a slab will be needed
for the next patch, because it requires a special alignment of "struct io"
and kmalloc cannot meet this alignment.

Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2009-12-10 23:51:57 +00:00
Kiyoshi Ueda
f88fb98118 dm: dec_pending needs locking to save error value
Multiple instances of dec_pending() can run concurrently so a lock is
needed when it saves the first error code.

I have never experienced actual problem without locking and just found
this during code inspection while implementing the barrier support
patch for request-based dm.

This patch adds the locking.
I've done compile, boot and basic I/O testings.

Cc: stable@kernel.org
Signed-off-by: Kiyoshi Ueda <k-ueda@ct.jp.nec.com>
Signed-off-by: Jun'ichi Nomura <j-nomura@ce.jp.nec.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2009-10-16 23:18:15 +01:00
Zdenek Kabelac
03022c54b9 dm: add missing del_gendisk to alloc_dev error path
Add missing del_gendisk() to error path when creation of workqueue fails.
Otherwice there is a resource leak and following warning is shown:

WARNING: at fs/sysfs/dir.c:487 sysfs_add_one+0xc5/0x160()
sysfs: cannot create duplicate filename '/devices/virtual/block/dm-0'

Cc: stable@kernel.org
Signed-off-by: Zdenek Kabelac <zkabelac@redhat.com>
Reviewed-by: Jonathan Brassow <jbrassow@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2009-10-16 23:18:15 +01:00
Nikanth Karthikesan
316d315bff block: Seperate read and write statistics of in_flight requests v2
Commit a9327cac44 added seperate read
and write statistics of in_flight requests. And exported the number
of read and write requests in progress seperately through sysfs.

But  Corrado Zoccolo <czoccolo@gmail.com> reported getting strange
output from "iostat -kx 2". Global values for service time and
utilization were garbage. For interval values, utilization was always
100%, and service time is higher than normal.

So this was reverted by commit 0f78ab9899

The problem was in part_round_stats_single(), I missed the following:
        if (now == part->stamp)
                return;

-       if (part->in_flight) {
+       if (part_in_flight(part)) {
                __part_stat_add(cpu, part, time_in_queue,
                                part_in_flight(part) * (now - part->stamp));
                __part_stat_add(cpu, part, io_ticks, (now - part->stamp));

With this chunk included, the reported regression gets fixed.

Signed-off-by: Nikanth Karthikesan <knikanth@suse.de>

--
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
2009-10-06 20:16:55 +02:00
Jens Axboe
0f78ab9899 Revert "Seperate read and write statistics of in_flight requests"
This reverts commit a9327cac44.

Corrado Zoccolo <czoccolo@gmail.com> reports:

"with 2.6.32-rc1 I started getting the following strange output from
"iostat -kx 2":
Linux 2.6.31bisect (et2) 	04/10/2009 	_i686_	(2 CPU)

avg-cpu:  %user   %nice %system %iowait  %steal   %idle
          10,70    0,00    3,16   15,75    0,00   70,38

Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s
avgrq-sz avgqu-sz   await  svctm  %util
sda              18,22     0,00    0,67    0,01    14,77     0,02
43,94     0,01   10,53 39043915,03 2629219,87
sdb              60,89     9,68   50,79    3,04  1724,43    50,52
65,95     0,70   13,06 488437,47 2629219,87

avg-cpu:  %user   %nice %system %iowait  %steal   %idle
           2,72    0,00    0,74    0,00    0,00   96,53

Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s
avgrq-sz avgqu-sz   await  svctm  %util
sda               0,00     0,00    0,00    0,00     0,00     0,00
0,00     0,00    0,00   0,00 100,00
sdb               0,00     0,00    0,00    0,00     0,00     0,00
0,00     0,00    0,00   0,00 100,00

avg-cpu:  %user   %nice %system %iowait  %steal   %idle
           6,68    0,00    0,99    0,00    0,00   92,33

Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s
avgrq-sz avgqu-sz   await  svctm  %util
sda               0,00     0,00    0,00    0,00     0,00     0,00
0,00     0,00    0,00   0,00 100,00
sdb               0,00     0,00    0,00    0,00     0,00     0,00
0,00     0,00    0,00   0,00 100,00

avg-cpu:  %user   %nice %system %iowait  %steal   %idle
           4,40    0,00    0,73    1,47    0,00   93,40

Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s
avgrq-sz avgqu-sz   await  svctm  %util
sda               0,00     0,00    0,00    0,00     0,00     0,00
0,00     0,00    0,00   0,00 100,00
sdb               0,00     4,00    0,00    3,00     0,00    28,00
18,67     0,06   19,50 333,33 100,00

Global values for service time and utilization are garbage. For
interval values, utilization is always 100%, and service time is
higher than normal.

I bisected it down to:
[a9327cac44] Seperate read and write
statistics of in_flight requests
and verified that reverting just that commit indeed solves the issue
on 2.6.32-rc1."

So until this is debugged, revert the bad commit.

Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
2009-10-04 21:04:38 +02:00
Alexey Dobriyan
83d5cde47d const: make block_device_operations const
Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-09-22 07:17:25 -07:00
Nikanth Karthikesan
a9327cac44 Seperate read and write statistics of in_flight requests
Currently, there is a single in_flight counter measuring the number of
requests in the request_queue. But some monitoring tools would like to
know how many read requests and write requests are in progress. Split the
current in_flight counter into two seperate counters for read and write.

This information is exported as a sysfs attribute, as changing the
currently available stat files would break the existing tools.

Signed-off-by: Nikanth Karthikesan <knikanth@suse.de>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
2009-09-14 08:24:52 +02:00
Jens Axboe
1f98a13f62 bio: first step in sanitizing the bio->bi_rw flag testing
Get rid of any functions that test for these bits and make callers
use bio_rw_flagged() directly. Then it is at least directly apparent
what variable and flag they check.

Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
2009-09-11 14:33:31 +02:00
Kiyoshi Ueda
a77e28c7e1 dm multipath: fix oops when request based io fails when no paths
The patch posted at http://marc.info/?l=dm-devel&m=124539787228784&w=2
which was merged into cec47e3d4a ("dm:
prepare for request based option") introduced a regression in
request-based dm.

If map_request() calls dm_kill_unmapped_request() to complete a cloned
bio without dispatching it, clone->bio is still set when
dm_end_request() is called and the BUG_ON(clone->bio) is incorrect.

The patch fixes this bug by freeing bio in dm_end_request() if the clone
has bio.  I've redone my tests to cover all I/O paths and confirmed
there's no other regression.

Here is the oops I hit in request-based dm when I do I/O to a multipath
device which doesn't have any active path nor queue_if_no_path setting:

------------[ cut here ]------------
kernel BUG at /root/2.6.31-rc4.rqdm/drivers/md/dm.c:828!
invalid opcode: 0000 [#1] SMP
last sysfs file: /sys/devices/system/cpu/cpu3/cache/index2/shared_cpu_map
CPU 1
Modules linked in: autofs4 sunrpc cpufreq_ondemand acpi_cpufreq dm_mirror dm_region_hash dm_log dm_service_time dm_multipath scsi_dh dm_mod video output sbs sbshc battery ac sg sr_mod e1000e button cdrom serio_raw rtc_cmos rtc_core rtc_lib piix lpfc scsi_transport_fc ata_piix libata megaraid_sas sd_mod scsi_mod crc_t10dif ext3 jbd uhci_hcd ohci_hcd ehci_hcd [last unloaded: microcode]
Pid: 7, comm: ksoftirqd/1 Not tainted 2.6.31-rc4.rqdm #1 Express5800/120Lj [N8100-1417]
RIP: 0010:[<ffffffffa023629d>]  [<ffffffffa023629d>] dm_softirq_done+0xbd/0x100 [dm_mod]
RSP: 0018:ffff8800280a1f08  EFLAGS: 00010282
RAX: ffffffffa02544e0 RBX: ffff8802aa1111d0 RCX: ffff8802aa1111e0
RDX: ffff8802ab913e70 RSI: 0000000000000000 RDI: ffff8802ab913e70
RBP: ffff8800280a1f28 R08: ffffc90005457040 R09: 0000000000000000
R10: 0000000000000001 R11: 0000000000000000 R12: 00000000fffffffb
R13: ffff8802ab913e88 R14: ffff8802ab9c1438 R15: 0000000000000100
FS:  0000000000000000(0000) GS:ffff88002809e000(0000) knlGS:0000000000000000
CS:  0010 DS: 0018 ES: 0018 CR0: 000000008005003b
CR2: 0000003d54a98640 CR3: 000000029f0a1000 CR4: 00000000000006e0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
Process ksoftirqd/1 (pid: 7, threadinfo ffff8802ae50e000, task ffff8802ae4f8040)
Stack:
 ffff8800280a1f38 0000000000000020 ffffffff814f30a0 0000000000000004
<0> ffff8800280a1f58 ffffffff8116b245 ffff8800280a1f38 ffff8800280a1f38
<0> ffff8800280a1f58 0000000000000001 ffff8800280a1fa8 ffffffff810477bc
Call Trace:
 <IRQ>
 [<ffffffff8116b245>] blk_done_softirq+0x75/0x90
 [<ffffffff810477bc>] __do_softirq+0xcc/0x210
 [<ffffffff81047170>] ? ksoftirqd+0x0/0x110
 [<ffffffff8100ce7c>] call_softirq+0x1c/0x50
 <EOI>
 [<ffffffff8100e785>] do_softirq+0x65/0xa0
 [<ffffffff81047170>] ? ksoftirqd+0x0/0x110
 [<ffffffff810471e0>] ksoftirqd+0x70/0x110
 [<ffffffff81059559>] kthread+0x99/0xb0
 [<ffffffff8100cd7a>] child_rip+0xa/0x20
 [<ffffffff8100c73c>] ? restore_args+0x0/0x30
 [<ffffffff810594c0>] ? kthread+0x0/0xb0
 [<ffffffff8100cd70>] ? child_rip+0x0/0x20
Code: 44 89 e6 48 89 df e8 23 fb f2 e0 be 01 00 00 00 4c 89 f7 e8 f6 fd ff ff 5b 41 5c 41 5d 41 5e c9 c3 4c 89 ef e8 85 fe ff ff eb ed <0f> 0b eb fe 41 8b 85 dc 00 00 00 48 83 bb 10 01 00 00 00 89 83
RIP  [<ffffffffa023629d>] dm_softirq_done+0xbd/0x100 [dm_mod]
 RSP <ffff8800280a1f08>
---[ end trace 16af0a1d8542da55 ]---

Signed-off-by: Kiyoshi Ueda <k-ueda@ct.jp.nec.com>
Signed-off-by: Jun'ichi Nomura <j-nomura@ce.jp.nec.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2009-09-04 20:40:16 +01:00
Mike Snitzer
a732c207d1 dm: remove queue next_ordered workaround for barriers
This patch removes DM's bio-based vs request-based conditional setting
of next_ordered.  For bio-based DM the next_ordered check is no longer a
concern (as that check is now in the __make_request path).  For
request-based DM the default of QUEUE_ORDERED_NONE is now appropriate.

bio-based DM was changed to work-around the previously misplaced
next_ordered check with this commit:
99360b4c18

request-based DM does not yet support barriers but reacted to the above
bio-based DM change with this commit:
5d67aa2366

The above changes are no longer needed given Neil Brown's recent fix to
put the next_ordered check in the __make_request path:
db64f680ba

Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Cc: Jun'ichi Nomura <j-nomura@ce.jp.nec.com>
Cc: NeilBrown <neilb@suse.de>
Acked-by: Kiyoshi Ueda <k-ueda@ct.jp.nec.com>
Acked-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2009-07-23 20:30:40 +01:00
Martin K. Petersen
7878cba9f0 block: Create bip slabs with embedded integrity vectors
This patch restores stacking ability to the block layer integrity
infrastructure by creating a set of dedicated bip slabs.  Each bip slab
has an embedded bio_vec array at the end.  This cuts down on memory
allocations and also simplifies the code compared to the original bvec
version.  Only the largest bip slab is backed by a mempool.  The pool is
contained in the bio_set so stacking drivers can ensure forward
progress.

Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
Signed-off-by: Jens Axboe <axboe@carl.(none)>
2009-07-01 10:56:25 +02:00
Kiyoshi Ueda
523d9297d4 dm: disable interrupt when taking map_lock
This patch disables interrupt when taking map_lock to avoid
lockdep warnings in request-based dm.

request-based dm takes map_lock after taking queue_lock with
disabling interrupt:
  spin_lock_irqsave(queue_lock)
  q->request_fn() == dm_request_fn()
    => dm_get_table()
         => read_lock(map_lock)
while queue_lock could be (but isn't) taken in interrupt context.

Signed-off-by: Kiyoshi Ueda <k-ueda@ct.jp.nec.com>
Signed-off-by: Jun'ichi Nomura <j-nomura@ce.jp.nec.com>
Acked-by: Christof Schmitt <christof.schmitt@de.ibm.com>
Acked-by: Hannes Reinecke <hare@suse.de>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2009-06-22 10:12:37 +01:00
Kiyoshi Ueda
5d67aa2366 dm: do not set QUEUE_ORDERED_DRAIN if request based
Request-based dm doesn't have barrier support yet.
So we need to set QUEUE_ORDERED_DRAIN only for bio-based dm.
Since the device type is decided at the first table loading time,
the flag set is deferred until then.

Signed-off-by: Kiyoshi Ueda <k-ueda@ct.jp.nec.com>
Signed-off-by: Jun'ichi Nomura <j-nomura@ce.jp.nec.com>
Acked-by: Hannes Reinecke <hare@suse.de>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2009-06-22 10:12:36 +01:00
Kiyoshi Ueda
e6ee8c0b76 dm: enable request based option
This patch enables request-based dm.

o Request-based dm and bio-based dm coexist, since there are
  some target drivers which are more fitting to bio-based dm.
  Also, there are other bio-based devices in the kernel
  (e.g. md, loop).
  Since bio-based device can't receive struct request,
  there are some limitations on device stacking between
  bio-based and request-based.

                     type of underlying device
                   bio-based      request-based
   ----------------------------------------------
    bio-based         OK                OK
    request-based     --                OK

  The device type is recognized by the queue flag in the kernel,
  so dm follows that.

o The type of a dm device is decided at the first table binding time.
  Once the type of a dm device is decided, the type can't be changed.

o Mempool allocations are deferred to at the table loading time, since
  mempools for request-based dm are different from those for bio-based
  dm and needed mempool type is fixed by the type of table.

o Currently, request-based dm supports only tables that have a single
  target.  To support multiple targets, we need to support request
  splitting or prevent bio/request from spanning multiple targets.
  The former needs lots of changes in the block layer, and the latter
  needs that all target drivers support merge() function.
  Both will take a time.

Signed-off-by: Kiyoshi Ueda <k-ueda@ct.jp.nec.com>
Signed-off-by: Jun'ichi Nomura <j-nomura@ce.jp.nec.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2009-06-22 10:12:36 +01:00
Kiyoshi Ueda
cec47e3d4a dm: prepare for request based option
This patch adds core functions for request-based dm.

When struct mapped device (md) is initialized, md->queue has
an I/O scheduler and the following functions are used for
request-based dm as the queue functions:
    make_request_fn: dm_make_request()
    pref_fn:         dm_prep_fn()
    request_fn:      dm_request_fn()
    softirq_done_fn: dm_softirq_done()
    lld_busy_fn:     dm_lld_busy()
Actual initializations are done in another patch (PATCH 2).

Below is a brief summary of how request-based dm behaves, including:
  - making request from bio
  - cloning, mapping and dispatching request
  - completing request and bio
  - suspending md
  - resuming md

  bio to request
  ==============
  md->queue->make_request_fn() (dm_make_request()) calls __make_request()
  for a bio submitted to the md.
  Then, the bio is kept in the queue as a new request or merged into
  another request in the queue if possible.

  Cloning and Mapping
  ===================
  Cloning and mapping are done in md->queue->request_fn() (dm_request_fn()),
  when requests are dispatched after they are sorted by the I/O scheduler.

  dm_request_fn() checks busy state of underlying devices using
  target's busy() function and stops dispatching requests to keep them
  on the dm device's queue if busy.
  It helps better I/O merging, since no merge is done for a request
  once it is dispatched to underlying devices.

  Actual cloning and mapping are done in dm_prep_fn() and map_request()
  called from dm_request_fn().
  dm_prep_fn() clones not only request but also bios of the request
  so that dm can hold bio completion in error cases and prevent
  the bio submitter from noticing the error.
  (See the "Completion" section below for details.)

  After the cloning, the clone is mapped by target's map_rq() function
    and inserted to underlying device's queue using
    blk_insert_cloned_request().

  Completion
  ==========
  Request completion can be hooked by rq->end_io(), but then, all bios
  in the request will have been completed even error cases, and the bio
  submitter will have noticed the error.
  To prevent the bio completion in error cases, request-based dm clones
  both bio and request and hooks both bio->bi_end_io() and rq->end_io():
      bio->bi_end_io(): end_clone_bio()
      rq->end_io():     end_clone_request()

  Summary of the request completion flow is below:
  blk_end_request() for a clone request
    => blk_update_request()
       => bio->bi_end_io() == end_clone_bio() for each clone bio
          => Free the clone bio
          => Success: Complete the original bio (blk_update_request())
             Error:   Don't complete the original bio
    => blk_finish_request()
       => rq->end_io() == end_clone_request()
          => blk_complete_request()
             => dm_softirq_done()
                => Free the clone request
                => Success: Complete the original request (blk_end_request())
                   Error:   Requeue the original request

  end_clone_bio() completes the original request on the size of
  the original bio in successful cases.
  Even if all bios in the original request are completed by that
  completion, the original request must not be completed yet to keep
  the ordering of request completion for the stacking.
  So end_clone_bio() uses blk_update_request() instead of
  blk_end_request().
  In error cases, end_clone_bio() doesn't complete the original bio.
  It just frees the cloned bio and gives over the error handling to
  end_clone_request().

  end_clone_request(), which is called with queue lock held, completes
  the clone request and the original request in a softirq context
  (dm_softirq_done()), which has no queue lock, to avoid a deadlock
  issue on submission of another request during the completion:
      - The submitted request may be mapped to the same device
      - Request submission requires queue lock, but the queue lock
        has been held by itself and it doesn't know that

  The clone request has no clone bio when dm_softirq_done() is called.
  So target drivers can't resubmit it again even error cases.
  Instead, they can ask dm core for requeueing and remapping
  the original request in that cases.

  suspend
  =======
  Request-based dm uses stopping md->queue as suspend of the md.
  For noflush suspend, just stops md->queue.

  For flush suspend, inserts a marker request to the tail of md->queue.
  And dispatches all requests in md->queue until the marker comes to
  the front of md->queue.  Then, stops dispatching request and waits
  for the all dispatched requests to complete.
  After that, completes the marker request, stops md->queue and
  wake up the waiter on the suspend queue, md->wait.

  resume
  ======
  Starts md->queue.

Signed-off-by: Kiyoshi Ueda <k-ueda@ct.jp.nec.com>
Signed-off-by: Jun'ichi Nomura <j-nomura@ce.jp.nec.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2009-06-22 10:12:35 +01:00
Mike Snitzer
754c5fc7eb dm: calculate queue limits during resume not load
Currently, device-mapper maintains a separate instance of 'struct
queue_limits' for each table of each device.  When the configuration of
a device is to be changed, first its table is loaded and this structure
is populated, then the device is 'resumed' and the calculated
queue_limits are applied.

This places restrictions on how userspace may process related devices,
where it is often advantageous to 'load' tables for several devices
at once before 'resuming' them together.  As the new queue_limits
only take effect after the 'resume', if they are changing and one
device uses another, the latter must be 'resumed' before the former
may be 'loaded'.

This patch moves the calculation of these queue_limits out of
the 'load' operation into 'resume'.  Since we are no longer
pre-calculating this struct, we no longer need to maintain copies
within our dm structs.

dm_set_device_limits() now passes the 'start' of the device's
data area (aka pe_start) as the 'offset' to blk_stack_limits().

init_valid_queue_limits() is replaced by blk_set_default_limits().

Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Cc: martin.petersen@oracle.com
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2009-06-22 10:12:34 +01:00
Milan Broz
60935eb21d dm ioctl: support cookies for udev
Add support for passing a 32 bit "cookie" into the kernel with the
DM_SUSPEND, DM_DEV_RENAME and DM_DEV_REMOVE ioctls.  The (unsigned)
value of this cookie is returned to userspace alongside the uevents
issued by these ioctls in the variable DM_COOKIE.

This means the userspace process issuing these ioctls can be notified
by udev after udev has completed any actions triggered.

To minimise the interface extension, we pass the cookie into the
kernel in the event_nr field which is otherwise unused when calling
these ioctls.  Incrementing the version number allows userspace to
determine in advance whether or not the kernel supports the cookie.
If the kernel does support this but userspace does not, there should
be no impact as the new variable will just get ignored.

Signed-off-by: Milan Broz <mbroz@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2009-06-22 10:12:30 +01:00
Mikulas Patocka
52b1fd5a27 dm: send empty barriers to targets in dm_flush
Pass empty barrier flushes to the targets in dm_flush().

Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2009-06-22 10:12:21 +01:00
Alasdair G Kergon
9015df24a8 dm: initialise tio in alloc_tio
Move repeated dm_target_io initialisation inside alloc_tio().

Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2009-06-22 10:12:21 +01:00
Mikulas Patocka
f9ab94cee3 dm: introduce num_flush_requests
Introduce num_flush_requests for a target to set to say how many flush
instructions (empty barriers) it wants to receive.  These are sent by
__clone_and_map_empty_barrier with map_info->flush_request going from 0
to (num_flush_requests - 1).

Old targets without flush support won't receive any flush requests.

Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2009-06-22 10:12:20 +01:00
Mikulas Patocka
27eaa14975 dm: remove check that prevents mapping empty bios
Remove the check that the size of the cloned bio is not zero because a
subsequent patch needs to send zero-sized barriers down this path.

Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2009-06-22 10:12:20 +01:00
Mikulas Patocka
fdb9572b73 dm: remove EOPNOTSUPP for barriers
If the underlying device doesn't support barriers and dm receives a
barrier, it waits until all requests on that device drain so it no
longer needs to report -EOPNOTSUPP to the caller.

This patch deals with the confusing situation when moving a volume from
one physical device to another triggers an EOPNOTSUPP on a volume that
didn't report it before.

Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2009-06-22 10:12:19 +01:00
Mikulas Patocka
5aa2781d96 dm: store only first barrier error
With the following patches, more than one error can occur during
processing.  Change md->barrier_error so that only the first one is
recorded and returned to the caller.

Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2009-06-22 10:12:18 +01:00
Mikulas Patocka
2761e95fe4 dm: process requeue in dm_wq_work
If barrier request was returned with DM_ENDIO_REQUEUE,
requeue it in dm_wq_work instead of dec_pending.

This allows us to correctly handle a situation when some targets
are asking for a requeue and other targets signal an error.

Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2009-06-22 10:12:18 +01:00
Mikulas Patocka
531fe96364 dm: make dm_flush return void
Make dm_flush return void.

The first error during flush is stored in md->barrier_error instead.

Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2009-06-22 10:12:17 +01:00
Mikulas Patocka
32a926da5a dm: always hold bdev reference
Fix a potential deadlock when creating multiple snapshots by holding a
reference to struct block_device for the whole lifecycle of every dm
device instead of obtaining it independently at each point it is needed.

bdget_disk() was called while the device was being suspended, in
dm_suspend().  However there could be other devices already suspended,
for example when creating additional snapshots of a device. bdget_disk()
can wait for IO and allocate memory resulting in waiting for the
already-suspended device - deadlock.

This patch changes the code so that it gets the reference to struct
block_device when struct mapped_device is allocated and initialized in
alloc_dev() where it is always OK to allocate memory or wait for I/O.
It drops the reference when it is destroyed in free_dev().  Thus there
is no call to bdget_disk() while any device is suspended.

Previously unlock_fs() was called only if bdev was held.  Now it is
called unconditionally, but the superfluous calls are harmless because
it returns immediately if the filesystem was not previously frozen.

This patch also now allows the device size to be changed in a
noflush suspend because the bdev is held.  This has no adverse effect.

Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2009-06-22 10:12:17 +01:00
Mikulas Patocka
db8fef4fab dm: rename suspended_bdev to bdev
Rename suspended_bdev to bdev.

This patch doesn't change any functionality, just renames the variable.
In the next patch, the variable will be used even for non-suspended device.

(Pre-requisite for the per-target barrier support patches.)

Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2009-06-22 10:12:15 +01:00
Mikulas Patocka
8cbeb67ad5 dm: avoid unsupported spanning of md stripe boundaries
A bio that has two or more vector entries, size less than or equal to
page size, that crosses a stripe boundary of an underlying md device is
accepted by device mapper (it conforms to all its limits) but not by the
underlying device.

The fix is: If device mapper selects the one-page maximum request size,
it also needs to set its own q->merge_bvec_fn to reject any bios with
multiple vector entries that span more pages.

The problem was discovered in the following scenario:
  * MD - RAID-0
  * LV on the top of it (raid1, snapshot or striped with chunk
size/stripe larger than RAID-0 stripe)
  * one of the logical volumes is exported to xen domU
  * inside xen domU it is partitioned, the key point is that the partition
must be unaligned on page boundary (fdisk normally aligns the partition to
63 sectors which will trigger it)
  * install the system on the partitioned disk in domU
This causes I/O failures in dom0.
Reference: https://bugzilla.redhat.com/show_bug.cgi?id=223947

Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2009-06-22 10:12:14 +01:00
Milan Broz
4d89b7b4e4 dm: sysfs skip output when device is being destroyed
Do not process sysfs attributes when device is being destroyed.

Otherwise code can cause
  BUG_ON(test_bit(DMF_FREEING, &md->flags));
in dm_put() call.

Cc: stable@kernel.org
Signed-off-by: Milan Broz <mbroz@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2009-06-22 10:12:11 +01:00
Li Zefan
e212d6f250 block: remove some includings of blktrace_api.h
When porting blktrace to tracepoints, we changed to trace/block.h
for trace prober declarations.

Signed-off-by: Li Zefan <lizf@cn.fujitsu.com>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
2009-06-16 11:19:36 +02:00
Li Zefan
55782138e4 tracing/events: convert block trace points to TRACE_EVENT()
TRACE_EVENT is a more generic way to define tracepoints. Doing so adds
these new capabilities to this tracepoint:

  - zero-copy and per-cpu splice() tracing
  - binary tracing without printf overhead
  - structured logging records exposed under /debug/tracing/events
  - trace events embedded in function tracer output and other plugins
  - user-defined, per tracepoint filter expressions
  ...

Cons:

  - no dev_t info for the output of plug, unplug_timer and unplug_io events.
    no dev_t info for getrq and sleeprq events if bio == NULL.
    no dev_t info for rq_abort,...,rq_requeue events if rq->rq_disk == NULL.

    This is mainly because we can't get the deivce from a request queue.
    But this may change in the future.

  - A packet command is converted to a string in TP_assign, not TP_print.
    While blktrace do the convertion just before output.

    Since pc requests should be rather rare, this is not a big issue.

  - In blktrace, an event can have 2 different print formats, but a TRACE_EVENT
    has a unique format, which means we have some unused data in a trace entry.

    The overhead is minimized by using __dynamic_array() instead of __array().

I've benchmarked the ioctl blktrace vs the splice based TRACE_EVENT tracing:

      dd                   dd + ioctl blktrace       dd + TRACE_EVENT (splice)
1     7.36s, 42.7 MB/s     7.50s, 42.0 MB/s          7.41s, 42.5 MB/s
2     7.43s, 42.3 MB/s     7.48s, 42.1 MB/s          7.43s, 42.4 MB/s
3     7.38s, 42.6 MB/s     7.45s, 42.2 MB/s          7.41s, 42.5 MB/s

So the overhead of tracing is very small, and no regression when using
those trace events vs blktrace.

And the binary output of TRACE_EVENT is much smaller than blktrace:

 # ls -l -h
 -rw-r--r-- 1 root root 8.8M 06-09 13:24 sda.blktrace.0
 -rw-r--r-- 1 root root 195K 06-09 13:24 sda.blktrace.1
 -rw-r--r-- 1 root root 2.7M 06-09 13:25 trace_splice.out

Following are some comparisons between TRACE_EVENT and blktrace:

plug:
  kjournald-480   [000]   303.084981: block_plug: [kjournald]
  kjournald-480   [000]   303.084981:   8,0    P   N [kjournald]

unplug_io:
  kblockd/0-118   [000]   300.052973: block_unplug_io: [kblockd/0] 1
  kblockd/0-118   [000]   300.052974:   8,0    U   N [kblockd/0] 1

remap:
  kjournald-480   [000]   303.085042: block_remap: 8,0 W 102736992 + 8 <- (8,8) 33384
  kjournald-480   [000]   303.085043:   8,0    A   W 102736992 + 8 <- (8,8) 33384

bio_backmerge:
  kjournald-480   [000]   303.085086: block_bio_backmerge: 8,0 W 102737032 + 8 [kjournald]
  kjournald-480   [000]   303.085086:   8,0    M   W 102737032 + 8 [kjournald]

getrq:
  kjournald-480   [000]   303.084974: block_getrq: 8,0 W 102736984 + 8 [kjournald]
  kjournald-480   [000]   303.084975:   8,0    G   W 102736984 + 8 [kjournald]

  bash-2066  [001]  1072.953770:   8,0    G   N [bash]
  bash-2066  [001]  1072.953773: block_getrq: 0,0 N 0 + 0 [bash]

rq_complete:
  konsole-2065  [001]   300.053184: block_rq_complete: 8,0 W () 103669040 + 16 [0]
  konsole-2065  [001]   300.053191:   8,0    C   W 103669040 + 16 [0]

  ksoftirqd/1-7   [001]  1072.953811:   8,0    C   N (5a 00 08 00 00 00 00 00 24 00) [0]
  ksoftirqd/1-7   [001]  1072.953813: block_rq_complete: 0,0 N (5a 00 08 00 00 00 00 00 24 00) 0 + 0 [0]

rq_insert:
  kjournald-480   [000]   303.084985: block_rq_insert: 8,0 W 0 () 102736984 + 8 [kjournald]
  kjournald-480   [000]   303.084986:   8,0    I   W 102736984 + 8 [kjournald]

Changelog from v2 -> v3:

- use the newly introduced __dynamic_array().

Changelog from v1 -> v2:

- use __string() instead of __array() to minimize the memory required
  to store hex dump of rq->cmd().

- support large pc requests.

- add missing blk_fill_rwbs_rq() in block_rq_requeue TRACE_EVENT.

- some cleanups.

Signed-off-by: Li Zefan <lizf@cn.fujitsu.com>
LKML-Reference: <4A2DF669.5070905@cn.fujitsu.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2009-06-09 12:34:23 -04:00
Ingo Molnar
44347d947f Merge branch 'linus' into tracing/core
Merge reason: tracing/core was on a .30-rc1 base and was missing out on
              on a handful of tracing fixes present in .30-rc5-almost.

Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-05-07 11:17:34 +02:00
Alan D. Brunelle
22a7c31a96 blktrace: from-sector redundant in trace_block_remap
Remove redundant from-sector parameter: it's /always/ the bio's sector
passed in.

[ Impact: cleanup ]

Signed-off-by: Alan D. Brunelle <alan.brunelle@hp.com>
Reviewed-by: Li Zefan <lizf@cn.fujitsu.com>
Reviewed-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Jens Axboe <jens.axboe@oracle.com>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
LKML-Reference: <49FF517C.7000503@hp.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-05-06 14:13:01 +02:00
Christoph Hellwig
8f3d8ba20e block: move bio list helpers into bio.h
It's used by DM and MD and generally useful, so move the bio list
helpers into bio.h.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Acked-by: Alasdair G Kergon <agk@redhat.com>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
2009-04-15 08:28:09 +02:00
Mikulas Patocka
af7e466a1a dm: implement basic barrier support
Barriers are submitted to a worker thread that issues them in-order.

The thread is modified so that when it sees a barrier request it waits
for all pending IO before the request then submits the barrier and
waits for it.  (We must wait, otherwise it could be intermixed with
following requests.)

Errors from the barrier request are recorded in a per-device barrier_error
variable. There may be only one barrier request in progress at once.

For now, the barrier request is converted to a non-barrier request when
sending it to the underlying device.

This patch guarantees correct barrier behavior if the underlying device
doesn't perform write-back caching. The same requirement existed before
barriers were supported in dm.

Bottom layer barrier support (sending barriers by target drivers) and
handling devices with write-back caches will be done in further patches.

Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2009-04-09 00:27:16 +01:00
Mikulas Patocka
92c639021c dm: remove dm_request loop
Remove queue_io return value and a loop in dm_request.

IO may be submitted to a worker thread with queue_io().  queue_io() sets
DMF_QUEUE_IO_TO_THREAD so that all further IO is queued for the thread. When
the thread finishes its work, it clears DMF_QUEUE_IO_TO_THREAD and from this
point on, requests are submitted from dm_request again. This will be used
for processing barriers.

Remove the loop in dm_request. queue_io() can submit I/Os to the worker thread
even if DMF_QUEUE_IO_TO_THREAD was not set.

Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2009-04-09 00:27:15 +01:00
Mikulas Patocka
3b00b2036f dm: rework queueing and suspension
Rework shutting down on suspend and document the associated rules.

Drop write lock in __split_and_process_bio to allow more processing
concurrency.

Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2009-04-09 00:27:15 +01:00
Alasdair G Kergon
54d9a1b451 dm: simplify dm_request loop
Refactor the code in dm_request().

Require the new DMF_BLOCK_FOR_SUSPEND flag on readahead bios we will
discard so we don't drop such bios while processing a barrier.

Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2009-04-09 00:27:14 +01:00
Alasdair G Kergon
1eb787ec18 dm: split DMF_BLOCK_IO flag into two
Split the DMF_BLOCK_IO flag into two.

DMF_BLOCK_IO_FOR_SUSPEND is set when I/O must be blocked while suspending a
device.  DMF_QUEUE_IO_TO_THREAD is set when I/O must be queued to a
worker thread for later processing.

Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2009-04-09 00:27:14 +01:00
Alasdair G Kergon
df12ee9963 dm: rearrange dm_wq_work
Refactor dm_wq_work() to make later patch more readable.

Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2009-04-09 00:27:13 +01:00
Mikulas Patocka
692d0eb9e0 dm: remove limited barrier support
Prepare for full barrier implementation: first remove the restricted support.

Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2009-04-09 00:27:13 +01:00
Martin K. Petersen
9c47008d13 dm: add integrity support
This patch provides support for data integrity passthrough in the device
mapper.

 - If one or more component devices support integrity an integrity
   profile is preallocated for the DM device.

 - If all component devices have compatible profiles the DM device is
   flagged as capable.

 - Handle integrity metadata when splitting and cloning bios.

Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2009-04-09 00:27:12 +01:00
Mikulas Patocka
99360b4c18 dm: set queue ordered mode
Set queue ordered mode.  It doesn't really matter what we set here
because we don't ever put any requests on the queue.  But we need to set
something other than QUEUE_ORDERED_NONE so that __generic_make_request
passes barrier requests to us.

Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2009-04-02 19:55:39 +01:00
Mikulas Patocka
b44ebeb017 dm: move wait queue declaration
Move wait queue declaration and unplug to dm_wait_for_completion.

The purpose is to minimize duplicate code in the further patches.

The patch reorders functions a little bit. It doesn't change any
functionality. For proper non-deadlock operation, add_wait_queue must
happen before set_current_state(interruptible) and before the test for
!atomic_read(&md->pending).

Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2009-04-02 19:55:39 +01:00
Mikulas Patocka
022c261100 dm: merge pushback and deferred bio lists
Merge pushback and deferred lists into one list - use deferred list
for both deferred and pushed-back bios.

This will be needed for proper support of barrier bios: it is impossible to
support ordering correctly with two lists because the requests on both lists
will be mixed up.

Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2009-04-02 19:55:39 +01:00
Mikulas Patocka
401600dfd3 dm: allow uninterruptible wait for pending io
Allow uninterruptible wait for pending IOs.

Add argument "interruptible" to dm_wait_for_completion that specifies
either interruptible or uninterruptible waiting.

Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2009-04-02 19:55:38 +01:00
Mikulas Patocka
ef2085870e dm: merge __flush_deferred_io into caller
Merge __flush_deferred_io() into the only caller, dm_wq_work().

There's no need to have a function that has only one caller.

Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2009-04-02 19:55:38 +01:00
Mikulas Patocka
f0b9a4502b dm: move bio_io_error into __split_and_process_bio
Move the bio_io_error() calls directly into __split_and_process_bio().

This avoids some code duplication in later patches.

Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2009-04-02 19:55:38 +01:00
Mikulas Patocka
8a53c28db4 dm: rename __split_bio
Rename __split_bio() to __split_and_process_bio() because it not only splits
the bio to serveral parts, but also submits them to target drivers.

Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2009-04-02 19:55:37 +01:00
Mikulas Patocka
53d5914f28 dm: remove unnecessary struct dm_wq_req
Remove struct dm_wq_req and move "work" directly into struct mapped_device.

In the revised implementation, the thread will do just one type of work
(processing the queue).

Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2009-04-02 19:55:37 +01:00
Mikulas Patocka
9a1fb46448 dm: remove unnecessary work queue context field
Remove the context field from struct dm_wq_req because we will no longer
need it.

Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2009-04-02 19:55:36 +01:00
Mikulas Patocka
143773965b dm: remove unnecessary work queue type field
Remove "type" field from struct dm_wq_req because we no longer need it
to have more than one value.

Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2009-04-02 19:55:36 +01:00
Milan Broz
b35f8caa08 dm crypt: wait for endio to complete before destruction
The following oops has been reported when dm-crypt runs over a loop device.

...
[   70.381058] Process loop0 (pid: 4268, ti=cf3b2000 task=cf1cc1f0 task.ti=cf3b2000)
...
[   70.381058] Call Trace:
[   70.381058]  [<d0d76601>] ? crypt_dec_pending+0x5e/0x62 [dm_crypt]
[   70.381058]  [<d0d767b8>] ? crypt_endio+0xa2/0xaa [dm_crypt]
[   70.381058]  [<d0d76716>] ? crypt_endio+0x0/0xaa [dm_crypt]
[   70.381058]  [<c01a2f24>] ? bio_endio+0x2b/0x2e
[   70.381058]  [<d0806530>] ? dec_pending+0x224/0x23b [dm_mod]
[   70.381058]  [<d08066e4>] ? clone_endio+0x79/0xa4 [dm_mod]
[   70.381058]  [<d080666b>] ? clone_endio+0x0/0xa4 [dm_mod]
[   70.381058]  [<c01a2f24>] ? bio_endio+0x2b/0x2e
[   70.381058]  [<c02bad86>] ? loop_thread+0x380/0x3b7
[   70.381058]  [<c02ba8a1>] ? do_lo_send_aops+0x0/0x165
[   70.381058]  [<c013754f>] ? autoremove_wake_function+0x0/0x33
[   70.381058]  [<c02baa06>] ? loop_thread+0x0/0x3b7

When a table is being replaced, it waits for I/O to complete
before destroying the mempool, but the endio function doesn't
call mempool_free() until after completing the bio.

Fix it by swapping the order of those two operations.

The same problem occurs in dm.c with md referenced after dec_pending.
Again, we swap the order.

Cc: stable@kernel.org
Signed-off-by: Milan Broz <mbroz@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2009-03-16 17:44:36 +00:00
Milan Broz
784aae735d dm: add name and uuid to sysfs
Implement simple read-only sysfs entry for device-mapper block device.

This patch adds a simple sysfs directory named "dm" under block device
properties and implements
	- name attribute (string containing mapped device name)
	- uuid attribute (string containing UUID, or empty string if not set)

The kobject is embedded in mapped_device struct, so no additional
memory allocation is needed for initializing sysfs entry.

During the processing of sysfs attribute we need to lock mapped device
which is done by a new function dm_get_from_kobj, which returns the md
associated with kobject and increases the usage count.

Each 'show attribute' function is responsible for its own locking.

Signed-off-by: Milan Broz <mbroz@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2009-01-06 03:05:12 +00:00
Mikulas Patocka
d58168763f dm table: rework reference counting
Rework table reference counting.

The existing code uses a reference counter. When the last reference is
dropped and the counter reaches zero, the table destructor is called.
Table reference counters are acquired/released from upcalls from other
kernel code (dm_any_congested, dm_merge_bvec, dm_unplug_all).
If the reference counter reaches zero in one of the upcalls, the table
destructor is called from almost random kernel code.

This leads to various problems:
* dm_any_congested being called under a spinlock, which calls the
  destructor, which calls some sleeping function.
* the destructor attempting to take a lock that is already taken by the
  same process.
* stale reference from some other kernel code keeps the table
  constructed, which keeps some devices open, even after successful
  return from "dmsetup remove". This can confuse lvm and prevent closing
  of underlying devices or reusing device minor numbers.

The patch changes reference counting so that the table destructor can be
called only at predetermined places.

The table has always exactly one reference from either mapped_device->map
or hash_cell->new_map. After this patch, this reference is not counted
in table->holders.  A pair of dm_create_table/dm_destroy_table functions
is used for table creation/destruction.

Temporary references from the other code increase table->holders. A pair
of dm_table_get/dm_table_put functions is used to manipulate it.

When the table is about to be destroyed, we wait for table->holders to
reach 0. Then, we call the table destructor.  We use active waiting with
msleep(1), because the situation happens rarely (to one user in 5 years)
and removing the device isn't performance-critical task: the user doesn't
care if it takes one tick more or not.

This way, the destructor is called only at specific points
(dm_table_destroy function) and the above problems associated with lazy
destruction can't happen.

Finally remove the temporary protection added to dm_any_congested().

Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2009-01-06 03:05:10 +00:00
Andi Kleen
ab4c142488 dm: support barriers on simple devices
Implement barrier support for single device DM devices

This patch implements barrier support in DM for the common case of dm linear
just remapping a single underlying device. In this case we can safely
pass the barrier through because there can be no reordering between
devices.

 NB. Any DM device might cease to support barriers if it gets
     reconfigured so code must continue to allow for a possible
     -EOPNOTSUPP on every barrier bio submitted.  - agk

Signed-off-by: Andi Kleen <ak@suse.de>
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2009-01-06 03:05:09 +00:00
Kiyoshi Ueda
8fbf26ad5b dm request: add caches
This patch prepares some kmem_caches for request-based dm.

Signed-off-by: Kiyoshi Ueda <k-ueda@ct.jp.nec.com>
Signed-off-by: Jun'ichi Nomura <j-nomura@ce.jp.nec.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2009-01-06 03:05:06 +00:00
Mikulas Patocka
a1b51e9867 dm table: drop reference at unbind
Move one dm_table_put() so that the last reference in the thread
gets dropped in __unbind().

This is required for a following patch,
dm-table-rework-reference-counting.patch, which will change the logic in
such a way that table destructor is called only at specific points in
the code.

Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2009-01-06 03:04:53 +00:00
Jens Axboe
bb799ca020 bio: allow individual slabs in the bio_set
Instead of having a global bio slab cache, add a reference to one
in each bio_set that is created. This allows for personalized slabs
in each bio_set, so that they can have bios of different sizes.

This means we can personalize the bios we return. File systems may
want to embed the bio inside another structure, to avoid allocation
more items (and stuffing them in ->bi_private) after the get a bio.
Or we may want to embed a number of bio_vecs directly at the end
of a bio, to avoid doing two allocations to return a bio. This is now
possible.

Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
2008-12-29 08:29:23 +01:00
Ingo Molnar
0bfc24559d blktrace: port to tracepoints, update
Port to the new tracepoints API: split DEFINE_TRACE() and DECLARE_TRACE()
sites. Spread them out to the usage sites, as suggested by
Mathieu Desnoyers.

Signed-off-by: Ingo Molnar <mingo@elte.hu>
Acked-by: Mathieu Desnoyers <mathieu.desnoyers@polymtl.ca>
2008-11-26 13:04:35 +01:00
Arnaldo Carvalho de Melo
5f3ea37c77 blktrace: port to tracepoints
This was a forward port of work done by Mathieu Desnoyers, I changed it to
encode the 'what' parameter on the tracepoint name, so that one can register
interest in specific events and not on classes of events to then check the
'what' parameter.

Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2008-11-26 12:13:34 +01:00
Chandra Seetharaman
8a57dfc6f9 dm: avoid destroying table in dm_any_congested
dm_any_congested() just checks for the DMF_BLOCK_IO and has no
code to make sure that suspend waits for dm_any_congested() to
complete.  This patch adds such a check.

Without it, a race can occur with dm_table_put() attempting to
destroying the table in the wrong thread, the one running
dm_any_congested() which is meant to be quick and return
immediately.

Two examples of problems:
1. Sleeping functions called from congested code, the caller
   of which holds a spin lock.
2. An ABBA deadlock between pdflush and multipathd. The two locks
   in contention are inode lock and kernel lock.

Signed-off-by: Chandra Seetharaman <sekharan@us.ibm.com>
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2008-11-13 23:39:14 +00:00
Mikulas Patocka
d221d2e776 dm: move pending queue wake_up end_io_acct
This doesn't fix any bug, just moves wake_up immediately after decrementing
md->pending, for better code readability.

It must be clear to anyone manipulating md->pending to wake up
the queue if md->pending reaches zero, so move the wakeup as close to
the decrementing as possible.

Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2008-11-13 23:39:10 +00:00
Linus Torvalds
2248485640 Merge git://git.kernel.org/pub/scm/linux/kernel/git/viro/bdev
* git://git.kernel.org/pub/scm/linux/kernel/git/viro/bdev: (66 commits)
  [PATCH] kill the rest of struct file propagation in block ioctls
  [PATCH] get rid of struct file use in blkdev_ioctl() BLKBSZSET
  [PATCH] get rid of blkdev_locked_ioctl()
  [PATCH] get rid of blkdev_driver_ioctl()
  [PATCH] sanitize blkdev_get() and friends
  [PATCH] remember mode of reiserfs journal
  [PATCH] propagate mode through swsusp_close()
  [PATCH] propagate mode through open_bdev_excl/close_bdev_excl
  [PATCH] pass fmode_t to blkdev_put()
  [PATCH] kill the unused bsize on the send side of /dev/loop
  [PATCH] trim file propagation in block/compat_ioctl.c
  [PATCH] end of methods switch: remove the old ones
  [PATCH] switch sr
  [PATCH] switch sd
  [PATCH] switch ide-scsi
  [PATCH] switch tape_block
  [PATCH] switch dcssblk
  [PATCH] switch dasd
  [PATCH] switch mtd_blkdevs
  [PATCH] switch mmc
  ...
2008-10-23 10:23:07 -07:00
Kiyoshi Ueda
51157b4ab4 dm: tidy local_init
This patch tidies local_init() in preparation for request-based dm.
No functional change.

Signed-off-by: Kiyoshi Ueda <k-ueda@ct.jp.nec.com>
Signed-off-by: Jun'ichi Nomura <j-nomura@ce.jp.nec.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2008-10-21 17:45:08 +01:00
Kiyoshi Ueda
f431d9666f dm: remove unused flush_all
This patch removes the DM_WQ_FLUSH_ALL state that is unnecessary.

The dm_queue_flush(md, DM_WQ_FLUSH_ALL, NULL) in dm_suspend()
is never invoked because:
  - 'goto flush_and_out' is the same as 'goto out' because
    the 'goto flush_and_out' is called only when '!noflush'
  - If r is non-zero, then the code above will invoke 'goto out'
    and skip this code.

No functional change.

Signed-off-by: Kiyoshi Ueda <k-ueda@ct.jp.nec.com>
Signed-off-by: Jun'ichi Nomura <j-nomura@ce.jp.nec.com>
Signed-off-by: Milan Broz <mbroz@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2008-10-21 17:45:07 +01:00
Martin K. Petersen
f3e1d26ede dm: mark split bio as cloned
When a bio gets split, mark its fragments with the BIO_CLONED flag.

Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2008-10-21 17:45:04 +01:00
Al Viro
fe5f9f2cd5 [PATCH] switch dm
ioctl() doesn't need BKL here

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2008-10-21 07:48:29 -04:00
Al Viro
d4430d62fa [PATCH] beginning of methods conversion
To keep the size of changesets sane we split the switch by drivers;
to keep the damn thing bisectable we do the following:
	1) rename the affected methods, add ones with correct
prototypes, make (few) callers handle both.  That's this changeset.
	2) for each driver convert to new methods.  *ALL* drivers
are converted in this series.
	3) kill the old (renamed) methods.

Note that it _is_ a flagday; all in-tree drivers are converted and by the
end of this series no trace of old methods remain.  The only reason why
we do that this way is to keep the damn thing bisectable and allow per-driver
debugging if anything goes wrong.

New methods:
	open(bdev, mode)
	release(disk, mode)
	ioctl(bdev, mode, cmd, arg)		/* Called without BKL */
	compat_ioctl(bdev, mode, cmd, arg)
	locked_ioctl(bdev, mode, cmd, arg)	/* Called with BKL, legacy */

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2008-10-21 07:47:32 -04:00
Al Viro
647b3d0084 [PATCH] lose unused arguments in dm ioctl callbacks
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2008-10-21 07:47:18 -04:00
Tejun Heo
074a7aca7a block: move stats from disk to part0
Move stats related fields - stamp, in_flight, dkstats - from disk to
part0 and unify stat handling such that...

* part_stat_*() now updates part0 together if the specified partition
  is not part0.  ie. part_stat_*() are now essentially all_stat_*().

* {disk|all}_stat_*() are gone.

* part_round_stats() is updated similary.  It handles part0 stats
  automatically and disk_round_stats() is killed.

* part_{inc|dec}_in_fligh() is implemented which automatically updates
  part0 stats for parts other than part0.

* disk_map_sector_rcu() is updated to return part0 if no part matches.
  Combined with the above changes, this makes NULL special case
  handling in callers unnecessary.

* Separate stats show code paths for disk are collapsed into part
  stats show code paths.

* Rename disk_stat_lock/unlock() to part_stat_lock/unlock()

While at it, reposition stat handling macros a bit and add missing
parentheses around macro parameters.

Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
2008-10-09 08:56:08 +02:00