Commit Graph

4523 Commits

Author SHA1 Message Date
Andreas Gruenbacher
d018017102 drbd: Remove the terrible DEV hack
DRBD was using dev_err() and similar all over the code; instead of having to
write dev_err(disk_to_dev(device->vdisk), ...) to convert a drbd_device into a
kernel device, a DEV macro was used which implicitly references the device
variable.  This is terrible; introduce separate drbd_err() and similar macros
with an explicit device parameter instead.

Signed-off-by: Andreas Gruenbacher <agruen@linbit.com>
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
2014-02-17 16:45:01 +01:00
Andreas Gruenbacher
c06ece6ba6 drbd: Turn connection->volumes into connection->peer_devices
Let connection->peer_devices point to peer devices; connection->volumes was
pointing to devices.

Signed-off-by: Andreas Gruenbacher <agruen@linbit.com>
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
2014-02-17 16:45:00 +01:00
Andreas Gruenbacher
eb6bea673f drbd: Move resource options from connection to resource
Signed-off-by: Andreas Gruenbacher <agruen@linbit.com>
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
2014-02-17 16:44:59 +01:00
Andreas Gruenbacher
9693da2379 drbd: conn_try_disconnect(): Use parameter instead of the global variable
Signed-off-by: Andreas Gruenbacher <agruen@linbit.com>
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
2014-02-17 16:44:58 +01:00
Andreas Gruenbacher
4bc760488c drbd: Replace conn_get_by_name() with drbd_find_resource()
So far, connections and resources always come in pairs, but in the future with
multiple connections per resource, the names will stick with the resources.

Signed-off-by: Andreas Gruenbacher <agruen@linbit.com>
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
2014-02-17 16:44:57 +01:00
Andreas Gruenbacher
803ea1348e drbd: Add struct drbd_resource->devices
This allows to access the volumes of a resource by number.

Signed-off-by: Andreas Gruenbacher <agruen@linbit.com>
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
2014-02-17 16:44:57 +01:00
Andreas Gruenbacher
93e4bf7a77 drbd: Minor cleanup in conn_new_minor()
Signed-off-by: Andreas Gruenbacher <agruen@linbit.com>
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
2014-02-17 16:44:56 +01:00
Andreas Gruenbacher
d8628a8657 drbd: Add struct drbd_device->resource
Signed-off-by: Andreas Gruenbacher <agruen@linbit.com>
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
2014-02-17 16:44:55 +01:00
Andreas Gruenbacher
a10f6b8ae6 drbd: drbd_adm_down(): Move valid resource name check to drbd_adm_prepare()
Signed-off-by: Andreas Gruenbacher <agruen@linbit.com>
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
2014-02-17 16:44:54 +01:00
Andreas Gruenbacher
77c556f663 drbd: Add struct drbd_resource
In a first step, each resource has exactly one connection, and both objects are
allocated at the same time.  The final result will be one resource and zero or
more connections.

Only allow to delete a resource if all its connections are C_STANDALONE.
Stop the worker threads of all connections early enough.

Signed-off-by: Andreas Gruenbacher <agruen@linbit.com>
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
2014-02-17 16:44:53 +01:00
Andreas Gruenbacher
05a10ec790 drbd: Improve some function and variable naming
Rename functions
conn_destroy() -> drbd_destroy_connection(),
drbd_minor_destroy() -> drbd_destroy_device()
drbd_adm_add_minor() -> drbd_adm_add_minor()
drbd_adm_delete_minor() -> drbd_adm_del_minor()

Rename global variable minors to drbd_devices

Signed-off-by: Andreas Gruenbacher <agruen@linbit.com>
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
2014-02-17 16:44:52 +01:00
Andreas Gruenbacher
a6b32bc3ce drbd: Introduce "peer_device" object between "device" and "connection"
In a setup where a device (aka volume) can replicate to multiple peers and one
connection can be shared between multiple devices, we need separate objects to
represent devices on peer nodes and network connections.

As a first step to introduce multiple connections per device, give each
drbd_device object a single drbd_peer_device object which connects it to a
drbd_connection object.

Signed-off-by: Andreas Gruenbacher <agruen@linbit.com>
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
2014-02-17 16:44:51 +01:00
Andreas Gruenbacher
bde89a9e15 drbd: Rename drbd_tconn -> drbd_connection
sed -i -e 's:all_tconn:connections:g' -e 's:tconn:connection:g'

Signed-off-by: Andreas Gruenbacher <agruen@linbit.com>
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
2014-02-17 16:44:47 +01:00
Andreas Gruenbacher
b30ab7913b drbd: Rename "mdev" to "device"
sed -i -e 's:mdev:device:g'

Signed-off-by: Andreas Gruenbacher <agruen@linbit.com>
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
2014-02-17 16:42:24 +01:00
Andreas Gruenbacher
5476169793 drbd: Rename struct drbd_conf -> struct drbd_device
sed -i -e 's:\<drbd_conf\>:drbd_device:g'

Signed-off-by: Andreas Gruenbacher <agruen@linbit.com>
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
2014-02-17 16:36:44 +01:00
Andreas Gruenbacher
a3603a6e3b drbd: Split off on-the-wire protocol definitions
Keep the protocol definitions separate from the kernel code; they are useful in
their own right.

Signed-off-by: Andreas Gruenbacher <agruen@linbit.com>
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
2014-02-17 16:27:49 +01:00
Philipp Reisner
8e22943430 drbd: Add missing error goto
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2014-02-17 16:24:47 +01:00
Rashika Kheria
d9216c8d9a drivers: block: Remove unused function drbd_bm_write_lazy() in drbd_bitmap.c
Remove unused function drbd_bm_write_lazy() in drbd/drbd_bitmap.c.

This eliminates the following warning in drbd/drbd_bitmap.c:
drivers/block/drbd/drbd_bitmap.c:1208:5: warning: no previous prototype for ‘drbd_bm_write_lazy’ [-Wmissing-prototypes]

Signed-off-by: Rashika Kheria <rashika.kheria@gmail.com>
Reviewed-by: Josh Triplett <josh@joshtriplett.org>
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
2014-02-17 16:19:39 +01:00
Rashika Kheria
fbe0d91ca3 drivers: block: Mark function seq_printf_with_thousands_grouping() as static in drbd_proc.c
Mark function seq_printf_with_thousands_grouping() as static in
drbd/drbd_proc.c because it is not used outside this file.

This eliminates the following warning in drbd/drbd_proc.c:
drivers/block/drbd/drbd_proc.c:49:6: warning: no previous prototype for ‘seq_printf_with_thousands_grouping’ [-Wmissing-prototypes]

Signed-off-by: Rashika Kheria <rashika.kheria@gmail.com>
Reviewed-by: Josh Triplett <josh@joshtriplett.org>
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
2014-02-17 16:19:39 +01:00
Rashika Kheria
a186e47856 drivers: block: Mark the function as static in drbd_worker.c
Mark functions drbd_endio_read_sec_final(), drbd_send_barrier(),
need_to_send_barrier(), dequeue_work_batch(), dequeue_work_item() and
wait_for_work() as static in drbd/drbd_worker.c because they are not
used outside this file.

This eliminates the following warnings in drbd/drbd_worker.c:
drivers/block/drbd/drbd_worker.c:99:6: warning: no previous prototype for ‘drbd_endio_read_sec_final’ [-Wmissing-prototypes]
drivers/block/drbd/drbd_worker.c:1276:5: warning: no previous prototype for ‘drbd_send_barrier’ [-Wmissing-prototypes]
drivers/block/drbd/drbd_worker.c:1774:6: warning: no previous prototype for ‘need_to_send_barrier’ [-Wmissing-prototypes]
drivers/block/drbd/drbd_worker.c:1798:6: warning: no previous prototype for ‘dequeue_work_batch’ [-Wmissing-prototypes]
drivers/block/drbd/drbd_worker.c:1806:6: warning: no previous prototype for ‘dequeue_work_item’ [-Wmissing-prototypes]
drivers/block/drbd/drbd_worker.c:1815:6: warning: no previous prototype for ‘wait_for_work’ [-Wmissing-prototypes]

Signed-off-by: Rashika Kheria <rashika.kheria@gmail.com>
Reviewed-by: Josh Triplett <josh@joshtriplett.org>
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
2014-02-17 16:19:39 +01:00
Rashika Kheria
de0b2e69b6 drivers: block: Move prototype declaration to appropriate header file from drbd_main.c
Move prototype declaration of functions drbdd_init() and drbd_asender()
from drbd/drbd_main.c to header file drbd/drbd_int.h because these
functions are used by more than one file.

This eliminates the following warning in drbd/drbd_receiver.c:
drivers/block/drbd/drbd_receiver.c:4836:5: warning: no previous prototype for ‘drbdd_init’ [-Wmissing-prototypes]
drivers/block/drbd/drbd_receiver.c:5245:5: warning: no previous prototype for ‘drbd_asender’ [-Wmissing-prototypes]

Signed-off-by: Rashika Kheria <rashika.kheria@gmail.com>
Reviewed-by: Josh Triplett <josh@joshtriplett.org>
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
2014-02-17 16:19:39 +01:00
Rashika Kheria
f63e631a34 drivers: block: Mark functions as static in drbd_receiver.c
Mark functions conn_wait_active_ee_empty() and
drbd_crypto_alloc_digest_safe() as static in drbd/drbd_receiver.c
because they are not used outside this file.

This eliminates the following warning in drbd/drbd_receiver.c:
drivers/block/drbd/drbd_receiver.c:1401:6: warning: no previous prototype for ‘conn_wait_active_ee_empty’ [-Wmissing-prototypes]
drivers/block/drbd/drbd_receiver.c:3259:21: warning: no previous prototype for ‘drbd_crypto_alloc_digest_safe’ [-Wmissing-prototypes]

Signed-off-by: Rashika Kheria <rashika.kheria@gmail.com>
Reviewed-by: Josh Triplett <josh@joshtriplett.org>
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
2014-02-17 16:19:39 +01:00
Rashika Kheria
01cd263614 drivers: block: Mark functions as static in drbd_req.c
Mark functions drbd_request_prepare() and find_oldest_request() as
static in drbd/drbd_req.c because they are not used outside this file.

This eliminates the following warnings in drbd/drbd_req.c:
drivers/block/drbd/drbd_req.c:1037:1: warning: no previous prototype for ‘drbd_request_prepare’ [-Wmissing-prototypes]
drivers/block/drbd/drbd_req.c:1323:22: warning: no previous prototype for ‘find_oldest_request’ [-Wmissing-prototypes]

Signed-off-by: Rashika Kheria <rashika.kheria@gmail.com>
Reviewed-by: Josh Triplett <josh@joshtriplett.org>
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
2014-02-17 16:19:39 +01:00
Rashika Kheria
ed54482b99 drivers: block: Move prototype declaration of function tl_abort_disk_io() to appropriate header file from drbd_state.c
Move the prototype declaration of function tl_abort_disk_io() from
drbd/drbd_state.c to appropriate header file drbd/drbd_int.h because it
is used by more than 2 files.

This eliminates the following warnings in drbd/drbd_main.c:
drivers/block/drbd/drbd_main.c:310:6: warning: no previous prototype for ‘tl_abort_disk_io’ [-Wmissing-prototypes]

Signed-off-by: Rashika Kheria <rashika.kheria@gmail.com>
Reviewed-by: Josh Triplett <josh@joshtriplett.org>
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
2014-02-17 16:19:38 +01:00
Rashika Kheria
a99efafc26 drivers: block: Mark function as static in drbd_actlog.c
Mark the function drbd_al_begin_io_prepare() as static in
drbd/drbd_actlog.c because it is not used outside this file.

This eliminates the following warnings in drbd/drbd_actlog.c:
drivers/block/drbd/drbd_actlog.c:277:6: warning: no previous prototype for ‘drbd_al_begin_io_prepare’ [-Wmissing-prototypes]

Signed-off-by: Rashika Kheria <rashika.kheria@gmail.com>
Reviewed-by: Josh Triplett <josh@joshtriplett.org>
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
2014-02-17 16:19:38 +01:00
Rashika Kheria
4b7a530f6b drivers: block: Mark functions as static in drbd_nl.c
Mark functions conn_khelper(), nla_put_drbd_cfg_context(),
nla_put_status_info() and get_one_status() as static in drbd/drbd_nl.c
because they are not used outside this file.

This eliminates the following warnings in drbd/drbd_nl.c:
drivers/block/drbd/drbd_nl.c:365:5: warning: no previous prototype for ‘conn_khelper’ [-Wmissing-prototypes]
drivers/block/drbd/drbd_nl.c:2727:5: warning: no previous prototype for ‘nla_put_drbd_cfg_context’ [-Wmissing-prototypes]
drivers/block/drbd/drbd_nl.c:2753:5: warning: no previous prototype for ‘nla_put_status_info’ [-Wmissing-prototypes]
drivers/block/drbd/drbd_nl.c:2895:5: warning: no previous prototype for ‘get_one_status’ [-Wmissing-prototypes]

Signed-off-by: Rashika Kheria <rashika.kheria@gmail.com>
Reviewed-by: Josh Triplett <josh@joshtriplett.org>
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
2014-02-17 16:19:38 +01:00
Rashika Kheria
16f4e743c8 drivers: block: Mark functions as static in drbd_main.c
Mark functions _drbd_send_uuids(), fill_bitmap_rle_bits() and
init_submitter() as static in drbd/drbd_main.c because they are
not used outside this file.

This eliminates the following warnings in drbd/drbd_main.c:
drivers/block/drbd/drbd_main.c:826:5: warning: no previous prototype for ‘_drbd_send_uuids’ [-Wmissing-prototypes]
drivers/block/drbd/drbd_main.c:1070:5: warning: no previous prototype for ‘fill_bitmap_rle_bits’ [-Wmissing-prototypes]
drivers/block/drbd/drbd_main.c:2592:5: warning: no previous prototype for ‘init_submitter’ [-Wmissing-prototypes]

Signed-off-by: Rashika Kheria <rashika.kheria@gmail.com>
Reviewed-by: Josh Triplett <josh@joshtriplett.org>
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
2014-02-17 16:19:38 +01:00
Linus Torvalds
5e57dc8110 Merge branch 'for-linus' of git://git.kernel.dk/linux-block
Pull block IO fixes from Jens Axboe:
 "Second round of updates and fixes for 3.14-rc2.  Most of this stuff
  has been queued up for a while.  The notable exception is the blk-mq
  changes, which are naturally a bit more in flux still.

  The pull request contains:

   - Two bug fixes for the new immutable vecs, causing crashes with raid
     or swap.  From Kent.

   - Various blk-mq tweaks and fixes from Christoph.  A fix for
     integrity bio's from Nic.

   - A few bcache fixes from Kent and Darrick Wong.

   - xen-blk{front,back} fixes from David Vrabel, Matt Rushton, Nicolas
     Swenson, and Roger Pau Monne.

   - Fix for a vec miscount with integrity vectors from Martin.

   - Minor annotations or fixes from Masanari Iida and Rashika Kheria.

   - Tweak to null_blk to do more normal FIFO processing of requests
     from Shlomo Pongratz.

   - Elevator switching bypass fix from Tejun.

   - Softlockup in blkdev_issue_discard() fix when !CONFIG_PREEMPT from
     me"

* 'for-linus' of git://git.kernel.dk/linux-block: (31 commits)
  block: add cond_resched() to potentially long running ioctl discard loop
  xen-blkback: init persistent_purge_work work_struct
  blk-mq: pair blk_mq_start_request / blk_mq_requeue_request
  blk-mq: dont assume rq->errors is set when returning an error from ->queue_rq
  block: Fix cloning of discard/write same bios
  block: Fix type mismatch in ssize_t_blk_mq_tag_sysfs_show
  blk-mq: rework flush sequencing logic
  null_blk: use blk_complete_request and blk_mq_complete_request
  virtio_blk: use blk_mq_complete_request
  blk-mq: rework I/O completions
  fs: Add prototype declaration to appropriate header file include/linux/bio.h
  fs: Mark function as static in fs/bio-integrity.c
  block/null_blk: Fix completion processing from LIFO to FIFO
  block: Explicitly handle discard/write same segments
  block: Fix nr_vecs for inline integrity vectors
  blk-mq: Add bio_integrity setup to blk_mq_make_request
  blk-mq: initialize sg_reserved_size
  blk-mq: handle dma_drain_size
  blk-mq: divert __blk_put_request for MQ ops
  blk-mq: support at_head inserations for blk_execute_rq
  ...
2014-02-14 10:45:18 -08:00
Roger Pau Monne
abb97b8c50 xen-blkback: init persistent_purge_work work_struct
Initialize persistent_purge_work work_struct on xen_blkif_alloc (and
remove the previous initialization done in purge_persistent_gnt). This
prevents flush_work from complaining even if purge_persistent_gnt has
not been used.

Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
Reviewed-by: David Vrabel <david.vrabel@citrix.com>
Tested-by: Sander Eikelenboom <linux@eikelenboom.it>
Signed-off-by: Jens Axboe <axboe@fb.com>
2014-02-11 20:34:03 -07:00
Jens Axboe
9d4cb8e3a5 Merge branch 'stable/for-jens-3.14' of git://git.kernel.org/pub/scm/linux/kernel/git/xen/tip into for-linus
Konrad writes:

Please git pull the following branch:

 git://git.kernel.org/pub/scm/linux/kernel/git/xen/tip.git stable/for-jens-3.14

which is based off v3.13-rc6. If you would like me to rebase it on
a different branch/tag I would be more than happy to do so.

The patches are all bug-fixes and hopefully can go in 3.14.

They deal with xen-blkback shutdown and cause memory leaks
as well as shutdown races. They should go to stable tree and if you
are OK with I will ask them to backport those fixes.

There is also a fix to xen-blkfront to deal with unexpected state
transition. And lastly a fix to the header where it was using the
__aligned__ unnecessarily.
2014-02-10 12:52:34 -07:00
Christoph Hellwig
ce2c350b2c null_blk: use blk_complete_request and blk_mq_complete_request
Use the block layer helpers for CPU-local completions instead of
reimplementing them locally.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@fb.com>
2014-02-10 09:27:31 -07:00
Christoph Hellwig
5124c28579 virtio_blk: use blk_mq_complete_request
Make sure to complete requests on the submitting CPU.  Previously this
was done in blk_mq_end_io, but the responsibility shifted to the drivers.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@fb.com>
2014-02-10 09:27:31 -07:00
Shlomo Pongratz
d7790b928d block/null_blk: Fix completion processing from LIFO to FIFO
The completion queue is implemented using lockless list.

The llist_add is adds the events to the list head which is a push operation.
The processing of the completion elements is done by disconnecting all the
pushed elements and iterating over the disconnected list. The problem is
that the processing is done in reverse order w.r.t order of the insertion
i.e. LIFO processing. By reversing the disconnected list which is done in
linear time the desired FIFO processing is achieved.

Signed-off-by: Shlomo Pongratz <shlomop@mellanox.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
2014-02-07 13:56:07 -07:00
David Vrabel
3661371701 xen-blkfront: handle backend CLOSED without CLOSING
Backend drivers shouldn't transistion to CLOSED unless the frontend is
CLOSED.  If a backend does transition to CLOSED too soon then the
frontend may not see the CLOSING state and will not properly shutdown.

So, treat an unexpected backend CLOSED state the same as CLOSING.

Signed-off-by: David Vrabel <david.vrabel@citrix.com>
Acked-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Cc: stable@vger.kernel.org
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
2014-02-07 13:35:20 -05:00
Roger Pau Monne
80bfa2f6e2 xen-blkif: drop struct blkif_request_segment_aligned
This was wrongly introduced in commit 402b27f9, the only difference
between blkif_request_segment_aligned and blkif_request_segment is
that the former has a named padding, while both share the same
memory layout.

Also correct a few minor glitches in the description, including for it
to no longer assume PAGE_SIZE == 4096.

Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
[Description fix by Jan Beulich]
Signed-off-by: Jan Beulich <jbeulich@suse.com>
Reported-by: Jan Beulich <jbeulich@suse.com>
Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Cc: David Vrabel <david.vrabel@citrix.com>
Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Tested-by: Matt Rushton <mrushton@amazon.com>
Cc: Matt Wilson <msw@amazon.com>
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
2014-02-07 13:03:53 -05:00
Roger Pau Monne
c05f3e3c85 xen-blkback: fix shutdown race
Introduce a new variable to keep track of the number of in-flight
requests. We need to make sure that when xen_blkif_put is called the
request has already been freed and we can safely free xen_blkif, which
was not the case before.

Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Cc: David Vrabel <david.vrabel@citrix.com>
Reviewed-by: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Tested-by: Matt Rushton <mrushton@amazon.com>
Reviewed-by: Matt Rushton <mrushton@amazon.com>
Cc: Matt Wilson <msw@amazon.com>
Cc: Ian Campbell <Ian.Campbell@citrix.com>
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
2014-02-07 12:59:30 -05:00
Roger Pau Monne
ef75341133 xen-blkback: fix memory leaks
I've at least identified two possible memory leaks in blkback, both
related to the shutdown path of a VBD:

- blkback doesn't wait for any pending purge work to finish before
  cleaning the list of free_pages. The purge work will call
  put_free_pages and thus we might end up with pages being added to
  the free_pages list after we have emptied it. Fix this by making
  sure there's no pending purge work before exiting
  xen_blkif_schedule, and moving the free_page cleanup code to
  xen_blkif_free.
- blkback doesn't wait for pending requests to end before cleaning
  persistent grants and the list of free_pages. Again this can add
  pages to the free_pages list or persistent grants to the
  persistent_gnts red-black tree. Fixed by moving the persistent
  grants and free_pages cleanup code to xen_blkif_free.

Also, add some checks in xen_blkif_free to make sure we are cleaning
everything.

Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Reviewed-by: David Vrabel <david.vrabel@citrix.com>
Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Tested-by: Matt Rushton <mrushton@amazon.com>
Reviewed-by: Matt Rushton <mrushton@amazon.com>
Cc: Matt Wilson <msw@amazon.com>
Cc: Ian Campbell <Ian.Campbell@citrix.com>
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
2014-02-07 12:58:46 -05:00
Matt Rushton
2ed22e3c3b xen-blkback: fix memory leak when persistent grants are used
Currently shrink_free_pagepool() is called before the pages used for
persistent grants are released via free_persistent_gnts(). This
results in a memory leak when a VBD that uses persistent grants is
torn down.

Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Cc: "Roger Pau Monné" <roger.pau@citrix.com>
Cc: Ian Campbell <Ian.Campbell@citrix.com>
Reviewed-by: David Vrabel <david.vrabel@citrix.com>
Cc: linux-kernel@vger.kernel.org
Cc: xen-devel@lists.xen.org
Cc: Anthony Liguori <aliguori@amazon.com>
Signed-off-by: Matt Rushton <mrushton@amazon.com>
Signed-off-by: Matt Wilson <msw@amazon.com>
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
2014-02-07 12:58:18 -05:00
Linus Torvalds
1cd731df09 Bug-fixes:
- Revert "xen/grant-table: Avoid m2p_override during mapping" as it broke Xen ARM build.
  - Fix CR4 not being set on AP processors in Xen PVH mode.
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1
 
 iQEcBAABAgAGBQJS8AyQAAoJEFjIrFwIi8fJbD4IAJssMuaLI5CRsSWBgDFHHDFt
 srVJpDOYQiDr/TxkwFCVcL4sFy9Htb3KMArU4eIBl6uMqQbGa+3rHyXcHYI219YY
 XH3D8RG+9JChwsxtaeUEzwx1C8ehcygD34vtdcoQXa7eBuEi4TL3HeLifR+HrXKO
 UdFrTA34FmvpVFbSuRXkZh5sd6ca9et9xHuQHM8SIY6pVokY6xaEYOp17tfPZpwM
 7A6LFjUjXeugHC2L3+/H8UOHA9nSZQvnMiZOWq2Cusc2Dt2V7emzgk2wcc2CHttf
 EA6GbtiJzHqMPmt5EjubI9hHdSMB31HpY4hnQE38+ucl+BwiSdRE9z2Rm4TYClg=
 =IX4M
 -----END PGP SIGNATURE-----

Merge tag 'stable/for-linus-3.14-rc1-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/xen/tip

Pull Xen fixes from Konrad Rzeszutek Wilk:
 "Bug-fixes:
   - Revert "xen/grant-table: Avoid m2p_override during mapping" as it
     broke Xen ARM build.
   - Fix CR4 not being set on AP processors in Xen PVH mode"

* tag 'stable/for-linus-3.14-rc1-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/xen/tip:
  xen/pvh: set CR4 flags for APs
  Revert "xen/grant-table: Avoid m2p_override during mapping"
2014-02-05 16:01:11 -08:00
Linus Torvalds
8352650a5c Merge git://git.infradead.org/users/willy/linux-nvme
Pull NVMe driver update from Matthew Wilcox:
 "Looks like I missed the merge window ...  but these are almost all
  bugfixes anyway (the ones that aren't have been baking for months)"

* git://git.infradead.org/users/willy/linux-nvme:
  NVMe: Namespace use after free on surprise removal
  NVMe: Correct uses of INIT_WORK
  NVMe: Include device and queue numbers in interrupt name
  NVMe: Add a pci_driver shutdown method
  NVMe: Disable admin queue on init failure
  NVMe: Dynamically allocate partition numbers
  NVMe: Async IO queue deletion
  NVMe: Surprise removal handling
  NVMe: Abort timed out commands
  NVMe: Schedule reset for failed controllers
  NVMe: Device resume error handling
  NVMe: Cache dev->pci_dev in a local pointer
  NVMe: Fix lockdep warnings
  NVMe: compat SG_IO ioctl
  NVMe: remove deprecated IRQF_DISABLED
  NVMe: Avoid shift operation when writing cq head doorbell
2014-02-05 15:53:26 -08:00
Konrad Rzeszutek Wilk
e85fc98055 Revert "xen/grant-table: Avoid m2p_override during mapping"
This reverts commit 08ece5bb23.

As it breaks ARM builds and needs more attention
on the ARM side.

Acked-by: David Vrabel <david.vrabel@citrix.com>
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
2014-02-03 06:44:49 -05:00
Keith Busch
9ac27090f6 NVMe: Namespace use after free on surprise removal
An nvme block device may have open references when the device is
removed. New commands may still be sent on the removed device, so we
need to ref count the opens, return errors for new commands, and not
free the namespace and nvme_dev until all references are closed.

Signed-off-by: Keith Busch <keith.busch@intel.com>
Signed-off-by: Matthew Wilcox <matthew.r.wilcox@intel.com>
2014-02-02 13:31:15 -05:00
Linus Torvalds
14164b46fc Bug-fixes:
- Xen ARM couldn't use the new FIFO events
  - Xen ARM couldn't use the SWIOTLB if compiled as 32-bit with 64-bit PCIe devices.
  - Grant table were doing needless M2P operations.
  - Ratchet down the self-balloon code so it won't OOM.
  - Fix misplaced kfree in Xen PVH error code paths.
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1
 
 iQEcBAABAgAGBQJS68IQAAoJEFjIrFwIi8fJAWgH/j4HStEey3rgGcqwIWSHkkap
 +t55wsrT8Ylq6CzZjaUtCo3pB7HotW526x/0rA2pxVqHn/8oCN/1EtdrNtYm/umX
 qOoda+db5NIjAEGVLWSLqGyokJQDrX/brXIWfYR300e9fnJi7yT/rFC4QHoZVUYl
 5LME8XH/jE012vvYelNu6DbbodlRmVCT8hctJS+eB5ER2WmtD9Pkw4GybEXPVYJz
 hE0Ts1DN91nKP2FGJb+mfB9UFT5X8i00akAK+Qc1R3sRnRh6eRoNV8dgyCnudKpO
 UPEdiAZvgij+mzlgIYSz6nKH0U/VbvRsG3lc3i5Si3o+vR3CYPCkvzOGX2d0rjw=
 =7cxW
 -----END PGP SIGNATURE-----

Merge tag 'stable/for-linus-3.14-rc0-late-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/xen/tip

Pull Xen bugfixes from Konrad Rzeszutek Wilk:
 "Bug-fixes for the new features that were added during this cycle.

  There are also two fixes for long-standing issues for which we have a
  solution: grant-table operations extra work that was not needed
  causing performance issues and the self balloon code was too
  aggressive causing OOMs.

  Details:
   - Xen ARM couldn't use the new FIFO events
   - Xen ARM couldn't use the SWIOTLB if compiled as 32-bit with 64-bit PCIe devices.
   - Grant table were doing needless M2P operations.
   - Ratchet down the self-balloon code so it won't OOM.
   - Fix misplaced kfree in Xen PVH error code paths"

* tag 'stable/for-linus-3.14-rc0-late-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/xen/tip:
  xen/pvh: Fix misplaced kfree from xlated_setup_gnttab_pages
  drivers: xen: deaggressive selfballoon driver
  xen/grant-table: Avoid m2p_override during mapping
  xen/gnttab: Use phys_addr_t to describe the grant frame base address
  xen: swiotlb: handle sizeof(dma_addr_t) != sizeof(phys_addr_t)
  arm/xen: Initialize event channels earlier
2014-01-31 08:38:18 -08:00
Zoltan Kiss
08ece5bb23 xen/grant-table: Avoid m2p_override during mapping
The grant mapping API does m2p_override unnecessarily: only gntdev needs it,
for blkback and future netback patches it just cause a lock contention, as
those pages never go to userspace. Therefore this series does the following:
- the original functions were renamed to __gnttab_[un]map_refs, with a new
  parameter m2p_override
- based on m2p_override either they follow the original behaviour, or just set
  the private flag and call set_phys_to_machine
- gnttab_[un]map_refs are now a wrapper to call __gnttab_[un]map_refs with
  m2p_override false
- a new function gnttab_[un]map_refs_userspace provides the old behaviour

It also removes a stray space from page.h and change ret to 0 if
XENFEAT_auto_translated_physmap, as that is the only possible return value
there.

v2:
- move the storing of the old mfn in page->index to gnttab_map_refs
- move the function header update to a separate patch

v3:
- a new approach to retain old behaviour where it needed
- squash the patches into one

v4:
- move out the common bits from m2p* functions, and pass pfn/mfn as parameter
- clear page->private before doing anything with the page, so m2p_find_override
  won't race with this

v5:
- change return value handling in __gnttab_[un]map_refs
- remove a stray space in page.h
- add detail why ret = 0 now at some places

v6:
- don't pass pfn to m2p* functions, just get it locally

Signed-off-by: Zoltan Kiss <zoltan.kiss@citrix.com>
Suggested-by: David Vrabel <david.vrabel@citrix.com>
Acked-by: David Vrabel <david.vrabel@citrix.com>
Acked-by: Stefano Stabellini <stefano.stabellini@eu.citrix.com>
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
2014-01-31 09:48:32 -05:00
Minchan Kim
e46e33152e zram: remove zram->lock in read path and change it with mutex
Finally, we separated zram->lock dependency from 32bit stat/ table
handling so there is no reason to use rw_semaphore between read and
write path so this patch removes the lock from read path totally and
changes rw_semaphore with mutex.  So, we could do

old:

  read-read: OK
  read-write: NO
  write-write: NO

Now:

  read-read: OK
  read-write: OK
  write-write: NO

The below data proves mixed workload performs well 11 times and there is
also enhance on write-write path because current rw-semaphore doesn't
support SPIN_ON_OWNER.  It's side effect but anyway good thing for us.

Write-related tests perform better (from 61% to 1058%) but read path has
good/bad(from -2.22% to 1.45%) but they are all marginal within stddev.

  CPU 12
  iozone -t -T -l 12 -u 12 -r 16K -s 60M -I +Z -V 0

  ==Initial write                ==Initial write
  records: 10                    records: 10
  avg:  516189.16                avg:  839907.96
  std:   22486.53 (4.36%)        std:   47902.17 (5.70%)
  max:  546970.60                max:  909910.35
  min:  481131.54                min:  751148.38
  ==Rewrite                      ==Rewrite
  records: 10                    records: 10
  avg:  509527.98                avg: 1050156.37
  std:   45799.94 (8.99%)        std:   40695.44 (3.88%)
  max:  611574.27                max: 1111929.26
  min:  443679.95                min:  980409.62
  ==Read                         ==Read
  records: 10                    records: 10
  avg: 4408624.17                avg: 4472546.76
  std:  281152.61 (6.38%)        std:  163662.78 (3.66%)
  max: 4867888.66                max: 4727351.03
  min: 4058347.69                min: 4126520.88
  ==Re-read                      ==Re-read
  records: 10                    records: 10
  avg: 4462147.53                avg: 4363257.75
  std:  283546.11 (6.35%)        std:  247292.63 (5.67%)
  max: 4912894.44                max: 4677241.75
  min: 4131386.50                min: 4035235.84
  ==Reverse Read                 ==Reverse Read
  records: 10                    records: 10
  avg: 4565865.97                avg: 4485818.08
  std:  313395.63 (6.86%)        std:  248470.10 (5.54%)
  max: 5232749.16                max: 4789749.94
  min: 4185809.62                min: 3963081.34
  ==Stride read                  ==Stride read
  records: 10                    records: 10
  avg: 4515981.80                avg: 4418806.01
  std:  211192.32 (4.68%)        std:  212837.97 (4.82%)
  max: 4889287.28                max: 4686967.22
  min: 4210362.00                min: 4083041.84
  ==Random read                  ==Random read
  records: 10                    records: 10
  avg: 4410525.23                avg: 4387093.18
  std:  236693.22 (5.37%)        std:  235285.23 (5.36%)
  max: 4713698.47                max: 4669760.62
  min: 4057163.62                min: 3952002.16
  ==Mixed workload               ==Mixed workload
  records: 10                    records: 10
  avg:  243234.25                avg: 2818677.27
  std:   28505.07 (11.72%)       std:  195569.70 (6.94%)
  max:  288905.23                max: 3126478.11
  min:  212473.16                min: 2484150.69
  ==Random write                 ==Random write
  records: 10                    records: 10
  avg:  555887.07                avg: 1053057.79
  std:   70841.98 (12.74%)       std:   35195.36 (3.34%)
  max:  683188.28                max: 1096125.73
  min:  437299.57                min:  992481.93
  ==Pwrite                       ==Pwrite
  records: 10                    records: 10
  avg:  501745.93                avg:  810363.09
  std:   16373.54 (3.26%)        std:   19245.01 (2.37%)
  max:  518724.52                max:  833359.70
  min:  464208.73                min:  765501.87
  ==Pread                        ==Pread
  records: 10                    records: 10
  avg: 4539894.60                avg: 4457680.58
  std:  197094.66 (4.34%)        std:  188965.60 (4.24%)
  max: 4877170.38                max: 4689905.53
  min: 4226326.03                min: 4095739.72

Signed-off-by: Minchan Kim <minchan@kernel.org>
Cc: Nitin Gupta <ngupta@vflare.org>
Tested-by: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
Cc: Jerome Marchand <jmarchan@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-01-30 16:56:56 -08:00
Minchan Kim
f614a9f48d zram: remove workqueue for freeing removed pending slot
Commit a0c516cbfc ("zram: don't grab mutex in zram_slot_free_noity")
introduced free request pending code to avoid scheduling by mutex under
spinlock and it was a mess which made code lenghty and increased
overhead.

Now, we don't need zram->lock any more to free slot so this patch
reverts it and then, tb_lock should protect it.

Signed-off-by: Minchan Kim <minchan@kernel.org>
Cc: Nitin Gupta <ngupta@vflare.org>
Tested-by: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
Cc: Jerome Marchand <jmarchan@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-01-30 16:56:55 -08:00
Minchan Kim
92967471b6 zram: introduce zram->tb_lock
Currently, the zram table is protected by zram->lock but it's rather
coarse-grained lock and it makes hard for scalibility.

Let's use own rwlock instead of depending on zram->lock.  This patch
adds new locking so obviously, it would make slow but this patch is just
prepartion for removing coarse-grained rw_semaphore(ie, zram->lock)
which is hurdle about zram scalability.

Final patch in this patchset series will remove the lock from read-path
and change rw_semaphore with mutex in write path.  With bonus, we could
drop pending slot free mess in next patch.

Signed-off-by: Minchan Kim <minchan@kernel.org>
Cc: Nitin Gupta <ngupta@vflare.org>
Tested-by: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
Cc: Jerome Marchand <jmarchan@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-01-30 16:56:55 -08:00
Minchan Kim
deb0bdeb2f zram: use atomic operation for stat
Some of fields in zram->stats are protected by zram->lock which is
rather coarse-grained so let's use atomic operation without explict
locking.

This patch is ready for removing dependency of zram->lock in read path
which is very coarse-grained rw_semaphore.  Of course, this patch adds
new atomic operation so it might make slow but my 12CPU test couldn't
spot any regression.  All gain/lose is marginal within stddev.

  iozone -t -T -l 12 -u 12 -r 16K -s 60M -I +Z -V 0

  ==Initial write                ==Initial write
  records: 50                    records: 50
  avg:  412875.17                avg:  415638.23
  std:   38543.12 (9.34%)        std:   36601.11 (8.81%)
  max:  521262.03                max:  502976.72
  min:  343263.13                min:  351389.12
  ==Rewrite                      ==Rewrite
  records: 50                    records: 50
  avg:  416640.34                avg:  397914.33
  std:   60798.92 (14.59%)       std:   46150.42 (11.60%)
  max:  543057.07                max:  522669.17
  min:  304071.67                min:  316588.77
  ==Read                         ==Read
  records: 50                    records: 50
  avg: 4147338.63                avg: 4070736.51
  std:  179333.25 (4.32%)        std:  223499.89 (5.49%)
  max: 4459295.28                max: 4539514.44
  min: 3753057.53                min: 3444686.31
  ==Re-read                      ==Re-read
  records: 50                    records: 50
  avg: 4096706.71                avg: 4117218.57
  std:  229735.04 (5.61%)        std:  171676.25 (4.17%)
  max: 4430012.09                max: 4459263.94
  min: 2987217.80                min: 3666904.28
  ==Reverse Read                 ==Reverse Read
  records: 50                    records: 50
  avg: 4062763.83                avg: 4078508.32
  std:  186208.46 (4.58%)        std:  172684.34 (4.23%)
  max: 4401358.78                max: 4424757.22
  min: 3381625.00                min: 3679359.94
  ==Stride read                  ==Stride read
  records: 50                    records: 50
  avg: 4094933.49                avg: 4082170.22
  std:  185710.52 (4.54%)        std:  196346.68 (4.81%)
  max: 4478241.25                max: 4460060.97
  min: 3732593.23                min: 3584125.78
  ==Random read                  ==Random read
  records: 50                    records: 50
  avg: 4031070.04                avg: 4074847.49
  std:  192065.51 (4.76%)        std:  206911.33 (5.08%)
  max: 4356931.16                max: 4399442.56
  min: 3481619.62                min: 3548372.44
  ==Mixed workload               ==Mixed workload
  records: 50                    records: 50
  avg:  149925.73                avg:  149675.54
  std:    7701.26 (5.14%)        std:    6902.09 (4.61%)
  max:  191301.56                max:  175162.05
  min:  133566.28                min:  137762.87
  ==Random write                 ==Random write
  records: 50                    records: 50
  avg:  404050.11                avg:  393021.47
  std:   58887.57 (14.57%)       std:   42813.70 (10.89%)
  max:  601798.09                max:  524533.43
  min:  325176.99                min:  313255.34
  ==Pwrite                       ==Pwrite
  records: 50                    records: 50
  avg:  411217.70                avg:  411237.96
  std:   43114.99 (10.48%)       std:   33136.29 (8.06%)
  max:  530766.79                max:  471899.76
  min:  320786.84                min:  317906.94
  ==Pread                        ==Pread
  records: 50                    records: 50
  avg: 4154908.65                avg: 4087121.92
  std:  151272.08 (3.64%)        std:  219505.04 (5.37%)
  max: 4459478.12                max: 4435857.38
  min: 3730512.41                min: 3101101.67

Signed-off-by: Minchan Kim <minchan@kernel.org>
Cc: Nitin Gupta <ngupta@vflare.org>
Tested-by: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
Cc: Jerome Marchand <jmarchan@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-01-30 16:56:55 -08:00
Minchan Kim
874e3cddc3 zram: remove unnecessary free
Commit a0c516cbfc ("zram: don't grab mutex in zram_slot_free_noity")
introduced pending zram slot free in zram's write path in case of
missing slot free by memory allocation failure in zram_slot_free_notify
but it is not necessary because we have already freed the slot right
before overwriting.

Signed-off-by: Minchan Kim <minchan@kernel.org>
Cc: Nitin Gupta <ngupta@vflare.org>
Cc: Jerome Marchand <jmarchan@redhat.com>
Tested-by: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-01-30 16:56:55 -08:00
Minchan Kim
9b353db16d zram: delay pending free request in read path
Sergey reported we don't need to handle pending free request every I/O
so that this patch removes it in read path while we remain it in write
path.

Let's consider below example.

Swap subsystem ask to zram "A" block free by swap_slot_free_notify but
zram had been pended it without real freeing.  Swap subsystem allocates
"A" block for new data but request pended for a long time just handled
and zram blindly free new data on the "A" block.  :(

That's why we couldn't remove handle pending free request right before
zram-write.

Signed-off-by: Minchan Kim <minchan@kernel.org>
Reported-by: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
Tested-by: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
Cc: Nitin Gupta <ngupta@vflare.org>
Cc: Jerome Marchand <jmarchan@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-01-30 16:56:55 -08:00
Minchan Kim
da4a04126b zram: fix race between reset and flushing pending work
Dan and Sergey reported that there is a racy between reset and flushing
of pending work so that it could make oops by freeing zram->meta in
reset while zram_slot_free can access zram->meta if new request is
adding during the race window.

This patch moves flush after taking init_lock so it prevents new request
so that it closes the race.

Signed-off-by: Minchan Kim <minchan@kernel.org>
Reported-by: Dan Carpenter <dan.carpenter@oracle.com>
Cc: Nitin Gupta <ngupta@vflare.org>
Cc: Jerome Marchand <jmarchan@redhat.com>
Tested-by: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-01-30 16:56:55 -08:00
Minchan Kim
7bfb3de8a1 zram: add copyright
Add my copyright to the zram source code which I maintain.

Signed-off-by: Minchan Kim <minchan@kernel.org>
Cc: Nitin Gupta <ngupta@vflare.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-01-30 16:56:55 -08:00
Minchan Kim
49061236a9 zram: remove old private project comment
Remove the old private compcache project address so upcoming patches
should be sent to LKML because we Linux kernel community will take care.

Signed-off-by: Minchan Kim <minchan@kernel.org>
Cc: Nitin Gupta <ngupta@vflare.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-01-30 16:56:55 -08:00
Minchan Kim
cd67e10ac6 zram: promote zram from staging
Zram has lived in staging for a LONG LONG time and have been
fixed/improved by many contributors so code is clean and stable now.  Of
course, there are lots of product using zram in real practice.

The major TV companys have used zram as swap since two years ago and
recently our production team released android smart phone with zram
which is used as swap, too and recently Android Kitkat start to use zram
for small memory smart phone.  And there was a report Google released
their ChromeOS with zram, too and cyanogenmod have been used zram long
time ago.  And I heard some disto have used zram block device for tmpfs.
In addition, I saw many report from many other peoples.  For example,
Lubuntu start to use it.

The benefit of zram is very clear.  With my experience, one of the
benefit was to remove jitter of video application with backgroud memory
pressure.  It would be effect of efficient memory usage by compression
but more issue is whether swap is there or not in the system.  Recent
mobile platforms have used JAVA so there are many anonymous pages.  But
embedded system normally are reluctant to use eMMC or SDCard as swap
because there is wear-leveling and latency issues so if we do not use
swap, it means we can't reclaim anoymous pages and at last, we could
encounter OOM kill.  :(

Although we have real storage as swap, it was a problem, too.  Because
it sometime ends up making system very unresponsible caused by slow swap
storage performance.

Quote from Luigi on Google
 "Since Chrome OS was mentioned: the main reason why we don't use swap
  to a disk (rotating or SSD) is because it doesn't degrade gracefully
  and leads to a bad interactive experience.  Generally we prefer to
  manage RAM at a higher level, by transparently killing and restarting
  processes.  But we noticed that zram is fast enough to be competitive
  with the latter, and it lets us make more efficient use of the
  available RAM.  " and he announced.
http://www.spinics.net/lists/linux-mm/msg57717.html

Other uses case is to use zram for block device.  Zram is block device
so anyone can format the block device and mount on it so some guys on
the internet start zram as /var/tmp.
http://forums.gentoo.org/viewtopic-t-838198-start-0.html

Let's promote zram and enhance/maintain it instead of removing.

Signed-off-by: Minchan Kim <minchan@kernel.org>
Reviewed-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Acked-by: Nitin Gupta <ngupta@vflare.org>
Acked-by: Pekka Enberg <penberg@kernel.org>
Cc: Bob Liu <bob.liu@oracle.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Hugh Dickins <hughd@google.com>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Luigi Semenzato <semenzato@google.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Rik van Riel <riel@redhat.com>
Cc: Seth Jennings <sjenning@linux.vnet.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-01-30 16:56:55 -08:00
Linus Torvalds
53d8ab29f8 Merge branch 'for-3.14/drivers' of git://git.kernel.dk/linux-block
Pull block IO driver changes from Jens Axboe:

 - bcache update from Kent Overstreet.

 - two bcache fixes from Nicholas Swenson.

 - cciss pci init error fix from Andrew.

 - underflow fix in the parallel IDE pg_write code from Dan Carpenter.
   I'm sure the 1 (or 0) users of that are now happy.

 - two PCI related fixes for sx8 from Jingoo Han.

 - floppy init fix for first block read from Jiri Kosina.

 - pktcdvd error return miss fix from Julia Lawall.

 - removal of IRQF_SHARED from the SEGA Dreamcast CD-ROM code from
   Michael Opdenacker.

 - comment typo fix for the loop driver from Olaf Hering.

 - potential oops fix for null_blk from Raghavendra K T.

 - two fixes from Sam Bradshaw (Micron) for the mtip32xx driver, fixing
   an OOM problem and a problem with handling security locked conditions

* 'for-3.14/drivers' of git://git.kernel.dk/linux-block: (47 commits)
  mg_disk: Spelling s/finised/finished/
  null_blk: Null pointer deference problem in alloc_page_buffers
  mtip32xx: Correctly handle security locked condition
  mtip32xx: Make SGL container per-command to eliminate high order dma allocation
  drivers/block/loop.c: fix comment typo in loop_config_discard
  drivers/block/cciss.c:cciss_init_one(): use proper errnos
  drivers/block/paride/pg.c: underflow bug in pg_write()
  drivers/block/sx8.c: remove unnecessary pci_set_drvdata()
  drivers/block/sx8.c: use module_pci_driver()
  floppy: bail out in open() if drive is not responding to block0 read
  bcache: Fix auxiliary search trees for key size > cacheline size
  bcache: Don't return -EINTR when insert finished
  bcache: Improve bucket_prio() calculation
  bcache: Add bch_bkey_equal_header()
  bcache: update bch_bkey_try_merge
  bcache: Move insert_fixup() to btree_keys_ops
  bcache: Convert sorting to btree_keys
  bcache: Convert debug code to btree_keys
  bcache: Convert btree_iter to struct btree_keys
  bcache: Refactor bset_tree sysfs stats
  ...
2014-01-30 11:40:10 -08:00
Linus Torvalds
f568849eda Merge branch 'for-3.14/core' of git://git.kernel.dk/linux-block
Pull core block IO changes from Jens Axboe:
 "The major piece in here is the immutable bio_ve series from Kent, the
  rest is fairly minor.  It was supposed to go in last round, but
  various issues pushed it to this release instead.  The pull request
  contains:

   - Various smaller blk-mq fixes from different folks.  Nothing major
     here, just minor fixes and cleanups.

   - Fix for a memory leak in the error path in the block ioctl code
     from Christian Engelmayer.

   - Header export fix from CaiZhiyong.

   - Finally the immutable biovec changes from Kent Overstreet.  This
     enables some nice future work on making arbitrarily sized bios
     possible, and splitting more efficient.  Related fixes to immutable
     bio_vecs:

        - dm-cache immutable fixup from Mike Snitzer.
        - btrfs immutable fixup from Muthu Kumar.

  - bio-integrity fix from Nic Bellinger, which is also going to stable"

* 'for-3.14/core' of git://git.kernel.dk/linux-block: (44 commits)
  xtensa: fixup simdisk driver to work with immutable bio_vecs
  block/blk-mq-cpu.c: use hotcpu_notifier()
  blk-mq: for_each_* macro correctness
  block: Fix memory leak in rw_copy_check_uvector() handling
  bio-integrity: Fix bio_integrity_verify segment start bug
  block: remove unrelated header files and export symbol
  blk-mq: uses page->list incorrectly
  blk-mq: use __smp_call_function_single directly
  btrfs: fix missing increment of bi_remaining
  Revert "block: Warn and free bio if bi_end_io is not set"
  block: Warn and free bio if bi_end_io is not set
  blk-mq: fix initializing request's start time
  block: blk-mq: don't export blk_mq_free_queue()
  block: blk-mq: make blk_sync_queue support mq
  block: blk-mq: support draining mq queue
  dm cache: increment bi_remaining when bi_end_io is restored
  block: fixup for generic bio chaining
  block: Really silence spurious compiler warnings
  block: Silence spurious compiler warnings
  block: Kill bio_pair_split()
  ...
2014-01-30 11:19:05 -08:00
Matthew Wilcox
bdfd70fde3 NVMe: Correct uses of INIT_WORK
We need to initialise the work_struct when we initialise the rest of the
struct nvme_dev, otherwise we'll hit a lockdep warning when we remove
the device.  Use PREPARE_WORK to change the function pointer instead
of INIT_WORK.

Signed-off-by: Matthew Wilcox <matthew.r.wilcox@intel.com>
2014-01-29 11:42:32 -05:00
Linus Torvalds
d891ea23d5 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client
Pull ceph updates from Sage Weil:
 "This is a big batch.  From Ilya we have:

   - rbd support for more than ~250 mapped devices (now uses same scheme
     that SCSI does for device major/minor numbering)
   - crush updates for new mapping behaviors (will be needed for coming
     erasure coding support, among other things)
   - preliminary support for tiered storage pools

  There is also a big series fixing a pile cephfs bugs with clustered
  MDSs from Yan Zheng, ACL support for cephfs from Guangliang Zhao, ceph
  fscache improvements from Li Wang, improved behavior when we get
  ENOSPC from Josh Durgin, some readv/writev improvements from
  Majianpeng, and the usual mix of small cleanups"

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client: (76 commits)
  ceph: cast PAGE_SIZE to size_t in ceph_sync_write()
  ceph: fix dout() compile warnings in ceph_filemap_fault()
  libceph: support CEPH_FEATURE_OSD_CACHEPOOL feature
  libceph: follow redirect replies from osds
  libceph: rename ceph_osd_request::r_{oloc,oid} to r_base_{oloc,oid}
  libceph: follow {read,write}_tier fields on osd request submission
  libceph: add ceph_pg_pool_by_id()
  libceph: CEPH_OSD_FLAG_* enum update
  libceph: replace ceph_calc_ceph_pg() with ceph_oloc_oid_to_pg()
  libceph: introduce and start using oid abstraction
  libceph: rename MAX_OBJ_NAME_SIZE to CEPH_MAX_OID_NAME_LEN
  libceph: move ceph_file_layout helpers to ceph_fs.h
  libceph: start using oloc abstraction
  libceph: dout() is missing a newline
  libceph: add ceph_kv{malloc,free}() and switch to them
  libceph: support CEPH_FEATURE_EXPORT_PEER
  ceph: add imported caps when handling cap export message
  ceph: add open export target session helper
  ceph: remove exported caps when handling cap import message
  ceph: handle session flush message
  ...
2014-01-28 11:02:23 -08:00
Matthew Wilcox
3193f07bb7 NVMe: Include device and queue numbers in interrupt name
On larger systems with many drives, it may help debugging to know which
queue is tied to which interrupt, just by looking at /proc/interrupts.

Signed-off-by: Matthew Wilcox <matthew.r.wilcox@intel.com>
2014-01-27 20:14:08 -05:00
Keith Busch
09ece1424f NVMe: Add a pci_driver shutdown method
We need to shut down the device cleanly when the system is being shut down.
This was in an earlier patch but was inadvertently lost during a rewrite.

Signed-off-by: Keith Busch <keith.busch@intel.com>
Signed-off-by: Matthew Wilcox <matthew.r.wilcox@intel.com>
2014-01-27 20:12:15 -05:00
Keith Busch
a1a5ef9998 NVMe: Disable admin queue on init failure
Disable the admin queue if device fails during initialization so the
queue's irq is freed.

Signed-off-by: Keith Busch <keith.busch@intel.com>
[rewritten to use nvme_free_queues]
Signed-off-by: Matthew Wilcox <matthew.r.wilcox@intel.com>
2014-01-27 20:11:38 -05:00
Matthew Wilcox
469071a37a NVMe: Dynamically allocate partition numbers
Some users need more than 64 partitions per device.  Rather than simply
increasing the number of partitions, switch to the dynamic partition
allocation scheme.

This means that minor numbers are not stable across boots, but since major
numbers aren't either, I cannot see this being a significant problem.

Tested-by: Matias Bjørling <m@bjorling.me>
Signed-off-by: Matthew Wilcox <matthew.r.wilcox@intel.com>
2014-01-27 20:11:29 -05:00
Keith Busch
4d11542070 NVMe: Async IO queue deletion
This attempts to delete all IO queues at the same time asynchronously on
shutdown. This is necessary for a present device that is not responding;
a shutdown operation previously would take 2 minutes per queue-pair
to timeout before moving on to the next queue, making a device removal
appear to take a very long time or "hung" as reported by users.

In the previous worst case, a removal may be stuck forever until a kill
signal is given if there are more than 32 queue pairs since it would run
out of admin command IDs after over an hour of timed out sync commands
(admin queue depth is 64).

This patch will wait for the admin command timeout for all commands to
complete, so the worst case now for an unresponsive controller is 60
seconds, though that still seems like a long time.

Since this adds another way to take queues offline, some duplicate code
resulted so I moved these into more convienient functions.

Signed-off-by: Keith Busch <keith.busch@intel.com>
[make functions static, correct line length and whitespace issues]
Signed-off-by: Matthew Wilcox <matthew.r.wilcox@intel.com>
2014-01-27 20:07:35 -05:00
Keith Busch
0e53d18051 NVMe: Surprise removal handling
This adds checks to see if the nvme pci device was removed. The check
reads the status register for the value of -1, which it should never be
unless the device is no longer present.

If a user performs a surprise removal on an nvme device, the driver will
be notified either by the pci driver remove callback if the platform's
slot is capable of this event, or via reading the device BAR status
register, which will indicate controller failure and trigger a reset.

Either way, the device is not present so all outstanding commands would
timeout. This will not send queue deletion commands to a drive that
isn't present and fail after ioremap, significantly speeding up surprise
removal; previously this took over 2 minutes per IO queue pair created,
but this will complete removing the device within a few seconds.

Signed-off-by: Keith Busch <keith.busch@intel.com>
Signed-off-by: Matthew Wilcox <matthew.r.wilcox@intel.com>
2014-01-27 19:28:43 -05:00
Keith Busch
c30341dc3c NVMe: Abort timed out commands
Send nvme abort command to io requests that have timed out on an
initialized device. If the command is not returned after another timeout,
schedule the controller for reset.

Signed-off-by: Keith Busch <keith.busch@intel.com>
[fix endianness issues]
Signed-off-by: Matthew Wilcox <matthew.r.wilcox@intel.com>
2014-01-27 19:27:53 -05:00
Keith Busch
d4b4ff8e28 NVMe: Schedule reset for failed controllers
Schedules a controller reset when it indicates it has a failed status. If
the device does not become ready after a reset, the pci device will be
scheduled for removal.

Signed-off-by: Keith Busch <keith.busch@intel.com>
[fixed checkpatch issue]
Signed-off-by: Matthew Wilcox <matthew.r.wilcox@intel.com>
2014-01-27 19:20:02 -05:00
Ilya Dryomov
3c972c95c6 libceph: rename ceph_osd_request::r_{oloc,oid} to r_base_{oloc,oid}
Rename ceph_osd_request::r_{oloc,oid} to r_base_{oloc,oid} before
introducing r_target_{oloc,oid} needed for redirects.

Signed-off-by: Ilya Dryomov <ilya.dryomov@inktank.com>
Reviewed-by: Sage Weil <sage@inktank.com>
2014-01-27 23:57:49 +02:00
Ilya Dryomov
4295f2217a libceph: introduce and start using oid abstraction
In preparation for tiering support, which would require having two
(base and target) object names for each osd request and also copying
those names around, introduce struct ceph_object_id (oid) and a couple
helpers to facilitate those copies and encapsulate the fact that object
name is not necessarily a NUL-terminated string.

Signed-off-by: Ilya Dryomov <ilya.dryomov@inktank.com>
Reviewed-by: Sage Weil <sage@inktank.com>
2014-01-27 23:57:28 +02:00
Ilya Dryomov
2d0ebc5d59 libceph: rename MAX_OBJ_NAME_SIZE to CEPH_MAX_OID_NAME_LEN
In preparation for adding oid abstraction, rename MAX_OBJ_NAME_SIZE to
CEPH_MAX_OID_NAME_LEN.

Signed-off-by: Ilya Dryomov <ilya.dryomov@inktank.com>
Reviewed-by: Sage Weil <sage@inktank.com>
2014-01-27 23:57:24 +02:00
Ilya Dryomov
22116525ba libceph: start using oloc abstraction
Instead of relying on pool fields in ceph_file_layout (for mapping) and
ceph_pg (for enconding), start using ceph_object_locator (oloc)
abstraction.  Note that userspace oloc currently consists of pool, key,
nspace and hash fields, while this one contains only a pool.  This is
OK, because at this point we only send (i.e. encode) olocs and never
have to receive (i.e. decode) them.

This makes keeping a copy of ceph_file_layout in every osd request
unnecessary, so ceph_osd_request::r_file_layout field is nuked.

Signed-off-by: Ilya Dryomov <ilya.dryomov@inktank.com>
Reviewed-by: Sage Weil <sage@inktank.com>
2014-01-27 23:57:03 +02:00
Fabian Frederick
a3b25d9b77 drivers/block/Kconfig: update RAM block device module name
RAM block device support module name changed to brd.ko some years ago
with an "rd" alias to match previous module implementation.  This patch
updates its Kconfig definition.

Signed-off-by: Fabian Frederick <fabf@skynet.be>
Acked-by: Randy Dunlap <rdunlap@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-01-23 16:36:53 -08:00
Linus Torvalds
84621c9b18 Features:
- FIFO event channels. Key advantages: support for over 100,000 events (2^17),
    16 different event priorities, improved fairness in event latency through
    the use of FIFOs.
  - Xen PVH support. "It’s a fully PV kernel mode, running with paravirtualized
    disk and network, paravirtualized interrupts and timers, no emulated devices
    of any kind (and thus no qemu), no BIOS or legacy boot — but instead of
    requiring PV MMU, it uses the HVM hardware extensions to virtualize the
    pagetables, as well as system calls and other privileged operations."
    (from "The Paravirtualization Spectrum, Part 2: From poles to a spectrum")
 Bug-fixes:
  - Fixes in balloon driver (refactor and make it work under ARM)
  - Allow xenfb to be used in HVM guests.
  - Allow xen_platform_pci=0 to work properly.
  - Refactors in event channels.
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1
 
 iQEcBAABAgAGBQJS4BmLAAoJEFjIrFwIi8fJ4SAH/iNGESowgMhfW64vRA8pBWq+
 NRJpUjYjjwmbxpwoNl6NPwn15cIXFyc3sMtvvrDD3taRDyko2RFuT+NTjpO05xPh
 d/cRpRXpXERHoiFgPf/WTp7ONBDhvPtHG0+BzJKwgqEIOUYXdbhD+gEjaVlFJScS
 CAY68OLmk7XYMSZBNzPfKNbSCyhVgZF7wpaimK9lxZBKsFRCDIq6jIyrAsC8epIL
 6V/V4l2S6lk/uUeGB6ULphYeINjI2kkpbSfCd1vyenLfWpVscc2o8uWEYFcZMAxy
 V4HpsoseuqrfdDqgPfud3VgogdISvbkCvDfW85rzfDP4MWxei2mVHFtJ/gSBV+g=
 =ToNG
 -----END PGP SIGNATURE-----

Merge tag 'stable/for-linus-3.14-rc0-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/xen/tip

Pull Xen updates from Konrad Rzeszutek Wilk:
 "Two major features that Xen community is excited about:

  The first is event channel scalability by David Vrabel - we switch
  over from an two-level per-cpu bitmap of events (IRQs) - to an FIFO
  queue with priorities.  This lets us be able to handle more events,
  have lower latency, and better scalability.  Good stuff.

  The other is PVH by Mukesh Rathor.  In short, PV is a mode where the
  kernel lets the hypervisor program page-tables, segments, etc.  With
  EPT/NPT capabilities in current processors, the overhead of doing this
  in an HVM (Hardware Virtual Machine) container is much lower than the
  hypervisor doing it for us.

  In short we let a PV guest run without doing page-table, segment,
  syscall, etc updates through the hypervisor - instead it is all done
  within the guest container.  It is a "hybrid" PV - hence the 'PVH'
  name - a PV guest within an HVM container.

  The major benefits are less code to deal with - for example we only
  use one function from the the pv_mmu_ops (which has 39 function
  calls); faster performance for syscall (no context switches into the
  hypervisor); less traps on various operations; etc.

  It is still being baked - the ABI is not yet set in stone.  But it is
  pretty awesome and we are excited about it.

  Lastly, there are some changes to ARM code - you should get a simple
  conflict which has been resolved in #linux-next.

  In short, this pull has awesome features.

  Features:
   - FIFO event channels.  Key advantages: support for over 100,000
     events (2^17), 16 different event priorities, improved fairness in
     event latency through the use of FIFOs.
   - Xen PVH support.  "It’s a fully PV kernel mode, running with
     paravirtualized disk and network, paravirtualized interrupts and
     timers, no emulated devices of any kind (and thus no qemu), no BIOS
     or legacy boot — but instead of requiring PV MMU, it uses the HVM
     hardware extensions to virtualize the pagetables, as well as system
     calls and other privileged operations." (from "The
     Paravirtualization Spectrum, Part 2: From poles to a spectrum")

  Bug-fixes:
   - Fixes in balloon driver (refactor and make it work under ARM)
   - Allow xenfb to be used in HVM guests.
   - Allow xen_platform_pci=0 to work properly.
   - Refactors in event channels"

* tag 'stable/for-linus-3.14-rc0-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/xen/tip: (52 commits)
  xen/pvh: Set X86_CR0_WP and others in CR0 (v2)
  MAINTAINERS: add git repository for Xen
  xen/pvh: Use 'depend' instead of 'select'.
  xen: delete new instances of __cpuinit usage
  xen/fb: allow xenfb initialization for hvm guests
  xen/evtchn_fifo: fix error return code in evtchn_fifo_setup()
  xen-platform: fix error return code in platform_pci_init()
  xen/pvh: remove duplicated include from enlighten.c
  xen/pvh: Fix compile issues with xen_pvh_domain()
  xen: Use dev_is_pci() to check whether it is pci device
  xen/grant-table: Force to use v1 of grants.
  xen/pvh: Support ParaVirtualized Hardware extensions (v3).
  xen/pvh: Piggyback on PVHVM XenBus.
  xen/pvh: Piggyback on PVHVM for grant driver (v4)
  xen/grant: Implement an grant frame array struct (v3).
  xen/grant-table: Refactor gnttab_init
  xen/grants: Remove gnttab_max_grant_frames dependency on gnttab_init.
  xen/pvh: Piggyback on PVHVM for event channels (v2)
  xen/pvh: Update E820 to work with PVH (v2)
  xen/pvh: Secondary VCPU bringup (non-bootup CPUs)
  ...
2014-01-22 22:00:18 -08:00
Linus Torvalds
bb1281f2aa Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/trivial
Pull trivial tree updates from Jiri Kosina:
 "Usual rocket science stuff from trivial.git"

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/trivial: (39 commits)
  neighbour.h: fix comment
  sched: Fix warning on make htmldocs caused by wait.h
  slab: struct kmem_cache is protected by slab_mutex
  doc: Fix typo in USB Gadget Documentation
  of/Kconfig: Spelling s/one/once/
  mkregtable: Fix sscanf handling
  lp5523, lp8501: comment improvements
  thermal: rcar: comment spelling
  treewide: fix comments and printk msgs
  IXP4xx: remove '1 &&' from a condition check in ixp4xx_restart()
  Documentation: update /proc/uptime field description
  Documentation: Fix size parameter for snprintf
  arm: fix comment header and macro name
  asm-generic: uaccess: Spelling s/a ny/any/
  mtd: onenand: fix comment header
  doc: driver-model/platform.txt: fix a typo
  drivers: fix typo in DEVTMPFS_MOUNT Kconfig help text
  doc: Fix typo (acces_process_vm -> access_process_vm)
  treewide: Fix typos in printk
  drivers/gpu/drm/qxl/Kconfig: reformat the help text
  ...
2014-01-22 21:21:55 -08:00
Geert Uytterhoeven
14424be4db mg_disk: Spelling s/finised/finished/
Signed-off-by: Geert Uytterhoeven <geert+renesas@linux-m68k.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2014-01-21 20:34:58 -08:00
Raghavendra K T
9967d8ac5c null_blk: Null pointer deference problem in alloc_page_buffers
If we load the null_blk module with bs=8k we get following oops:
[ 3819.812190] BUG: unable to handle kernel NULL pointer dereference at 0000000000000008
[ 3819.812387] IP: [<ffffffff81170aa5>] create_empty_buffers+0x28/0xaf
[ 3819.812527] PGD 219244067 PUD 215a06067 PMD 0
[ 3819.812640] Oops: 0000 [#1] SMP
[ 3819.812772] Modules linked in: null_blk(+)

 Fix that by resetting block size to PAGE_SIZE if it is greater than PAGE_SIZE

Reported-by: Sumanth <sumantk2@linux.vnet.ibm.com>
Signed-off-by: Raghavendra K T <raghavendra.kt@linux.vnet.ibm.com>
Reviewed-by: Matias Bjorling <m@bjorling.me>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2014-01-21 20:22:40 -08:00
Sam Bradshaw
26d580575b mtip32xx: Correctly handle security locked condition
If power is removed during a secure erase, the drive will end up in a
security locked condition.  This patch causes the driver to identify,
log, and flag the security lock state.  IOs are prevented from
submission to the drive until the locked state is addressed with a
secure erase.

Bumped version number to reflect this capability.

Signed-off-by: Sam Bradshaw <sbradshaw@micron.com>
Signed-off-by: Asai Thambi S P <asamymuthupa@micron.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2014-01-21 20:20:44 -08:00
Sam Bradshaw
188b9f49d4 mtip32xx: Make SGL container per-command to eliminate high order dma allocation
The mtip32xx driver makes a high order dma memory allocation to store a
command index table, some dedicated buffers, and a command header & SGL
blob.  This allocation can fail with a surprise insert under low &
fragmented memory conditions.

This patch breaks these regions up into separate low order allocations
and increases the maximum number of segments a single command SGL can
have.  We wanted to allow at least 256 segments for 1 MB direct IO.
Since the command header occupies the first 0x80 bytes of the SGL blob,
that meant we needed two 4k pages to contain the header and SGL.  The
two pages allow up to 504 SGL segments.

Signed-off-by: Sam Bradshaw <sbradshaw@micron.com>
Signed-off-by: Asai Thambi S P <asamymuthupa@micron.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2014-01-21 20:20:13 -08:00
Jens Axboe
e1803a706f Merge branch 'for-jens' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/linux-block into for-3.14/drivers 2014-01-21 20:18:54 -08:00
Olaf Hering
12a64d2f5e drivers/block/loop.c: fix comment typo in loop_config_discard
Discard requests are ignored if the encryption is enabled for the given
loop device.  Update comment to match the code, and similar comments
elsewhere in the file.

Signed-off-by: Olaf Hering <olaf@aepfle.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2014-01-21 20:16:56 -08:00
Andrew Morton
4336548af4 drivers/block/cciss.c:cciss_init_one(): use proper errnos
pci_driver.probe should return a meaningful errno, not -1.

Cc: Jens Axboe <axboe@kernel.dk>
Cc: Stephen M. Cameron <scameron@beardog.cce.hp.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2014-01-21 20:16:56 -08:00
Dan Carpenter
3f7d758b1e drivers/block/paride/pg.c: underflow bug in pg_write()
The test here can underflow so we pass bogus lengths to the hardware.
It's a static checker fix and I don't know the impact.

Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Cc: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2014-01-21 20:16:56 -08:00
Jingoo Han
f2fc8af41c drivers/block/sx8.c: remove unnecessary pci_set_drvdata()
The driver core clears the driver data to NULL after device_release or on
probe failure.  Thus, it is not needed to manually clear the device driver
data to NULL.

Signed-off-by: Jingoo Han <jg1.han@samsung.com>
Cc: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2014-01-21 20:16:56 -08:00
Jingoo Han
28e6c50069 drivers/block/sx8.c: use module_pci_driver()
Use module_pci_driver() macro which makes the code smaller and simpler.

Signed-off-by: Jingoo Han <jg1.han@samsung.com>
Cc: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2014-01-21 20:16:56 -08:00
Linus Torvalds
8cf7a16ee9 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/geert/linux-m68k
Pull m68k updates from Geert Uytterhoeven:
 - Zorro bus cleanups and UAPI revival
 - Bootinfo cleanups and UAPI revival
 - Kexec support
 - Memory size reductions and bug fixes for multi-platform kernels
 - Polled interrupt support for Atari EtherNAT, EtherNEC and NetUSBee
 - Machine-specific random_get_entropy()
 - Defconfig updates and cleanups

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/geert/linux-m68k: (46 commits)
  m68k/mac: Make SCC reset work more reliably
  m68k/irq - Use polled IRQ flag for MFP timer cascaded interrupts
  m68k: Update defconfigs for v3.13-rc1
  m68k/defconfig: Enable EARLY_PRINTK
  m68k/mm: kmap spelling/grammar fixes
  m68k: Convert arch/m68k/kernel/traps.c to pr_*()
  m68k: Convert arch/m68k/mm/fault.c to pr_*()
  m68k/mm: Check for mm != NULL in do_page_fault() debug code
  m68k/defconfig: Disable /sbin/hotplug fork-bomb by default
  m68k/atari: Hide RTC_PORT() macro from rtc-cmos
  m68k/amiga,atari: Fix specifying multiple debug= parameters
  m68k/defconfig: Use ext4 for ext2/ext3 file systems
  m68k: Add support to export bootinfo in procfs
  m68k: Add kexec support
  m68k/mac: Mark Mac IIsi ADB driver BROKEN
  m68k/amiga: Provide mach_random_get_entropy()
  m68k: Add infrastructure for machine-specific random_get_entropy()
  m68k/atari: Call paging_init() before nf_init()
  m68k: Remove superfluous inclusions of <asm/bootinfo.h>
  m68k/UAPI: Use proper types (endianness/size) in <asm/bootinfo*.h>
  ...
2014-01-20 09:24:31 -08:00
Jiri Kosina
7b7b68bba5 floppy: bail out in open() if drive is not responding to block0 read
In case reading of block 0 during open() fails, it is not the right thing
to let open() succeed.

Fix this by introducing FD_OPEN_SHOULD_FAIL_BIT flag, and setting it in
case the bio callback encounters an error while trying to read block 0.

As a bonus, this works around certain broken userspace (blkid), which is
not able to properly handle read()s returning IO errors. Hence be nice to
those, and bail out during open() already; if block 0 is not readable,
read()s are not going to provide any meaningful data anyway.

Signed-off-by: Jiri Kosina <jkosina@suse.cz>
2014-01-17 11:12:06 +01:00
Ming Lei
518d00b749 block: null_blk: fix queue leak inside removing device
When queue_mode is NULL_Q_MQ and null_blk is being removed,
blk_cleanup_queue() isn't called to cleanup queue, so the queue
allocated won't be freed.

This patch calls blk_cleanup_queue() for MQ to drain all pending
requests first and release the reference counter of queue kobject, then
blk_mq_free_queue() will be called in queue kobject's release handler
when queue kobject's reference counter drops to zero.

Signed-off-by: Ming Lei <tom.leiming@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-01-12 16:22:42 +07:00
Jens Axboe
54a387cb9e Merge branch 'for-3.14/core' into for-3.14/drivers
We need the updated code to make bcache easier to merge.
2014-01-08 09:32:45 -07:00
Konrad Rzeszutek Wilk
51c71a3bba xen/pvhvm: If xen_platform_pci=0 is set don't blow up (v4).
The user has the option of disabling the platform driver:
00:02.0 Unassigned class [ff80]: XenSource, Inc. Xen Platform Device (rev 01)

which is used to unplug the emulated drivers (IDE, Realtek 8169, etc)
and allow the PV drivers to take over. If the user wishes
to disable that they can set:

  xen_platform_pci=0
  (in the guest config file)

or
  xen_emul_unplug=never
  (on the Linux command line)

except it does not work properly. The PV drivers still try to
load and since the Xen platform driver is not run - and it
has not initialized the grant tables, most of the PV drivers
stumble upon:

input: Xen Virtual Keyboard as /devices/virtual/input/input5
input: Xen Virtual Pointer as /devices/virtual/input/input6M
------------[ cut here ]------------
kernel BUG at /home/konrad/ssd/konrad/linux/drivers/xen/grant-table.c:1206!
invalid opcode: 0000 [#1] SMP
Modules linked in: xen_kbdfront(+) xenfs xen_privcmd
CPU: 6 PID: 1389 Comm: modprobe Not tainted 3.13.0-rc1upstream-00021-ga6c892b-dirty #1
Hardware name: Xen HVM domU, BIOS 4.4-unstable 11/26/2013
RIP: 0010:[<ffffffff813ddc40>]  [<ffffffff813ddc40>] get_free_entries+0x2e0/0x300
Call Trace:
 [<ffffffff8150d9a3>] ? evdev_connect+0x1e3/0x240
 [<ffffffff813ddd0e>] gnttab_grant_foreign_access+0x2e/0x70
 [<ffffffffa0010081>] xenkbd_connect_backend+0x41/0x290 [xen_kbdfront]
 [<ffffffffa0010a12>] xenkbd_probe+0x2f2/0x324 [xen_kbdfront]
 [<ffffffff813e5757>] xenbus_dev_probe+0x77/0x130
 [<ffffffff813e7217>] xenbus_frontend_dev_probe+0x47/0x50
 [<ffffffff8145e9a9>] driver_probe_device+0x89/0x230
 [<ffffffff8145ebeb>] __driver_attach+0x9b/0xa0
 [<ffffffff8145eb50>] ? driver_probe_device+0x230/0x230
 [<ffffffff8145eb50>] ? driver_probe_device+0x230/0x230
 [<ffffffff8145cf1c>] bus_for_each_dev+0x8c/0xb0
 [<ffffffff8145e7d9>] driver_attach+0x19/0x20
 [<ffffffff8145e260>] bus_add_driver+0x1a0/0x220
 [<ffffffff8145f1ff>] driver_register+0x5f/0xf0
 [<ffffffff813e55c5>] xenbus_register_driver_common+0x15/0x20
 [<ffffffff813e76b3>] xenbus_register_frontend+0x23/0x40
 [<ffffffffa0015000>] ? 0xffffffffa0014fff
 [<ffffffffa001502b>] xenkbd_init+0x2b/0x1000 [xen_kbdfront]
 [<ffffffff81002049>] do_one_initcall+0x49/0x170

.. snip..

which is hardly nice. This patch fixes this by having each
PV driver check for:
 - if running in PV, then it is fine to execute (as that is their
   native environment).
 - if running in HVM, check if user wanted 'xen_emul_unplug=never',
   in which case bail out and don't load any PV drivers.
 - if running in HVM, and if PCI device 5853:0001 (xen_platform_pci)
   does not exist, then bail out and not load PV drivers.
 - (v2) if running in HVM, and if the user wanted 'xen_emul_unplug=ide-disks',
   then bail out for all PV devices _except_ the block one.
   Ditto for the network one ('nics').
 - (v2) if running in HVM, and if the user wanted 'xen_emul_unplug=unnecessary'
   then load block PV driver, and also setup the legacy IDE paths.
   In (v3) make it actually load PV drivers.

Reported-by: Sander Eikelenboom <linux@eikelenboom.it
Reported-by: Anthony PERARD <anthony.perard@citrix.com>
Reported-and-Tested-by: Fabio Fantoni <fabio.fantoni@m2r.biz>
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
[v2: Add extra logic to handle the myrid ways 'xen_emul_unplug'
can be used per Ian and Stefano suggestion]
[v3: Make the unnecessary case work properly]
[v4: s/disks/ide-disks/ spotted by Fabio]
Reviewed-by: Stefano Stabellini <stefano.stabellini@eu.citrix.com>
Acked-by: Bjorn Helgaas <bhelgaas@google.com> [for PCI parts]
CC: stable@vger.kernel.org
2014-01-03 14:54:18 -05:00
Julia Lawall
8586ea96b4 pktcdvd: fix error return code
Set the return variable to an error code as done elsewhere in the function.

A simplified version of the semantic match that finds this problem is as
follows: (http://coccinelle.lip6.fr/)

// <smpl>
(
if@p1 (\(ret < 0\|ret != 0\))
 { ... return ret; }
|
ret@p1 = 0
)
... when != ret = e1
    when != &ret
*if(...)
{
  ... when != ret = e2
      when forall
 return ret;
}

// </smpl>

Signed-off-by: Julia Lawall <Julia.Lawall@lip6.fr>
Signed-off-by: Jiri Kosina <jkosina@suse.cz>
2014-01-03 10:05:34 +01:00
Ilya Dryomov
e37180c0f2 rbd: tear down watch request if rbd_dev_device_setup() fails
Tear down watch request if rbd_dev_device_setup() fails.

Signed-off-by: Ilya Dryomov <ilya.dryomov@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
2013-12-31 20:32:07 +02:00
Ilya Dryomov
fca2706539 rbd: introduce rbd_dev_header_unwatch_sync() and switch to it
Rename rbd_dev_header_watch_sync() to __rbd_dev_header_watch_sync() and
introduce two helpers: rbd_dev_header_{,un}watch_sync() to make it more
clear what is going on.

Signed-off-by: Ilya Dryomov <ilya.dryomov@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
2013-12-31 20:32:06 +02:00
Ilya Dryomov
7e513d4366 rbd: enable extended devt in single-major mode
If single-major device number allocation scheme is turned on, instead
of reserving 256 minors per device, which imposes a limit of 4096
images mapped at once, reserve 16 minors per device and enable extended
devt feature.  This results in a theoretical limit of 65536 images
mapped at once, and still allows to have more than 15 partititions:
partitions starting with 16th are mapped under major 259 (Block
Extended Major):

$ rbd showmapped
id pool image snap device
0  rbd  b5    -    /dev/rbd0    # no partitions
1  rbd  b2    -    /dev/rbd1    # 40 partitions
2  rbd  b3    -    /dev/rbd2    #  2 partitions

$ cat /proc/partitions
 251        0       1024 rbd0
 251       16       1024 rbd1
 251       17          0 rbd1p1
 251       18          0 rbd1p2
 ...
 251       30          0 rbd1p14
 251       31          0 rbd1p15
 259        0          0 rbd1p16
 259        1          0 rbd1p17
 ...
 259       23          0 rbd1p39
 259       24          0 rbd1p40
 251       32       1024 rbd2
 251       33          0 rbd2p1
 251       34          0 rbd2p2

(major 251 was assigned dynamically at module load time)

Signed-off-by: Ilya Dryomov <ilya.dryomov@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
2013-12-31 20:32:04 +02:00
Ilya Dryomov
9b60e70b3b rbd: add support for single-major device number allocation scheme
Currently each rbd device is allocated its own major number, which
leads to a hard limit of 230-250 images mapped at once.  This commit
adds support for a new single-major device number allocation scheme,
which is hidden behind a new single_major boolean module parameter and
is disabled by default for backwards compatibility reasons.  (Old
userspace cannot correctly unmap images mapped under single-major
scheme and would essentially just unmap a random image, if that.)

$ rbd showmapped
id pool image snap device
0  rbd  b100  -    /dev/rbd0
1  rbd  b101  -    /dev/rbd1
2  rbd  b102  -    /dev/rbd2
3  rbd  b103  -    /dev/rbd3

Old scheme (modprobe rbd):

$ ls -l /dev/rbd*
brw-rw---- 1 root disk 253, 0 Dec 10 12:24 /dev/rbd0
brw-rw---- 1 root disk 252, 0 Dec 10 12:28 /dev/rbd1
brw-rw---- 1 root disk 252, 1 Dec 10 12:28 /dev/rbd1p1
brw-rw---- 1 root disk 252, 2 Dec 10 12:28 /dev/rbd1p2
brw-rw---- 1 root disk 252, 3 Dec 10 12:28 /dev/rbd1p3
brw-rw---- 1 root disk 251, 0 Dec 10 12:28 /dev/rbd2
brw-rw---- 1 root disk 251, 1 Dec 10 12:28 /dev/rbd2p1
brw-rw---- 1 root disk 250, 0 Dec 10 12:24 /dev/rbd3

New scheme (modprobe rbd single_major=Y):

$ ls -l /dev/rbd*
brw-rw---- 1 root disk 253,   0 Dec 10 12:30 /dev/rbd0
brw-rw---- 1 root disk 253, 256 Dec 10 12:30 /dev/rbd1
brw-rw---- 1 root disk 253, 257 Dec 10 12:30 /dev/rbd1p1
brw-rw---- 1 root disk 253, 258 Dec 10 12:30 /dev/rbd1p2
brw-rw---- 1 root disk 253, 259 Dec 10 12:30 /dev/rbd1p3
brw-rw---- 1 root disk 253, 512 Dec 10 12:30 /dev/rbd2
brw-rw---- 1 root disk 253, 513 Dec 10 12:30 /dev/rbd2p1
brw-rw---- 1 root disk 253, 768 Dec 10 12:30 /dev/rbd3

(major 253 was assigned dynamically at module load time)

The new limit is 4096 images mapped at once, and it comes from the fact
that, as before, 256 minor numbers are reserved for each mapping.
(A follow-up commit changes the number of minors reserved and the way
we deal with partitions over that number.)

If single_major is set to true, two new sysfs interfaces show up:
/sys/bus/rbd/{add,remove}_single_major.  These are to be used instead
of /sys/bus/rbd/{add,remove}, which are disabled for backwards
compatibility reasons outlined above.

Signed-off-by: Ilya Dryomov <ilya.dryomov@inktank.com>
Reviewed-by: Alex Elder <elder@linaro.org>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
2013-12-31 20:31:59 +02:00
Ilya Dryomov
92c76dc036 rbd: wire up is_visible() sysfs callback for rbd bus
In preparation for single-major device number allocation scheme, wire
up attribute_group::is_visible() callback for rbd bus.  This allows us
to make the new single-major attributes conditional.

Signed-off-by: Ilya Dryomov <ilya.dryomov@inktank.com>
Reviewed-by: Alex Elder <elder@linaro.org>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
2013-12-31 20:31:58 +02:00
Ilya Dryomov
dd82fff1e8 rbd: add 'minor' sysfs rbd device attribute
Introduce /sys/bus/rbd/devices/<id>/minor sysfs attribute for exporting
rbd whole disk minor numbers.  This is a step towards single-major
device number allocation scheme, but also a good thing on its own.

Signed-off-by: Ilya Dryomov <ilya.dryomov@inktank.com>
Reviewed-by: Alex Elder <elder@linaro.org>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
2013-12-31 20:31:57 +02:00
Ilya Dryomov
f8a22fc238 rbd: switch to ida for rbd id assignments
Currently rbd ids are allocated using an atomic variable that keeps
track of the highest id currently in use and each new id is simply one
more than the value of that variable.  That's nice and cheap, but it
does mean that rbd ids are allowed to grow boundlessly, and, more
importantly, it's completely unpredictable.  So, in preparation for
single-major device number allocation scheme, which is going to
establish and rely on a constant mapping between rbd ids and device
numbers, switch to ida for rbd id assignments.

Signed-off-by: Ilya Dryomov <ilya.dryomov@inktank.com>
Reviewed-by: Alex Elder <elder@linaro.org>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
2013-12-31 20:31:56 +02:00
Ilya Dryomov
e1b4d96dea rbd: refactor rbd_init() a bit
Refactor rbd_init() a bit to make it more clear what's going on.

Signed-off-by: Ilya Dryomov <ilya.dryomov@inktank.com>
Reviewed-by: Alex Elder <elder@linaro.org>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
2013-12-31 20:31:55 +02:00
Ilya Dryomov
90da258b88 rbd: tweak "loaded" message and module description
Tweak "loaded" message, so that it looks like

[   30.184235] rbd: loaded

instead of

[   38.056564] rbd: loaded rbd (rados block device)

Also move (and slightly tweak) MODULE_DESCRIPTION so that all authors
are next to each other in modinfo output.

Signed-off-by: Ilya Dryomov <ilya.dryomov@inktank.com>
Reviewed-by: Alex Elder <elder@linaro.org>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
2013-12-31 20:31:54 +02:00
Ilya Dryomov
70eebd2006 rbd: rbd_device::dev_id is an int, format it as such
rbd_device::dev_id is an int, format it as such.

Signed-off-by: Ilya Dryomov <ilya.dryomov@inktank.com>
Reviewed-by: Alex Elder <elder@linaro.org>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
2013-12-31 20:31:53 +02:00
Jens Axboe
b28bc9b38c Linux 3.13-rc6
-----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1
 
 iQEcBAABAgAGBQJSwLfoAAoJEHm+PkMAQRiGi6QH/1U1B7lmHChDTw3jj1lfm9gA
 189Si4QJlnxFWCKHvKEL+pcaVuACU+aMGI8+KyMYK4/JfuWVjjj5fr/SvyHH2/8m
 LdSK8aHMhJ46uBS4WJ/l6v46qQa5e2vn8RKSBAyKm/h4vpt+hd6zJdoFrFai4th7
 k/TAwOAEHI5uzexUChwLlUBRTvbq4U8QUvDu+DeifC8cT63CGaaJ4qVzjOZrx1an
 eP6UXZrKDASZs7RU950i7xnFVDQu4PsjlZi25udsbeiKcZJgPqGgXz5ULf8ZH8RQ
 YCi1JOnTJRGGjyIOyLj7pyB01h7XiSM2+eMQ0S7g54F2s7gCJ58c2UwQX45vRWU=
 =/4/R
 -----END PGP SIGNATURE-----

Merge tag 'v3.13-rc6' into for-3.14/core

Needed to bring blk-mq uptodate, since changes have been going in
since for-3.14/core was established.

Fixup merge issues related to the immutable biovec changes.

Signed-off-by: Jens Axboe <axboe@kernel.dk>

Conflicts:
	block/blk-flush.c
	fs/btrfs/check-integrity.c
	fs/btrfs/extent_io.c
	fs/btrfs/scrub.c
	fs/logfs/dev_bdev.c
2013-12-31 09:51:02 -07:00
Linus Torvalds
c5fdd531b5 Merge branch 'for-linus' of git://git.kernel.dk/linux-block
Pull block fixes from Jens Axboe:
 - fix for a memory leak on certain unplug events
 - a collection of bcache fixes from Kent and Nicolas
 - a few null_blk fixes and updates form Matias
 - a marking of static of functions in the stec pci-e driver

* 'for-linus' of git://git.kernel.dk/linux-block:
  null_blk: support submit_queues on use_per_node_hctx
  null_blk: set use_per_node_hctx param to false
  null_blk: corrections to documentation
  null_blk: warning on ignored submit_queues param
  null_blk: refactor init and init errors code paths
  null_blk: documentation
  null_blk: mem garbage on NUMA systems during init
  drivers: block: Mark the functions as static in skd_main.c
  bcache: New writeback PD controller
  bcache: bugfix for race between moving_gc and bucket_invalidate
  bcache: fix for gc and writeback race
  bcache: bugfix - moving_gc now moves only correct buckets
  bcache: fix for gc crashing when no sectors are used
  bcache: Fix heap_peek() macro
  bcache: Fix for can_attach_cache()
  bcache: Fix dirty_data accounting
  bcache: Use uninterruptible sleep in writeback
  bcache: kthread don't set writeback task to INTERUPTIBLE
  block: fix memory leaks on unplugging block device
  bcache: fix sparse non static symbol warning
2013-12-24 10:06:03 -08:00
Matias Bjørling
fc1bc35443 null_blk: support submit_queues on use_per_node_hctx
In the case of both the submit_queues param and use_per_node_hctx param
are used. We limit the number af submit_queues to the number of online
nodes.

If the submit_queues is a multiple of nr_online_nodes, its trivial. Simply map
them to the nodes. For example: 8 submit queues are mapped as node0[0,1],
node1[2,3], ...
If uneven, we are left with an uneven number of submit_queues that must be
mapped. These are mapped toward the first node and onward. E.g. 5
submit queues mapped onto 4 nodes are mapped as node0[0,1], node1[2], ...

Signed-off-by: Matias Bjorling <m@bjorling.me>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2013-12-21 09:30:34 -07:00
Matias Bjørling
200052440d null_blk: set use_per_node_hctx param to false
The defaults for the module is to instantiate itself with blk-mq and a
submit queue for each CPU node in the system.

To save resources, initialize instead with a single submit queue.

Signed-off-by: Matias Bjorling <m@bjorling.me>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2013-12-21 09:30:33 -07:00
Matias Bjorling
d15ee6b1a4 null_blk: warning on ignored submit_queues param
Let the user know when the number of submission queues are being
ignored.

Signed-off-by: Matias Bjorling <m@bjorling.me>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2013-12-19 08:09:43 -07:00
Matias Bjorling
2d263a7856 null_blk: refactor init and init errors code paths
Simplify the initialization logic of the three block-layers.

- The queue initialization is split into two parts. This allows reuse of
  code when initializing the sq-, bio- and mq-based layers.
- Set submit_queues default value to 0 and always set it at init time.
- Simplify the init error code paths.

Signed-off-by: Matias Bjorling <m@bjorling.me>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2013-12-19 08:09:42 -07:00
Matias Bjorling
0c56010c83 null_blk: mem garbage on NUMA systems during init
For NUMA systems, initializing the blk-mq layer and using per node hctx.
We initialize submit queues to 1, while blk-mq nr_hw_queues is
initialized to the number of NUMA nodes.

This makes the null_init_hctx function overwrite memory outside of what
it allocated.  In my case it lead to writing garbage into struct
request_queue's mq_map.

Signed-off-by: Matias Bjorling <m@bjorling.me>
Cc: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-12-19 08:09:38 -07:00
Rashika Kheria
a26ba7fadd drivers: block: Mark the functions as static in skd_main.c
Mark functions skd_skmsg_state_to_str() and skd_skreq_state_to_str() as
static in skd_main.c because they are not used outside this file.

This eliminates the following warnings in skd_main.c:
drivers/block/skd_main.c:5272:13: warning: no previous prototype for ‘skd_skmsg_state_to_str’ [-Wmissing-prototypes]
drivers/block/skd_main.c:5284:13: warning: no previous prototype for ‘skd_skreq_state_to_str’ [-Wmissing-prototypes]

Signed-off-by: Rashika Kheria <rashika.kheria@gmail.com>
Reviewed-by: Josh Triplett <josh@joshtriplett.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2013-12-19 08:06:49 -07:00
Jiri Kosina
e23c34bb41 Merge branch 'master' into for-next
Sync with Linus' tree to be able to apply fixes on top of newer things
in tree (efi-stub).

Signed-off-by: Jiri Kosina <jkosina@suse.cz>
2013-12-19 15:08:32 +01:00
Keith Busch
9a6b94584d NVMe: Device resume error handling
Adds controller error handling on resume power management. If the device
fails to initialize, the device is queued for a reset. If the reset fails,
a thread is spawned to remove the pci device.

If the device resumes as "busy", the device is responding to admin
commands but will not create IO queues. In this case, we need to remove
the gendisks and free the IO queues since they can't be used and may be
holding bios in their lists.

From testing, the dma pools require a pci device so this had to change
the pci driver 'remove' to release the dma resources in line with that
call instead of after all references to the device are released.

Signed-off-by: Keith Busch <keith.busch@intel.com>
Signed-off-by: Matthew Wilcox <matthew.r.wilcox@intel.com>
2013-12-16 15:54:39 -05:00
Matthew Wilcox
68608c268b NVMe: Cache dev->pci_dev in a local pointer
Helps with line-length issues

Signed-off-by: Matthew Wilcox <matthew.r.wilcox@intel.com>
2013-12-16 15:54:35 -05:00
Matthew Wilcox
0a8d44cb33 NVMe: Fix lockdep warnings
During the initialisation path, the queue lock is taken without interrupt
protection.  It's perfectly safe to do so, because the interrupt handler
can't run at this point, but it confuses lockdep.

Signed-off-by: Matthew Wilcox <matthew.r.wilcox@intel.com>
2013-12-16 15:54:34 -05:00
Keith Busch
320a382746 NVMe: compat SG_IO ioctl
For 32-bit versions of sg3-utils running on a 64-bit system. This is
mostly a copy from the relevent portions of fs/compat_ioctl.c, with
slight modifications for going through block_device_operations.

Signed-off-by: Keith Busch <keith.busch@intel.com>
Reviewed-by: Vishal Verma <vishal.l.verma@linux.intel.com>
[fixed up CONFIG_COMPAT=n build problems]
Signed-off-by: Matthew Wilcox <matthew.r.wilcox@intel.com>
2013-12-16 15:49:40 -05:00
Matias Bjorling
57053d8c5c null_blk: mem garbage on NUMA systems during init
For NUMA systems, initializing the blk-mq layer and using per node hctx.
We initialize submit queues to 1, while blk-mq nr_hw_queues is
initialized to the number of NUMA nodes.

This makes the null_init_hctx function overwrite memory outside of what
it allocated.  In my case it lead to writing garbage into struct
request_queue's mq_map.

Signed-off-by: Matias Bjorling <m@bjorling.me>
Cc: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-12-15 12:17:16 -08:00
Linus Torvalds
5ee540613d Merge branch 'for-linus' of git://git.kernel.dk/linux-block
Pull block layer fixes from Jens Axboe:
 "A small collection of fixes for the current series. It contains:

   - A fix for a use-after-free of a request in blk-mq.  From Ming Lei

   - A fix for a blk-mq bug that could attempt to dereference a NULL rq
     if allocation failed

   - Two xen-blkfront small fixes

   - Cleanup of submit_bio_wait() type uses in the kernel, unifying
     that.  From Kent

   - A fix for 32-bit blkg_rwstat reading.  I apologize for this one
     looking mangled in the shortlog, it's entirely my fault for missing
     an empty line between the description and body of the text"

* 'for-linus' of git://git.kernel.dk/linux-block:
  blk-mq: fix use-after-free of request
  blk-mq: fix dereference of rq->mq_ctx if allocation fails
  block: xen-blkfront: Fix possible NULL ptr dereference
  xen-blkfront: Silence pfn maybe-uninitialized warning
  block: submit_bio_wait() conversions
  Update of blkg_stat and blkg_rwstat may happen in bh context
2013-12-05 15:33:27 -08:00
Alan
6bbdc3984e vio: remove dangly makefile bits
The drivers are long gone but some config escaped the prune

Signed-off-by: Alan Cox <gnomes@lxorguk.ukuu.org.uk>
Resolves-bug: https://bugzilla.kernel.org/show_bug.cgi?id=57221
Signed-off-by: Jiri Kosina <jkosina@suse.cz>
2013-12-03 17:35:23 +01:00
Jens Axboe
e345d767f6 Merge branch 'stable/for-jens-3.13-take-two' of git://git.kernel.org/pub/scm/linux/kernel/git/xen/tip into for-linus 2013-11-27 19:12:58 -07:00
Felipe Pena
2f089cb89d block: xen-blkfront: Fix possible NULL ptr dereference
In the blkif_release function the bdget_disk() call might returns
a NULL ptr which might be dereferenced on bdev->bd_openers checking

Signed-off-by: Felipe Pena <felipensp@gmail.com>
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
[v2: Added WARN per Roger's suggestion]
2013-11-26 11:24:01 -05:00
Tim Gardner
427bfe07e6 xen-blkfront: Silence pfn maybe-uninitialized warning
pfn cannot actually be used unless (!info->feature_persistent), nor is
pfn accessed in get_grant() unless (!info->feature_persistent), but silence
this warning anyway. gcc-4.8

drivers/block/xen-blkfront.c: In function 'do_blkif_request':
drivers/block/xen-blkfront.c:508:20: warning: 'pfn' may be used uninitialized in this function [-Wmaybe-uninitialized]
     gnt_list_entry = get_grant(&gref_head, pfn, info);
                    ^
drivers/block/xen-blkfront.c:492:19: note: 'pfn' was declared here
     unsigned long pfn;

Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Cc: David Vrabel <david.vrabel@citrix.com>
Signed-off-by: Tim Gardner <tim.gardner@canonical.com>
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Acked-by: Roger Pau Monné <roger.pau@citrix.com>
2013-11-26 11:23:53 -05:00
Geert Uytterhoeven
b3ce717205 block/z2ram: Remove duplicate external declarations
Remove the external declarations for m68k_realnum_memory and m68k_memory,
which are already provided by <asm/setup.h>, to avoid conflicts later.

Signed-off-by: Geert Uytterhoeven <geert@linux-m68k.org>
2013-11-26 11:09:10 +01:00
Geert Uytterhoeven
6112ea0862 zorro: ZTWO_VADDR() should return "void __iomem *"
ZTWO_VADDR() converts from physical to virtual I/O addresses, so it should
return "void __iomem *" instead of "unsigned long".

This allows to drop several casts, but requires adding a few casts to
accomodate legacy driver frameworks that store "unsigned long" I/O
addresses.

Signed-off-by: Geert Uytterhoeven <geert@linux-m68k.org>
2013-11-26 11:09:07 +01:00
Kent Overstreet
20d0189b10 block: Introduce new bio_split()
The new bio_split() can split arbitrary bios - it's not restricted to
single page bios, like the old bio_split() (previously renamed to
bio_pair_split()). It also has different semantics - it doesn't allocate
a struct bio_pair, leaving it up to the caller to handle completions.

Then convert the existing bio_pair_split() users to the new bio_split()
- and also nvme, which was open coding bio splitting.

(We have to take that BUG_ON() out of bio_integrity_trim() because this
bio_split() needs to use it, and there's no reason it has to be used on
bios marked as cloned; BIO_CLONED doesn't seem to have clearly
documented semantics anyways.)

Signed-off-by: Kent Overstreet <kmo@daterainc.com>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Martin K. Petersen <martin.petersen@oracle.com>
Cc: Matthew Wilcox <matthew.r.wilcox@intel.com>
Cc: Keith Busch <keith.busch@intel.com>
Cc: Vishal Verma <vishal.l.verma@intel.com>
Cc: Jiri Kosina <jkosina@suse.cz>
Cc: Neil Brown <neilb@suse.de>
2013-11-23 22:33:57 -08:00
Kent Overstreet
ee67891bf1 block: Rename bio_split() -> bio_pair_split()
This is prep work for introducing a more general bio_split().

Signed-off-by: Kent Overstreet <kmo@daterainc.com>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: NeilBrown <neilb@suse.de>
Cc: Alasdair Kergon <agk@redhat.com>
Cc: Lars Ellenberg <lars.ellenberg@linbit.com>
Cc: Peter Osterlund <petero2@telia.com>
Cc: Sage Weil <sage@inktank.com>
2013-11-23 22:33:56 -08:00
Kent Overstreet
5341a6278b rbd: Refactor bio cloning
Now that we've got drivers converted to the new immutable bvec
primitives, bio splitting becomes much easier - this is how the new
bio_split() will work. (Someone more familiar with the ceph code could
probably use bio_clone_fast() instead of bio_clone() here).

Signed-off-by: Kent Overstreet <kmo@daterainc.com>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Yehuda Sadeh <yehuda@inktank.com>
Cc: Alex Elder <elder@inktank.com>
Cc: ceph-devel@vger.kernel.org
2013-11-23 22:33:54 -08:00
Kent Overstreet
feb261e2ee aoe: Convert to immutable biovecs
Now that we've got a mechanism for immutable biovecs -
bi_iter.bi_bvec_done - we need to convert drivers to use primitives that
respect it instead of using the bvec array directly.

The aoe code no longer has to manually iterate over partial bvecs, so
some struct members go away - other struct members are effectively
renamed:

buf->resid	-> buf->iter.bi_size
buf->sector	-> buf->iter.bi_sector

f->bcnt		-> f->iter.bi_size
f->lba		-> f->iter.bi_sector

Signed-off-by: Kent Overstreet <kmo@daterainc.com>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: "Ed L. Cashin" <ecashin@coraid.com>
2013-11-23 22:33:52 -08:00
Kent Overstreet
003b5c5719 block: Convert drivers to immutable biovecs
Now that we've got a mechanism for immutable biovecs -
bi_iter.bi_bvec_done - we need to convert drivers to use primitives that
respect it instead of using the bvec array directly.

Signed-off-by: Kent Overstreet <kmo@daterainc.com>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: NeilBrown <neilb@suse.de>
Cc: Alasdair Kergon <agk@redhat.com>
Cc: dm-devel@redhat.com
2013-11-23 22:33:51 -08:00
Kent Overstreet
458b76ed2f block: Kill bio_segments()/bi_vcnt usage
When we start sharing biovecs, keeping bi_vcnt accurate for splits is
going to be error prone - and unnecessary, if we refactor some code.

So bio_segments() has to go - but most of the existing users just needed
to know if the bio had multiple segments, which is easier - add a
bio_multiple_segments() for them.

(Two of the current uses of bio_segments() are going to go away in a
couple patches, but the current implementation of bio_segments() is
unsafe as soon as we start doing driver conversions for immutable
biovecs - so implement a dumb version for bisectability, it'll go away
in a couple patches)

Signed-off-by: Kent Overstreet <kmo@daterainc.com>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Neil Brown <neilb@suse.de>
Cc: Nagalakshmi Nandigama <Nagalakshmi.Nandigama@lsi.com>
Cc: Sreekanth Reddy <Sreekanth.Reddy@lsi.com>
Cc: "James E.J. Bottomley" <JBottomley@parallels.com>
2013-11-23 22:33:51 -08:00
Kent Overstreet
4550dd6c6b block: Immutable bio vecs
This adds a mechanism by which we can advance a bio by an arbitrary
number of bytes without modifying the biovec: bio->bi_iter.bi_bvec_done
indicates the number of bytes completed in the current bvec.

Various driver code still needs to be updated to not refer to the bvec
directly before we can use this for interesting things, like efficient
bio splitting.

Signed-off-by: Kent Overstreet <kmo@daterainc.com>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Lars Ellenberg <drbd-dev@lists.linbit.com>
Cc: Paul Clements <Paul.Clements@steeleye.com>
Cc: drbd-user@lists.linbit.com
Cc: nbd-general@lists.sourceforge.net
2013-11-23 22:33:49 -08:00
Kent Overstreet
7988613b0e block: Convert bio_for_each_segment() to bvec_iter
More prep work for immutable biovecs - with immutable bvecs drivers
won't be able to use the biovec directly, they'll need to use helpers
that take into account bio->bi_iter.bi_bvec_done.

This updates callers for the new usage without changing the
implementation yet.

Signed-off-by: Kent Overstreet <kmo@daterainc.com>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Geert Uytterhoeven <geert@linux-m68k.org>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Paul Mackerras <paulus@samba.org>
Cc: "Ed L. Cashin" <ecashin@coraid.com>
Cc: Nick Piggin <npiggin@kernel.dk>
Cc: Lars Ellenberg <drbd-dev@lists.linbit.com>
Cc: Jiri Kosina <jkosina@suse.cz>
Cc: Paul Clements <Paul.Clements@steeleye.com>
Cc: Jim Paris <jim@jtan.com>
Cc: Geoff Levand <geoff@infradead.org>
Cc: Yehuda Sadeh <yehuda@inktank.com>
Cc: Sage Weil <sage@inktank.com>
Cc: Alex Elder <elder@inktank.com>
Cc: ceph-devel@vger.kernel.org
Cc: Joshua Morris <josh.h.morris@us.ibm.com>
Cc: Philip Kelleher <pjk1939@linux.vnet.ibm.com>
Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Cc: Jeremy Fitzhardinge <jeremy@goop.org>
Cc: Neil Brown <neilb@suse.de>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: linux390@de.ibm.com
Cc: Nagalakshmi Nandigama <Nagalakshmi.Nandigama@lsi.com>
Cc: Sreekanth Reddy <Sreekanth.Reddy@lsi.com>
Cc: support@lsi.com
Cc: "James E.J. Bottomley" <JBottomley@parallels.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Steven Whitehouse <swhiteho@redhat.com>
Cc: Herton Ronaldo Krzesinski <herton.krzesinski@canonical.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Guo Chao <yan@linux.vnet.ibm.com>
Cc: Asai Thambi S P <asamymuthupa@micron.com>
Cc: Selvan Mani <smani@micron.com>
Cc: Sam Bradshaw <sbradshaw@micron.com>
Cc: Matthew Wilcox <matthew.r.wilcox@intel.com>
Cc: Keith Busch <keith.busch@intel.com>
Cc: Stephen Hemminger <shemminger@vyatta.com>
Cc: Quoc-Son Anh <quoc-sonx.anh@intel.com>
Cc: Sebastian Ott <sebott@linux.vnet.ibm.com>
Cc: Nitin Gupta <ngupta@vflare.org>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Jerome Marchand <jmarchan@redhat.com>
Cc: Seth Jennings <sjenning@linux.vnet.ibm.com>
Cc: "Martin K. Petersen" <martin.petersen@oracle.com>
Cc: Mike Snitzer <snitzer@redhat.com>
Cc: Vivek Goyal <vgoyal@redhat.com>
Cc: "Darrick J. Wong" <darrick.wong@oracle.com>
Cc: Chris Metcalf <cmetcalf@tilera.com>
Cc: Jan Kara <jack@suse.cz>
Cc: linux-m68k@lists.linux-m68k.org
Cc: linuxppc-dev@lists.ozlabs.org
Cc: drbd-user@lists.linbit.com
Cc: nbd-general@lists.sourceforge.net
Cc: cbe-oss-dev@lists.ozlabs.org
Cc: xen-devel@lists.xensource.com
Cc: virtualization@lists.linux-foundation.org
Cc: linux-raid@vger.kernel.org
Cc: linux-s390@vger.kernel.org
Cc: DL-MPTFusionLinux@lsi.com
Cc: linux-scsi@vger.kernel.org
Cc: devel@driverdev.osuosl.org
Cc: linux-fsdevel@vger.kernel.org
Cc: cluster-devel@redhat.com
Cc: linux-mm@kvack.org
Acked-by: Geoff Levand <geoff@infradead.org>
2013-11-23 22:33:49 -08:00
Kent Overstreet
a4ad39b1d1 block: Convert bio_iovec() to bvec_iter
For immutable biovecs, we'll be introducing a new bio_iovec() that uses
our new bvec iterator to construct a biovec, taking into account
bvec_iter->bi_bvec_done - this patch updates existing users for the new
usage.

Some of the existing users really do need a pointer into the bvec array
- those uses are all going to be removed, but we'll need the
functionality from immutable to remove them - so for now rename the
existing bio_iovec() -> __bio_iovec(), and it'll be removed in a couple
patches.

Signed-off-by: Kent Overstreet <kmo@daterainc.com>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: "Ed L. Cashin" <ecashin@coraid.com>
Cc: Alasdair Kergon <agk@redhat.com>
Cc: dm-devel@redhat.com
Cc: "James E.J. Bottomley" <JBottomley@parallels.com>
2013-11-23 22:33:49 -08:00
Kent Overstreet
4f024f3797 block: Abstract out bvec iterator
Immutable biovecs are going to require an explicit iterator. To
implement immutable bvecs, a later patch is going to add a bi_bvec_done
member to this struct; for now, this patch effectively just renames
things.

Signed-off-by: Kent Overstreet <kmo@daterainc.com>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Geert Uytterhoeven <geert@linux-m68k.org>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Paul Mackerras <paulus@samba.org>
Cc: "Ed L. Cashin" <ecashin@coraid.com>
Cc: Nick Piggin <npiggin@kernel.dk>
Cc: Lars Ellenberg <drbd-dev@lists.linbit.com>
Cc: Jiri Kosina <jkosina@suse.cz>
Cc: Matthew Wilcox <willy@linux.intel.com>
Cc: Geoff Levand <geoff@infradead.org>
Cc: Yehuda Sadeh <yehuda@inktank.com>
Cc: Sage Weil <sage@inktank.com>
Cc: Alex Elder <elder@inktank.com>
Cc: ceph-devel@vger.kernel.org
Cc: Joshua Morris <josh.h.morris@us.ibm.com>
Cc: Philip Kelleher <pjk1939@linux.vnet.ibm.com>
Cc: Rusty Russell <rusty@rustcorp.com.au>
Cc: "Michael S. Tsirkin" <mst@redhat.com>
Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Cc: Jeremy Fitzhardinge <jeremy@goop.org>
Cc: Neil Brown <neilb@suse.de>
Cc: Alasdair Kergon <agk@redhat.com>
Cc: Mike Snitzer <snitzer@redhat.com>
Cc: dm-devel@redhat.com
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: linux390@de.ibm.com
Cc: Boaz Harrosh <bharrosh@panasas.com>
Cc: Benny Halevy <bhalevy@tonian.com>
Cc: "James E.J. Bottomley" <JBottomley@parallels.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: "Nicholas A. Bellinger" <nab@linux-iscsi.org>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Chris Mason <chris.mason@fusionio.com>
Cc: "Theodore Ts'o" <tytso@mit.edu>
Cc: Andreas Dilger <adilger.kernel@dilger.ca>
Cc: Jaegeuk Kim <jaegeuk.kim@samsung.com>
Cc: Steven Whitehouse <swhiteho@redhat.com>
Cc: Dave Kleikamp <shaggy@kernel.org>
Cc: Joern Engel <joern@logfs.org>
Cc: Prasad Joshi <prasadjoshi.linux@gmail.com>
Cc: Trond Myklebust <Trond.Myklebust@netapp.com>
Cc: KONISHI Ryusuke <konishi.ryusuke@lab.ntt.co.jp>
Cc: Mark Fasheh <mfasheh@suse.com>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Ben Myers <bpm@sgi.com>
Cc: xfs@oss.sgi.com
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Len Brown <len.brown@intel.com>
Cc: Pavel Machek <pavel@ucw.cz>
Cc: "Rafael J. Wysocki" <rjw@sisk.pl>
Cc: Herton Ronaldo Krzesinski <herton.krzesinski@canonical.com>
Cc: Ben Hutchings <ben@decadent.org.uk>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Guo Chao <yan@linux.vnet.ibm.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Asai Thambi S P <asamymuthupa@micron.com>
Cc: Selvan Mani <smani@micron.com>
Cc: Sam Bradshaw <sbradshaw@micron.com>
Cc: Wei Yongjun <yongjun_wei@trendmicro.com.cn>
Cc: "Roger Pau Monné" <roger.pau@citrix.com>
Cc: Jan Beulich <jbeulich@suse.com>
Cc: Stefano Stabellini <stefano.stabellini@eu.citrix.com>
Cc: Ian Campbell <Ian.Campbell@citrix.com>
Cc: Sebastian Ott <sebott@linux.vnet.ibm.com>
Cc: Christian Borntraeger <borntraeger@de.ibm.com>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Jiang Liu <jiang.liu@huawei.com>
Cc: Nitin Gupta <ngupta@vflare.org>
Cc: Jerome Marchand <jmarchand@redhat.com>
Cc: Joe Perches <joe@perches.com>
Cc: Peng Tao <tao.peng@emc.com>
Cc: Andy Adamson <andros@netapp.com>
Cc: fanchaoting <fanchaoting@cn.fujitsu.com>
Cc: Jie Liu <jeff.liu@oracle.com>
Cc: Sunil Mushran <sunil.mushran@gmail.com>
Cc: "Martin K. Petersen" <martin.petersen@oracle.com>
Cc: Namjae Jeon <namjae.jeon@samsung.com>
Cc: Pankaj Kumar <pankaj.km@samsung.com>
Cc: Dan Magenheimer <dan.magenheimer@oracle.com>
Cc: Mel Gorman <mgorman@suse.de>6
2013-11-23 22:33:47 -08:00
Yuanhan Liu
044c8d4b15 kernel: remove CONFIG_USE_GENERIC_SMP_HELPERS cleanly
Remove CONFIG_USE_GENERIC_SMP_HELPERS left by commit 0a06ff068f
("kernel: remove CONFIG_USE_GENERIC_SMP_HELPERS").

Signed-off-by: Yuanhan Liu <yuanhan.liu@linux.intel.com>
Cc: Christoph Hellwig <hch@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-11-21 16:42:27 -08:00
Shaohua Li
f02b9ac35a virtio-blk: virtqueue_kick() must be ordered with other virtqueue operations
It isn't safe to call it without holding the vblk->vq_lock.

Reported-by: Dave Chinner <david@fromorbit.com>
Signed-off-by: Shaohua Li <shli@fusionio.com>

Fixed another condition of virtqueue_kick() not holding the lock.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
2013-11-19 19:00:45 -07:00
Michael Opdenacker
481e5bad84 NVMe: remove deprecated IRQF_DISABLED
This patch proposes to remove the use of the IRQF_DISABLED flag

It's a NOOP since 2.6.35 and it will be removed one day.

Signed-off-by: Michael Opdenacker <michael.opdenacker@free-electrons.com>
Signed-off-by: Matthew Wilcox <matthew.r.wilcox@intel.com>
2013-11-18 17:13:54 -05:00
Haiyan Hu
b80d5ccca3 NVMe: Avoid shift operation when writing cq head doorbell
Changes the type of dev->db_stride to unsigned and changes the value
stored there to be 1 << the current value. Then there is less
calculation to be done at completion time.

Signed-off-by: Haiyan Hu <huhaiyan@huawei.com>
Signed-off-by: Matthew Wilcox <matthew.r.wilcox@intel.com>
2013-11-18 17:10:51 -05:00
Linus Torvalds
f412f2c60b Merge branch 'for-linus' of git://git.kernel.dk/linux-block
Pull second round of block driver updates from Jens Axboe:
 "As mentioned in the original pull request, the bcache bits were pulled
  because of their dependency on the immutable bio vecs.  Kent re-did
  this part and resubmitted it, so here's the 2nd round of (mostly)
  driver updates for 3.13.  It contains:

 - The bcache work from Kent.

 - Conversion of virtio-blk to blk-mq.  This removes the bio and request
   path, and substitutes with the blk-mq path instead.  The end result
   almost 200 deleted lines.  Patch is acked by Asias and Christoph, who
   both did a bunch of testing.

 - A removal of bootmem.h include from Grygorii Strashko, part of a
   larger series of his killing the dependency on that header file.

 - Removal of __cpuinit from blk-mq from Paul Gortmaker"

* 'for-linus' of git://git.kernel.dk/linux-block: (56 commits)
  virtio_blk: blk-mq support
  blk-mq: remove newly added instances of __cpuinit
  bcache: defensively handle format strings
  bcache: Bypass torture test
  bcache: Delete some slower inline asm
  bcache: Use ida for bcache block dev minor
  bcache: Fix sysfs splat on shutdown with flash only devs
  bcache: Better full stripe scanning
  bcache: Have btree_split() insert into parent directly
  bcache: Move spinlock into struct time_stats
  bcache: Kill sequential_merge option
  bcache: Kill bch_next_recurse_key()
  bcache: Avoid deadlocking in garbage collection
  bcache: Incremental gc
  bcache: Add make_btree_freeing_key()
  bcache: Add btree_node_write_sync()
  bcache: PRECEDING_KEY()
  bcache: bch_(btree|extent)_ptr_invalid()
  bcache: Don't bother with bucket refcount for btree node allocations
  bcache: Debug code improvements
  ...
2013-11-15 16:33:41 -08:00
Linus Torvalds
b746f9c794 Nothing really exciting: some groundwork for changing virtio endian, and
some robustness fixes for broken virtio devices, plus minor tweaks.
 
 [vs last pull request: added the virtio-scsi broken vq escape patch, which
 I somehow lost.]
 
 Cheers,
 Rusty.
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1.4.14 (GNU/Linux)
 
 iQIcBAABAgAGBQJSgDsJAAoJENkgDmzRrbjxEE4P/jXqZHS/HdlxW9k0BjKKlEIF
 PdtCoP3UhWTdskXvy2pD8m6nYn214MEJYUIa4HFlIEZsdxhuexzQHY19Ynkjagyv
 57sRsUUm5fYQLIL7IUh2DUD1VU38hUFinno/y333szzvCj9qITDA/QABsiWxK8NO
 dq+Lmeixgrhc5yN9iryW+gZV+hekJIZ4LsU5ejSaJucKblzXUH8qIbmSthG7RTYJ
 tr4J7xTTXbhxY4CoC5Dpx2hvsFkvzaAIvI4Nr1mDjfq5cR8BaYvnC89U1IbhdAey
 p1AbZE58JLrY+Z8K8LBRGV2KjO8qSZ6R47hbZ9nAnodJYB7sZLyj6jUe1q+/htuC
 Dh9Xm9O4eW2xNaFk20dYeIF4UU5/HzdsbvG/IlH8x4sm8/K706ocYyAOHlzYUg2T
 k7gltrgDzDokMgb2R44gwnr4oaJ2q8Gne6JXswlPEv2eRs6vNnA5Xhc0rEHGkU6C
 gYn1vNFN6yx0vf2syG/Ce5pZtMxGpefKQkHzzWdq8FKr1B9s54dDuf2hls7J8A9t
 OQT1gE33yURSelf4Kh4k9zWXaWk/Ohv9l2R1cqpALnJ4/+q0fP5t7HdK500S7aax
 DxLeFeqvsBw7nlWgsGxQmt+fjITQFHhcDiwst0ehnt6RbDEW7XPIguz0K/gyhxYG
 +UNbl/5Gr64jnUX3YCzm
 =vY2L
 -----END PGP SIGNATURE-----

Merge tag 'virtio-next-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rusty/linux

Pull virtio updates from Rusty Russell:
 "Nothing really exciting: some groundwork for changing virtio endian,
  and some robustness fixes for broken virtio devices, plus minor
  tweaks"

* tag 'virtio-next-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rusty/linux:
  virtio_scsi: verify if queue is broken after virtqueue_get_buf()
  x86, asmlinkage, lguest: Pass in globals into assembler statement
  virtio: mmio: fix signature checking for BE guests
  virtio_ring: adapt to notify() returning bool
  virtio_net: verify if queue is broken after virtqueue_get_buf()
  virtio_console: verify if queue is broken after virtqueue_get_buf()
  virtio_blk: verify if queue is broken after virtqueue_get_buf()
  virtio_ring: add new function virtqueue_is_broken()
  virtio_test: verify if virtqueue_kick() succeeded
  virtio_net: verify if virtqueue_kick() succeeded
  virtio_ring: let virtqueue_{kick()/notify()} return a bool
  virtio_ring: change host notification API
  virtio_config: remove virtio_config_val
  virtio: use size-based config accessors.
  virtio_config: introduce size-based accessors.
  virtio_ring: plug kmemleak false positive.
  virtio: pm: use CONFIG_PM_SLEEP instead of CONFIG_PM
2013-11-15 13:28:47 +09:00
Wolfram Sang
16735d022f tree-wide: use reinit_completion instead of INIT_COMPLETION
Use this new function to make code more comprehensible, since we are
reinitialzing the completion, not initializing.

[akpm@linux-foundation.org: linux-next resyncs]
Signed-off-by: Wolfram Sang <wsa@the-dreams.de>
Acked-by: Linus Walleij <linus.walleij@linaro.org> (personally at LCE13)
Cc: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-11-15 09:32:21 +09:00
Jens Axboe
1cf7e9c68f virtio_blk: blk-mq support
Switch virtio-blk from the dual support for old-style requests and bios
to use the block-multiqueue.

Acked-by: Asias He <asias@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Christoph Hellwig <hch@lst.de>
2013-11-14 08:40:44 -07:00
Jens Axboe
1355b37f11 Merge branch 'for-3.13/post-mq-drivers' into for-linus 2013-11-14 08:29:01 -07:00
Linus Torvalds
5eea9be8b2 Merge branch 'for-3.13/drivers' of git://git.kernel.dk/linux-block
Pull block driver updates from Jens Axboe:
 "This is the block driver pull request for 3.13.  As with the core pull
  request just sent out, this was rebased on top of the core branch
  again after the immutable series was pulled.  This also means that
  bcache gets to sit the initial pull over.  I will send a second driver
  pull request in the merge window to get those fixes in, once they have
  been rebased and tested on top of the non-immutable stack.

  This pull request contains:

   - Add support for the sTec Kronos pci-e flash card from sTec.  Also
     has various cleanups for this driver, from myself, Bart, Mike
     Snizter, and Wei Yongjun.

   - Add surprise removal support for the micron mtip32xx driver from
     Micron.

   - Floppy documentation fix from Ben Harris.

   - debugfs bug fix for pktcdvd from Dan Carpenter.

   - Fix for the mtip32xx driver stack usage in the debugfs path,
     dynamically allocating those buffers instead.  From David Milburn.

   - Disable cpqarray in Kconfig.  The plan is to remove it on request
     of HP, but lets disable it for a few revisions just to see if
     anyone yells.

   - drbd fixes from Lars Ellenberg and Philipp Reisner.

   - Elevator switch fix for the s390 block driver from Heiko Carstens.

   - loop crash fix on IO to unassigned device from Mikulas Patocka.

   - A series of bug fixes for the IBM rsxx pci-e flash driver from
     Philip J Kelleher.

   - cciss probe fix from Stephen Cameron.

   - Xen block front/back fixes from Roger Pau Monne and Vegard Nossum"

* 'for-3.13/drivers' of git://git.kernel.dk/linux-block: (41 commits)
  floppy: Correct documentation of driver options when used as a module.
  pktcdvd: debugfs functions return NULL on error
  xen-blkfront: restore the non-persistent data path
  skd: fix formatting in skd_s1120.h
  skd: reorder construct/destruct code
  skd: cleanup skd_do_inq_page_da()
  skd: remove SKD_OMIT_FROM_SRC_DIST ifdefs
  skd: remove redundant skdev->pdev assignment from skd_pci_probe()
  skd: use <asm/unaligned.h>
  skd: remove SCSI subsystem specific includes
  skd: register block device only if some devices are present
  skd: fix error messages in skd_init()
  skd: fix error paths in skd_init()
  skd: fix unregister_blkdev() placement
  skd: more removal of bio-based code
  skd: cleanup the skd_*() function block wrapping
  skd: rip out bio path
  skd: fix error return code in skd_pci_probe()
  s390/dasd: hold request queue sysfs lock when calling elevator_init()
  cciss: return 0 from driver probe function on success, not 1
  ...
2013-11-14 12:13:05 +09:00
Linus Torvalds
0910c0bdf7 Merge branch 'for-3.13/core' of git://git.kernel.dk/linux-block
Pull block IO core updates from Jens Axboe:
 "This is the pull request for the core changes in the block layer for
  3.13.  It contains:

   - The new blk-mq request interface.

     This is a new and more scalable queueing model that marries the
     best part of the request based interface we currently have (which
     is fully featured, but scales poorly) and the bio based "interface"
     which the new drivers for high IOPS devices end up using because
     it's much faster than the request based one.

     The bio interface has no block layer support, since it taps into
     the stack much earlier.  This means that drivers end up having to
     implement a lot of functionality on their own, like tagging,
     timeout handling, requeue, etc.  The blk-mq interface provides all
     these.  Some drivers even provide a switch to select bio or rq and
     has code to handle both, since things like merging only works in
     the rq model and hence is faster for some workloads.  This is a
     huge mess.  Conversion of these drivers nets us a substantial code
     reduction.  Initial results on converting SCSI to this model even
     shows an 8x improvement on single queue devices.  So while the
     model was intended to work on the newer multiqueue devices, it has
     substantial improvements for "classic" hardware as well.  This code
     has gone through extensive testing and development, it's now ready
     to go.  A pull request is coming to convert virtio-blk to this
     model will be will be coming as well, with more drivers scheduled
     for 3.14 conversion.

   - Two blktrace fixes from Jan and Chen Gang.

   - A plug merge fix from Alireza Haghdoost.

   - Conversion of __get_cpu_var() from Christoph Lameter.

   - Fix for sector_div() with 64-bit divider from Geert Uytterhoeven.

   - A fix for a race between request completion and the timeout
     handling from Jeff Moyer.  This is what caused the merge conflict
     with blk-mq/core, in case you are looking at that.

   - A dm stacking fix from Mike Snitzer.

   - A code consolidation fix and duplicated code removal from Kent
     Overstreet.

   - A handful of block bug fixes from Mikulas Patocka, fixing a loop
     crash and memory corruption on blk cg.

   - Elevator switch bug fix from Tomoki Sekiyama.

  A heads-up that I had to rebase this branch.  Initially the immutable
  bio_vecs had been queued up for inclusion, but a week later, it became
  clear that it wasn't fully cooked yet.  So the decision was made to
  pull this out and postpone it until 3.14.  It was a straight forward
  rebase, just pruning out the immutable series and the later fixes of
  problems with it.  The rest of the patches applied directly and no
  further changes were made"

* 'for-3.13/core' of git://git.kernel.dk/linux-block: (31 commits)
  block: replace IS_ERR and PTR_ERR with PTR_ERR_OR_ZERO
  block: replace IS_ERR and PTR_ERR with PTR_ERR_OR_ZERO
  block: Do not call sector_div() with a 64-bit divisor
  kernel: trace: blktrace: remove redundent memcpy() in compat_blk_trace_setup()
  block: Consolidate duplicated bio_trim() implementations
  block: Use rw_copy_check_uvector()
  block: Enable sysfs nomerge control for I/O requests in the plug list
  block: properly stack underlying max_segment_size to DM device
  elevator: acquire q->sysfs_lock in elevator_change()
  elevator: Fix a race in elevator switching and md device initialization
  block: Replace __get_cpu_var uses
  bdi: test bdi_init failure
  block: fix a probe argument to blk_register_region
  loop: fix crash if blk_alloc_queue fails
  blk-core: Fix memory corruption if blkcg_init_queue fails
  block: fix race between request completion and timeout handling
  blktrace: Send BLK_TN_PROCESS events to all running traces
  blk-mq: don't disallow request merges for req->special being set
  blk-mq: mq plug list breakage
  blk-mq: fix for flush deadlock
  ...
2013-11-14 12:08:14 +09:00
Linus Torvalds
8ceafbfa91 Merge branch 'for-linus-dma-masks' of git://git.linaro.org/people/rmk/linux-arm
Pull DMA mask updates from Russell King:
 "This series cleans up the handling of DMA masks in a lot of drivers,
  fixing some bugs as we go.

  Some of the more serious errors include:
   - drivers which only set their coherent DMA mask if the attempt to
     set the streaming mask fails.
   - drivers which test for a NULL dma mask pointer, and then set the
     dma mask pointer to a location in their module .data section -
     which will cause problems if the module is reloaded.

  To counter these, I have introduced two helper functions:
   - dma_set_mask_and_coherent() takes care of setting both the
     streaming and coherent masks at the same time, with the correct
     error handling as specified by the API.
   - dma_coerce_mask_and_coherent() which resolves the problem of
     drivers forcefully setting DMA masks.  This is more a marker for
     future work to further clean these locations up - the code which
     creates the devices really should be initialising these, but to fix
     that in one go along with this change could potentially be very
     disruptive.

  The last thing this series does is prise away some of Linux's addition
  to "DMA addresses are physical addresses and RAM always starts at
  zero".  We have ARM LPAE systems where all system memory is above 4GB
  physical, hence having DMA masks interpreted by (eg) the block layers
  as describing physical addresses in the range 0..DMAMASK fails on
  these platforms.  Santosh Shilimkar addresses this in this series; the
  patches were copied to the appropriate people multiple times but were
  ignored.

  Fixing this also gets rid of some ARM weirdness in the setup of the
  max*pfn variables, and brings ARM into line with every other Linux
  architecture as far as those go"

* 'for-linus-dma-masks' of git://git.linaro.org/people/rmk/linux-arm: (52 commits)
  ARM: 7805/1: mm: change max*pfn to include the physical offset of memory
  ARM: 7797/1: mmc: Use dma_max_pfn(dev) helper for bounce_limit calculations
  ARM: 7796/1: scsi: Use dma_max_pfn(dev) helper for bounce_limit calculations
  ARM: 7795/1: mm: dma-mapping: Add dma_max_pfn(dev) helper function
  ARM: 7794/1: block: Rename parameter dma_mask to max_addr for blk_queue_bounce_limit()
  ARM: DMA-API: better handing of DMA masks for coherent allocations
  ARM: 7857/1: dma: imx-sdma: setup dma mask
  DMA-API: firmware/google/gsmi.c: avoid direct access to DMA masks
  DMA-API: dcdbas: update DMA mask handing
  DMA-API: dma: edma.c: no need to explicitly initialize DMA masks
  DMA-API: usb: musb: use platform_device_register_full() to avoid directly messing with dma masks
  DMA-API: crypto: remove last references to 'static struct device *dev'
  DMA-API: crypto: fix ixp4xx crypto platform device support
  DMA-API: others: use dma_set_coherent_mask()
  DMA-API: staging: use dma_set_coherent_mask()
  DMA-API: usb: use new dma_coerce_mask_and_coherent()
  DMA-API: usb: use dma_set_coherent_mask()
  DMA-API: parport: parport_pc.c: use dma_coerce_mask_and_coherent()
  DMA-API: net: octeon: use dma_coerce_mask_and_coherent()
  DMA-API: net: nxp/lpc_eth: use dma_coerce_mask_and_coherent()
  ...
2013-11-14 07:55:21 +09:00
Dan Carpenter
49c2856af7 pktcdvd: debugfs functions return NULL on error
My static checker complains correctly that this is potential NULL
dereference because debugfs functions return NULL on error.  They return
an ERR_PTR if they are configured out.

We don't need to check for ERR_PTR because if debugfs is stubbed out the
dummy functions won't complain about that.  We don't need to check the
values before calling debugfs_remove() because that accepts ERR_PTRs and
NULL pointers.

We don't need to set pkt->dfs_f_info to NULL in pkt_debugfs_dev_new()
because it was initialized with kzalloc() so I have removed that.

Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Jiri Kosina <jkosina@suse.cz>
2013-11-08 09:10:31 -07:00
Roger Pau Monne
bfe11d6de1 xen-blkfront: restore the non-persistent data path
When persistent grants were added they were always used, even if the
backend doesn't have this feature (there's no harm in always using the
same set of pages). This restores the old data path when the backend
doesn't have persistent grants, removing the burden of doing a memcpy
when it is not actually needed.

Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
Reported-by: Felipe Franciosi <felipe.franciosi@citrix.com>
Cc: Felipe Franciosi <felipe.franciosi@citrix.com>
Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Cc: David Vrabel <david.vrabel@citrix.com>
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
[v2: Fix up whitespace issues]
2013-11-08 09:10:30 -07:00
Bartlomiej Zolnierkiewicz
f1a3c61913 skd: fix formatting in skd_s1120.h
Cc: Akhil Bhansali <abhansali@stec-inc.com>
Cc: Jeff Moyer <jmoyer@redhat.com>
Signed-off-by: Bartlomiej Zolnierkiewicz <b.zolnierkie@samsung.com>
Signed-off-by: Kyungmin Park <kyungmin.park@samsung.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2013-11-08 09:10:30 -07:00
Bartlomiej Zolnierkiewicz
542d7b0015 skd: reorder construct/destruct code
Reorder placement of skd_construct(), skd_cons_sg_list(), skd_destruct()
and skd_free_sg_list() functions. Then remove no longer needed function
prototypes.

Cc: Akhil Bhansali <abhansali@stec-inc.com>
Cc: Jeff Moyer <jmoyer@redhat.com>
Signed-off-by: Bartlomiej Zolnierkiewicz <b.zolnierkie@samsung.com>
Signed-off-by: Kyungmin Park <kyungmin.park@samsung.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2013-11-08 09:10:30 -07:00
Bartlomiej Zolnierkiewicz
fec23f6311 skd: cleanup skd_do_inq_page_da()
skdev->pdev and skdev->pdev->bus are always different than NULL in
skd_do_inq_page_da() so simplify the code accordingly.

Also cache skdev->pdev value in pdev variable while at it.

Cc: Akhil Bhansali <abhansali@stec-inc.com>
Cc: Jeff Moyer <jmoyer@redhat.com>
Signed-off-by: Bartlomiej Zolnierkiewicz <b.zolnierkie@samsung.com>
Signed-off-by: Kyungmin Park <kyungmin.park@samsung.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2013-11-08 09:10:30 -07:00
Bartlomiej Zolnierkiewicz
7f74e5e944 skd: remove SKD_OMIT_FROM_SRC_DIST ifdefs
SKD_OMIT_FROM_SRC_DIST is never defined.

Cc: Akhil Bhansali <abhansali@stec-inc.com>
Cc: Jeff Moyer <jmoyer@redhat.com>
Signed-off-by: Bartlomiej Zolnierkiewicz <b.zolnierkie@samsung.com>
Signed-off-by: Kyungmin Park <kyungmin.park@samsung.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2013-11-08 09:10:30 -07:00
Bartlomiej Zolnierkiewicz
ebedd16dab skd: remove redundant skdev->pdev assignment from skd_pci_probe()
skdev->pdev is set to pdev twice in skd_pci_probe(), first time
through skd_construct() call and the second time directly in
the function. Remove the second assignment as it is not needed.

Cc: Akhil Bhansali <abhansali@stec-inc.com>
Cc: Jeff Moyer <jmoyer@redhat.com>
Signed-off-by: Bartlomiej Zolnierkiewicz <b.zolnierkie@samsung.com>
Signed-off-by: Kyungmin Park <kyungmin.park@samsung.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2013-11-08 09:10:30 -07:00
Bartlomiej Zolnierkiewicz
4ca90b5332 skd: use <asm/unaligned.h>
Use <asm/unaligned.h> instead of <asm-generic/unaligned.h>.

Cc: Akhil Bhansali <abhansali@stec-inc.com>
Cc: Jeff Moyer <jmoyer@redhat.com>
Signed-off-by: Bartlomiej Zolnierkiewicz <b.zolnierkie@samsung.com>
Signed-off-by: Kyungmin Park <kyungmin.park@samsung.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2013-11-08 09:10:30 -07:00
Bartlomiej Zolnierkiewicz
43504631d0 skd: remove SCSI subsystem specific includes
This is not a SCSI host driver so remove SCSI subsystem specific
includes.

Cc: Akhil Bhansali <abhansali@stec-inc.com>
Cc: Jeff Moyer <jmoyer@redhat.com>
Signed-off-by: Bartlomiej Zolnierkiewicz <b.zolnierkie@samsung.com>
Signed-off-by: Kyungmin Park <kyungmin.park@samsung.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2013-11-08 09:10:30 -07:00
Bartlomiej Zolnierkiewicz
b8df6647c2 skd: register block device only if some devices are present
Register block device in skd_pci_probe() instead of in skd_init() so it
is registered only if some devices are present (currently it is always
registered when the driver is loaded). Please note that this change
depends on the fact that register_blkdev(0, ...) never returns 0.

Cc: Akhil Bhansali <abhansali@stec-inc.com>
Cc: Jeff Moyer <jmoyer@redhat.com>
Signed-off-by: Bartlomiej Zolnierkiewicz <b.zolnierkie@samsung.com>
Signed-off-by: Kyungmin Park <kyungmin.park@samsung.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2013-11-08 09:10:30 -07:00
Bartlomiej Zolnierkiewicz
fbed149ab3 skd: fix error messages in skd_init()
* change priority level from KERN_INFO to KERN_ERR
* add "skd: " prefix
* do minor CodingStyle fixes

Cc: Akhil Bhansali <abhansali@stec-inc.com>
Cc: Jeff Moyer <jmoyer@redhat.com>
Signed-off-by: Bartlomiej Zolnierkiewicz <b.zolnierkie@samsung.com>
Signed-off-by: Kyungmin Park <kyungmin.park@samsung.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2013-11-08 09:10:30 -07:00
Bartlomiej Zolnierkiewicz
2b91c55222 skd: fix error paths in skd_init()
Cc: Akhil Bhansali <abhansali@stec-inc.com>
Cc: Jeff Moyer <jmoyer@redhat.com>
Signed-off-by: Bartlomiej Zolnierkiewicz <b.zolnierkie@samsung.com>
Signed-off-by: Kyungmin Park <kyungmin.park@samsung.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2013-11-08 09:10:30 -07:00
Bartlomiej Zolnierkiewicz
a073ae953e skd: fix unregister_blkdev() placement
register_blkdev() is called before pci_register_driver() in skd_init()
so unregister_blkdev() should be called after pci_unregister_driver()
in skd_exit(). Fix it.

Cc: Akhil Bhansali <abhansali@stec-inc.com>
Cc: Jeff Moyer <jmoyer@redhat.com>
Signed-off-by: Bartlomiej Zolnierkiewicz <b.zolnierkie@samsung.com>
Signed-off-by: Kyungmin Park <kyungmin.park@samsung.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2013-11-08 09:10:29 -07:00
Mike Snitzer
38d4a1bb99 skd: more removal of bio-based code
Remove skd_flush_cmd structure and skd_flush_slab.
Remove skd_end_request wrapper around skd_end_request_blk.
Remove skd_requeue_request, use blk_requeue_request directly.
Cleanup some comments (remove "bio" info) and whitespace.

Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2013-11-08 09:10:29 -07:00
Jens Axboe
6a5ec65b9a skd: cleanup the skd_*() function block wrapping
Just call the block functions directly, don't wrap them
in skd helpers. With only one queueing model enabled, there's
no point in doing that.

Also kill the ->start_time and ->bio from the skd_request_context,
we don't use those anymore.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
2013-11-08 09:10:29 -07:00
Jens Axboe
fcd37eb3c1 skd: rip out bio path
The skd driver has a selectable rq or bio based queueing model.
For 3.14, we want to turn this into a single blk-mq interface
instead. With the immutable biovecs being merged in 3.13, the
bio model would need patches to even work. So rip it out, with
a conversion pending for blk-mq in the next release.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
2013-11-08 09:10:29 -07:00
Wei Yongjun
1762b57fcb skd: fix error return code in skd_pci_probe()
Fix to return -ENOMEM in the skd construct error handling
case instead of 0, as done elsewhere in this function.

Signed-off-by: Wei Yongjun <yongjun_wei@trendmicro.com.cn>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2013-11-08 09:10:29 -07:00
Stephen M. Cameron
b88fac630b cciss: return 0 from driver probe function on success, not 1
A return value of 1 is interpreted as an error

Signed-off-by: Stephen M. Cameron <scameron@beardog.cce.hp.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2013-11-08 09:10:29 -07:00
rchinthekindi
2e44b42718 skd: Replaced custom debug PRINTKs with pr_debug
Replaced DPRINTK() and VPRINTK() with pr_debug().

Signed-off-by: Ramprasad C <ramprasad.chinthekindi@hgst.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2013-11-08 09:10:29 -07:00
Akhil Bhansali
f721bb0dbd skd: Fix checkpatch ERRORS and removed unused functions
This patch fixes checkpatch.pl errors for assignment in if condition.
It also removes unused readq / readl function calls.

As Andrew had disabled the compilation of drivers for 32 bit,
I have modified format specifiers in few VPRINTKs to avoid warnings
during 64 bit compilation.

Signed-off-by: Akhil Bhansali <abhansali@stec-inc.com>
Reviewed-by: Ramprasad Chinthekindi <rchinthekindi@stec-inc.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2013-11-08 09:10:29 -07:00
Philip J Kelleher
8c49a77ca4 rsxx: Fix possible kernel panic with invalid config.
This patch fixes a possible Kernel Panic on driver load if
the configuration on the card is messed up or not yet set.
The driver could possible give a 32 bit unsigned all Fs to
the kernel as the device's block size.

Now we only write the block size to the kernel if the
configuration from the card is valid.

Also, driver version is being updated.

Signed-off-by: Philip J Kelleher <pjk1939@linux.vnet.ibm.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2013-11-08 09:10:29 -07:00
Philip J Kelleher
e35f38bf73 rsxx: Disallow discards from being unmapped.
This patch fixes a bug in which discards were always
calling pci_unmap_page. Discards should never call the
pci_unmap_page function call because they are never mapped.

This caused a race condition on PowerPC systems when issuing
discards, writes, and reads all at the same time. The
pci_map_page function would eventually map logical address
0 for a read or write. Discards are always assigned a DMA
address of 0 because they are never mapped. So if
pci_map_page mapped address 0 for a DMA and a discard was
"unmapped" then the address would be freed and would cause
an EEH event to occur when Hardware accesses the address.

This was injected/uncovered in commit:
b347f9cf0bc8d42ee95ba1d3837fd93045ab336b

The pci_dma_mapping_error function declares -1 a DMA_ERROR
not 0 like initially thought So before we would never unmap
discards because they were considered NULL.

This patch should fall on top of commit id:
fc1967bb08a6184ed44ef990e1dd4389901b809c

Also, the driver version is being up dated.

Signed-off-by: Philip J Kelleher <pjk1939@linux.vnet.ibm.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2013-11-08 09:10:29 -07:00
Lars Ellenberg
35f47ef1a1 drbd: avoid to shrink max_bio_size due to peer re-configuration
For a long time, the receiving side has spread "too large" incoming
requests over multiple bios.  No need to shrink our max_bio_size
(max_hw_sectors) if the peer is reconfigured to use a different storage.

The problem manifests itself if we are not the top of the device stack
(DRBD is used a LVM PV).

A hardware reconfiguration on the peer may cause the supported
max_bio_size to shrink, and the connection handshake would now
unnecessarily shrink the max_bio_size on the active node.

There is no way to notify upper layers that they have to "re-stack"
their limits. So they won't notice at all, and may keep submitting bios
that are suddenly considered "too large for device".

We already check for compatibility and ignore changes on the peer,
the code only was masked out unless we have a fully established connection.
We just need to allow it a bit earlier during the handshake.

Also consider max_hw_sectors in our merge bvec function, just in case.

Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2013-11-08 09:10:29 -07:00
Lars Ellenberg
d2da5b0cb5 drbd: fix decoding of bitmap vli rle for device sizes > 64 TB
Symptoms: disconnect after bitmap exchange due to
bitmap overflow (e:49731075554) while decoding bm RLE packet

In the decoding step of the variable length integer run length encoding
there was potentially an uncatched bitshift by wordsize (variable >> 64).

The result of which is "undefined" :(
(only "sometimes" the result is the desired 0)

Fix: don't do any bit shift magic for shift == 64, just assign.

Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2013-11-08 09:10:28 -07:00
Philipp Reisner
57737adc96 drbd: Fix adding of new minors with freshly created meta data
Online adding of new minors with freshly created meta data
to an resource with an established connection failed, with a
wrong state transition on one side on one side of the new minor.

Freshly created meta-data has a la_size (last agreed size) of 0.
When we online add such devices, the code wrongly got into
the code path for resyncing new storage that was added while
the disk was detached.

Fixed that by making the GREW from ZERO a special case.

Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2013-11-08 09:10:28 -07:00
Philipp Reisner
b874d231e1 drbd: Fix an connection drop issue after enabling allow-two-primaries
Since drbd-8.4.0 it is possible to change the allow-two-primaries
network option while the connection is established.

The sequence code used to partially order packets from the
data socket with packets from the meta-data socket, still assued
that the allow-two-primaries option is constant while the
connection is established.

I.e.
On a node that has the RESOLVE_CONFLICTS bits set, after enabling
allow-two-primaries, when receiving the next data packet it timed out
while waiting for the necessary packets on the data socket to arrive
(wait_for_and_update_peer_seq() function).

Fixed that by always tracking the sequence number, but only waiting
for it if allow-two-primaries is set.

Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2013-11-08 09:10:28 -07:00
Lars Ellenberg
69babf05cb drbd: fix NULL pointer deref in module init error path
If we want to iterate over the (as of yet still empty) list in the
cleanup path, we need to initialize the list before the first goto fail.

Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2013-11-08 09:10:28 -07:00
Jens Axboe
7badfb1c34 block: disable cpqarray in Kconfig
Mike writes:

"cpqarray hasn't been used in over 12 years. It's doubtful that anyone
 still uses the board. It's time the driver was removed from the mainline
 kernel.  The only updates these days are minor and mostly done by people
 outside of HP."

If nobody yells, we'll remove it from the kernel tree completely
for 3.15.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
2013-11-08 09:10:28 -07:00
Akhil Bhansali
e67f86b31a Add support for sTec's pci-e flash card Kronos
Signed-off-by: Akhil Bhansali <abhansali@stec-inc.com>
Signed-off-by: Ramprasad Chinthekindi <rchinthekindi@stec-inc.com>
Reviewed-by: Jeff Moyer <jmoyer@redhat.com>

Folded patch, contributions to clean up this driver from:

Jens Axboe
Dan Carpenter
Andrew Morton

Signed-off-by: Jens Axboe <axboe@kernel.dk>
2013-11-08 09:10:28 -07:00
Philip J Kelleher
0317cd6de8 rsxx: Kernel Panic caused by mapping Discards
This fixes a kernel panic injected by commit id
8d26750143341831bc312f61c5ed141eeb75b8d0 where discards
are getting mapped through the pci_map_page function call.

The driver will now start verifying that a dma is not a
discard before issuing a the pci_map_page function call.

Also, we are updating the driver version.

Signed-off-by: Philip J Kelleher <pjk1939@linux.vnet.ibm.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2013-11-08 09:10:28 -07:00
David Milburn
c8afd0dcbd mtip32xx: dynamically allocate buffer in debugfs functions
Dynamically allocate buf to prevent warnings:

drivers/block/mtip32xx/mtip32xx.c: In function ‘mtip_hw_read_device_status’:
drivers/block/mtip32xx/mtip32xx.c:2823: warning: the frame size of 1056 bytes is larger than 1024 bytes
drivers/block/mtip32xx/mtip32xx.c: In function ‘mtip_hw_read_registers’:
drivers/block/mtip32xx/mtip32xx.c:2894: warning: the frame size of 1056 bytes is larger than 1024 bytes
drivers/block/mtip32xx/mtip32xx.c: In function ‘mtip_hw_read_flags’:
drivers/block/mtip32xx/mtip32xx.c:2917: warning: the frame size of 1056 bytes is larger than 1024 bytes

Signed-off-by: David Milburn <dmilburn@redhat.com>
Acked-by: Asai Thambi S P <asamymuthupa@micron.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2013-11-08 09:10:28 -07:00
Asai Thambi S P
8f8b899563 mtip32xx: Add SRSI support
This patch add support for SRSI(Surprise Removal Surprise Insertion).

Approach:
---------
Surprise Removal:
-----------------
On surprise removal of the device, gendisk, request queue, device index, sysfs
entries, etc are retained as long as device is in use - mounted filesystem,
device opened by an application, etc. The service thread breaks out of the main
while loop, waits for pci remove to exit, and then waits for device to become
free. When there no holders of the device, service thread cleans up the block
and device related stuff and returns.

Surprise Insertion:
-------------------
No change, this scenario follows the normal pci probe() function flow.

Signed-off-by: Asai Thambi S P <asamymuthupa@micron.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2013-11-08 09:10:28 -07:00
Philip J Kelleher
1b21f5b2ad rsxx: Moving pci_map_page to prevent overflow.
The pci_map_page function has been moved into our
issued workqueue to prevent an us running out of
mappable addresses on non-HWWD PCIe x8 slots. The
maximum amount that can possible be mapped at one
time now is: 255 dmas X 4 dma channels X 4096 Bytes.

Signed-off-by: Philip J Kelleher <pjk1939@linux.vnet.ibm.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2013-11-08 09:10:28 -07:00
Philip J Kelleher
e5feab229f rsxx: Handling failed pci_map_page on PowerPC and double free.
The rsxx driver was not checking the correct value during a
pci_map_page failure. Fixing this also uncovered a
double free if the bio was returned before it was
broken up into indiviadual 4k dmas, that is also
fixed here.

Signed-off-by: Philip J Kelleher <pjk1939@linux.vnet.ibm.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2013-11-08 09:10:28 -07:00
Mikulas Patocka
ef7e7c82e0 loop: fix crash when using unassigned loop device
When the loop module is loaded, it creates 8 loop devices /dev/loop[0-7].
The devices have no request routine and thus, when they are used without
being assigned, a crash happens.

For example, these commands cause crash (assuming there are no used loop
devices):

Kernel Fault: Code=26 regs=000000007f420980 (Addr=0000000000000010)
CPU: 1 PID: 50 Comm: kworker/1:1 Not tainted 3.11.0 #1
Workqueue: ksnaphd do_metadata [dm_snapshot]
task: 000000007fcf4078 ti: 000000007f420000 task.ti: 000000007f420000
[  116.319988]
     YZrvWESTHLNXBCVMcbcbcbcbOGFRQPDI
PSW: 00001000000001001111111100001111 Not tainted
r00-03  000000ff0804ff0f 00000000408bf5d0 00000000402d8204 000000007b7ff6c0
r04-07  00000000408a95d0 000000007f420950 000000007b7ff6c0 000000007d06c930
r08-11  000000007f4205c0 0000000000000001 000000007f4205c0 000000007f4204b8
r12-15  0000000000000010 0000000000000000 0000000000000000 0000000000000000
r16-19  000000001108dd48 000000004061cd7c 000000007d859800 000000000800000f
r20-23  0000000000000000 0000000000000008 0000000000000000 0000000000000000
r24-27  00000000ffffffff 000000007b7ff6c0 000000007d859800 00000000408a95d0
r28-31  0000000000000000 000000007f420950 000000007f420980 000000007f4208e8
sr00-03  0000000000000000 0000000000000000 0000000000000000 0000000000303000
sr04-07  0000000000000000 0000000000000000 0000000000000000 0000000000000000
[  117.549988]
IASQ: 0000000000000000 0000000000000000 IAOQ: 00000000402d82fc 00000000402d8300
 IIR: 53820020    ISR: 0000000000000000  IOR: 0000000000000010
 CPU:        1   CR30: 000000007f420000 CR31: ffffffffffffffff
 ORIG_R28: 0000000000000001
 IAOQ[0]: generic_make_request+0x11c/0x1a0
 IAOQ[1]: generic_make_request+0x120/0x1a0
 RP(r2): generic_make_request+0x24/0x1a0
Backtrace:
 [<00000000402d83f0>] submit_bio+0x70/0x140
 [<0000000011087c4c>] dispatch_io+0x234/0x478 [dm_mod]
 [<0000000011087f44>] sync_io+0xb4/0x190 [dm_mod]
 [<00000000110883bc>] dm_io+0x2c4/0x310 [dm_mod]
 [<00000000110bfcd0>] do_metadata+0x28/0xb0 [dm_snapshot]
 [<00000000401591d8>] process_one_work+0x160/0x460
 [<0000000040159bc0>] worker_thread+0x300/0x478
 [<0000000040161a70>] kthread+0x118/0x128
 [<0000000040104020>] end_fault_vector+0x20/0x28
 [<0000000040177220>] task_tick_fair+0x420/0x4d0
 [<00000000401aa048>] invoke_rcu_core+0x50/0x60
 [<00000000401ad5b8>] rcu_check_callbacks+0x210/0x8d8
 [<000000004014aaa0>] update_process_times+0xa8/0xc0
 [<00000000401ab86c>] rcu_process_callbacks+0x4b4/0x598
 [<0000000040142408>] __do_softirq+0x250/0x2c0
 [<00000000401789d0>] find_busiest_group+0x3c0/0xc70
[  119.379988]
Kernel panic - not syncing: Kernel Fault
Rebooting in 1 seconds..

Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Cc: stable@kernel.org
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2013-11-08 09:10:28 -07:00
Vegard Nossum
ea5ec76d76 xen/blkback: fix reference counting
If the permission check fails, we drop a reference to the blkif without
having taken it in the first place. The bug was introduced in commit
604c499cbb (xen/blkback: Check device
permissions before allowing OP_DISCARD).

Cc: stable@vger.kernel.org
Cc: Jan Beulich <JBeulich@suse.com>
Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Signed-off-by: Vegard Nossum <vegard.nossum@oracle.com>
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2013-11-08 09:10:27 -07:00
Roger Pau Monne
c47206e25f xen-blkfront: improve aproximation of required grants per request
Improve the calculation of required grants to process a request by
using nr_phys_segments instead of always assuming a request is going
to use all posible segments.

nr_phys_segments contains the number of scatter-gather DMA addr+len
pairs, which is basically what we put at every granted page.
for_each_sg iterates over the DMA addr+len pairs and uses a grant
page for each of them.

Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
Reviewed-by: David Vrabel <david.vrabel@citrix.com>
Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Cc: David Vrabel <david.vrabel@citrix.com>
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2013-11-08 09:10:27 -07:00
Roger Pau Monne
fbe363c476 xen-blkfront: revoke foreign access for grants not mapped by the backend
There's no need to keep the foreign access in a grant if it is not
persistently mapped by the backend. This allows us to free grants that
are not mapped by the backend, thus preventing blkfront from hoarding
all grants.

The main effect of this is that blkfront will only persistently map
the same grants as the backend, and it will always try to use grants
that are already mapped by the backend. Also the number of persistent
grants in blkfront is the same as in blkback (and is controlled by the
value in blkback).

Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
Reviewed-by: David Vrabel <david.vrabel@citrix.com>
Acked-by: Matt Wilson <msw@amazon.com>
Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Cc: David Vrabel <david.vrabel@citrix.com>
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2013-11-08 09:10:27 -07:00
Michael Opdenacker
370d668668 mg_disk: remove deprecated IRQF_DISABLED
This patch proposes to remove the use of the IRQF_DISABLED flag

It's a NOOP since 2.6.35 and it will be removed one day.

Signed-off-by: Michael Opdenacker <michael.opdenacker@free-electrons.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2013-11-08 09:10:27 -07:00
Jens Axboe
e37459b8e2 Merge branch 'blk-mq/core' into for-3.13/core
Signed-off-by: Jens Axboe <axboe@kernel.dk>

Conflicts:
	block/blk-timeout.c
2013-11-08 09:08:12 -07:00
Kent Overstreet
6678d83f18 block: Consolidate duplicated bio_trim() implementations
Someone cut and pasted md's md_trim_bio() into xen-blkfront.c. Come on,
we should know better than this.

Signed-off-by: Kent Overstreet <kmo@daterainc.com>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Neil Brown <neilb@suse.de>
Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Cc: Jeremy Fitzhardinge <jeremy@goop.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2013-11-08 09:02:31 -07:00
Mikulas Patocka
a207f59376 block: fix a probe argument to blk_register_region
The probe function is supposed to return NULL on failure (as we can see in
kobj_lookup: kobj = probe(dev, index, data); ... if (kobj) return kobj;

However, in loop and brd, it returns negative error from ERR_PTR.

This causes a crash if we simulate disk allocation failure and run
less -f /dev/loop0 because the negative number is interpreted as a pointer:

BUG: unable to handle kernel NULL pointer dereference at 00000000000002b4
IP: [<ffffffff8118b188>] __blkdev_get+0x28/0x450
PGD 23c677067 PUD 23d6d1067 PMD 0
Oops: 0000 [#1] PREEMPT SMP
Modules linked in: loop hpfs nvidia(PO) ip6table_filter ip6_tables uvesafb cfbcopyarea cfbimgblt cfbfillrect fbcon font bitblit fbcon_rotate fbcon_cw fbcon_ud fbcon_ccw softcursor fb fbdev msr ipt_MASQUERADE iptable_nat nf_nat_ipv4 nf_conntrack_ipv4 nf_defrag_ipv4 xt_state ipt_REJECT xt_tcpudp iptable_filter ip_tables x_tables bridge stp llc tun ipv6 cpufreq_stats cpufreq_ondemand cpufreq_userspace cpufreq_powersave cpufreq_conservative hid_generic spadfs usbhid hid fuse raid0 snd_usb_audio snd_pcm_oss snd_mixer_oss md_mod snd_pcm snd_timer snd_page_alloc snd_hwdep snd_usbmidi_lib dmi_sysfs snd_rawmidi nf_nat_ftp nf_nat nf_conntrack_ftp nf_conntrack snd soundcore lm85 hwmon_vid ohci_hcd ehci_pci ehci_hcd serverworks sata_svw libata acpi_cpufreq freq_table mperf ide_core usbcore kvm_amd kvm tg3 i2c_piix4 libphy microcode e100 usb_common ptp skge i2c_core pcspkr k10temp evdev floppy hwmon pps_core mii rtc_cmos button processor unix [last unloaded: nvidia]
CPU: 1 PID: 6831 Comm: less Tainted: P        W  O 3.10.15-devel #18
Hardware name: empty empty/S3992-E, BIOS 'V1.06   ' 06/09/2009
task: ffff880203cc6bc0 ti: ffff88023e47c000 task.ti: ffff88023e47c000
RIP: 0010:[<ffffffff8118b188>]  [<ffffffff8118b188>] __blkdev_get+0x28/0x450
RSP: 0018:ffff88023e47dbd8  EFLAGS: 00010286
RAX: ffffffffffffff74 RBX: ffffffffffffff74 RCX: 0000000000000000
RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000001
RBP: ffff88023e47dc18 R08: 0000000000000002 R09: 0000000000000000
R10: 0000000000000000 R11: 0000000000000000 R12: ffff88023f519658
R13: ffffffff8118c300 R14: 0000000000000000 R15: ffff88023f519640
FS:  00007f2070bf7700(0000) GS:ffff880247400000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00000000000002b4 CR3: 000000023da1d000 CR4: 00000000000007e0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
Stack:
 0000000000000002 0000001d00000000 000000003e47dc50 ffff88023f519640
 ffff88043d5bb668 ffffffff8118c300 ffff88023d683550 ffff88023e47de60
 ffff88023e47dc98 ffffffff8118c10d 0000001d81605698 0000000000000292
Call Trace:
 [<ffffffff8118c300>] ? blkdev_get_by_dev+0x60/0x60
 [<ffffffff8118c10d>] blkdev_get+0x1dd/0x370
 [<ffffffff8118c300>] ? blkdev_get_by_dev+0x60/0x60
 [<ffffffff813cea6c>] ? _raw_spin_unlock+0x2c/0x50
 [<ffffffff8118c300>] ? blkdev_get_by_dev+0x60/0x60
 [<ffffffff8118c365>] blkdev_open+0x65/0x80
 [<ffffffff8114d12e>] do_dentry_open.isra.18+0x23e/0x2f0
 [<ffffffff8114d214>] finish_open+0x34/0x50
 [<ffffffff8115e122>] do_last.isra.62+0x2d2/0xc50
 [<ffffffff8115eb58>] path_openat.isra.63+0xb8/0x4d0
 [<ffffffff81115a8e>] ? might_fault+0x4e/0xa0
 [<ffffffff8115f4f0>] do_filp_open+0x40/0x90
 [<ffffffff813cea6c>] ? _raw_spin_unlock+0x2c/0x50
 [<ffffffff8116db85>] ? __alloc_fd+0xa5/0x1f0
 [<ffffffff8114e45f>] do_sys_open+0xef/0x1d0
 [<ffffffff8114e559>] SyS_open+0x19/0x20
 [<ffffffff813cff16>] system_call_fastpath+0x1a/0x1f
Code: 44 00 00 55 48 89 e5 41 57 49 89 ff 41 56 41 89 d6 41 55 41 54 4c 8d 67 18 53 48 83 ec 18 89 75 cc e9 f2 00 00 00 0f 1f 44 00 00 <48> 8b 80 40 03 00 00 48 89 df 4c 8b 68 58 e8 d5
a4 07 00 44 89
RIP  [<ffffffff8118b188>] __blkdev_get+0x28/0x450
 RSP <ffff88023e47dbd8>
CR2: 00000000000002b4
---[ end trace bb7f32dbf02398dc ]---

The brd change should be backported to stable kernels starting with 2.6.25.
The loop change should be backported to stable kernels starting with 2.6.22.

Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Acked-by: Tejun Heo <tj@kernel.org>
Cc: stable@kernel.org	# 2.6.22+
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2013-11-08 08:59:39 -07:00
Mikulas Patocka
3ec981e30f loop: fix crash if blk_alloc_queue fails
loop: fix crash if blk_alloc_queue fails

If blk_alloc_queue fails, loop_add cleans up, but it doesn't clean up the
identifier allocated with idr_alloc. That causes crash on module unload in
idr_for_each(&loop_index_idr, &loop_exit_cb, NULL); where we attempt to
remove non-existed device with that id.

BUG: unable to handle kernel NULL pointer dereference at 0000000000000380
IP: [<ffffffff812057c9>] del_gendisk+0x19/0x2d0
PGD 43d399067 PUD 43d0ad067 PMD 0
Oops: 0000 [#1] PREEMPT SMP
Modules linked in: loop(-) dm_snapshot dm_zero dm_mirror dm_region_hash dm_log dm_loop dm_mod ip6table_filter ip6_tables uvesafb cfbcopyarea cfbimgblt cfbfillrect fbcon font bitblit fbcon_rotate fbcon_cw fbcon_ud fbcon_ccw softcursor fb fbdev msr ipt_MASQUERADE iptable_nat nf_nat_ipv4 nf_conntrack_ipv4 nf_defrag_ipv4 xt_state ipt_REJECT xt_tcpudp iptable_filter ip_tables x_tables bridge stp llc tun ipv6 cpufreq_userspace cpufreq_stats cpufreq_ondemand cpufreq_conservative cpufreq_powersave spadfs fuse hid_generic usbhid hid raid0 md_mod dmi_sysfs nf_nat_ftp nf_nat nf_conntrack_ftp nf_conntrack snd_usb_audio snd_pcm_oss snd_mixer_oss snd_pcm snd_timer snd_page_alloc lm85 hwmon_vid snd_hwdep snd_usbmidi_lib snd_rawmidi snd soundcore acpi_cpufreq ohci_hcd freq_table tg3 ehci_pci mperf ehci_hcd kvm_amd kvm sata_svw serverworks libphy libata ide_core k10temp usbcore hwmon microcode ptp pcspkr pps_core e100 skge mii usb_common i2c_piix4 floppy evdev rtc_cmos i2c_core processor but!
 ton unix
CPU: 7 PID: 2735 Comm: rmmod Tainted: G        W    3.10.15-devel #15
Hardware name: empty empty/S3992-E, BIOS 'V1.06   ' 06/09/2009
task: ffff88043d38e780 ti: ffff88043d21e000 task.ti: ffff88043d21e000
RIP: 0010:[<ffffffff812057c9>]  [<ffffffff812057c9>] del_gendisk+0x19/0x2d0
RSP: 0018:ffff88043d21fe10  EFLAGS: 00010282
RAX: ffffffffa05102e0 RBX: 0000000000000000 RCX: 0000000000000000
RDX: 0000000000000000 RSI: ffff88043ea82800 RDI: 0000000000000000
RBP: ffff88043d21fe48 R08: 0000000000000000 R09: 0000000000000001
R10: 0000000000000001 R11: 0000000000000000 R12: 00000000000000ff
R13: 0000000000000080 R14: 0000000000000000 R15: ffff88043ea82800
FS:  00007ff646534700(0000) GS:ffff880447000000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
CR2: 0000000000000380 CR3: 000000043e9bf000 CR4: 00000000000007e0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
Stack:
 ffffffff8100aba4 0000000000000092 ffff88043d21fe48 ffff88043ea82800
 00000000000000ff ffff88043d21fe98 0000000000000000 ffff88043d21fe60
 ffffffffa05102b4 0000000000000000 ffff88043d21fe70 ffffffffa05102ec
Call Trace:
 [<ffffffff8100aba4>] ? native_sched_clock+0x24/0x80
 [<ffffffffa05102b4>] loop_remove+0x14/0x40 [loop]
 [<ffffffffa05102ec>] loop_exit_cb+0xc/0x10 [loop]
 [<ffffffff81217b74>] idr_for_each+0x104/0x190
 [<ffffffffa05102e0>] ? loop_remove+0x40/0x40 [loop]
 [<ffffffff8109adc5>] ? trace_hardirqs_on_caller+0x105/0x1d0
 [<ffffffffa05135dc>] loop_exit+0x34/0xa58 [loop]
 [<ffffffff810a98ea>] SyS_delete_module+0x13a/0x260
 [<ffffffff81221d5e>] ? trace_hardirqs_on_thunk+0x3a/0x3f
 [<ffffffff813cff16>] system_call_fastpath+0x1a/0x1f
Code: f0 4c 8b 6d f8 c9 c3 66 66 2e 0f 1f 84 00 00 00 00 00 55 48 89 e5 41 56 41 55 4c 8d af 80 00 00 00 41 54 53 48 89 fb 48 83 ec 18 <48> 83 bf 80 03 00
00 00 74 4d e8 98 fe ff ff 31 f6 48 c7 c7 20
RIP  [<ffffffff812057c9>] del_gendisk+0x19/0x2d0
 RSP <ffff88043d21fe10>
CR2: 0000000000000380
---[ end trace 64ec069ec70f1309 ]---

Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Acked-by: Tejun Heo <tj@kernel.org>
Cc: stable@kernel.org	# 3.1+
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2013-11-08 08:59:26 -07:00
Heinz Graalfs
7f03b17d5c virtio_blk: verify if queue is broken after virtqueue_get_buf()
In case virtqueue_get_buf() returned with a NULL pointer verify if the
virtqueue is broken in order to leave while loop.

Signed-off-by: Heinz Graalfs <graalfs@linux.vnet.ibm.com>
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
2013-10-29 11:28:17 +10:30
Olof Johansson
43d93947a5 Via Paul Walmsley <paul@pwsan.com>:
Move some of the OMAP2+ CM and System Control Module direct
 register accesses into CM- and System Control
 Module-specific "drivers" underneath arch/arm/mach-omap2/.  This
 is a prerequisite for moving this code out of arch/arm/mach-omap2/ into
 drivers/.
 
 Basic test logs are available here:
 
 http://www.pwsan.com/omap/testlogs/cm_scm_cleanup_a_v3.13/20131019101809/
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1.4.15 (GNU/Linux)
 
 iQIcBAABAgAGBQJSZAZXAAoJEBvUPslcq6VzCWQQAKH4Rj0izwbbLkgBAeeaQz5K
 oJgPJ6UPLOJ2uLIUauCKUSR6+nktrCTfV8P+J4DhCc6OiGrKBXJhSETPgaTbWsNw
 Bd577pmuvXSfNFXUaLwCgkSmafJ1pi6d7kEx/7ZW3TziVE/aUxyeHkrMtWJHrjTP
 28tJVieOxLlO5iK06DfmGcCpLUBKJKtgGRo0h/oqMhLAaN5S8//lyVYgdsto7oCN
 /bes6OpuVVdKiSr78V4rCVtR5Lij5+lVrT8HDiw2BA0V3bYcI7+CVlWBPZ3mYkuy
 oAJDcn9whNyfWS+SsaTIjy6nHsgQkhEJnhrQW3k2skVZobRtWDv7U5LiTjsUhb3o
 pjyWD8zZ7jqrkgyLsai6dm1zsljMQXsIQwH5h++HdCRhtNOXd6bVQZy0KqkpLu0y
 Bhpt8/edh4Bdc305oB05/Y9Uxr7Gr8M377chVZx+JD3rxIDjRRyOJcRIhd27WZEf
 HSMLpO/ayUXWdDuTlKW0IEnImx3PrxT913cnjIY589FhfdahfGQoft4sWDeiQLAX
 +zVYZljeY+GxbUWO6aY4m2PfVN9p/Hwal58NZZgj59wq9iHUuJErK11X7rj+2vwN
 +20IS8sikz6Iym84iC0T+omUeFVY0Zo004DVvpPB+D1C2LpwdI1c6kTz4DYT1EBP
 pvs8Wihkk7xQxQn0rBGP
 =L37r
 -----END PGP SIGNATURE-----

Merge tag 'omap-for-v3.13/cm-scm-cleanup-signed' of git://git.kernel.org/pub/scm/linux/kernel/git/tmlind/linux-omap into next/cleanup

From Paul Walmsley <paul@pwsan.com> via Tony Lindgren:

Move some of the OMAP2+ CM and System Control Module direct
register accesses into CM- and System Control
Module-specific "drivers" underneath arch/arm/mach-omap2/.  This
is a prerequisite for moving this code out of arch/arm/mach-omap2/ into
drivers/.

Basic test logs are available here:

http://www.pwsan.com/omap/testlogs/cm_scm_cleanup_a_v3.13/20131019101809/

* tag 'omap-for-v3.13/cm-scm-cleanup-signed' of git://git.kernel.org/pub/scm/linux/kernel/git/tmlind/linux-omap:
  ARM: OMAP3: control: add API for setting IVA bootmode
  ARM: OMAP3: CM/control: move CM scratchpad save to CM driver
  ARM: OMAP3: McBSP: do not access CM register directly
  ARM: OMAP3: clock: add API to enable/disable autoidle for a single clock
  ARM: OMAP2: CM/PM: remove direct register accesses outside CM code
  + Linux 3.12-rc4

Signed-off-by: Olof Johansson <olof@lixom.net>
2013-10-28 14:39:03 -07:00
Jens Axboe
f2298c0403 null_blk: multi queue aware block test driver
A driver that simply completes IO it receives, it does no
transfers. Written to fascilitate testing of the blk-mq code.
It supports various module options to use either bio queueing,
rq queueing, or mq mode.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
2013-10-25 11:56:00 +01:00
Jens Axboe
5953316dbf block: make rq->cmd_flags be 64-bit
We have officially run out of flags in a 32-bit space. Extend it
to 64-bit even on 32-bit archs.

Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2013-10-25 11:55:59 +01:00
Rusty Russell
855e0c5288 virtio: use size-based config accessors.
This lets the transport do endian conversion if necessary, and insulates
the drivers from the difference.

Most drivers can use the simple helpers virtio_cread() and virtio_cwrite().

Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
2013-10-17 10:55:37 +10:30
Olof Johansson
ec01791d0f This deletes the Shark SA110-based sub-architecture from
the kernel.
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1.4.14 (GNU/Linux)
 
 iQIcBAABAgAGBQJSQuSSAAoJEEEQszewGV1zi00QALIJQt28Ir03WoxpbfLev2WP
 2s/f3zwW76XkNhWhh7FNG+GCl7VUcLcLSXod/nYE7wvbVtx9DVGzpWRbzqmNW/Ti
 5eqkk9EQso1G8jEseBiKaWsO7JxKkwa+nEYkGaqbm9K5Qk9iVlQ1hTF/m9/L64Dq
 I4BazqHJuU8g7sJhNaUaVvJLX5YXEeYTaTl43XoEEHy24HDqnvT2sIkiEj1WXLGE
 EmsOHVYLeJ4z7kZK8Dl0ZW/YFzrP3UAM25KObRYJXKAwGZhbTNreujF0RDdg5CBY
 6etMU7h7W90KkCGXzbgt4AkHwBDHM6EGb81ZiKOWZBGlrQRQFyAas4e8UTgEQh1F
 +I9PKNY/m94F9+RMpV9a35hNHNUj/+XskmXmoIFnCgSsrgYDIoSzZqV+v3iVDHRb
 oeX4R6S5ZbmxT3QC0n1xdctGn8BbgzGpAUnvfcNHG/IypNjCuGk9Femu7CQS2Pka
 R3BuTcizcppbITaFeaLLorRkC8U/GPuh5JmMUpyJh/+aq+wiWDw9oBEVfNclHB67
 Qo44EAN1WPv9M+EyQAhzUndJIFj0Ib4wUM7ILjHeXJLVvwQfpE1dQURDs//6Wwgp
 /hALpZqa5Y9oxPJO3Fgmmmci6q0P3+EFX5PT8GO2ecGWhhDlCSfLWc/lEtYrjzxG
 t77iKH/DdzX5nfrr4AQQ
 =r6/u
 -----END PGP SIGNATURE-----

Merge tag 'del-shark-for-v3.13' of git://git.kernel.org/pub/scm/linux/kernel/git/linusw/linux-stericsson into next/cleanup

From Linus Walleij:

This deletes the Shark SA110-based sub-architecture from
the kernel.

* tag 'del-shark-for-v3.13' of git://git.kernel.org/pub/scm/linux/kernel/git/linusw/linux-stericsson:
  video: drop code for ARCH_SHARK in cyber2000fb
  input: i8042: drop dependency on ARCH_SHARK
  ide: drop dependency on ARCH_SHARK
  block: drop dependency on ARCH_SHARK
  MAINTAINERS: delete Shark machine entry
  ARM: delete mach-shark

Signed-off-by: Olof Johansson <olof@lixom.net>
2013-09-25 21:55:46 -07:00
Linus Walleij
3c5710f6a2 block: drop dependency on ARCH_SHARK
With this machine deleted, there is no need to maintain the
MFM block driver for its hard disk either.

Cc: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Linus Walleij <linus.walleij@linaro.org>
2013-09-25 15:25:17 +02:00
Dan Carpenter
58f09e00ae cciss: fix info leak in cciss_ioctl32_passthru()
The arg64 struct has a hole after ->buf_size which isn't cleared.  Or if
any of the calls to copy_from_user() fail then that would cause an
information leak as well.

This was assigned CVE-2013-2147.

Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Acked-by: Mike Miller <mike.miller@hp.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-09-24 17:00:26 -07:00
Dan Carpenter
627aad1c01 cpqarray: fix info leak in ida_locked_ioctl()
The pciinfo struct has a two byte hole after ->dev_fn so stack
information could be leaked to the user.

This was assigned CVE-2013-2147.

Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Acked-by: Mike Miller <mike.miller@hp.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-09-24 17:00:26 -07:00
Aaron Lu
8910700039 virtio: pm: use CONFIG_PM_SLEEP instead of CONFIG_PM
The freeze and restore functions defined in virtio drivers are used
for suspend and hibernate, so CONFIG_PM_SLEEP is more appropriate than
CONFIG_PM. This patch replace all CONFIG_PM with CONFIG_PM_SLEEP for
virtio drivers that implement freeze and restore callbacks.

Signed-off-by: Aaron Lu <aaron.lu@intel.com>
Reviewed-by: Amit Shah <amit.shah@redhat.com>
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
2013-09-23 15:45:58 +09:30
Russell King
052d0efada DMA-API: block: nvme-core: replace dma_set_mask()+dma_set_coherent_mask() with new helper
Replace the following sequence:

	dma_set_mask(dev, mask);
	dma_set_coherent_mask(dev, mask);

with a call to the new helper dma_set_mask_and_coherent().

Signed-off-by: Russell King <rmk+kernel@arm.linux.org.uk>
2013-09-21 21:02:24 +01:00
Linus Torvalds
e9ff04dd94 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client
Pull ceph fixes from Sage Weil:
 "These fix several bugs with RBD from 3.11 that didn't get tested in
  time for the merge window: some error handling, a use-after-free, and
  a sequencing issue when unmapping and image races with a notify
  operation.

  There is also a patch fixing a problem with the new ceph + fscache
  code that just went in"

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client:
  fscache: check consistency does not decrement refcount
  rbd: fix error handling from rbd_snap_name()
  rbd: ignore unmapped snapshots that no longer exist
  rbd: fix use-after free of rbd_dev->disk
  rbd: make rbd_obj_notify_ack() synchronous
  rbd: complete notifies before cleaning up osd_client and rbd_dev
  libceph: add function to ensure notifies are complete
2013-09-19 12:50:37 -05:00
Martin Schwidefsky
0244ad004a Remove GENERIC_HARDIRQ config option
After the last architecture switched to generic hard irqs the config
options HAVE_GENERIC_HARDIRQS & GENERIC_HARDIRQS and the related code
for !CONFIG_GENERIC_HARDIRQS can be removed.

Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
2013-09-13 15:09:52 +02:00
Joe Perches
666dc7c90a pktcdvd: fix defective misuses of pkt_<level>
Fix thinkos where pkt_<level> needs a valid pktcdvd_device * and the
pointer is known to be NULL.

Signed-off-by: Joe Perches <joe@perches.com>
Reported-by: Dan Carpenter <dan.carpenter@oracle.com> (go smatch!)
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-09-11 15:59:34 -07:00
Joe Perches
f3ded788bb pktcdvd: add struct pktcdvd_device * to pkt_dump_sense()
Allow the device name to be emitted with pkt_err when logging the sense
data.

Signed-off-by: Joe Perches <joe@perches.com>
Cc: Jiri Kosina <jkosina@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-09-11 15:59:33 -07:00
Joe Perches
0c075d64df pktcdvd: convert pr_info to pkt_info
Add a new pkt_info macro to prefix the name to the logging output.

Signed-off-by: Joe Perches <joe@perches.com>
Cc: Jiri Kosina <jkosina@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-09-11 15:59:33 -07:00
Joe Perches
ca73dabc3d pktcdvd: convert pr_notice to pkt_notice
Add a new pkt_notice macro to prefix the name to the logging output.

Signed-off-by: Joe Perches <joe@perches.com>
Cc: Jiri Kosina <jkosina@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-09-11 15:59:32 -07:00
Joe Perches
fa63c0ab81 pktcdvd: add struct pktcdvd_device.name to pr_err logging where possible
Add a new pkt_err macro to prefix the name to the logging output.  Convert
pr_err where there is a non-null struct pktcdvd_device.

Includes improvements from Andy Shevchenko.

Signed-off-by: Joe Perches <joe@perches.com>
Cc: Andy Shevchenko <andy.shevchenko@gmail.com>
Cc: Jiri Kosina <jkosina@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-09-11 15:59:32 -07:00
Joe Perches
844aa79743 pktcdvd: add struct pktcdvd_device * to pkt_dbg
Add pd->name to output for these debugging messages.

Remove normally compiled out pkt_dbg(2, ...) function entry tracing
equivalents as it's better done via the function tracer.

Signed-off-by: Joe Perches <joe@perches.com>
Cc: Jiri Kosina <jkosina@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-09-11 15:59:32 -07:00
Joe Perches
cd3f2cd05c pktcdvd: consolidate DPRINTK and VPRINTK macros
Use the more common pkt_dbg(level, fmt, ...) form.

These messages are emitted at KERN_NOTICE.

Always emit function name with pkt_dbg(2, ...) uses and remove the
sometimes abbreviated embedded function name.

This form always verifies the format and arguments.

Signed-off-by: Joe Perches <joe@perches.com>
Cc: Jiri Kosina <jkosina@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-09-11 15:59:32 -07:00
Joe Perches
99481334bc pktcdvd: convert printk to pr_<level>
Use a more current logging style and add messages levels to the logging
messages.

Simplify pkt_dump_sense by using %*ph and adding a simple function to emit
the sense string.

Includes improvements from Andy Shevchenko and Dan Carpenter.

Signed-off-by: Joe Perches <joe@perches.com>
Cc: Andy Shevchenko <andy.shevchenko@gmail.com>
Cc: Dan Carpenter <dan.carpenter@oracle.com>
Cc: Jiri Kosina <jkosina@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-09-11 15:59:30 -07:00
Joe Perches
5323fb770b pktcdvd: convert ZONE macro to static function get_zone()
Macros should be converted to functions where feasible to
verify arguments and the like.

Signed-off-by: Joe Perches <joe@perches.com>
Cc: Jiri Kosina <jkosina@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-09-11 15:59:30 -07:00
Ed Cashin
fea1b13973 aoe: do not BUG if memory pressure prevented debugfs file creation
If the system has trouble allocating memory for the creation of the aoe
debugfs directory or of a file inside it, the debugfs member of an aoedev
can be NULL.

Do not treat a NULL debugfs pointer as a BUG on aoedev shutdown, avoiding
the user impact of an unecessary panic.

Signed-off-by: Ed Cashin <ecashin@coraid.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-09-11 15:59:28 -07:00
Andy Shevchenko
e0ec360597 aoe: suppress compiler warnings
This patch fixes following compiler warnings:

  drivers/block/aoe/aoecmd.c: In function `aoecmd_ata_rw':
  drivers/block/aoe/aoecmd.c:383:17: warning: variable `t' set but not used [-Wunused-but-set-variable]
    struct aoetgt *t;
                   ^
  drivers/block/aoe/aoecmd.c: In function `resend':
  drivers/block/aoe/aoecmd.c:488:21: warning: variable `ah' set but not used [-Wunused-but-set-variable]
    struct aoe_atahdr *ah;
                       ^

Signed-off-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
Cc: Ed Cashin <ecashin@coraid.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-09-11 15:59:27 -07:00
Andy Shevchenko
a88c1f0cac aoe: remove custom implementation of kbasename()
In the kernel we have a nice helper that may be used here. This patch
substitutes the custom implementation by the native function call.

Signed-off-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
Cc: Ed Cashin <ecashin@coraid.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-09-11 15:59:26 -07:00
Ed Cashin
896dcd9a64 aoe: update internal version number to 85
Signed-off-by: Ed Cashin <ecashin@coraid.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-09-11 15:59:26 -07:00
Ed Cashin
ec345120c5 aoe: update copyright date
Signed-off-by: Ed Cashin <ecashin@coraid.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-09-11 15:59:25 -07:00
Ed Cashin
2256c1c51e aoe: fill in per-AoE-target information for debugfs file
This information is presented in a compact format that has evolved for
easy routine scanning by expert humans, mostly developers and support
technicians helping to troubleshoot or test AoE-based systems.

Signed-off-by: Ed Cashin <ecashin@coraid.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-09-11 15:59:25 -07:00
Ed Cashin
1cf94797c2 aoe: provide file operations for debugfs files
The place holder in the file contents is filled out in the following
patch.

Signed-off-by: Ed Cashin <ecashin@coraid.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-09-11 15:59:24 -07:00
Ed Cashin
e8866cf2b9 aoe: add AoE-target files to debugfs
Signed-off-by: Ed Cashin <ecashin@coraid.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-09-11 15:59:23 -07:00
Ed Cashin
190519cd30 aoe: create and destroy debugfs directory for aoe
This series adds the debugging information that the coraid.com-distributed
aoe driver exports via sysfs, but instead of sysfs, it uses debugfs.

With these patches applied, even without AoE targets on the network, KEDR
reports new possible memory leaks, but these are from callers outside the
aoe driver that have used aoe_devnode to get the name of the character
devices through the aoe_class->devnode callback, and I believe they're
responsible for freeing that memory.

This patch:

Create and destroy the debugfs directory.

Signed-off-by: Ed Cashin <ecashin@coraid.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-09-11 15:59:22 -07:00
Jingoo Han
c07303c0af drivers/block/swim.c: remove unnecessary platform_set_drvdata()
The driver core clears the driver data to NULL after device_release or
on probe failure.  Thus, it is not needed to manually clear the device
driver data to NULL.

Signed-off-by: Jingoo Han <jg1.han@samsung.com>
Cc: Jean Delvare <khali@linux-fr.org>
Cc: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-09-11 15:56:59 -07:00
Mike Miller
e7b18ede44 cciss: set max scatter gather entries to 32 on P600
At one time we used to set the maximum number of scatter gather elements
on all Smart Array controllers to 32.  At some point in time the
firmware began to write the "appropriate" value for each controller into
the config table.  The cciss driver would then read that and set
h->maxsgentries.

        h->maxsgentries = readl(&(h->cfgtable->MaxSGElements);

On the P600 that value is 544.  Under some workloads a significant
performance reduction may result.  This patch forces the P600 to use
only 32 scatter gather elements.  Other controllers are not affected.

Signed-off-by: Mike Miller <mike.miller@hp.com>
Signed-off-by: Dwight (Bud) Brown <bubrown@redhat.com>
Signed-off-by: Tomas Henzl <thenzl@redhat.com>
Acked-by: Stephen M. Cameron <steve.cameron@hp.com>
Cc: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-09-11 15:56:59 -07:00
Jingoo Han
c86db975c8 drivers/block/mg_disk.c: make mg_times_out() static
mg_times_out() is used only in this file.  Fix the following sparse
warning:

  drivers/block/mg_disk.c:639:6: warning: symbol 'mg_times_out' was not declared. Should it be static?

Signed-off-by: Jingoo Han <jg1.han@samsung.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-09-11 15:56:59 -07:00
Jingoo Han
bb8e0e84b3 block: replace strict_strtoul() with kstrtoul()
The use of strict_strtoul() is not preferred, because strict_strtoul() is
obsolete.  Thus, kstrtoul() should be used.

Signed-off-by: Jingoo Han <jg1.han@samsung.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-09-11 15:56:56 -07:00
Josh Durgin
da6a6b6397 rbd: fix error handling from rbd_snap_name()
rbd_snap_name() calls rbd_dev_v{1,2}_snap_name() depending on the
format of the image. The format 1 version returns NULL on error, which
is handled by the caller. The format 2 version returns an ERR_PTR,
which the caller of rbd_snap_name() does not expect.

Fortunately this is unlikely to occur in practice because
rbd_snap_id_by_name() is called before rbd_snap_name(). This would hit
similar errors to rbd_snap_name() (like the snapshot not existing) and
return early, so rbd_snap_name() would not hit an error unless the
snapshot was removed between the two calls or memory was exhausted.

Use an ERR_PTR in rbd_dev_v1_snap_name() so that the specific error
can be propagated, and it is consistent with rbd_dev_v2_snap_name().
Handle the ERR_PTR in the only rbd_snap_name() caller.

Suggested-by: Alex Elder <alex.elder@linaro.org>
Signed-off-by: Josh Durgin <josh.durgin@inktank.com>
Reviewed-by: Alex Elder <elder@linaro.org>
2013-09-09 11:16:44 -07:00
Josh Durgin
efadc98aab rbd: ignore unmapped snapshots that no longer exist
This prevents erroring out while adding a device when a snapshot
unrelated to the current mapping is deleted between reading the
snapshot context and reading the snapshot names. If the mapped
snapshot name is not found an error still occurs as usual.

Signed-off-by: Josh Durgin <josh.durgin@inktank.com>
Reviewed-by: Alex Elder <elder@linaro.org>
2013-09-09 11:16:32 -07:00
Josh Durgin
9875201e10 rbd: fix use-after free of rbd_dev->disk
Removing a device deallocates the disk, unschedules the watch, and
finally cleans up the rbd_dev structure. rbd_dev_refresh(), called
from the watch callback, updates the disk size and rbd_dev
structure. With no locking between them, rbd_dev_refresh() may use the
device or rbd_dev after they've been freed.

To fix this, check whether RBD_DEV_FLAG_REMOVING is set before
updating the disk size in rbd_dev_refresh(). In order to prevent a
race where rbd_dev_refresh() is already revalidating the disk when
rbd_remove() is called, move the call to rbd_bus_del_dev() after the
watch is unregistered and all notifies are complete. It's safe to
defer deleting this structure because no new requests can be submitted
once the RBD_DEV_FLAG_REMOVING is set, since the device cannot be
opened.

Fixes: http://tracker.ceph.com/issues/5636
Signed-off-by: Josh Durgin <josh.durgin@inktank.com>
Reviewed-by: Alex Elder <elder@linaro.org>
2013-09-09 11:16:25 -07:00
Josh Durgin
20e0af67ce rbd: make rbd_obj_notify_ack() synchronous
The only user of rbd_obj_notify_ack() is rbd_watch_cb(). It used
asynchronously with no tracking of when the notify ack completes, so
it may still be in progress when the osd_client is shut down.  This
results in a BUG() since the osd client assumes no requests are in
flight when it stops. Since all notifies are flushed before the
osd_client is stopped, waiting for the notify ack to complete before
returning from the watch callback ensures there are no notify acks in
flight during shutdown.

Rename rbd_obj_notify_ack() to rbd_obj_notify_ack_sync() to reflect
its new synchronous nature.

Signed-off-by: Josh Durgin <josh.durgin@inktank.com>
Reviewed-by: Alex Elder <elder@linaro.org>
2013-09-09 11:16:02 -07:00
Josh Durgin
9abc59908e rbd: complete notifies before cleaning up osd_client and rbd_dev
To ensure rbd_dev is not used after it's released, flush all pending
notify callbacks before calling rbd_dev_image_release(). No new
notifies can be added to the queue at this point because the watch has
already be unregistered with the osd_client.

Signed-off-by: Josh Durgin <josh.durgin@inktank.com>
Reviewed-by: Alex Elder <elder@linaro.org>
2013-09-09 11:15:57 -07:00
Linus Torvalds
6cccc7d301 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client
Pull ceph updates from Sage Weil:
 "This includes both the first pile of Ceph patches (which I sent to
  torvalds@vger, sigh) and a few new patches that add support for
  fscache for Ceph.  That includes a few fscache core fixes that David
  Howells asked go through the Ceph tree.  (Thanks go to Milosz Tanski
  for putting this feature together)

  This first batch of patches (included here) had (has) several
  important RBD bug fixes, hole punch support, several different
  cleanups in the page cache interactions, improvements in the truncate
  code (new truncate mutex to avoid shenanigans with i_mutex), and a
  series of fixes in the synchronous striping read/write code.

  On top of that is a random collection of small fixes all across the
  tree (error code checks and error path cleanup, obsolete wq flags,
  etc)"

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client: (43 commits)
  ceph: use d_invalidate() to invalidate aliases
  ceph: remove ceph_lookup_inode()
  ceph: trivial buildbot warnings fix
  ceph: Do not do invalidate if the filesystem is mounted nofsc
  ceph: page still marked private_2
  ceph: ceph_readpage_to_fscache didn't check if marked
  ceph: clean PgPrivate2 on returning from readpages
  ceph: use fscache as a local presisent cache
  fscache: Netfs function for cleanup post readpages
  FS-Cache: Fix heading in documentation
  CacheFiles: Implement interface to check cache consistency
  FS-Cache: Add interface to check consistency of a cached object
  rbd: fix null dereference in dout
  rbd: fix buffer size for writes to images with snapshots
  libceph: use pg_num_mask instead of pgp_num_mask for pg.seed calc
  rbd: fix I/O error propagation for reads
  ceph: use vfs __set_page_dirty_nobuffers interface instead of doing it inside filesystem
  ceph: allow sync_read/write return partial successed size of read/write.
  ceph: fix bugs about handling short-read for sync read mode.
  ceph: remove useless variable revoked_rdcache
  ...
2013-09-09 09:13:22 -07:00
Linus Torvalds
b409624ad5 Merge git://git.infradead.org/users/willy/linux-nvme
Pull NVM Express driver update from Matthew Wilcox.

* git://git.infradead.org/users/willy/linux-nvme:
  NVMe: Merge issue on character device bring-up
  NVMe: Handle ioremap failure
  NVMe: Add pci suspend/resume driver callbacks
  NVMe: Use normal shutdown
  NVMe: Separate controller init from disk discovery
  NVMe: Separate queue alloc/free from create/delete
  NVMe: Group pci related actions in functions
  NVMe: Disk stats for read/write commands only
  NVMe: Bring up cdev on set feature failure
  NVMe: Fix checkpatch issues
  NVMe: Namespace IDs are unsigned
  NVMe: Update nvme_id_power_state with latest spec
  NVMe: Split header file into user-visible and kernel-visible pieces
  NVMe: Call nvme_process_cq from submission path
  NVMe: Remove "process_cq did something" message
  NVMe: Return correct value from interrupt handler
  NVMe: Disk IO statistics
  NVMe: Restructure MSI / MSI-X setup
  NVMe: Use kzalloc instead of kmalloc+memset
2013-09-07 20:19:02 -07:00
Keith Busch
d82e8bfdef NVMe: Merge issue on character device bring-up
A recent patch made it possible to bring up the character handle when the
device is responsive but not accepting a set-features command. Another
recent patch moved the initialization that requires we move where the
checks for this condition occur. This patch merges these two ideas so
it works much as before.

Signed-off-by: Keith Busch <keith.busch@intel.com>
Signed-off-by: Matthew Wilcox <matthew.r.wilcox@intel.com>
2013-09-06 16:26:58 -04:00
Linus Torvalds
2e515bf096 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/trivial
Pull trivial tree from Jiri Kosina:
 "The usual trivial updates all over the tree -- mostly typo fixes and
  documentation updates"

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/trivial: (52 commits)
  doc: Documentation/cputopology.txt fix typo
  treewide: Convert retrun typos to return
  Fix comment typo for init_cma_reserved_pageblock
  Documentation/trace: Correcting and extending tracepoint documentation
  mm/hotplug: fix a typo in Documentation/memory-hotplug.txt
  power: Documentation: Update s2ram link
  doc: fix a typo in Documentation/00-INDEX
  Documentation/printk-formats.txt: No casts needed for u64/s64
  doc: Fix typo "is is" in Documentations
  treewide: Fix printks with 0x%#
  zram: doc fixes
  Documentation/kmemcheck: update kmemcheck documentation
  doc: documentation/hwspinlock.txt fix typo
  PM / Hibernate: add section for resume options
  doc: filesystems : Fix typo in Documentations/filesystems
  scsi/megaraid fixed several typos in comments
  ppc: init_32: Fix error typo "CONFIG_START_KERNEL"
  treewide: Add __GFP_NOWARN to k.alloc calls with v.alloc fallbacks
  page_isolation: Fix a comment typo in test_pages_isolated()
  doc: fix a typo about irq affinity
  ...
2013-09-06 09:36:28 -07:00
Josh Durgin
c35455791c rbd: fix null dereference in dout
The order parameter is sometimes NULL in _rbd_dev_v2_snap_size(), but
the dout() always derefences it. Move this to another dout() protected
by a check that order is non-NULL.

Signed-off-by: Josh Durgin <josh.durgin@inktank.com>
Reviewed-by: Sage Weil <sage@inktank.com>
Reviewed-by: Alex Elder <alex.elder@linaro.org>
2013-09-03 22:08:51 -07:00
Josh Durgin
03507db631 rbd: fix buffer size for writes to images with snapshots
rbd_osd_req_create() needs to know the snapshot context size to create
a buffer large enough to send it with the message front. It gets this
from the img_request, which was not set for the obj_request yet. This
resulted in trying to write past the end of the front payload, hitting
this BUG:

libceph: BUG_ON(p > msg->front.iov_base + msg->front.iov_len);

Fix this by associating the obj_request with its img_request
immediately after it's created, before the osd request is created.

Fixes: http://tracker.ceph.com/issues/5760
Suggested-by: Alex Elder <alex.elder@linaro.org>
Signed-off-by: Josh Durgin <josh.durgin@inktank.com>
Reviewed-by: Alex Elder <alex.elder@linaro.org>
2013-09-03 22:08:46 -07:00
Josh Durgin
17c1cc1d92 rbd: fix I/O error propagation for reads
When a request returns an error, the driver needs to report the entire
extent of the request as completed.  Writes already did this, since
they always set xferred = length, but reads were skipping that step if
an error other than -ENOENT occurred.  Instead, rbd would end up
passing 0 xferred to blk_end_request(), which would always report
needing more data.  This resulted in an assert failing when more data
was required by the block layer, but all the object requests were
done:

[ 1868.719077] rbd: obj_request read result -108 xferred 0
[ 1868.719077]
[ 1868.719518] end_request: I/O error, dev rbd1, sector 0
[ 1868.719739]
[ 1868.719739] Assertion failure in rbd_img_obj_callback() at line 1736:
[ 1868.719739]
[ 1868.719739]   rbd_assert(more ^ (which == img_request->obj_request_count));

Without this assert, reads that hit errors would hang forever, since
the block layer considered them incomplete.

Fixes: http://tracker.ceph.com/issues/5647
CC: stable@vger.kernel.org  # v3.10
Signed-off-by: Josh Durgin <josh.durgin@inktank.com>
Reviewed-by: Alex Elder <alex.elder@linaro.org>
2013-09-03 22:06:10 -07:00
Keith Busch
9d713c2bfb NVMe: Handle ioremap failure
Decrement the number of queues required for doorbell remapping until
the memory is successfully mapped for that size.

Additional checks are done so that we don't call free_irq if it has
already been freed.

Signed-off-by: Keith Busch <keith.busch@intel.com>
Signed-off-by: Matthew Wilcox <matthew.r.wilcox@intel.com>
2013-09-03 16:44:25 -04:00
Keith Busch
cd63894630 NVMe: Add pci suspend/resume driver callbacks
Used for going in and out of low power states. Resuming reuses the IO
queues from the previous initialization, freeing any allocated queues
that are no longer usable.

Signed-off-by: Keith Busch <keith.busch@intel.com>
Signed-off-by: Matthew Wilcox <matthew.r.wilcox@intel.com>
2013-09-03 16:44:16 -04:00
Keith Busch
1894d8f16a NVMe: Use normal shutdown
The NVMe spec recommends using the shutdown normal sequence when safely
taking the controller offline instead of hitting CC.EN on the next
start-up to reset the controller. The spec recommends a minimum of 1
second for the shutdown complete. This patch waits 2 seconds to be on
the safe side.

Signed-off-by: Keith Busch <keith.busch@intel.com>
Signed-off-by: Matthew Wilcox <matthew.r.wilcox@intel.com>
2013-09-03 16:40:32 -04:00
Keith Busch
f0b50732a9 NVMe: Separate controller init from disk discovery
This combines the controller initialization into one function, removing
IO queue setup from namespace discovery, and creates symetric functions
for device removal. The controller start and shutdown functions can now
be called from resume/suspend context as well as probe/remove.

Signed-off-by: Keith Busch <keith.busch@intel.com>
Signed-off-by: Matthew Wilcox <matthew.r.wilcox@intel.com>
2013-09-03 16:40:28 -04:00
Keith Busch
2240427425 NVMe: Separate queue alloc/free from create/delete
This separates nvme queue allocation from creation, and queue deletion
from freeing. This is so that we may in the future temporarily disable
queues and reuse the same memory when bringing them back online, like
coming back from suspend state.

Signed-off-by: Keith Busch <keith.busch@intel.com>
Signed-off-by: Matthew Wilcox <matthew.r.wilcox@intel.com>
2013-09-03 16:39:28 -04:00
Keith Busch
0877cb0d28 NVMe: Group pci related actions in functions
This will make it easier to reuse these outside probe/remove.

Signed-off-by: Keith Busch <keith.busch@intel.com>
Signed-off-by: Matthew Wilcox <matthew.r.wilcox@intel.com>
2013-09-03 16:39:25 -04:00
Keith Busch
9e59d091b0 NVMe: Disk stats for read/write commands only
Flush and discard requests would previously mess up the accounting.

Signed-off-by: Keith Busch <keith.busch@intel.com>
Signed-off-by: Matthew Wilcox <matthew.r.wilcox@intel.com>
2013-09-03 16:33:35 -04:00
Keith Busch
7e03b12406 NVMe: Bring up cdev on set feature failure
This patch creates the character device as long as a device's admin queues
are usable so a user has an opprotunity to perform administration tasks.
A device may be in a state that does not allow IO and setting the queue
count feature in such a state returns an error. Previously the driver
would bail and the controller would be unusable.

Signed-off-by: Keith Busch <keith.busch@intel.com>
2013-09-03 16:32:26 -04:00
Keith Busch
1b56749e54 NVMe: Fix checkpatch issues
Signed-off-by: Keith Busch <keith.busch@intel.com>
2013-09-03 16:32:26 -04:00
Matthew Wilcox
c3bfe7176c NVMe: Namespace IDs are unsigned
The 'Number of Namespaces' read from the device was being treated as
signed, which would cause us to not scan any namespaces for a device
with more than 2 billion namespaces.  That led to noticing that the
namespace ID was also being treated as signed, which could lead to the
result from NVME_IOCTL_ID being treated as an error code.

Signed-off-by: Matthew Wilcox <matthew.r.wilcox@intel.com>
2013-09-03 16:32:26 -04:00
Linus Torvalds
542a086ac7 Driver core patches for 3.12-rc1
Here's the big driver core pull request for 3.12-rc1.
 
 Lots of tiny changes here fixing up the way sysfs attributes are
 created, to try to make drivers simpler, and fix a whole class race
 conditions with creations of device attributes after the device was
 announced to userspace.
 
 All the various pieces are acked by the different subsystem maintainers.
 
 Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v2.0.21 (GNU/Linux)
 
 iEYEABECAAYFAlIlIPcACgkQMUfUDdst+ynUMwCaAnITsxyDXYQ4DqEsz8EcOtMk
 718AoLrgnUZs3B+70AT34DVktg4HSThk
 =USl9
 -----END PGP SIGNATURE-----

Merge tag 'driver-core-3.12-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core

Pull driver core patches from Greg KH:
 "Here's the big driver core pull request for 3.12-rc1.

  Lots of tiny changes here fixing up the way sysfs attributes are
  created, to try to make drivers simpler, and fix a whole class race
  conditions with creations of device attributes after the device was
  announced to userspace.

  All the various pieces are acked by the different subsystem
  maintainers"

* tag 'driver-core-3.12-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core: (119 commits)
  firmware loader: fix pending_fw_head list corruption
  drivers/base/memory.c: introduce help macro to_memory_block
  dynamic debug: line queries failing due to uninitialized local variable
  sysfs: sysfs_create_groups returns a value.
  debugfs: provide debugfs_create_x64() when disabled
  rbd: convert bus code to use bus_groups
  firmware: dcdbas: use binary attribute groups
  sysfs: add sysfs_create/remove_groups for when SYSFS is not enabled
  driver core: add #include <linux/sysfs.h> to core files.
  HID: convert bus code to use dev_groups
  Input: serio: convert bus code to use drv_groups
  Input: gameport: convert bus code to use drv_groups
  driver core: firmware: use __ATTR_RW()
  driver core: core: use DEVICE_ATTR_RO
  driver core: bus: use DRIVER_ATTR_WO()
  driver core: create write-only attribute macros for devices and drivers
  sysfs: create __ATTR_WO()
  driver-core: platform: convert bus code to use dev_groups
  workqueue: convert bus code to use dev_groups
  MEI: convert bus code to use dev_groups
  ...
2013-09-03 11:37:15 -07:00
Greg Kroah-Hartman
b15a21ddda rbd: convert bus code to use bus_groups
The bus_attrs field of struct bus_type is going away soon, dev_groups
should be used instead.  This converts the RBD bus code to use the
correct field.

Cc: Yehuda Sadeh <yehuda@inktank.com>
Cc: Sage Weil <sage@inktank.com>
Acked-by: Alex Elder <elder@linaro.org>
Cc: <ceph-devel@vger.kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2013-08-27 22:07:49 -07:00
Joe Perches
8be04b9374 treewide: Add __GFP_NOWARN to k.alloc calls with v.alloc fallbacks
Don't emit OOM warnings when k.alloc calls fail when
there there is a v.alloc immediately afterwards.

Converted a kmalloc/vmalloc with memset to kzalloc/vzalloc.

Signed-off-by: Joe Perches <joe@perches.com>
Acked-by: "Theodore Ts'o" <tytso@mit.edu>
Signed-off-by: Jiri Kosina <jkosina@suse.cz>
2013-08-20 13:06:40 +02:00
Sage Weil
ee3e542fec Merge remote-tracking branch 'linus/master' into testing 2013-08-15 11:11:45 -07:00
Ed Cashin
fb32975d1b aoe: adjust ref of head for compound page tails
Fix a BUG which can trigger when direct-IO is used with AOE.

As discussed previously, the fact that some users of the block layer
provide bios that point to pages with a zero _count means that it is not
OK for the network layer to do a put_page on the skb frags during an
skb_linearize, so the aoe driver gets a reference to pages in bios and
puts the reference before ending the bio.  And because it cannot use
get_page on a page with a zero _count, it manipulates the value
directly.

It is not OK to increment the _count of a compound page tail, though,
since the VM layer will VM_BUG_ON a non-zero _count.  Block users that
do direct I/O can result in the aoe driver seeing compound page tails in
bios.  In that case, the same logic works as long as the head of the
compound page is used instead of the tails.  This patch handles compound
pages and does not BUG.

It relies on the block layer user leaving the relationship between the
page tail and its head alone for the duration between the submission of
the bio and its completion, whether successful or not.

Signed-off-by: Ed Cashin <ecashin@coraid.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-08-13 17:57:48 -07:00
Jingoo Han
a158073c43 block: rbd: use NULL instead of 0
The local variables such as 'bio_list', and 'pages' are pointers;
thus, use NULL instead of 0 to fix the following sparse warnings.

drivers/block/rbd.c:2166:32: warning: Using plain integer as NULL pointer
drivers/block/rbd.c:2168:31: warning: Using plain integer as NULL pointer

Signed-off-by: Jingoo Han <jg1.han@samsung.com>
Reviewed-by: Sage Weil <sage@inktank.com>
2013-08-09 17:55:40 -07:00
Linus Torvalds
d4c90b1b9f Merge branch 'for-3.11/drivers' of git://git.kernel.dk/linux-block
Pull block IO driver bits from Jens Axboe:
 "As I mentioned in the core block pull request, due to real life
  circumstances the driver pull request would be late.  Now it looks
  like -rc2 late...  On the plus side, apart form the rsxx update, these
  are all things that I could argue could go in later in the cycle as
  they are fixes and not features.  So even though things are late, it's
  not ALL bad.

  The pull request contains:

   - Updates to bcache, all bug fixes, from Kent.

   - A pile of drbd bug fixes (no big features this time!).

   - xen blk front/back fixes.

   - rsxx driver updates, some of them deferred form 3.10.  So should be
     well cooked by now"

* 'for-3.11/drivers' of git://git.kernel.dk/linux-block: (63 commits)
  bcache: Allocation kthread fixes
  bcache: Fix GC_SECTORS_USED() calculation
  bcache: Journal replay fix
  bcache: Shutdown fix
  bcache: Fix a sysfs splat on shutdown
  bcache: Advertise that flushes are supported
  bcache: check for allocation failures
  bcache: Fix a dumb race
  bcache: Use standard utility code
  bcache: Update email address
  bcache: Delete fuzz tester
  bcache: Document shrinker reserve better
  bcache: FUA fixes
  drbd: Allow online change of al-stripes and al-stripe-size
  drbd: Constants should be UPPERCASE
  drbd: Ignore the exit code of a fence-peer handler if it returns too late
  drbd: Fix rcu_read_lock balance on error path
  drbd: fix error return code in drbd_init()
  drbd: Do not sleep inside rcu
  bcache: Refresh usage docs
  ...
2013-07-22 19:02:52 -07:00
Linus Torvalds
72c1c2f1be Merge branch 'next' of git://git.monstr.eu/linux-2.6-microblaze
Pull microblaze update from Michal Simek:
 "This Microblaze merge window is quite minimal.

  I have also added to my branch one xilinx systemace sparse fix because
  haven't got any reply from block maintainer."

* 'next' of git://git.monstr.eu/linux-2.6-microblaze:
  xilinx systemace: Fix sparse warnings
  microblaze: Move __NR_syscalls from uapi
  microblaze: Enable KGDB in defconfig
  microblaze: Don't mark arch_kgdb_ops as const.
2013-07-10 10:16:07 -07:00
Michal Simek
4937a269cb xilinx systemace: Fix sparse warnings
Fix sysace sparse warnings.

Signed-off-by: Michal Simek <michal.simek@xilinx.com>
2013-07-10 07:47:12 +02:00
Linus Torvalds
9a5889ae1c Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client
Pull Ceph updates from Sage Weil:
 "There is some follow-on RBD cleanup after the last window's code drop,
  a series from Yan fixing multi-mds behavior in cephfs, and then a
  sprinkling of bug fixes all around.  Some warnings, sleeping while
  atomic, a null dereference, and cleanups"

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client: (36 commits)
  libceph: fix invalid unsigned->signed conversion for timespec encoding
  libceph: call r_unsafe_callback when unsafe reply is received
  ceph: fix race between cap issue and revoke
  ceph: fix cap revoke race
  ceph: fix pending vmtruncate race
  ceph: avoid accessing invalid memory
  libceph: Fix NULL pointer dereference in auth client code
  ceph: Reconstruct the func ceph_reserve_caps.
  ceph: Free mdsc if alloc mdsc->mdsmap failed.
  ceph: remove sb_start/end_write in ceph_aio_write.
  ceph: avoid meaningless calling ceph_caps_revoking if sync_mode == WB_SYNC_ALL.
  ceph: fix sleeping function called from invalid context.
  ceph: move inode to proper flushing list when auth MDS changes
  rbd: fix a couple warnings
  ceph: clear migrate seq when MDS restarts
  ceph: check migrate seq before changing auth cap
  ceph: fix race between page writeback and truncate
  ceph: reset iov_len when discarding cap release messages
  ceph: fix cap release race
  libceph: fix truncate size calculation
  ...
2013-07-09 12:39:10 -07:00
Linus Torvalds
7f0ef0267e Merge branch 'akpm' (updates from Andrew Morton)
Merge first patch-bomb from Andrew Morton:
 - various misc bits
 - I'm been patchmonkeying ocfs2 for a while, as Joel and Mark have been
   distracted.  There has been quite a bit of activity.
 - About half the MM queue
 - Some backlight bits
 - Various lib/ updates
 - checkpatch updates
 - zillions more little rtc patches
 - ptrace
 - signals
 - exec
 - procfs
 - rapidio
 - nbd
 - aoe
 - pps
 - memstick
 - tools/testing/selftests updates

* emailed patches from Andrew Morton <akpm@linux-foundation.org>: (445 commits)
  tools/testing/selftests: don't assume the x bit is set on scripts
  selftests: add .gitignore for kcmp
  selftests: fix clean target in kcmp Makefile
  selftests: add .gitignore for vm
  selftests: add hugetlbfstest
  self-test: fix make clean
  selftests: exit 1 on failure
  kernel/resource.c: remove the unneeded assignment in function __find_resource
  aio: fix wrong comment in aio_complete()
  drivers/w1/slaves/w1_ds2408.c: add magic sequence to disable P0 test mode
  drivers/memstick/host/r592.c: convert to module_pci_driver
  drivers/memstick/host/jmb38x_ms: convert to module_pci_driver
  pps-gpio: add device-tree binding and support
  drivers/pps/clients/pps-gpio.c: convert to module_platform_driver
  drivers/pps/clients/pps-gpio.c: convert to devm_* helpers
  drivers/parport/share.c: use kzalloc
  Documentation/accounting/getdelays.c: avoid strncpy in accounting tool
  aoe: update internal version number to v83
  aoe: update copyright date
  aoe: perform I/O completions in parallel
  ...
2013-07-03 17:12:13 -07:00
Ed Cashin
94ac11833f aoe: update internal version number to v83
Signed-off-by: Ed Cashin <ecashin@coraid.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-07-03 16:08:05 -07:00
Ed Cashin
ca47bbd93c aoe: update copyright date
Signed-off-by: Ed Cashin <ecashin@coraid.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-07-03 16:08:05 -07:00
Ed Cashin
8030d34397 aoe: perform I/O completions in parallel
Some users have a large AoE target while others like to use many AoE
targets at the same time.  In the latter case, there is an opportunity to
greatly improve aggregate throughput by allowing different threads to
complete the I/O associated with each target.  For 36 targets, 4 KiB read
throughput roughly doubles, for example, with these changes in place.

Signed-off-by: Ed Cashin <ecashin@coraid.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-07-03 16:08:05 -07:00
Paul Clements
c378f70adb nbd: correct disconnect behavior
Currently, when a disconnect is requested by the user (via NBD_DISCONNECT
ioctl) the return from NBD_DO_IT is undefined (it is usually one of
several error codes).  This means that nbd-client does not know if a
manual disconnect was performed or whether a network error occurred.
Because of this, nbd-client's persist mode (which tries to reconnect after
error, but not after manual disconnect) does not always work correctly.

This change fixes this by causing NBD_DO_IT to always return 0 if a user
requests a disconnect.  This means that nbd-client can correctly either
persist the connection (if an error occurred) or disconnect (if the user
requested it).

Signed-off-by: Paul Clements <paul.clements@steeleye.com>
Acked-by: Rob Landley <rob@landley.net>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-07-03 16:08:05 -07:00
Michal Belczyk
9532f149ee nbd: remove bogus BUG_ON in NBD_CLEAR_QUE
The NBD_CLEAR_QUE ioctl has been deprecated for quite some time (its job
is now done by two other ioctls).  We should stop trying to make bogus
assertions in it.  Also, user-level code should remove calls to
NBD_CLEAR_QUE, ASAP.

Signed-off-by: Michal Belczyk <belczyk@bsd.krakow.pl>
Signed-off-by: Paul Clements <paul.clements@steeleye.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-07-03 16:08:05 -07:00
Kees Cook
f170168b9a drivers: avoid parsing names as kthread_run() format strings
Calling kthread_run with a single name parameter causes it to be handled
as a format string. Many callers are passing potentially dynamic string
content, so use "%s" in those cases to avoid any potential accidents.

Signed-off-by: Kees Cook <keescook@chromium.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-07-03 16:07:41 -07:00
Kees Cook
ffc8b30866 block: do not pass disk names as format strings
Disk names may contain arbitrary strings, so they must not be
interpreted as format strings.  It seems that only md allows arbitrary
strings to be used for disk names, but this could allow for a local
memory corruption from uid 0 into ring 0.

CVE-2013-2851

Signed-off-by: Kees Cook <keescook@chromium.org>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-07-03 16:07:25 -07:00
Sage Weil
e976cad0f0 rbd: fix a couple warnings
gcc isn't quite smart enough and generates these warnings:

drivers/block/rbd.c: In function 'rbd_img_request_fill':
drivers/block/rbd.c:1266:22: warning: 'bio_list' may be used uninitialized in this function [-Wmaybe-uninitialized]
drivers/block/rbd.c:2186:14: note: 'bio_list' was declared here
drivers/block/rbd.c:2247:10: warning: 'pages' may be used uninitialized in this function [-Wmaybe-uninitialized]

even though they are initialized for their respective code paths.

Signed-off-by: Sage Weil <sage@inktank.com>
2013-07-03 15:32:50 -07:00
Alex Elder
d552c6191b rbd: take a little credit
Add a name to the list of authors.

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
2013-07-03 15:32:44 -07:00
Alex Elder
cfbf6377b6 rbd: use rwsem to protect header updates
Updating an image header needs to be protected to ensure it's
done consistently.  However distinct headers can be updated
concurrently without a problem.  Instead of using the global
control lock to serialize headder updates, just rely on the header
semaphore.  (It's already used, this just moves it out to cover
a broader section of the code.)

That leaves the control mutex protecting only the creation of rbd
clients, so rename it.

This resolves:
    http://tracker.ceph.com/issues/5222

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
2013-07-03 15:32:43 -07:00
Alex Elder
1ba0f1e797 rbd: don't hold ctl_mutex to get/put device
When an rbd device is first getting mapped, its device registration
is protected the control mutex.  There is no need to do that though,
because the device has already been assigned an id that's guaranteed
to be unique.

An unmap of an rbd device won't proceed if the device has a non-zero
open count or is already being unmapped.  So there's no need to hold
the control mutex in that case either.

Finally, an rbd device can't be opened if it is being removed, and
it won't go away if there is a non-zero open count.  So here too
there's no need to hold the control mutex while getting or putting a
reference to an rbd device's Linux device structure.

Drop the mutex calls in these cases.

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
2013-07-03 15:32:42 -07:00
Alex Elder
82a442d239 rbd: protect against concurrent unmaps
Make sure two concurrent unmap operations on the same rbd device
won't collide, by only proceeding with the removal and cleanup of a
device if is not already underway.

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
2013-07-03 15:32:41 -07:00
Alex Elder
751cc0e3cf rbd: set removing flag while holding list lock
When unmapping a device, its id is supplied, and that is used to
look up which rbd device should be unmapped.  Looking up the
device involves searching the rbd device list while holding
a spinlock that protects access to that list.

Currently all of this is done under protection of the control lock,
but that protection is going away soon.  To ensure the rbd_dev is
still valid (still on the list) while setting its REMOVING flag, do
so while still holding the list lock.  To do so, get rid of
__rbd_get_dev(), and open code what it did in the one place it
was used.

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
2013-07-03 15:32:41 -07:00
Alex Elder
08f75463c1 rbd: protect against duplicate client creation
If more than one rbd image has the same ceph cluster configuration
(same options, same set of monitors, same keys) they normally share
a single rbd client.

When an image is getting mapped, rbd looks to see if an existing
client can be used, and creates a new one if not.

The lookup and creation are not done under a common lock though, so
mapping two images concurrently could lead to duplicate clients
getting set up needlessly.  This isn't a major problem, but it's
wasteful and different from what's intended.

This patch fixes that by using the control mutex to protect
both the lookup and (if needed) creation of the client.  It
was previously used just when creating.

This resolves:
    http://tracker.ceph.com/issues/3094

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
2013-07-03 15:32:39 -07:00
Alex Elder
3b5cf2a2f1 rbd: clean up a few things in the refresh path
This includes a few relatively small fixes I found while examining
the code that refreshes image information.

This resolves:
    http://tracker.ceph.com/issues/5040

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
2013-07-03 15:32:38 -07:00
Alex Elder
e215605417 rbd: flush dcache after zeroing page data
Neither zero_bio_chain() nor zero_pages() contains a call to flush
caches after zeroing a portion of a page.  This can cause problems
on architectures that have caches that allow virtual address
aliasing.

This resolves:
    http://tracker.ceph.com/issues/4777

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
2013-07-03 15:32:37 -07:00
Linus Torvalds
baa6f82093 Simple warning fix for module sections. If too late to pull, no big deal.
Cheers,
 Rusty.
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1.4.12 (GNU/Linux)
 
 iQIcBAABAgAGBQJR0NO+AAoJENkgDmzRrbjxS0YP/RoI5WguUDVx8tIdS29pwe9L
 Z5U0uIDUBZkNftlvvkdqdGjU1MeejxoMIs6cMz6ScY6YEI7li1AzuNApRadflnMx
 Do9u999NylI6q2oCLrt9y/eliJ/p5hkeCbhDrGdCgB7rPCjEAp/+Eu9Kq7FO8AYr
 Qu0rpLjvJ/5c5lsLAmUn95lZZ/uc/EPHasfZDiR8eGHRoAvPeyJ5qR06MpHFcl33
 2EKwR8RCYP9yUdqWnSKlC4LTuIpNK/BjEpHcQK2UPSn49vR8B3gVz5VXNP/7nIYQ
 hCr3VGoEOBoAlsD38IuKycXQ7vmBTU2hq7bQPq7Z6wyH96Sb6xFmVqT1Dm7uAI4+
 3OU69ugQb7WM3hQjeRRFvn5WjPqu8THwJKU7vhEf+xA09+v09QFjbHoEFabrFDlo
 MqVVhelkT6Mo8bBrxl8sZc/CG+aSl/kLQfyXWIcR2MQEAlfYKOYuDz+5TXtMZHaR
 SwlxgE2wbPP54v27yQWfBWY4d7dtJ+OX25GW1yzJdsNUFg7fQ9zzMnrv1prsUlog
 bxs4P0JcBOMLZx2IWm0R2kmbRFYjPgj6Kr3VnrIhZGesG1cwQB2QNvEPEkCtl6kq
 8HdX8Tua3dpfKAhDxnlztpVcJUmTuLn95jMWjW1KhCcjOqqJbaElq9ubplP2eMtr
 7zlfzFq9RsBKG0+J98Gt
 =vZYj
 -----END PGP SIGNATURE-----
mergetag object b3087e48ce
 type commit
 tag virtio-next-for-linus
 tagger Rusty Russell <rusty@rustcorp.com.au> 1372639977 +0930
 
 Was away, but it's all trivial and been sitting in linux-next.  So if you don't
 pull, no electrons will be harmed.
 
 Thanks,
 Rusty.
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1.4.12 (GNU/Linux)
 
 iQIcBAABAgAGBQJR0NMjAAoJENkgDmzRrbjxf3EQALoIAuQAjiEcHTGkLd/hGMJk
 mDiN14CXNH0siKASlCfcAxudJxKfIRM1tYzMFV1e7DtC7ZAH43m/SYBmXC4hRpIz
 nHsv57xtUg8bGZgb2OZKX/L0rqiE5wxtK7ZWGIPh3aE4uZKif+YvSBNJniO/HRTl
 UVDbrp+sbZzbT8NIHQ8d2vdtfG7rCNRkVaaETxk1wguu2TGJkyt0xuU3zT54We0L
 M8rJuHyTplgEMuDUydmNi9uNPZJZZY7QqoZ7s+ubXELo2FcFeyqWP2oHZnywESok
 fWOfhgdgIgmTED/eXS9wGwB7+K6azKg2chA23QZ91ncX/Ts141sEAd9ezwuQZc+5
 VS/Lo2mRvWbEcG78uYGUz+ukltZv0Og6uOvfSfCrALcVA4pYNFJTjbRu+Io+qAml
 082byeBUT/eurZr7xE30b4aTvZV2bYfqFkVynjF9ugNvTTjMiruw8uAzep6DD1gl
 38XRwtrk6MHLAFuxsyNV1FgunIWB6zvXNjMTUUVDLAQCbZwxZArr9gDd4zUG9VTS
 EbA9TIulSMMLUve5zZQyByqRLlzmzDVghMOZVbYUOQN4uNxbAPoXZ87tpiRhk8Oc
 XW+LRABjwZOjiQ9SDWRxxb8p2WTVyjuwIOTE0pLCuEBw6RFmjGga+PzH9rDygJnF
 zEIbVDRT9Jj7rOtxSabE
 =p6z7
 -----END PGP SIGNATURE-----

Merge tags 'modules-next-for-linus' and 'virtio-next-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rusty/linux

Pull trivial module and virtio fixes from Rusty Russell.

Apparently these were meant for 3.10, but came in after the release.

* tag 'modules-next-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rusty/linux:
  modpost.c: Add .text.unlikely to TEXT_SECTIONS

* tag 'virtio-next-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rusty/linux:
  virtio: remove virtqueue_add_buf().
  lguest: rename i386_head.S
  virtio_blk: Add missing 'static' qualifiers
  virtio: console: Add emergency writeonly register to config space
  virtio_pci: better macro exported in uapi
2013-07-03 13:09:06 -07:00
Linus Torvalds
0e97456ab5 Merge branch 'for-3.11' of git://git.kernel.org/pub/scm/linux/kernel/git/geert/linux-m68k
Pull m68k updates from Geert Uytterhoeven.

* 'for-3.11' of git://git.kernel.org/pub/scm/linux/kernel/git/geert/linux-m68k:
  m68k/q40: Enable PC parallel port in defconfig
  m68k/q40: Undefine insl/outsl before redefining them
  m68k/uaccess: Fix asm constraints for userspace access
  swim: Release memory region after incorrect return/goto
  m68k/irq: Vector ints need a valid interrupt handler
  m68k/math-emu: unsigned issue, 'unsigned long' will never be less than zero
  m68k: remove CONFIG_EARLY_PRINTK dependency on CONFIG_EMBEDDED, default to n
  m68k/sun3: remove inline marking of EXPORT_SYMBOL functions
  [SCSI] a3000: use module_platform_driver_probe()
  [SCSI] a4000t: use module_platform_driver_probe()
  m68k: Remove inline strcpy() and strcat() implementations
2013-07-03 11:11:23 -07:00
Linus Torvalds
63580e51bb Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs
Pull VFS patches (part 1) from Al Viro:
 "The major change in this pile is ->readdir() replacement with
  ->iterate(), dealing with ->f_pos races in ->readdir() instances for
  good.

  There's a lot more, but I'd prefer to split the pull request into
  several stages and this is the first obvious cutoff point."

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (67 commits)
  [readdir] constify ->actor
  [readdir] ->readdir() is gone
  [readdir] convert ecryptfs
  [readdir] convert coda
  [readdir] convert ocfs2
  [readdir] convert fatfs
  [readdir] convert xfs
  [readdir] convert btrfs
  [readdir] convert hostfs
  [readdir] convert afs
  [readdir] convert ncpfs
  [readdir] convert hfsplus
  [readdir] convert hfs
  [readdir] convert befs
  [readdir] convert cifs
  [readdir] convert freevxfs
  [readdir] convert fuse
  [readdir] convert hpfs
  reiserfs: switch reiserfs_readdir_dentry to inode
  reiserfs: is_privroot_deh() needs only directory inode, actually
  ...
2013-07-02 09:28:37 -07:00
Jens Axboe
5f0e5afa0d Linux 3.10-rc7
-----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1.4.13 (GNU/Linux)
 
 iQEbBAABAgAGBQJRxf9cAAoJEHm+PkMAQRiGMWkH911xM4gRmFgE7SqVW4F4AWBm
 ngcqMqNy9IdqKfibORUUDvVfEa5gjD5ai2quIKpfQiaukbpQJ696H90ijuAkajLn
 DQBrN243s0pzhhc/quWINnWxsFQ613JjdUMUMaD7e9A1aKjYzWrPGt/tSjrFXGCP
 tArTupVzc/iOmnEQDKiROI/Nokq44QJ36aTGPM7n08xMtpKmkCXM+9/UosBteB0O
 HVI33dmjwz7i55fI53XAWyuZCE+gSEnA4z8spJ9LfXso2W14V+roc+GuL6OyeeTI
 pCn/+4niVPb4B0ROZlpyVmdZjbPPcMMEK5o+BSJI68SH6LHZTQh2iVuqYfpSyA==
 =uUH5
 -----END PGP SIGNATURE-----

Merge tag 'v3.10-rc7' into for-3.11/drivers

Linux 3.10-rc7

Pull this in early to avoid doing it with the bcache merge,
since there are a number of changes to bcache between my old
base (3.10-rc1) and the new pull request.
2013-07-02 08:31:48 +02:00
Alex Elder
912c317d46 rbd: drop original request earlier for existence check
The reference to the original request dropped at the end of
rbd_img_obj_exists_callback() corresponds to the reference taken
in rbd_img_obj_exists_submit() to account for the stat request
referring to it.  Move the put of that reference up right after
clearing that pointer to make its purpose more obvious.

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
2013-07-01 09:52:02 -07:00
Geert Uytterhoeven
491205a8b4 rbd: Use min_t() to fix comparison of distinct pointer types warning
drivers/block/rbd.c: In function ‘zero_pages’:
drivers/block/rbd.c:1102: warning: comparison of distinct pointer types lacks a cast

Remove the hackish casts and use min_t() to fix this.

Signed-off-by: Geert Uytterhoeven <geert@linux-m68k.org>
Reviewed-by: Alex Elder <elder@inktank.com>
2013-07-01 09:52:01 -07:00
Linus Torvalds
bd2931b5cf Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client
Pull Ceph fix from Sage Weil:
 "This is a recently spotted regression in the snapshot behavior...

  It turns out several tests weren't being run in the nightlies so this
  took a while to spot"

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client:
  rbd: send snapshot context with writes
2013-06-29 10:31:15 -07:00
Al Viro
83a8761142 move linux/loop.h to drivers/block
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2013-06-29 12:46:45 +04:00
Philipp Reisner
d752b26960 drbd: Allow online change of al-stripes and al-stripe-size
Allow to change the AL layout with an resize operation. For that
the reisze command gets two new fields: al_stripes and al_stripe_size.

In order to make the operation crash save:
1) Lock out all IO and MD-IO
2) Write the super block with MDF_PRIMARY_IND clear
3) write the bitmap to the new location (all zeros, since
   we allow only while connected)
4) Initialize the new AL-area
5) Write the super block with the restored MDF_PRIMARY_IND.
6) Unfreeze all IO

Since the AL-layout has no influence on the protocol, this operation
needs to be beforemed on both sides of a resource (if intended).

Signed-off-by: Andreas Gruenbacher <agruen@linbit.com>
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2013-06-28 16:04:36 +02:00
Philipp Reisner
e96c96333f drbd: Constants should be UPPERCASE
Signed-off-by: Andreas Gruenbacher <agruen@linbit.com>
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2013-06-28 16:04:36 +02:00
Philipp Reisner
28e448bb30 drbd: Ignore the exit code of a fence-peer handler if it returns too late
In case the connection was established and lost again before
the a fence-peer handler returns, ignore the exit code of this
instance. (And use the exit code of the later started instance)

Signed-off-by: Andreas Gruenbacher <agruen@linbit.com>
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2013-06-28 16:04:36 +02:00
Andreas Gruenbacher
f9eb7bf424 drbd: Fix rcu_read_lock balance on error path
Signed-off-by: Andreas Gruenbacher <agruen@linbit.com>
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2013-06-28 16:04:36 +02:00
Wei Yongjun
6110d70bdf drbd: fix error return code in drbd_init()
Fix to return a negative error code from the error handling
case instead of 0, as returned elsewhere in this function.

Signed-off-by: Wei Yongjun <yongjun_wei@trendmicro.com.cn>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
Signed-off-by: Andreas Gruenbacher <agruen@linbit.com>
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2013-06-28 16:04:36 +02:00
Andreas Gruenbacher
26ea8f9239 drbd: Do not sleep inside rcu
Signed-off-by: Andreas Gruenbacher <agruen@linbit.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2013-06-28 16:04:36 +02:00
Jens Axboe
f35546e072 Merge branch 'stable/for-jens-3.10' of git://git.kernel.org/pub/scm/linux/kernel/git/konrad/xen into for-3.11/drivers
Konrad writes:

It has the 'feature-max-indirect-segments' implemented in both backend
and frontend. The current problem with the backend and frontend is that the
segment size is limited to 11 pages. It means we can at most squeeze in 44kB per
request. The ring can hold 32 (next power of two below 36) requests, meaning we
can do 1.4M of outstanding requests. Nowadays that is not enough.

The problem in the past was addressed in two ways - but neither one went upstream.
The first solution to this proposed by Justin from Spectralogic was to negotiate
the segment size.  This means that the ‘struct blkif_sring_entry’ is now a variable size.
It can expand from 112 bytes (cover 11 pages of data - 44kB) to 1580 bytes
(256 pages of data - so 1MB). It is a simple extension by just making the array in the
request expand from 11 to a variable size negotiated. But it had limits: this extension
still limits the number of segments per request to 255 (as the total number must be
specified in the request, which only has an 8-bit field for that purpose).

The other solution (from Intel - Ronghui) was to create one extra ring that only has the
‘struct blkif_request_segment’ in them. The ‘struct blkif_request’ would be changed to have
an index in said ‘segment ring’. There is only one segment ring. This means that the size of
the initial ring is still the same. The requests would point to the segment and enumerate out
how many of the indexes it wants to use. The limit is of course the size of the segment.
If one assumes a one-page segment this means we can in one request cover ~4MB.

Those patches were posted as RFC and the author never followed up on the ideas on changing
it to be a bit more flexible.

There is yet another mechanism that could be employed  (which these patches implement) - and it
borrows from VirtIO protocol. And that is the ‘indirect descriptors’. This very similar to
what Intel suggests, but with a twist. The twist is to negotiate how many of these
'segment' pages (aka indirect descriptor pages) we want to support (in reality we negotiate
how many entries in the segment we want to cover, and we module the number if it is
bigger than the segment size).

This means that with the existing 36 slots in the ring (single page) we can cover:
32 slots * each blkif_request_indirect covers: 512 * 4096 ~= 64M. Since we ample space
in the blkif_request_indirect to span more than one indirect page, that number (64M)
can be also multiplied by eight = 512MB.

Roger Pau Monne took the idea and implemented them in these patches. They work
great and the corner cases (migration between backends with and without this extension)
work nicely. The backend has a limit right now off how many indirect entries
it can handle: one indirect page, and at maximum 256 entries (out of 512 - so  50% of the page
is used). That comes out to 32 slots * 256 entries in a indirect page * 1 indirect page
per request * 4096 = 32MB.

This is a conservative number that can change in the future. Right now it strikes
a good balance between giving excellent performance, memory usage in the backend, and
balancing the needs of many guests.

In the patchset there is also the split of the blkback structure to be per-VBD.
This means that the spinlock contention we had with many guests trying to do I/O and
all the blkback threads hitting the same lock has been eliminated.

Also there are bug-fixes to deal with oddly sized sectors, insane amounts on
th ring, and also a security fix (posted earlier).
2013-06-28 16:01:14 +02:00
Josh Durgin
d2d1f17a0d rbd: send snapshot context with writes
Sending the right snapshot context with each write is required for
snapshots to work. Due to the ordering of calls, the snapshot context
is never set for any requests. This causes writes to the current
version of the image to be reflected in all snapshots, which are
supposed to be read-only.

This happens because rbd_osd_req_format_write() sets the snapshot
context based on obj_request->img_request. At this point, however,
obj_request->img_request has not been set yet, to the snapshot context
is set to NULL. Fix this by moving rbd_img_obj_request_add(), which
sets obj_request->img_request, before the osd request formatting
calls.

This resolves:
    http://tracker.ceph.com/issues/5465

Reported-by: Karol Jurak <karol.jurak@gmail.com>
Signed-off-by: Josh Durgin <josh.durgin@inktank.com>
Reviewed-by: Sage Weil <sage@inktank.com>
Reviewed-by: Alex Elder <elder@linaro.org>
2013-06-27 05:55:29 -07:00
Linus Torvalds
78750f1908 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client
Pull Ceph fix from Sage Weil:
 "This fixes another problem with using v2 images on 3.10 due to the
  order in which fields are read from the image header.

  Hopefully this is the last one"

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client:
  rbd: fetch object order before using it
2013-06-26 08:47:46 -10:00
Josh Durgin
1617e40c1e rbd: fetch object order before using it
rbd_dev_v2_header_onetime() fetches striping information, and
checks whether the image can be read by compariing the stripe unit
to the object size. It determines the object size by shifting
the object order, which is 0 at this point since it has not been
read yet. Move the call to get the image size and object order
before rbd_dev_v2_header_onetime() so it is set before use.

Signed-off-by: Josh Durgin <josh.durgin@inktank.com>
Reviewed-by: Sage Weil <sage@inktank.com>
2013-06-25 12:27:31 -07:00
Roger Pau Monne
1e0f7a21b2 xen-blkback: check the number of iovecs before allocating a bios
With the introduction of indirect segments we can receive requests
with a number of segments bigger than the maximum number of allowed
iovecs in a bios, so make sure that blkback doesn't try to allocate a
bios with more iovecs than BIO_MAX_PAGES

Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
2013-06-25 10:00:58 -04:00
Matthew Wilcox
7d8224574c NVMe: Call nvme_process_cq from submission path
Since we have the queue locked, it makes sense to check if there are
any completion queue entries on the queue before we release the lock.
If there are, it may save an interrupt and reduce latency for the I/Os
that happened to complete.  This happens fairly often for some workloads.

Signed-off-by: Matthew Wilcox <matthew.r.wilcox@intel.com>
2013-06-24 13:57:27 -04:00
Matthew Wilcox
bc57a0f7a4 NVMe: Remove "process_cq did something" message
I was originally intending to log the fact that the kthread had done
some work since it might help us find interrupt handling problems, but
that hasn't been done yet, and spamming the logs with this message is
just rude.

Signed-off-by: Matthew Wilcox <matthew.r.wilcox@intel.com>
2013-06-24 13:57:08 -04:00
Joe Perches
957d6bf665 swim: Release memory region after incorrect return/goto
The code uses

	return foo;
	goto err_type;

when instead the form should have been

	ret = foo;
	goto err_type;

Here this causes a useful release_mem_region to be skipped.

Signed-off-by: Joe Perches <joe@perches.com>
Reviewed-by: Laurent Vivier <Laurent@Vivier.EU>
Signed-off-by: Geert Uytterhoeven <geert@linux-m68k.org>
2013-06-24 19:44:19 +02:00
Matthew Wilcox
e9539f4752 NVMe: Return correct value from interrupt handler
The interrupt handler currently reports whether it found any new
completion queue entries.  If the completion queue is primarily being
processed by a method other than the interrupt handler, it may return
IRQ_NONE so often that Linux thinks that the interrupt is being falsely
triggered.

To solve this problem, report whether any completion queue entries have
been seen since the last interrupt was received for this queue.

Signed-off-by: Matthew Wilcox <matthew.r.wilcox@intel.com>
2013-06-24 11:54:20 -04:00
Roger Pau Monne
294caaf29c xen-blkfront: set blk_queue_max_hw_sectors correctly
Now that indirect segments are enabled blk_queue_max_hw_sectors must
be set to match the maximum number of sectors we can handle in a
request.

Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
Reported-by: Felipe Franciosi <felipe.franciosi@citrix.com>
Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
2013-06-21 15:58:55 -04:00
Roger Pau Monne
2d9105433f xen-blkback: workaround compiler bug in gcc 4.1
The code generat with gcc (GCC) 4.1.2 20080704 (Red Hat 4.1.2-54)
creates an unbound loop for the second foreach_grant_safe loop in
purge_persistent_gnt.

The workaround is to avoid having this second loop and instead
perform all the work inside the first loop by adding a new variable,
clean_used, that will be set when all the desired persistent grants
have been removed and we need to iterate over the remaining ones to
remove the WAS_ACTIVE flag.

Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
Reported-by: Tom O'Neill <toneill@vmem.com>
Reported-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
2013-06-21 15:58:44 -04:00
Linus Torvalds
7ecba6f2f3 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client
Pull Ceph fix from Sage Weil:
 "This fixes a problem preventing the kernel and userland librbd
  libraries from sharing data with the new format 2 images"

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client:
  rbd: use the correct length for format 2 object names
2013-06-21 06:27:40 -10:00
Keith Busch
6198221fa0 NVMe: Disk IO statistics
Add io stats accounting for bio requests so nvme block devices show
useful disk stats.

Signed-off-by: Keith Busch <keith.busch@intel.com>
Signed-off-by: Matthew Wilcox <matthew.r.wilcox@intel.com>
2013-06-20 12:06:35 -04:00
Matthew Wilcox
063a8096f3 NVMe: Restructure MSI / MSI-X setup
The current code copies 'nr_io_queues' into 'q_count', modifies
'nr_io_queues' during MSI-X setup, then resets 'nr_io_queues' for
MSI setup.  Instead, copy 'nr_io_queues' into 'vecs' and modify 'vecs'
during both MSI-X and MSI setup.

This lets us simplify the for-loops that set up MSI-X and MSI, and opens
the possibility of using more I/O queues than we have interrupt vectors,
should future benchmarking prove that to be a useful feature.

Signed-off-by: Matthew Wilcox <matthew.r.wilcox@intel.com>
2013-06-20 11:09:23 -04:00
Tushar Behera
03ea83e9a3 NVMe: Use kzalloc instead of kmalloc+memset
Use kzalloc instead of kmalloc and a susbsequent memset.

Signed-off-by: Tushar Behera <tushar.behera@linaro.org>
Signed-off-by: Vishal Verma <vishal.l.verma@linux.intel.com>
Signed-off-by: Matthew Wilcox <matthew.r.wilcox@intel.com>
2013-06-19 13:24:27 -04:00
Philip J Kelleher
36f988e978 rsxx: Adding in debugfs entries.
Adding debugfs entries to help with debugging and testing and
testing code.

pci_regs:
       	This entry will spit out all of the data stored on the BAR.

stats:
       	This entry will display all of the driver stats for each
       	DMA channel.

cram:
	This will allow read/write ability to the CRAM address space
	on our adapter's CPU.

Signed-off-by: Philip J Kelleher <pjk1939@linux.vnet.ibm.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2013-06-19 13:52:10 +02:00
Philip J Kelleher
62302508f2 rsxx: Fixes incorrect stats calculation.
Fixing incorrect stats calculation during read retries.

Signed-off-by: Philip J Kelleher <pjk1939@linux.vnet.ibm.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2013-06-19 13:52:10 +02:00
Philip J Kelleher
b8b225da13 rsxx: Adding EEH check inside cregs timeout.
Unfortunaly, our CPU register path does not do any kind of
EEH error checking. So to fix this issue, an ioread32 was
added to the CPU register timeout code. This way, the
driver can check to see if the timeout was caused by an EEH
error or not. This is a dummy read.

Signed-off-by: Philip J Kelleher <pjk1939@linux.vnet.ibm.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2013-06-19 13:52:10 +02:00
Philip J Kelleher
3eb8dcafb5 rsxx: Adapter address space sanity check.
Adding a sanity check to guarentee that DMAs outside of the device's
address space will be errored out right away.

Signed-off-by: Philip J Kelleher <pjk1939@linux.vnet.ibm.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2013-06-19 13:52:10 +02:00
Philip J Kelleher
66bc600363 rsxx: Fixes DLPAR add kernel panic if partition still mounted.
A kernel panic would occur on a DLPAR add if there was a partition
still mounted during the DLPAR remove. This bug fix will allow the
user to unmount the partition and bring the driver back into a
good state after the DLPAR add.

Signed-off-by: Philip J Kelleher <pjk1939@linux.vnet.ibm.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2013-06-19 13:52:09 +02:00
Philip J Kelleher
f730e3dc6d rsxx: Changing the adapter name to the official name.
Changing the adapter name from FlashSystem-80 to the official
name: Flash Adapter 900GB Full Height.

Signed-off-by: Philip J Kelleher <pjk1939@linux.vnet.ibm.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2013-06-19 13:52:09 +02:00
Philip J Kelleher
fb065cd9e0 rsxx: Adding in sync_start module paramenter.
Before, the partition table would have to be reread because our
card was attached before it transistioned out of it's 'starting'
state.

This change will cause the driver to wait to attach the device
until the adapter is ready.

Signed-off-by: Philip J Kelleher <pjk1939@linux.vnet.ibm.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2013-06-19 13:52:09 +02:00
Philip J Kelleher
7b379cc378 rsxx: Allow block size to be determined by configuration.
Previously, the block size was determined by whether or not
our Hardware could handle 512 byte accesses. Now, all of our
Hardware can handle 512 and 4096 block sizes.

This fix allows it to be user configurable.

Signed-off-by: Philip J Kelleher <pjk1939@linux.vnet.ibm.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2013-06-19 13:52:09 +02:00
Philip J Kelleher
31a70bb444 rsxx: Fixes soft-lockup issues during DMAs.
The workqueue mechanism has been reworked to prevent soft
lockup issues from occuring by adding in mutex sychronization.

Signed-off-by: Philip J Kelleher <pjk1939@linux.vnet.ibm.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2013-06-19 13:52:09 +02:00
Philip J Kelleher
0ab4743ebc rsxx: Restructured DMA cancel scheme.
Before, DMAs would never be cancelled if there was a data stall
or an EEH Permenant failure which would cause an unrecoverable
I/O hang.

The DMA cancellation mechanism has been modified to fix
these issues and allows DMAs to be cancelled during the
above mentioned events.

Signed-off-by: Philip J Kelleher <pjk1939@linux.vnet.ibm.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2013-06-19 13:52:09 +02:00
Philip J Kelleher
a3299ab185 rsxx: Individual workqueues for interruptible events.
Giving all interrupt based events their own workqueue to complete
tasks on. This fixes a bug that would cause creg commands to timeout
if too many are issued at once.

Signed-off-by: Philip J Kelleher <pjk1939@linux.vnet.ibm.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2013-06-19 13:52:09 +02:00
Konrad Rzeszutek Wilk
8e3f875554 xen/blkback: Check for insane amounts of request on the ring (v6).
Check that the ring does not have an insane amount of requests
(more than there could fit on the ring).

If we detect this case we will stop processing the requests
and wait until the XenBus disconnects the ring.

The existing check RING_REQUEST_CONS_OVERFLOW which checks for how
many responses we have created in the past (rsp_prod_pvt) vs
requests consumed (req_cons) and whether said difference is greater or
equal to the size of the ring, does not catch this case.

Wha the condition does check if there is a need to process more
as we still have a backlog of responses to finish. Note that both
of those values (rsp_prod_pvt and req_cons) are not exposed on the
shared ring.

To understand this problem a mini crash course in ring protocol
response/request updates is in place.

There are four entries: req_prod and rsp_prod; req_event and rsp_event
to track the ring entries. We are only concerned about the first two -
which set the tone of this bug.

The req_prod is a value incremented by frontend for each request put
on the ring. Conversely the rsp_prod is a value incremented by the backend
for each response put on the ring (rsp_prod gets set by rsp_prod_pvt when
pushing the responses on the ring).  Both values can
wrap and are modulo the size of the ring (in block case that is 32).
Please see RING_GET_REQUEST and RING_GET_RESPONSE for the more details.

The culprit here is that if the difference between the
req_prod and req_cons is greater than the ring size we have a problem.
Fortunately for us, the '__do_block_io_op' loop:

	rc = blk_rings->common.req_cons;
	rp = blk_rings->common.sring->req_prod;

	while (rc != rp) {

		..
		blk_rings->common.req_cons = ++rc; /* before make_response() */

	}

will loop up to the point when rc == rp. The macros inside of the
loop (RING_GET_REQUEST) is smart and is indexing based on the modulo
of the ring size. If the frontend has provided a bogus req_prod value
we will loop until the 'rc == rp' - which means we could be processing
already processed requests (or responses) often.

The reason the RING_REQUEST_CONS_OVERFLOW is not helping here is
b/c it only tracks how many responses we have internally produced
and whether we would should process more. The astute reader will
notice that the macro RING_REQUEST_CONS_OVERFLOW provides two
arguments - more on this later.

For example, if we were to enter this function with these values:

       	blk_rings->common.sring->req_prod =  X+31415 (X is the value from
		the last time __do_block_io_op was called).
        blk_rings->common.req_cons = X
        blk_rings->common.rsp_prod_pvt = X

The RING_REQUEST_CONS_OVERFLOW(&blk_rings->common, blk_rings->common.req_cons)
is doing:

	req_cons - rsp_prod_pvt >= 32

Which is,
	X - X >= 32 or 0 >= 32

And that is false, so we continue on looping (this bug).

If we re-use said macro RING_REQUEST_CONS_OVERFLOW and pass in the rp
instead (sring->req_prod) of rc, the this macro can do the check:

     req_prod - rsp_prov_pvt >= 32

Which is,
       X + 31415 - X >= 32 , or 31415 >= 32

which is true, so we can error out and break out of the function.

Unfortunatly the difference between rsp_prov_pvt and req_prod can be
at 32 (which would error out in the macro). This condition exists when
the backend is lagging behind with the responses and still has not finished
responding to all of them (so make_response has not been called), and
the rsp_prov_pvt + 32 == req_cons. This ends up with us not being able
to use said macro.

Hence introducing a new macro called RING_REQUEST_PROD_OVERFLOW which does
a simple check of:

    req_prod - rsp_prod_pvt > RING_SIZE

And with the X values from above:

   X + 31415 - X > 32

Returns true. Also not that if the ring is full (which is where
the RING_REQUEST_CONS_OVERFLOW triggered), we would not hit the
same condition:

   X + 32 - X > 32

Which is false.

Lets use that macro.
Note that in v5 of this patchset the macro was different - we used an
earlier version.

Cc: stable@vger.kernel.org
[v1: Move the check outside the loop]
[v2: Add a pr_warn as suggested by David]
[v3: Use RING_REQUEST_CONS_OVERFLOW as suggested by Jan]
[v4: Move wake_up after kthread_stop as suggested by Jan]
[v5: Use RING_REQUEST_PROD_OVERFLOW instead]
[v6: Use RING_REQUEST_PROD_OVERFLOW - Jan's version]
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>

gadsa
2013-06-17 15:17:16 -04:00
Josh Durgin
3a96d5cd7b rbd: use the correct length for format 2 object names
Format 2 objects use 16 characters for the object name suffix to be
able to express the full 64-bit range of object numbers. Format 1
images only use 12 characters for this. Using 12-character names for
format 2 caused userspace and kernel rbd clients to read differently
named objects, which made an image written by one client look empty to
the other client.

CC: stable@vger.kernel.org  # 3.9+
Reported-by: Chris Dunlop <chris@onthe.net.au>
Signed-off-by: Josh Durgin <josh.durgin@inktank.com>
Reviewed-by: Sage Weil <sage@inktank.com>
2013-06-13 08:46:15 -07:00
Linus Torvalds
b2cc9c19e4 Merge branch 'for-linus' of git://git.kernel.dk/linux-block
Pull block layer fixes from Jens Axboe:
 "Outside of bcache (which really isn't super big), these are all
  few-liners.  There are a few important fixes in here:

   - Fix blk pm sleeping when holding the queue lock

   - A small collection of bcache fixes that have been done and tested
     since bcache was included in this merge window.

   - A fix for a raid5 regression introduced with the bio changes.

   - Two important fixes for mtip32xx, fixing an oops and potential data
     corruption (or hang) due to wrong bio iteration on stacked devices."

* 'for-linus' of git://git.kernel.dk/linux-block:
  scatterlist: sg_set_buf() argument must be in linear mapping
  raid5: Initialize bi_vcnt
  pktcdvd: silence static checker warning
  block: remove refs to XD disks from documentation
  blkpm: avoid sleep when holding queue lock
  mtip32xx: Correctly handle bio->bi_idx != 0 conditions
  mtip32xx: Fix NULL pointer dereference during module unload
  bcache: Fix error handling in init code
  bcache: clarify free/available/unused space
  bcache: drop "select CLOSURES"
  bcache: Fix incompatible pointer type warning
2013-06-12 16:42:39 -07:00
Linus Torvalds
a568fa1c91 Merge branch 'akpm' (updates from Andrew Morton)
Merge misc fixes from Andrew Morton:
 "Bunch of fixes and one little addition to math64.h"

* emailed patches from Andrew Morton <akpm@linux-foundation.org>: (27 commits)
  include/linux/math64.h: add div64_ul()
  mm: memcontrol: fix lockless reclaim hierarchy iterator
  frontswap: fix incorrect zeroing and allocation size for frontswap_map
  kernel/audit_tree.c:audit_add_tree_rule(): protect `rule' from kill_rules()
  mm: migration: add migrate_entry_wait_huge()
  ocfs2: add missing lockres put in dlm_mig_lockres_handler
  mm/page_alloc.c: fix watermark check in __zone_watermark_ok()
  drivers/misc/sgi-gru/grufile.c: fix info leak in gru_get_config_info()
  aio: fix io_destroy() regression by using call_rcu()
  rtc-at91rm9200: use shadow IMR on at91sam9x5
  rtc-at91rm9200: add shadow interrupt mask
  rtc-at91rm9200: refactor interrupt-register handling
  rtc-at91rm9200: add configuration support
  rtc-at91rm9200: add match-table compile guard
  fs/ocfs2/namei.c: remove unecessary ERROR when removing non-empty directory
  swap: avoid read_swap_cache_async() race to deadlock while waiting on discard I/O completion
  drivers/rtc/rtc-twl.c: fix missing device_init_wakeup() when booted with device tree
  cciss: fix broken mutex usage in ioctl
  audit: wait_for_auditd() should use TASK_UNINTERRUPTIBLE
  drivers/rtc/rtc-cmos.c: fix accidentally enabling rtc channel
  ...
2013-06-12 16:29:53 -07:00
Stephen M. Cameron
03f47e888d cciss: fix broken mutex usage in ioctl
If a new logical drive is added and the CCISS_REGNEWD ioctl is invoked
(as is normal with the Array Configuration Utility) the process will
hang as below.  It attempts to acquire the same mutex twice, once in
do_ioctl() and once in cciss_unlocked_open().  The BKL was recursive,
the mutex isn't.

  Linux version 3.10.0-rc2 (scameron@localhost.localdomain) (gcc version 4.4.7 20120313 (Red Hat 4.4.7-3) (GCC) ) #1 SMP Fri May 24 14:32:12 CDT 2013
  [...]
  acu             D 0000000000000001     0  3246   3191 0x00000080
  Call Trace:
    schedule+0x29/0x70
    schedule_preempt_disabled+0xe/0x10
    __mutex_lock_slowpath+0x17b/0x220
    mutex_lock+0x2b/0x50
    cciss_unlocked_open+0x2f/0x110 [cciss]
    __blkdev_get+0xd3/0x470
    blkdev_get+0x5c/0x1e0
    register_disk+0x182/0x1a0
    add_disk+0x17c/0x310
    cciss_add_disk+0x13a/0x170 [cciss]
    cciss_update_drive_info+0x39b/0x480 [cciss]
    rebuild_lun_table+0x258/0x370 [cciss]
    cciss_ioctl+0x34f/0x470 [cciss]
    do_ioctl+0x49/0x70 [cciss]
    __blkdev_driver_ioctl+0x28/0x30
    blkdev_ioctl+0x200/0x7b0
    block_ioctl+0x3c/0x40
    do_vfs_ioctl+0x89/0x350
    SyS_ioctl+0xa1/0xb0
    system_call_fastpath+0x16/0x1b

This mutex usage was added into the ioctl path when the big kernel lock
was removed.  As it turns out, these paths are all thread safe anyway
(or can easily be made so) and we don't want ioctl() to be single
threaded in any case.

Signed-off-by: Stephen M. Cameron <scameron@beardog.cce.hp.com>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Mike Miller <mike.miller@hp.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-06-12 16:29:45 -07:00
Linus Torvalds
8d7a8fe2ce Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client
Pull ceph fixes from Sage Weil:
 "There is a pair of fixes for double-frees in the recent bundle for
  3.10, a couple of fixes for long-standing bugs (sleep while atomic and
  an endianness fix), and a locking fix that can be triggered when osds
  are going down"

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client:
  rbd: fix cleanup in rbd_add()
  rbd: don't destroy ceph_opts in rbd_add()
  ceph: ceph_pagelist_append might sleep while atomic
  ceph: add cpu_to_le32() calls when encoding a reconnect capability
  libceph: must hold mutex for reset_changed_osds()
2013-06-12 08:28:19 -07:00
Linus Torvalds
77293e215e Merge branch 'fixes-3.10' of git://git.infradead.org/users/willy/linux-nvme
Pull NVMe fixes from Matthew Wilcox.

* 'fixes-3.10' of git://git.infradead.org/users/willy/linux-nvme:
  NVMe: Add MSI support
  NVMe: Use dma_set_mask() correctly
  Return the result from user admin command IOCTL even in case of failure
  NVMe: Do not cancel command multiple times
  NVMe: fix error return code in nvme_submit_bio_queue()
  NVMe: check for integer overflow in nvme_map_user_pages()
  MAINTAINERS: update NVM EXPRESS DRIVER file list
  NVMe: Fix a signedness bug in nvme_trans_modesel_get_mp
  NVMe: Remove redundant version.h header include
2013-06-11 23:07:21 -07:00
Konrad Rzeszutek Wilk
604c499cbb xen/blkback: Check device permissions before allowing OP_DISCARD
We need to make sure that the device is not RO or that
the request is not past the number of sectors we want to
issue the DISCARD operation for.

This fixes CVE-2013-2140.

Cc: stable@vger.kernel.org
Acked-by: Jan Beulich <JBeulich@suse.com>
Acked-by: Ian Campbell <Ian.Campbell@citrix.com>
[v1: Made it pr_warn instead of pr_debug]
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
2013-06-07 17:05:55 -04:00
Stefan Bader
7c4d7d710f xen/blkback: Use physical sector size for setup
Currently xen-blkback passes the logical sector size over xenbus and
xen-blkfront sets up the paravirt disk with that logical block size.
But newer drives usually have the logical sector size set to 512 for
compatibility reasons and would show the actual sector size only in
physical sector size.
This results in the device being partitioned and accessed in dom0 with
the correct sector size, but the guest thinks 512 bytes is the correct
block size. And that results in poor performance.

To fix this, blkback gets modified to pass also physical-sector-size
over xenbus and blkfront to use both values to set up the paravirt
disk. I did not just change the passed in sector-size because I am
not sure having a bigger logical sector size than the physical one
is valid (and that would happen if a newer dom0 kernel hits an older
domU kernel). Also this way a domU set up before should still be
accessible (just some tools might detect the unaligned setup).

[v2: Make xenbus write failure non-fatal]
[v3: Use xenbus_scanf instead of xenbus_gather]
[v4: Rebased against segment changes]

Signed-off-by: Stefan Bader <stefan.bader@canonical.com>
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
2013-06-07 17:05:53 -04:00
Konrad Rzeszutek Wilk
2d5dc3ba85 xen-blkfront: Introduce a 'max' module parameter to alter the amount of indirect segments.
The max module parameter (by default 32) is the maximum number of
segments that the frontend will negotiate with the backend for indirect
descriptors.  Higher value means more potential throughput but more
memory usage. The backend picks the minimum of the frontend and its
default backend value.

Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
2013-06-04 15:58:35 -04:00
Ramachandra Rao Gajula
fa08a39664 NVMe: Add MSI support
Some devices only have support for MSI, not MSI-X.  While MSI is more
limited, it still provides better performance than line-based interrupts.

Signed-off-by: Ramachandra Gajula <rama@fastorsystems.com>
Signed-off-by: Matthew Wilcox <matthew.r.wilcox@intel.com>
2013-05-31 11:45:52 -04:00
Jens Axboe
b02383ea20 Merge branch 'for-jens' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/linux-block into for-linus
Jiri writes:

please pull from

  git://git.kernel.org/pub/scm/linux/kernel/git/jikos/linux-block.git for-jens

to receive one pktcdvd fix. It fixes a highly theoretical issue with using.
pktcdvd to work with media that'd be larger than 2TB :) But it's a correct.
fix and makes static checkers shut up about improperly cleaning upper.
32bits.
2013-05-29 21:21:23 +02:00
Dan Carpenter
104529661b pktcdvd: silence static checker warning
Static checkers complain about widening the binary "not" operations here
because sectors are u64 and "(pd)->settings.size" is unsigned int.
It unintentionally clears the high 32 bits of the sector.  This means
that the driver won't work for devices with over 2TB of space.  Since
this is a DVD drive, we're unlikely to reach that limit, but we may as
well silence the warning.

Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Jiri Kosina <jkosina@suse.cz>
2013-05-29 15:36:22 +02:00
Matthew Wilcox
cf9f123b38 NVMe: Use dma_set_mask() correctly
In some circumstances setting a 64-bit DMA mask can fail, as explained
in Documentation/DMA-API-HOWTO.txt.  Use the recommended code sequence
to set a 32-bit DMA mask if setting a 64-bit DMA mask fails.

Reported-by: Chayan Biswas <Chayan.Biswas@sandisk.com>
Signed-off-by: Matthew Wilcox <matthew.r.wilcox@intel.com>
2013-05-28 16:46:46 -04:00
Brian Behlendorf
dfd20b2b17 drivers/block/brd.c: fix brd_lookup_page() race
The index on the page must be set before it is inserted in the radix
tree.  Otherwise there is a small race which can occur during lookup
where the page can be found with the incorrect index.  This will trigger
the BUG_ON() in brd_lookup_page().

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reported-by: Chris Wedgwood <cw@f00f.org>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-05-24 16:22:52 -07:00
Gernot Vormayr
585dc0c2f6 drivers/block/xsysace.c: fix id with missing port-number
If the port number is missing from the device-tree the device gets named
xs` instead of xsa.  This fixes the check for missing ids.

Tested on ml507 board.

Signed-off-by: Gernot Vormayr <gvormayr@gmail.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Grant Likely <grant.likely@linaro.org>
Cc: Rob Herring <rob.herring@calxeda.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-05-24 16:22:50 -07:00
Chayan Biswas
cf90bc4830 Return the result from user admin command IOCTL even in case of failure
We copy the result to user if the command is completed from the
controller even if it completes with failure (non-zero) status.
A return status of < 0 indicates the command was not completed
by the controller. The user application may expect the error code
in the result field in case of failure.

Signed-off-by: Chayan Biswas <Chayan.Biswas@sandisk.com>
Signed-off-by: Matthew Wilcox <matthew.r.wilcox@intel.com>
2013-05-23 13:38:59 -04:00
Jonghwan Choi
2a647bfe1b virtio_blk: Add missing 'static' qualifiers
Add missing 'static' qualifiers

Signed-off-by: Jonghwan Choi <jhbird.choi@samsung.com>
Acked-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
2013-05-20 12:09:23 +09:30
Alex Elder
3abef3b358 rbd: fix cleanup in rbd_add()
Bjorn Helgaas pointed out that a recent commit introduced a
use-after-free condition in an error path for rbd_add().
He correctly stated:

    I think b536f69a3a "rbd: set up devices only for mapped images"
    introduced a use-after-free error in rbd_add():
	...
    If rbd_dev_device_setup() returns an error, we call
    rbd_dev_image_release(), which ultimately kfrees rbd_dev.
    Then we call rbd_dev_destroy(), which references fields in
    the already-freed rbd_dev struct before kfreeing it again.

The simple fix is to return the error code after the call to
rbd_dev_image_release().

Closer examination revealed that there's no need to clean up
rbd_opts in that function, so fix that too.

Update some other comments that have also become out of date.

Reported-by: Bjorn Helgaas <bhelgaas@google.com>
Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
2013-05-17 12:50:10 -05:00
Alex Elder
7262cfca43 rbd: don't destroy ceph_opts in rbd_add()
Whether rbd_client_create() successfully creates a new client or
not, it takes responsibility for getting the ceph_opts structure
it's passed destroyed.  If successful, the structure becomes
associated with the created client; if not, rbd_client_create()
will destroy it.

Previously, rbd_get_client() would call ceph_destroy_options()
if rbd_get_client() failed, and that meant it got called twice.
That led freeing various pointers more than once, which is never a
good idea.

This resolves:
    http://tracker.ceph.com/issues/4559

Cc: stable@vger.kernel.org # 3.8+
Reported-by: Dan van der Ster <dan@vanderster.com>
Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
2013-05-17 12:50:03 -05:00
Keith Busch
053ab702cc NVMe: Do not cancel command multiple times
Cancelling an already cancelled command does not do anything, so check
the command context before cancelling it, continuing if had already been
cancelled so we do not log the same problem every second if a device
stops responding.

Signed-off-by: Keith Busch <keith.busch@intel.com>
Signed-off-by: Matthew Wilcox <matthew.r.wilcox@intel.com>
2013-05-17 09:18:38 -04:00
Wei Yongjun
1287dabd34 NVMe: fix error return code in nvme_submit_bio_queue()
nvme_submit_flush_data() might overwrite the initialisation of the
return value with 0, so move the -ENOMEM setting close to the usage.

Signed-off-by: Wei Yongjun <yongjun_wei@trendmicro.com.cn>
Signed-off-by: Matthew Wilcox <matthew.r.wilcox@intel.com>
2013-05-17 09:13:18 -04:00
Dan Carpenter
5460fc0310 NVMe: check for integer overflow in nvme_map_user_pages()
You need to have CAP_SYS_ADMIN to trigger this overflow but it makes the
static checkers complain so we should fix it.  The worry is that
"length" comes from copy_from_user() so we need to check that "length +
offset" can't overflow.

I also changed the min_t() cast to be unsigned instead of signed.  Now
that we cap "length" to INT_MAX it doesn't make a difference, but it's a
little easier for reviewers to know that large values aren't cast to
negative.

Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Matthew Wilcox <matthew.r.wilcox@intel.com>
2013-05-17 09:11:03 -04:00
Vishal Verma
710a143dd8 NVMe: Fix a signedness bug in nvme_trans_modesel_get_mp
nvme_trans_modesel_get_mp() was defined with a unsigned return
type, but can return signed values.

Reported-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Vishal Verma <vishal.l.verma@linux.intel.com>
Signed-off-by: Matthew Wilcox <matthew.r.wilcox@intel.com>
2013-05-17 09:10:38 -04:00
Sachin Kamat
76e0310c2d NVMe: Remove redundant version.h header include
version.h header inclusion is not necessary as detected by
checkversion.pl.

Signed-off-by: Sachin Kamat <sachin.kamat@linaro.org>
Acked-by: Vishal Verma <vishal.l.verma@linux.intel.com>
Signed-off-by: Matthew Wilcox <matthew.r.wilcox@intel.com>
2013-05-17 09:10:05 -04:00
Linus Torvalds
109c3c0292 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client
Pull Ceph fixes from Sage Weil:
 "Yes, this is a much larger pull than I would like after -rc1.  There
  are a few things included:

   - a few fixes for leaks and incorrect assertions
   - a few patches fixing behavior when mapped images are resized
   - handling for cloned/layered images that are flattened out from
     underneath the client

  The last bit was non-trivial, and there is some code movement and
  associated cleanup mixed in.  This was ready and was meant to go in
  last week but I missed the boat on Friday.  My only excuse is that I
  was waiting for an all clear from the testing and there were many
  other shiny things to distract me.

  Strictly speaking, handling the flatten case isn't a regression and
  could wait, so if you like we can try to pull the series apart, but
  Alex and I would much prefer to have it all in as it is a case real
  users will hit with 3.10."

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client: (33 commits)
  rbd: re-submit flattened write request (part 2)
  rbd: re-submit write request for flattened clone
  rbd: re-submit read request for flattened clone
  rbd: detect when clone image is flattened
  rbd: reference count parent requests
  rbd: define parent image request routines
  rbd: define rbd_dev_unparent()
  rbd: don't release write request until necessary
  rbd: get parent info on refresh
  rbd: ignore zero-overlap parent
  rbd: support reading parent page data for writes
  rbd: fix parent request size assumption
  libceph: init sent and completed when starting
  rbd: kill rbd_img_request_get()
  rbd: only set up watch for mapped images
  rbd: set mapping read-only flag in rbd_add()
  rbd: support reading parent page data
  rbd: fix an incorrect assertion condition
  rbd: define rbd_dev_v2_header_info()
  rbd: get rid of trivial v1 header wrappers
  ...
2013-05-15 13:36:19 -07:00
Sam Bradshaw
093c959307 mtip32xx: Correctly handle bio->bi_idx != 0 conditions
Stacking drivers may append bvecs to existing bio's, resulting
in non-zero bi_idx conditions.  This patch counts the loops of
bio_for_each_segment() rather than inheriting the bi_idx value
to pass as a segment count to the hardware submission routine.

Signed-off-by: Sam Bradshaw <sbradshaw@micron.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2013-05-15 10:09:05 +02:00
Sam Bradshaw
974a51a245 mtip32xx: Fix NULL pointer dereference during module unload
An open file-handle to one or more of the driver exported debugfs
nodes causes raciness in recursive removal during module unload;
sometimes a stale parent dentry is dereferenced when more than 1
pci device is present.

Signed-off-by: Sam Bradshaw <sbradshaw@micron.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2013-05-15 10:04:34 +02:00
Alex Elder
638f5abed3 rbd: re-submit flattened write request (part 2)
Add code to rbd_img_obj_exists_callback() to detect when a clone's
parent image has disappeared, and re-submit the original write
request in that case.

Kill off some redundant assertions.

This completes the resolution for:
    http://tracker.ceph.com/issues/3763

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
2013-05-13 15:06:46 -05:00
Alex Elder
bbea1c1a31 rbd: re-submit write request for flattened clone
Add code to rbd_img_parent_read_full_callback() to detect when a
clone's parent image has disappeared, and re-submit the original
write request in that case.  (See the previous commit for more
reasoning about why this is appropriate.)

Rename some variables in rbd_img_obj_parent_read_full_callback()
to match the convention used in the previous patch.

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
2013-05-13 15:06:45 -05:00
Alex Elder
02c74fbad9 rbd: re-submit read request for flattened clone
If a clone image gets flattened while a parent read request is
underway, the original rbd object request needs to be resubmitted.

The reason is that by the time we get the response to the parent
read request, the data read from the parent may be out of date.
In other words, we could see this sequence of events:

    rbd client                      parent image/osd
    ----------                      ----------------
    original object ENOENT;
        issue parent read
                                    respond to parent read
                                    child image flattened
    original image header refresh
             <--- original object written independently here
    parent read response received

Add code to rbd_img_parent_read_callback() to detect when a clone's
parent image has disappeared (as evidenced by its parent overlap
becoming 0), and re-submit the original read request in that case.

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
2013-05-13 15:06:45 -05:00
Alex Elder
392a9dad7e rbd: detect when clone image is flattened
A format 2 clone image can be the subject of a "flatten" operation,
during which all of its data gets "copied up" from its parent image,
leaving the image fully populated.  Once this is complete, the
clone's association with the parent is abolished.

Since this can occur when a clone is mapped, we need to detect when
it has occurred and handle it accordingly.  We know an image has
been flattened when we know it at one time had a parent, but we have
learned (via a "get_parent" object class method call) it no longer
has one.

There might be in-flight requests at the point we learn an image has
been flattened, so we can't simply clean up parent data structures
right away.  Instead, we'll drop the initial parent reference when
the parent has disappeared (rather than when the image gets
destroyed), which will allow the last in-flight reference to clean
things up when it's complete.

We leverage the fact that a zero parent overlap renders an image
effectively unlayered.  We set the overlap to 0 at the point we
detect the clone image has flattened, which allows the unlayered
behavior to take effect immediately, while keeping other parent
structures in place until in-flight requests to complete.

This and the next few patches resolve:
    http://tracker.ceph.com/issues/3763

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
2013-05-13 15:06:45 -05:00
Alex Elder
a2acd00e79 rbd: reference count parent requests
Keep a reference count for uses of the parent information for an rbd
device.

An initial reference is set in rbd_img_request_create() if the
target image has a parent (with non-zero overlap).  Each image
request for an image with a non-zero parent overlap gets another
reference when it's created, and that reference is dropped when the
request is destroyed.

The initial reference is dropped when the image gets torn down.

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
2013-05-13 15:06:44 -05:00
Alex Elder
e93f315235 rbd: define parent image request routines
Define rbd_parent_request_create() and rbd_parent_request_destroy()
to handle the creation of parent image requests submitted for
layered image objects.  For simplicity, let rbd_img_request_put()
handle dropping the reference to any image request (parent or not),
and call whichever destructor is appropriate on the last put.

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
2013-05-13 15:06:44 -05:00
Alex Elder
fb65d2284c rbd: define rbd_dev_unparent()
Define rbd_dev_unparent() to encapsulate cleaning up parent data
structures from a layered rbd image.

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
2013-05-13 15:06:43 -05:00
Alex Elder
8785b1d487 rbd: don't release write request until necessary
Previously when a layered write was going to involve a copyup
request, the original osd request was released before submitting the
parent full-object read.  The osd request for the copyup would then
be allocated in rbd_img_obj_parent_read_full_callback().

Shortly we will be handling the event of mapped layered images
getting flattened, and when that occurs we need to resubmit the
original request.  We therefore don't want to release the osd
request until we really konw we're going to replace it--in the
callback function.

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
2013-05-13 15:06:43 -05:00
Alex Elder
642a25375f rbd: get parent info on refresh
Get parent info for format 2 images on every refresh (rather than
just during the initial probe).  This will be needed to detect the
disappearance of the parent image in the event a mapped image
becomes unlayered (i.e., flattened).  Avoid leaking the previous
parent spec on the second and subsequent times this information is
requested by dropping the previous one (if any) before updating it.
(Also, extract the pool id into a local variable before assigning
it into the parent spec.)

Switch to using a non-zero parent overlap value rather than the
existence of a parent (a non-null parent_spec pointer) to determine
whether to mark a request layered.  It will soon be possible for
a layered image to become unlayered while a request is in flight.

This means that the layered flag for an image request indicates that
there was a non-zero parent overlap at the time the image request
was created.  The parent overlap can change thereafter, which may
lead to special handling at request submission or completion time.

This and the next several patches are related to:
    http://tracker.ceph.com/issues/3763

NOTE:
If an error occurs while refreshing the parent info (i.e.,
requesting it after initial probe), the old parent info will
persist.  This is not really correct, and is a scenario that needs
to be addressed.  For now we'll assert that the failure mode is
unlikely, but the issue has been documented in tracker issue 5040.

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
2013-05-13 15:06:33 -05:00
Alex Elder
70cf49cfc7 rbd: ignore zero-overlap parent
An rbd clone image that has an overlap with its parent of 0 is
effectively not a layered image at all.  Detect this case and treat
such an image as non-layered.  Issue a warning to be sure the user
knows what's going on.

This resolves:
    http://tracker.ceph.com/issues/5028

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
2013-05-13 14:12:41 -05:00
Alex Elder
b91f09f17b rbd: support reading parent page data for writes
Currently, rbd_img_obj_parent_read_full() assumes the incoming
object request contains bio data.  But if a layered image is part of
a multi-layer stack of images it will result in read requests of
page data to parent images.

This is handling the same kind of issue as was resolved by this
commit:
    5b2ab72d  rbd: support reading parent page data

This resolves:
    http://tracker.ceph.com/issues/5027

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
2013-05-13 14:12:40 -05:00
Alex Elder
ebda6408f2 rbd: fix parent request size assumption
The code that reads object data from the parent for a copyup on
write request currently assumes that the size of that request is the
size of a "full" object from the original target image.

That is not necessarily the case.  The parent overlap could reduce
the request size below that.  To fix that assumption we need to
record the number of pages in the copyup_pages array, for both an
image request and an object request.  Rename a local variable in
rbd_img_obj_parent_read_full_callback() to reflect we're recording
the length of the parent read request, not the size of the target
object.

This resolves:
    http://tracker.ceph.com/issues/5038

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
2013-05-13 14:09:01 -05:00
Linus Torvalds
2d4fe27850 Merge git://git.infradead.org/users/willy/linux-nvme
Pull NVMe driver update from Matthew Wilcox:
 "Lots of exciting new features in the NVM Express driver this time,
  including support for emulating SCSI commands, discard support and the
  ability to submit per-sector metadata with I/Os.

  It's still mostly bugfixes though!"

* git://git.infradead.org/users/willy/linux-nvme: (27 commits)
  NVMe: Use user defined admin ioctl timeout
  NVMe: Simplify Firmware Activate code slightly
  NVMe: Only clear the enable bit when disabling controller
  NVMe: Wait for device to acknowledge shutdown
  NVMe: Schedule timeout for sync commands
  NVMe: Meta-data support in NVME_IOCTL_SUBMIT_IO
  NVMe: Device specific stripe size handling
  NVMe: Split non-mergeable bio requests
  NVMe: Remove dead code in nvme_dev_add
  NVMe: Check for NULL memory in nvme_dev_add
  NVMe: Fix error clean-up on nvme_alloc_queue
  NVMe: Free admin queue on request_irq error
  NVMe: Add scsi unmap to SG_IO
  NVMe: queue usage fixes in nvme-scsi
  NVMe: Set TASK_INTERRUPTIBLE before processing queues
  NVMe: Add a character device for each nvme device
  NVMe: Fix endian-related problems in user I/O submission path
  NVMe: Fix I/O cancellation status on big-endian machines
  NVMe: Fix sparse warnings in scsi emulation
  NVMe: Don't fail initialisation unnecessarily
  ...
2013-05-09 16:35:00 -07:00
Keith Busch
94f370cab6 NVMe: Use user defined admin ioctl timeout
Signed-off-by: Keith Busch <keith.busch@intel.com>
Signed-off-by: Matthew Wilcox <matthew.r.wilcox@intel.com>
2013-05-09 16:03:50 -04:00
Alex Elder
c48f3f86e2 rbd: kill rbd_img_request_get()
Get rid of rbd_img_request_get(), because it isn't used, and maybe
won't ever be needed.

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
2013-05-08 20:17:00 -05:00
Alex Elder
1f3ef78861 rbd: only set up watch for mapped images
Any changes to parent images are immaterial to any mapped clone.
So there is no need to have a watch event registered on header
objects except for the header object of an image that is mapped.
In fact, a watch request is a write operation, and we may only
have read access to a parent image.

We can't set up the watch request until we know the name of the
header object though.  So pass a flag to rbd_dev_image_probe() to
indicate whether this probe is for a mapping or for a parent image.

Change the second parameter to rbd_dev_header_watch_sync() be
Boolean while we're at it.

This resolves:
    http://tracker.ceph.com/issues/4941

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
2013-05-08 20:16:55 -05:00
Alex Elder
7ce4eef7b5 rbd: set mapping read-only flag in rbd_add()
The rbd_dev->mapping field for a parent image is not meaningful.
Since rbd_image_probe() is used both for images being mapped and
their parents, it doesn't make sense to set that flag in that
function.

So move the setting of the mapping.read_only flag out of
rbd_dev_image_probe() and into rbd_add() instead.

This resolves:
    http://tracker.ceph.com/issues/4940

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
2013-05-08 20:16:50 -05:00
Alex Elder
5b2ab72d36 rbd: support reading parent page data
Currently, rbd_img_parent_read() assumes the incoming object request
contains bio data.  But if a layered image is part of a multi-layer
stack of images it will result in read requests of page data to parent
images.

Fortunately, it's not hard to add support for page data.

This resolves:
    http://tracker.ceph.com/issues/4939

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
2013-05-08 20:16:25 -05:00
Alex Elder
91c6febb38 rbd: fix an incorrect assertion condition
In rbd_img_obj_parent_read_full_callback() there is an assertion
intended to verify the size of the image request for a full parent
read was the size of the original request's target object.  But
assertion was looking at the parent image order rather than the
original one, and these values can differ.

Fix that.

This resolves:
    http://tracker.ceph.com/issues/4938

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
2013-05-08 20:16:10 -05:00
Alex Elder
2df3fac758 rbd: define rbd_dev_v2_header_info()
This rearranges rbd_dev_v2_refresh() so it works more like
rbd_dev_v1_header_info().  While format 1 images need to read the
whole header object to get any information, format 2 can collect
almost all information selectively.  So the one-time initialization
will remain in a separate function--based on rbd_dev_v2_probe().

Rename rbd_dev_v2_refresh() to be rbd_dev_v2_header_info(), and have
it call rbd_dev_v2_header_onetime() if it's being called for the
first time for the given rbd device.

Rename rbd_dev_v2_probe() to be rbd_dev_v2_header_onetime() and
remove the image size and snapshot context calls it held in
common with the refresh function.

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
2013-05-08 17:00:52 -05:00
Alex Elder
99a41ebcee rbd: get rid of trivial v1 header wrappers
Get rid of the trivial wrapper functions rbd_dev_v1_refresh() and
rbd_dev_v1_probe(), substituting rbd_dev_v1_header_read() calls
in their place.

Rename rbd_dev_v1_header_read() to be rbd_dev_v1_header_info(), to
be more generic (it will better reflect what happens with format 2
images).

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
2013-05-08 17:00:46 -05:00
Alex Elder
30d60ba2f2 rbd: simplify rbd_dev_v1_probe()
An rbd_dev structure's fields are all zero-filled for an initial
probe, so there's no need to explicitly zero the parent_spec
and parent_overlap fields in rbd_dev_v1_probe().  Removing these
assignments makes rbd_dev_v1_probe() *almost* trivial.

Move the dout() message that announces discovery of an image into
rbd_dev_image_probe(), generalize to support images in either format
and only show it if an image is fully discovered.

This highlights that are some unnecessary cleanups in the error
path for rbd_dev_v1_probe(), so they can be removed.

Now rbd_dev_v1_probe() *is* a trivial wrapper function.

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
2013-05-08 17:00:41 -05:00
Alex Elder
662518b128 rbd: update in-core header directly
Now that rbd_header_from_disk() only fills in one-time fields once,
we can extend it slightly so it releases the other fields before
replacing their values.  This way there's no need to pass a
temporary buffer and then copy all the results in.  Just use the rbd
device header structure in rbd_header_from_disk() so its values get
updated directly.

Note that this means we need to take the header semaphore at the
point we update things.  So pass the rbd_dev rather than the address
of its header as its first argument to rbd_header_from_disk(), and
have it return an error code.

As a result, rbd_dev_v1_header_read() does all the work,
rbd_read_header() becomes unnecessary, and rbd_dev_v1_refresh()
becomes a very simple wrapper.

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
2013-05-08 17:00:37 -05:00
Alex Elder
bb23e37acb rbd: refactor rbd_header_from_disk()
This rearranges rbd_header_from_disk so that it:
    - allocates the snapshot context right away
    - keeps results in local variables, not changing the passed-in
      header until it's known we'll succeed
    - does initialization of set-once fields in a header only if
      they have not already been set

The last point is moot at the moment, because rbd_read_header()
(the only caller) always supplies a zero-filled header buffer.

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
2013-05-08 17:00:33 -05:00
Alex Elder
46578dcdca rbd: zero format 1 header structure earlier
The passed-in header structure is zeroed in rbd_header_from_disk().
Instead, have the caller do it.  Note that there are two callers,
rbd_dev_v1_refresh() and rbd_dev_v1_probe().  The latter already has
a zeroed header structure so zeroing it isn't necessary there.

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
2013-05-08 17:00:28 -05:00
Alex Elder
f35a4dee14 rbd: set the mapping size and features later
Defer setting the size and features fields of a mapped image until
after the Linux disk structure is set up.  Set the capacity of the
disk after that.

Rearrange the definition of rbd_image_header, separating the fields
that are set only once from those that can be updated.

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
2013-05-08 17:00:00 -05:00
Linus Torvalds
ebb3727779 Merge branch 'for-3.10/drivers' of git://git.kernel.dk/linux-block
Pull block driver updates from Jens Axboe:
 "It might look big in volume, but when categorized, not a lot of
  drivers are touched.  The pull request contains:

   - mtip32xx fixes from Micron.

   - A slew of drbd updates, this time in a nicer series.

   - bcache, a flash/ssd caching framework from Kent.

   - Fixes for cciss"

* 'for-3.10/drivers' of git://git.kernel.dk/linux-block: (66 commits)
  bcache: Use bd_link_disk_holder()
  bcache: Allocator cleanup/fixes
  cciss: bug fix to prevent cciss from loading in kdump crash kernel
  cciss: add cciss_allow_hpsa module parameter
  drivers/block/mg_disk.c: add CONFIG_PM_SLEEP to suspend/resume functions
  mtip32xx: Workaround for unaligned writes
  bcache: Make sure blocksize isn't smaller than device blocksize
  bcache: Fix merge_bvec_fn usage for when it modifies the bvm
  bcache: Correctly check against BIO_MAX_PAGES
  bcache: Hack around stuff that clones up to bi_max_vecs
  bcache: Set ra_pages based on backing device's ra_pages
  bcache: Take data offset from the bdev superblock.
  mtip32xx: mtip32xx: Disable TRIM support
  mtip32xx: fix a smatch warning
  bcache: Disable broken btree fuzz tester
  bcache: Fix a format string overflow
  bcache: Fix a minor memory leak on device teardown
  bcache: Documentation updates
  bcache: Use WARN_ONCE() instead of __WARN()
  bcache: Add missing #include <linux/prefetch.h>
  ...
2013-05-08 11:51:05 -07:00
Linus Torvalds
4de13d7aa8 Merge branch 'for-3.10/core' of git://git.kernel.dk/linux-block
Pull block core updates from Jens Axboe:

 - Major bit is Kents prep work for immutable bio vecs.

 - Stable candidate fix for a scheduling-while-atomic in the queue
   bypass operation.

 - Fix for the hang on exceeded rq->datalen 32-bit unsigned when merging
   discard bios.

 - Tejuns changes to convert the writeback thread pool to the generic
   workqueue mechanism.

 - Runtime PM framework, SCSI patches exists on top of these in James'
   tree.

 - A few random fixes.

* 'for-3.10/core' of git://git.kernel.dk/linux-block: (40 commits)
  relay: move remove_buf_file inside relay_close_buf
  partitions/efi.c: replace useless kzalloc's by kmalloc's
  fs/block_dev.c: fix iov_shorten() criteria in blkdev_aio_read()
  block: fix max discard sectors limit
  blkcg: fix "scheduling while atomic" in blk_queue_bypass_start
  Documentation: cfq-iosched: update documentation help for cfq tunables
  writeback: expose the bdi_wq workqueue
  writeback: replace custom worker pool implementation with unbound workqueue
  writeback: remove unused bdi_pending_list
  aoe: Fix unitialized var usage
  bio-integrity: Add explicit field for owner of bip_buf
  block: Add an explicit bio flag for bios that own their bvec
  block: Add bio_alloc_pages()
  block: Convert some code to bio_for_each_segment_all()
  block: Add bio_for_each_segment_all()
  bounce: Refactor __blk_queue_bounce to not use bi_io_vec
  raid1: use bio_copy_data()
  pktcdvd: Use bio_reset() in disabled code to kill bi_idx usage
  pktcdvd: use bio_copy_data()
  block: Add bio_copy_data()
  ...
2013-05-08 10:13:35 -07:00
Matthew Wilcox
ab3ea5bf37 NVMe: Simplify Firmware Activate code slightly
Add definitions for the three Firmware Activate actions, and change the
SCSI translation code to construct the command into a temporary variable
instead of translating the endianness back-and-forth.

Signed-off-by: Matthew Wilcox <matthew.r.wilcox@intel.com>
Reviewed-by: Vishal Verma <vishal.l.verma@linux.intel.com>
2013-05-08 09:55:05 -04:00
Matthew Wilcox
44af146a84 NVMe: Only clear the enable bit when disabling controller
Many of the bits in the Controller Configuration register may only be
modified when the Enable bit is clear.  Clearing them at the same time
as the Enable bit might be OK, but let's play it safe and only touch the
Enable bit.

Signed-off-by: Matthew Wilcox <matthew.r.wilcox@intel.com>
Reviewed-by: Keith Busch <keith.busch@intel.com>
2013-05-08 09:54:31 -04:00
Matthew Wilcox
ba47e3865e NVMe: Wait for device to acknowledge shutdown
A recent update to the specification makes it clear that the host
is expected to wait for the device to acknowledge the Enable bit
transitioning to 0 as well as waiting for the device to acknowledge a
transition to 1.

Reported-by: Khosrow Panah <Khosrow.Panah@idt.com>
Signed-off-by: Matthew Wilcox <matthew.r.wilcox@intel.com>
Reviewed-by: Keith Busch <keith.busch@intel.com>
2013-05-08 09:53:49 -04:00
Alex Elder
51344a38ba rbd: always set read-only flag in rbd_add()
Hold off setting the read-only flag in rbd_add() for an image being
mapped until we have successfully probed the image.  At that point
we know whether it's a snapshot mapping or not, so we can set the
read-only flag in that one place rather than doing so (for
snapshots) in rbd_dev_mapping_set().  To do this, pass a flag to the
image probe routine indicating whether we want a read-only mapping.

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
2013-05-08 07:48:12 -05:00
Alex Elder
6d80b130d5 rbd: kill rbd_dev_clear_mapping()
This function is a duplicate of rbd_dev_mapping_clear(), and was
added by mistake.

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
2013-05-08 07:48:12 -05:00
Alex Elder
8f4b7d9821 rbd: don't look up snapshot id in rbd_dev_mapping_set()
Currently rbd_dev_mapping_set() looks up the snapshot id for the
snapshot whose name is found in the rbd device's spec structure.

That function gets called by rbd_dev_device_setup(), which is
called by rbd_add() *after* rbd_dev_image_probe().  If the
image probe succeeds, the rbd device's spec will already have
been updated to include names and ids for all fields.

Therefore there's no need to look up the snapshot id in
rbd_dev_mapping_set().

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
2013-05-08 07:48:11 -05:00
Alex Elder
c734b79655 rbd: don't print warning if not mapping a parent
The presence of the LAYERING bit in an rbd image's feature mask does
not guarantee the image actually has a parent image.  Currently that
bit is set only when a clone (i.e., image with a parent) is created,
but it is (currently) not cleared if that clone gets flattened back
into a "normal" image.  A "parent_id" query will leave the
parent_spec for the image being mapped a null pointer, but will not
return an error.

Currently, whenever an image with the LAYERED feature gets mapped, a
warning about the use of layered images gets printed.  But we don't
want to do this for a flattened image, so print the warning only
if we find there is a parent spec after the probe.

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
2013-05-08 07:48:11 -05:00
Roger Pau Monne
b7649158a0 xen-blkfront: use a different scatterlist for each request
In blkif_queue_request blkfront iterates over the scatterlist in order
to set the segments of the request, and in blkif_completion blkfront
iterates over the raw request, which makes it hard to know the exact
position of the source and destination memory positions.

This can be solved by allocating a scatterlist for each request, that
will be keep until the request is finished, allowing us to copy the
data back to the original memory without having to iterate over the
raw request.

Oracle-Bug: 16660413 - LARGE ASYNCHRONOUS READS APPEAR BROKEN ON 2.6.39-400
CC: stable@vger.kernel.org
Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
Reported-and-Tested-by: Anne Milicia <anne.milicia@oracle.com>
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
2013-05-08 08:46:51 -04:00
Alex Elder
29334ba49c rbd: kill rbd_update_mapping_size()
Since rbd_update_mapping_size() is now a trivial wrapper, just open
code it in its two callers.

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
2013-05-08 07:45:39 -05:00
Alex Elder
00a653e216 rbd: update capacity in rbd_dev_refresh()
When a mapped image changes size, we change the capacity recorded
for the Linux disk associated with it, in rbd_update_mapping_size().
That function is called in two places--the format 1 and format 2
refresh routines.

There is no need to set the capacity while holding the header
semaphore.  Instead, do it in the common rbd_dev_refresh(), using
the logic that's already there to initiate disk revalidation.

Add handling in the request function, just in case a request
that exceeds the capacity of the device comes in (perhaps one
that was started before a refresh shrunk the device).

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
2013-05-08 07:45:30 -05:00
Alex Elder
e627db085e rbd: revalidate only for mapping size changes
This commit:
    d98df63e rbd: revalidate_disk upon rbd resize
instituted a call to revalidate_disk() to notify interested parties
that a mapped image has changed size.  This works well, as long as
the the rbd device doesn't map a snapshot.

A snapshot will never change size.  However, the base image the
snapshot is associated with can, and it can do so while the snapshot
is mapped.

The problem is that the test for the size is looking at the size of
the base image, not the size of the mapped snapshot.  This patch
corrects that.

Update the warning message shown in the event of error, and move
it into the callers.

This resolves:
    http://tracker.ceph.com/issues/4911

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
2013-05-08 07:40:48 -05:00
Alex Elder
49ece55428 rbd: fix leak of format 2 snapshot context
When rbd_dev_v2_refresh() is called, the rbd device already has a
snapshot context associated with it.  But that never gets freed,
the pointer just gets overwritten.

Fix this by dropping the rbd device's reference to the snapshot
context before overwriting the pointer.

Because ceph_put_snap_context() already handles for a null pointer
we don't need to check for that (for the probe case, where no
context has yet been assigned).

This resolves:
    http://tracker.ceph.com/issues/4912

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
2013-05-08 07:38:30 -05:00
Linus Torvalds
292088ee03 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs
Pull more vfs updates from Al Viro:
 "A couple of fixes + getting rid of __blkdev_put() return value"

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
  proc: Use PDE attribute setting accessor functions
  make blkdev_put() return void
  block_device_operations->release() should return void
  mtd_blktrans_ops->release() should return void
  hfs: SMP race on directory close()
2013-05-07 15:14:53 -07:00
Roger Pau Monne
bb642e8315 xen-blkback: allocate list of pending reqs in small chunks
Allocate pending requests in smaller chunks instead of allocating them
all at the same time.

This change also removes the global array of pending_reqs, it is no
longer necessay.

Variables related to the grant mapping have been grouped into a struct
called "grant_page", this allows to allocate them in smaller chunks,
and also improves memory locality.

Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
Reported-by: Sander Eikelenboom <linux@eikelenboom.it>
Tested-by: Sander Eikelenboom <linux@eikelenboom.it>
Reviewed-by: David Vrabel <david.vrabel@citrix.com>
Cc: David Vrabel <david.vrabel@citrix.com>
Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
2013-05-07 09:42:17 -04:00
Al Viro
db2a144bed block_device_operations->release() should return void
The value passed is 0 in all but "it can never happen" cases (and those
only in a couple of drivers) *and* it would've been lost on the way
out anyway, even if something tried to pass something meaningful.
Just don't bother.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2013-05-07 02:16:21 -04:00
Linus Torvalds
91f8575685 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client
Pull Ceph changes from Alex Elder:
 "This is a big pull.

  Most of it is culmination of Alex's work to implement RBD image
  layering, which is now complete (yay!).

  There is also some work from Yan to fix i_mutex behavior surrounding
  writes in cephfs, a sync write fix, a fix for RBD images that get
  resized while they are mapped, and a few patches from me that resolve
  annoying auth warnings and fix several bugs in the ceph auth code."

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client: (254 commits)
  rbd: fix image request leak on parent read
  libceph: use slab cache for osd client requests
  libceph: allocate ceph message data with a slab allocator
  libceph: allocate ceph messages with a slab allocator
  rbd: allocate image object names with a slab allocator
  rbd: allocate object requests with a slab allocator
  rbd: allocate name separate from obj_request
  rbd: allocate image requests with a slab allocator
  rbd: use binary search for snapshot lookup
  rbd: clear EXISTS flag if mapped snapshot disappears
  rbd: kill off the snapshot list
  rbd: define rbd_snap_size() and rbd_snap_features()
  rbd: use snap_id not index to look up snap info
  rbd: look up snapshot name in names buffer
  rbd: drop obj_request->version
  rbd: drop rbd_obj_method_sync() version parameter
  rbd: more version parameter removal
  rbd: get rid of some version parameters
  rbd: stop tracking header object version
  rbd: snap names are pointer to constant data
  ...
2013-05-06 13:11:19 -07:00
Linus Torvalds
736a2dd257 Lots of virtio work which wasn't quite ready for last merge window. Plus
I dived into lguest again, reworking the pagetable code so we can move
 the switcher page: our fixmaps sometimes take more than 2MB now...
 
 Cheers,
 Rusty.
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1.4.12 (GNU/Linux)
 
 iQIcBAABAgAGBQJRga7lAAoJENkgDmzRrbjx/yIQAKpqIBtxOJeYH3SY+Uoe7Cfp
 toNYcpJEldvb0UcWN8M2cSZpHoxl1SUoq9djwcM29tcKa7EZAjHaGtb/Q1qMTDgv
 +B3WAfiGU2pmXFxLAkbrlLNGnysy24JspqJQ5hcYV84EiBxQdZp+nCYgOphd+GMK
 ww16vo9ya8jFjzt3GeRp/Heb3vEzV4Cp6BC3i0m8A3WNpEpbRb66pqXNk5o8ggJO
 SxQOKSXmUM+0m+jKSul5xn3e2Ls2LOrZZ8/DIHA+gW66N4Zab7n2/j1Q9VRxb4lh
 FqnR7KwgBX8OCh9IsBDqQYS7MohvMYge6eUdLtFrq84jvMleMEhrC8q9v2tucFUb
 5t18CLwvyK7Gdg6UCKiZ7YSPcuURAILO16al9bh5IseeBDsuX+43VsvQoBmFn9k6
 cLOVTZ6BlOmahK5PyRYFSvLa9Rxzr/05Mr7oYq9UgshD9io78dnqczFYIORF53rW
 zD7C4HuTZfYJFfNd0wAJ0RfVXnf8QvDlMdo7zPC26DSXNWqj8OexCY0qqSWUB+2F
 vcfJP6NkV4fZB8aawWIFUVwc64yqtt2uPVLa7ATZWqk16PgKrchGewmw3tiEwOgu
 1l7xgffTRRUIJsqaCZoXdgw3yezcKRjuUBcOxL09lDAAhc+NxWNvzZBsKp66DwDk
 yZQKn0OdXnuf0CeEOfFf
 =1tYL
 -----END PGP SIGNATURE-----

Merge tag 'virtio-next-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rusty/linux

Pull virtio & lguest updates from Rusty Russell:
 "Lots of virtio work which wasn't quite ready for last merge window.

  Plus I dived into lguest again, reworking the pagetable code so we can
  move the switcher page: our fixmaps sometimes take more than 2MB now..."

Ugh.  Annoying conflicts with the tcm_vhost -> vhost_scsi rename.
Hopefully correctly resolved.

* tag 'virtio-next-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rusty/linux: (57 commits)
  caif_virtio: Remove bouncing email addresses
  lguest: improve code readability in lg_cpu_start.
  virtio-net: fill only rx queues which are being used
  lguest: map Switcher below fixmap.
  lguest: cache last cpu we ran on.
  lguest: map Switcher text whenever we allocate a new pagetable.
  lguest: don't share Switcher PTE pages between guests.
  lguest: expost switcher_pages array (as lg_switcher_pages).
  lguest: extract shadow PTE walking / allocating.
  lguest: make check_gpte et. al return bool.
  lguest: assume Switcher text is a single page.
  lguest: rename switcher_page to switcher_pages.
  lguest: remove RESERVE_MEM constant.
  lguest: check vaddr not pgd for Switcher protection.
  lguest: prepare to make SWITCHER_ADDR a variable.
  virtio: console: replace EMFILE with EBUSY for already-open port
  virtio-scsi: reset virtqueue affinity when doing cpu hotplug
  virtio-scsi: introduce multiqueue support
  virtio-scsi: push vq lock/unlock into virtscsi_vq_done
  virtio-scsi: pass struct virtio_scsi to virtqueue completion function
  ...
2013-05-02 14:14:04 -07:00
Keith Busch
78f8d2577b NVMe: Schedule timeout for sync commands
Schedule a timeout on sync commands in case the command times out and
the device is not being polled for timeouts. This prevents device removal
from hanging forever if the device has stopped responding.

Signed-off-by: Keith Busch <keith.busch@intel.com>
Signed-off-by: Matthew Wilcox <matthew.r.wilcox@intel.com>
2013-05-02 15:36:02 -04:00
Keith Busch
f410c680b5 NVMe: Meta-data support in NVME_IOCTL_SUBMIT_IO
This adds support for namespaces with separate meta-data formats in the
submit io ioctl. The meta-data buffer has to be a contiguous, so such
a buffer is allocated and the mapped user pages are copied to/from this
buffer for write/read commands.

Signed-off-by: Keith Busch <keith.busch@intel.com>
Signed-off-by: Matthew Wilcox <matthew.r.wilcox@intel.com>
2013-05-02 15:35:09 -04:00
Keith Busch
159b67d7ae NVMe: Device specific stripe size handling
We have an nvme device that has a concept of a stripe size. IO requests
that do not transfer data crossing a stripe boundary has greater
performance compared to IO that does cross it. This patch sets the
stripe size for the device if the device and vendor ids match one with
this feature and splits IO requests that cross the stripe boundary.

Signed-off-by: Keith Busch <keith.busch@intel.com>
Signed-off-by: Matthew Wilcox <matthew.r.wilcox@intel.com>
2013-05-02 14:41:05 -04:00
Keith Busch
427e970801 NVMe: Split non-mergeable bio requests
It is possible a bio request can not be submitted as a single NVMe IO
command if the bio_vec is not mergeable with the NVMe PRP alignement
constraints. This condition was handled by submitting an IO for the
mergeable portion then submitting a follow on IO for the remaining data
after the previous IO completes. The remainder to be sent was tracked
by manipulating the bio->bi_idx and bio->bi_sector. This patch splits
the request as many times as necessary and submits the bios together.

Since submitting the bio may cause it to be requeued on split,
nvme_resubmit_bios had to be modified to remove the wait queue when
the bio list is empty prior to submitting the bio since a split would
have added the wait queue a second time, corrupting the wait queue head
task list.

There are a few other benefits from doing this: it fixes a potential
issue with the previous handling of a non-mergeable bio as the requeuing
method could would use an unlocked nvme_queue if the callback isn't
invoked on the queue's associated cpu; it will be possible to retry a
failed bio if desired at some later time since it does not manipulate
the original bio; the bio integrity extensions require the bio to be in
its original condition for the checks to work correctly if we implement
the end-to-end data protection in the future.

Signed-off-by: Keith Busch <keith.busch@intel.com>
Signed-off-by: Matthew Wilcox <matthew.r.wilcox@intel.com>
2013-05-02 14:38:59 -04:00
Keith Busch
cbb6218fd4 NVMe: Remove dead code in nvme_dev_add
There is no situation that could occur where we could error out of this
function and require cleaning up allocated namespaces.

Signed-off-by: Keith Busch <keith.busch@intel.com>
Signed-off-by: Matthew Wilcox <matthew.r.wilcox@intel.com>
2013-05-02 14:36:45 -04:00
Keith Busch
a9ef4343af NVMe: Check for NULL memory in nvme_dev_add
Signed-off-by: Keith Busch <keith.busch@intel.com>
Signed-off-by: Matthew Wilcox <matthew.r.wilcox@intel.com>
2013-05-02 14:35:44 -04:00
Keith Busch
68b8eca5f8 NVMe: Fix error clean-up on nvme_alloc_queue
The nvme_queue's depth is not set if we fail to allocate submission queue
entries, which was being used to determine how much coherent memory to
free on error. Use the depth variable instead.

Signed-off-by: Keith Busch <keith.busch@intel.com>
Signed-off-by: Matthew Wilcox <matthew.r.wilcox@intel.com>
2013-05-02 14:34:35 -04:00
Keith Busch
025c557a71 NVMe: Free admin queue on request_irq error
Fixes a potential memory leak if requesting the admin queue irq fails.

Signed-off-by: Keith Busch <keith.busch@intel.com>
Signed-off-by: Matthew Wilcox <matthew.r.wilcox@intel.com>
2013-05-02 14:33:53 -04:00
Keith Busch
ec50373350 NVMe: Add scsi unmap to SG_IO
Translates a scsi unmap request from SG_IO ioctl to NVMe
data-set-management deallocate.

Signed-off-by: Keith Busch <keith.busch@intel.com>
Acked-by: Vishal Verma <vishal.l.verma@linux.intel.com>
Signed-off-by: Matthew Wilcox <matthew.r.wilcox@intel.com>
2013-05-02 14:32:08 -04:00
Keith Busch
14385de117 NVMe: queue usage fixes in nvme-scsi
Fixes nvme queue usages in scsi-to-nvme translation code to not get
a queue more often than it is being put, and not use the queue in an
unsafe way without it being locked.

Signed-off-by: Keith Busch <keith.busch@intel.com>
Acked-by: Vishal Verma <vishal.l.verma@linux.intel.com>
Signed-off-by: Matthew Wilcox <matthew.r.wilcox@intel.com>
2013-05-02 14:30:53 -04:00
Alex Elder
b5b09be30c rbd: fix image request leak on parent read
When a read for a layered image object finds the target object
doesn't exist, a read image request for the parent image is created
and submitted.  When that completes, the callback routine was
not releasing that parent image request.  Fix that.

The slab allocation stuff just added has greatly simplified the
search for the source of this memory leak.

This resolves:
    http://tracker.ceph.com/issues/4803

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
2013-05-02 12:15:28 -05:00
Alex Elder
78c2a44aae rbd: allocate image object names with a slab allocator
The names of objects used for image object requests are always fixed
size.  So create a slab cache to manage them.  Define a new function
rbd_segment_name_free() to match rbd_segment_name() (which is what
supplies the dynamically-allocated name buffer).

This is part of:
    http://tracker.ceph.com/issues/3926

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
2013-05-02 11:58:30 -05:00
Alex Elder
868311b1eb rbd: allocate object requests with a slab allocator
Create a slab cache to manage rbd_obj_request allocation.  We aren't
using a constructor, and we'll zero-fill object request structures
when they're allocated.

This is part of:
    http://tracker.ceph.com/issues/3926

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
2013-05-02 11:58:30 -05:00
Alex Elder
f907ad5596 rbd: allocate name separate from obj_request
The next patch will define a slab allocator for a object requests.
To use that we'll need to allocate the name of an object separate
from the request structure itself.

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
2013-05-02 11:58:29 -05:00
Alex Elder
1c2a9dfe21 rbd: allocate image requests with a slab allocator
Create a slab cache to manage rbd_img_request allocation.  Nothing
too fancy at this point--we'll still initialize everything at
allocation time (no constructor)

This is part of:
    http://tracker.ceph.com/issues/3926

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
2013-05-02 11:58:29 -05:00
Alex Elder
30d1cff817 rbd: use binary search for snapshot lookup
Use bsearch(3) to make snapshot lookup by id more efficient.  (There
could be thousands of snapshots, and conceivably many more.)

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
2013-05-02 11:58:17 -05:00
Alex Elder
15228ede7d rbd: clear EXISTS flag if mapped snapshot disappears
This functionality inadvertently disappeared in the last patch.

Image snapshots can get removed at just about any time.  In
particular it can disappear even if it is in use by an rbd
client as a mapped image.

The rbd client deals with such a disappearance by responding to new
requests with ENXIO.  This is implemented by each rbd device
maintaining an EXISTS flag, which is normally set but cleared if a
snapshot disappears.

This patch (re-)implements the clearing of that flag.

Whenever mapped image header information is refreshed, if the
mapping is for a snapshot, verify the mapped snapshot is still
present in the updated snapshot context.  If it is not, clear the
flag.

It is not necessary to check this in the initial probe, because the
probe will not succeed if the snapshot doesn't exist.

This resolves:
    http://tracker.ceph.com/issues/4880

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
2013-05-02 11:57:03 -05:00
Alex Elder
33dca39f5c rbd: kill off the snapshot list
We no longer use the snapshot list for anything.  When we need to
look up a snapshot name, id, size, or feature mask, we just do it
directly rather than relying on this list being updated with every
refresh.  The main reason it existed was for the benefit of the
device/sysfs entries that previously were associated with snapshots.

So get rid of the snapshot list, and struct rbd_snap, and the
hundreds of lines of code that supported them.

This resolves:
    http://tracker.ceph.com/issues/4868

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
2013-05-01 21:20:22 -07:00
Alex Elder
2ad3d7167e rbd: define rbd_snap_size() and rbd_snap_features()
This patch defines a handful of new functions that will allow
us to get rid of the rbd device structure's list of snapshots.

Define rbd_snap_id_by_name() to look up a snapshot id given its
name.  This is efficient for format 1 images but not for format 2.
Fortunately it only gets called at mapping time so it's not that
critical.

Use rbd_snap_id_by_name() to find out the id for a snapshot getting
mapped, and pass that id to new functions rbd_snap_size() and
rbd_snap_features() to look up information about a given snapshot's
size and feature mask given its snapshot id.  All this gets done
in rbd_dev_mapping_set().

As a result, snap_by_name() is no longer needed, so get rid of it.

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
2013-05-01 21:20:20 -07:00
Alex Elder
54cac61fb6 rbd: use snap_id not index to look up snap info
In order to align with what was needed for format 1 rbd images,
rbd_dev_v2_snap_info() was set up to take as argument an index into
the array of snapshot ids in a rbd device's snapshot context.

This switches that around, so we pass the snapshot id instead.
In doing this, rbd_snap_name() now returns a dynamically-allocated
string rather than a fixed one, so there's no need to make a
duplicate in its caller, rbd_dev_spec_update().

This means the following functions take a snapshot id where they
previously used an index value:
    rbd_dev_snap_info()
    rbd_dev_v1_snap_info()
    rbd_dev_v2_snap_info()

A new function, rbd_dev_snap_index(), determines the snap index for
format 1 images and uses it to look up the name.

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
2013-05-01 21:20:19 -07:00
Alex Elder
9682fc6d3a rbd: look up snapshot name in names buffer
Rather than scanning the list of snapshot structures for it, scan
the snapshot context buffer containing snapshot names in order to
determine for a format 1 image the name associated with a given
snapshot id.

Pull out the part of rbd_dev_v1_snap_info() that does this scan into
a new function, _rbd_dev_v1_snap_name().  Have that function return
a dynamically-allocated copy of the name, and don't duplicate it in
rbd_dev_v1_snap_info().

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
2013-05-01 21:20:18 -07:00
Alex Elder
dedc81ea84 rbd: drop obj_request->version
Nothing ever uses the version field maintained in the object request
structure any more, so get rid of it.

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
2013-05-01 21:20:17 -07:00
Alex Elder
e2a58ee55b rbd: drop rbd_obj_method_sync() version parameter
Only NULL is passed as the version argument to rbd_obj_method_sync(),
so get rid of it.

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
2013-05-01 21:20:16 -07:00
Alex Elder
cc4a38bdd5 rbd: more version parameter removal
Continued from the last patch, more parameters that can go away
because we no longer have a need to track object versions.

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
2013-05-01 21:20:15 -07:00
Alex Elder
7097f8df6e rbd: get rid of some version parameters
Several functions in rbd have parameters meant to allow the version
of an object to be passed in or out.  The purpose of those was to
allow the version of a header object to be maintained, but we no
longer do that.  As a result, these parameters are never actually
needed or used, so get rid of them.

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
2013-05-01 21:20:14 -07:00
Alex Elder
b21ebdddeb rbd: stop tracking header object version
The rbd code takes care to maintain the version of the header
object.  This was done in hopes of using it to detect a change in
the object between reading it and setting up a watch request to
be notified of changes.

The mechanism was never fully implemented, however.  And we now
avoid the original problem by setting up the watch request before
ever reading the content of the header.

The osd doesn't interpret the object version supplied with a WATCH
osd op, nor does it use the version supplied with a NOTIFY_ACK op
(we can just supply 0 for both).  There is therefore no need to
maintain the header's object version any more, so stop doing so.

We'll be able to simplify some more rbd code in the next few patches
as a result of this.

This resolves:
    http://tracker.ceph.com/issues/3952

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
2013-05-01 21:20:13 -07:00
Alex Elder
cb75223d2b rbd: snap names are pointer to constant data
Make explicit that snapshot names don't change by making functions
return and take parameters that that point to const qualified data.

This resolves:
    http://tracker.ceph.com/issues/4867

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
2013-05-01 21:20:12 -07:00
Alex Elder
a3fbe5d447 rbd: don't revalidate so much
Whenever a header object event causes a mapped rbd image to refresh
its header information, revalidate_disk() is being called.  This was
done in rbd_dev_refresh() outside the control mutex in order to
avoid a lock inversion.  Although a an event like this *might*
indicate the image has changed size, most of the time it does not.

Record the image size before and after the refresh, and only
call revalidate_disk() if it changes.

This resolves:
    http://tracker.ceph.com/issues/4867

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
2013-05-01 21:20:11 -07:00
Alex Elder
96882f55c4 rbd: fix up the layering warning message
A warning gets spewed for any image being probed, including parent
images.  Set up a condition such that the warning message only gets
printed for the image being mapped, not any of its parents.

Also, I didn't like the way the warning ended up being so long.
Make it a terse warning instead.  People experimenting with layering
will know what the message means.

This is part of:
    http://tracker.ceph.com/issues/4867

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
2013-05-01 21:20:10 -07:00
Alex Elder
812164f8c3 ceph: use ceph_create_snap_context()
Now that we have a library routine to create snap contexts, use it.

This is part of:
    http://tracker.ceph.com/issues/4857

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
2013-05-01 21:20:09 -07:00
Alex Elder
b536f69a3a rbd: set up devices only for mapped images
Stop setting up Linux devices during the image probe operation.
Instead, set up the devices as a separate step after the image
probe, in rbd_add().

A consequence of this is that only mapped images get devices
assigned to them, which is pretty sweet.

This resolves:
    http://tracker.ceph.com/issues/4774

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
2013-05-01 21:20:07 -07:00
Alex Elder
8ad42cd0c0 rbd: don't have device release destroy rbd_dev
Currently an rbd_device structure gets destroyed from the release
routine for the device embedded within it.  Stop doing that, instead
calling rbd_dev_image_release() right after rbd_bus_del_dev()
wherever the latter is called.

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
2013-05-01 21:20:05 -07:00
Alex Elder
6fd48b3be9 rbd: define rbd_dev_unprobe()
Define a new function rbd_dev_unprobe() which undoes state changes
that occur from calling rbd_dev_v1_probe() or rbd_dev_v2_probe().
Note that this is a superset of rbd_header_free(), which is now
getting removed (it seems to have been used improperly anyway).

Flesh out rbd_dev_image_release() so it undoes exactly what
rbd_dev_image_probe() does.

This means that:
    - rbd_dev_device_release() gets called when the last device
      reference gets dropped;
    - that undoes everything done by the rbd_dev_device_setup() call
      at the end of rbd_dev_image_probe() (and nothing more), ending
      by calling rbd_dev_image_release(); and
    - rbd_dev_image_release() undoes everything else done by
      rbd_dev_image_probe() (and this includes a call to
      rbd_dev_unprobe().

This means the image and device portions of an rbd device are fairly
cleanly separated now, so error paths should be a little easier to
verify than they used to be.

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
2013-05-01 21:20:04 -07:00
Alex Elder
200a6a8be5 rbd: don't destroy rbd_dev in device release function
Rename rbd_dev_probe_finish() to be rbd_dev_device_setup().  Its
purpose is to set up the Linux side of an rbd device mapping.
Rename rbd_dev_release() to be rbd_dev_device_release(), making
it more obvious it serves as the inverse of the setup function
(or it will).

Encapsulate some of what was done in rbd_dev_release() into a new
function rbd_dev_image_release(), which serves as the inverse of
setting up the ceph side of the mapped rbd image.

Define a new helper rbd_dev_clear_mapping() to simply zero out the
fields of a mapping structure--the inverse of rbd_dev_set_mapping().

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
2013-05-01 21:20:03 -07:00
Alex Elder
79ab7558aa rbd: drop module later
Drop the module reference at the end of rbd_remove() for symmetry
with adding a reference at the top of rbd_add().

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
2013-05-01 21:20:02 -07:00
Alex Elder
b644de2ba0 rbd: set up watch in rbd_dev_image_probe()
Move setting up the watch request for an image so it's done in
rbd_dev_image_probe() rather than rbd_dev_probe_finish().  Move
it all the way up to before doing the initial probe.  This avoids
a potential race condition, in which we get (and use) the initial
snapshot context for an image, and it gets changed between that
time and the time we get the watch set up.

This resolves:
    http://tracker.ceph.com/issues/3871

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
2013-05-01 21:20:01 -07:00
Alex Elder
96f03e08f9 rbd: don't bother checking whether order changes
When a format 2 image is refreshed, code is in place to verify that
the object order never changes from what it was originally.  This
relies on the fact that the refresh will occur *after* an initial
load of information about the image.

An upcoming patch makes it possible for the refresh to occur first,
so we can no longer make this order check.  The order really can't
ever change anyway--this was just a sanity check.  So get rid of it.

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
2013-05-01 21:20:00 -07:00
Alex Elder
0d8189e175 rbd: don't clean up watch in device release function
Currently, a watch on an rbd device header object gets torn down
when its final Linux device reference gets dropped.  Instead, tear
it down when removing the device.  If an error occurs cleaning up
the watch event when unmapping, abort the unmap request.

All images (including parents) still get watch requests set up, so
tear these down also, in rbd_dev_remove_parent().  For now, ignore
any errors that occur in this case.

Get rid of local variable "rc" in rbd_remove(); use "ret" instead
(they both somehow ended up defined in the function and only one is
needed).

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
2013-05-01 21:19:59 -07:00
Alex Elder
332bb12db9 rbd: define rbd_header_name()
Define a new function rbd_header_name(), which allocates and formats
the name of the header object for the rbd device.

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
2013-05-01 21:19:58 -07:00
Alex Elder
9bb81c9be9 rbd: move more initialization into rbd_dev_image_probe()
Move a block of initialization related to the "ceph-side" of an rbd
image out of rbd_dev_probe_finish() and into rbd_dev_image_probe().

Add appropriate error handling to clean things up in the event any
of these new functions return an error.

We know that rbd_dev_snaps_update(), rbd_dev_spec_update(), and
rbd_dev_probe_parent() all clean up after themselves before they
return an error, so no special cleanup is required except when an
earlier call succeeds.  Since rbd_dev_spec_update() only updates the
spec field (whose cleanup will be handled by dropping the last
reference to the spec) there is no cleanup action associatied with
that.

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
2013-05-01 21:19:57 -07:00
Alex Elder
5de10f3b0c rbd: probe for the parent earlier
Probe for a parent device earlier in rbd_dev_probe_finish(), before
starting to set up the Linux side of the rbd device.

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
2013-05-01 21:19:56 -07:00
Alex Elder
2e93bf9e46 rbd: remove parent devices on probe error
When an error occurs while finishing probing a device it is assumed
that parent devices get cleaned up when deleting a device.  They
don't.  Add a call to clean them up.  Note that this means the
parent spec will already be cleaned up so it doesn't have to be
in one of the rbd_add() error paths.

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
2013-05-01 21:19:55 -07:00
Alex Elder
ad945fc1da rbd: fix rbd_dev_remove_parent()
In certain error paths, it is possible for an rbd device to have a
parent spec but no parent rbd_dev.  In rbd_dev_remove_parent() use
the parent field rather than parent_spec in determining whether to
try to remove any parent devices.  Use assertions to indicate that
any non-null parent pointer has parent_spec associated with it.

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
2013-05-01 21:19:54 -07:00
Alex Elder
b480815a17 rbd: kill __rbd_remove()
The function __rbd_remove() is used in two spots, and it's fairly
simple.  It combines cleanup of part of the ceph-side state as well
as cleaning up the Linux-side state.  Just open code it in the two
callers and eliminate the function.

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
2013-05-01 21:19:53 -07:00
Alex Elder
d1cf578845 rbd: set mapping info earlier
Set the mapping size and features earlier in rbd_dev_probe_finish().

Define rbd_dev_mapping_clear() as an inverse for setting those
fields, and use it both in error handling in rbd_dev_image_probe()
and in the final cleanup in rbd_dev_release().  Change the name
of rbd_dev_set_mapping() to of rbd_dev_mapping_set().

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
2013-05-01 21:19:51 -07:00
Alex Elder
05a46afdc7 rbd: encapsulate removing parent devices
Encapsulate the code that removes an rbd device's parent images into
a new function, rbd_dev_remove_parent().

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
2013-05-01 21:19:50 -07:00
Alex Elder
124afba25d rbd: encapsulate probing for parent devices
Encapsulate the code that probes for an rbd device's parent images
into a new function, rbd_dev_probe_parent().

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
2013-05-01 21:19:49 -07:00
Alex Elder
b5156e76da rbd: defer setting disk capacity
Don't set the disk capacity until right before we announce the
device as available for use.

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
2013-05-01 21:19:48 -07:00
Alex Elder
129b79d449 rbd: only set device exists flag when ready
Hold off setting the EXISTS rbd device flag until just before we
announce the disk as available for use.  There's no point in doing
so any earlier than that, and at that point the device truly is
fully set up and ready to use.

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
2013-05-01 21:19:47 -07:00
Alex Elder
fc71d8330e rbd: fix up some sysfs stuff
This just tweaks a few things in the routines that implement
rbd sysfs files.

All of the entries for an rbd device in /sys/bus/rbd/devices/<id>/
will represent information whose valid values are known by the time
they are accessible.

Right now we get the size of the mapped image by a call to
get_capacity().  There's no need to do this, because that will
return what we last set the capacity to, which is just the size
recorded for the mapping.  So just show that value instead.

We also get this under protection of the header semaphore, in order
to provide a precisely correct value.  This isn't really necessary;
these files are really informational only and it's not necessary to
be so careful.

Finally, print a special value in case the major device number is
not recorded.  Right now that won't matter much but soon the parent
images won't have devices associated with them.

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
2013-05-01 21:19:46 -07:00
Alex Elder
e28626a08b rbd: fix a bug in resizing a mapping
When a snapshot context update occurs, rbd_update_mapping_size() is
called to set the capacity of the disk to record the updated
size of the image in case it has changed.

There's a bug though.  The mapping size is in units of *bytes*.  The
code that updates the mapping size field is assigning a value that
has been scaled down to *sectors*.

Fix that.  Also, check to see if the size has actually changed, and
don't bother updating things (specifically, calling set_capacity())
if it has not.

This resolves:
    http://tracker.ceph.com/issues/4833

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
2013-05-01 21:19:45 -07:00
Alex Elder
2e9f7f1c0d rbd: refactor rbd_dev_probe_update_spec()
Fairly straightforward refactoring of rbd_dev_probe_update_spec().
The name is changed to rbd_dev_spec_update().

Rearrange it so nothing gets assigned to the spec until all of the
names have been successfully acquired.

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
2013-05-01 21:19:44 -07:00
Alex Elder
71f293e26e rbd: rename rbd_dev_probe()
Rename rbd_dev_probe() to be rbd_dev_image_probe().  Its purpose
will eventually be to probe for the existence of a valid rbd image
for the rbd device--focusing only on the ceph side and not the Linux
device side of initialization.

For now the two "sides" are not fully separated, and this function
is still the entry point for initializing the full rbd device.

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
2013-05-01 21:19:43 -07:00
Alex Elder
9f5dffdc8f rbd: make rbd_dev_destroy() match rbd_dev_create()
Currently, rbd_dev_destroy() does more than just the inverse of what
rbd_dev_create() does.  Stop doing that, and move the two extra
things it does into the three call sites.

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
2013-05-01 21:19:42 -07:00
Alex Elder
468521c1b1 rbd: define rbd snap context routines
Encapsulate the creation of a snapshot context for rbd in a new
function rbd_snap_context_create().  Define rbd wrappers for getting
and dropping references to them once they're created.

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
2013-05-01 21:19:41 -07:00
Alex Elder
c0cd10db46 rbd: use rbd_warn(), not WARN_ON()
Change some calls to WARN_ON() so they use rbd_warn() instead, so we
get consistent messaging.  A few remain but they can probably just
go away eventually.

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
2013-05-01 21:19:40 -07:00
Alex Elder
500d0c0fbb rbd: move stripe_unit and stripe_count into header
This commit added fetching if fancy striping parameters:
    09186ddb rbd: get and check striping parameters

They are almost unused, but the two fields storing the information
really belonged in the rbd_image_header structure.

This patch moves them there.

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
2013-05-01 21:19:39 -07:00
Alex Elder
ecb4dc2256 rbd: make rbd spec names pointer to const
Make the names and image id in an rbd_spec be pointers to constant
data.  This required the use of a local variable to hold the
snapshot name in rbd_add_parse_args() to avoid a warning.

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
2013-05-01 21:19:37 -07:00
Alex Elder
e1d4213f09 rbd: set snapshot id in rbd_dev_probe_update_spec()
Set the rbd spec's snapshot id for an image getting mapped in
rbd_dev_probe_update_spec() rather than rbd_dev_set_mapping().
This is the more logical place for that to happen (even though
it means we might look up the snapshot by name twice).

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
2013-05-01 21:19:36 -07:00
Alex Elder
8b0241f85a rbd: have snap_by_name() return a snapshot
A function called snap_by_name() ought to just look up a snapshot by
name.  It does that, but then it assigns some stuff to the rbd
device structure as well.

Change the function to do just the lookup, and have the caller do
the assignments that follow.

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
2013-05-01 21:19:35 -07:00
Alex Elder
5655c4d940 rbd: fix image id leak in initial probe
If a format 2 image id is found for an image being mapped, but the
subsequent probe of the image fails, rbd_dev_probe() quits without
freeing the image id.  Fix that.

Also drop a redundant hunk of code in rbd_dev_image_id().

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
2013-05-01 21:19:34 -07:00
Alex Elder
c0fba36880 rbd: have rbd_dev_image_id() set format 1 image id
Currently, rbd_dev_probe() assumes that any error returned by
rbd_dev_image_id() is most likely -ENOENT, and responds by
calling the format 1 probe routine, rbd_dev_v1_probe().  Then,
at the top of rbd_dev_v1_probe(), an empty string is allocated
for the image id.

This is sort of unbalanced.  Fix this by having rbd_dev_image_id()
look for -ENOENT from its "get_id" method call.  If that is seen,
have it allocate the empty string there rather than depending on
rbd_dev_v1_probe() to do it.

Given that this is effectively defining the format of the image,
set rbd_dev->image_format inside rbd_dev_image_id() rather than in
the format-specific probe routines.

Also drop a redundant hunk of code in rbd_dev_image_id().

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
2013-05-01 21:19:33 -07:00
Alex Elder
a0cab92432 rbd: avoid dropping extra reference in rbd_free_disk()
I found during some failure injection testing that the call to
rbd_free_disk() in the error path of rbd_dev_probe_finish() was
dropping an extra reference to the disk queue.  The problem
occurred when put_disk tried to drop a reference to the disk's
queue.  A call to blk_cleanup_queue() just prior to that will have
also dropped a reference to the queue.

The problem is that the reference dropped by put_disk() is assumed
to have been taken by add_disk().  Our code has error paths that can
occur after the disk and its queue are initialized, but before the
call to add_disk(), and in those paths we won't have that extra
reference.

The fix is easy though.  In rbd_free_disk() we're already checking
the disk's GENHD_FL_UP flag.  That flag is an indication that
add_disk() has been called, so just call blk_cleanup_queue()
conditional on that flag being set.

This resolves:
    http://tracker.ceph.com/issues/4800

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
2013-05-01 21:19:32 -07:00
Alex Elder
f40eb349e0 rbd: use rbd_obj_method_sync() return value
Now that rbd_obj_method_sync() returns the number of bytes
returned by the method call, that value should be used by
callers to ensure we don't overrun the valid portion of the
buffer.

Fix the two spots that remained that weren't doing that,
rbd_dev_image_name() and rbd_dev_v2_snap_name().

Rearrange the error path slightly in rbd_dev_v2_snap_name().

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
2013-05-01 21:19:31 -07:00
Alex Elder
6e584f5244 rbd: fix leak of format 2 snapshot names
When the snapshot context for an rbd device gets updated (or the
initial one is recorded) a a list of snapshot structures is created
to represent them, one entry per snapshot.  Each entry includes a
dynamically-allocated copy of the snapshot name.

Currently the name is allocated in rbd_snap_create(), as a duplicate
of the passed-in name.

For format 1 images, the snapshot name provided is just a pointer to
an existing name.  But for format 2 images, the passed-in name is
already dynamically allocated, and in the the process of duplicating
it here we are leaking the passed-in name.

Fix this by dynamically allocating the name for format 1 snapshots
also, and then stop allocating a duplicate in rbd_snap_create().

Change rbd_dev_v1_snap_info() so none of its parameters is
side-effected unless it's going to return success.

This is part of:
    http://tracker.ceph.com/issues/4803

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
2013-05-01 21:19:30 -07:00
Alex Elder
6087b51b9e rbd: rename __rbd_add_snap_dev()
Rename __rbd_add_snap_dev() to be rbd_snap_create().  We no longer
have devices for non-mapped snapshots, and we're not actually
"adding" it to the list in this function, just creating it.

Rename rbd_remove_snap_dev() to be rbd_snap_destroy() for reasons
similar to the above.  Stop having this function delete the snapshot
from its list (to be symmetrical with its create counterpart) and do
that in the caller instead.

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
2013-05-01 21:19:29 -07:00
Alex Elder
acb1b6caf1 rbd: only update values on snap_info success
Change rbd_dev_v2_snap_info() so it only ever sets values of the
size and features parameters if looking up the snapshot name was
successful.

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
2013-05-01 21:19:28 -07:00
Alex Elder
c86f86e9e7 rbd: make snap_size order parameter optional
Only one of the two callers of _rbd_dev_v2_snap_size() needs the
order value returned.  So make that an optional argument--a null
pointer if the caller doesn't need it.

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
2013-05-01 21:19:27 -07:00