linux_dsm_epyc7002

mirror of https://github.com/AuxXxilium/linux_dsm_epyc7002.git synced 2024-12-28 11:18:45 +07:00

Author	SHA1	Message	Date
Christoph Hellwig	37f19e57a0	block: merge __bio_map_user_iov into bio_map_user_iov And also remove the unused bdev argument. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Ming Lei <tom.leiming@gmail.com> Signed-off-by: Jens Axboe <axboe@fb.com>	2015-02-05 09:30:43 -07:00
Christoph Hellwig	75c72b8366	block: merge __bio_map_kern into bio_map_kern This saves a little code, and allow to simplify the error handling. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Ming Lei <tom.leiming@gmail.com> Signed-off-by: Jens Axboe <axboe@fb.com>	2015-02-05 09:30:42 -07:00
Kent Overstreet	26e49cfc7e	block: pass iov_iter to the BLOCK_PC mapping functions Make use of a new interface provided by iov_iter, backed by scatter-gather list of iovec, instead of the old interface based on sg_iovec. Also use iov_iter_advance() instead of manual iteration. This commit should contain only literal replacements, without functional changes. Cc: Christoph Hellwig <hch@infradead.org> Cc: Jens Axboe <axboe@kernel.dk> Cc: Doug Gilbert <dgilbert@interlog.com> Cc: "James E.J. Bottomley" <JBottomley@parallels.com> Signed-off-by: Kent Overstreet <kmo@daterainc.com> [dpark: add more description in commit message] Signed-off-by: Dongsu Park <dongsu.park@profitbricks.com> [hch: fixed to do a deep clone of the iov_iter, and to properly use the iov_iter direction] Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Ming Lei <tom.leiming@gmail.com> Signed-off-by: Jens Axboe <axboe@fb.com>	2015-02-05 09:30:40 -07:00
Christoph Hellwig	1dfa0f68c0	block: add a helper to free bio bounce buffer pages The code sniplet to walk all bio_vecs and free their pages is opencoded in way to many places, so factor it into a helper. Also convert the slightly more complex cases in bio_kern_endio and __bio_copy_iov where we break the freeing from an existing loop into a separate one. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Ming Lei <tom.leiming@gmail.com> Signed-off-by: Jens Axboe <axboe@fb.com>	2015-02-05 09:30:39 -07:00
Christoph Hellwig	ddad8dd0a1	block: use blk_rq_map_user_iov to implement blk_rq_map_user Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Ming Lei <tom.leiming@gmail.com> Signed-off-by: Jens Axboe <axboe@fb.com>	2015-02-05 09:30:37 -07:00
Christoph Hellwig	42d2683a27	block: simplify bio_map_kern Just open code the trivial mapping from a kernel virtual address to a bio instead of going through the complex user address mapping machinery. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Ming Lei <tom.leiming@gmail.com> Signed-off-by: Jens Axboe <axboe@fb.com>	2015-02-05 09:30:35 -07:00
Peter Zijlstra	2c56124652	block: Simplify bsg complete all It took me a few tries to figure out what this code did; lets rewrite it into a more regular form. The thing that makes this one 'special' is the BSG_F_BLOCK flag, if that is not set we're not supposed/allowed to block and should spin wait for completion. The (new) io_wait_event() will never see a false condition in case of the spinning and we will therefore not block. Cc: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Jens Axboe <axboe@fb.com>	2015-02-04 09:57:52 -07:00
Ingo Molnar	3c01b74e81	* Move efivarfs from the misc filesystem section to pseudo filesystem, since that's a more logical and accurate place - Leif Lindholm * Update efibootmgr URL in Kconfig help - Peter Jones * Improve accuracy of EFI guid function names - Borislav Petkov * Expose firmware platform size in sysfs for the benefit of EFI boot loader installers and other utilities - Steve McIntyre * Cleanup __init annotations for arm64/efi code - Ard Biesheuvel * Mark the UIE as unsupported for rtc-efi - Ard Biesheuvel * Fix memory leak in error code path of runtime map code - Dan Carpenter * Improve robustness of get_memory_map() by removing assumptions on the size of efi_memory_desc_t (which could change in future spec versions) and querying the firmware instead of guessing about the memmap size - Ard Biesheuvel * Remove superfluous guid unparse calls - Ivan Khoronzhuk * Delete unnecessary chosen@0 DT node FDT code since was duplicated from code in drivers/of and is entirely unnecessary - Leif Lindholm -----BEGIN PGP SIGNATURE----- Version: GnuPG v1 iQIcBAABAgAGBQJUv69oAAoJEC84WcCNIz1VEYgP/1b27WRfCXs4q/8FP+UheSDS nAFbGe9PjVPnxo5pA9VwPP6eNQ2zYiyNGEK1BlbQlFPZdSD1updIraA78CiF5iys iSYyG9xVIcTB23RZI8aJLnBXbosIUKPJZ3FORv1LPhI6Mz1rCpraEaaUlv67rUKr FLBG9cR7t9f/f+fJw6LOAAISGIG/4s0wQdA5/noaYkj5R5bICl2UTGtbwa0oNstb NUO93aKDgaG/VljpIEeG6XV96Ioz7cHjQsEaX8sTrvT0n7nPNIqSDjFJOqWKJOXl RsFrzyl8fFIbMuQatYv1f3efPvyH+iKOfHnHrvcjUNje0xhm7F0Bd86BkOw1a3JQ pNb0YUWecI0Z/8GSzN8X0JQ7cowa3wI15Z/Hfs03odTXiM6VqwFAhuz/s5DEUdKS U+rOPjU0ezt3G4oBB/VGgF9w5JWKfsMcsHgmLX9P+JYzKFrxggo1SXAtXUeRAqQp agKmUB+k6Y1baQO8efkoM7rKL2F0q1SR9QiK+16BHCCkevD23v7IFGrHm2r1xKil kvWlY4MkRVa4KGPxEFEDVty0HjXxImwYsxTaYVHTS7SMeoP41f6koHKB19NaB3No 5fqn/rT1KcJuhQj/I+vAixIX4WMJkX/MQVbtKfqSaKlAiRg3eRY6ONYr0jOglfF6 gaMuvmDd0HlV6UJvH/9L =iPpM -----END PGP SIGNATURE----- Merge tag 'efi-next' of git://git.kernel.org/pub/scm/linux/kernel/git/mfleming/efi into x86/efi Pull EFI updates from Matt Fleming: " - Move efivarfs from the misc filesystem section to pseudo filesystem, since that's a more logical and accurate place - Leif Lindholm - Update efibootmgr URL in Kconfig help - Peter Jones - Improve accuracy of EFI guid function names - Borislav Petkov - Expose firmware platform size in sysfs for the benefit of EFI boot loader installers and other utilities - Steve McIntyre - Cleanup __init annotations for arm64/efi code - Ard Biesheuvel - Mark the UIE as unsupported for rtc-efi - Ard Biesheuvel - Fix memory leak in error code path of runtime map code - Dan Carpenter - Improve robustness of get_memory_map() by removing assumptions on the size of efi_memory_desc_t (which could change in future spec versions) and querying the firmware instead of guessing about the memmap size - Ard Biesheuvel - Remove superfluous guid unparse calls - Ivan Khoronzhuk - Delete unnecessary chosen@0 DT node FDT code since was duplicated from code in drivers/of and is entirely unnecessary - Leif Lindholm There's nothing super scary, mainly cleanups, and a merge from Ricardo who kindly picked up some patches from the linux-efi mailing list while I was out on annual leave in December. Perhaps the biggest risk is the get_memory_map() change from Ard, which changes the way that both the arm64 and x86 EFI boot stub build the early memory map. It would be good to have it bake in linux-next for a while. " Signed-off-by: Ingo Molnar <mingo@kernel.org>	2015-01-29 19:16:40 +01:00
Ming Lei	e09aae7ede	blk-mq: release mq's kobjects in blk_release_queue() The kobject memory inside blk-mq hctx/ctx shouldn't have been freed before the kobject is released because driver core can access it freely before its release. We can't do that in all ctx/hctx/mq_kobj's release handler because it can be run before blk_cleanup_queue(). Given mq_kobj shouldn't have been introduced, this patch simply moves mq's release into blk_release_queue(). Reported-by: Sasha Levin <sasha.levin@oracle.com> Signed-off-by: Ming Lei <ming.lei@canonical.com> Signed-off-by: Jens Axboe <axboe@fb.com>	2015-01-29 08:30:51 -08:00
Ming Lei	74170118b2	Revert "blk-mq: fix hctx/ctx kobject use-after-free" This reverts commit `76d697d107`. The commit `76d697d107` causes general protection fault reported from Bart Van Assche: https://lkml.org/lkml/2015/1/28/334 Reported-by: Bart Van Assche <bart.vanassche@sandisk.com> Signed-off-by: Ming Lei <ming.lei@canonical.com> Signed-off-by: Jens Axboe <axboe@fb.com>	2015-01-29 08:30:49 -08:00
Keith Busch	77a0868901	block: keep established cmd_flags when cloning into a blk-mq request blk_mq_alloc_request() may establish REQ_MQ_INFLIGHT in addition to incrementing the hctx->nr_active count. Any cmd_flags that are established in the newly allocated clone request must be preserved in addition to the cmd_flags that are later copied over from the original request as part of blk_rq_prep_clone(). Otherwise, if REQ_MQ_INFLIGHT isn't set in the clone request the hctx->nr_active count won't get decremented via blk_mq_free_request(). The only consumer of blk_rq_prep_clone() is request-based DM, which uses blk_rq_init() prior to calling blk_rq_prep_clone() for the non-blk-mq case. Given the cloned request's cmd_flags will be 0 it is safe to OR them with the original request's cmd_flags for both the non-blk-mq and blk-mq cases. Reported-by: Bart Van Assche <bvanassche@acm.org> Signed-off-by: Keith Busch <keith.busch@intel.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com> Signed-off-by: Jens Axboe <axboe@fb.com>	2015-01-28 09:44:15 -07:00
Keith Busch	7fb4898e0c	block: add blk-mq support to blk_insert_cloned_request() If the request passed to blk_insert_cloned_request() was allocated by a blk-mq device it must be submitted using blk_mq_insert_request(). Signed-off-by: Keith Busch <keith.busch@intel.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com> Signed-off-by: Jens Axboe <axboe@fb.com>	2015-01-28 09:44:13 -07:00
Keith Busch	febf71588c	block: require blk_rq_prep_clone() be given an initialized clone request Prepare to allow blk_rq_prep_clone() to accept clone requests that were allocated from blk-mq request queues. As such the blk_rq_prep_clone() caller must first initialize the clone request. Signed-off-by: Keith Busch <keith.busch@intel.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com> Signed-off-by: Jens Axboe <axboe@fb.com>	2015-01-28 09:44:11 -07:00
Shaohua Li	24391c0dc5	blk-mq: add tag allocation policy This is the blk-mq part to support tag allocation policy. The default allocation policy isn't changed (though it's not a strict FIFO). The new policy is round-robin for libata. But it's a try-best implementation. If multiple tasks are competing, the tags returned will be mixed (which is unavoidable even with !mq, as requests from different tasks can be mixed in queue) Cc: Jens Axboe <axboe@fb.com> Cc: Tejun Heo <tj@kernel.org> Cc: Christoph Hellwig <hch@infradead.org> Signed-off-by: Shaohua Li <shli@fb.com> Signed-off-by: Jens Axboe <axboe@fb.com>	2015-01-23 14:18:00 -07:00
Shaohua Li	ee1b6f7aff	block: support different tag allocation policy The libata tag allocation is using a round-robin policy. Next patch will make libata use block generic tag allocation, so let's add a policy to tag allocation. Currently two policies: FIFO (default) and round-robin. Cc: Jens Axboe <axboe@fb.com> Cc: Tejun Heo <tj@kernel.org> Cc: Christoph Hellwig <hch@infradead.org> Signed-off-by: Shaohua Li <shli@fb.com> Signed-off-by: Jens Axboe <axboe@fb.com>	2015-01-23 14:15:46 -07:00
Boaz Harrosh	bb5c3cdda3	block: Remove annoying "unknown partition table" message As Christoph put it: Can we just get rid of the warnings? It's fairly annoying as devices without partitions are perfectly fine and very useful. Me too I see this message every VM boot for ages on all my devices. Would love to just remove it. For me a partition-table is only needed for a booting BIOS, grub, and stuff. CC: Christoph Hellwig <hch@infradead.org> Signed-off-by: Boaz Harrosh <boaz@plexistor.com> Signed-off-by: Jens Axboe <axboe@fb.com>	2015-01-22 08:03:52 -07:00
Martin K. Petersen	d93ba7a5a9	block: Add discard flag to blkdev_issue_zeroout() function blkdev_issue_discard() will zero a given block range. This is done by way of explicit writing, thus provisioning or allocating the blocks on disk. There are use cases where the desired behavior is to zero the blocks but unprovision them if possible. The blocks must deterministically contain zeroes when they are subsequently read back. This patch adds a flag to blkdev_issue_zeroout() that provides this variant. If the discard flag is set and a block device guarantees discard_zeroes_data we will use REQ_DISCARD to clear the block range. If the device does not support discard_zeroes_data or if the discard request fails we will fall back to first REQ_WRITE_SAME and then a regular REQ_WRITE. Also update the callers of blkdev_issue_zero() to reflect the new flag and make sb_issue_zeroout() prefer the discard approach. Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@fb.com>	2015-01-21 10:41:46 -07:00
Jeff Moyer	c6ce194325	cfq-iosched: fix incorrect filing of rt async cfqq Hi, If you can manage to submit an async write as the first async I/O from the context of a process with realtime scheduling priority, then a cfq_queue is allocated, but filed into the wrong async_cfqq bucket. It ends up in the best effort array, but actually has realtime I/O scheduling priority set in cfqq->ioprio. The reason is that cfq_get_queue assumes the default scheduling class and priority when there is no information present (i.e. when the async cfqq is created): static struct cfq_queue * cfq_get_queue(struct cfq_data cfqd, bool is_sync, struct cfq_io_cq cic, struct bio bio, gfp_t gfp_mask) { const int ioprio_class = IOPRIO_PRIO_CLASS(cic->ioprio); const int ioprio = IOPRIO_PRIO_DATA(cic->ioprio); cic->ioprio starts out as 0, which is "invalid". So, class of 0 (IOPRIO_CLASS_NONE) is passed to cfq_async_queue_prio like so: async_cfqq = cfq_async_queue_prio(cfqd, ioprio_class, ioprio); static struct cfq_queue * cfq_async_queue_prio(struct cfq_data cfqd, int ioprio_class, int ioprio) { switch (ioprio_class) { case IOPRIO_CLASS_RT: return &cfqd->async_cfqq[0][ioprio]; case IOPRIO_CLASS_NONE: ioprio = IOPRIO_NORM; / fall through / case IOPRIO_CLASS_BE: return &cfqd->async_cfqq[1][ioprio]; case IOPRIO_CLASS_IDLE: return &cfqd->async_idle_cfqq; default: BUG(); } } Here, instead of returning a class mapped from the process' scheduling priority, we get back the bucket associated with IOPRIO_CLASS_BE. Now, there is no queue allocated there yet, so we create it: cfqq = cfq_find_alloc_queue(cfqd, is_sync, cic, bio, gfp_mask); That function ends up doing this: cfq_init_cfqq(cfqd, cfqq, current->pid, is_sync); cfq_init_prio_data(cfqq, cic); cfq_init_cfqq marks the priority as having changed. Then, cfq_init_prio data does this: ioprio_class = IOPRIO_PRIO_CLASS(cic->ioprio); switch (ioprio_class) { default: printk(KERN_ERR "cfq: bad prio %x\n", ioprio_class); case IOPRIO_CLASS_NONE: / * no prio set, inherit CPU scheduling settings */ cfqq->ioprio = task_nice_ioprio(tsk); cfqq->ioprio_class = task_nice_ioclass(tsk); break; So we basically have two code paths that treat IOPRIO_CLASS_NONE differently, which results in an RT async cfqq filed into a best effort bucket. Attached is a patch which fixes the problem. I'm not sure how to make it cleaner. Suggestions would be welcome. Signed-off-by: Jeff Moyer <jmoyer@redhat.com> Tested-by: Hidehiro Kawai <hidehiro.kawai.ez@hitachi.com> Cc: stable@kernel.org Signed-off-by: Jens Axboe <axboe@fb.com>	2015-01-21 10:38:30 -07:00
Christoph Hellwig	b4caecd480	fs: introduce f_op->mmap_capabilities for nommu mmap support Since "BDI: Provide backing device capability information [try #3]" the backing_dev_info structure also provides flags for the kind of mmap operation available in a nommu environment, which is entirely unrelated to it's original purpose. Introduce a new nommu-only file operation to provide this information to the nommu mmap code instead. Splitting this from the backing_dev_info structure allows to remove lots of backing_dev_info instance that aren't otherwise needed, and entirely gets rid of the concept of providing a backing_dev_info for a character device. It also removes the need for the mtd_inodefs filesystem. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Tejun Heo <tj@kernel.org> Acked-by: Brian Norris <computersforpeace@gmail.com> Signed-off-by: Jens Axboe <axboe@fb.com>	2015-01-20 14:02:58 -07:00
Ming Lei	76d697d107	blk-mq: fix hctx/ctx kobject use-after-free The kobject memory shouldn't have been freed before the kobject is released because driver core can access it freely before its release. This patch frees hctx in its release callback. For ctx, they share one single per-cpu variable which is associated with the request queue, so free ctx in q->mq_kobj's release handler. Signed-off-by: Sasha Levin <sasha.levin@oracle.com> (fix ctx kobjects) Signed-off-by: Ming Lei <ming.lei@canonical.com> Signed-off-by: Jens Axboe <axboe@fb.com>	2015-01-20 09:28:33 -07:00
Jens Axboe	0bf364984c	blk-mq: fix false negative out-of-tags condition The blk-mq tagging tries to maintain some locality between CPUs and the tags issued. The tags are split into groups of words, and the words may not be fully populated. When searching for a new free tag, blk-mq may look at partial words, hence it passes in an offset/size to find_next_zero_bit(). However, it does that wrong, the size must always be the full length of the number of tags in that word, otherwise we'll potentially miss some near the end. Another issue is when __bt_get() goes from one word set to the next. It bumps the index, but not the last_tag associated with the previous index. Bump that to be in the range of the new word. Finally, clean up __bt_get() and __bt_get_word() a bit and get rid of the goto in there, and the unnecessary 'wrap' variable. Signed-off-by: Jens Axboe <axboe@fb.com>	2015-01-14 08:49:55 -07:00
Keith Busch	eb130dbfc4	blk-mq: End unstarted requests on a dying queue Requests that haven't been started prior to a queue dying can be ended in error without waiting for them to start and time out. Signed-off-by: Keith Busch <keith.busch@intel.com> Added code comment to explain why this is done. Signed-off-by: Jens Axboe <axboe@fb.com>	2015-01-08 08:59:53 -07:00
Keith Busch	5b3f25fc34	blk-mq: Allow requests to never expire Some types of requests may be started that are not gauranteed to ever complete. This adds a request flag that a driver can use so mark the request as such. Signed-off-by: Keith Busch <keith.busch@intel.com> Signed-off-by: Jens Axboe <axboe@fb.com>	2015-01-08 08:59:01 -07:00
Jens Axboe	1885b24d23	blk-mq: Add helper to abort requeued requests Adds a helper function a driver can use to abort requeued requests in case any are pending when h/w queues are being removed. Signed-off-by: Jens Axboe <axboe@fb.com>	2015-01-08 08:55:53 -07:00
Keith Busch	c68ed59f53	blk-mq: Let drivers cancel requeue_work Kicking requeued requests will start h/w queues in a work_queue, which may alter the driver's requested state to temporarily stop them. This patch exports a method to cancel the q->requeue_work so a driver can be assured stopped h/w queues won't be started up before it is ready. Signed-off-by: Keith Busch <keith.busch@intel.com> Signed-off-by: Jens Axboe <axboe@fb.com>	2015-01-08 08:55:40 -07:00
Keith Busch	973c01919b	blk-mq: Export if requests were started Drivers can iterate over all allocated request tags, but their callback needs a way to know if the driver started the request in the first place. Signed-off-by: Keith Busch <keith.busch@intel.com> Signed-off-by: Jens Axboe <axboe@fb.com>	2015-01-08 08:55:27 -07:00
Keith Busch	3fd5940cb2	blk-mq: Wake tasks entering queue on dying When the queue is set to dying, wake up tasks that are waiting on frozen queue so they realize it is dying and abandon their request. Signed-off-by: Keith Busch <keith.busch@intel.com> Modified by me to add a code comment on the need for the wakeup. Signed-off-by: Jens Axboe <axboe@fb.com>	2015-01-08 08:53:56 -07:00
Borislav Petkov	26e022727f	efi: Rename efi_guid_unparse to efi_guid_to_str Call it what it does - "unparse" is plain-misleading. Signed-off-by: Borislav Petkov <bp@suse.de> Signed-off-by: Ricardo Neri <ricardo.neri-calderon@linux.intel.com>	2015-01-07 19:07:44 -08:00
Jens Axboe	17ded32070	blk-mq: get rid of ->cmd_size in the hardware queue We store it in the tag set, we don't need it in the hardware queue. While removing cmd_size, place ->queue_num further down to avoid a hole on 64-bit archs. It's not used in any fast paths, so we can safely move it. Signed-off-by: Jens Axboe <axboe@fb.com>	2015-01-07 10:44:04 -07:00
Jens Axboe	c761d96b07	blk-mq: export blk_mq_freeze_queue() Commit `b4c6a02877` exported the start and unfreeze, but we need the regular blk_mq_freeze_queue() for the loop conversion. Signed-off-by: Jens Axboe <axboe@fb.com>	2015-01-02 15:05:12 -07:00
Jens Axboe	aed3ea94bd	block: wake up waiters when a queue is marked dying If it's dying, we can't expect new request to complete and come in an wake up other tasks waiting for requests. So after we have marked it as dying, wake up everybody currently waiting for a request. Once they wake, they will retry their allocation and fail appropriately due to the state of the queue. Tested-by: Keith Busch <keith.busch@intel.com> Signed-off-by: Jens Axboe <axboe@fb.com>	2014-12-31 09:39:16 -07:00
Keith Busch	b4c6a02877	blk-mq: Export freeze/unfreeze functions Let drivers prevent entering a queue that isn't available. Signed-off-by: Keith Busch <keith.busch@intel.com> Signed-off-by: Jens Axboe <axboe@fb.com>	2014-12-20 10:34:15 -07:00
Keith Busch	c76541a932	blk-mq: Exit queue on alloc failure Fixes usage counter when a request could not be allocated. Signed-off-by: Keith Busch <keith.busch@intel.com> Signed-off-by: Jens Axboe <axboe@fb.com>	2014-12-20 10:33:53 -07:00
Jens Axboe	35d37c6635	Revert "blk-mq: Micro-optimize bt_get()" This reverts commit `52f7eb945f`. The optimization is only really safe for a single queue, otherwise 'bs' and 'bt' can indeed change, and if we don't do a finish_wait() for each loop, we'll potentially change the wait structure and corrupt task wait list. Reported-by: Jan Kara <jack@suse.cz>	2014-12-15 08:30:26 -07:00
Linus Torvalds	caf292ae5b	Merge branch 'for-3.19/core' of git://git.kernel.dk/linux-block Pull block driver core update from Jens Axboe: "This is the pull request for the core block IO changes for 3.19. Not a huge round this time, mostly lots of little good fixes: - Fix a bug in sysfs blktrace interface causing a NULL pointer dereference, when enabled/disabled through that API. From Arianna Avanzini. - Various updates/fixes/improvements for blk-mq: - A set of updates from Bart, mostly fixing buts in the tag handling. - Cleanup/code consolidation from Christoph. - Extend queue_rq API to be able to handle batching issues of IO requests. NVMe will utilize this shortly. From me. - A few tag and request handling updates from me. - Cleanup of the preempt handling for running queues from Paolo. - Prevent running of unmapped hardware queues from Ming Lei. - Move the kdump memory limiting check to be in the correct location, from Shaohua. - Initialize all software queues at init time from Takashi. This prevents a kobject warning when CPUs are brought online that weren't online when a queue was registered. - Single writeback fix for I_DIRTY clearing from Tejun. Queued with the core IO changes, since it's just a single fix. - Version X of the __bio_add_page() segment addition retry from Maurizio. Hope the Xth time is the charm. - Documentation fixup for IO scheduler merging from Jan. - Introduce (and use) generic IO stat accounting helpers for non-rq drivers, from Gu Zheng. - Kill off artificial limiting of max sectors in a request from Christoph" * 'for-3.19/core' of git://git.kernel.dk/linux-block: (26 commits) bio: modify __bio_add_page() to accept pages that don't start a new segment blk-mq: Fix uninitialized kobject at CPU hotplugging blktrace: don't let the sysfs interface remove trace from running list blk-mq: Use all available hardware queues blk-mq: Micro-optimize bt_get() blk-mq: Fix a race between bt_clear_tag() and bt_get() blk-mq: Avoid that __bt_get_word() wraps multiple times blk-mq: Fix a use-after-free blk-mq: prevent unmapped hw queue from being scheduled blk-mq: re-check for available tags after running the hardware queue blk-mq: fix hang in bt_get() blk-mq: move the kdump check to blk_mq_alloc_tag_set blk-mq: cleanup tag free handling blk-mq: use 'nr_cpu_ids' as highest CPU ID count for hwq <-> cpu map blk: introduce generic io stat accounting help function blk-mq: handle the single queue case in blk_mq_hctx_next_cpu genhd: check for int overflow in disk_expand_part_tbl() blk-mq: add blk_mq_free_hctx_request() blk-mq: export blk_mq_free_request() blk-mq: use get_cpu/put_cpu instead of preempt_disable/preempt_enable ...	2014-12-13 14:14:23 -08:00
Maurizio Lombardi	fcbf6a087a	bio: modify __bio_add_page() to accept pages that don't start a new segment The original behaviour is to refuse to add a new page if the maximum number of segments has been reached, regardless of the fact the page we are going to add can be merged into the last segment or not. Unfortunately, when the system runs under heavy memory fragmentation conditions, a driver may try to add multiple pages to the last segment. The original code won't accept them and EBUSY will be reported to userspace. This patch modifies the function so it refuses to add a page only in case the latter starts a new segment and the maximum number of segments has already been reached. The bug can be easily reproduced with the st driver: 1) set CONFIG_SCSI_MPT2SAS_MAX_SGE or CONFIG_SCSI_MPT3SAS_MAX_SGE to 16 2) modprobe st buffer_kbs=1024 3) #dd if=/dev/zero of=/dev/st0 bs=1M count=10 dd: error writing `/dev/st0': Device or resource busy Signed-off-by: Maurizio Lombardi <mlombard@redhat.com> Signed-off-by: Ming Lei <ming.lei@canonical.com> Cc: Jet Chen <jet.chen@intel.com> Cc: Tomas Henzl <thenzl@redhat.com> Cc: Jens Axboe <axboe@kernel.dk> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Jens Axboe <axboe@fb.com>	2014-12-11 09:11:52 -07:00
Linus Torvalds	92a578b064	ACPI and power management updates for 3.19-rc1 This time we have some more new material than we used to have during the last couple of development cycles. The most important part of it to me is the introduction of a unified interface for accessing device properties provided by platform firmware. It works with Device Trees and ACPI in a uniform way and drivers using it need not worry about where the properties come from as long as the platform firmware (either DT or ACPI) makes them available. It covers both devices and "bare" device node objects without struct device representation as that turns out to be necessary in some cases. This has been in the works for quite a few months (and development cycles) and has been approved by all of the relevant maintainers. On top of that, some drivers are switched over to the new interface (at25, leds-gpio, gpio_keys_polled) and some additional changes are made to the core GPIO subsystem to allow device drivers to manipulate GPIOs in the "canonical" way on platforms that provide GPIO information in their ACPI tables, but don't assign names to GPIO lines (in which case the driver needs to do that on the basis of what it knows about the device in question). That also has been approved by the GPIO core maintainers and the rfkill driver is now going to use it. Second is support for hardware P-states in the intel_pstate driver. It uses CPUID to detect whether or not the feature is supported by the processor in which case it will be enabled by default. However, it can be disabled entirely from the kernel command line if necessary. Next is support for a platform firmware interface based on ACPI operation regions used by the PMIC (Power Management Integrated Circuit) chips on the Intel Baytrail-T and Baytrail-T-CR platforms. That interface is used for manipulating power resources and for thermal management: sensor temperature reporting, trip point setting and so on. Also the ACPI core is now going to support the _DEP configuration information in a limited way. Basically, _DEP it supposed to reflect off-the-hierarchy dependencies between devices which may be very indirect, like when AML for one device accesses locations in an operation region handled by another device's driver (usually, the device depended on this way is a serial bus or GPIO controller). The support added this time is sufficient to make the ACPI battery driver work on Asus T100A, but it is general enough to be able to cover some other use cases in the future. Finally, we have a new cpufreq driver for the Loongson1B processor. In addition to the above, there are fixes and cleanups all over the place as usual and a traditional ACPICA update to a recent upstream release. As far as the fixes go, the ACPI LPSS (Low-power Subsystem) driver for Intel platforms should be able to handle power management of the DMA engine correctly, the cpufreq-dt driver should interact with the thermal subsystem in a better way and the ACPI backlight driver should handle some more corner cases, among other things. On top of the ACPICA update there are fixes for race conditions in the ACPICA's interrupt handling code which might lead to some random and strange looking failures on some systems. In the cleanups department the most visible part is the series of commits targeted at getting rid of the CONFIG_PM_RUNTIME configuration option. That was triggered by a discussion regarding the generic power domains code during which we realized that trying to support certain combinations of PM config options was painful and not really worth it, because nobody would use them in production anyway. For this reason, we decided to make CONFIG_PM_SLEEP select CONFIG_PM_RUNTIME and that lead to the conclusion that the latter became redundant and CONFIG_PM could be used instead of it. The material here makes that replacement in a major part of the tree, but there will be at least one more batch of that in the second part of the merge window. Specifics: - Support for retrieving device properties information from ACPI _DSD device configuration objects and a unified device properties interface for device drivers (and subsystems) on top of that. As stated above, this works with Device Trees and ACPI and allows device drivers to be written in a platform firmware (DT or ACPI) agnostic way. The at25, leds-gpio and gpio_keys_polled drivers are now going to use this new interface and the GPIO subsystem is additionally modified to allow device drivers to assign names to GPIO resources returned by ACPI _CRS objects (in case _DSD is not present or does not provide the expected data). The changes in this set are mostly from Mika Westerberg, Rafael J Wysocki, Aaron Lu, and Darren Hart with some fixes from others (Fabio Estevam, Geert Uytterhoeven). - Support for Hardware Managed Performance States (HWP) as described in Volume 3, section 14.4, of the Intel SDM in the intel_pstate driver. CPUID is used to detect whether or not the feature is supported by the processor. If supported, it will be enabled automatically unless the intel_pstate=no_hwp switch is present in the kernel command line. From Dirk Brandewie. - New Intel Broadwell-H ID for intel_pstate (Dirk Brandewie). - Support for firmware interface based on ACPI operation regions used by the PMIC chips on the Intel Baytrail-T and Baytrail-T-CR platforms for power resource control and thermal management (Aaron Lu). - Limited support for retrieving off-the-hierarchy dependencies between devices from ACPI _DEP device configuration objects and deferred probing support for the ACPI battery driver based on the _DEP information to make that driver work on Asus T100A (Lan Tianyu). - New cpufreq driver for the Loongson1B processor (Kelvin Cheung). - ACPICA update to upstream revision 20141107 which only affects tools (Bob Moore). - Fixes for race conditions in the ACPICA's interrupt handling code and in the ACPI code related to system suspend and resume (Lv Zheng and Rafael J Wysocki). - ACPI core fix for an RCU-related issue in the ioremap() regions management code that slowed down significantly after CPUs had been allowed to enter idle states even if they'd had RCU callbakcs queued and triggered some problems in certain proprietary graphics driver (and elsewhere). The fix replaces synchronize_rcu() in that code with synchronize_rcu_expedited() which makes the issue go away. From Konstantin Khlebnikov. - ACPI LPSS (Low-Power Subsystem) driver fix to handle power management of the DMA engine included into the LPSS correctly. The problem is that the DMA engine doesn't have ACPI PM support of its own and it simply is turned off when the last LPSS device having ACPI PM support goes into D3cold. To work around that, the PM domain used by the ACPI LPSS driver is redesigned so at least one device with ACPI PM support will be on as long as the DMA engine is in use. From Andy Shevchenko. - ACPI backlight driver fix to avoid using it on "Win8-compatible" systems where it doesn't work and where it was used by default by mistake (Aaron Lu). - Assorted minor ACPI core fixes and cleanups from Tomasz Nowicki, Sudeep Holla, Huang Rui, Hanjun Guo, Fabian Frederick, and Ashwin Chaugule (mostly related to the upcoming ARM64 support). - Intel RAPL (Running Average Power Limit) power capping driver fixes and improvements including new processor IDs (Jacob Pan). - Generic power domains modification to power up domains after attaching devices to them to meet the expectations of device drivers and bus types assuming devices to be accessible at probe time (Ulf Hansson). - Preliminary support for controlling device clocks from the generic power domains core code and modifications of the ARM/shmobile platform to use that feature (Ulf Hansson). - Assorted minor fixes and cleanups of the generic power domains core code (Ulf Hansson, Geert Uytterhoeven). - Assorted minor fixes and cleanups of the device clocks control code in the PM core (Geert Uytterhoeven, Grygorii Strashko). - Consolidation of device power management Kconfig options by making CONFIG_PM_SLEEP select CONFIG_PM_RUNTIME and removing the latter which is now redundant (Rafael J Wysocki and Kevin Hilman). That is the first batch of the changes needed for this purpose. - Core device runtime power management support code cleanup related to the execution of callbacks (Andrzej Hajda). - cpuidle ARM support improvements (Lorenzo Pieralisi). - cpuidle cleanup related to the CPUIDLE_FLAG_TIME_VALID flag and a new MAINTAINERS entry for ARM Exynos cpuidle (Daniel Lezcano and Bartlomiej Zolnierkiewicz). - New cpufreq driver callback (->ready) to be executed when the cpufreq core is ready to use a given policy object and cpufreq-dt driver modification to use that callback for cooling device registration (Viresh Kumar). - cpufreq core fixes and cleanups (Viresh Kumar, Vince Hsu, James Geboski, Tomeu Vizoso). - Assorted fixes and cleanups in the cpufreq-pcc, intel_pstate, cpufreq-dt, pxa2xx cpufreq drivers (Lenny Szubowicz, Ethan Zhao, Stefan Wahren, Petr Cvek). - OPP (Operating Performance Points) framework modification to allow OPPs to be removed too and update of a few cpufreq drivers (cpufreq-dt, exynos5440, imx6q, cpufreq) to remove OPPs (added during initialization) on driver removal (Viresh Kumar). - Hibernation core fixes and cleanups (Tina Ruchandani and Markus Elfring). - PM Kconfig fix related to CPU power management (Pankaj Dubey). - cpupower tool fix (Prarit Bhargava). / -----BEGIN PGP SIGNATURE----- Version: GnuPG v2.0.22 (GNU/Linux) iQIcBAABCAAGBQJUhj6JAAoJEILEb/54YlRxTM4P/j5g5SfqvY0QKsn7sR7MGZ6v nsgCBhJAqTw3ocNC7EAs8z9h2GWy1KbKpakKYWAh9Fs1yZoey7tFSlcv/Rgjlp70 uU5sDQHtpE9mHKiymdsowiQuWgpl962L4k+k8hUslhlvgk1PvVbpajR6OqG8G+pD asuIW9eh1APNkLyXmRJ3ZPomzs0VmRdZJ0NEs0lKX9mJskqEvxPIwdaxq3iaJq9B Fo0J345zUDcJnxWblDRdHlOigCimglElfN5qJwaC4KpwUKuBvLRKbp4f69+wfT0c kYFiR29X5KjJ2kLfP/wKsLyuDCYYXRq3tCia5M1tAqOjZ+UA89H/GDftx/5lntmv qUlBa35VfdS1SX4HyApZitOHiLgo+It/hl8Z9bJnhyVw66NxmMQ8JYN2imb8Lhqh XCLR7BxLTah82AapLJuQ0ZDHPzZqMPG2veC2vAzRMYzVijict/p4Y2+qBqONltER 4rs9uRVn+hamX33lCLg8BEN8zqlnT3rJFIgGaKjq/wXHAU/zpE9CjOrKMQcAg9+s t51XMNPwypHMAYyGVhEL89ImjXnXxBkLRuquhlmEpvQchIhR+mR3dLsarGn7da44 WPIQJXzcsojXczcwwfqsJCR4I1FTFyQIW+UNh02GkDRgRovQqo+Jk762U7vQwqH+ LBdhvVaS1VW4v+FWXEoZ =5dox -----END PGP SIGNATURE----- Merge tag 'pm+acpi-3.19-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm Pull ACPI and power management updates from Rafael Wysocki: "This time we have some more new material than we used to have during the last couple of development cycles. The most important part of it to me is the introduction of a unified interface for accessing device properties provided by platform firmware. It works with Device Trees and ACPI in a uniform way and drivers using it need not worry about where the properties come from as long as the platform firmware (either DT or ACPI) makes them available. It covers both devices and "bare" device node objects without struct device representation as that turns out to be necessary in some cases. This has been in the works for quite a few months (and development cycles) and has been approved by all of the relevant maintainers. On top of that, some drivers are switched over to the new interface (at25, leds-gpio, gpio_keys_polled) and some additional changes are made to the core GPIO subsystem to allow device drivers to manipulate GPIOs in the "canonical" way on platforms that provide GPIO information in their ACPI tables, but don't assign names to GPIO lines (in which case the driver needs to do that on the basis of what it knows about the device in question). That also has been approved by the GPIO core maintainers and the rfkill driver is now going to use it. Second is support for hardware P-states in the intel_pstate driver. It uses CPUID to detect whether or not the feature is supported by the processor in which case it will be enabled by default. However, it can be disabled entirely from the kernel command line if necessary. Next is support for a platform firmware interface based on ACPI operation regions used by the PMIC (Power Management Integrated Circuit) chips on the Intel Baytrail-T and Baytrail-T-CR platforms. That interface is used for manipulating power resources and for thermal management: sensor temperature reporting, trip point setting and so on. Also the ACPI core is now going to support the _DEP configuration information in a limited way. Basically, _DEP it supposed to reflect off-the-hierarchy dependencies between devices which may be very indirect, like when AML for one device accesses locations in an operation region handled by another device's driver (usually, the device depended on this way is a serial bus or GPIO controller). The support added this time is sufficient to make the ACPI battery driver work on Asus T100A, but it is general enough to be able to cover some other use cases in the future. Finally, we have a new cpufreq driver for the Loongson1B processor. In addition to the above, there are fixes and cleanups all over the place as usual and a traditional ACPICA update to a recent upstream release. As far as the fixes go, the ACPI LPSS (Low-power Subsystem) driver for Intel platforms should be able to handle power management of the DMA engine correctly, the cpufreq-dt driver should interact with the thermal subsystem in a better way and the ACPI backlight driver should handle some more corner cases, among other things. On top of the ACPICA update there are fixes for race conditions in the ACPICA's interrupt handling code which might lead to some random and strange looking failures on some systems. In the cleanups department the most visible part is the series of commits targeted at getting rid of the CONFIG_PM_RUNTIME configuration option. That was triggered by a discussion regarding the generic power domains code during which we realized that trying to support certain combinations of PM config options was painful and not really worth it, because nobody would use them in production anyway. For this reason, we decided to make CONFIG_PM_SLEEP select CONFIG_PM_RUNTIME and that lead to the conclusion that the latter became redundant and CONFIG_PM could be used instead of it. The material here makes that replacement in a major part of the tree, but there will be at least one more batch of that in the second part of the merge window. Specifics: - Support for retrieving device properties information from ACPI _DSD device configuration objects and a unified device properties interface for device drivers (and subsystems) on top of that. As stated above, this works with Device Trees and ACPI and allows device drivers to be written in a platform firmware (DT or ACPI) agnostic way. The at25, leds-gpio and gpio_keys_polled drivers are now going to use this new interface and the GPIO subsystem is additionally modified to allow device drivers to assign names to GPIO resources returned by ACPI _CRS objects (in case _DSD is not present or does not provide the expected data). The changes in this set are mostly from Mika Westerberg, Rafael J Wysocki, Aaron Lu, and Darren Hart with some fixes from others (Fabio Estevam, Geert Uytterhoeven). - Support for Hardware Managed Performance States (HWP) as described in Volume 3, section 14.4, of the Intel SDM in the intel_pstate driver. CPUID is used to detect whether or not the feature is supported by the processor. If supported, it will be enabled automatically unless the intel_pstate=no_hwp switch is present in the kernel command line. From Dirk Brandewie. - New Intel Broadwell-H ID for intel_pstate (Dirk Brandewie). - Support for firmware interface based on ACPI operation regions used by the PMIC chips on the Intel Baytrail-T and Baytrail-T-CR platforms for power resource control and thermal management (Aaron Lu). - Limited support for retrieving off-the-hierarchy dependencies between devices from ACPI _DEP device configuration objects and deferred probing support for the ACPI battery driver based on the _DEP information to make that driver work on Asus T100A (Lan Tianyu). - New cpufreq driver for the Loongson1B processor (Kelvin Cheung). - ACPICA update to upstream revision 20141107 which only affects tools (Bob Moore). - Fixes for race conditions in the ACPICA's interrupt handling code and in the ACPI code related to system suspend and resume (Lv Zheng and Rafael J Wysocki). - ACPI core fix for an RCU-related issue in the ioremap() regions management code that slowed down significantly after CPUs had been allowed to enter idle states even if they'd had RCU callbakcs queued and triggered some problems in certain proprietary graphics driver (and elsewhere). The fix replaces synchronize_rcu() in that code with synchronize_rcu_expedited() which makes the issue go away. From Konstantin Khlebnikov. - ACPI LPSS (Low-Power Subsystem) driver fix to handle power management of the DMA engine included into the LPSS correctly. The problem is that the DMA engine doesn't have ACPI PM support of its own and it simply is turned off when the last LPSS device having ACPI PM support goes into D3cold. To work around that, the PM domain used by the ACPI LPSS driver is redesigned so at least one device with ACPI PM support will be on as long as the DMA engine is in use. From Andy Shevchenko. - ACPI backlight driver fix to avoid using it on "Win8-compatible" systems where it doesn't work and where it was used by default by mistake (Aaron Lu). - Assorted minor ACPI core fixes and cleanups from Tomasz Nowicki, Sudeep Holla, Huang Rui, Hanjun Guo, Fabian Frederick, and Ashwin Chaugule (mostly related to the upcoming ARM64 support). - Intel RAPL (Running Average Power Limit) power capping driver fixes and improvements including new processor IDs (Jacob Pan). - Generic power domains modification to power up domains after attaching devices to them to meet the expectations of device drivers and bus types assuming devices to be accessible at probe time (Ulf Hansson). - Preliminary support for controlling device clocks from the generic power domains core code and modifications of the ARM/shmobile platform to use that feature (Ulf Hansson). - Assorted minor fixes and cleanups of the generic power domains core code (Ulf Hansson, Geert Uytterhoeven). - Assorted minor fixes and cleanups of the device clocks control code in the PM core (Geert Uytterhoeven, Grygorii Strashko). - Consolidation of device power management Kconfig options by making CONFIG_PM_SLEEP select CONFIG_PM_RUNTIME and removing the latter which is now redundant (Rafael J Wysocki and Kevin Hilman). That is the first batch of the changes needed for this purpose. - Core device runtime power management support code cleanup related to the execution of callbacks (Andrzej Hajda). - cpuidle ARM support improvements (Lorenzo Pieralisi). - cpuidle cleanup related to the CPUIDLE_FLAG_TIME_VALID flag and a new MAINTAINERS entry for ARM Exynos cpuidle (Daniel Lezcano and Bartlomiej Zolnierkiewicz). - New cpufreq driver callback (->ready) to be executed when the cpufreq core is ready to use a given policy object and cpufreq-dt driver modification to use that callback for cooling device registration (Viresh Kumar). - cpufreq core fixes and cleanups (Viresh Kumar, Vince Hsu, James Geboski, Tomeu Vizoso). - Assorted fixes and cleanups in the cpufreq-pcc, intel_pstate, cpufreq-dt, pxa2xx cpufreq drivers (Lenny Szubowicz, Ethan Zhao, Stefan Wahren, Petr Cvek). - OPP (Operating Performance Points) framework modification to allow OPPs to be removed too and update of a few cpufreq drivers (cpufreq-dt, exynos5440, imx6q, cpufreq) to remove OPPs (added during initialization) on driver removal (Viresh Kumar). - Hibernation core fixes and cleanups (Tina Ruchandani and Markus Elfring). - PM Kconfig fix related to CPU power management (Pankaj Dubey). - cpupower tool fix (Prarit Bhargava)" * tag 'pm+acpi-3.19-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm: (120 commits) i2c-omap / PM: Drop CONFIG_PM_RUNTIME from i2c-omap.c dmaengine / PM: Replace CONFIG_PM_RUNTIME with CONFIG_PM tools: cpupower: fix return checks for sysfs_get_idlestate_count() drivers: sh / PM: Replace CONFIG_PM_RUNTIME with CONFIG_PM e1000e / igb / PM: Eliminate CONFIG_PM_RUNTIME MMC / PM: Replace CONFIG_PM_RUNTIME with CONFIG_PM MFD / PM: Replace CONFIG_PM_RUNTIME with CONFIG_PM misc / PM: Replace CONFIG_PM_RUNTIME with CONFIG_PM media / PM: Replace CONFIG_PM_RUNTIME with CONFIG_PM input / PM: Replace CONFIG_PM_RUNTIME with CONFIG_PM leds: leds-gpio: Fix multiple instances registration without 'label' property iio / PM: Replace CONFIG_PM_RUNTIME with CONFIG_PM hsi / OMAP / PM: Replace CONFIG_PM_RUNTIME with CONFIG_PM i2c-hid / PM: Replace CONFIG_PM_RUNTIME with CONFIG_PM drm / exynos / PM: Replace CONFIG_PM_RUNTIME with CONFIG_PM gpio / PM: Replace CONFIG_PM_RUNTIME with CONFIG_PM hwrandom / exynos / PM: Use CONFIG_PM in #ifdef block / PM: Replace CONFIG_PM_RUNTIME with CONFIG_PM USB / PM: Drop CONFIG_PM_RUNTIME from the USB core PM: Merge the SET*_RUNTIME_PM_OPS() macros ...	2014-12-10 21:17:00 -08:00
Takashi Iwai	06a41a99d1	blk-mq: Fix uninitialized kobject at CPU hotplugging When a CPU is hotplugged, the current blk-mq spews a warning like: kobject '(null)' (ffffe8ffffc8b5d8): tried to add an uninitialized object, something is seriously wrong. CPU: 1 PID: 1386 Comm: systemd-udevd Not tainted 3.18.0-rc7-2.g088d59b-default #1 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.7.5-20140531_171129-lamiak 04/01/2014 0000000000000000 0000000000000002 ffffffff81605f07 ffffe8ffffc8b5d8 ffffffff8132c7a0 ffff88023341d370 0000000000000020 ffff8800bb05bd58 ffff8800bb05bd08 000000000000a0a0 000000003f441940 0000000000000007 Call Trace: [<ffffffff81005306>] dump_trace+0x86/0x330 [<ffffffff81005644>] show_stack_log_lvl+0x94/0x170 [<ffffffff81006d21>] show_stack+0x21/0x50 [<ffffffff81605f07>] dump_stack+0x41/0x51 [<ffffffff8132c7a0>] kobject_add+0xa0/0xb0 [<ffffffff8130aee1>] blk_mq_register_hctx+0x91/0xb0 [<ffffffff8130b82e>] blk_mq_sysfs_register+0x3e/0x60 [<ffffffff81309298>] blk_mq_queue_reinit_notify+0xf8/0x190 [<ffffffff8107cfdc>] notifier_call_chain+0x4c/0x70 [<ffffffff8105fd23>] cpu_notify+0x23/0x50 [<ffffffff81060037>] _cpu_up+0x157/0x170 [<ffffffff810600d9>] cpu_up+0x89/0xb0 [<ffffffff815fa5b5>] cpu_subsys_online+0x35/0x80 [<ffffffff814323cd>] device_online+0x5d/0xa0 [<ffffffff81432485>] online_store+0x75/0x80 [<ffffffff81236a5a>] kernfs_fop_write+0xda/0x150 [<ffffffff811c5532>] vfs_write+0xb2/0x1f0 [<ffffffff811c5f42>] SyS_write+0x42/0xb0 [<ffffffff8160c4ed>] system_call_fastpath+0x16/0x1b [<00007f0132fb24e0>] 0x7f0132fb24e0 This is indeed because of an uninitialized kobject for blk_mq_ctx. The blk_mq_ctx kobjects are initialized in blk_mq_sysfs_init(), but it goes loop over hctx_for_each_ctx(), i.e. it initializes only for online CPUs. Thus, when a CPU is hotplugged, the ctx for the newly onlined CPU is registered without initialization. This patch fixes the issue by initializing the all ctx kobjects belonging to each queue. Bugzilla: https://bugzilla.novell.com/show_bug.cgi?id=908794 Cc: <stable@vger.kernel.org> Signed-off-by: Takashi Iwai <tiwai@suse.de> Signed-off-by: Jens Axboe <axboe@fb.com>	2014-12-10 08:57:31 -07:00
Bart Van Assche	959f5f5b2f	blk-mq: Use all available hardware queues Suppose that a system has two CPU sockets, three cores per socket, that it does not support hyperthreading and that four hardware queues are provided by a block driver. With the current algorithm this will lead to the following assignment of CPU cores to hardware queues: HWQ 0: 0 1 HWQ 1: 2 3 HWQ 2: 4 5 HWQ 3: (none) This patch changes the queue assignment into: HWQ 0: 0 1 HWQ 1: 2 HWQ 2: 3 4 HWQ 3: 5 In other words, this patch has the following three effects: - All four hardware queues are used instead of only three. - CPU cores are spread more evenly over hardware queues. For the above example the range of the number of CPU cores associated with a single HWQ is reduced from [0..2] to [1..2]. - If the number of HWQ's is a multiple of the number of CPU sockets it is now guaranteed that all CPU cores associated with a single HWQ reside on the same CPU socket. Signed-off-by: Bart Van Assche <bvanassche@acm.org> Reviewed-by: Sagi Grimberg <sagig@mellanox.com> Cc: Jens Axboe <axboe@fb.com> Cc: Christoph Hellwig <hch@lst.de> Cc: Ming Lei <ming.lei@canonical.com> Cc: Alexander Gordeev <agordeev@redhat.com> Signed-off-by: Jens Axboe <axboe@fb.com>	2014-12-09 09:08:21 -07:00
Bart Van Assche	52f7eb945f	blk-mq: Micro-optimize bt_get() Remove a superfluous finish_wait() call. Convert the two bt_wait_ptr() calls into a single call. Signed-off-by: Bart Van Assche <bvanassche@acm.org> Cc: Christoph Hellwig <hch@lst.de> Cc: Robert Elliott <elliott@hp.com> Cc: Ming Lei <ming.lei@canonical.com> Cc: Alexander Gordeev <agordeev@redhat.com> Signed-off-by: Jens Axboe <axboe@fb.com>	2014-12-09 09:07:28 -07:00
Bart Van Assche	c38d185d4a	blk-mq: Fix a race between bt_clear_tag() and bt_get() What we need is the following two guarantees: * Any thread that observes the effect of the test_and_set_bit() by __bt_get_word() also observes the preceding addition of 'current' to the appropriate wait list. This is guaranteed by the semantics of the spin_unlock() operation performed by prepare_and_wait(). Hence the conversion of test_and_set_bit_lock() into test_and_set_bit(). * The wait lists are examined by bt_clear() after the tag bit has been cleared. clear_bit_unlock() guarantees that any thread that observes that the bit has been cleared also observes the store operations preceding clear_bit_unlock(). However, clear_bit_unlock() does not prevent that the wait lists are examined before that the tag bit is cleared. Hence the addition of a memory barrier between clear_bit() and the wait list examination. Signed-off-by: Bart Van Assche <bvanassche@acm.org> Cc: Christoph Hellwig <hch@lst.de> Cc: Robert Elliott <elliott@hp.com> Cc: Ming Lei <ming.lei@canonical.com> Cc: Alexander Gordeev <agordeev@redhat.com> Cc: <stable@vger.kernel.org> # v3.13+ Signed-off-by: Jens Axboe <axboe@fb.com>	2014-12-09 09:07:16 -07:00
Bart Van Assche	9e98e9d7cf	blk-mq: Avoid that __bt_get_word() wraps multiple times If __bt_get_word() is called with last_tag != 0, if the first find_next_zero_bit() fails, if after wrap-around the test_and_set_bit() call fails and find_next_zero_bit() succeeds, if the next test_and_set_bit() call fails and subsequently find_next_zero_bit() does not find a zero bit, then another wrap-around will occur. Avoid this by introducing an additional local variable. Signed-off-by: Bart Van Assche <bvanassche@acm.org> Cc: Christoph Hellwig <hch@lst.de> Cc: Robert Elliott <elliott@hp.com> Cc: Ming Lei <ming.lei@canonical.com> Cc: Alexander Gordeev <agordeev@redhat.com> Cc: <stable@vger.kernel.org> # v3.13+ Signed-off-by: Jens Axboe <axboe@fb.com>	2014-12-09 09:07:14 -07:00
Bart Van Assche	45a9c9d909	blk-mq: Fix a use-after-free blk-mq users are allowed to free the memory request_queue.tag_set points at after blk_cleanup_queue() has finished but before blk_release_queue() has started. This can happen e.g. in the SCSI core. The SCSI core namely embeds the tag_set structure in a SCSI host structure. The SCSI host structure is freed by scsi_host_dev_release(). This function is called after blk_cleanup_queue() finished but can be called before blk_release_queue(). This means that it is not safe to access request_queue.tag_set from inside blk_release_queue(). Hence remove the blk_sync_queue() call from blk_release_queue(). This call is not necessary - outstanding requests must have finished before blk_release_queue() is called. Additionally, move the blk_mq_free_queue() call from blk_release_queue() to blk_cleanup_queue() to avoid that struct request_queue.tag_set gets accessed after it has been freed. This patch avoids that the following kernel oops can be triggered when deleting a SCSI host for which scsi-mq was enabled: Call Trace: [<ffffffff8109a7c4>] lock_acquire+0xc4/0x270 [<ffffffff814ce111>] mutex_lock_nested+0x61/0x380 [<ffffffff812575f0>] blk_mq_free_queue+0x30/0x180 [<ffffffff8124d654>] blk_release_queue+0x84/0xd0 [<ffffffff8126c29b>] kobject_cleanup+0x7b/0x1a0 [<ffffffff8126c140>] kobject_put+0x30/0x70 [<ffffffff81245895>] blk_put_queue+0x15/0x20 [<ffffffff8125c409>] disk_release+0x99/0xd0 [<ffffffff8133d056>] device_release+0x36/0xb0 [<ffffffff8126c29b>] kobject_cleanup+0x7b/0x1a0 [<ffffffff8126c140>] kobject_put+0x30/0x70 [<ffffffff8125a78a>] put_disk+0x1a/0x20 [<ffffffff811d4cb5>] __blkdev_put+0x135/0x1b0 [<ffffffff811d56a0>] blkdev_put+0x50/0x160 [<ffffffff81199eb4>] kill_block_super+0x44/0x70 [<ffffffff8119a2a4>] deactivate_locked_super+0x44/0x60 [<ffffffff8119a87e>] deactivate_super+0x4e/0x70 [<ffffffff811b9833>] cleanup_mnt+0x43/0x90 [<ffffffff811b98d2>] __cleanup_mnt+0x12/0x20 [<ffffffff8107252c>] task_work_run+0xac/0xe0 [<ffffffff81002c01>] do_notify_resume+0x61/0xa0 [<ffffffff814d2c58>] int_signal+0x12/0x17 Signed-off-by: Bart Van Assche <bvanassche@acm.org> Cc: Christoph Hellwig <hch@lst.de> Cc: Robert Elliott <elliott@hp.com> Cc: Ming Lei <ming.lei@canonical.com> Cc: Alexander Gordeev <agordeev@redhat.com> Cc: <stable@vger.kernel.org> # v3.13+ Signed-off-by: Jens Axboe <axboe@fb.com>	2014-12-09 09:07:13 -07:00
Linus Torvalds	f3f62a38ce	SCSI for-linus on 20141208 This patch is the usual mix of driver updates (srp, ipr, scsi_debug, NCR5380, fnic, 53c974, ses, wd719x, hpsa, megaraid_sas). Of those, wd7a9x is new and 53c974 is a rewrite of the old tmscsim driver and the extensive work by Finn Thain rewrites all the NCR5380 based drivers. There's also extensive infrastructure updates: a new logging infrastructure for sense information and a rewrite of the tagged command queue API and an assortment of minor updates. Signed-off-by: James Bottomley <JBottomley@Parallels.com> -----BEGIN PGP SIGNATURE----- Version: GnuPG v2 iQEcBAABAgAGBQJUhgazAAoJEDeqqVYsXL0MxIYH/2wCs9sne5cfDTjfufLJDdu1 WLoJgNMxv+14OukknJfG2Kk1WlHgLRM5+TVWIGiG0mmjXuFyShzIqEOHKTDWqxnE tBH4wLi/+XYqZAmAeim4/2zhvf+cUVVlIu01VERR5uwaBWYM8BeLSwjdnAAvEEwb iV74p1WV6frXo4guADplgtkjD0YxI4MTUZ1figRMlLO6WLFFyQ+95UfY8jFs+eQv zk63y7Mm7dQNd57/Wl3i89lw0kqlaJSZNl8Ovj1axy4rDYzT1wXhY8mEwD8fI8Ym wahldjFE5vXgj0NpO3tB3Z+UDP2YmQduMyzTxkPNnrEPKiOCsnQo42XR6vb92cQ= =Y+DU -----END PGP SIGNATURE----- Merge tag 'scsi-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi Pull SCSI updates from James Bottomley: "This patch is the usual mix of driver updates (srp, ipr, scsi_debug, NCR5380, fnic, 53c974, ses, wd719x, hpsa, megaraid_sas). Of those, wd7a9x is new and 53c974 is a rewrite of the old tmscsim driver and the extensive work by Finn Thain rewrites all the NCR5380 based drivers. There's also extensive infrastructure updates: a new logging infrastructure for sense information and a rewrite of the tagged command queue API and an assortment of minor updates" * tag 'scsi-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi: (183 commits) scsi: set fmt to NULL scsi_extd_sense_format() by default libsas: remove task_collector mode wd719x: remove dma_cache_sync call scsi_debug: add Report supported opcodes+tmfs; Compare and write scsi_debug: change SCSI command parser to table driven scsi_debug: add Capacity Changed Unit Attention scsi_debug: append inject error flags onto scsi_cmnd object scsi_debug: pinpoint invalid field in sense data wd719x: Add firmware documentation wd719x: Introduce Western Digital WD7193/7197/7296 PCI SCSI card driver eeprom-93cx6: Add (read-only) support for 8-bit mode esas2r: fix an oversight in setting return value esas2r: fix an error path in esas2r_ioctl_handler esas2r: fir error handling in do_fm_api scsi: add SPC-3 command definitions scsi: rename SERVICE_ACTION_IN to SERVICE_ACTION_IN_16 scsi: remove scsi_driver owner field scsi: move scsi_dispatch_cmd to scsi_lib.c scsi: stop passing a gfp_mask argument down the command setup path scsi: remove scsi_next_command ...	2014-12-08 21:19:19 -08:00
Ming Lei	19c66e59ce	blk-mq: prevent unmapped hw queue from being scheduled When one hardware queue has no mapped software queues, it shouldn't have been scheduled. Otherwise WARNING or OOPS can triggered. blk_mq_hw_queue_mapped() helper is introduce for fixing the problem. Signed-off-by: Ming Lei <ming.lei@canonical.com> Signed-off-by: Jens Axboe <axboe@fb.com>	2014-12-08 21:37:08 -07:00
Rafael J. Wysocki	e3d857e1ae	Merge branch 'pm-runtime' * pm-runtime: (25 commits) i2c-omap / PM: Drop CONFIG_PM_RUNTIME from i2c-omap.c dmaengine / PM: Replace CONFIG_PM_RUNTIME with CONFIG_PM drivers: sh / PM: Replace CONFIG_PM_RUNTIME with CONFIG_PM e1000e / igb / PM: Eliminate CONFIG_PM_RUNTIME MMC / PM: Replace CONFIG_PM_RUNTIME with CONFIG_PM MFD / PM: Replace CONFIG_PM_RUNTIME with CONFIG_PM misc / PM: Replace CONFIG_PM_RUNTIME with CONFIG_PM media / PM: Replace CONFIG_PM_RUNTIME with CONFIG_PM input / PM: Replace CONFIG_PM_RUNTIME with CONFIG_PM iio / PM: Replace CONFIG_PM_RUNTIME with CONFIG_PM hsi / OMAP / PM: Replace CONFIG_PM_RUNTIME with CONFIG_PM i2c-hid / PM: Replace CONFIG_PM_RUNTIME with CONFIG_PM drm / exynos / PM: Replace CONFIG_PM_RUNTIME with CONFIG_PM gpio / PM: Replace CONFIG_PM_RUNTIME with CONFIG_PM hwrandom / exynos / PM: Use CONFIG_PM in #ifdef block / PM: Replace CONFIG_PM_RUNTIME with CONFIG_PM USB / PM: Drop CONFIG_PM_RUNTIME from the USB core PM: Merge the SET*_RUNTIME_PM_OPS() macros PM / Kconfig: Do not select PM directly from Kconfig files PCI / PM: Drop CONFIG_PM_RUNTIME from the PCI core ...	2014-12-08 20:00:44 +01:00
Jens Axboe	080ff35114	blk-mq: re-check for available tags after running the hardware queue If we run out of tags and have to sleep, we run the hardware queue to kick pending IO into gear. During that run, we may have completed requests, so re-check if we have free tags before going to sleep. Signed-off-by: Jens Axboe <axboe@fb.com>	2014-12-08 08:49:06 -07:00
Bart Van Assche	b32232073e	blk-mq: fix hang in bt_get() Avoid that if there are fewer hardware queues than CPU threads that bt_get() can hang. The symptoms of the hang were as follows: * All tags allocated for a particular hardware queue. * (nr_tags) pending commands for that hardware queue. * No pending commands for the software queues associated with that hardware queue. Signed-off-by: Jens Axboe <axboe@fb.com>	2014-12-08 08:46:34 -07:00
James Bottomley	dc843ef00e	Merge remote-tracking branch 'scsi-queue/core-for-3.19' into for-linus	2014-12-08 07:40:20 -08:00
Rafael J. Wysocki	47fafbc701	block / PM: Replace CONFIG_PM_RUNTIME with CONFIG_PM After commit `b2b49ccbdd` (PM: Kconfig: Set PM_RUNTIME if PM_SLEEP is selected) PM_RUNTIME is always set if PM is set, so #ifdef blocks depending on CONFIG_PM_RUNTIME may now be changed to depend on CONFIG_PM. Replace CONFIG_PM_RUNTIME with CONFIG_PM in the block device core. Reviewed-by: Aaron Lu <aaron.lu@intel.com> Acked-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>	2014-12-04 01:00:23 +01:00
Darrick J. Wong	594416a720	block: fix regression where bio_integrity_process uses wrong bio_vec iterator bio integrity handling is broken on a system with LVM layered atop a DIF/DIX SCSI drive because device mapper clones the bio, modifies the clone, and sends the clone to the lower layers for processing. However, the clone bio has bi_vcnt == 0, which means that when the sd driver calls bio_integrity_process to attach DIX data, the for_each_segment_all() call (which uses bi_vcnt) returns immediately and random garbage is sent to the disk on a disk write. The disk of course returns an error. Therefore, teach bio_integrity_process() to use bio_for_each_segment() to iterate the bio_vecs, since the per-bio iterator tracks which bio_vecs are associated with that particular bio. The integrity handling code is effectively part of the "driver" (it's not the bio owner), so it must use the correct iterator function. v2: Fix a compiler warning about abandoned local variables. This patch supersedes "block: bio_integrity_process uses wrong bio_vec iterator". Patch applies against 3.18-rc6. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Acked-by: Martin K. Petersen <martin.petersen@oracle.com> Signed-off-by: Jens Axboe <axboe@fb.com>	2014-12-02 08:15:21 -07:00
Shaohua Li	6637fadf25	blk-mq: move the kdump check to blk_mq_alloc_tag_set We call blk_mq_alloc_tag_set() first then blk_mq_init_queue(). The requests are allocated in the former function. So the kdump check should be moved to there to really save memory. Signed-off-by: Shaohua Li <shli@fb.com> Signed-off-by: Jens Axboe <axboe@fb.com>	2014-11-30 18:35:39 -07:00
Jens Axboe	70114c393c	blk-mq: cleanup tag free handling We only call __blk_mq_put_tag() and __blk_mq_put_reserved_tag() from blk_mq_put_tag(), so just inline the two calls instead of having them as separate functions. Signed-off-by: Jens Axboe <axboe@fb.com>	2014-11-24 15:52:30 -07:00
Jens Axboe	a33c1ba291	blk-mq: use 'nr_cpu_ids' as highest CPU ID count for hwq <-> cpu map We currently use num_possible_cpus(), but that breaks on sparc64 where the CPU ID space is discontig. Use nr_cpu_ids as the highest CPU ID instead, so we don't end up reading from invalid memory. Cc: stable@kernel.org # 3.13+ Signed-off-by: Jens Axboe <axboe@fb.com>	2014-11-24 15:02:42 -07:00
Hannes Reinecke	eb846d9f14	scsi: rename SERVICE_ACTION_IN to SERVICE_ACTION_IN_16 SPC-3 defines SERVICE ACTION IN(12) and SERVICE ACTION IN(16). So rename SERVICE_ACTION_IN to SERVICE_ACTION_IN_16 to be consistent with SPC and to allow for better distinction. Signed-off-by: Hannes Reinecke <hare@suse.de> Tested-by: Robert Elliott <elliott@hp.com> Reviewed-by: Robert Elliott <elliott@hp.com> Signed-off-by: Christoph Hellwig <hch@lst.de>	2014-11-24 20:01:40 +01:00
Gu Zheng	394ffa503b	blk: introduce generic io stat accounting help function Many block drivers accounting io stat based on bio (e.g. NVMe...), the blk_account_io_start/end() which is based on request does not make sense to them, so here we introduce the similar help function named generic_start/end_io_acct base on raw sectors, and it can simplify some driver's open io accounting code. Signed-off-by: Gu Zheng <guz.fnst@cn.fujitsu.com> Signed-off-by: Jens Axboe <axboe@fb.com>	2014-11-24 08:04:44 -07:00
Christoph Hellwig	b657d7e632	blk-mq: handle the single queue case in blk_mq_hctx_next_cpu Don't duplicate the code to handle the not cpu bounce case in the caller, do it inside blk_mq_hctx_next_cpu instead. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@fb.com>	2014-11-24 08:02:07 -07:00
Jens Axboe	5fabcb4c33	genhd: check for int overflow in disk_expand_part_tbl() We can get here from blkdev_ioctl() -> blkpg_ioctl() -> add_partition() with a user passed in partno value. If we pass in 0x7fffffff, the new target in disk_expand_part_tbl() overflows the 'int' and we access beyond the end of ptbl->part[] and even write to it when we do the rcu_assign_pointer() to assign the new partition. Reported-by: David Ramos <daramos@stanford.edu> Cc: stable@kernel.org Signed-off-by: Jens Axboe <axboe@fb.com>	2014-11-19 13:09:07 -07:00
Jens Axboe	7c7f2f2bc9	blk-mq: add blk_mq_free_hctx_request() It's silly to use blk_mq_free_request() which in turn maps the request to the hardware queue, for places where we already know what the hardware queue is. This saves us an extra mapping of a hardware queue on request completion, if the caller knows this information already. Signed-off-by: Jens Axboe <axboe@fb.com>	2014-11-17 10:41:57 -07:00
Jens Axboe	1a3b595a28	blk-mq: export blk_mq_free_request() Drivers that know they are blk-mq should just use this function instead of calling through blk_put_request(). Signed-off-by: Jens Axboe <axboe@fb.com>	2014-11-17 10:40:48 -07:00
Christoph Hellwig	125c99bc8b	scsi: add new scsi-command flag for tagged commands Currently scsi piggy backs on the block layer to define the concept of a tagged command. But we want to be able to have block-level host-wide tags assigned even for untagged commands like the initial INQUIRY, so add a new SCSI-level flag for commands that are tagged at the scsi level, so that even commands without that set can have tags assigned to them. Note that this alredy is the case for the blk-mq code path, and this just lets the old path catch up with it. We also set this flag based upon sdev->simple_tags instead of the block queue flag, so that it is entirely independent of the block layer tagging, and thus always correct even if a driver doesn't use block level tagging yet. Also remove the old blk_rq_tagged; it was only used by SCSI drivers, and removing it forces them to look for the proper replacement. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Mike Christie <michaelc@cs.wisc.edu> Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com> Reviewed-by: Hannes Reinecke <hare@suse.de>	2014-11-12 11:19:40 +01:00
Bart Van Assche	205fb5f5ba	blk-mq: add blk_mq_unique_tag() The queuecommand() callback functions in SCSI low-level drivers need to know which hardware context has been selected by the block layer. Since this information is not available in the request structure, and since passing the hctx pointer directly to the queuecommand callback function would require modification of all SCSI LLDs, add a function to the block layer that allows to query the hardware context index. Signed-off-by: Bart Van Assche <bvanassche@acm.org> Acked-by: Jens Axboe <axboe@kernel.dk> Reviewed-by: Sagi Grimberg <sagig@mellanox.com> Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com> Signed-off-by: Christoph Hellwig <hch@lst.de>	2014-11-12 11:16:09 +01:00
Ming Lei	7f60dcaaf9	block: blk-merge: fix blk_recount_segments() For cloned bio, bio->bi_vcnt can't be used at all, and we have resort to bio_segments() to figure out how many segment there are in the bio. Signed-off-by: Ming Lei <tom.leiming@gmail.com> Signed-off-by: Jens Axboe <axboe@fb.com>	2014-11-11 16:24:15 -07:00
Paolo Bonzini	2a90d4aae5	blk-mq: use get_cpu/put_cpu instead of preempt_disable/preempt_enable blk-mq is using preempt_disable/enable in order to ensure that the queue runners are placed on the right CPU. This does not work with the RT patches, because __blk_mq_run_hw_queue takes a non-raw spinlock with the preemption-disabled region. If there is contention on the lock, this violates the rules for preemption-disabled regions. While this should be easily fixable within the RT patches just by doing migrate_disable/enable, we can do better and document _why_ this particular region runs with disabled preemption. After the previous patch, it is trivial to switch it to get/put_cpu; the RT patches then can change it to get_cpu_light, which lets virtio-blk run under RT kernels. Cc: Jens Axboe <axboe@kernel.dk> Cc: Thomas Gleixner <tglx@linutronix.de> Reported-by: Clark Williams <williams@redhat.com> Tested-by: Clark Williams <williams@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Signed-off-by: Jens Axboe <axboe@fb.com>	2014-11-11 11:04:49 -07:00
Paolo Bonzini	398205b839	blk_mq: call preempt_disable/enable in blk_mq_run_hw_queue, and only if needed preempt_disable/enable surrounds every call to blk_mq_run_hw_queue, except the one in blk-flush.c. In fact that one is always asynchronous, and it does not need smp_processor_id(). We can do the same for all other calls, avoiding preempt_disable when async is true. This avoids peppering blk-mq.c with preemption-disabled regions. Cc: Jens Axboe <axboe@kernel.dk> Cc: Thomas Gleixner <tglx@linutronix.de> Reported-by: Clark Williams <williams@redhat.com> Tested-by: Clark Williams <williams@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Signed-off-by: Jens Axboe <axboe@fb.com>	2014-11-11 11:04:47 -07:00
Tony Battersby	92697dc947	scsi: Fix more error handling in SCSI_IOCTL_SEND_COMMAND Fix an error path in SCSI_IOCTL_SEND_COMMAND that calls blk_put_request(rq) on an invalid IS_ERR(rq) pointer. Fixes: `a492f07545` ("block,scsi: fixup blk_get_request dead queue scenarios") Signed-off-by: Tony Battersby <tonyb@cybernetics.com> Signed-off-by: Jens Axboe <axboe@fb.com>	2014-11-10 15:41:47 -07:00
Tejun Heo	f3af020b9a	blk-mq: make mq_queue_reinit_notify() freeze queues in parallel q->mq_usage_counter is a percpu_ref which is killed and drained when the queue is frozen. On a CPU hotplug event, blk_mq_queue_reinit() which involves freezing the queue is invoked on all existing queues. Because percpu_ref killing and draining involve a RCU grace period, doing the above on one queue after another may take a long time if there are many queues on the system. This patch splits out initiation of freezing and waiting for its completion, and updates blk_mq_queue_reinit_notify() so that the queues are frozen in parallel instead of one after another. Note that freezing and unfreezing are moved from blk_mq_queue_reinit() to blk_mq_queue_reinit_notify(). Signed-off-by: Tejun Heo <tj@kernel.org> Reported-by: Christian Borntraeger <borntraeger@de.ibm.com> Tested-by: Christian Borntraeger <borntraeger@de.ibm.com> Signed-off-by: Jens Axboe <axboe@fb.com>	2014-11-04 14:49:31 -07:00
Jan Kara	ece9c72acc	block: Fix computation of merged request priority Priority of a merged request is computed by ioprio_best(). If one of the requests has undefined priority (IOPRIO_CLASS_NONE) and another request has priority from IOPRIO_CLASS_BE, the function will return the undefined priority which is wrong. Fix the function to properly return priority of a request with the defined priority. Fixes: `d58cdfb89c` CC: stable@vger.kernel.org Signed-off-by: Jan Kara <jack@suse.cz> Reviewed-by: Jeff Moyer <jmoyer@redhat.com> Signed-off-by: Jens Axboe <axboe@fb.com>	2014-10-31 08:30:43 -06:00
Jens Axboe	e167dfb53c	blk-mq: add BLK_MQ_F_DEFER_ISSUE support flag Drivers can now tell blk-mq if they take advantage of the deferred issue through 'last' or not. If they do, don't do queue-direct for sync IO. This is a preparation patch for the nvme conversion. Signed-off-by: Jens Axboe <axboe@fb.com>	2014-10-29 11:18:26 -06:00
Jens Axboe	74c450521d	blk-mq: add a 'list' parameter to ->queue_rq() Since we have the notion of a 'last' request in a chain, we can use this to have the hardware optimize the issuing of requests. Add a list_head parameter to queue_rq that the driver can use to temporarily store hw commands for issue when 'last' is true. If we are doing a chain of requests, pass in a NULL list for the first request to force issue of that immediately, then batch the remainder for deferred issue until the last request has been sent. Instead of adding yet another argument to the hot ->queue_rq path, encapsulate the passed arguments in a blk_mq_queue_data structure. This is passed as a constant, and has been tested as faster than passing 4 (or even 3) args through ->queue_rq. Update drivers for the new ->queue_rq() prototype. There are no functional changes in this patch for drivers - if they don't use the passed in list, then they will just queue requests individually like before. Signed-off-by: Jens Axboe <axboe@fb.com>	2014-10-29 11:14:52 -06:00
Sudip Mukherjee	d32f6b5752	block: fix wrong error return in elevator_init() while compiling integer err was showing as a set but unused variable. elevator_init_fn can be either cfq_init_queue or deadline_init_queue or noop_init_queue. all three of these functions are returning -ENOMEM if they fail to allocate the queue. so we should actually be returning the error code rather than returning 0 always. Signed-off-by: Sudip Mukherjee <sudip@vectorindia.org> Signed-off-by: Jens Axboe <axboe@fb.com>	2014-10-23 12:35:42 -06:00
Jan Kara	84ce0f0e94	scsi: Fix error handling in SCSI_IOCTL_SEND_COMMAND When sg_scsi_ioctl() fails to prepare request to submit in blk_rq_map_kern() we jump to a label where we just end up copying (luckily zeroed-out) kernel buffer to userspace instead of reporting error. Fix the problem by jumping to the right label. CC: Jens Axboe <axboe@kernel.dk> CC: linux-scsi@vger.kernel.org CC: stable@vger.kernel.org Coverity-id: 1226871 Signed-off-by: Jan Kara <jack@suse.cz> Fixed up the, now unused, out label. Signed-off-by: Jens Axboe <axboe@fb.com>	2014-10-22 20:13:39 -06:00
Ming Lei	76d8137a31	blk-merge: recaculate segment if it isn't less than max segments The problem is introduced by commit 764f612c6c3c231b(blk-merge: don't compute bi_phys_segments from bi_vcnt for cloned bio), and merge is needed if number of current segment isn't less than max segments. Strictly speaking, bio->bi_vcnt shouldn't be used here since it may not be accurate in cases of both cloned bio or bio cloned from, but bio_segments() is a bit expensive, and bi_vcnt is still the biggest number, so the approach should work. Signed-off-by: Ming Lei <tom.leiming@gmail.com> Signed-off-by: Jens Axboe <axboe@fb.com>	2014-10-21 19:00:32 -06:00
Christoph Hellwig	34b48db66e	block: remove artifical max_hw_sectors cap Set max_sectors to the value the drivers provides as hardware limit by default. Linux had proper I/O throttling for a long time and doesn't rely on a artifically small maximum I/O size anymore. By not limiting the I/O size by default we remove an annoying tuning step required for most Linux installation. Note that both the user, and if absolutely required the driver can still impose a limit for FS requests below max_hw_sectors_kb. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@fb.com>	2014-10-21 14:02:54 -06:00
Linus Torvalds	d3dc366bba	Merge branch 'for-3.18/core' of git://git.kernel.dk/linux-block Pull core block layer changes from Jens Axboe: "This is the core block IO pull request for 3.18. Apart from the new and improved flush machinery for blk-mq, this is all mostly bug fixes and cleanups. - blk-mq timeout updates and fixes from Christoph. - Removal of REQ_END, also from Christoph. We pass it through the ->queue_rq() hook for blk-mq instead, freeing up one of the request bits. The space was overly tight on 32-bit, so Martin also killed REQ_KERNEL since it's no longer used. - blk integrity updates and fixes from Martin and Gu Zheng. - Update to the flush machinery for blk-mq from Ming Lei. Now we have a per hardware context flush request, which both cleans up the code should scale better for flush intensive workloads on blk-mq. - Improve the error printing, from Rob Elliott. - Backing device improvements and cleanups from Tejun. - Fixup of a misplaced rq_complete() tracepoint from Hannes. - Make blk_get_request() return error pointers, fixing up issues where we NULL deref when a device goes bad or missing. From Joe Lawrence. - Prep work for drastically reducing the memory consumption of dm devices from Junichi Nomura. This allows creating clone bio sets without preallocating a lot of memory. - Fix a blk-mq hang on certain combinations of queue depths and hardware queues from me. - Limit memory consumption for blk-mq devices for crash dump scenarios and drivers that use crazy high depths (certain SCSI shared tag setups). We now just use a single queue and limited depth for that" * 'for-3.18/core' of git://git.kernel.dk/linux-block: (58 commits) block: Remove REQ_KERNEL blk-mq: allocate cpumask on the home node bio-integrity: remove the needless fail handle of bip_slab creating block: include func name in __get_request prints block: make blk_update_request print prefix match ratelimited prefix blk-merge: don't compute bi_phys_segments from bi_vcnt for cloned bio block: fix alignment_offset math that assumes io_min is a power-of-2 blk-mq: Make bt_clear_tag() easier to read blk-mq: fix potential hang if rolling wakeup depth is too high block: add bioset_create_nobvec() block: use bio_clone_fast() in blk_rq_prep_clone() block: misplaced rq_complete tracepoint sd: Honor block layer integrity handling flags block: Replace strnicmp with strncasecmp block: Add T10 Protection Information functions block: Don't merge requests if integrity flags differ block: Integrity checksum flag block: Relocate bio integrity flags block: Add a disk flag to block integrity profile block: Add prefix to block integrity profile flags ...	2014-10-18 11:53:51 -07:00
Jens Axboe	a86073e48a	blk-mq: allocate cpumask on the home node All other allocs are done on the specific node, somehow the cpumask for hw queue runs was missed. Fix that by using zalloc_cpumask_var_node() in blk_mq_init_queue(). Signed-off-by: Jens Axboe <axboe@fb.com>	2014-10-13 15:41:54 -06:00
Gu Zheng	b65c7491cb	bio-integrity: remove the needless fail handle of bip_slab creating bip_slab is created with SLAB_PANIC, so the fail handler is unneeded. Signed-off-by: Gu Zheng <guz.fnst@cn.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Jens Axboe <axboe@fb.com>	2014-10-13 15:09:38 -06:00
Robert Elliott	7b2b10e0e2	block: include func name in __get_request prints In __get_request calls to printk_ratelimited, include the function name so the callbacks suppressed message matches the messages that are printed, and add "dev" before the device name so it matches other block layer messages. Signed-off-by: Robert Elliott <elliott@hp.com> Reviewed-by: Webb Scales <webbnh@hp.com> Signed-off-by: Jens Axboe <axboe@fb.com>	2014-10-13 08:34:23 -06:00
Robert Elliott	ef3ecb66bc	block: make blk_update_request print prefix match ratelimited prefix In blk_update_request, change the printk_ratelimited prefix from end_request to blk_update_request so it matches the name printed if rate limiting occurs. Old: [10234.933106] blk_update_request: 174 callbacks suppressed [10234.934940] end_request: critical target error, dev sdr, sector 16 [10234.949788] end_request: critical target error, dev sdr, sector 16 New: [16863.445173] blk_update_request: 398 callbacks suppressed [16863.447029] blk_update_request: critical target error, dev sdr, sector 1442066176 [16863.449383] blk_update_request: critical target error, dev sdr, sector 802802888 [16863.451680] blk_update_request: critical target error, dev sdr, sector 1609535456 Signed-off-by: Robert Elliott <elliott@hp.com> Reviewed-by: Webb Scales <webbnh@hp.com> Signed-off-by: Jens Axboe <axboe@fb.com>	2014-10-13 08:34:21 -06:00
Linus Torvalds	c798360cd1	Merge branch 'for-3.18' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/percpu Pull percpu updates from Tejun Heo: "A lot of activities on percpu front. Notable changes are... - percpu allocator now can take @gfp. If @gfp doesn't contain GFP_KERNEL, it tries to allocate from what's already available to the allocator and a work item tries to keep the reserve around certain level so that these atomic allocations usually succeed. This will replace the ad-hoc percpu memory pool used by blk-throttle and also be used by the planned blkcg support for writeback IOs. Please note that I noticed a bug in how @gfp is interpreted while preparing this pull request and applied the fix `6ae833c7fe` ("percpu: fix how @gfp is interpreted by the percpu allocator") just now. - percpu_ref now uses longs for percpu and global counters instead of ints. It leads to more sparse packing of the percpu counters on 64bit machines but the overhead should be negligible and this allows using percpu_ref for refcnting pages and in-memory objects directly. - The switching between percpu and single counter modes of a percpu_ref is made independent of putting the base ref and a percpu_ref can now optionally be initialized in single or killed mode. This allows avoiding percpu shutdown latency for cases where the refcounted objects may be synchronously created and destroyed in rapid succession with only a fraction of them reaching fully operational status (SCSI probing does this when combined with blk-mq support). It's also planned to be used to implement forced single mode to detect underflow more timely for debugging. There's a separate branch percpu/for-3.18-consistent-ops which cleans up the duplicate percpu accessors. That branch causes a number of conflicts with s390 and other trees. I'll send a separate pull request w/ resolutions once other branches are merged" * 'for-3.18' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/percpu: (33 commits) percpu: fix how @gfp is interpreted by the percpu allocator blk-mq, percpu_ref: start q->mq_usage_counter in atomic mode percpu_ref: make INIT_ATOMIC and switch_to_atomic() sticky percpu_ref: add PERCPU_REF_INIT_* flags percpu_ref: decouple switching to percpu mode and reinit percpu_ref: decouple switching to atomic mode and killing percpu_ref: add PCPU_REF_DEAD percpu_ref: rename things to prepare for decoupling percpu/atomic mode switch percpu_ref: replace pcpu_ prefix with percpu_ percpu_ref: minor code and comment updates percpu_ref: relocate percpu_ref_reinit() Revert "blk-mq, percpu_ref: implement a kludge for SCSI blk-mq stall during probe" Revert "percpu: free percpu allocation info for uniprocessor system" percpu-refcount: make percpu_ref based on longs instead of ints percpu-refcount: improve WARN messages percpu: fix locking regression in the failure path of pcpu_alloc() percpu-refcount: add @gfp to percpu_ref_init() proportions: add @gfp to init functions percpu_counter: add @gfp to percpu_counter_init() percpu_counter: make percpu_counters_lock irq-safe ...	2014-10-10 07:26:02 -04:00
Ming Lei	764f612c6c	blk-merge: don't compute bi_phys_segments from bi_vcnt for cloned bio It isn't correct to figure out req->bi_phys_segments from bio->bi_vcnt if the bio is cloned. Signed-off-by: Ming Lei <ming.lei@canonical.com> Tested-by: Jeff Mahoney <jeffm@suse.com> Signed-off-by: Jens Axboe <axboe@fb.com>	2014-10-09 13:11:44 -06:00
Mike Snitzer	b8839b8c55	block: fix alignment_offset math that assumes io_min is a power-of-2 The math in both blk_stack_limits() and queue_limit_alignment_offset() assume that a block device's io_min (aka minimum_io_size) is always a power-of-2. Fix the math such that it works for non-power-of-2 io_min. This issue (of alignment_offset != 0) became apparent when testing dm-thinp with a thinp blocksize that matches a RAID6 stripesize of 1280K. Commit `fdfb4c8c1` ("dm thin: set minimum_io_size to pool's data block size") unlocked the potential for alignment_offset != 0 due to the dm-thin-pool's io_min possibly being a non-power-of-2. Signed-off-by: Mike Snitzer <snitzer@redhat.com> Cc: stable@vger.kernel.org Acked-by: Martin K. Petersen <martin.petersen@oracle.com> Signed-off-by: Jens Axboe <axboe@fb.com>	2014-10-09 09:41:40 -06:00
Linus Torvalds	28596c9722	Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/trivial Pull "trivial tree" updates from Jiri Kosina: "Usual pile from trivial tree everyone is so eagerly waiting for" * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/trivial: (39 commits) Remove MN10300_PROC_MN2WS0038 mei: fix comments treewide: Fix typos in Kconfig kprobes: update jprobe_example.c for do_fork() change Documentation: change "&" to "and" in Documentation/applying-patches.txt Documentation: remove obsolete pcmcia-cs from Changes Documentation: update links in Changes Documentation: Docbook: Fix generated DocBook/kernel-api.xml score: Remove GENERIC_HAS_IOMAP gpio: fix 'CONFIG_GPIO_IRQCHIP' comments tty: doc: Fix grammar in serial/tty dma-debug: modify check_for_stack output treewide: fix errors in printk genirq: fix reference in devm_request_threaded_irq comment treewide: fix synchronize_rcu() in comments checkstack.pl: port to AArch64 doc: queue-sysfs: minor fixes init/do_mounts: better syntax description MIPS: fix comment spelling powerpc/simpleboot: fix comment ...	2014-10-07 21:16:26 -04:00
Bart Van Assche	9d8f0bcca6	blk-mq: Make bt_clear_tag() easier to read Eliminate a backwards goto statement from bt_clear_tag(). Signed-off-by: Bart Van Assche <bvanassche@acm.org> Signed-off-by: Jens Axboe <axboe@fb.com>	2014-10-07 08:45:21 -06:00
Jens Axboe	abab13b5c4	blk-mq: fix potential hang if rolling wakeup depth is too high We currently divide the queue depth by 4 as our batch wakeup count, but we split the wakeups over BT_WAIT_QUEUES number of wait queues. This defaults to 8. If the product of the resulting batch wake count and BT_WAIT_QUEUES is higher than the device queue depth, we can get into a situation where a task goes to sleep waiting for a request, but never gets woken up. Reported-by: Bart Van Assche <bvanassche@acm.org> Fixes: `4bb659b156` Cc: stable@kernel.org Signed-off-by: Jens Axboe <axboe@fb.com>	2014-10-07 08:39:20 -06:00
Junichi Nomura	d8f429e166	block: add bioset_create_nobvec() Users of bio_clone_fast() do not want bios with their own bvecs. Allocating a bvec mempool as part of the bioset intended for such users is a waste of memory. bioset_create_nobvec() creates a bioset that doesn't have the bvec mempool. Signed-off-by: Jun'ichi Nomura <j-nomura@ce.jp.nec.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com> Signed-off-by: Jens Axboe <axboe@fb.com>	2014-10-03 15:28:18 -06:00
Junichi Nomura	11dfce509e	block: use bio_clone_fast() in blk_rq_prep_clone() Request cloning clones bios in the request to track the completion of each bio. For that purpose, we can use bio_clone_fast() instead of bio_clone() to avoid unnecessary allocation and copy of bvecs. This patch reduces memory footprint of request-based device-mapper (about 1-4KB for each request) and is a preparation for further reduction of memory usage by removing unused bvec mempool. Signed-off-by: Jun'ichi Nomura <j-nomura@ce.jp.nec.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com> Signed-off-by: Jens Axboe <axboe@fb.com>	2014-10-03 15:28:16 -06:00
Hannes Reinecke	4a0efdc933	block: misplaced rq_complete tracepoint The rq_complete tracepoint was never issued for empty requests, causing the resulting blktrace information to never show any completion for those request. Signed-off-by: Hannes Reinecke <hare@suse.de> Acked-by: Tejun Heo <tj@kernel.org> Signed-off-by: Jens Axboe <axboe@fb.com>	2014-10-01 08:17:42 -06:00
Rasmus Villemoes	582940508b	block: Replace strnicmp with strncasecmp The kernel used to contain two functions for length-delimited, case-insensitive string comparison, strnicmp with correct semantics and a slightly buggy strncasecmp. The latter is the POSIX name, so strnicmp was renamed to strncasecmp, and strnicmp made into a wrapper for the new strncasecmp to avoid breaking existing users. To allow the compat wrapper strnicmp to be removed at some point in the future, and to avoid the extra indirection cost, do s/strnicmp/strncasecmp/g. Cc: Jens Axboe <axboe@kernel.dk> Signed-off-by: Rasmus Villemoes <linux@rasmusvillemoes.dk> Signed-off-by: Jens Axboe <axboe@fb.com>	2014-09-27 16:48:55 -06:00
Martin K. Petersen	2341c2f8c3	block: Add T10 Protection Information functions The T10 Protection Information format is also used by some devices that do not go through the SCSI layer (virtual block devices, NVMe). Relocate the relevant functions to a block layer library that can be used without involving SCSI. Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@fb.com>	2014-09-27 09:14:59 -06:00
Martin K. Petersen	4eaf99bead	block: Don't merge requests if integrity flags differ We'd occasionally merge requests with conflicting integrity flags. Introduce a merge helper which checks that the requests have compatible integrity payloads. Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Sagi Grimberg <sagig@mellanox.com> Signed-off-by: Jens Axboe <axboe@fb.com>	2014-09-27 09:14:57 -06:00
Martin K. Petersen	aae7df5019	block: Integrity checksum flag Make the choice of checksum a per-I/O property by introducing a flag that can be inspected by the SCSI layer. There are several reasons for this: 1. It allows us to switch choice of checksum without unloading and reloading the HBA driver. 2. During error recovery we need to be able to tell the HBA that checksums read from disk should not be verified and converted to IP checksums. 3. For error injection purposes we need to be able to write a bad guard tag to storage. Since the storage device only supports T10 CRC we need to be able to disable IP checksum conversion on the HBA. Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com> Reviewed-by: Sagi Grimberg <sagig@mellanox.com> Signed-off-by: Jens Axboe <axboe@fb.com>	2014-09-27 09:14:55 -06:00
Martin K. Petersen	b1f0138857	block: Relocate bio integrity flags Move flags affecting the integrity code out of the bio bi_flags and into the block integrity payload. Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com> Reviewed-by: Sagi Grimberg <sagig@mellanox.com> Signed-off-by: Jens Axboe <axboe@fb.com>	2014-09-27 09:14:54 -06:00
Martin K. Petersen	3aec2f41a8	block: Add a disk flag to block integrity profile So far we have relied on the app tag size to determine whether a disk has been formatted with T10 protection information or not. However, not all target devices provide application tag storage. Add a flag to the block integrity profile that indicates whether the disk has been formatted with protection information. Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com> Reviewed-by: Sagi Grimberg <sagig@dev.mellanox.co.il> Signed-off-by: Jens Axboe <axboe@fb.com>	2014-09-27 09:14:51 -06:00
Martin K. Petersen	8288f496eb	block: Add prefix to block integrity profile flags Add a BLK_ prefix to the integrity profile flags. Also rename the flags to be more consistent with the generate/verify terminology in the rest of the integrity code. Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Sagi Grimberg <sagig@mellanox.com> Signed-off-by: Jens Axboe <axboe@fb.com>	2014-09-27 09:14:51 -06:00
Martin K. Petersen	1859308853	block: Clean up the code used to generate and verify integrity metadata Instead of the "operate" parameter we pass in a seed value and a pointer to a function that can be used to process the integrity metadata. The generation function is changed to have a return value to fit into this scheme. Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com> Reviewed-by: Sagi Grimberg <sagig@mellanox.com> Signed-off-by: Jens Axboe <axboe@fb.com>	2014-09-27 09:14:51 -06:00
Martin K. Petersen	5a2aa87305	block: Make protection interval calculation generic Now that the protection interval has been detached from the sector size we need to be able to handle sizes that are different from 4K and 512. Make the interval calculation generic. Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Sagi Grimberg <sagig@mellanox.com> Signed-off-by: Jens Axboe <axboe@fb.com>	2014-09-27 09:14:50 -06:00
Martin K. Petersen	3be91c4a3d	block: Deprecate the use of the term sector in the context of block integrity The protection interval is not necessarily tied to the logical block size of a block device. Stop using the terms "sector" and "sectors". Going forward we will use the term "seed" to describe the initial reference tag value for a given I/O. "Interval" will be used to describe the portion of the data buffer that a given piece of protection information is associated with. Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Sagi Grimberg <sagig@mellanox.com> Signed-off-by: Jens Axboe <axboe@fb.com>	2014-09-27 09:14:50 -06:00
Martin K. Petersen	5f9378fa9c	block: Remove bip_buf bip_buf is not really needed so we can remove it. Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Sagi Grimberg <sagig@mellanox.com> Signed-off-by: Jens Axboe <axboe@fb.com>	2014-09-27 09:14:50 -06:00
Martin K. Petersen	8492b68bc4	block: Remove integrity tagging functions None of the filesystems appear interested in using the integrity tagging feature. Potentially because very few storage devices actually permit using the application tag space. Remove the tagging functions. Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Sagi Grimberg <sagig@mellanox.com> Signed-off-by: Jens Axboe <axboe@fb.com>	2014-09-27 09:14:50 -06:00
Martin K. Petersen	180b2f95dd	block: Replace bi_integrity with bi_special For commands like REQ_COPY we need a way to pass extra information along with each bio. Like integrity metadata this information must be available at the bottom of the stack so bi_private does not suffice. Rename the existing bi_integrity field to bi_special and make it a union so we can have different bio extensions for each class of command. We previously used bi_integrity != NULL as a way to identify whether a bio had integrity metadata or not. Introduce a REQ_INTEGRITY to be the indicator now that bi_special can contain different things. In addition, bio_integrity(bio) will now return a pointer to the integrity payload (when applicable). Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Sagi Grimberg <sagig@mellanox.com> Signed-off-by: Jens Axboe <axboe@fb.com>	2014-09-27 09:14:46 -06:00
Martin K. Petersen	e7258c1a26	block: Get rid of bdev_integrity_enabled() bdev_integrity_enabled() is only used by bio_integrity_enabled(). Combine these two functions. Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Sagi Grimberg <sagig@mellanox.com> Signed-off-by: Jens Axboe <axboe@fb.com>	2014-09-27 09:14:44 -06:00
Ming Lei	f70ced0917	blk-mq: support per-distpatch_queue flush machinery This patch supports to run one single flush machinery for each blk-mq dispatch queue, so that: - current init_request and exit_request callbacks can cover flush request too, then the buggy copying way of initializing flush request's pdu can be fixed - flushing performance gets improved in case of multi hw-queue In fio sync write test over virtio-blk(4 hw queues, ioengine=sync, iodepth=64, numjobs=4, bs=4K), it is observed that througput gets increased a lot over my test environment: - throughput: +70% in case of virtio-blk over null_blk - throughput: +30% in case of virtio-blk over SSD image The multi virtqueue feature isn't merged to QEMU yet, and patches for the feature can be found in below tree: git://kernel.ubuntu.com/ming/qemu.git v2.1.0-mq.4 And simply passing 'num_queues=4 vectors=5' should be enough to enable multi queue(quad queue) feature for QEMU virtio-blk. Suggested-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Ming Lei <ming.lei@canonical.com> Signed-off-by: Jens Axboe <axboe@fb.com>	2014-09-25 15:22:45 -06:00
Ming Lei	e97c293cdf	block: introduce 'blk_mq_ctx' parameter to blk_get_flush_queue This patch adds 'blk_mq_ctx' parameter to blk_get_flush_queue(), so that this function can find the corresponding blk_flush_queue bound with current mq context since the flush queue will become per hw-queue. For legacy queue, the parameter can be simply 'NULL'. For multiqueue case, the parameter should be set as the context from which the related request is originated. With this context info, the hw queue and related flush queue can be found easily. Signed-off-by: Ming Lei <ming.lei@canonical.com> Signed-off-by: Jens Axboe <axboe@fb.com>	2014-09-25 15:22:44 -06:00
Ming Lei	0bae352da5	block: flush: avoid to figure out flush queue unnecessarily Just figuring out flush queue at the entry of kicking off flush machinery and request's completion handler, then pass it through. Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Ming Lei <ming.lei@canonical.com> Signed-off-by: Jens Axboe <axboe@fb.com>	2014-09-25 15:22:42 -06:00
Ming Lei	ba483388e3	block: remove blk_init_flush() and its pair Now mission of the two helpers is over, and just call blk_alloc_flush_queue() and blk_free_flush_queue() directly. Signed-off-by: Ming Lei <ming.lei@canonical.com> Signed-off-by: Jens Axboe <axboe@fb.com>	2014-09-25 15:22:41 -06:00
Ming Lei	7c94e1c157	block: introduce blk_flush_queue to drive flush machinery This patch introduces 'struct blk_flush_queue' and puts all flush machinery related fields into this structure, so that - flush implementation details aren't exposed to driver - it is easy to convert to per dispatch-queue flush machinery This patch is basically a mechanical replacement. Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Ming Lei <ming.lei@canonical.com> Signed-off-by: Jens Axboe <axboe@fb.com>	2014-09-25 15:22:40 -06:00
Ming Lei	7ddab5de5b	block: avoid to use q->flush_rq directly This patch trys to use local variable to access flush request, so that we can convert to per-queue flush machinery a bit easier. Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Ming Lei <ming.lei@canonical.com> Signed-off-by: Jens Axboe <axboe@fb.com>	2014-09-25 15:22:38 -06:00
Ming Lei	3c09676c12	block: move flush initialization to blk_flush_init These fields are always used with the flush request, so initialize them together. Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Ming Lei <ming.lei@canonical.com> Signed-off-by: Jens Axboe <axboe@fb.com>	2014-09-25 15:22:37 -06:00
Ming Lei	f355265571	block: introduce blk_init_flush and its pair These two temporary functions are introduced for holding flush initialization and de-initialization, so that we can introduce 'flush queue' easier in the following patch. And once 'flush queue' and its allocation/free functions are ready, they will be removed for sake of code readability. Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Ming Lei <ming.lei@canonical.com> Signed-off-by: Jens Axboe <axboe@fb.com>	2014-09-25 15:22:35 -06:00
Ming Lei	1bcb1eada4	blk-mq: allocate flush_rq in blk_mq_init_flush() It is reasonable to allocate flush req in blk_mq_init_flush(). Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Ming Lei <ming.lei@canonical.com> Signed-off-by: Jens Axboe <axboe@fb.com>	2014-09-25 15:22:34 -06:00
Ming Lei	08e98fc601	blk-mq: handle failure path for initializing hctx Failure of initializing one hctx isn't handled, so this patch introduces blk_mq_init_hctx() and its pair to handle it explicitly. Also this patch makes code cleaner. Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Ming Lei <ming.lei@canonical.com> Signed-off-by: Jens Axboe <axboe@fb.com>	2014-09-25 15:22:32 -06:00
Tejun Heo	17497acbdc	blk-mq, percpu_ref: start q->mq_usage_counter in atomic mode blk-mq uses percpu_ref for its usage counter which tracks the number of in-flight commands and used to synchronously drain the queue on freeze. percpu_ref shutdown takes measureable wallclock time as it involves a sched RCU grace period. This means that draining a blk-mq takes measureable wallclock time. One would think that this shouldn't matter as queue shutdown should be a rare event which takes place asynchronously w.r.t. userland. Unfortunately, SCSI probing involves synchronously setting up and then tearing down a lot of request_queues back-to-back for non-existent LUNs. This means that SCSI probing may take above ten seconds when scsi-mq is used. [ 0.949892] scsi host0: Virtio SCSI HBA [ 1.007864] scsi 0:0:0:0: Direct-Access QEMU QEMU HARDDISK 1.1. PQ: 0 ANSI: 5 [ 1.021299] scsi 0:0:1:0: Direct-Access QEMU QEMU HARDDISK 1.1. PQ: 0 ANSI: 5 [ 1.520356] tsc: Refined TSC clocksource calibration: 2491.910 MHz <stall> [ 16.186549] sd 0:0:0:0: Attached scsi generic sg0 type 0 [ 16.190478] sd 0:0:1:0: Attached scsi generic sg1 type 0 [ 16.194099] osd: LOADED open-osd 0.2.1 [ 16.203202] sd 0:0:0:0: [sda] 31457280 512-byte logical blocks: (16.1 GB/15.0 GiB) [ 16.208478] sd 0:0:0:0: [sda] Write Protect is off [ 16.211439] sd 0:0:0:0: [sda] Write cache: enabled, read cache: enabled, doesn't support DPO or FUA [ 16.218771] sd 0:0:1:0: [sdb] 31457280 512-byte logical blocks: (16.1 GB/15.0 GiB) [ 16.223264] sd 0:0:1:0: [sdb] Write Protect is off [ 16.225682] sd 0:0:1:0: [sdb] Write cache: enabled, read cache: enabled, doesn't support DPO or FUA This is also the reason why request_queues start in bypass mode which is ended on blk_register_queue() as shutting down a fully functional queue also involves a RCU grace period and the queues for non-existent SCSI devices never reach registration. blk-mq basically needs to do the same thing - start the mq in a degraded mode which is faster to shut down and then make it fully functional only after the queue reaches registration. percpu_ref recently grew facilities to force atomic operation until explicitly switched to percpu mode, which can be used for this purpose. This patch makes blk-mq initialize q->mq_usage_counter in atomic mode and switch it to percpu mode only once blk_register_queue() is reached. Note that this issue was previously worked around by `0a30288da1` ("blk-mq, percpu_ref: implement a kludge for SCSI blk-mq stall during probe") for v3.17. The temp fix was reverted in preparation of adding persistent atomic mode to percpu_ref by `9eca80461a` ("Revert "blk-mq, percpu_ref: implement a kludge for SCSI blk-mq stall during probe""). This patch and the prerequisite percpu_ref changes will be merged during v3.18 devel cycle. Signed-off-by: Tejun Heo <tj@kernel.org> Reported-by: Christoph Hellwig <hch@infradead.org> Link: http://lkml.kernel.org/g/20140919113815.GA10791@lst.de Fixes: `add703fda9` ("blk-mq: use percpu_ref for mq usage count") Reviewed-by: Kent Overstreet <kmo@daterainc.com> Cc: Jens Axboe <axboe@kernel.dk> Cc: Johannes Weiner <hannes@cmpxchg.org>	2014-09-24 13:37:21 -04:00
Tejun Heo	2aad2a86f6	percpu_ref: add PERCPU_REF_INIT_* flags With the recent addition of percpu_ref_reinit(), percpu_ref now can be used as a persistent switch which can be turned on and off repeatedly where turning off maps to killing the ref and waiting for it to drain; however, there currently isn't a way to initialize a percpu_ref in its off (killed and drained) state, which can be inconvenient for certain persistent switch use cases. Similarly, percpu_ref_switch_to_atomic/percpu() allow dynamic selection of operation mode; however, currently a newly initialized percpu_ref is always in percpu mode making it impossible to avoid the latency overhead of switching to atomic mode. This patch adds @flags to percpu_ref_init() and implements the following flags. * PERCPU_REF_INIT_ATOMIC : start ref in atomic mode * PERCPU_REF_INIT_DEAD : start ref killed and drained These flags should be able to serve the above two use cases. v2: target_core_tpg.c conversion was missing. Fixed. Signed-off-by: Tejun Heo <tj@kernel.org> Reviewed-by: Kent Overstreet <kmo@daterainc.com> Cc: Jens Axboe <axboe@kernel.dk> Cc: Christoph Hellwig <hch@infradead.org> Cc: Johannes Weiner <hannes@cmpxchg.org>	2014-09-24 13:31:50 -04:00
Tejun Heo	9eca80461a	Revert "blk-mq, percpu_ref: implement a kludge for SCSI blk-mq stall during probe" This reverts commit `0a30288da1`, which was a temporary fix for SCSI blk-mq stall issue. The following patches will fix the issue properly by introducing atomic mode to percpu_ref. Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Kent Overstreet <kmo@daterainc.com> Cc: Jens Axboe <axboe@kernel.dk> Cc: Christoph Hellwig <hch@lst.de>	2014-09-24 13:07:33 -04:00
Tejun Heo	d06efebf0c	Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/axboe/linux-block into for-3.18 This is to receive `0a30288da1` ("blk-mq, percpu_ref: implement a kludge for SCSI blk-mq stall during probe") which implements __percpu_ref_kill_expedited() to work around SCSI blk-mq stall. The commit reverted and patches to implement proper fix will be added. Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Kent Overstreet <kmo@daterainc.com> Cc: Jens Axboe <axboe@kernel.dk> Cc: Christoph Hellwig <hch@lst.de>	2014-09-24 13:00:21 -04:00
Tejun Heo	0a30288da1	blk-mq, percpu_ref: implement a kludge for SCSI blk-mq stall during probe blk-mq uses percpu_ref for its usage counter which tracks the number of in-flight commands and used to synchronously drain the queue on freeze. percpu_ref shutdown takes measureable wallclock time as it involves a sched RCU grace period. This means that draining a blk-mq takes measureable wallclock time. One would think that this shouldn't matter as queue shutdown should be a rare event which takes place asynchronously w.r.t. userland. Unfortunately, SCSI probing involves synchronously setting up and then tearing down a lot of request_queues back-to-back for non-existent LUNs. This means that SCSI probing may take more than ten seconds when scsi-mq is used. This will be properly fixed by implementing a mechanism to keep q->mq_usage_counter in atomic mode till genhd registration; however, that involves rather big updates to percpu_ref which is difficult to apply late in the devel cycle (v3.17-rc6 at the moment). As a stop-gap measure till the proper fix can be implemented in the next cycle, this patch introduces __percpu_ref_kill_expedited() and makes blk_mq_freeze_queue() use it. This is heavy-handed but should work for testing the experimental SCSI blk-mq implementation. Signed-off-by: Tejun Heo <tj@kernel.org> Reported-by: Christoph Hellwig <hch@infradead.org> Link: http://lkml.kernel.org/g/20140919113815.GA10791@lst.de Fixes: `add703fda9` ("blk-mq: use percpu_ref for mq usage count") Cc: Kent Overstreet <kmo@daterainc.com> Cc: Jens Axboe <axboe@kernel.dk> Tested-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@fb.com>	2014-09-24 08:29:36 -06:00
Jens Axboe	46f341ffcf	genhd: fix leftover might_sleep() in blk_free_devt() Commit `2da78092` changed the locking from a mutex to a spinlock, so we now longer sleep in this context. But there was a leftover might_sleep() in there, which now triggers since we do the final free from an RCU callback. Get rid of it. Reported-by: Pontus Fuchs <pontus.fuchs@gmail.com> Signed-off-by: Jens Axboe <axboe@fb.com>	2014-09-22 14:45:45 -06:00
Christoph Hellwig	9041583765	block: fix blk_abort_request on blk-mq Signed-off-by: Christoph Hellwig <hch@lst.de> Moved blk_mq_rq_timed_out() definition to the private blk-mq.h header. Signed-off-by: Jens Axboe <axboe@fb.com>	2014-09-22 12:00:08 -06:00
Ming Lei	5e940aaa59	blk-timeout: fix blk_add_timer Commit 8cb34819cdd5d(blk-mq: unshared timeout handler) introduces blk-mq's own timeout handler, and removes following line: blk_queue_rq_timed_out(q, blk_mq_rq_timed_out); which then causes blk_add_timer() to bypass adding the timer, since blk-mq no longer has q->rq_timed_out_fn defined. This patch fixes the problem by bypassing the check for blk-mq, so that both request deadlines are still set and the rolling timer updated. Signed-off-by: Ming Lei <ming.lei@canonical.com> Signed-off-by: Jens Axboe <axboe@fb.com>	2014-09-22 12:00:08 -06:00
Jens Axboe	aedcd72f6c	blk-mq: limit memory consumption if a crash dump is active It's not uncommon for crash dump kernels to be limited to 128MB or something low in that area. This is normally not a problem for devices as we don't use that much memory, but for some shared SCSI setups with huge queue depths, it can potentially fill most of memory with tons of request allocations. blk-mq does scale back when it fails to allocate memory, but it scales back just enough so that blk-mq succeeds. This could still leave the system with not enough memory to make any real progress. Check if we are in a kdump environment and limit the hardware queues and tag depth. Signed-off-by: Jens Axboe <axboe@fb.com>	2014-09-22 12:00:08 -06:00
Ming Lei	2edd2c740b	blk-mq: remove unnecessary blk_clear_rq_complete() This patch removes two unnecessary blk_clear_rq_complete(), the REQ_ATOM_COMPLETE flag is cleared inside blk_mq_start_request(), so: - The blk_clear_rq_complete() in blk_flush_restore_request() needn't because the request will be freed later, and clearing it here may open a small race window with timeout. - The blk_clear_rq_complete() in blk_mq_requeue_request() isn't necessary too, even though REQ_ATOM_STARTED is cleared in __blk_mq_requeue_request(), in theory it still may cause a small race window with timeout since the two clear_bit() may be reordered. Signed-off-by: Ming Lei <ming.lei@canoical.com> Signed-off-by: Jens Axboe <axboe@fb.com>	2014-09-22 12:00:07 -06:00
Christoph Hellwig	0152fb6b57	blk-mq: pass a reserved argument to the timeout handler Allow blk-mq to pass an argument to the timeout handler to indicate if we're timing out a reserved or regular command. For many drivers those need to be handled different. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@fb.com>	2014-09-22 12:00:07 -06:00
Christoph Hellwig	46f92d42ee	blk-mq: unshared timeout handler Duplicate the (small) timeout handler in blk-mq so that we can pass arguments more easily to the driver timeout handler. This enables the next patch. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@fb.com>	2014-09-22 12:00:07 -06:00
Christoph Hellwig	81481eb423	blk-mq: fix and simplify tag iteration for the timeout handler Don't do a kmalloc from timer to handle timeouts, chances are we could be under heavy load or similar and thus just miss out on the timeouts. Fortunately it is very easy to just iterate over all in use tags, and doing this properly actually cleans up the blk_mq_busy_iter API as well, and prepares us for the next patch by passing a reserved argument to the iterator. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@fb.com>	2014-09-22 12:00:07 -06:00
Christoph Hellwig	c8a446ad69	blk-mq: rename blk_mq_end_io to blk_mq_end_request Now that we've changed the driver API on the submission side use the opportunity to fix up the name on the completion side to fit into the general scheme. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@fb.com>	2014-09-22 12:00:07 -06:00
Christoph Hellwig	e2490073cd	blk-mq: call blk_mq_start_request from ->queue_rq When we call blk_mq_start_request from the core blk-mq code before calling into ->queue_rq there is a racy window where the timeout handler can hit before we've fully set up the driver specific part of the command. Move the call to blk_mq_start_request into the driver so the driver can start the request only once it is fully set up. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@fb.com>	2014-09-22 12:00:07 -06:00
Christoph Hellwig	bf57229745	blk-mq: remove REQ_END Pass an explicit parameter for the last request in a batch to ->queue_rq instead of using a request flag. Besides being a cleaner and non-stateful interface this is also required for the next patch, which fixes the blk-mq I/O submission code to not start a time too early. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@fb.com>	2014-09-22 12:00:07 -06:00
Jens Axboe	6d11fb454b	Merge branch 'for-linus' into for-3.18/core Moving patches from for-linus to 3.18 instead, pull in this changes that will go to Linus today.	2014-09-22 11:57:32 -06:00
Jens Axboe	8b95741569	blk-mq: use blk_mq_start_hw_queues() when running requeue work When requests are retried due to hw or sw resource shortages, we often stop the associated hardware queue. So ensure that we restart the queues when running the requeue work, otherwise the queue run will be a no-op. Signed-off-by: Jens Axboe <axboe@fb.com>	2014-09-22 11:55:56 -06:00
Jens Axboe	6b55e1f2d0	blk-mq: fix potential oops on out-of-memory in __blk_mq_alloc_rq_maps() __blk_mq_alloc_rq_maps() can be invoked multiple times, if we scale back the queue depth if we are low on memory. So don't clear set->tags when we fail, this is handled directly in the parent function, blk_mq_alloc_tag_set(). Reported-by: Robert Elliott <Elliott@hp.com> Signed-off-by: Jens Axboe <axboe@fb.com>	2014-09-22 11:55:23 -06:00
Christoph Hellwig	a57a178a49	blk-mq: avoid infinite recursion with the FUA flag We should not insert requests into the flush state machine from blk_mq_insert_request. All incoming flush requests come through blk_{m,s}q_make_request and are handled there, while blk_execute_rq_nowait should only be called for BLOCK_PC requests. All other callers deal with requests that already went through the flush statemchine and shouldn't be reinserted into it. Reported-by: Robert Elliott <Elliott@hp.com> Debugged-by: Ming Lei <ming.lei@canonical.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@fb.com>	2014-09-22 11:55:19 -06:00
David Hildenbrand	683d0e1262	blk-mq: Avoid race condition with uninitialized requests This patch should fix the bug reported in https://lkml.org/lkml/2014/9/11/249. We have to initialize at least the atomic_flags and the cmd_flags when allocating storage for the requests. Otherwise blk_mq_timeout_check() might dereference uninitialized pointers when racing with the creation of a request. Also move the reset of cmd_flags for the initializing code to the point where a request is freed. So we will never end up with pending flush request indicators that might trigger dereferences of invalid pointers in blk_mq_timeout_check(). Cc: stable@vger.kernel.org Signed-off-by: David Hildenbrand <dahi@linux.vnet.ibm.com> Reported-by: Paulo De Rezende Pinatti <ppinatti@linux.vnet.ibm.com> Tested-by: Paulo De Rezende Pinatti <ppinatti@linux.vnet.ibm.com> Acked-by: Christian Borntraeger <borntraeger@de.ibm.com> Signed-off-by: Jens Axboe <axboe@fb.com>	2014-09-22 11:55:14 -06:00
Jens Axboe	538b753418	blk-mq: request deadline must be visible before marking rq as started When we start the request, we set the deadline and flip the bits marking the request as started and non-complete. However, it's important that the deadline store is ordered before flipping the bits, otherwise we could have a small window where the request is marked started but with an invalid deadline. This can confuse the timeout handling. Suggested-by: Ming Lei <tom.leiming@gmail.com> Signed-off-by: Jens Axboe <axboe@fb.com>	2014-09-22 11:54:04 -06:00
Jens Axboe	b207892b06	Merge branch 'for-linus' into for-3.18/core A bit of churn on the for-linus side that would be nice to have in the core bits for 3.18, so pull it in to catch us up and make forward progress easier. Signed-off-by: Jens Axboe <axboe@fb.com> Conflicts: block/scsi_ioctl.c	2014-09-11 09:31:18 -06:00
Jens Axboe	a516440542	blk-mq: scale depth and rq map appropriate if low on memory If we are running in a kdump environment, resources are scarce. For some SCSI setups with a huge set of shared tags, we run out of memory allocating what the drivers is asking for. So implement a scale back logic to reduce the tag depth for those cases, allowing the driver to successfully load. We should extend this to detect low memory situations, and implement a sane fallback for those (1 queue, 64 tags, or something like that). Tested-by: Robert Elliott <elliott@hp.com> Signed-off-by: Jens Axboe <axboe@fb.com>	2014-09-10 09:02:03 -06:00
Alan Stern	df35c7c912	Block: fix unbalanced bypass-disable in blk_register_queue When a queue is registered, the block layer turns off the bypass setting (because bypass is enabled when the queue is created). This doesn't work well for queues that are unregistered and then registered again; we get a WARNING because of the unbalanced calls to blk_queue_bypass_end(). This patch fixes the problem by making blk_register_queue() call blk_queue_bypass_end() only the first time the queue is registered. Signed-off-by: Alan Stern <stern@rowland.harvard.edu> Acked-by: Tejun Heo <tj@kernel.org> CC: James Bottomley <James.Bottomley@HansenPartnership.com> CC: Jens Axboe <axboe@kernel.dk> Signed-off-by: Jens Axboe <axboe@fb.com>	2014-09-09 10:44:24 -06:00
Masanari Iida	da3dae54e4	Documentation: Docbook: Fix generated DocBook/kernel-api.xml This patch fix spelling typo found in DocBook/kernel-api.xml. It is because the file is generated from the source comments, I have to fix the comments in source codes. Signed-off-by: Masanari Iida <standby24x7@gmail.com> Acked-by: Randy Dunlap <rdunlap@infradead.org> Signed-off-by: Jiri Kosina <jkosina@suse.cz>	2014-09-09 10:34:56 +02:00
Tejun Heo	ff9ea32381	block, bdi: an active gendisk always has a request_queue associated with it bdev_get_queue() returns the request_queue associated with the specified block_device. blk_get_backing_dev_info() makes use of bdev_get_queue() to determine the associated bdi given a block_device. All the callers of bdev_get_queue() including blk_get_backing_dev_info() assume that bdev_get_queue() may return NULL and implement NULL handling; however, bdev_get_queue() requires the passed in block_device is opened and attached to its gendisk. Because an active gendisk always has a valid request_queue associated with it, bdev_get_queue() can never return NULL and neither can blk_get_backing_dev_info(). Make it clear that neither of the two functions can return NULL and remove NULL handling from all the callers. Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Chris Mason <clm@fb.com> Cc: Dave Chinner <david@fromorbit.com> Signed-off-by: Jens Axboe <axboe@fb.com>	2014-09-08 10:00:35 -06:00
Tejun Heo	f4da80727c	blkcg: remove blkcg->id blkcg->id is a unique id given to each blkcg; however, the cgroup_subsys_state which each blkcg embeds already has ->serial_nr which can be used for the same purpose. Drop blkcg->id and replace its uses with blkcg->css.serial_nr. Rename cfq_cgroup->blkcg_id to ->blkcg_serial_nr and @id in check_blkcg_changed() to @serial_nr for consistency. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Vivek Goyal <vgoyal@redhat.com> Signed-off-by: Jens Axboe <axboe@fb.com>	2014-09-08 09:55:37 -06:00
Tejun Heo	a34375ef9e	percpu-refcount: add @gfp to percpu_ref_init() Percpu allocator now supports allocation mask. Add @gfp to percpu_ref_init() so that !GFP_KERNEL allocation masks can be used with percpu_refs too. This patch doesn't make any functional difference. v2: blk-mq conversion was missing. Updated. Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Kent Overstreet <koverstreet@google.com> Cc: Benjamin LaHaise <bcrl@kvack.org> Cc: Li Zefan <lizefan@huawei.com> Cc: Nicholas A. Bellinger <nab@linux-iscsi.org> Cc: Jens Axboe <axboe@kernel.dk>	2014-09-08 09:51:30 +09:00
Keith Busch	2da78092dd	block: Fix dev_t minor allocation lifetime Releases the dev_t minor when all references are closed to prevent another device from acquiring the same major/minor. Since the partition's release may be invoked from call_rcu's soft-irq context, the ext_dev_idr's mutex had to be replaced with a spinlock so as not so sleep. Signed-off-by: Keith Busch <keith.busch@intel.com> Cc: stable@kernel.org Signed-off-by: Jens Axboe <axboe@fb.com>	2014-09-03 15:01:02 -06:00
Robert Elliott	5676e7b6db	blk-mq: cleanup after blk_mq_init_rq_map failures In blk-mq.c blk_mq_alloc_tag_set, if: set->tags = kmalloc_node() succeeds, but one of the blk_mq_init_rq_map() calls fails, goto out_unwind; needs to free set->tags so the caller is not obligated to do so. None of the current callers (null_blk, virtio_blk, virtio_blk, or the forthcoming scsi-mq) do so. set->tags needs to be set to NULL after doing so, so other tag cleanup logic doesn't try to free a stale pointer later. Also set it to NULL in blk_mq_free_tag_set. Tested with error injection on the forthcoming scsi-mq + hpsa combination. Signed-off-by: Robert Elliott <elliott@hp.com> Signed-off-by: Jens Axboe <axboe@fb.com>	2014-09-03 10:44:15 -06:00
Ming Lei	0738854939	blk-merge: fix blk_recount_segments QUEUE_FLAG_NO_SG_MERGE is set at default for blk-mq devices, so bio->bi_phys_segment computed may be bigger than queue_max_segments(q) for blk-mq devices, then drivers will fail to handle the case, for example, BUG_ON() in virtio_queue_rq() can be triggerd for virtio-blk: https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1359146 This patch fixes the issue by ignoring the QUEUE_FLAG_NO_SG_MERGE flag if the computed bio->bi_phys_segment is bigger than queue_max_segments(q), and the regression is caused by commit 05f1dd53152173(block: add queue flag for disabling SG merging). Reported-by: Kick In <pierre-andre.morey@canonical.com> Tested-by: Chris J Arges <chris.j.arges@canonical.com> Signed-off-by: Ming Lei <ming.lei@canonical.com> Signed-off-by: Jens Axboe <axboe@fb.com>	2014-09-02 10:25:12 -06:00
Jens Axboe	55872c5a3c	bsg: fix potential error pointer dereference Dan writes: block/bsg.c:327 bsg_map_hdr() error: 'next_rq' dereferencing possible ERR_PTR(). Fix this by setting next_rq to NULL, for the case where it can be != NULL but an error pointer. Reported-by: Dan Carpenter <dan.carpenter@oracle.com> Signed-off-by: Jens Axboe <axboe@fb.com>	2014-08-29 08:34:14 -06:00
Joe Lawrence	a492f07545	block,scsi: fixup blk_get_request dead queue scenarios The blk_get_request function may fail in low-memory conditions or during device removal (even if __GFP_WAIT is set). To distinguish between these errors, modify the blk_get_request call stack to return the appropriate ERR_PTR. Verify that all callers check the return status and consider IS_ERR instead of a simple NULL pointer check. For consistency, make a similar change to the blk_mq_alloc_request leg of blk_get_request. It may fail if the queue is dead, or the caller was unwilling to wait. Signed-off-by: Joe Lawrence <joe.lawrence@stratus.com> Acked-by: Jiri Kosina <jkosina@suse.cz> [for pktdvd] Acked-by: Boaz Harrosh <bharrosh@panasas.com> [for osd] Reviewed-by: Jeff Moyer <jmoyer@redhat.com> Signed-off-by: Jens Axboe <axboe@fb.com>	2014-08-28 10:03:46 -06:00
Toshiaki Makita	7b5af5cffc	cfq-iosched: Add comments on update timing of weight Explain that weight has to be updated on activation. This complements previous fix `e15693ef18` ("cfq-iosched: Fix wrong children_weight calculation"). Signed-off-by: Toshiaki Makita <makita.toshiaki@lab.ntt.co.jp> Signed-off-by: Jens Axboe <axboe@fb.com>	2014-08-28 08:16:29 -06:00
Joe Lawrence	eb571eeade	block,scsi: verify return pointer from blk_get_request The blk-core dead queue checks introduce an error scenario to blk_get_request that returns NULL if the request queue has been shutdown. This affects the behavior for __GFP_WAIT callers, who should verify the return value before dereferencing. Signed-off-by: Joe Lawrence <joe.lawrence@stratus.com> Acked-by: Jiri Kosina <jkosina@suse.cz> [for pktdvd] Reviewed-by: Jeff Moyer <jmoyer@redhat.com> Signed-off-by: Jens Axboe <axboe@fb.com>	2014-08-26 15:20:23 -06:00
Toshiaki Makita	e15693ef18	cfq-iosched: Fix wrong children_weight calculation cfq_group_service_tree_add() is applying new_weight at the beginning of the function via cfq_update_group_weight(). This actually allows weight to change between adding it to and subtracting it from children_weight, and triggers WARN_ON_ONCE() in cfq_group_service_tree_del(), or even causes oops by divide error during vfr calculation in cfq_group_service_tree_add(). The detailed scenario is as follows: 1. Create blkio cgroups X and Y as a child of X. Set X's weight to 500 and perform some I/O to apply new_weight. This X's I/O completes before starting Y's I/O. 2. Y starts I/O and cfq_group_service_tree_add() is called with Y. 3. cfq_group_service_tree_add() walks up the tree during children_weight calculation and adds parent X's weight (500) to children_weight of root. children_weight becomes 500. 4. Set X's weight to 1000. 5. X starts I/O and cfq_group_service_tree_add() is called with X. 6. cfq_group_service_tree_add() applies its new_weight (1000). 7. I/O of Y completes and cfq_group_service_tree_del() is called with Y. 8. I/O of X completes and cfq_group_service_tree_del() is called with X. 9. cfq_group_service_tree_del() subtracts X's weight (1000) from children_weight of root. children_weight becomes -500. This triggers WARN_ON_ONCE(). 10. Set X's weight to 500. 11. X starts I/O and cfq_group_service_tree_add() is called with X. 12. cfq_group_service_tree_add() applies its new_weight (500) and adds it to children_weight of root. children_weight becomes 0. Calcularion of vfr triggers oops by divide error. weight should be updated right before adding it to children_weight. Reported-by: Ruki Sekiya <sekiya.ruki@lab.ntt.co.jp> Signed-off-by: Toshiaki Makita <makita.toshiaki@lab.ntt.co.jp> Acked-by: Tejun Heo <tj@kernel.org> Cc: stable@vger.kernel.org Signed-off-by: Jens Axboe <axboe@fb.com>	2014-08-26 10:17:30 -06:00
Sabrina Dubroca	d19d744685	block: fix error handling in sg_io Before commit `2cada584b2` ("block: cleanup error handling in sg_io"), we had ret = 0 before entering the last big if block of sg_io. Since `2cada584b2`, ret = -EFAULT, which breaks hdparm: /dev/sda: setting Advanced Power Management level to 0xc8 (200) HDIO_DRIVE_CMD failed: Bad address APM_level = 128 Signed-off-by: Sabrina Dubroca <sd@queasysnail.net> Fixes: `2cada584b2` ("block: cleanup error handling in sg_io") Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@fb.com>	2014-08-26 08:20:01 -06:00
Tony Battersby	2ba136daa3	fix regression in SCSI_IOCTL_SEND_COMMAND blk_rq_set_block_pc() memsets rq->cmd to 0, so it should come immediately after blk_get_request() to avoid overwriting the user-supplied CDB. Also check for failure to allocate rq. Fixes: `f27b087b81` ("block: add blk_rq_set_block_pc()") Cc: <stable@vger.kernel.org> # 3.16.x Signed-off-by: Tony Battersby <tonyb@cybernetics.com> Signed-off-by: Jens Axboe <axboe@fb.com>	2014-08-22 15:04:33 -05:00
Tony Battersby	6f4a16266f	scsi-mq: fix requests that use a separate CDB buffer This patch fixes code such as the following with scsi-mq enabled: rq = blk_get_request(...); blk_rq_set_block_pc(rq); rq->cmd = my_cmd_buffer; /* separate CDB buffer */ blk_execute_rq_nowait(...); Code like this appears in e.g. sg_start_req() in drivers/scsi/sg.c (for large CDBs only). Without this patch, scsi_mq_prep_fn() will set rq->cmd back to rq->__cmd, causing the wrong CDB to be sent to the device. Signed-off-by: Tony Battersby <tonyb@cybernetics.com> Signed-off-by: Jens Axboe <axboe@fb.com>	2014-08-22 15:04:31 -05:00
Christoph Hellwig	a57821cac6	block: support > 16 byte CDBs for SG_IO Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Boaz Harrosh <boaz@plexistor.com> Signed-off-by: Jens Axboe <axboe@fb.com>	2014-08-21 20:42:01 -05:00
Christoph Hellwig	2cada584b2	block: cleanup error handling in sg_io Make sure we always clean up through the out label and just have a single place to put the request. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@fb.com>	2014-08-21 20:42:01 -05:00
Tejun Heo	cddd5d1764	blk-mq: blk_mq_freeze_queue() should allow nesting While converting to percpu_ref for freezing, `add703fda9` ("blk-mq: use percpu_ref for mq usage count") incorrectly made blk_mq_freeze_queue() misbehave when freezing is nested due to percpu_ref_kill() being invoked on an already killed ref. Fix it by making blk_mq_freeze_queue() kill and kick the queue only for the outermost freeze attempt. All the nested ones can simply wait for the ref to reach zero. While at it, remove unnecessary @wake initialization from blk_mq_unfreeze_queue(). Signed-off-by: Tejun Heo <tj@kernel.org> Reported-by: Ming Lei <ming.lei@canonical.com> Signed-off-by: Jens Axboe <axboe@fb.com>	2014-08-21 20:37:51 -05:00
Jens Axboe	a68aafa5b2	blk-mq: correct a few wrong/bad comments Just grammar or spelling errors, nothing major. Signed-off-by: Jens Axboe <axboe@fb.com>	2014-08-21 20:37:49 -05:00
Sagi Grimberg	16f408dc6b	block: Fix BUG_ON when pi errors occur When getting a pi error we get to bio_integrity_end_io with bi_remaining already decremented to 0 where we will eventually need to call bio_endio with restored original bio completion handler. Calling bio_endio invokes a BUG_ON(). We should call bio_endio_nodec instead, like what is done in bio_integrity_verify_fn. Signed-off-by: Sagi Grimberg <sagig@mellanox.com> Signed-off-by: Jens Axboe <axboe@fb.com>	2014-08-21 20:37:47 -05:00
Jens Axboe	274a5843ff	blk-mq: don't allow merges if turned off for the queue blk-mq uses BLK_MQ_F_SHOULD_MERGE, as set by the driver at init time, to determine whether it should merge IO or not. However, this could also be disabled by the admin, if merging is switched off through sysfs. So check the general queue state as well before attempting to merge IO. Reported-by: Rob Elliott <Elliott@hp.com> Tested-by: Rob Elliott <Elliott@hp.com> Signed-off-by: Jens Axboe <axboe@fb.com>	2014-08-21 20:37:45 -05:00
Ming Lei	dd84008708	blk-mq: fix WARNING "percpu_ref_kill() called more than once!" Before doing queue release, the queue has been freezed already by blk_cleanup_queue(), so needn't to freeze queue for deleting tag set. This patch fixes the WARNING of "percpu_ref_kill() called more than once!" which is triggered during unloading block driver. Cc: Tejun Heo <tj@kernel.org> Signed-off-by: Ming Lei <ming.lei@canonical.com> Signed-off-by: Jens Axboe <axboe@fb.com>	2014-08-15 12:38:20 -06:00
Linus Torvalds	ba368991f6	. Allow the thin target to paired with any size external origin; also allow thin snapshots to be larger than the external origin. . Add support for quickly loading a repetitive pattern into the dm-switch target. . Use per-bio data in the dm-crypt target instead of always using a mempool for each allocation. Required switching to kmalloc alignment for the bio slab. . Fix DM core to properly stack the QUEUE_FLAG_NO_SG_MERGE flag . Fix the dm-cache and dm-thin targets' export of the minimum_io_size to match the data block size -- this fixes an issue where mkfs.xfs would improperly infer raid striping was in place on the underlying storage. . Small cleanups in dm-io, dm-mpath and dm-cache -----BEGIN PGP SIGNATURE----- Version: GnuPG v1 iQEcBAABAgAGBQJT64yEAAoJEMUj8QotnQNatjQH/2mqm8EtPuZas70zHVDzjMlE ZyV8xgHpU0MBmiBi+JhUBv9iKX4sVa+C25559WkKtxRVMnZmI1WDry4TagiqrhnK 9o/uvdWigJMR+uwahwe4UErEtKscOQJD30a8taN/suJ6Z2C7XJJRUZPsyL4a3Vov w+UIi7aYDEGp/2VQ8mvTTxjdF5x5km4wKsjBTs03uTrrkEJ+bIUndl2I1X+X4bsw kiWYOQwmcnD8GwYkSrthJYLsS3Hjur/J/My7KZwXc00ANLOexqHdKfRDwH8b36+m olKXv3swCd8vi+jJYEYzuW9213ACsSEGP7h8NFVZ/+2FeDsSzB/C7zjW9okIUIw= =y/3r -----END PGP SIGNATURE----- Merge tag 'dm-3.17-changes' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm Pull device mapper changes from Mike Snitzer: - Allow the thin target to paired with any size external origin; also allow thin snapshots to be larger than the external origin. - Add support for quickly loading a repetitive pattern into the dm-switch target. - Use per-bio data in the dm-crypt target instead of always using a mempool for each allocation. Required switching to kmalloc alignment for the bio slab. - Fix DM core to properly stack the QUEUE_FLAG_NO_SG_MERGE flag - Fix the dm-cache and dm-thin targets' export of the minimum_io_size to match the data block size -- this fixes an issue where mkfs.xfs would improperly infer raid striping was in place on the underlying storage. - Small cleanups in dm-io, dm-mpath and dm-cache * tag 'dm-3.17-changes' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm: dm table: propagate QUEUE_FLAG_NO_SG_MERGE dm switch: efficiently support repetitive patterns dm switch: factor out switch_region_table_read dm cache: set minimum_io_size to cache's data block size dm thin: set minimum_io_size to pool's data block size dm crypt: use per-bio data block: use kmalloc alignment for bio slab dm table: make dm_table_supports_discards static dm cache metadata: use dm-space-map-metadata.h defined size limits dm cache: fail migrations in the do_worker error path dm cache: simplify deferred set reference count increments dm thin: relax external origin size constraints dm thin: switch to an atomic_t for tracking pending new block preparations dm mpath: eliminate pg_ready() wrapper dm io: simplify dec_count and sync_io	2014-08-14 09:17:56 -06:00
Linus Torvalds	d429a3639c	Merge branch 'for-3.17/drivers' of git://git.kernel.dk/linux-block Pull block driver changes from Jens Axboe: "Nothing out of the ordinary here, this pull request contains: - A big round of fixes for bcache from Kent Overstreet, Slava Pestov, and Surbhi Palande. No new features, just a lot of fixes. - The usual round of drbd updates from Andreas Gruenbacher, Lars Ellenberg, and Philipp Reisner. - virtio_blk was converted to blk-mq back in 3.13, but now Ming Lei has taken it one step further and added support for actually using more than one queue. - Addition of an explicit SG_FLAG_Q_AT_HEAD for block/bsg, to compliment the the default behavior of adding to the tail of the queue. From Douglas Gilbert" * 'for-3.17/drivers' of git://git.kernel.dk/linux-block: (86 commits) bcache: Drop unneeded blk_sync_queue() calls bcache: add mutex lock for bch_is_open bcache: Correct printing of btree_gc_max_duration_ms bcache: try to set b->parent properly bcache: fix memory corruption in init error path bcache: fix crash with incomplete cache set bcache: Fix more early shutdown bugs bcache: fix use-after-free in btree_gc_coalesce() bcache: Fix an infinite loop in journal replay bcache: fix crash in bcache_btree_node_alloc_fail tracepoint bcache: bcache_write tracepoint was crashing bcache: fix typo in bch_bkey_equal_header bcache: Allocate bounce buffers with GFP_NOWAIT bcache: Make sure to pass GFP_WAIT to mempool_alloc() bcache: fix uninterruptible sleep in writeback thread bcache: wait for buckets when allocating new btree root bcache: fix crash on shutdown in passthrough mode bcache: fix lockdep warnings on shutdown bcache allocator: send discards with correct size bcache: Fix to remove the rcu_sched stalls. ...	2014-08-14 09:10:21 -06:00
Linus Torvalds	4a319a490c	Merge branch 'for-3.17/core' of git://git.kernel.dk/linux-block Pull block core bits from Jens Axboe: "Small round this time, after the massive blk-mq dump for 3.16. This pull request contains: - Fixes for max_sectors overflow in ioctls from Akinoby Mita. - Partition off-by-one bug fix in aix partitions from Dan Carpenter. - Various small partition cleanups from Fabian Frederick. - Fix for the block integrity code sometimes returning the wrong vector count from Gu Zheng. - Cleanup an re-org of the blk-mq queue enter/exit percpu counters from Tejun. Dependent on the percpu pull for 3.17 (which was in the block tree too), that you have already pulled in. - A blkcg oops fix, also from Tejun" * 'for-3.17/core' of git://git.kernel.dk/linux-block: partitions: aix.c: off by one bug blkcg: don't call into policy draining if root_blkg is already gone Revert "bio: modify __bio_add_page() to accept pages that don't start a new segment" bio: modify __bio_add_page() to accept pages that don't start a new segment block: fix SG_[GS]ET_RESERVED_SIZE ioctl when max_sectors is huge block: fix BLKSECTGET ioctl when max_sectors is greater than USHRT_MAX block/partitions/efi.c: kerneldoc fixing block/partitions/msdos.c: code clean-up block/partitions/amiga.c: replace nolevel printk by pr_err block/partitions/aix.c: replace count*size kzalloc by kcalloc bio-integrity: add "bip_max_vcnt" into struct bio_integrity_payload blk-mq: use percpu_ref for mq usage count blk-mq: collapse __blk_mq_drain_queue() into blk_mq_freeze_queue() blk-mq: decouble blk-mq freezing from generic bypassing block, blk-mq: draining can't be skipped even if bypass_depth was non-zero blk-mq: fix a memory ordering bug in blk_mq_queue_enter()	2014-08-14 09:07:02 -06:00
Dan Carpenter	d97a86c170	partitions: aix.c: off by one bug The lvip[] array has "state->limit" elements so the condition here should be >= instead of >. Fixes: `6ceea22bbb` ('partitions: add aix lvm partition support files') Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com> Acked-by: Philippe De Muyter <phdm@macqel.be> Signed-off-by: Jens Axboe <axboe@fb.com>	2014-08-05 13:13:24 -06:00
Linus Torvalds	47dfe4037e	Merge branch 'for-3.17' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup Pull cgroup changes from Tejun Heo: "Mostly changes to get the v2 interface ready. The core features are mostly ready now and I think it's reasonable to expect to drop the devel mask in one or two devel cycles at least for a subset of controllers. - cgroup added a controller dependency mechanism so that block cgroup can depend on memory cgroup. This will be used to finally support IO provisioning on the writeback traffic, which is currently being implemented. - The v2 interface now uses a separate table so that the interface files for the new interface are explicitly declared in one place. Each controller will explicitly review and add the files for the new interface. - cpuset is getting ready for the hierarchical behavior which is in the similar style with other controllers so that an ancestor's configuration change doesn't change the descendants' configurations irreversibly and processes aren't silently migrated when a CPU or node goes down. All the changes are to the new interface and no behavior changed for the multiple hierarchies" * 'for-3.17' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup: (29 commits) cpuset: fix the WARN_ON() in update_nodemasks_hier() cgroup: initialize cgrp_dfl_root_inhibit_ss_mask from !->dfl_files test cgroup: make CFTYPE_ONLY_ON_DFL and CFTYPE_NO_ internal to cgroup core cgroup: distinguish the default and legacy hierarchies when handling cftypes cgroup: replace cgroup_add_cftypes() with cgroup_add_legacy_cftypes() cgroup: rename cgroup_subsys->base_cftypes to ->legacy_cftypes cgroup: split cgroup_base_files[] into cgroup_{dfl\|legacy}_base_files[] cpuset: export effective masks to userspace cpuset: allow writing offlined masks to cpuset.cpus/mems cpuset: enable onlined cpu/node in effective masks cpuset: refactor cpuset_hotplug_update_tasks() cpuset: make cs->{cpus, mems}_allowed as user-configured masks cpuset: apply cs->effective_{cpus,mems} cpuset: initialize top_cpuset's configured masks at mount cpuset: use effective cpumask to build sched domains cpuset: inherit ancestor's masks if effective_{cpus, mems} becomes empty cpuset: update cs->effective_{cpus, mems} when config changes cpuset: update cpuset->effective_{cpus,mems} at hotplug cpuset: add cs->effective_cpus and cs->effective_mems cgroup: clean up sane_behavior handling ...	2014-08-04 10:11:28 -07:00
Mikulas Patocka	6a24148361	block: use kmalloc alignment for bio slab Various subsystems can ask the bio subsystem to create a bio slab cache with some free space before the bio. This free space can be used for any purpose. Device mapper uses this per-bio-data feature to place some target-specific and device-mapper specific data before the bio, so that the target-specific data doesn't have to be allocated separately. This per-bio-data mechanism is used in place of kmalloc, so we need the allocated slab to have the same memory alignment as memory allocated with kmalloc. Change bio_find_or_create_slab() so that it uses ARCH_KMALLOC_MINALIGN alignment when creating the slab cache. This is needed so that dm-crypt can use per-bio-data for encryption - the crypto subsystem assumes this data will have the same alignment as kmalloc'ed memory. Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com> Acked-by: Jens Axboe <axboe@fb.com>	2014-08-01 12:30:34 -04:00
Tejun Heo	2a1b4cf233	blkcg: don't call into policy draining if root_blkg is already gone While a queue is being destroyed, all the blkgs are destroyed and its ->root_blkg pointer is set to NULL. If someone else starts to drain while the queue is in this state, the following oops happens. NULL pointer dereference at 0000000000000028 IP: [<ffffffff8144e944>] blk_throtl_drain+0x84/0x230 PGD e4a1067 PUD b773067 PMD 0 Oops: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC Modules linked in: cfq_iosched(-) [last unloaded: cfq_iosched] CPU: 1 PID: 537 Comm: bash Not tainted 3.16.0-rc3-work+ #2 Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011 task: ffff88000e222250 ti: ffff88000efd4000 task.ti: ffff88000efd4000 RIP: 0010:[<ffffffff8144e944>] [<ffffffff8144e944>] blk_throtl_drain+0x84/0x230 RSP: 0018:ffff88000efd7bf0 EFLAGS: 00010046 RAX: 0000000000000000 RBX: ffff880015091450 RCX: 0000000000000001 RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000000 RBP: ffff88000efd7c10 R08: 0000000000000000 R09: 0000000000000001 R10: ffff88000e222250 R11: 0000000000000000 R12: ffff880015091450 R13: ffff880015092e00 R14: ffff880015091d70 R15: ffff88001508fc28 FS: 00007f1332650740(0000) GS:ffff88001fa80000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b CR2: 0000000000000028 CR3: 0000000009446000 CR4: 00000000000006e0 Stack: ffffffff8144e8f6 ffff880015091450 0000000000000000 ffff880015091d80 ffff88000efd7c28 ffffffff8144ae2f ffff880015091450 ffff88000efd7c58 ffffffff81427641 ffff880015091450 ffffffff82401f00 ffff880015091450 Call Trace: [<ffffffff8144ae2f>] blkcg_drain_queue+0x1f/0x60 [<ffffffff81427641>] __blk_drain_queue+0x71/0x180 [<ffffffff81429b3e>] blk_queue_bypass_start+0x6e/0xb0 [<ffffffff814498b8>] blkcg_deactivate_policy+0x38/0x120 [<ffffffff8144ec44>] blk_throtl_exit+0x34/0x50 [<ffffffff8144aea5>] blkcg_exit_queue+0x35/0x40 [<ffffffff8142d476>] blk_release_queue+0x26/0xd0 [<ffffffff81454968>] kobject_cleanup+0x38/0x70 [<ffffffff81454848>] kobject_put+0x28/0x60 [<ffffffff81427505>] blk_put_queue+0x15/0x20 [<ffffffff817d07bb>] scsi_device_dev_release_usercontext+0x16b/0x1c0 [<ffffffff810bc339>] execute_in_process_context+0x89/0xa0 [<ffffffff817d064c>] scsi_device_dev_release+0x1c/0x20 [<ffffffff817930e2>] device_release+0x32/0xa0 [<ffffffff81454968>] kobject_cleanup+0x38/0x70 [<ffffffff81454848>] kobject_put+0x28/0x60 [<ffffffff817934d7>] put_device+0x17/0x20 [<ffffffff817d11b9>] __scsi_remove_device+0xa9/0xe0 [<ffffffff817d121b>] scsi_remove_device+0x2b/0x40 [<ffffffff817d1257>] sdev_store_delete+0x27/0x30 [<ffffffff81792ca8>] dev_attr_store+0x18/0x30 [<ffffffff8126f75e>] sysfs_kf_write+0x3e/0x50 [<ffffffff8126ea87>] kernfs_fop_write+0xe7/0x170 [<ffffffff811f5e9f>] vfs_write+0xaf/0x1d0 [<ffffffff811f69bd>] SyS_write+0x4d/0xc0 [<ffffffff81d24692>] system_call_fastpath+0x16/0x1b `776687bce4` ("block, blk-mq: draining can't be skipped even if bypass_depth was non-zero") made it easier to trigger this bug by making blk_queue_bypass_start() drain even when it loses the first bypass test to blk_cleanup_queue(); however, the bug has always been there even before the commit as blk_queue_bypass_start() could race against queue destruction, win the initial bypass test but perform the actual draining after blk_cleanup_queue() already destroyed all blkgs. Fix it by skippping calling into policy draining if all the blkgs are already gone. Signed-off-by: Tejun Heo <tj@kernel.org> Reported-by: Shirish Pargaonkar <spargaonkar@suse.com> Reported-by: Sasha Levin <sasha.levin@oracle.com> Reported-by: Jet Chen <jet.chen@intel.com> Cc: stable@vger.kernel.org Tested-by: Shirish Pargaonkar <spargaonkar@suse.com> Signed-off-by: Jens Axboe <axboe@fb.com>	2014-07-16 09:52:03 +02:00
Tejun Heo	2cf669a58d	cgroup: replace cgroup_add_cftypes() with cgroup_add_legacy_cftypes() Currently, cftypes added by cgroup_add_cftypes() are used for both the unified default hierarchy and legacy ones and subsystems can mark each file with either CFTYPE_ONLY_ON_DFL or CFTYPE_INSANE if it has to appear only on one of them. This is quite hairy and error-prone. Also, we may end up exposing interface files to the default hierarchy without thinking it through. cgroup_subsys will grow two separate cftype addition functions and apply each only on the hierarchies of the matching type. This will allow organizing cftypes in a lot clearer way and encourage subsystems to scrutinize the interface which is being exposed in the new default hierarchy. In preparation, this patch adds cgroup_add_legacy_cftypes() which currently is a simple wrapper around cgroup_add_cftypes() and replaces all cgroup_add_cftypes() usages with it. While at it, this patch drops a completely spurious return from __hugetlb_cgroup_file_init(). This patch doesn't introduce any functional differences. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Neil Horman <nhorman@tuxdriver.com> Acked-by: Li Zefan <lizefan@huawei.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@suse.cz> Cc: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>	2014-07-15 11:05:09 -04:00
Tejun Heo	5577964e64	cgroup: rename cgroup_subsys->base_cftypes to ->legacy_cftypes Currently, cgroup_subsys->base_cftypes is used for both the unified default hierarchy and legacy ones and subsystems can mark each file with either CFTYPE_ONLY_ON_DFL or CFTYPE_INSANE if it has to appear only on one of them. This is quite hairy and error-prone. Also, we may end up exposing interface files to the default hierarchy without thinking it through. cgroup_subsys will grow two separate cftype arrays and apply each only on the hierarchies of the matching type. This will allow organizing cftypes in a lot clearer way and encourage subsystems to scrutinize the interface which is being exposed in the new default hierarchy. In preparation, this patch renames cgroup_subsys->base_cftypes to cgroup_subsys->legacy_cftypes. This patch is pure rename. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Neil Horman <nhorman@tuxdriver.com> Acked-by: Li Zefan <lizefan@huawei.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@suse.cz> Cc: Vivek Goyal <vgoyal@redhat.com> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Paul Mackerras <paulus@samba.org> Cc: Ingo Molnar <mingo@redhat.com> Cc: Arnaldo Carvalho de Melo <acme@kernel.org> Cc: Aristeu Rozanski <aris@redhat.com> Cc: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>	2014-07-15 11:05:09 -04:00
Jens Axboe	26a337944e	Revert "bio: modify __bio_add_page() to accept pages that don't start a new segment" This reverts commit `254c4407cb`. It causes crashes with cryptsetup, even after a few iterations and updates. Drop it for now.	2014-07-14 22:04:47 +02:00
Mikulas Patocka	3b3a1814d1	block: provide compat ioctl for BLKZEROOUT This patch provides the compat BLKZEROOUT ioctl. The argument is a pointer to two uint64_t values, so there is no need to translate it. Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Cc: stable@vger.kernel.org # 3.7+ Acked-by: Martin K. Petersen <martin.petersen@oracle.com> Signed-off-by: Jens Axboe <axboe@fb.com>	2014-07-14 12:27:20 +02:00
Tejun Heo	0b462c89e3	blkcg: don't call into policy draining if root_blkg is already gone While a queue is being destroyed, all the blkgs are destroyed and its ->root_blkg pointer is set to NULL. If someone else starts to drain while the queue is in this state, the following oops happens. NULL pointer dereference at 0000000000000028 IP: [<ffffffff8144e944>] blk_throtl_drain+0x84/0x230 PGD e4a1067 PUD b773067 PMD 0 Oops: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC Modules linked in: cfq_iosched(-) [last unloaded: cfq_iosched] CPU: 1 PID: 537 Comm: bash Not tainted 3.16.0-rc3-work+ #2 Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011 task: ffff88000e222250 ti: ffff88000efd4000 task.ti: ffff88000efd4000 RIP: 0010:[<ffffffff8144e944>] [<ffffffff8144e944>] blk_throtl_drain+0x84/0x230 RSP: 0018:ffff88000efd7bf0 EFLAGS: 00010046 RAX: 0000000000000000 RBX: ffff880015091450 RCX: 0000000000000001 RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000000 RBP: ffff88000efd7c10 R08: 0000000000000000 R09: 0000000000000001 R10: ffff88000e222250 R11: 0000000000000000 R12: ffff880015091450 R13: ffff880015092e00 R14: ffff880015091d70 R15: ffff88001508fc28 FS: 00007f1332650740(0000) GS:ffff88001fa80000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b CR2: 0000000000000028 CR3: 0000000009446000 CR4: 00000000000006e0 Stack: ffffffff8144e8f6 ffff880015091450 0000000000000000 ffff880015091d80 ffff88000efd7c28 ffffffff8144ae2f ffff880015091450 ffff88000efd7c58 ffffffff81427641 ffff880015091450 ffffffff82401f00 ffff880015091450 Call Trace: [<ffffffff8144ae2f>] blkcg_drain_queue+0x1f/0x60 [<ffffffff81427641>] __blk_drain_queue+0x71/0x180 [<ffffffff81429b3e>] blk_queue_bypass_start+0x6e/0xb0 [<ffffffff814498b8>] blkcg_deactivate_policy+0x38/0x120 [<ffffffff8144ec44>] blk_throtl_exit+0x34/0x50 [<ffffffff8144aea5>] blkcg_exit_queue+0x35/0x40 [<ffffffff8142d476>] blk_release_queue+0x26/0xd0 [<ffffffff81454968>] kobject_cleanup+0x38/0x70 [<ffffffff81454848>] kobject_put+0x28/0x60 [<ffffffff81427505>] blk_put_queue+0x15/0x20 [<ffffffff817d07bb>] scsi_device_dev_release_usercontext+0x16b/0x1c0 [<ffffffff810bc339>] execute_in_process_context+0x89/0xa0 [<ffffffff817d064c>] scsi_device_dev_release+0x1c/0x20 [<ffffffff817930e2>] device_release+0x32/0xa0 [<ffffffff81454968>] kobject_cleanup+0x38/0x70 [<ffffffff81454848>] kobject_put+0x28/0x60 [<ffffffff817934d7>] put_device+0x17/0x20 [<ffffffff817d11b9>] __scsi_remove_device+0xa9/0xe0 [<ffffffff817d121b>] scsi_remove_device+0x2b/0x40 [<ffffffff817d1257>] sdev_store_delete+0x27/0x30 [<ffffffff81792ca8>] dev_attr_store+0x18/0x30 [<ffffffff8126f75e>] sysfs_kf_write+0x3e/0x50 [<ffffffff8126ea87>] kernfs_fop_write+0xe7/0x170 [<ffffffff811f5e9f>] vfs_write+0xaf/0x1d0 [<ffffffff811f69bd>] SyS_write+0x4d/0xc0 [<ffffffff81d24692>] system_call_fastpath+0x16/0x1b `776687bce4` ("block, blk-mq: draining can't be skipped even if bypass_depth was non-zero") made it easier to trigger this bug by making blk_queue_bypass_start() drain even when it loses the first bypass test to blk_cleanup_queue(); however, the bug has always been there even before the commit as blk_queue_bypass_start() could race against queue destruction, win the initial bypass test but perform the actual draining after blk_cleanup_queue() already destroyed all blkgs. Fix it by skippping calling into policy draining if all the blkgs are already gone. Signed-off-by: Tejun Heo <tj@kernel.org> Reported-by: Shirish Pargaonkar <spargaonkar@suse.com> Reported-by: Sasha Levin <sasha.levin@oracle.com> Reported-by: Jet Chen <jet.chen@intel.com> Cc: stable@vger.kernel.org Tested-by: Shirish Pargaonkar <spargaonkar@suse.com> Signed-off-by: Jens Axboe <axboe@fb.com>	2014-07-12 17:55:10 +02:00
Tejun Heo	aa6ec29bee	cgroup: remove sane_behavior support on non-default hierarchies sane_behavior has been used as a development vehicle for the default unified hierarchy. Now that the default hierarchy is in place, the flag became redundant and confusing as its usage is allowed on all hierarchies. There are gonna be either the default hierarchy or legacy ones. Let's make that clear by removing sane_behavior support on non-default hierarchies. This patch replaces cgroup_sane_behavior() with cgroup_on_dfl(). The comment on top of CGRP_ROOT_SANE_BEHAVIOR is moved to on top of cgroup_on_dfl() with sane_behavior specific part dropped. On the default and legacy hierarchies w/o sane_behavior, this shouldn't cause any behavior differences. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Vivek Goyal <vgoyal@redhat.com> Acked-by: Li Zefan <lizefan@huawei.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@suse.cz>	2014-07-09 10:08:08 -04:00
Tejun Heo	1ced953b17	blkcg, memcg: make blkcg depend on memcg on the default hierarchy Currently, the blkio subsystem attributes all of writeback IOs to the root. One of the issues is that there's no way to tell who originated a writeback IO from block layer. Those IOs are usually issued asynchronously from a task which didn't have anything to do with actually generating the dirty pages. The memory subsystem, when enabled, already keeps track of the ownership of each dirty page and it's desirable for blkio to piggyback instead of adding its own per-page tag. cgroup now has a mechanism to express such dependency - cgroup_subsys->depends_on. This patch declares that blkcg depends on memcg so that memcg is enabled automatically on the default hierarchy when available. Future changes will make blkcg map the memcg tag to find out the cgroup to blame for writeback IOs. As this means that a memcg may be made invisible, this patch also implements css_reset() for memcg which resets its basic configurations. This implementation will probably need to be expanded to cover other states which are used in the default hierarchy. v2: blkcg's dependency on memcg is wrapped with CONFIG_MEMCG to avoid build failure. Reported by kbuild test robot. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Li Zefan <lizefan@huawei.com> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@suse.cz> Cc: Vivek Goyal <vgoyal@redhat.com> Cc: Jens Axboe <axboe@kernel.dk>	2014-07-08 18:02:57 -04:00
Christoph Hellwig	d45b3279a5	block: don't assume last put of shared tags is for the host There is no inherent reason why the last put of a tag structure must be the one for the Scsi_Host, as device model objects can be held for arbitrary periods. Merge blk_free_tags and __blk_free_tags into a single funtion that just release a references and get rid of the BUG() when the host reference wasn't the last. Signed-off-by: Christoph Hellwig <hch@lst.de> Cc: stable@kernel.org Signed-off-by: Jens Axboe <axboe@fb.com>	2014-07-08 12:25:28 +02:00
Maurizio Lombardi	254c4407cb	bio: modify __bio_add_page() to accept pages that don't start a new segment The original behaviour is to refuse to add a new page if the maximum number of segments has been reached, regardless of the fact the page we are going to add can be merged into the last segment or not. Unfortunately, when the system runs under heavy memory fragmentation conditions, a driver may try to add multiple pages to the last segment. The original code won't accept them and EBUSY will be reported to userspace. This patch modifies the function so it refuses to add a page only in case the latter starts a new segment and the maximum number of segments has already been reached. The bug can be easily reproduced with the st driver: 1) set CONFIG_SCSI_MPT2SAS_MAX_SGE or CONFIG_SCSI_MPT3SAS_MAX_SGE to 16 2) modprobe st buffer_kbs=1024 3) #dd if=/dev/zero of=/dev/st0 bs=1M count=10 dd: error writing `/dev/st0': Device or resource busy [ming.lei@canonical.com: update bi_iter.bi_size before recounting segments] Signed-off-by: Maurizio Lombardi <mlombard@redhat.com> Signed-off-by: Ming Lei <ming.lei@canonical.com> Tested-by: Dongsu Park <dongsu.park@profitbricks.com> Tested-by: Jet Chen <jet.chen@intel.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Christoph Hellwig <hch@lst.de> Cc: Kent Overstreet <kmo@daterainc.com> Cc: Jens Axboe <axboe@kernel.dk> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Jens Axboe <axboe@fb.com>	2014-07-01 10:55:15 -06:00
Douglas Gilbert	d15156138d	block SG_IO: add SG_FLAG_Q_AT_HEAD flag After the SG_IO ioctl was copied into the block layer and later into the bsg driver, subtle differences emerged. One difference is the way injected commands are queued through the block layer (i.e. this is not SCSI device queueing nor SATA NCQ). Summarizing: - SG_IO on block layer device: blk_exec*(at_head=false) - sg device SG_IO: at_head=true - bsg device SG_IO: at_head=true Some time ago Boaz Harrosh introduced a sg v4 flag called BSG_FLAG_Q_AT_TAIL to override the bsg driver default. A recent patch titled: "sg: add SG_FLAG_Q_AT_TAIL flag" allowed the sg driver default to be overridden. This patch allows a SG_IO ioctl sent to a block layer device to have its default overridden. ChangeLog: - introduce SG_FLAG_Q_AT_HEAD flag in sg.h to cause commands that are injected via a block layer device SG_IO ioctl to set at_head=true - make comments clearer about queueing in sg.h since the header is used both by the sg device and block layer device implementations of the SG_IO ioctl. - introduce BSG_FLAG_Q_AT_HEAD in bsg.h for compatibility (it does nothing) and update comments. Signed-off-by: Douglas Gilbert <dgilbert@interlog.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Mike Christie <michaelc@cs.wisc.edu> Signed-off-by: Jens Axboe <axboe@fb.com>	2014-07-01 10:48:05 -06:00
Akinobu Mita	9b4231bf99	block: fix SG_[GS]ET_RESERVED_SIZE ioctl when max_sectors is huge SG_GET_RESERVED_SIZE and SG_SET_RESERVED_SIZE ioctls access a reserved buffer in bytes as int type. The value needs to be capped at the request queue's max_sectors. But integer overflow is not correctly handled in the calculation when converting max_sectors from sectors to bytes. Signed-off-by: Akinobu Mita <akinobu.mita@gmail.com> Cc: Jens Axboe <axboe@kernel.dk> Cc: "James E.J. Bottomley" <JBottomley@parallels.com> Cc: Douglas Gilbert <dgilbert@interlog.com> Cc: linux-scsi@vger.kernel.org Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@fb.com>	2014-07-01 10:43:09 -06:00
Akinobu Mita	63f2649659	block: fix BLKSECTGET ioctl when max_sectors is greater than USHRT_MAX BLKSECTGET ioctl loads the request queue's max_sectors as unsigned short value to the argument pointer. So if the max_sector is greater than USHRT_MAX, the upper 16 bits of that is just discarded. In such case, USHRT_MAX is more preferable than the lower 16 bits of max_sectors. Signed-off-by: Akinobu Mita <akinobu.mita@gmail.com> Cc: Jens Axboe <axboe@kernel.dk> Cc: "James E.J. Bottomley" <JBottomley@parallels.com> Cc: Douglas Gilbert <dgilbert@interlog.com> Cc: linux-scsi@vger.kernel.org Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@fb.com>	2014-07-01 10:43:07 -06:00
Fabian Frederick	16e1556526	block/partitions/efi.c: kerneldoc fixing Adding function documentation and fixing kerneldoc warnings ('field: description' uniformization). Cc: Davidlohr Bueso <davidlohr@hp.com> Cc: Jens Axboe <axboe@kernel.dk> Signed-off-by: Fabian Frederick <fabf@skynet.be> Signed-off-by: Jens Axboe <axboe@fb.com>	2014-07-01 10:40:05 -06:00
Fabian Frederick	dce14c239a	block/partitions/msdos.c: code clean-up checkpatch fixing: WARNING: Missing a blank line after declarations WARNING: space prohibited between function name and open parenthesis '(' ERROR: spaces required around that '<' (ctx:VxV) Cc: Jens Axboe <axboe@kernel.dk> Cc: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Fabian Frederick <fabf@skynet.be> Signed-off-by: Jens Axboe <axboe@fb.com>	2014-07-01 10:40:03 -06:00
Fabian Frederick	600ffc5ead	block/partitions/amiga.c: replace nolevel printk by pr_err Also add no prefix pr_fmt to avoid any future default format update Cc: Jens Axboe <axboe@kernel.dk> Cc: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Fabian Frederick <fabf@skynet.be> Signed-off-by: Jens Axboe <axboe@fb.com>	2014-07-01 10:40:02 -06:00
Fabian Frederick	472d5e2af2	block/partitions/aix.c: replace countsize kzalloc by kcalloc kcalloc manages countsizeof overflow. Cc: Jens Axboe <axboe@kernel.dk> Cc: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Fabian Frederick <fabf@skynet.be> Signed-off-by: Jens Axboe <axboe@fb.com>	2014-07-01 10:40:00 -06:00
Gu Zheng	cbcd1054a1	bio-integrity: add "bip_max_vcnt" into struct bio_integrity_payload Commit `08778795` ("block: Fix nr_vecs for inline integrity vectors") from Martin introduces the function bip_integrity_vecs(get the useful vectors) to fix the issue about nr_vecs for inline integrity vectors that reported by David Milburn. But it seems that bip_integrity_vecs() will return the wrong number if the bio is not based on any bio_set for some reason(bio->bi_pool == NULL), because in that case, the bip_inline_vecs[0] is malloced directly. So here we add the bip_max_vcnt to record the count of vector slots, and cleanup the function bip_integrity_vecs(). Signed-off-by: Gu Zheng <guz.fnst@cn.fujitsu.com> Cc: Martin K. Petersen <martin.petersen@oracle.com> Cc: Kent Overstreet <kmo@daterainc.com> Signed-off-by: Jens Axboe <axboe@fb.com>	2014-07-01 10:36:47 -06:00
Tejun Heo	add703fda9	blk-mq: use percpu_ref for mq usage count Currently, blk-mq uses a percpu_counter to keep track of how many usages are in flight. The percpu_counter is drained while freezing to ensure that no usage is left in-flight after freezing is complete. blk_mq_queue_enter/exit() and blk_mq_[un]freeze_queue() implement this per-cpu gating mechanism. This type of code has relatively high chance of subtle bugs which are extremely difficult to trigger and it's way too hairy to be open coded in blk-mq. percpu_ref can serve the same purpose after the recent changes. This patch replaces the open-coded per-cpu usage counting and draining mechanism with percpu_ref. blk_mq_queue_enter() performs tryget_live on the ref and exit() performs put. blk_mq_freeze_queue() kills the ref and waits until the reference count reaches zero. blk_mq_unfreeze_queue() revives the ref and wakes up the waiters. Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Jens Axboe <axboe@kernel.dk> Cc: Nicholas A. Bellinger <nab@linux-iscsi.org> Cc: Kent Overstreet <kmo@daterainc.com> Signed-off-by: Jens Axboe <axboe@fb.com>	2014-07-01 10:34:38 -06:00
Tejun Heo	72d6f02a8d	blk-mq: collapse __blk_mq_drain_queue() into blk_mq_freeze_queue() Keeping __blk_mq_drain_queue() as a separate function doesn't buy us anything and it's gonna be further simplified. Let's flatten it into its caller. This patch doesn't make any functional change. Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Jens Axboe <axboe@kernel.dk> Cc: Nicholas A. Bellinger <nab@linux-iscsi.org> Signed-off-by: Jens Axboe <axboe@fb.com>	2014-07-01 10:33:02 -06:00
Tejun Heo	780db2071a	blk-mq: decouble blk-mq freezing from generic bypassing blk_mq freezing is entangled with generic bypassing which bypasses blkcg and io scheduler and lets IO requests fall through the block layer to the drivers in FIFO order. This allows forward progress on IOs with the advanced features disabled so that those features can be configured or altered without worrying about stalling IO which may lead to deadlock through memory allocation. However, generic bypassing doesn't quite fit blk-mq. blk-mq currently doesn't make use of blkcg or ioscheds and it maps bypssing to freezing, which blocks request processing and drains all the in-flight ones. This causes problems as bypassing assumes that request processing is online. blk-mq works around this by conditionally allowing request processing for the problem case - during queue initialization. Another weirdity is that except for during queue cleanup, bypassing started on the generic side prevents blk-mq from processing new requests but doesn't drain the in-flight ones. This shouldn't break anything but again highlights that something isn't quite right here. The root cause is conflating blk-mq freezing and generic bypassing which are two different mechanisms. The only intersecting purpose that they serve is during queue cleanup. Let's properly separate blk-mq freezing from generic bypassing and simply use it where necessary. * request_queue->mq_freeze_depth is added and blk_mq_[un]freeze_queue() now operate on this counter instead of ->bypass_depth. The replacement for QUEUE_FLAG_BYPASS isn't added but the counter is tested directly. This will be further updated by later changes. * blk_mq_drain_queue() is dropped and "__" prefix is dropped from blk_mq_freeze_queue(). Queue cleanup path now calls blk_mq_freeze_queue() directly. * blk_queue_enter()'s fast path condition is simplified to simply check @q->mq_freeze_depth. Previously, the condition was !blk_queue_dying(q) && (!blk_queue_bypass(q) \|\| !blk_queue_init_done(q)) mq_freeze_depth is incremented right after dying is set and blk_queue_init_done() exception isn't necessary as blk-mq doesn't start frozen, which only leaves the blk_queue_bypass() test which can be replaced by @q->mq_freeze_depth test. This change simplifies the code and reduces confusion in the area. Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Jens Axboe <axboe@kernel.dk> Cc: Nicholas A. Bellinger <nab@linux-iscsi.org> Signed-off-by: Jens Axboe <axboe@fb.com>	2014-07-01 10:31:13 -06:00
Tejun Heo	776687bce4	block, blk-mq: draining can't be skipped even if bypass_depth was non-zero Currently, both blk_queue_bypass_start() and blk_mq_freeze_queue() skip queue draining if bypass_depth was already above zero. The assumption is that the one which bumped the bypass_depth should have performed draining already; however, there's nothing which prevents a new instance of bypassing/freezing from starting before the previous one finishes draining. The current code may allow the later bypassing/freezing instances to complete while there still are in-flight requests which haven't finished draining. Fix it by draining regardless of bypass_depth. We still skip draining from blk_queue_bypass_start() while the queue is initializing to avoid introducing excessive delays during boot. INIT_DONE setting is moved above the initial blk_queue_bypass_end() so that bypassing attempts can't slip inbetween. Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Jens Axboe <axboe@kernel.dk> Cc: Nicholas A. Bellinger <nab@linux-iscsi.org> Signed-off-by: Jens Axboe <axboe@fb.com>	2014-07-01 10:29:17 -06:00
Tejun Heo	531ed6261e	blk-mq: fix a memory ordering bug in blk_mq_queue_enter() blk-mq uses a percpu_counter to keep track of how many usages are in flight. The percpu_counter is drained while freezing to ensure that no usage is left in-flight after freezing is complete. blk_mq_queue_enter/exit() and blk_mq_[un]freeze_queue() implement this per-cpu gating mechanism; unfortunately, it contains a subtle bug - smp_wmb() in blk_mq_queue_enter() doesn't prevent prevent the cpu from fetching @q->bypass_depth before incrementing @q->mq_usage_counter and if freezing happens inbetween the caller can slip through and freezing can be complete while there are active users. Use smp_mb() instead so that bypass_depth and mq_usage_counter modifications and tests are properly interlocked. Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Jens Axboe <axboe@kernel.dk> Cc: Nicholas A. Bellinger <nab@linux-iscsi.org> Signed-off-by: Jens Axboe <axboe@fb.com>	2014-07-01 10:27:06 -06:00
Linus Torvalds	3493860c76	Merge branch 'for-linus' of git://git.kernel.dk/linux-block Pull block fixes from Jens Axboe: "A small collection of fixes/changes for the current series. This contains: - Removal of dead code from Gu Zheng. - Revert of two bad fixes that went in earlier in this round, marking things as __init that were not purely used from init. - A fix for blk_mq_start_hw_queue() using the __blk_mq_run_hw_queue(), which could place us wrongly. Make it use the non __ variant, which handles cases where we are called from the wrong CPU set. From me. - A fix for drbd, which allocates discard requests without room for the SCSI payload. From Lars Ellenberg. - A fix for user-after-free in the blkcg code from Tejun. - Addition of limiting gaps in SG lists, if the hardware needs it. This is the last pre-req patch for blk-mq to enable the full NVMe conversion. Could wait until 3.17, but it's simple enough so would be nice to have everything we need for the NVMe port in the 3.17 release. From me" * 'for-linus' of git://git.kernel.dk/linux-block: drbd: fix NULL pointer deref in blk_add_request_payload blk-mq: blk_mq_start_hw_queue() should use blk_mq_run_hw_queue() block: add support for limiting gaps in SG lists bio: remove unused macro bip_vec_idx() Revert "block: add __init to elv_register" Revert "block: add __init to blkcg_policy_register" blkcg: fix use-after-free in __blkg_release_rcu() by making blkcg_gq refcnt an atomic_t floppy: format block0 read error message properly	2014-06-26 13:06:13 -07:00
Jens Axboe	0ffbce80c2	blk-mq: blk_mq_start_hw_queue() should use blk_mq_run_hw_queue() Currently it calls __blk_mq_run_hw_queue(), which depends on the CPU placement being correct. This means it's not possible to call blk_mq_start_hw_queues(q) from a context that is correct for all queues, leading to triggering the WARN_ON(!cpumask_test_cpu(raw_smp_processor_id(), hctx->cpumask)); in __blk_mq_run_hw_queue(). Reported-by: Ming Lei <tom.leiming@gmail.com> Signed-off-by: Jens Axboe <axboe@fb.com>	2014-06-25 08:22:34 -06:00
Jens Axboe	66cb45aa41	block: add support for limiting gaps in SG lists Another restriction inherited for NVMe - those devices don't support SG lists that have "gaps" in them. Gaps refers to cases where the previous SG entry doesn't end on a page boundary. For NVMe, all SG entries must start at offset 0 (except the first) and end on a page boundary (except the last). Signed-off-by: Jens Axboe <axboe@fb.com>	2014-06-24 16:22:24 -06:00
Jens Axboe	e567bf7112	Revert "block: add __init to elv_register" This reverts commit `b5097e956a`. The original commit is buggy, we do use the registration functions at runtime, for instance when loading IO schedulers through sysfs. Reported-by: Damien Wyart <damien.wyart@gmail.com>	2014-06-22 16:34:11 -06:00
Jens Axboe	d5bf02914e	Revert "block: add __init to blkcg_policy_register" This reverts commit `a2d445d440`. The original commit is buggy, we do use the registration functions at runtime for modular builds.	2014-06-22 16:34:11 -06:00
Tejun Heo	a5049a8ae3	blkcg: fix use-after-free in __blkg_release_rcu() by making blkcg_gq refcnt an atomic_t Hello, So, this patch should do. Joe, Vivek, can one of you guys please verify that the oops goes away with this patch? Jens, the original thread can be read at http://thread.gmane.org/gmane.linux.kernel/1720729 The fix converts blkg->refcnt from int to atomic_t. It does some overhead but it should be minute compared to everything else which is going on and the involved cacheline bouncing, so I think it's highly unlikely to cause any noticeable difference. Also, the refcnt in question should be converted to a perpcu_ref for blk-mq anyway, so the atomic_t is likely to go away pretty soon anyway. Thanks. ------- 8< ------- __blkg_release_rcu() may be invoked after the associated request_queue is released with a RCU grace period inbetween. As such, the function and callbacks invoked from it must not dereference the associated request_queue. This is clearly indicated in the comment above the function. Unfortunately, while trying to fix a different issue, `2a4fd070ee` ("blkcg: move bulk of blkcg_gq release operations to the RCU callback") ignored this and added [un]locking of @blkg->q->queue_lock to __blkg_release_rcu(). This of course can cause oops as the request_queue may be long gone by the time this code gets executed. general protection fault: 0000 [#1] SMP CPU: 21 PID: 30 Comm: rcuos/21 Not tainted 3.15.0 #1 Hardware name: Stratus ftServer 6400/G7LAZ, BIOS BIOS Version 6.3:57 12/25/2013 task: ffff880854021de0 ti: ffff88085403c000 task.ti: ffff88085403c000 RIP: 0010:[<ffffffff8162e9e5>] [<ffffffff8162e9e5>] _raw_spin_lock_irq+0x15/0x60 RSP: 0018:ffff88085403fdf0 EFLAGS: 00010086 RAX: 0000000000020000 RBX: 0000000000000010 RCX: 0000000000000000 RDX: 000060ef80008248 RSI: 0000000000000286 RDI: 6b6b6b6b6b6b6b6b RBP: ffff88085403fdf0 R08: 0000000000000286 R09: 0000000000009f39 R10: 0000000000020001 R11: 0000000000020001 R12: ffff88103c17a130 R13: ffff88103c17a080 R14: 0000000000000000 R15: 0000000000000000 FS: 0000000000000000(0000) GS:ffff88107fca0000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 00000000006e5ab8 CR3: 000000000193d000 CR4: 00000000000407e0 Stack: ffff88085403fe18 ffffffff812cbfc2 ffff88103c17a130 0000000000000000 ffff88103c17a130 ffff88085403fec0 ffffffff810d1d28 ffff880854021de0 ffff880854021de0 ffff88107fcaec58 ffff88085403fe80 ffff88107fcaec30 Call Trace: [<ffffffff812cbfc2>] __blkg_release_rcu+0x72/0x150 [<ffffffff810d1d28>] rcu_nocb_kthread+0x1e8/0x300 [<ffffffff81091d81>] kthread+0xe1/0x100 [<ffffffff8163813c>] ret_from_fork+0x7c/0xb0 Code: ff 47 04 48 8b 7d 08 be 00 02 00 00 e8 55 48 a4 ff 5d c3 0f 1f 00 66 66 66 66 90 55 48 89 e5 +fa 66 66 90 66 66 90 b8 00 00 02 00 <f0> 0f c1 07 89 c2 c1 ea 10 66 39 c2 75 02 5d c3 83 e2 fe 0f +b7 RIP [<ffffffff8162e9e5>] _raw_spin_lock_irq+0x15/0x60 RSP <ffff88085403fdf0> The request_queue locking was added because blkcg_gq->refcnt is an int protected with the queue lock and __blkg_release_rcu() needs to put the parent. Let's fix it by making blkcg_gq->refcnt an atomic_t and dropping queue locking in the function. Given the general heavy weight of the current request_queue and blkcg operations, this is unlikely to cause any noticeable overhead. Moreover, blkcg_gq->refcnt is likely to be converted to percpu_ref in the near future, so whatever (most likely negligible) overhead it may add is temporary. Signed-off-by: Tejun Heo <tj@kernel.org> Reported-by: Joe Lawrence <joe.lawrence@stratus.com> Acked-by: Vivek Goyal <vgoyal@redhat.com> Link: http://lkml.kernel.org/g/alpine.DEB.2.02.1406081816540.17948@jlaw-desktop.mno.stratus.com Cc: stable@vger.kernel.org Signed-off-by: Jens Axboe <axboe@fb.com>	2014-06-22 16:30:52 -06:00
Linus Torvalds	f1d702487b	Merge branch 'for-linus' of git://git.kernel.dk/linux-block Pull block fixes from Jens Axboe: "A smaller collection of fixes for the block core that would be nice to have in -rc2. This pull request contains: - Fixes for races in the wait/wakeup logic used in blk-mq from Alexander. No issues have been observed, but it is definitely a bit flakey currently. Alternatively, we may drop the cyclic wakeups going forward, but that needs more testing. - Some cleanups from Christoph. - Fix for an oops in null_blk if queue_mode=1 and softirq completions are used. From me. - A fix for a regression caused by the chunk size setting. It inadvertently used max_hw_sectors instead of max_sectors, which is incorrect, and causes hangs on btrfs multi-disk setups (where hw sectors apparently isn't set). From me. - Removal of WQ_POWER_EFFICIENT in the kblockd creation. This was a recent addition as well, but it actually breaks blk-mq which relies on strict scheduling. If the workqueue power_efficient mode is turned on, this breaks blk-mq. From Matias. - null_blk module parameter description fix from Mike" * 'for-linus' of git://git.kernel.dk/linux-block: blk-mq: bitmap tag: fix races in bt_get() function blk-mq: bitmap tag: fix race on blk_mq_bitmap_tags::wake_cnt blk-mq: bitmap tag: fix races on shared ::wake_index fields block: blk_max_size_offset() should check ->max_sectors null_blk: fix softirq completions for queue_mode == 1 blk-mq: merge blk_mq_drain_queue and __blk_mq_drain_queue blk-mq: properly drain stopped queues block: remove WQ_POWER_EFFICIENT from kblockd null_blk: fix name and description of 'queue_mode' module parameter block: remove elv_abort_queue and blk_abort_flushes	2014-06-19 17:56:43 -10:00
Alexander Gordeev	86fb5c56cf	blk-mq: bitmap tag: fix races in bt_get() function This update fixes few issues in bt_get() function: - list_empty(&wait.task_list) check is not protected; - was_empty check is always true which results in every thread entering the loop resets bt_wait_state::wait_cnt counter rather than every bt->wake_cnt'th thread; - 'bt_wait_state::wait_cnt' counter update is redundant, since it also gets reset in bt_clear_tag() function; Cc: Christoph Hellwig <hch@infradead.org> Cc: Ming Lei <tom.leiming@gmail.com> Cc: Jens Axboe <axboe@kernel.dk> Signed-off-by: Alexander Gordeev <agordeev@redhat.com> Signed-off-by: Jens Axboe <axboe@fb.com>	2014-06-17 22:13:08 -07:00
Alexander Gordeev	2971c35f35	blk-mq: bitmap tag: fix race on blk_mq_bitmap_tags::wake_cnt This piece of code in bt_clear_tag() function is racy: bs = bt_wake_ptr(bt); if (bs && atomic_dec_and_test(&bs->wait_cnt)) { atomic_set(&bs->wait_cnt, bt->wake_cnt); wake_up(&bs->wait); } Since nothing prevents bt_wake_ptr() from returning the very same 'bs' address on multiple CPUs, the following scenario is possible: CPU1 CPU2 ---- ---- 0. bs = bt_wake_ptr(bt); bs = bt_wake_ptr(bt); 1. atomic_dec_and_test(&bs->wait_cnt) 2. atomic_dec_and_test(&bs->wait_cnt) 3. atomic_set(&bs->wait_cnt, bt->wake_cnt); If the decrement in [1] yields zero then for some amount of time the decrement in [2] results in a negative/overflow value, which is not expected. The follow-up assignment in [3] overwrites the invalid value with the batch value (and likely prevents the issue from being severe) which is still incorrect and should be a lesser. Cc: Ming Lei <tom.leiming@gmail.com> Cc: Jens Axboe <axboe@kernel.dk> Signed-off-by: Alexander Gordeev <agordeev@redhat.com> Signed-off-by: Jens Axboe <axboe@fb.com>	2014-06-17 22:13:05 -07:00
Alexander Gordeev	8537b12034	blk-mq: bitmap tag: fix races on shared ::wake_index fields Fix racy updates of shared blk_mq_bitmap_tags::wake_index and blk_mq_hw_ctx::wake_index fields. Cc: Ming Lei <tom.leiming@gmail.com> Signed-off-by: Alexander Gordeev <agordeev@redhat.com> Signed-off-by: Jens Axboe <axboe@fb.com>	2014-06-17 22:12:35 -07:00
Linus Torvalds	b55b390202	Merge git://git.infradead.org/users/willy/linux-nvme Pull NVMe update from Matthew Wilcox: "Mostly bugfixes again for the NVMe driver. I'd like to call out the exported tracepoint in the block layer; I believe Keith has cleared this with Jens. We've had a few reports from people who're really pounding on NVMe devices at scale, hence the timeout changes (and new module parameters), hotplug cpu deadlock, tracepoints, and minor performance tweaks" [ Jens hadn't seen that tracepoint thing, but is ok with it - it will end up going away when mq conversion happens ] * git://git.infradead.org/users/willy/linux-nvme: (22 commits) NVMe: Fix START_STOP_UNIT Scsi->NVMe translation. NVMe: Use Log Page constants in SCSI emulation NVMe: Define Log Page constants NVMe: Fix hot cpu notification dead lock NVMe: Rename io_timeout to nvme_io_timeout NVMe: Use last bytes of f/w rev SCSI Inquiry NVMe: Adhere to request queue block accounting enable/disable NVMe: Fix nvme get/put queue semantics NVMe: Delete NVME_GET_FEAT_TEMP_THRESH NVMe: Make admin timeout a module parameter NVMe: Make iod bio timeout a parameter NVMe: Prevent possible NULL pointer dereference NVMe: Fix the buffer size passed in GetLogPage(CDW10.NUMD) NVMe: Update data structures for NVMe 1.2 NVMe: Enable BUILD_BUG_ON checks NVMe: Update namespace and controller identify structures to the 1.1a spec NVMe: Flush with data support NVMe: Configure support for block flush NVMe: Add tracepoints NVMe: Protect against badly formatted CQEs ...	2014-06-15 15:58:03 -10:00
Christoph Hellwig	95ed068165	blk-mq: merge blk_mq_drain_queue and __blk_mq_drain_queue Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@fb.com>	2014-06-13 12:17:40 -06:00
Christoph Hellwig	8f5280f4ee	blk-mq: properly drain stopped queues If we need to drain a queue we need to run all queues, even if they are marked stopped to make sure the driver has a chance to error out on all queued requests. This fixes surprise removal with scsi-mq. Reported-by: Bart Van Assche <bvanassche@acm.org> Tested-by: Bart Van Assche <bvanassche@acm.org> Signed-off-by: Jens Axboe <axboe@fb.com>	2014-06-13 12:17:38 -06:00
Matias Bjørling	28747fcd22	block: remove WQ_POWER_EFFICIENT from kblockd blk-mq issues async requests through kblockd. To issue a work request on a specific CPU, kblockd_schedule_delayed_work_on is used. However, the specific CPU choice may not be honored, if the power_efficient option for workqueues is set. blk-mq requires that we have strict per-cpu scheduling, so it wont work properly if kblockd is marked POWER_EFFICIENT and power_efficient is set. Remove the kblockd WQ_POWER_EFFICIENT flag to prevent this behavior. This essentially reverts part of commit `695588f945`, which added the WQ_POWER_EFFICIENT marker to kblockd. Signed-off-by: Matias Bjørling <m@bjorling.me> Signed-off-by: Jens Axboe <axboe@fb.com>	2014-06-11 15:53:39 -06:00
Christoph Hellwig	2940474af7	block: remove elv_abort_queue and blk_abort_flushes elv_abort_queue has no callers, and blk_abort_flushes is only called by elv_abort_queue. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@fb.com>	2014-06-11 15:31:21 -06:00
Linus Torvalds	23d4ed53b7	Merge branch 'for-linus' of git://git.kernel.dk/linux-block Pull block layer fixes from Jens Axboe: "Final small batch of fixes to be included before -rc1. Some general cleanups in here as well, but some of the blk-mq fixes we need for the NVMe conversion and/or scsi-mq. The pull request contains: - Support for not merging across a specified "chunk size", if set by the driver. Some NVMe devices perform poorly for IO that crosses such a chunk, so we need to support it generically as part of request merging avoid having to do complicated split logic. From me. - Bump max tag depth to 10Ki tags. Some scsi devices have a huge shared tag space. Before we failed with EINVAL if a too large tag depth was specified, now we truncate it and pass back the actual value. From me. - Various blk-mq rq init fixes from me and others. - A fix for enter on a dying queue for blk-mq from Keith. This is needed to prevent oopsing on hot device removal. - Fixup for blk-mq timer addition from Ming Lei. - Small round of performance fixes for mtip32xx from Sam Bradshaw. - Minor stack leak fix from Rickard Strandqvist. - Two __init annotations from Fabian Frederick" * 'for-linus' of git://git.kernel.dk/linux-block: block: add __init to blkcg_policy_register block: add __init to elv_register block: ensure that bio_add_page() always accepts a page for an empty bio blk-mq: add timer in blk_mq_start_request blk-mq: always initialize request->start_time block: blk-exec.c: Cleaning up local variable address returnd mtip32xx: minor performance enhancements blk-mq: ->timeout should be cleared in blk_mq_rq_ctx_init() blk-mq: don't allow queue entering for a dying queue blk-mq: bump max tag depth to 10K tags block: add blk_rq_set_block_pc() block: add notion of a chunk size for request merging	2014-06-11 08:41:17 -07:00
Fabian Frederick	a2d445d440	block: add __init to blkcg_policy_register blkcg_policy_register is only called by __init functions: __init cfq_init __init throtl_init Cc: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Fabian Frederick <fabf@skynet.be> Signed-off-by: Jens Axboe <axboe@fb.com>	2014-06-10 13:13:12 -06:00
Fabian Frederick	b5097e956a	block: add __init to elv_register elv_register is only called by elevator init functions: __init cfq_init __init deadline_init __init noop_init Cc: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Fabian Frederick <fabf@skynet.be> Signed-off-by: Jens Axboe <axboe@fb.com>	2014-06-10 13:13:11 -06:00
Jens Axboe	58a4915ad2	block: ensure that bio_add_page() always accepts a page for an empty bio With commit `762380ad93` added support for chunk sizes and no merging across them, it broke the rule of always allowing adding of a single page to an empty bio. So relax the restriction a bit to allow for that, similarly to what we have always done. This fixes a crash with mkfs.xfs and 512b sector sizes on NVMe. Reported-by: Keith Busch <keith.busch@intel.com> Signed-off-by: Jens Axboe <axboe@fb.com>	2014-06-10 12:53:56 -06:00
Linus Torvalds	14208b0ec5	Merge branch 'for-3.16' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup Pull cgroup updates from Tejun Heo: "A lot of activities on cgroup side. Heavy restructuring including locking simplification took place to improve the code base and enable implementation of the unified hierarchy, which currently exists behind a __DEVEL__ mount option. The core support is mostly complete but individual controllers need further work. To explain the design and rationales of the the unified hierarchy Documentation/cgroups/unified-hierarchy.txt is added. Another notable change is css (cgroup_subsys_state - what each controller uses to identify and interact with a cgroup) iteration update. This is part of continuing updates on css object lifetime and visibility. cgroup started with reference count draining on removal way back and is now reaching a point where csses behave and are iterated like normal refcnted objects albeit with some complexities to allow distinguishing the state where they're being deleted. The css iteration update isn't taken advantage of yet but is planned to be used to simplify memcg significantly" * 'for-3.16' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup: (77 commits) cgroup: disallow disabled controllers on the default hierarchy cgroup: don't destroy the default root cgroup: disallow debug controller on the default hierarchy cgroup: clean up MAINTAINERS entries cgroup: implement css_tryget() device_cgroup: use css_has_online_children() instead of has_children() cgroup: convert cgroup_has_live_children() into css_has_online_children() cgroup: use CSS_ONLINE instead of CGRP_DEAD cgroup: iterate cgroup_subsys_states directly cgroup: introduce CSS_RELEASED and reduce css iteration fallback window cgroup: move cgroup->serial_nr into cgroup_subsys_state cgroup: link all cgroup_subsys_states in their sibling lists cgroup: move cgroup->sibling and ->children into cgroup_subsys_state cgroup: remove cgroup->parent device_cgroup: remove direct access to cgroup->children memcg: update memcg_has_children() to use css_next_child() memcg: remove tasks/children test from mem_cgroup_force_empty() cgroup: remove css_parent() cgroup: skip refcnting on normal root csses and cgrp_dfl_root self css cgroup: use cgroup->self.refcnt for cgroup refcnting ...	2014-06-09 15:03:33 -07:00
Ming Lei	2b8393b43e	blk-mq: add timer in blk_mq_start_request This way will become consistent with non-mq case, also avoid to update rq->deadline twice for mq. The comment said: "We do this early, to ensure we are on the right CPU.", but no percpu stuff is used in blk_add_timer(), so it isn't necessary. Even when inserting from plug list, there is no such guarantee at all. Signed-off-by: Ming Lei <tom.leiming@gmail.com> Signed-off-by: Jens Axboe <axboe@fb.com>	2014-06-09 10:20:06 -06:00
Jens Axboe	3ee3237239	blk-mq: always initialize request->start_time The blk-mq core only initializes this if io stats are enabled, since blk-mq only reads the field in that case. But drivers could potentially use it internally, so ensure that we always set it to the current time when the request is allocated. Reported-by: Ming Lei <tom.leiming@gmail.com> Signed-off-by: Jens Axboe <axboe@fb.com>	2014-06-09 09:36:53 -06:00
Rickard Strandqvist	de83953f9d	block: blk-exec.c: Cleaning up local variable address returnd Address of local variable assigned to a function parameter This was partly found using a static code analysis program called cppcheck. Signed-off-by: Rickard Strandqvist <rickard_strandqvist@spectrumdigital.se> Signed-off-by: Jens Axboe <axboe@fb.com>	2014-06-08 19:51:31 -06:00
Mitchel Humpherys	b1de0d139c	mm: convert some level-less printks to pr_* printk is meant to be used with an associated log level. There are some instances of printk scattered around the mm code where the log level is missing. Add a log level and adhere to suggestions by scripts/checkpatch.pl by moving to the pr_* macros. Also add the typical pr_fmt definition so that print statements can be easily traced back to the modules where they occur, correlated one with another, etc. This will require the removal of some (now redundant) prefixes on a few print statements. Signed-off-by: Mitchel Humpherys <mitchelh@codeaurora.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2014-06-06 16:08:18 -07:00
Jens Axboe	f6be4fb4bc	blk-mq: ->timeout should be cleared in blk_mq_rq_ctx_init() It'll be used in blk_mq_start_request() to set a potential timeout for the request, so clear it to zero at alloc time to ensure that we know if someone has set it or not. Fixes random early timeouts on NVMe testing. Signed-off-by: Jens Axboe <axboe@fb.com>	2014-06-06 11:05:25 -06:00
Keith Busch	3b632cf0ea	blk-mq: don't allow queue entering for a dying queue If the queue is going away, don't let new allocs or queueing happen on it. Go through the normal wait process, and exit with ENODEV in that case. Signed-off-by: Keith Busch <keith.busch@intel.com> Signed-off-by: Jens Axboe <axboe@fb.com>	2014-06-06 10:40:03 -06:00
Jens Axboe	a4391c6465	blk-mq: bump max tag depth to 10K tags For some scsi-mq cases, the tag map can be huge. So increase the max number of tags we support. Additionally, don't fail with EINVAL if a user requests too many tags. Warn that the tag depth has been adjusted down, and store the new value inside the tag_set passed in. Signed-off-by: Jens Axboe <axboe@fb.com>	2014-06-06 08:04:46 -06:00
Jens Axboe	f27b087b81	block: add blk_rq_set_block_pc() With the optimizations around not clearing the full request at alloc time, we are leaving some of the needed init for REQ_TYPE_BLOCK_PC up to the user allocating the request. Add a blk_rq_set_block_pc() that sets the command type to REQ_TYPE_BLOCK_PC, and properly initializes the members associated with this type of request. Update callers to use this function instead of manipulating rq->cmd_type directly. Includes fixes from Christoph Hellwig <hch@lst.de> for my half-assed attempt. Signed-off-by: Jens Axboe <axboe@fb.com>	2014-06-06 07:57:37 -06:00
Jens Axboe	762380ad93	block: add notion of a chunk size for request merging Some drivers have different limits on what size a request should optimally be, depending on the offset of the request. Similar to dividing a device into chunks. Add a setting that allows the driver to inform the block layer of such a chunk size. The block layer will then prevent merging across the chunks. This is needed to optimally support NVMe with a non-zero stripe size. Signed-off-by: Jens Axboe <axboe@fb.com>	2014-06-05 13:38:39 -06:00
Ming Lei	14b83e172f	block: mq flush: clear flush_rq's tag in flush_end_io() blk_mq_tag_to_rq() needs to be able to tell if it should return the original request, or the flush request if we are doing a flush sequence. Clear the flush tag when IO completes for a flush, since that is what we are comparing against. Signed-off-by: Ming Lei <tom.leiming@gmail.com> Signed-off-by: Jens Axboe <axboe@fb.com>	2014-06-04 10:40:16 -06:00
Jens Axboe	0e62f51f87	blk-mq: let blk_mq_tag_to_rq() take blk_mq_tags as the main parameter We currently pass in the hardware queue, and get the tags from there. But from scsi-mq, with a shared tag space, it's a lot more convenient to pass in the blk_mq_tags instead as the hardware queue isn't always directly available. So instead of having to re-map to a given hardware queue from rq->mq_ctx, just pass in the tags structure. Signed-off-by: Jens Axboe <axboe@fb.com>	2014-06-04 10:23:49 -06:00
Jens Axboe	f899fed442	blk-mq: fix regression from commit `624dbe4754` When the code was collapsed to avoid duplication, the recent patch for ensuring that a queue is idled before free was dropped, which was added by commit `19c5d84f14`. Add back the blk_mq_tag_idle(), to ensure we don't leak a reference to an active queue when it is freed. Signed-off-by: Jens Axboe <axboe@fb.com>	2014-06-04 09:11:53 -06:00
Jens Axboe	ff87bcec19	blk-mq: handle NULL req return from blk_map_request in single queue mode blk_mq_map_request() can return NULL if we fail entering the queue (dying, or removed), in which case it has already ended IO on the bio. So nothing more to do, except just return. Signed-off-by: Jens Axboe <axboe@fb.com>	2014-06-03 21:04:39 -06:00
Ming Lei	e6cdb0929f	blk-mq: fix sparse warning on missed __percpu annotation 'struct blk_mq_ctx' is __percpu, so add the annotation and fix the sparse warning reported from Fengguang: [block:for-linus 2/3] block/blk-mq.h:75:16: sparse: incorrect type in initializer (different address spaces) Reported-by: kbuild test robot <fengguang.wu@intel.com> Signed-off-by: Ming Lei <tom.leiming@gmail.com> Signed-off-by: Jens Axboe <axboe@fb.com>	2014-06-03 21:04:39 -06:00
Ming Lei	cb96a42cc1	blk-mq: fix schedule from atomic context blk_mq_put_ctx() has to be called before io_schedule() in bt_get(). This patch fixes the problem by taking similar approach from percpu_ida allocation for the situation. Signed-off-by: Ming Lei <tom.leiming@gmail.com> Signed-off-by: Jens Axboe <axboe@fb.com>	2014-06-03 21:04:39 -06:00
Ming Lei	1aecfe4887	blk-mq: move blk_mq_get_ctx/blk_mq_put_ctx to mq private header The blk-mq tag code need these helpers. Signed-off-by: Ming Lei <tom.leiming@gmail.com> Signed-off-by: Jens Axboe <axboe@fb.com>	2014-06-03 21:04:38 -06:00
Linus Torvalds	776edb5931	Merge branch 'locking-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip into next Pull core locking updates from Ingo Molnar: "The main changes in this cycle were: - reduced/streamlined smp_mb__() interface that allows more usecases and makes the existing ones less buggy, especially in rarer architectures - add rwsem implementation comments - bump up lockdep limits" 'locking-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (33 commits) rwsem: Add comments to explain the meaning of the rwsem's count field lockdep: Increase static allocations arch: Mass conversion of smp_mb__() arch,doc: Convert smp_mb__() arch,xtensa: Convert smp_mb__() arch,x86: Convert smp_mb__() arch,tile: Convert smp_mb__() arch,sparc: Convert smp_mb__() arch,sh: Convert smp_mb__() arch,score: Convert smp_mb__() arch,s390: Convert smp_mb__() arch,powerpc: Convert smp_mb__() arch,parisc: Convert smp_mb__() arch,openrisc: Convert smp_mb__() arch,mn10300: Convert smp_mb__() arch,mips: Convert smp_mb__() arch,metag: Convert smp_mb__() arch,m68k: Convert smp_mb__() arch,m32r: Convert smp_mb__() arch,ia64: Convert smp_mb__() ...	2014-06-03 12:57:53 -07:00
Linus Torvalds	80081ec309	Merge branch 'for-3.16/drivers' of git://git.kernel.dk/linux-block into next Pull block driver changes from Jens Axboe: "Now that the core bits are in, here's the pull request for the driver related changes for 3.16. Nothing out of the ordinary here, mostly business as usual. There are a few pulls of for-3.16/core into this branch, which were done when the blk-mq was modified after the mtip32xx conversion was put in. The pull request contains: - skd and cciss converted to use pci_enable_msix_exact(). From Alexander Gordeev. - A few mtip32xx fixes from Asai @ Micron. - The conversion of mtip32xx from make_request_fn to blk-mq, and a later small fix for that conversion on quiescing for non-queued IO. From me. - A fix for bsg to use an exported function to check whether this driver is request based or not. Needed updating for blk-mq, which is request based, but does not have a request_fn hook. From me. - Small floppy bug fix from Jiri. - A series of cleanups for the cdrom uniform layer from Joe Perches. Gets rid of various old ugly macros, making the code conform more to the modern coding style. - A series of patches for drbd from the drbd crew (Lars Ellenberg and Philipp Reisner). - A use-after-free fix for null_blk from Ming Lei. - Also from Ming Lei is a performance patch for virtio-blk, which can net us a 3x win on kvm platforms where world notification is expensive. - Ming Lei also fixed a stall issue in virtio-blk, due to a race between queue start/stop and resource limits. - A small batch of fixes for xen-blk{back,front} from Olaf Hering and Valentin Priescu" * 'for-3.16/drivers' of git://git.kernel.dk/linux-block: (54 commits) block: virtio_blk: don't hold spin lock during world switch xen-blkback: defer freeing blkif to avoid blocking xenwatch xen blkif.h: fix comment typo in discard-alignment xen/blkback: disable discard feature if requested by toolstack xen-blkfront: remove type check from blkfront_setup_discard floppy: do not corrupt bio.bi_flags when reading block 0 mtip32xx: move error handling to service thread virtio_blk: fix race between start and stop queue mtip32xx: stop block hardware queues before quiescing IO mtip32xx: blk_mq_init_queue() returns an ERR_PTR mtip32xx: convert to use blk-mq cdrom: Remove unnecessary prototype for cdrom_get_disc_info cdrom: Remove unnecessary prototype for cdrom_mrw_exit cdrom: Remove cdrom_count_tracks prototype cdrom: Remove cdrom_get_next_writeable prototype cdrom: Remove cdrom_get_last_written prototype cdrom: Move mmc_ioctls above cdrom_ioctl to remove unnecessary prototype cdrom: Remove unnecessary sanitize_format prototype cdrom: Remove unnecessary check_for_audio_disc prototype cdrom: Remove prototype for open_for_data ...	2014-06-02 13:57:01 -07:00
Linus Torvalds	681a289548	Merge branch 'for-3.16/core' of git://git.kernel.dk/linux-block into next Pull block core updates from Jens Axboe: "It's a big(ish) round this time, lots of development effort has gone into blk-mq in the last 3 months. Generally we're heading to where 3.16 will be a feature complete and performant blk-mq. scsi-mq is progressing nicely and will hopefully be in 3.17. A nvme port is in progress, and the Micron pci-e flash driver, mtip32xx, is converted and will be sent in with the driver pull request for 3.16. This pull request contains: - Lots of prep and support patches for scsi-mq have been integrated. All from Christoph. - API and code cleanups for blk-mq from Christoph. - Lots of good corner case and error handling cleanup fixes for blk-mq from Ming Lei. - A flew of blk-mq updates from me: * Provide strict mappings so that the driver can rely on the CPU to queue mapping. This enables optimizations in the driver. * Provided a bitmap tagging instead of percpu_ida, which never really worked well for blk-mq. percpu_ida relies on the fact that we have a lot more tags available than we really need, it fails miserably for cases where we exhaust (or are close to exhausting) the tag space. * Provide sane support for shared tag maps, as utilized by scsi-mq * Various fixes for IO timeouts. * API cleanups, and lots of perf tweaks and optimizations. - Remove 'buffer' from struct request. This is ancient code, from when requests were always virtually mapped. Kill it, to reclaim some space in struct request. From me. - Remove 'magic' from blk_plug. Since we store these on the stack and since we've never caught any actual bugs with this, lets just get rid of it. From me. - Only call part_in_flight() once for IO completion, as includes two atomic reads. Hopefully we'll get a better implementation soon, as the part IO stats are now one of the more expensive parts of doing IO on blk-mq. From me. - File migration of block code from {mm,fs}/ to block/. This includes bio.c, bio-integrity.c, bounce.c, and ioprio.c. From me, from a discussion on lkml. That should describe the meat of the pull request. Also has various little fixes and cleanups from Dave Jones, Shaohua Li, Duan Jiong, Fengguang Wu, Fabian Frederick, Randy Dunlap, Robert Elliott, and Sam Bradshaw" * 'for-3.16/core' of git://git.kernel.dk/linux-block: (100 commits) blk-mq: push IPI or local end_io decision to __blk_mq_complete_request() blk-mq: remember to start timeout handler for direct queue block: ensure that the timer is always added blk-mq: blk_mq_unregister_hctx() can be static blk-mq: make the sysfs mq/ layout reflect current mappings blk-mq: blk_mq_tag_to_rq should handle flush request block: remove dead code in scsi_ioctl:blk_verify_command blk-mq: request initialization optimizations block: add queue flag for disabling SG merging block: remove 'magic' from struct blk_plug blk-mq: remove alloc_hctx and free_hctx methods blk-mq: add file comments and update copyright notices blk-mq: remove blk_mq_alloc_request_pinned blk-mq: do not use blk_mq_alloc_request_pinned in blk_mq_map_request blk-mq: remove blk_mq_wait_for_tags blk-mq: initialize request in __blk_mq_alloc_request blk-mq: merge blk_mq_alloc_reserved_request into blk_mq_alloc_request blk-mq: add helper to insert requests from irq context blk-mq: remove stale comment for blk_mq_complete_request() blk-mq: allow non-softirq completions ...	2014-06-02 09:29:34 -07:00
Jens Axboe	ed851860b4	blk-mq: push IPI or local end_io decision to __blk_mq_complete_request() We have callers outside of the blk-mq proper (like timeouts) that want to call __blk_mq_complete_request(), so rename the function and put the decision code for whether to use ->softirq_done_fn or blk_mq_endio() into __blk_mq_complete_request(). This also makes the interface more logical again. blk_mq_complete_request() attempts to atomically mark the request completed, and calls __blk_mq_complete_request() if successful. __blk_mq_complete_request() then just ends the request. Signed-off-by: Jens Axboe <axboe@fb.com>	2014-05-30 21:20:50 -06:00
Jens Axboe	feff689412	blk-mq: remember to start timeout handler for direct queue Commit `07068d5b8e` added a direct-to-hw-queue mode, but this mode needs to remember to add the request timeout handler as well. Without it, we don't track timeouts for these requests. Signed-off-by: Jens Axboe <axboe@fb.com>	2014-05-30 15:42:56 -06:00
Jens Axboe	c7bca4183f	block: ensure that the timer is always added Commit `f793aa5378` relaxed the timer addition a little too much. If the timer isn't pending, we always need to add it. Signed-off-by: Jens Axboe <axboe@fb.com>	2014-05-30 15:41:39 -06:00
Fengguang Wu	ee3c5db089	blk-mq: blk_mq_unregister_hctx() can be static CC: Jens Axboe <axboe@kernel.dk> Signed-off-by: Fengguang Wu <fengguang.wu@intel.com> Signed-off-by: Jens Axboe <axboe@fb.com>	2014-05-30 10:31:13 -06:00
Jens Axboe	67aec14ce8	blk-mq: make the sysfs mq/ layout reflect current mappings Currently blk-mq registers all the hardware queues in sysfs, regardless of whether it uses them (e.g. they have CPU mappings) or not. The unused hardware queues lack the cpux/ directories, and the other sysfs entries (like active, pending, etc) are all zeroes. Change this so that sysfs correctly reflects the current mappings of the hardware queues. Signed-off-by: Jens Axboe <axboe@fb.com>	2014-05-30 08:25:36 -06:00
Jens Axboe	f89ca16646	Merge branch 'for-3.16/core' into for-3.16/drivers Pulled in for the blk_mq_tag_to_rq() change, which impacts mtip32xx. Signed-off-by: Jens Axboe <axboe@fb.com>	2014-05-30 08:11:50 -06:00
Shaohua Li	2230237500	blk-mq: blk_mq_tag_to_rq should handle flush request flush request is special, which borrows the tag from the parent request. Hence blk_mq_tag_to_rq needs special handling to return the flush request from the tag. Signed-off-by: Shaohua Li <shli@fusionio.com> Signed-off-by: Jens Axboe <axboe@fb.com>	2014-05-30 08:06:42 -06:00
Dave Jones	da52f22fa9	block: remove dead code in scsi_ioctl:blk_verify_command filter gets assigned the address of blk_default_cmd_filter on entry to this function, so the !filter condition can never be true. Signed-off-by: Dave Jones <davej@redhat.com> Signed-off-by: Jens Axboe <axboe@fb.com>	2014-05-29 13:38:50 -06:00
Jens Axboe	4b570521be	blk-mq: request initialization optimizations We currently clear a lot more than we need to, so make that a bit more clever. Make some of the init dependent on features, like only setting start_time if we are going to use it. Signed-off-by: Jens Axboe <axboe@fb.com>	2014-05-29 11:00:11 -06:00
Jens Axboe	05f1dd5315	block: add queue flag for disabling SG merging If devices are not SG starved, we waste a lot of time potentially collapsing SG segments. Enough that 1.5% of the CPU time goes to this, at only 400K IOPS. Add a queue flag, QUEUE_FLAG_NO_SG_MERGE, which just returns the number of vectors in a bio instead of looping over all segments and checking for collapsible ones. Add a BLK_MQ_F_SG_MERGE flag so that drivers can opt-in on the sg merging, if they so desire. Signed-off-by: Jens Axboe <axboe@fb.com>	2014-05-29 09:53:32 -06:00
Jens Axboe	4d92a9beb3	block: remove 'magic' from struct blk_plug I don't think we've ever caught any bugs with this, and there's the list poisoning for the plug lists to catch uninitialized cases. So remove the magic member and save 8 bytes in the struct. Signed-off-by: Jens Axboe <axboe@fb.com>	2014-05-29 08:09:00 -06:00
Jens Axboe	0fb662e225	Merge branch 'for-3.16/core' into for-3.16/drivers Pull in core changes (again), since we got rid of the alloc/free hctx mq_ops hooks and mtip32xx then needed updating again. Signed-off-by: Jens Axboe <axboe@fb.com>	2014-05-28 10:18:51 -06:00
Christoph Hellwig	cdef54dd85	blk-mq: remove alloc_hctx and free_hctx methods There is no need for drivers to control hardware context allocation now that we do the context to node mapping in common code. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@fb.com>	2014-05-28 10:18:31 -06:00
Jens Axboe	75bb4625bb	blk-mq: add file comments and update copyright notices None of the blk-mq files have an explanatory comment at the top for what that particular file does. Add that and add appropriate copyright notices as well. Signed-off-by: Jens Axboe <axboe@fb.com>	2014-05-28 10:15:41 -06:00
Jens Axboe	6178976500	Merge branch 'for-3.16/core' into for-3.16/drivers mtip32xx uses blk_mq_alloc_reserved_request(), so pull in the core changes so we have a properly merged end result. Signed-off-by: Jens Axboe <axboe@fb.com>	2014-05-28 09:50:26 -06:00
Christoph Hellwig	d852564f8c	blk-mq: remove blk_mq_alloc_request_pinned We now only have one caller left and can open code it there in a cleaner way. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@fb.com>	2014-05-28 09:49:27 -06:00
Christoph Hellwig	793597a6a9	blk-mq: do not use blk_mq_alloc_request_pinned in blk_mq_map_request We already do a non-blocking allocation in blk_mq_map_request, no need to repeat it. Just call __blk_mq_alloc_request to wait directly. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@fb.com>	2014-05-28 09:49:25 -06:00
Christoph Hellwig	a3bd77567c	blk-mq: remove blk_mq_wait_for_tags The current logic for blocking tag allocation is rather confusing, as we first allocated and then free again a tag in blk_mq_wait_for_tags, just to attempt a non-blocking allocation and then repeat if someone else managed to grab the tag before us. Instead change blk_mq_alloc_request_pinned to simply do a blocking tag allocation itself and use the request we get back from it. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@fb.com>	2014-05-28 09:49:23 -06:00
Christoph Hellwig	5dee857720	blk-mq: initialize request in __blk_mq_alloc_request Both callers if __blk_mq_alloc_request want to initialize the request, so lift it into the common path. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@fb.com>	2014-05-28 09:49:21 -06:00
Christoph Hellwig	4ce01dd1a0	blk-mq: merge blk_mq_alloc_reserved_request into blk_mq_alloc_request Instead of having two almost identical copies of the same code just let the callers pass in the reserved flag directly. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@fb.com>	2014-05-28 09:49:19 -06:00
Christoph Hellwig	6fca6a611c	blk-mq: add helper to insert requests from irq context Both the cache flush state machine and the SCSI midlayer want to submit requests from irq context, and the current per-request requeue_work unfortunately causes corruption due to sharing with the csd field for flushes. Replace them with a per-request_queue list of requests to be requeued. Based on an earlier test by Ming Lei. Signed-off-by: Christoph Hellwig <hch@lst.de> Reported-by: Ming Lei <tom.leiming@gmail.com> Tested-by: Ming Lei <tom.leiming@gmail.com> Signed-off-by: Jens Axboe <axboe@fb.com>	2014-05-28 08:08:02 -06:00
Jens Axboe	95f0968499	blk-mq: allow non-softirq completions Right now we export two ways of completing a request: 1) blk_mq_complete_request(). This uses an IPI (if needed) and completes through q->softirq_done_fn(). It also works with timeouts. 2) blk_mq_end_io(). This completes inline, and ignores any timeout state of the request. Let blk_mq_complete_request() handle non-softirq_done_fn completions as well, by just completing inline. If a driver has enough completion ports to place completions correctly, it need not define a mq_ops->complete() and we can avoid an indirect function call by doing the completion inline. Signed-off-by: Jens Axboe <axboe@fb.com>	2014-05-27 17:46:48 -06:00
Jens Axboe	f14bbe77a9	blk-mq: pass in suggested NUMA node to ->alloc_hctx() Drivers currently have to figure this out on their own, and they are missing information to do it properly. The ones that did attempt to do it, do it wrong. So just pass in the suggested node directly to the alloc function. Signed-off-by: Jens Axboe <axboe@fb.com>	2014-05-27 12:06:53 -06:00
Ming Lei	3d2936f457	block: only allocate/free mq_usage_counter in blk-mq The percpu counter is only used for blk-mq, so move its allocation and free inside blk-mq, and don't allocate it for legacy queue device. Signed-off-by: Ming Lei <tom.leiming@gmail.com> Signed-off-by: Jens Axboe <axboe@fb.com>	2014-05-27 09:37:08 -06:00
Ming Lei	624dbe4754	blk-mq: avoid code duplication blk_mq_exit_hw_queues() and blk_mq_free_hw_queues() are introduced to avoid code duplication. Signed-off-by: Ming Lei <tom.leiming@gmail.com> Signed-off-by: Jens Axboe <axboe@fb.com>	2014-05-27 09:37:06 -06:00
Ming Lei	1f9f07e917	blk-mq: fix leak of hctx->ctx_map hctx->ctx_map should have been freed inside blk_mq_free_queue(). Signed-off-by: Ming Lei <tom.leiming@gmail.com> Signed-off-by: Jens Axboe <axboe@fb.com>	2014-05-27 08:34:45 -06:00
Fabian Frederick	35086784ca	block/blk-lib.c: make __blkdev_issue_zeroout static __blkdev_issue_zeroout is only used in blk-lib.c Cc: Jens Axboe <axboe@kernel.dk> Cc: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Fabian Frederick <fabf@skynet.be> Signed-off-by: Jens Axboe <axboe@fb.com>	2014-05-26 17:39:09 -06:00
Christoph Hellwig	19c5d84f14	blk-mq: idle all hardware contexts before freeing a queue Without this we can leak the active_queues reference if a command is freed while it is considered active. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@fb.com>	2014-05-26 17:21:51 -06:00
Jens Axboe	c22d9d8a60	blk-mq: allow setting of per-request timeouts Currently blk-mq uses the queue timeout for all requests. But for some commands, drivers may want to set a specific timeout for special requests. Allow this to be passed in through request->timeout, and use it if set. Signed-off-by: Jens Axboe <axboe@fb.com>	2014-05-23 14:14:57 -06:00
Sam Bradshaw	edf866b380	blk-mq: export blk_mq_tag_busy_iter Export the blk-mq in-flight tag iterator for driver consumption. This is particularly useful in exception paths or SRSI where in-flight IOs need to be cancelled and/or reissued. The NVMe driver conversion will use this. Signed-off-by: Sam Bradshaw <sbradshaw@micron.com> Signed-off-by: Matias Bjørling <m@bjorling.me> Signed-off-by: Jens Axboe <axboe@fb.com>	2014-05-23 13:30:16 -06:00
Jens Axboe	07068d5b8e	blk-mq: split make request handler for multi and single queue We want slightly different behavior from them: - On single queue devices, we currently use the per-process plug for deferred IO and for merging. - On multi queue devices, we don't use the per-process plug, but we want to go straight to hardware for SYNC IO. Split blk_mq_make_request() into a blk_sq_make_request() for single queue devices, and retain blk_mq_make_request() for multi queue devices. Then we don't need multiple checks for q->nr_hw_queues in the request mapping. Signed-off-by: Jens Axboe <axboe@fb.com>	2014-05-22 10:43:07 -06:00
Jens Axboe	484b4061e6	blk-mq: save memory by freeing requests on unused hardware queues Depending on the topology of the machine and the number of queues exposed by a device, we can end up in a situation where some of the hardware queues are unused (as in, they don't map to any software queues). For this case, free up the memory used by the request map, as we will not use it. This can be a substantial amount of memory, depending on the number of queues vs CPUs and the queue depth of the device. Signed-off-by: Jens Axboe <axboe@fb.com>	2014-05-21 14:01:15 -06:00
Jens Axboe	e814e71ba4	blk-mq: allow the hctx cpu hotplug notifier to return errors Prepare this for the next patch which adds more smarts in the plugging logic, so that we can save some memory. Signed-off-by: Jens Axboe <axboe@fb.com>	2014-05-21 13:59:08 -06:00
Robert Elliott	da41a589f5	blk-mq: Micro-optimize blk_queue_nomerges() check In blk_mq_make_request(), do the blk_queue_nomerges() check outside the call to blk_attempt_plug_merge() to eliminate function call overhead when nomerges=2 (disabled) Signed-off-by: Robert Elliott <elliott@hp.com> Signed-off-by: Jens Axboe <axboe@fb.com>	2014-05-20 15:49:03 -06:00
Jens Axboe	eba7176826	blk-mq: initialize q->nr_requests after calling blk_queue_make_request() blk_queue_make_requests() overwrites our set value for q->nr_requests, turning it into the default of 128. Set this appropriately after initializing queue values in blk_queue_make_request(). Signed-off-by: Jens Axboe <axboe@fb.com>	2014-05-20 15:17:27 -06:00
Jens Axboe	e3a2b3f931	blk-mq: allow changing of queue depth through sysfs For request_fn based devices, the block layer exports a 'nr_requests' file through sysfs to allow adjusting of queue depth on the fly. Currently this returns -EINVAL for blk-mq, since it's not wired up. Wire this up for blk-mq, so that it now also always dynamic adjustments of the allowed queue depth for any given block device managed by blk-mq. Signed-off-by: Jens Axboe <axboe@fb.com>	2014-05-20 11:49:02 -06:00
Jens Axboe	719c555f44	block: move mm/bounce.c to block/ Continue moving some of the block files that are scattered around. bounce.c contains only code for bouncing the contents of a bio. It's block proper code, not mm code. Suggested-by: Ming Lei <tom.leiming@gmail.com> Signed-off-by: Jens Axboe <axboe@fb.com>	2014-05-19 20:01:52 -06:00
Jens Axboe	39a9f97e5e	Merge branch 'for-3.16/blk-mq-tagging' into for-3.16/core Signed-off-by: Jens Axboe <axboe@fb.com> Conflicts: block/blk-mq-tag.c	2014-05-19 11:52:35 -06:00
Jens Axboe	1429d7c946	blk-mq: switch ctx pending map to the sparser blk_align_bitmap Each hardware queue has a bitmap of software queues with pending requests. When new IO is queued on a software queue, the bit is set, and when IO is pruned on a hardware queue run, the bit is cleared. This causes a lot of traffic. Switch this from the regular BITS_PER_LONG bitmap to a sparser layout, similarly to what was done for blk-mq tagging. 20% performance increase was observed for single threaded IO, and about 15% performanc increase on multiple threads driving the same device. Signed-off-by: Jens Axboe <axboe@fb.com>	2014-05-19 11:02:47 -06:00
Jens Axboe	e93ecf602b	blk-mq: move the cache friendly bitmap type of out blk-mq-tag We will use it for the pending list in blk-mq core as well. Signed-off-by: Jens Axboe <axboe@fb.com>	2014-05-19 11:02:47 -06:00
Jens Axboe	2667bcbbd5	block: move ioprio.c from fs/ to block/ Like commit `f9c78b2b`, move this block related file outside of fs/ and into the core block directory, block/. Signed-off-by: Jens Axboe <axboe@fb.com>	2014-05-19 11:02:18 -06:00
Jens Axboe	f9c78b2be2	block: move bio.c and bio-integrity.c from fs/ to block/ They really belong in block/, especially now since it's not in drivers/block/ anymore. Additionally, the get_maintainer script gets it wrong when in fs/. Suggested-by: Christoph Hellwig <hch@infradead.org> Acked-by: Al Viro <viro@ZenIV.linux.org.uk> Signed-off-by: Jens Axboe <axboe@fb.com>	2014-05-19 08:34:46 -06:00
Tejun Heo	5c9d535b89	cgroup: remove css_parent() cgroup in general is moving towards using cgroup_subsys_state as the fundamental structural component and css_parent() was introduced to convert from using cgroup->parent to css->parent. It was quite some time ago and we're moving forward with making css more prominent. This patch drops the trivial wrapper css_parent() and let the users dereference css->parent. While at it, explicitly mark fields of css which are public and immutable. v2: New usage from device_cgroup.c converted. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Michal Hocko <mhocko@suse.cz> Acked-by: Neil Horman <nhorman@tuxdriver.com> Acked-by: "David S. Miller" <davem@davemloft.net> Acked-by: Li Zefan <lizefan@huawei.com> Cc: Vivek Goyal <vgoyal@redhat.com> Cc: Jens Axboe <axboe@kernel.dk> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Johannes Weiner <hannes@cmpxchg.org>	2014-05-16 13:22:48 -04:00
Jens Axboe	0d2602ca30	blk-mq: improve support for shared tags maps This adds support for active queue tracking, meaning that the blk-mq tagging maintains a count of active users of a tag set. This allows us to maintain a notion of fairness between users, so that we can distribute the tag depth evenly without starving some users while allowing others to try unfair deep queues. If sharing of a tag set is detected, each hardware queue will track the depth of its own queue. And if this exceeds the total depth divided by the number of active queues, the user is actively throttled down. The active queue count is done lazily to avoid bouncing that data between submitter and completer. Each hardware queue gets marked active when it allocates its first tag, and gets marked inactive when 1) the last tag is cleared, and 2) the queue timeout grace period has passed. Signed-off-by: Jens Axboe <axboe@fb.com>	2014-05-13 15:10:52 -06:00
Tejun Heo	451af504df	cgroup: replace cftype->write_string() with cftype->write() Convert all cftype->write_string() users to the new cftype->write() which maps directly to kernfs write operation and has full access to kernfs and cgroup contexts. The conversions are mostly mechanical. * @css and @cft are accessed using of_css() and of_cft() accessors respectively instead of being specified as arguments. * Should return @nbytes on success instead of 0. * @buf is not trimmed automatically. Trim if necessary. Note that blkcg and netprio don't need this as the parsers already handle whitespaces. cftype->write_string() has no user left after the conversions and removed. While at it, remove unnecessary local variable @p in cgroup_subtree_control_write() and stale comment about CGROUP_LOCAL_BUFFER_SIZE in cgroup_freezer.c. This patch doesn't introduce any visible behavior changes. v2: netprio was missing from conversion. Converted. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Aristeu Rozanski <arozansk@redhat.com> Acked-by: Vivek Goyal <vgoyal@redhat.com> Acked-by: Li Zefan <lizefan@huawei.com> Cc: Jens Axboe <axboe@kernel.dk> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@suse.cz> Cc: Neil Horman <nhorman@tuxdriver.com> Cc: "David S. Miller" <davem@davemloft.net>	2014-05-13 12:16:21 -04:00
Tejun Heo	ec903c0c85	cgroup: rename css_tryget() to css_tryget_online() Unlike the more usual refcnting, what css_tryget() provides is the distinction between online and offline csses instead of protection against upping a refcnt which already reached zero. cgroup is planning to provide actual tryget which fails if the refcnt already reached zero. Let's rename the existing trygets so that they clearly indicate that they're onliness. I thought about keeping the existing names as-are and introducing new names for the planned actual tryget; however, given that each controller participates in the synchronization of the online state, it seems worthwhile to make it explicit that these functions are about on/offline state. Rename css_tryget() to css_tryget_online() and css_tryget_from_dir() to css_tryget_online_from_dir(). This is pure rename. v2: cgroup_freezer grew new usages of css_tryget(). Update accordingly. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Acked-by: Michal Hocko <mhocko@suse.cz> Acked-by: Li Zefan <lizefan@huawei.com> Cc: Vivek Goyal <vgoyal@redhat.com> Cc: Jens Axboe <axboe@kernel.dk> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Paul Mackerras <paulus@samba.org> Cc: Ingo Molnar <mingo@redhat.com> Cc: Arnaldo Carvalho de Melo <acme@kernel.org>	2014-05-13 12:11:01 -04:00
Jens Axboe	acb12e0a9c	Merge branch 'for-3.16/blk-mq-tagging' into for-3.16/core	2014-05-10 15:44:42 -06:00
Ming Lei	1f236ab22c	blk-mq: bitmap tag: cleanup blk_mq_init_tags Both nr_cache and nr_tags arn't needed for bitmap tag anymore. Signed-off-by: Ming Lei <tom.leiming@gmail.com> Signed-off-by: Jens Axboe <axboe@fb.com>	2014-05-10 15:44:00 -06:00
Ming Lei	9d3d21aeb4	blk-mq: bitmap tag: select random tag betweet 0 and (depth - 1) The selected tag should be selected at random between 0 and (depth - 1) with probability 1/depth, instead between 0 and (depth - 2) with probability 1/(depth - 1). Signed-off-by: Ming Lei <tom.leiming@gmail.com> Signed-off-by: Jens Axboe <axboe@fb.com>	2014-05-10 15:43:14 -06:00
Ming Lei	60f2df8a29	blk-mq: bitmap tag: remove barrier in bt_clear_tag() The barrier isn't necessary because both atomic_dec_and_test() and wake_up() implicate one barrier. Signed-off-by: Ming Lei <tom.leiming@gmail.com> Signed-off-by: Jens Axboe <axboe@fb.com>	2014-05-10 15:42:13 -06:00
Ming Lei	0289b2e110	blk-mq: bitmap tag: use clear_bit_unlock in bt_clear_tag() The unlock memory barrier need to order access to req in free path and clearing tag bit, otherwise either request free path may see a allocated request, or initialized request in allocate path might be modified by the ongoing free path. Signed-off-by: Ming Lei <tom.leiming@gmail.com> Signed-off-by: Jens Axboe <axboe@fb.com>	2014-05-10 15:41:42 -06:00
Jens Axboe	7276d02e24	block: only calculate part_in_flight() once We first check if we have inflight IO, then retrieve that same number again. Usually this isn't that costly since the chance of having the data dirtied in between is small, but there's no reason for calling part_in_flight() twice. Signed-off-by: Jens Axboe <axboe@fb.com>	2014-05-09 15:48:23 -06:00
Jens Axboe	cf4b50afc2	blk-mq: fix race in IO start accounting Commit `c6d600c6` opened up a small race where we could attempt to account IO completion on a request, racing with IO start accounting. Fix this up by ensuring that we've accounted for IO start before inserting the request. Signed-off-by: Jens Axboe <axboe@fb.com>	2014-05-09 14:54:08 -06:00
Jens Axboe	59d13bf5f5	blk-mq: use sparser tag layout for lower queue depth For best performance, spreading tags over multiple cachelines makes the tagging more efficient on multicore systems. But since we have 8 * sizeof(unsigned long) tags per cacheline, we don't always get a nice spread. Attempt to spread the tags over at least 4 cachelines, using fewer number of bits per unsigned long if we have to. This improves tagging performance in setups with 32-128 tags. For higher depths, the spread is the same as before (BITS_PER_LONG tags per cacheline). Signed-off-by: Jens Axboe <axboe@fb.com>	2014-05-09 13:41:15 -06:00
Jens Axboe	4bb659b156	blk-mq: implement new and more efficient tagging scheme blk-mq currently uses percpu_ida for tag allocation. But that only works well if the ratio between tag space and number of CPUs is sufficiently high. For most devices and systems, that is not the case. The end result if that we either only utilize the tag space partially, or we end up attempting to fully exhaust it and run into lots of lock contention with stealing between CPUs. This is not optimal. This new tagging scheme is a hybrid bitmap allocator. It uses two tricks to both be SMP friendly and allow full exhaustion of the space: 1) We cache the last allocated (or freed) tag on a per blk-mq software context basis. This allows us to limit the space we have to search. The key element here is not caching it in the shared tag structure, otherwise we end up dirtying more shared cache lines on each allocate/free operation. 2) The tag space is split into cache line sized groups, and each context will start off randomly in that space. Even up to full utilization of the space, this divides the tag users efficiently into cache line groups, avoiding dirtying the same one both between allocators and between allocator and freeer. This scheme shows drastically better behaviour, both on small tag spaces but on large ones as well. It has been tested extensively to show better performance for all the cases blk-mq cares about. Signed-off-by: Jens Axboe <axboe@fb.com>	2014-05-09 09:36:49 -06:00
Christoph Hellwig	af76e555e5	blk-mq: initialize struct request fields individually This allows us to avoid a non-atomic memset over ->atomic_flags as well as killing lots of duplicate initializations. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@fb.com>	2014-05-09 08:43:49 -06:00
Jens Axboe	9fccfed8f0	blk-mq: update a hotplug comment for grammar Signed-off-by: Jens Axboe <axboe@fb.com>	2014-05-09 08:43:49 -06:00
Jens Axboe	506e931f92	blk-mq: add basic round-robin of what CPU to queue workqueue work on Right now we just pick the first CPU in the mask, but that can easily overload that one. Add some basic batching and round-robin all the entries in the mask instead. Signed-off-by: Jens Axboe <axboe@fb.com>	2014-05-07 10:26:44 -06:00
Tejun Heo	36c38fb714	blkcg: use trylock on blkcg_pol_mutex in blkcg_reset_stats() During the recent conversion of cgroup to kernfs, cgroup_tree_mutex which nests above both the kernfs s_active protection and cgroup_mutex is added to synchronize cgroup file type operations as cgroup_mutex needed to be grabbed from some file operations and thus can't be put above s_active protection. While this arrangement mostly worked for cgroup, this triggered the following lockdep warning. ====================================================== [ INFO: possible circular locking dependency detected ] 3.15.0-rc3-next-20140430-sasha-00016-g4e281fa-dirty #429 Tainted: G W ------------------------------------------------------- trinity-c173/9024 is trying to acquire lock: (blkcg_pol_mutex){+.+.+.}, at: blkcg_reset_stats (include/linux/spinlock.h:328 block/blk-cgroup.c:455) but task is already holding lock: (s_active#89){++++.+}, at: kernfs_fop_write (fs/kernfs/file.c:283) which lock already depends on the new lock. the existing dependency chain (in reverse order) is: -> #2 (s_active#89){++++.+}: lock_acquire (arch/x86/include/asm/current.h:14 kernel/locking/lockdep.c:3602) __kernfs_remove (arch/x86/include/asm/atomic.h:27 fs/kernfs/dir.c:352 fs/kernfs/dir.c:1024) kernfs_remove_by_name_ns (fs/kernfs/dir.c:1219) cgroup_addrm_files (include/linux/kernfs.h:427 kernel/cgroup.c:1074 kernel/cgroup.c:2899) cgroup_clear_dir (kernel/cgroup.c:1092 (discriminator 2)) rebind_subsystems (kernel/cgroup.c:1144) cgroup_setup_root (kernel/cgroup.c:1568) cgroup_mount (kernel/cgroup.c:1716) mount_fs (fs/super.c:1094) vfs_kern_mount (fs/namespace.c:899) do_mount (fs/namespace.c:2238 fs/namespace.c:2561) SyS_mount (fs/namespace.c:2758 fs/namespace.c:2729) tracesys (arch/x86/kernel/entry_64.S:746) -> #1 (cgroup_tree_mutex){+.+.+.}: lock_acquire (arch/x86/include/asm/current.h:14 kernel/locking/lockdep.c:3602) mutex_lock_nested (kernel/locking/mutex.c:486 kernel/locking/mutex.c:587) cgroup_add_cftypes (include/linux/list.h:76 kernel/cgroup.c:3040) blkcg_policy_register (block/blk-cgroup.c:1106) throtl_init (block/blk-throttle.c:1694) do_one_initcall (init/main.c:789) kernel_init_freeable (init/main.c:854 init/main.c:863 init/main.c:882 init/main.c:1003) kernel_init (init/main.c:935) ret_from_fork (arch/x86/kernel/entry_64.S:552) -> #0 (blkcg_pol_mutex){+.+.+.}: __lock_acquire (kernel/locking/lockdep.c:1840 kernel/locking/lockdep.c:1945 kernel/locking/lockdep.c:2131 kernel/locking/lockdep.c:3182) lock_acquire (arch/x86/include/asm/current.h:14 kernel/locking/lockdep.c:3602) mutex_lock_nested (kernel/locking/mutex.c:486 kernel/locking/mutex.c:587) blkcg_reset_stats (include/linux/spinlock.h:328 block/blk-cgroup.c:455) cgroup_file_write (kernel/cgroup.c:2714) kernfs_fop_write (fs/kernfs/file.c:295) vfs_write (fs/read_write.c:532) SyS_write (fs/read_write.c:584 fs/read_write.c:576) tracesys (arch/x86/kernel/entry_64.S:746) other info that might help us debug this: Chain exists of: blkcg_pol_mutex --> cgroup_tree_mutex --> s_active#89 Possible unsafe locking scenario: CPU0 CPU1 ---- ---- lock(s_active#89); lock(cgroup_tree_mutex); lock(s_active#89); lock(blkcg_pol_mutex); * DEADLOCK * 4 locks held by trinity-c173/9024: #0: (&f->f_pos_lock){+.+.+.}, at: __fdget_pos (fs/file.c:714) #1: (sb_writers#18){.+.+.+}, at: vfs_write (include/linux/fs.h:2255 fs/read_write.c:530) #2: (&of->mutex){+.+.+.}, at: kernfs_fop_write (fs/kernfs/file.c:283) #3: (s_active#89){++++.+}, at: kernfs_fop_write (fs/kernfs/file.c:283) stack backtrace: CPU: 3 PID: 9024 Comm: trinity-c173 Tainted: G W 3.15.0-rc3-next-20140430-sasha-00016-g4e281fa-dirty #429 ffffffff919687b0 ffff8805f6373bb8 ffffffff8e52cdbb 0000000000000002 ffffffff919d8400 ffff8805f6373c08 ffffffff8e51fb88 0000000000000004 ffff8805f6373c98 ffff8805f6373c08 ffff88061be70d98 ffff88061be70dd0 Call Trace: dump_stack (lib/dump_stack.c:52) print_circular_bug (kernel/locking/lockdep.c:1216) __lock_acquire (kernel/locking/lockdep.c:1840 kernel/locking/lockdep.c:1945 kernel/locking/lockdep.c:2131 kernel/locking/lockdep.c:3182) lock_acquire (arch/x86/include/asm/current.h:14 kernel/locking/lockdep.c:3602) mutex_lock_nested (kernel/locking/mutex.c:486 kernel/locking/mutex.c:587) blkcg_reset_stats (include/linux/spinlock.h:328 block/blk-cgroup.c:455) cgroup_file_write (kernel/cgroup.c:2714) kernfs_fop_write (fs/kernfs/file.c:295) vfs_write (fs/read_write.c:532) SyS_write (fs/read_write.c:584 fs/read_write.c:576) This is a highly unlikely but valid circular dependency between "echo 1 > blkcg.reset_stats" and cfq module [un]loading. cgroup is going through further locking update which will remove this complication but for now let's use trylock on blkcg_pol_mutex and retry the file operation if the trylock fails. Signed-off-by: Tejun Heo <tj@kernel.org> Reported-by: Sasha Levin <sasha.levin@oracle.com> References: http://lkml.kernel.org/g/5363C04B.4010400@oracle.com	2014-05-05 13:48:18 -04:00
Keith Busch	3291fa57cb	NVMe: Add tracepoints Adding tracepoints for bio_complete and block_split into nvme to help with gathering IO info using blktrace and blkparse. Signed-off-by: Keith Busch <keith.busch@intel.com> Signed-off-by: Matthew Wilcox <matthew.r.wilcox@intel.com>	2014-05-05 10:41:26 -04:00
Fabian Frederick	5cf8c22775	block/blk-throttle.c: fix return of 0/1 with return type bool Fix 4 coccinelle warnings. Cc: Jens Axboe <axboe@kernel.dk> Cc: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Fabian Frederick <fabf@skynet.be> Signed-off-by: Jens Axboe <axboe@fb.com>	2014-05-02 11:38:03 -06:00
Fabian Frederick	5214e33c8e	block/blk-iopoll.c: use iop instead of iopoll All blk_iopoll functions use iop for parent iopoll structure except blk_iopoll_complete.This also fixes one kernel-doc warning. Cc: Jens Axboe <axboe@kernel.dk> Cc: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Fabian Frederick <fabf@skynet.be> Signed-off-by: Jens Axboe <axboe@fb.com>	2014-05-02 11:37:41 -06:00
Jens Axboe	74814b1c55	blk-mq: remove extra requeue trace We already issue a blktrace requeue event in __blk_mq_requeue_request(), don't do it from the original caller as well. Signed-off-by: Jens Axboe <axboe@fb.com>	2014-05-02 11:24:48 -06:00
Masanari Iida	176167ad9e	block: Fix format string mismatch in cfq-iosched.c Fix format string mismatch in cfq_var_show() Signed-off-by: Masanari Iida <standby24x7@gmail.com> Signed-off-by: Jens Axboe <axboe@fb.com>	2014-04-30 15:56:10 -06:00
Jens Axboe	c6d600c65e	blk-mq: refactor request insertion/merging Refactor the logic around adding a new bio to a software queue, so we nest the ctx->lock where we really need it (merge and insertion) and don't hold it when we don't (init and IO start accounting). Signed-off-by: Jens Axboe <axboe@fb.com>	2014-04-30 13:43:56 -06:00
Jens Axboe	98bc1f272a	blk-mq remove debug BUG_ON() when draining software queues It's never been of any use, lets get rid of it. Signed-off-by: Jens Axboe <axboe@fb.com>	2014-04-30 13:43:08 -06:00
Jens Axboe	5810d903fa	blk-mq: fix waiting for reserved tags blk_mq_wait_for_tags() is only able to wait for "normal" tags, not reserved tags. Pass in which one we should attempt to get a tag for, so that waiting for reserved tags will work. Reserved tags are used for internal commands, which are usually serialized. Hence no waiting generally takes place, but we should ensure that it actually works if users need that functionality. Signed-off-by: Jens Axboe <axboe@fb.com>	2014-04-29 20:49:48 -06:00
Christoph Hellwig	c4a634f432	block: fold __blk_add_timer into blk_add_timer Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@fb.com>	2014-04-25 08:24:40 -06:00
Christoph Hellwig	3853520163	blk-mq: respect rq_affinity The blk-mq code is using it's own version of the I/O completion affinity tunables, which causes a few issues: - the rq_affinity sysfs file doesn't work for blk-mq devices, even if it still is present, thus breaking existing tuning setups. - the rq_affinity = 1 mode, which is the defauly for legacy request based drivers isn't implemented at all. - blk-mq drivers don't implement any completion affinity with the default flag settings. This patches removes the blk-mq ipi_redirect flag and sysfs file, as well as the internal BLK_MQ_F_SHOULD_IPI flag and replaces it with code that respects the queue-wide rq_affinity flags and also implements the rq_affinity = 1 mode. This means I/O completion affinity can now only be tuned block-queue wide instead of per context, which seems more sensible to me anyway. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@fb.com>	2014-04-25 08:24:07 -06:00
Jens Axboe	87ee7b1121	blk-mq: fix race with timeouts and requeue events If a requeue event races with a timeout, we can get into the situation where we attempt to complete a request from the timeout handler when it's not start anymore. This causes a crash. So have the timeout handler check that REQ_ATOM_STARTED is still set on the request - if not, we ignore the event. If this happens, the request has now been marked as complete. As a consequence, we need to ensure to clear REQ_ATOM_COMPLETE in blk_mq_start_request(), as to maintain proper request state. Signed-off-by: Jens Axboe <axboe@fb.com>	2014-04-24 08:51:47 -06:00
Jens Axboe	70ab0b2d51	Revert "blk-mq: initialize req->q in allocation" This reverts commit `6a3c8a3ac0`. We need selective clearing of the request to make the init-at-free time completely safe. Otherwise we end up stomping on rq->atomic_flags, which we don't want to do.	2014-04-24 08:50:38 -06:00
Ming Lei	981bd189f8	blk-mq: fix leak of set->tags set->tags should be freed in blk_mq_free_tag_set(). Signed-off-by: Ming Lei <tom.leiming@gmail.com> Signed-off-by: Jens Axboe <axboe@fb.com>	2014-04-23 10:08:22 -06:00
Fabian Frederick	8876e140ec	block/blk-throttle.c: add static to blk_throtl_dispatch_work_fn blk_throtl_dispatch_work_fn is only used in blk-throttle.c Cc: Jens Axboe <axboe@kernel.dk> Cc: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Fabian Frederick <fabf@skynet.be> Signed-off-by: Jens Axboe <axboe@fb.com>	2014-04-21 19:30:28 -06:00
Ming Lei	6a3c8a3ac0	blk-mq: initialize req->q in allocation The patch basically reverts the patch of(blk-mq: initialize request on allocation) in Jens's tree(already in -next), and only initialize req->q in allocation for two reasons: - presumed cache hotness on completion - blk_rq_tagged(rq) depends on reset of req->mq_ctx Signed-off-by: Ming Lei <tom.leiming@gmail.com> Signed-off-by: Jens Axboe <axboe@fb.com>	2014-04-21 10:38:39 -06:00
Ming Lei	4ca085009f	blk-mq: user (1 << order) to implement order_to_size() Cc: Jörg-Volker Peetz <jvpeetz@web.de> Cc: Max Filippov <jcmvbkbc@gmail.com> Signed-off-by: Ming Lei <tom.leiming@gmail.com> Signed-off-by: Jens Axboe <axboe@fb.com>	2014-04-21 10:38:38 -06:00
Ming Lei	4847900532	blk-mq: fix allocation of set->tags type of set->tags is struct blk_mq_tags **. Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Ming Lei <tom.leiming@gmail.com> Signed-off-by: Jens Axboe <axboe@fb.com>	2014-04-21 10:38:36 -06:00
Ming Lei	11471e0d04	blk-mq: free hctx->ctx_map when init failed Avoid memory leak in the failure path. Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Ming Lei <tom.leiming@gmail.com> Signed-off-by: Jens Axboe <axboe@fb.com>	2014-04-21 10:38:34 -06:00
Peter Zijlstra	4e857c58ef	arch: Mass conversion of smp_mb__() Mostly scripted conversion of the smp_mb__ barriers. Signed-off-by: Peter Zijlstra <peterz@infradead.org> Acked-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Link: http://lkml.kernel.org/n/tip-55dhyhocezdw1dg7u19hmh1u@git.kernel.org Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: linux-arch@vger.kernel.org Signed-off-by: Ingo Molnar <mingo@kernel.org>	2014-04-18 14:20:48 +02:00
Jens Axboe	49fd524f95	bsg: update check for rq based driver for blk-mq bsg currently checks ->request_fn to check whether a queue can handle struct request. But with blk-mq, we don't have a request_fn yet are request based. Add a queue_is_rq_based() helper and use that in bsg, I'm guessing this is not the last place we need to update for this. Besides, it better explains what is being checked. Signed-off-by: Jens Axboe <axboe@fb.com>	2014-04-16 14:15:46 -06:00
Jens Axboe	f793aa5378	block: relax when to modify the timeout timer Since we are now, by default, applying timer slack to expiry times, the logic for when to modify a timer in the block code is suboptimal. The block layer keeps a forward rolling timer per queue for all requests, and modifies this timer if a request has a shorter timeout than what the current expiry time is. However, this breaks down when our rounded timer values get applied slack. Then each new request ends up modifying the timer, since we're still a little in front of the timer + slack. Fix this by allowing a tolerance of HZ / 2, the timeout handling doesn't need to be very precise. This drastically cuts down the number of timer modifications we have to make. Signed-off-by: Jens Axboe <axboe@fb.com>	2014-04-16 14:15:25 -06:00
Christoph Hellwig	12120077b2	block: export blk_finish_request This allows to mirror the blk-mq code flow for more a more readable I/O completion handler in SCSI. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@fb.com>	2014-04-16 14:15:25 -06:00
Christoph Hellwig	f88a164b72	blk-mq: rename mq_flush_work struct request member We will use this work_struct to requeue scsi commands from the completion handler as well, so give it a more generic name. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@fb.com>	2014-04-16 14:15:25 -06:00
Christoph Hellwig	ed0791b2f8	blk-mq: add blk_mq_requeue_request This allows to requeue a request that has been accepted by ->queue_rq earlier. This is needed by the SCSI layer in various error conditions. The existing internal blk_mq_requeue_request is renamed to __blk_mq_requeue_request as it is a lower level building block for this funtionality. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@fb.com>	2014-04-16 14:15:25 -06:00
Christoph Hellwig	2f26855656	blk-mq: add blk_mq_start_hw_queues Add a helper to unconditionally kick contexts of a queue. This will be needed by the SCSI layer to provide fair queueing between multiple devices on a single host. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@fb.com>	2014-04-16 14:15:25 -06:00
Christoph Hellwig	70f4db639c	blk-mq: add blk_mq_delay_queue Add a blk-mq equivalent to blk_delay_queue so that the scsi layer can ask to be kicked again after a delay. Signed-off-by: Christoph Hellwig <hch@lst.de> Modified by me to kill the unnecessary preempt disable/enable in the delayed workqueue handler. Signed-off-by: Jens Axboe <axboe@fb.com>	2014-04-16 14:15:25 -06:00
Christoph Hellwig	1b4a325858	blk-mq: add async parameter to blk_mq_start_stopped_hw_queues Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@fb.com>	2014-04-16 14:15:25 -06:00
Christoph Hellwig	91b63639c7	blk-mq: bidi support Add two unlinkely branches to make sure the resid is initialized correctly for bidi request pairs, and the second request gets properly freed. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@fb.com>	2014-04-16 14:15:25 -06:00
Christoph Hellwig	63151a449e	blk-mq: allow drivers to hook into I/O completion Split out the bottom half of blk_mq_end_io so that drivers can perform work when they know a request has been completed, but before it has been freed. This also obsoletes blk_mq_end_io_partial as drivers can now pass any value to blk_update_request directly. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@fb.com>	2014-04-16 14:15:25 -06:00
Jens Axboe	6700a678c0	blk-mq: kill preempt disable/enable in blk_mq_work_fn() blk_mq_work_fn() is always invoked off the bounded workqueues, so it can happily preempt among the queues in that set without causing any issues for blk-mq. Signed-off-by: Jens Axboe <axboe@fb.com>	2014-04-16 14:15:24 -06:00
Jens Axboe	fd1270d5df	blk-mq: don't use preempt_count() to check for right CPU UP or CONFIG_PREEMPT_NONE will return 0, and what we really want to check is whether or not we are on the right CPU. So don't make PREEMPT part of this, just test the CPU in the mask directly. Signed-off-by: Jens Axboe <axboe@fb.com>	2014-04-16 14:15:24 -06:00
Christoph Hellwig	24d2f90309	blk-mq: split out tag initialization, support shared tags Add a new blk_mq_tag_set structure that gets set up before we initialize the queue. A single blk_mq_tag_set structure can be shared by multiple queues. Signed-off-by: Christoph Hellwig <hch@lst.de> Modular export of blk_mq_{alloc,free}_tagset added by me. Signed-off-by: Jens Axboe <axboe@fb.com>	2014-04-15 14:18:02 -06:00
Christoph Hellwig	ed44832dea	blk-mq: initialize request on allocation If we want to share tag and request allocation between queues we cannot initialize the request at init/free time, but need to initialize it at allocation time as it might get used for different queues over its lifetime. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@fb.com>	2014-04-15 14:03:03 -06:00
Christoph Hellwig	e9b267d91f	blk-mq: add ->init_request and ->exit_request methods The current blk_mq_init_commands/blk_mq_free_commands interface has a two problems: 1) Because only the constructor is passed to blk_mq_init_commands there is no easy way to clean up when a comman initialization failed. The current code simply leaks the allocations done in the constructor. 2) There is no good place to call blk_mq_free_commands: before blk_cleanup_queue there is no guarantee that all outstanding commands have completed, so we can't free them yet. After blk_cleanup_queue the queue has usually been freed. This can be worked around by grabbing an unconditional reference before calling blk_cleanup_queue and dropping it after blk_mq_free_commands is done, although that's not exatly pretty and driver writers are guaranteed to get it wrong sooner or later. Both issues are easily fixed by making the request constructor and destructor normal blk_mq_ops methods. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@fb.com>	2014-04-15 14:03:03 -06:00
Christoph Hellwig	8727af4b9d	blk-mq: make ->flush_rq fully transparent to drivers Drivers shouldn't have to care about the block layer setting aside a request to implement the flush state machine. We already override the mq context and tag to make it more transparent, but so far haven't deal with the driver private data in the request. Make sure to override this as well, and while we're at it add a proper helper sitting in blk-mq.c that implements the full impersonation. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@fb.com>	2014-04-15 14:03:02 -06:00
Christoph Hellwig	9d74e25737	blk-mq: do not initialize req->special Drivers can reach their private data easily using the blk_mq_rq_to_pdu helper and don't need req->special. By not initializing it code can be simplified nicely, and we also shave off a few more instructions from the I/O path. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@fb.com>	2014-04-15 14:03:02 -06:00
Christoph Hellwig	742ee69b92	blk-mq: initialize resid_len Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@fb.com>	2014-04-15 14:03:02 -06:00
Jens Axboe	b4f42e2831	block: remove struct request buffer member This was used in the olden days, back when onions were proper yellow. Basically it mapped to the current buffer to be transferred. With highmem being added more than a decade ago, most drivers map pages out of a bio, and rq->buffer isn't pointing at anything valid. Convert old style drivers to just use bio_data(). For the discard payload use case, just reference the page in the bio. Signed-off-by: Jens Axboe <axboe@fb.com>	2014-04-15 14:03:02 -06:00
Jens Axboe	f89e0dd9d1	Linux 3.15-rc1 -----BEGIN PGP SIGNATURE----- Version: GnuPG v1 iQEcBAABAgAGBQJTSv89AAoJEHm+PkMAQRiGd7AIAKL45dFRhX96W53uzGlpXtv7 Ecs9CMY5mtsB/rqtV/NSQaUELlNb4Ilb4lITh7NaLWAZxDJ12GwVsIbFoaBj7Ypx tfBNNxffGGnTTn2C1PpPQmLytLBvqXIVHMaPpDvnHYJl6g9BihshLzyMrsrizqnA DPJ0xMdgGp6BLC4qm0ZH1mS2q+TyXLN2ZCjJ1lp6NNYwBnwOe/ABjnUa0Ztu7aii 837N8h6nEuKHTr6DgrYHEgYVc3DPHPHaly/UJ8v4U30myzRv83YkD5DsBSUjSLj8 CzxJEnECa1XljLJK7SjRHy5pSP9tcUlmjx3Fk7IpQixmT+A15cov6fQ967jNlDw= =Hxnc -----END PGP SIGNATURE----- Merge tag 'v3.15-rc1' into for-3.16/core We don't like this, but things have diverged with the blk-mq fixes in 3.15-rc1. So merge it in.	2014-04-15 14:02:24 -06:00
Linus Torvalds	5166701b36	Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs Pull vfs updates from Al Viro: "The first vfs pile, with deep apologies for being very late in this window. Assorted cleanups and fixes, plus a large preparatory part of iov_iter work. There's a lot more of that, but it'll probably go into the next merge window - it does shape up nicely, removes a lot of boilerplate, gets rid of locking inconsistencie between aio_write and splice_write and I hope to get Kent's direct-io rewrite merged into the same queue, but some of the stuff after this point is having (mostly trivial) conflicts with the things already merged into mainline and with some I want more testing. This one passes LTP and xfstests without regressions, in addition to usual beating. BTW, readahead02 in ltp syscalls testsuite has started giving failures since "mm/readahead.c: fix readahead failure for memoryless NUMA nodes and limit readahead pages" - might be a false positive, might be a real regression..." * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (63 commits) missing bits of "splice: fix racy pipe->buffers uses" cifs: fix the race in cifs_writev() ceph_sync_{,direct_}write: fix an oops on ceph_osdc_new_request() failure kill generic_file_buffered_write() ocfs2_file_aio_write(): switch to generic_perform_write() ceph_aio_write(): switch to generic_perform_write() xfs_file_buffered_aio_write(): switch to generic_perform_write() export generic_perform_write(), start getting rid of generic_file_buffer_write() generic_file_direct_write(): get rid of ppos argument btrfs_file_aio_write(): get rid of ppos kill the 5th argument of generic_file_buffered_write() kill the 4th argument of __generic_file_aio_write() lustre: don't open-code kernel_recvmsg() ocfs2: don't open-code kernel_recvmsg() drbd: don't open-code kernel_recvmsg() constify blk_rq_map_user_iov() and friends lustre: switch to kernel_sendmsg() ocfs2: don't open-code kernel_sendmsg() take iov_iter stuff to mm/iov_iter.c process_vm_access: tidy up a bit ...	2014-04-12 14:49:50 -07:00
Duan Jiong	21f9fcd815	block: replace IS_ERR and PTR_ERR with PTR_ERR_OR_ZERO This patch fixes coccinelle error regarding usage of IS_ERR and PTR_ERR instead of PTR_ERR_OR_ZERO. Signed-off-by: Duan Jiong <duanj.fnst@cn.fujitsu.com> Signed-off-by: Jens Axboe <axboe@fb.com>	2014-04-11 08:09:47 -06:00
Jens Axboe	360f92c244	block: fix regression with block enabled tagging Martin reported that his test system would not boot with current git, it oopsed with this: BUG: unable to handle kernel paging request at ffff88046c6c9e80 IP: [<ffffffff812971e0>] blk_queue_start_tag+0x90/0x150 PGD 1ddf067 PUD 1de2067 PMD 47fc7d067 PTE 800000046c6c9060 Oops: 0002 [#1] SMP DEBUG_PAGEALLOC Modules linked in: sd_mod lpfc(+) scsi_transport_fc scsi_tgt oracleasm rpcsec_gss_krb5 ipv6 igb dca i2c_algo_bit i2c_core hwmon CPU: 3 PID: 87 Comm: kworker/u17:1 Not tainted 3.14.0+ #246 Hardware name: Supermicro X9DRX+-F/X9DRX+-F, BIOS 3.00 07/09/2013 Workqueue: events_unbound async_run_entry_fn task: ffff8802743c2150 ti: ffff880273d02000 task.ti: ffff880273d02000 RIP: 0010:[<ffffffff812971e0>] [<ffffffff812971e0>] blk_queue_start_tag+0x90/0x150 RSP: 0018:ffff880273d03a58 EFLAGS: 00010092 RAX: ffff88046c6c9e78 RBX: ffff880077208e78 RCX: 00000000fffc8da6 RDX: 00000000fffc186d RSI: 0000000000000009 RDI: 00000000fffc8d9d RBP: ffff880273d03a88 R08: 0000000000000001 R09: ffff8800021c2410 R10: 0000000000000005 R11: 0000000000015b30 R12: ffff88046c5bb8a0 R13: ffff88046c5c0890 R14: 000000000000001e R15: 000000000000001e FS: 0000000000000000(0000) GS:ffff880277b00000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: ffff88046c6c9e80 CR3: 00000000018f6000 CR4: 00000000000407e0 Stack: ffff880273d03a98 ffff880474b18800 0000000000000000 ffff880474157000 ffff88046c5c0890 ffff880077208e78 ffff880273d03ae8 ffffffff813b9e62 ffff880200000010 ffff880474b18968 ffff880474b18848 ffff88046c5c0cd8 Call Trace: [<ffffffff813b9e62>] scsi_request_fn+0xf2/0x510 [<ffffffff81293167>] __blk_run_queue+0x37/0x50 [<ffffffff8129ac43>] blk_execute_rq_nowait+0xb3/0x130 [<ffffffff8129ad24>] blk_execute_rq+0x64/0xf0 [<ffffffff8108d2b0>] ? bit_waitqueue+0xd0/0xd0 [<ffffffff813bba35>] scsi_execute+0xe5/0x180 [<ffffffff813bbe4a>] scsi_execute_req_flags+0x9a/0x110 [<ffffffffa01b1304>] sd_spinup_disk+0x94/0x460 [sd_mod] [<ffffffff81160000>] ? __unmap_hugepage_range+0x200/0x2f0 [<ffffffffa01b2b9a>] sd_revalidate_disk+0xaa/0x3f0 [sd_mod] [<ffffffffa01b2fb8>] sd_probe_async+0xd8/0x200 [sd_mod] [<ffffffff8107703f>] async_run_entry_fn+0x3f/0x140 [<ffffffff8106a1c5>] process_one_work+0x175/0x410 [<ffffffff8106b373>] worker_thread+0x123/0x400 [<ffffffff8106b250>] ? manage_workers+0x160/0x160 [<ffffffff8107104e>] kthread+0xce/0xf0 [<ffffffff81070f80>] ? kthread_freezable_should_stop+0x70/0x70 [<ffffffff815f0bac>] ret_from_fork+0x7c/0xb0 [<ffffffff81070f80>] ? kthread_freezable_should_stop+0x70/0x70 Code: 48 0f ab 11 72 db 48 81 4b 40 00 00 10 00 89 83 08 01 00 00 48 89 df 49 8b 04 24 48 89 1c d0 e8 f7 a8 ff ff 49 8b 85 28 05 00 00 <48> 89 58 08 48 89 03 49 8d 85 28 05 00 00 48 89 43 08 49 89 9d RIP [<ffffffff812971e0>] blk_queue_start_tag+0x90/0x150 RSP <ffff880273d03a58> CR2: ffff88046c6c9e80 Martin bisected and found this to be the problem patch; commit `6d113398dc` Author: Jan Kara <jack@suse.cz> Date: Mon Feb 24 16:39:54 2014 +0100 block: Stop abusing rq->csd.list in blk-softirq and the problem was immediately apparent. The patch states that it is safe to reuse queuelist at completion time, since it is no longer used. However, that is not true if a device is using block enabled tagging. If that is the case, then the queuelist is reused to keep track of busy tags. If a device also ended up using softirq completions, we'd reuse ->queuelist for the IPI handling while block tagging was still using it. Boom. Fix this by adding a new ipi_list list head, and share the memory used with the request hash table. The hash table is never used after the request is moved to the dispatch list, which happens long before any potential completion of the request. Add a new request bit for this, so we don't have cases that check rq->hash while it could potentially have been reused for the IPI completion. Reported-by: Martin K. Petersen <martin.petersen@oracle.com> Tested-by: Benjamin Herrenschmidt <benh@kernel.crashing.org> Signed-off-by: Jens Axboe <axboe@fb.com>	2014-04-09 21:54:06 -06:00
Jens Axboe	cb2da43e3d	blk-mq: simplify blk_mq_hw_sysfs_cpus_show() Now that we have a cpu mask of CPUs that are mapped to a specific hardware queue, we can just iterate that to display the sysfs num-hw-queue/cpu_list file. Signed-off-by: Jens Axboe <axboe@fb.com>	2014-04-09 10:53:21 -06:00
Jens Axboe	e4043dcf30	blk-mq: ensure that hardware queues are always run on the mapped CPUs Instead of providing soft mappings with no guarantees on hardware queues always being run on the right CPU, switch to a hard mapping guarantee that ensure that we always run the hardware queue on (one of, if more) the mapped CPU. Signed-off-by: Jens Axboe <axboe@fb.com>	2014-04-09 10:18:23 -06:00
Jens Axboe	8ab14595b6	block: add kblockd_schedule_delayed_work_on() Same function as kblockd_schedule_delayed_work(), but allow the caller to pass in a CPU that the work should be executed on. This just directly extends and maps into the workqueue API, and will be used to make the blk-mq mappings more strict. Signed-off-by: Jens Axboe <axboe@fb.com>	2014-04-09 10:17:03 -06:00
Jens Axboe	59c3d45e48	block: remove 'q' parameter from kblockd_schedule_*_work() The queue parameter is never used, just get rid of it. Signed-off-by: Jens Axboe <axboe@fb.com>	2014-04-09 10:17:00 -06:00
Jens Axboe	bccb5f7c8b	blk-mq: fix potential stall during CPU unplug with IO pending When a CPU is unplugged, we move the blk_mq_ctx request entries to the current queue. The current code forgets to remap the blk_mq_hw_ctx before marking the software context pending, which breaks if old-cpu and new-cpu don't map to the same hardware queue. Additionally, if we mark entries as pending in the new hardware queue, then make sure we schedule it for running. Otherwise request could be sitting there until someone else queues IO for that hardware queue. Signed-off-by: Jens Axboe <axboe@fb.com>	2014-04-07 08:17:18 -06:00
Linus Torvalds	32d01dc7be	Merge branch 'for-3.15' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup Pull cgroup updates from Tejun Heo: "A lot updates for cgroup: - The biggest one is cgroup's conversion to kernfs. cgroup took after the long abandoned vfs-entangled sysfs implementation and made it even more convoluted over time. cgroup's internal objects were fused with vfs objects which also brought in vfs locking and object lifetime rules. Naturally, there are places where vfs rules don't fit and nasty hacks, such as credential switching or lock dance interleaving inode mutex and cgroup_mutex with object serial number comparison thrown in to decide whether the operation is actually necessary, needed to be employed. After conversion to kernfs, internal object lifetime and locking rules are mostly isolated from vfs interactions allowing shedding of several nasty hacks and overall simplification. This will also allow implmentation of operations which may affect multiple cgroups which weren't possible before as it would have required nesting i_mutexes. - Various simplifications including dropping of module support, easier cgroup name/path handling, simplified cgroup file type handling and task_cg_lists optimization. - Prepatory changes for the planned unified hierarchy, which is still a patchset away from being actually operational. The dummy hierarchy is updated to serve as the default unified hierarchy. Controllers which aren't claimed by other hierarchies are associated with it, which BTW was what the dummy hierarchy was for anyway. - Various fixes from Li and others. This pull request includes some patches to add missing slab.h to various subsystems. This was triggered xattr.h include removal from cgroup.h. cgroup.h indirectly got included a lot of files which brought in xattr.h which brought in slab.h. There are several merge commits - one to pull in kernfs updates necessary for converting cgroup (already in upstream through driver-core), others for interfering changes in the fixes branch" * 'for-3.15' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup: (74 commits) cgroup: remove useless argument from cgroup_exit() cgroup: fix spurious lockdep warning in cgroup_exit() cgroup: Use RCU_INIT_POINTER(x, NULL) in cgroup.c cgroup: break kernfs active_ref protection in cgroup directory operations cgroup: fix cgroup_taskset walking order cgroup: implement CFTYPE_ONLY_ON_DFL cgroup: make cgrp_dfl_root mountable cgroup: drop const from @buffer of cftype->write_string() cgroup: rename cgroup_dummy_root and related names cgroup: move ->subsys_mask from cgroupfs_root to cgroup cgroup: treat cgroup_dummy_root as an equivalent hierarchy during rebinding cgroup: remove NULL checks from [pr_cont_]cgroup_{name\|path}() cgroup: use cgroup_setup_root() to initialize cgroup_dummy_root cgroup: reorganize cgroup bootstrapping cgroup: relocate setting of CGRP_DEAD cpuset: use rcu_read_lock() to protect task_cs() cgroup_freezer: document freezer_fork() subtleties cgroup: update cgroup_transfer_tasks() to either succeed or fail cgroup: drop task_lock() protection around task->cgroups cgroup: update how a newly forked task gets associated with css_set ...	2014-04-03 13:05:42 -07:00
Linus Torvalds	cd6362befe	Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next Pull networking updates from David Miller: "Here is my initial pull request for the networking subsystem during this merge window: 1) Support for ESN in AH (RFC 4302) from Fan Du. 2) Add full kernel doc for ethtool command structures, from Ben Hutchings. 3) Add BCM7xxx PHY driver, from Florian Fainelli. 4) Export computed TCP rate information in netlink socket dumps, from Eric Dumazet. 5) Allow IPSEC SA to be dumped partially using a filter, from Nicolas Dichtel. 6) Convert many drivers to pci_enable_msix_range(), from Alexander Gordeev. 7) Record SKB timestamps more efficiently, from Eric Dumazet. 8) Switch to microsecond resolution for TCP round trip times, also from Eric Dumazet. 9) Clean up and fix 6lowpan fragmentation handling by making use of the existing inet_frag api for it's implementation. 10) Add TX grant mapping to xen-netback driver, from Zoltan Kiss. 11) Auto size SKB lengths when composing netlink messages based upon past message sizes used, from Eric Dumazet. 12) qdisc dumps can take a long time, add a cond_resched(), From Eric Dumazet. 13) Sanitize netpoll core and drivers wrt. SKB handling semantics. Get rid of never-used-in-tree netpoll RX handling. From Eric W Biederman. 14) Support inter-address-family and namespace changing in VTI tunnel driver(s). From Steffen Klassert. 15) Add Altera TSE driver, from Vince Bridgers. 16) Optimizing csum_replace2() so that it doesn't adjust the checksum by checksumming the entire header, from Eric Dumazet. 17) Expand BPF internal implementation for faster interpreting, more direct translations into JIT'd code, and much cleaner uses of BPF filtering in non-socket ocntexts. From Daniel Borkmann and Alexei Starovoitov" * git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next: (1976 commits) netpoll: Use skb_irq_freeable to make zap_completion_queue safe. net: Add a test to see if a skb is freeable in irq context qlcnic: Fix build failure due to undefined reference to `vxlan_get_rx_port' net: ptp: move PTP classifier in its own file net: sxgbe: make "core_ops" static net: sxgbe: fix logical vs bitwise operation net: sxgbe: sxgbe_mdio_register() frees the bus Call efx_set_channels() before efx->type->dimension_resources() xen-netback: disable rogue vif in kthread context net/mlx4: Set proper build dependancy with vxlan be2net: fix build dependency on VxLAN mac802154: make csma/cca parameters per-wpan mac802154: allow only one WPAN to be up at any given time net: filter: minor: fix kdoc in __sk_run_filter netlink: don't compare the nul-termination in nla_strcmp can: c_can: Avoid led toggling for every packet. can: c_can: Simplify TX interrupt cleanup can: c_can: Store dlc private can: c_can: Reduce register access can: c_can: Make the code readable ...	2014-04-02 20:53:45 -07:00
Linus Torvalds	159d8133d0	Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/trivial Pull trivial tree updates from Jiri Kosina: "Usual rocket science -- mostly documentation and comment updates" * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/trivial: sparse: fix comment doc: fix double words isdn: capi: fix "CAPI_VERSION" comment doc: DocBook: Fix typos in xml and template file Bluetooth: add module name for btwilink driver core: unexport static function create_syslog_header mmc: core: typo fix in printk specifier ARM: spear: clean up editing mistake net-sysfs: fix comment typo 'CONFIG_SYFS' doc: Insert MODULE_ in module-signing macros Documentation: update URL to hfsplus Technote 1150 gpio: update path to documentation ixgbe: Fix format string in ixgbe_fcoe. Kconfig: Remove useless "default N" lines user_namespace.c: Remove duplicated word in comment CREDITS: fix formatting treewide: Fix typo in Documentation/DocBook mm: Fix warning on make htmldocs caused by slab.c ata: ata-samsung_cf: cleanup in header file idr: remove unused prototype of idr_free()	2014-04-02 16:23:38 -07:00
Al Viro	86d564c84c	constify blk_rq_map_user_iov() and friends sg_iovec array passed to it can be const Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2014-04-01 23:19:31 -04:00
Linus Torvalds	7a48837732	Merge branch 'for-3.15/core' of git://git.kernel.dk/linux-block Pull core block layer updates from Jens Axboe: "This is the pull request for the core block IO bits for the 3.15 kernel. It's a smaller round this time, it contains: - Various little blk-mq fixes and additions from Christoph and myself. - Cleanup of the IPI usage from the block layer, and associated helper code. From Frederic Weisbecker and Jan Kara. - Duplicate code cleanup in bio-integrity from Gu Zheng. This will give you a merge conflict, but that should be easy to resolve. - blk-mq notify spinlock fix for RT from Mike Galbraith. - A blktrace partial accounting bug fix from Roman Pen. - Missing REQ_SYNC detection fix for blk-mq from Shaohua Li" * 'for-3.15/core' of git://git.kernel.dk/linux-block: (25 commits) blk-mq: add REQ_SYNC early rt,blk,mq: Make blk_mq_cpu_notify_lock a raw spinlock blk-mq: support partial I/O completions blk-mq: merge blk_mq_insert_request and blk_mq_run_request blk-mq: remove blk_mq_alloc_rq blk-mq: don't dump CPU -> hw queue map on driver load blk-mq: fix wrong usage of hctx->state vs hctx->flags blk-mq: allow blk_mq_init_commands() to return failure block: remove old blk_iopoll_enabled variable blktrace: fix accounting of partially completed requests smp: Rename __smp_call_function_single() to smp_call_function_single_async() smp: Remove wait argument from __smp_call_function_single() watchdog: Simplify a little the IPI call smp: Move __smp_call_function_single() below its safe version smp: Consolidate the various smp_call_function_single() declensions smp: Teach __smp_call_function_single() to check for offline cpus smp: Remove unused list_head from csd smp: Iterate functions through llist_for_each_entry_safe() block: Stop abusing rq->csd.list in blk-softirq block: Remove useless IPI struct initialization ...	2014-04-01 19:19:15 -07:00
David S. Miller	04f58c8854	Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net Conflicts: Documentation/devicetree/bindings/net/micrel-ks8851.txt net/core/netpoll.c The net/core/netpoll.c conflict is a bug fix in 'net' happening to code which is completely removed in 'net-next'. In micrel-ks8851.txt we simply have overlapping changes. Signed-off-by: David S. Miller <davem@davemloft.net>	2014-03-25 20:29:20 -04:00
Shaohua Li	27fbf4e87c	blk-mq: add REQ_SYNC early Add REQ_SYNC early, so rq_dispatched[] in blk_mq_rq_ctx_init is set correctly. Signed-off-by: Shaohua Li<shli@fusionio.com> Signed-off-by: Jens Axboe <axboe@fb.com>	2014-03-21 08:57:58 -06:00
Mike Galbraith	55c816e3df	rt,blk,mq: Make blk_mq_cpu_notify_lock a raw spinlock [ 365.164040] BUG: sleeping function called from invalid context at kernel/rtmutex.c:674 [ 365.164041] in_atomic(): 1, irqs_disabled(): 1, pid: 26, name: migration/1 [ 365.164043] no locks held by migration/1/26. [ 365.164044] irq event stamp: 6648 [ 365.164056] hardirqs last enabled at (6647): [<ffffffff8153d377>] restore_args+0x0/0x30 [ 365.164062] hardirqs last disabled at (6648): [<ffffffff810ed98d>] multi_cpu_stop+0x9d/0x120 [ 365.164070] softirqs last enabled at (0): [<ffffffff810543bc>] copy_process.part.28+0x6fc/0x1920 [ 365.164072] softirqs last disabled at (0): [< (null)>] (null) [ 365.164076] CPU: 1 PID: 26 Comm: migration/1 Tainted: GF N 3.12.12-rt19-0.gcb6c4a2-rt #3 [ 365.164078] Hardware name: QCI QSSC-S4R/QSSC-S4R, BIOS QSSC-S4R.QCI.01.00.S013.032920111005 03/29/2011 [ 365.164091] 0000000000000001 ffff880a42ea7c30 ffffffff815367e6 ffffffff81a086c0 [ 365.164099] ffff880a42ea7c40 ffffffff8108919c ffff880a42ea7c60 ffffffff8153c24f [ 365.164107] ffff880a42ea91f0 00000000ffffffe1 ffff880a42ea7c88 ffffffff81297ec0 [ 365.164108] Call Trace: [ 365.164119] [<ffffffff810060b1>] try_stack_unwind+0x191/0x1a0 [ 365.164127] [<ffffffff81004872>] dump_trace+0x92/0x360 [ 365.164133] [<ffffffff81006108>] show_trace_log_lvl+0x48/0x60 [ 365.164138] [<ffffffff81004c18>] show_stack_log_lvl+0xd8/0x1d0 [ 365.164143] [<ffffffff81006160>] show_stack+0x20/0x50 [ 365.164153] [<ffffffff815367e6>] dump_stack+0x54/0x9a [ 365.164163] [<ffffffff8108919c>] __might_sleep+0xfc/0x140 [ 365.164173] [<ffffffff8153c24f>] rt_spin_lock+0x1f/0x70 [ 365.164182] [<ffffffff81297ec0>] blk_mq_main_cpu_notify+0x20/0x70 [ 365.164191] [<ffffffff81540a1c>] notifier_call_chain+0x4c/0x70 [ 365.164201] [<ffffffff81083499>] __raw_notifier_call_chain+0x9/0x10 [ 365.164207] [<ffffffff810567be>] cpu_notify+0x1e/0x40 [ 365.164217] [<ffffffff81525da2>] take_cpu_down+0x22/0x40 [ 365.164223] [<ffffffff810ed9c6>] multi_cpu_stop+0xd6/0x120 [ 365.164229] [<ffffffff810edd97>] cpu_stopper_thread+0xd7/0x1e0 [ 365.164235] [<ffffffff810863a3>] smpboot_thread_fn+0x203/0x380 [ 365.164241] [<ffffffff8107cbf8>] kthread+0xc8/0xd0 [ 365.164250] [<ffffffff8154440c>] ret_from_fork+0x7c/0xb0 [ 365.164429] smpboot: CPU 1 is now offline Signed-off-by: Mike Galbraith <bitbucket@online.de> Signed-off-by: Jens Axboe <axboe@fb.com>	2014-03-21 08:57:56 -06:00
Christoph Hellwig	7237c740b0	blk-mq: support partial I/O completions Add a new blk_mq_end_io_partial function to partially complete requests as needed by the SCSI layer. We do this by reusing blk_update_request to advance the bio instead of having a simplified version of it in the blk-mq code. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@fb.com>	2014-03-21 08:57:55 -06:00
Christoph Hellwig	eeabc850b7	blk-mq: merge blk_mq_insert_request and blk_mq_run_request It's almost identical to blk_mq_insert_request, so fold the two into one slightly more generic function by making the flush special case a bit smarted. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@fb.com>	2014-03-21 08:57:37 -06:00
Christoph Hellwig	081241e592	blk-mq: remove blk_mq_alloc_rq There's only one caller, which is a straight wrapper and fits the naming scheme of the related functions a lot better. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@fb.com>	2014-03-21 08:41:45 -06:00
Dave Jones	708f04d2ab	block: free q->flush_rq in blk_init_allocated_queue error paths Commit `7982e90c3a` ("block: fix q->flush_rq NULL pointer crash on dm-mpath flush") moved an allocation to blk_init_allocated_queue(), but neglected to free that allocation on the error paths that follow. Signed-off-by: Dave Jones <davej@fedoraproject.org> Acked-by: Mike Snitzer <snitzer@redhat.com> Signed-off-by: Jens Axboe <axboe@fb.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2014-03-20 22:32:06 -07:00
Jens Axboe	676141e48a	blk-mq: don't dump CPU -> hw queue map on driver load Now that we are out of initial debug/bringup mode, remove the verbose dump of the mapping table. Provide the mapping table in sysfs, under the hardware queue directory, in the cpu_list file. Signed-off-by: Jens Axboe <axboe@fb.com>	2014-03-20 13:31:44 -06:00
Jens Axboe	5d12f905cc	blk-mq: fix wrong usage of hctx->state vs hctx->flags BLK_MQ_F_* flags are for hctx->flags, and are non-atomic and set at registration time. BLK_MQ_S_* flags are dynamic and atomic, and are accessed through hctx->state. Some of the BLK_MQ_S_STOPPED uses were wrong. Additionally, the header file should not use a bit shift for the _S_ flags, as they are done through the set/test_bit functions. Signed-off-by: Jens Axboe <axboe@fb.com>	2014-03-19 15:25:02 -06:00
Tejun Heo	4d3bb511b5	cgroup: drop const from @buffer of cftype->write_string() cftype->write_string() just passes on the writeable buffer from kernfs and there's no reason to add const restriction on the buffer. The only thing const achieves is unnecessarily complicating parsing of the buffer. Drop const from @buffer. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Li Zefan <lizefan@huawei.com> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Paul Mackerras <paulus@samba.org> Cc: Ingo Molnar <mingo@redhat.com> Cc: Arnaldo Carvalho de Melo <acme@ghostprotocols.net> Cc: Daniel Borkmann <dborkman@redhat.com> Cc: Michal Hocko <mhocko@suse.cz> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Balbir Singh <bsingharora@gmail.com> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>	2014-03-19 10:23:54 -04:00
Eric W. Biederman	57a7744e09	net: Replace u64_stats_fetch_begin_bh to u64_stats_fetch_begin_irq Replace the bh safe variant with the hard irq safe variant. We need a hard irq safe variant to deal with netpoll transmitting packets from hard irq context, and we need it in most if not all of the places using the bh safe variant. Except on 32bit uni-processor the code is exactly the same so don't bother with a bh variant, just have a hard irq safe variant that everyone can use. Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2014-03-14 22:41:36 -04:00
Jens Axboe	95363efde1	blk-mq: allow blk_mq_init_commands() to return failure If drivers do dynamic allocation in the hardware command init path, then we need to be able to handle and return failures. And if they do allocations or mappings in the init command path, then we need a cleanup function to free up that space at exit time. So add blk_mq_free_commands() as the cleanup function. This is required for the mtip32xx driver conversion to blk-mq. Signed-off-by: Jens Axboe <axboe@fb.com>	2014-03-14 10:43:15 -06:00
Jens Axboe	89f8b33ca1	block: remove old blk_iopoll_enabled variable This was a debugging measure to toggle enabled/disabled when testing. But for real production setups, it's not safe to toggle this setting without either reloading drivers of quiescing IO first. Neither of which the toggle enforces. Additionally, it makes drivers deal with the conditional state. Remove it completely. It's up to the driver whether iopoll is enabled or not. Signed-off-by: Jens Axboe <axboe@fb.com>	2014-03-13 09:38:42 -06:00
Mike Snitzer	10beafc190	block: change flush sequence list addition back to front add Commit `18741986` inadvertently changed the rq flush insertion from a head to a tail insertion. Fix that back up. Signed-off-by: Mike Snitzer <msnitzer@redhat.com> Signed-off-by: Jens Axboe <axboe@fb.com>	2014-03-08 20:31:31 -07:00
Mike Snitzer	7982e90c3a	block: fix q->flush_rq NULL pointer crash on dm-mpath flush Commit `1874198` ("blk-mq: rework flush sequencing logic") switched ->flush_rq from being an embedded member of the request_queue structure to being dynamically allocated in blk_init_queue_node(). Request-based DM multipath doesn't use blk_init_queue_node(), instead it uses blk_alloc_queue_node() + blk_init_allocated_queue(). Because commit `1874198` placed the dynamic allocation of ->flush_rq in blk_init_queue_node() any flush issued to a dm-mpath device would crash with a NULL pointer, e.g.: BUG: unable to handle kernel NULL pointer dereference at (null) IP: [<ffffffff8125037e>] blk_rq_init+0x1e/0xb0 PGD bb3c7067 PUD bb01d067 PMD 0 Oops: 0002 [#1] SMP ... CPU: 5 PID: 5028 Comm: dt Tainted: G W O 3.14.0-rc3.snitm+ #10 ... task: ffff88032fb270e0 ti: ffff880079564000 task.ti: ffff880079564000 RIP: 0010:[<ffffffff8125037e>] [<ffffffff8125037e>] blk_rq_init+0x1e/0xb0 RSP: 0018:ffff880079565c98 EFLAGS: 00010046 RAX: 0000000000000000 RBX: 0000000000000000 RCX: 0000000000000030 RDX: ffff880260c74048 RSI: 0000000000000000 RDI: 0000000000000000 RBP: ffff880079565ca8 R08: ffff880260aa1e98 R09: 0000000000000001 R10: ffff88032fa78500 R11: 0000000000000246 R12: 0000000000000000 R13: ffff880260aa1de8 R14: 0000000000000650 R15: 0000000000000000 FS: 00007f8d36a2a700(0000) GS:ffff88033fca0000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 0000000000000000 CR3: 0000000079b36000 CR4: 00000000000007e0 Stack: 0000000000000000 ffff880260c74048 ffff880079565cd8 ffffffff81257a47 ffff880260aa1de8 ffff880260c74048 0000000000000001 0000000000000000 ffff880079565d08 ffffffff81257c2d 0000000000000000 ffff880260aa1de8 Call Trace: [<ffffffff81257a47>] blk_flush_complete_seq+0x2d7/0x2e0 [<ffffffff81257c2d>] blk_insert_flush+0x1dd/0x210 [<ffffffff8124ec59>] __elv_add_request+0x1f9/0x320 [<ffffffff81250681>] ? blk_account_io_start+0x111/0x190 [<ffffffff81253a4b>] blk_queue_bio+0x25b/0x330 [<ffffffffa0020bf5>] dm_request+0x35/0x40 [dm_mod] [<ffffffff812530c0>] generic_make_request+0xc0/0x100 [<ffffffff81253173>] submit_bio+0x73/0x140 [<ffffffff811becdd>] submit_bio_wait+0x5d/0x80 [<ffffffff81257528>] blkdev_issue_flush+0x78/0xa0 [<ffffffff811c1f6f>] blkdev_fsync+0x3f/0x60 [<ffffffff811b7fde>] vfs_fsync_range+0x1e/0x20 [<ffffffff811b7ffc>] vfs_fsync+0x1c/0x20 [<ffffffff811b81f1>] do_fsync+0x41/0x80 [<ffffffff8118874e>] ? SyS_lseek+0x7e/0x80 [<ffffffff811b8260>] SyS_fsync+0x10/0x20 [<ffffffff8154c2d2>] system_call_fastpath+0x16/0x1b Fix this by moving the ->flush_rq allocation from blk_init_queue_node() to blk_init_allocated_queue(). blk_init_queue_node() also calls blk_init_allocated_queue() so this change is functionality equivalent for all blk_init_queue_node() callers. Reported-by: Hannes Reinecke <hare@suse.de> Reported-by: Christoph Hellwig <hch@infradead.org> Signed-off-by: Mike Snitzer <snitzer@redhat.com> Signed-off-by: Jens Axboe <axboe@fb.com>	2014-03-08 17:20:01 -07:00
Shaohua Li	739c3eea71	blk-mq: add REQ_SYNC early Add REQ_SYNC early, so rq_dispatched[] in blk_mq_rq_ctx_init is set correctly. Signed-off-by: Shaohua Li<shli@fusionio.com> Signed-off-by: Jens Axboe <axboe@fb.com>	2014-03-07 08:15:28 -07:00
Roman Pen	af5040da01	blktrace: fix accounting of partially completed requests trace_block_rq_complete does not take into account that request can be partially completed, so we can get the following incorrect output of blkparser: C R 232 + 240 [0] C R 240 + 232 [0] C R 248 + 224 [0] C R 256 + 216 [0] but should be: C R 232 + 8 [0] C R 240 + 8 [0] C R 248 + 8 [0] C R 256 + 8 [0] Also, the whole output summary statistics of completed requests and final throughput will be incorrect. This patch takes into account real completion size of the request and fixes wrong completion accounting. Signed-off-by: Roman Pen <r.peniaev@gmail.com> CC: Steven Rostedt <rostedt@goodmis.org> CC: Frederic Weisbecker <fweisbec@gmail.com> CC: Ingo Molnar <mingo@redhat.com> CC: linux-kernel@vger.kernel.org Cc: stable@kernel.org Signed-off-by: Jens Axboe <axboe@fb.com>	2014-03-05 16:11:21 -07:00
Mike Galbraith	2a26ebef84	rt,blk,mq: Make blk_mq_cpu_notify_lock a raw spinlock [ 365.164040] BUG: sleeping function called from invalid context at kernel/rtmutex.c:674 [ 365.164041] in_atomic(): 1, irqs_disabled(): 1, pid: 26, name: migration/1 [ 365.164043] no locks held by migration/1/26. [ 365.164044] irq event stamp: 6648 [ 365.164056] hardirqs last enabled at (6647): [<ffffffff8153d377>] restore_args+0x0/0x30 [ 365.164062] hardirqs last disabled at (6648): [<ffffffff810ed98d>] multi_cpu_stop+0x9d/0x120 [ 365.164070] softirqs last enabled at (0): [<ffffffff810543bc>] copy_process.part.28+0x6fc/0x1920 [ 365.164072] softirqs last disabled at (0): [< (null)>] (null) [ 365.164076] CPU: 1 PID: 26 Comm: migration/1 Tainted: GF N 3.12.12-rt19-0.gcb6c4a2-rt #3 [ 365.164078] Hardware name: QCI QSSC-S4R/QSSC-S4R, BIOS QSSC-S4R.QCI.01.00.S013.032920111005 03/29/2011 [ 365.164091] 0000000000000001 ffff880a42ea7c30 ffffffff815367e6 ffffffff81a086c0 [ 365.164099] ffff880a42ea7c40 ffffffff8108919c ffff880a42ea7c60 ffffffff8153c24f [ 365.164107] ffff880a42ea91f0 00000000ffffffe1 ffff880a42ea7c88 ffffffff81297ec0 [ 365.164108] Call Trace: [ 365.164119] [<ffffffff810060b1>] try_stack_unwind+0x191/0x1a0 [ 365.164127] [<ffffffff81004872>] dump_trace+0x92/0x360 [ 365.164133] [<ffffffff81006108>] show_trace_log_lvl+0x48/0x60 [ 365.164138] [<ffffffff81004c18>] show_stack_log_lvl+0xd8/0x1d0 [ 365.164143] [<ffffffff81006160>] show_stack+0x20/0x50 [ 365.164153] [<ffffffff815367e6>] dump_stack+0x54/0x9a [ 365.164163] [<ffffffff8108919c>] __might_sleep+0xfc/0x140 [ 365.164173] [<ffffffff8153c24f>] rt_spin_lock+0x1f/0x70 [ 365.164182] [<ffffffff81297ec0>] blk_mq_main_cpu_notify+0x20/0x70 [ 365.164191] [<ffffffff81540a1c>] notifier_call_chain+0x4c/0x70 [ 365.164201] [<ffffffff81083499>] __raw_notifier_call_chain+0x9/0x10 [ 365.164207] [<ffffffff810567be>] cpu_notify+0x1e/0x40 [ 365.164217] [<ffffffff81525da2>] take_cpu_down+0x22/0x40 [ 365.164223] [<ffffffff810ed9c6>] multi_cpu_stop+0xd6/0x120 [ 365.164229] [<ffffffff810edd97>] cpu_stopper_thread+0xd7/0x1e0 [ 365.164235] [<ffffffff810863a3>] smpboot_thread_fn+0x203/0x380 [ 365.164241] [<ffffffff8107cbf8>] kthread+0xc8/0xd0 [ 365.164250] [<ffffffff8154440c>] ret_from_fork+0x7c/0xb0 [ 365.164429] smpboot: CPU 1 is now offline Signed-off-by: Mike Galbraith <bitbucket@online.de> Signed-off-by: Jens Axboe <axboe@fb.com>	2014-03-03 09:34:10 -07:00
Frederic Weisbecker	c46fff2a3b	smp: Rename __smp_call_function_single() to smp_call_function_single_async() The name __smp_call_function_single() doesn't tell much about the properties of this function, especially when compared to smp_call_function_single(). The comments above the implementation are also misleading. The main point of this function is actually not to be able to embed the csd in an object. This is actually a requirement that result from the purpose of this function which is to raise an IPI asynchronously. As such it can be called with interrupts disabled. And this feature comes at the cost of the caller who then needs to serialize the IPIs on this csd. Lets rename the function and enhance the comments so that they reflect these properties. Suggested-by: Christoph Hellwig <hch@infradead.org> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Christoph Hellwig <hch@infradead.org> Cc: Ingo Molnar <mingo@kernel.org> Cc: Jan Kara <jack@suse.cz> Cc: Jens Axboe <axboe@fb.com> Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com> Signed-off-by: Jens Axboe <axboe@fb.com>	2014-02-24 14:47:15 -08:00
Frederic Weisbecker	fce8ad1568	smp: Remove wait argument from __smp_call_function_single() The main point of calling __smp_call_function_single() is to send an IPI in a pure asynchronous way. By embedding a csd in an object, a caller can send the IPI without waiting for a previous one to complete as is required by smp_call_function_single() for example. As such, sending this kind of IPI can be safe even when irqs are disabled. This flexibility comes at the expense of the caller who then needs to synchronize the csd lifecycle by himself and make sure that IPIs on a single csd are serialized. This is how __smp_call_function_single() works when wait = 0 and this usecase is relevant. Now there don't seem to be any usecase with wait = 1 that can't be covered by smp_call_function_single() instead, which is safer. Lets look at the two possible scenario: 1) The user calls __smp_call_function_single(wait = 1) on a csd embedded in an object. It looks like a nice and convenient pattern at the first sight because we can then retrieve the object from the IPI handler easily. But actually it is a waste of memory space in the object since the csd can be allocated from the stack by smp_call_function_single(wait = 1) and the object can be passed an the IPI argument. Besides that, embedding the csd in an object is more error prone because the caller must take care of the serialization of the IPIs for this csd. 2) The user calls __smp_call_function_single(wait = 1) on a csd that is allocated on the stack. It's ok but smp_call_function_single() can do it as well and it already takes care of the allocation on the stack. Again it's more simple and less error prone. Therefore, using the underscore prepend API version with wait = 1 is a bad pattern and a sign that the caller can do safer and more simple. There was a single user of that which has just been converted. So lets remove this option to discourage further users. Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Christoph Hellwig <hch@infradead.org> Cc: Ingo Molnar <mingo@kernel.org> Cc: Jan Kara <jack@suse.cz> Cc: Jens Axboe <axboe@fb.com> Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com> Signed-off-by: Jens Axboe <axboe@fb.com>	2014-02-24 14:47:09 -08:00
Jan Kara	6d113398dc	block: Stop abusing rq->csd.list in blk-softirq Abusing rq->csd.list for a list of requests to complete is rather ugly. We use rq->queuelist instead which is much cleaner. It is safe because queuelist is used by the block layer only for requests waiting to be submitted to a device. Thus it is unused when irq reports the request IO is finished. Signed-off-by: Jan Kara <jack@suse.cz> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Christoph Hellwig <hch@infradead.org> Cc: Ingo Molnar <mingo@kernel.org> Cc: Jens Axboe <axboe@fb.com> Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com> Signed-off-by: Jens Axboe <axboe@fb.com>	2014-02-24 14:46:42 -08:00
Jan Kara	8b4922d317	block: Stop abusing csd.list for fifo_time Block layer currently abuses rq->csd.list.next for storing fifo_time. That is a terrible hack and completely unnecessary as well. Union achieves the same space saving in a cleaner way. Signed-off-by: Jan Kara <jack@suse.cz> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Christoph Hellwig <hch@infradead.org> Cc: Ingo Molnar <mingo@kernel.org> Cc: Jens Axboe <axboe@fb.com> Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com> Signed-off-by: Jens Axboe <axboe@fb.com>	2014-02-24 14:46:32 -08:00
Christoph Hellwig	d6a25b3131	blk-mq: support partial I/O completions Add a new blk_mq_end_io_partial function to partially complete requests as needed by the SCSI layer. We do this by reusing blk_update_request to advance the bio instead of having a simplified version of it in the blk-mq code. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@fb.com>	2014-02-21 08:58:49 -08:00
Christoph Hellwig	feb71dae1f	blk-mq: merge blk_mq_insert_request and blk_mq_run_request It's almost identical to blk_mq_insert_request, so fold the two into one slightly more generic function by making the flush special case a bit smarted. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@fb.com>	2014-02-21 08:58:48 -08:00
Christoph Hellwig	fd694131bb	blk-mq: remove blk_mq_alloc_rq There's only one caller, which is a straight wrapper and fits the naming scheme of the related functions a lot better. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@fb.com>	2014-02-21 08:58:47 -08:00
Jiri Kosina	d4263348f7	Merge branch 'master' into for-next	2014-02-20 14:54:28 +01:00
Masanari Iida	e227867f12	treewide: Fix typo in Documentation/DocBook This patch fix spelling typo in Documentation/DocBook. It is because .html and .xml files are generated by make htmldocs, I have to fix a typo within the source files. Signed-off-by: Masanari Iida <standby24x7@gmail.com> Acked-by: Randy Dunlap <rdunlap@infradead.org> Signed-off-by: Jiri Kosina <jkosina@suse.cz>	2014-02-19 14:58:17 +01:00
Paul E. McKenney	ec6c676a08	block: Substitute rcu_access_pointer() for rcu_dereference_raw() (Trivial patch.) If the code is looking at the RCU-protected pointer itself, but not dereferencing it, the rcu_dereference() functions can be downgraded to rcu_access_pointer(). This commit makes this downgrade in blkg_destroy() and ioc_destroy_icq(), both of which simply compare the RCU-protected pointer against another pointer with no dereferencing. Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Cc: Jens Axboe <axboe@kernel.dk> Signed-off-by: Jens Axboe <axboe@fb.com>	2014-02-18 12:21:26 -08:00
Gideon Israel Dsouza	e3ebf0d457	block: Use macros from compiler.h instead of __attribute__((...)) To increase compiler portability there are several macros defined in <linux/compiler.h> for various gcc __attribute((..)) constructs. I've made sure gcc these specific were replaced with the right macro and an #include <linux/compiler.h> was placed where needed. Signed-off-by: Gideon Israel Dsouza <gidisrael@gmail.com> Signed-off-by: Jens Axboe <axboe@fb.com>	2014-02-18 12:20:01 -08:00
Linus Torvalds	5e57dc8110	Merge branch 'for-linus' of git://git.kernel.dk/linux-block Pull block IO fixes from Jens Axboe: "Second round of updates and fixes for 3.14-rc2. Most of this stuff has been queued up for a while. The notable exception is the blk-mq changes, which are naturally a bit more in flux still. The pull request contains: - Two bug fixes for the new immutable vecs, causing crashes with raid or swap. From Kent. - Various blk-mq tweaks and fixes from Christoph. A fix for integrity bio's from Nic. - A few bcache fixes from Kent and Darrick Wong. - xen-blk{front,back} fixes from David Vrabel, Matt Rushton, Nicolas Swenson, and Roger Pau Monne. - Fix for a vec miscount with integrity vectors from Martin. - Minor annotations or fixes from Masanari Iida and Rashika Kheria. - Tweak to null_blk to do more normal FIFO processing of requests from Shlomo Pongratz. - Elevator switching bypass fix from Tejun. - Softlockup in blkdev_issue_discard() fix when !CONFIG_PREEMPT from me" * 'for-linus' of git://git.kernel.dk/linux-block: (31 commits) block: add cond_resched() to potentially long running ioctl discard loop xen-blkback: init persistent_purge_work work_struct blk-mq: pair blk_mq_start_request / blk_mq_requeue_request blk-mq: dont assume rq->errors is set when returning an error from ->queue_rq block: Fix cloning of discard/write same bios block: Fix type mismatch in ssize_t_blk_mq_tag_sysfs_show blk-mq: rework flush sequencing logic null_blk: use blk_complete_request and blk_mq_complete_request virtio_blk: use blk_mq_complete_request blk-mq: rework I/O completions fs: Add prototype declaration to appropriate header file include/linux/bio.h fs: Mark function as static in fs/bio-integrity.c block/null_blk: Fix completion processing from LIFO to FIFO block: Explicitly handle discard/write same segments block: Fix nr_vecs for inline integrity vectors blk-mq: Add bio_integrity setup to blk_mq_make_request blk-mq: initialize sg_reserved_size blk-mq: handle dma_drain_size blk-mq: divert __blk_put_request for MQ ops blk-mq: support at_head inserations for blk_execute_rq ...	2014-02-14 10:45:18 -08:00
Tejun Heo	924f0d9a20	cgroup: drop @skip_css from cgroup_taskset_for_each() If !NULL, @skip_css makes cgroup_taskset_for_each() skip the matching css. The intention of the interface is to make it easy to skip css's (cgroup_subsys_states) which already match the migration target; however, this is entirely unnecessary as migration taskset doesn't include tasks which are already in the target cgroup. Drop @skip_css from cgroup_taskset_for_each(). Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Li Zefan <lizefan@huawei.com> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Paul Mackerras <paulus@samba.org> Cc: Ingo Molnar <mingo@redhat.com> Cc: Arnaldo Carvalho de Melo <acme@ghostprotocols.net> Cc: Daniel Borkmann <dborkman@redhat.com>	2014-02-13 06:58:41 -05:00
Jens Axboe	c8123f8c9c	block: add cond_resched() to potentially long running ioctl discard loop When mkfs issues a full device discard and the device only supports discards of a smallish size, we can loop in blkdev_issue_discard() for a long time. If preempt isn't enabled, this can turn into a softlock situation and the kernel will start complaining. Add an explicit cond_resched() at the end of the loop to avoid that. Cc: stable@kernel.org Signed-off-by: Jens Axboe <axboe@fb.com>	2014-02-12 09:36:37 -07:00
Tejun Heo	e61734c55c	cgroup: remove cgroup->name cgroup->name handling became quite complicated over time involving dedicated struct cgroup_name for RCU protection. Now that cgroup is on kernfs, we can drop all of it and simply use kernfs_name/path() and friends. Replace cgroup->name and all related code with kernfs name/path constructs. * Reimplement cgroup_name() and cgroup_path() as thin wrappers on top of kernfs counterparts, which involves semantic changes. pr_cont_cgroup_name() and pr_cont_cgroup_path() added. * cgroup->name handling dropped from cgroup_rename(). * All users of cgroup_name/path() updated to the new semantics. Users which were formatting the string just to printk them are converted to use pr_cont_cgroup_name/path() instead, which simplifies things quite a bit. As cgroup_name() no longer requires RCU read lock around it, RCU lockings which were protecting only cgroup_name() are removed. v2: Comment above oom_info_lock updated as suggested by Michal. v3: dummy_top doesn't have a kn associated and pr_cont_cgroup_name/path() ended up calling the matching kernfs functions with NULL kn leading to oops. Test for NULL kn and print "/" if so. This issue was reported by Fengguang Wu. v4: Rebased on top of `0ab02ca8f8` ("cgroup: protect modifications to cgroup_idr with cgroup_mutex"). Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Peter Zijlstra <peterz@infradead.org> Acked-by: Michal Hocko <mhocko@suse.cz> Acked-by: Li Zefan <lizefan@huawei.com> Cc: Fengguang Wu <fengguang.wu@intel.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Balbir Singh <bsingharora@gmail.com> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>	2014-02-12 09:29:50 -05:00
Tejun Heo	5f46990787	cgroup: update the meaning of cftype->max_write_len cftype->max_write_len is used to extend the maximum size of writes. It's interpreted in such a way that the actual maximum size is one less than the specified value. The default size is defined by CGROUP_LOCAL_BUFFER_SIZE. Its interpretation is quite confusing - its value is decremented by 1 and then compared for equality with max size, which means that the actual default size is CGROUP_LOCAL_BUFFER_SIZE - 2, which is 62 chars. There's no point in having a limit that low. Update its definition so that it means the actual string length sans termination and anything below PAGE_SIZE-1 is treated as PAGE_SIZE-1. .max_write_len for "release_agent" is updated to PATH_MAX-1 and cgroup_release_agent_write() is updated so that the redundant strlen() check is removed and it uses strlcpy() instead of strcpy(). .max_write_len initializations in blk-throttle.c and cfq-iosched.c are no longer necessary and removed. The one in cpuset is kept unchanged as it's an approximated value to begin with. This will also make transition to kernfs smoother. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Li Zefan <lizefan@huawei.com>	2014-02-11 11:52:48 -05:00
Christoph Hellwig	49f5baa510	blk-mq: pair blk_mq_start_request / blk_mq_requeue_request Make sure we have a proper pairing between starting and requeueing requests. Move the dma drain and REQ_END setup into blk_mq_start_request, and make sure blk_mq_requeue_request properly undoes them, giving us a pair of function to prepare and unprepare a request without leaving side effects. Together this ensures we always clean up properly after BLK_MQ_RQ_QUEUE_BUSY returns from ->queue_rq. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@fb.com>	2014-02-11 09:34:08 -07:00
Christoph Hellwig	1e93b8c274	blk-mq: dont assume rq->errors is set when returning an error from ->queue_rq rq->errors never has been part of the communication protocol between drivers and the block stack and most drivers will not have initialized it. Return -EIO to upper layers when the driver returns BLK_MQ_RQ_QUEUE_ERROR unconditionally. If a driver want to return a different error it can easily do so by returning success after calling blk_mq_end_io itself. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@fb.com>	2014-02-11 09:34:07 -07:00
Masanari Iida	11c9444407	block: Fix type mismatch in ssize_t_blk_mq_tag_sysfs_show cppcheck detected following format string mismatch. [blk-mq-tag.c:201]: (warning) %u in format string (no. 1) requires 'unsigned int' but the argument type is 'int'. Change "cpu" from int to unsigned int, because the cpu never become minus value. Signed-off-by: Masanari Iida <standby24x7@gmail.com> Signed-off-by: Jens Axboe <axboe@fb.com>	2014-02-10 10:39:53 -07:00
Christoph Hellwig	18741986a4	blk-mq: rework flush sequencing logic Witch to using a preallocated flush_rq for blk-mq similar to what's done with the old request path. This allows us to set up the request properly with a tag from the actually allowed range and ->rq_disk as needed by some drivers. To make life easier we also switch to dynamic allocation of ->flush_rq for the old path. This effectively reverts most of "blk-mq: fix for flush deadlock" and "blk-mq: Don't reserve a tag for flush request" Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@fb.com>	2014-02-10 09:29:00 -07:00
Christoph Hellwig	30a91cb4ef	blk-mq: rework I/O completions Rework I/O completions to work more like the old code path. blk_mq_end_io now stays out of the business of deferring completions to others CPUs and calling blk_mark_rq_complete. The latter is very important to allow completing requests that have timed out and thus are already marked completed, the former allows using the IPI callout even for driver specific completions instead of having to reimplement them. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@fb.com>	2014-02-10 09:27:31 -07:00
Tejun Heo	073219e995	cgroup: clean up cgroup_subsys names and initialization cgroup_subsys is a bit messier than it needs to be. * The name of a subsys can be different from its internal identifier defined in cgroup_subsys.h. Most subsystems use the matching name but three - cpu, memory and perf_event - use different ones. * cgroup_subsys_id enums are postfixed with _subsys_id and each cgroup_subsys is postfixed with _subsys. cgroup.h is widely included throughout various subsystems, it doesn't and shouldn't have claim on such generic names which don't have any qualifier indicating that they belong to cgroup. * cgroup_subsys->subsys_id should always equal the matching cgroup_subsys_id enum; however, we require each controller to initialize it and then BUG if they don't match, which is a bit silly. This patch cleans up cgroup_subsys names and initialization by doing the followings. * cgroup_subsys_id enums are now postfixed with _cgrp_id, and each cgroup_subsys with _cgrp_subsys. * With the above, renaming subsys identifiers to match the userland visible names doesn't cause any naming conflicts. All non-matching identifiers are renamed to match the official names. cpu_cgroup -> cpu mem_cgroup -> memory perf -> perf_event * controllers no longer need to initialize ->subsys_id and ->name. They're generated in cgroup core and set automatically during boot. * Redundant cgroup_subsys declarations removed. * While updating BUG_ON()s in cgroup_init_early(), convert them to WARN()s. BUGging that early during boot is stupid - the kernel can't print anything, even through serial console and the trap handler doesn't even link stack frame properly for back-tracing. This patch doesn't introduce any behavior changes. v2: Rebased on top of `fe1217c4f3` ("net: net_cls: move cgroupfs classid handling into core"). Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Neil Horman <nhorman@tuxdriver.com> Acked-by: "David S. Miller" <davem@davemloft.net> Acked-by: "Rafael J. Wysocki" <rjw@rjwysocki.net> Acked-by: Michal Hocko <mhocko@suse.cz> Acked-by: Peter Zijlstra <peterz@infradead.org> Acked-by: Aristeu Rozanski <aris@redhat.com> Acked-by: Ingo Molnar <mingo@redhat.com> Acked-by: Li Zefan <lizefan@huawei.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Balbir Singh <bsingharora@gmail.com> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Serge E. Hallyn <serue@us.ibm.com> Cc: Vivek Goyal <vgoyal@redhat.com> Cc: Thomas Graf <tgraf@suug.ch>	2014-02-08 10:36:58 -05:00
Tejun Heo	3ed80a62bf	cgroup: drop module support With module supported dropped from net_prio, no controller is using cgroup module support. None of actual resource controllers can be built as a module and we aren't gonna add new controllers which don't control resources. This patch drops module support from cgroup. * cgroup_[un]load_subsys() and cgroup_subsys->module removed. * As there's no point in distinguishing IS_BUILTIN() and IS_MODULE(), cgroup_subsys.h now uses IS_ENABLED() directly. * enum cgroup_subsys_id now exactly matches the list of enabled controllers as ordered in cgroup_subsys.h. * cgroup_subsys[] is now a contiguously occupied array. Size specification is no longer necessary and dropped. * for_each_builtin_subsys() is removed and for_each_subsys() is updated to not require any locking. * module ref handling is removed from rebind_subsystems(). * Module related comments dropped. v2: Rebased on top of `fe1217c4f3` ("net: net_cls: move cgroupfs classid handling into core"). v3: Added {} around the if (need_forkexit_callback) block in cgroup_post_fork() for readability as suggested by Li. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Li Zefan <lizefan@huawei.com>	2014-02-08 10:36:58 -05:00
Kent Overstreet	5cb8850c9c	block: Explicitly handle discard/write same segments Immutable biovecs changed the way biovecs are interpreted - drivers no longer use bi_vcnt, they have to go by bi_iter.bi_size (to allow for using part of an existing segment without modifying it). This breaks with discards and write_same bios, since for those bi_size has nothing to do with segments in the biovec. So for now, we need a fairly gross hack - we fortunately know that there will never be more than one segment for the entire request, so we can special case discard/write_same. Signed-off-by: Kent Overstreet <kmo@daterainc.com> Tested-by: Hugh Dickins <hughd@google.com> Signed-off-by: Jens Axboe <axboe@fb.com>	2014-02-07 13:54:08 -07:00
Nicholas Bellinger	14ec77f352	blk-mq: Add bio_integrity setup to blk_mq_make_request This patch adds the missing bio_integrity_enabled() + bio_integrity_prep() setup into blk_mq_make_request() in order to use DIF protection with scsi-mq. Cc: Martin K. Petersen <martin.petersen@oracle.com> Signed-off-by: Nicholas Bellinger <nab@linux-iscsi.org> Signed-off-by: Jens Axboe <axboe@fb.com>	2014-02-07 13:45:39 -07:00
Christoph Hellwig	1be036e946	blk-mq: initialize sg_reserved_size To behave the same way as the old request path. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@fb.com>	2014-02-07 11:58:54 -07:00
Christoph Hellwig	4f7f418c48	blk-mq: handle dma_drain_size Make blk-mq handle the dma_drain_size field the same way as the old request path. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@fb.com>	2014-02-07 11:58:54 -07:00
Christoph Hellwig	6f5ba581c0	blk-mq: divert __blk_put_request for MQ ops __blk_put_request needs to call into the blk-mq code just like blk_put_request. As we don't have the queue lock in this case both end up calling the same function. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@fb.com>	2014-02-07 11:58:54 -07:00
Christoph Hellwig	72a0a36e28	blk-mq: support at_head inserations for blk_execute_rq This is neede for proper SG_IO operation as well as various uses of blk_execute_rq from the SCSI midlayer. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@fb.com>	2014-02-07 11:58:54 -07:00
Linus Torvalds	4e13c5d021	Merge branch 'for-next' of git://git.kernel.org/pub/scm/linux/kernel/git/nab/target-pending Pull SCSI target updates from Nicholas Bellinger: "The highlights this round include: - add support for SCSI Referrals (Hannes) - add support for T10 DIF into target core (nab + mkp) - add support for T10 DIF emulation in FILEIO + RAMDISK backends (Sagi + nab) - add support for T10 DIF -> bio_integrity passthrough in IBLOCK backend (nab) - prep changes to iser-target for >= v3.15 T10 DIF support (Sagi) - add support for qla2xxx N_Port ID Virtualization - NPIV (Saurav + Quinn) - allow percpu_ida_alloc() to receive task state bitmask (Kent) - fix >= v3.12 iscsi-target session reset hung task regression (nab) - fix >= v3.13 percpu_ref se_lun->lun_ref_active race (nab) - fix a long-standing network portal creation race (Andy)" * 'for-next' of git://git.kernel.org/pub/scm/linux/kernel/git/nab/target-pending: (51 commits) target: Fix percpu_ref_put race in transport_lun_remove_cmd target/iscsi: Fix network portal creation race target: Report bad sector in sense data for DIF errors iscsi-target: Convert gfp_t parameter to task state bitmask iscsi-target: Fix connection reset hang with percpu_ida_alloc percpu_ida: Make percpu_ida_alloc + callers accept task state bitmask iscsi-target: Pre-allocate more tags to avoid ack starvation qla2xxx: Configure NPIV fc_vport via tcm_qla2xxx_npiv_make_lport qla2xxx: Enhancements to enable NPIV support for QLOGIC ISPs with TCM/LIO. qla2xxx: Fix scsi_host leak on qlt_lport_register callback failure IB/isert: pass scatterlist instead of cmd to fast_reg_mr routine IB/isert: Move fastreg descriptor creation to a function IB/isert: Avoid frwr notation, user fastreg IB/isert: seperate connection protection domains and dma MRs tcm_loop: Enable DIF/DIX modes in SCSI host LLD target/rd: Add DIF protection into rd_execute_rw target/rd: Add support for protection SGL setup + release target/rd: Refactor rd_build_device_space + rd_release_device_space target/file: Add DIF protection support to fd_execute_rw target/file: Add DIF protection init/format support ...	2014-01-31 15:31:23 -08:00
Tejun Heo	556ee818c0	block: __elv_next_request() shouldn't call into the elevator if bypassing request_queue bypassing is used to suppress higher-level function of a request_queue so that they can be switched, reconfigured and shut down. A request_queue does the followings while bypassing. * bypasses elevator and io_cq association and queues requests directly to the FIFO dispatch queue. * bypasses block cgroup request_list lookup and always uses the root request_list. Once confirmed to be bypassing, specific elevator and block cgroup policy implementations can assume that nothing is in flight for them and perform various operations which would be dangerous otherwise. Such confirmation is acheived by short-circuiting all new requests directly to the dispatch queue and waiting for all the requests which were issued before to finish. Unfortunately, while the request allocating and draining sides were properly handled, we forgot to actually plug the request dispatch path. Even after bypassing mode is confirmed, if the attached driver tries to fetch a request and the dispatch queue is empty, __elv_next_request() would invoke the current elevator's elevator_dispatch_fn() callback. As all in-flight requests were drained, the elevator wouldn't contain any request but once bypass is confirmed we don't even know whether the elevator is even there. It might be in the process of being switched and half torn down. Frank Mayhar reports that this actually happened while switching elevators, leading to an oops. Let's fix it by making __elv_next_request() avoid invoking the elevator_dispatch_fn() callback if the queue is bypassing. It already avoids invoking the callback if the queue is dying. As a dying queue is guaranteed to be bypassing, we can simply replace blk_queue_dying() check with blk_queue_bypass(). Reported-by: Frank Mayhar <fmayhar@google.com> References: http://lkml.kernel.org/g/1390319905.20232.38.camel@bobble.lax.corp.google.com Cc: stable@vger.kernel.org Tested-by: Frank Mayhar <fmayhar@google.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2014-01-30 12:57:25 -07:00
Shaohua Li	f0276924fa	blk-mq: Don't reserve a tag for flush request Reserving a tag (request) for flush to avoid dead lock is a overkill. A tag is valuable resource. We can track the number of flush requests and disallow having too many pending flush requests allocated. With this patch, blk_mq_alloc_request_pinned() could do a busy nop (but not a dead loop) if too many pending requests are allocated and new flush request is allocated. But this should not be a problem, too many pending flush requests are very rare case. I verified this can fix the deadlock caused by too many pending flush requests. Signed-off-by: Shaohua Li <shli@fusionio.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2014-01-30 12:57:25 -07:00
Linus Torvalds	53d8ab29f8	Merge branch 'for-3.14/drivers' of git://git.kernel.dk/linux-block Pull block IO driver changes from Jens Axboe: - bcache update from Kent Overstreet. - two bcache fixes from Nicholas Swenson. - cciss pci init error fix from Andrew. - underflow fix in the parallel IDE pg_write code from Dan Carpenter. I'm sure the 1 (or 0) users of that are now happy. - two PCI related fixes for sx8 from Jingoo Han. - floppy init fix for first block read from Jiri Kosina. - pktcdvd error return miss fix from Julia Lawall. - removal of IRQF_SHARED from the SEGA Dreamcast CD-ROM code from Michael Opdenacker. - comment typo fix for the loop driver from Olaf Hering. - potential oops fix for null_blk from Raghavendra K T. - two fixes from Sam Bradshaw (Micron) for the mtip32xx driver, fixing an OOM problem and a problem with handling security locked conditions * 'for-3.14/drivers' of git://git.kernel.dk/linux-block: (47 commits) mg_disk: Spelling s/finised/finished/ null_blk: Null pointer deference problem in alloc_page_buffers mtip32xx: Correctly handle security locked condition mtip32xx: Make SGL container per-command to eliminate high order dma allocation drivers/block/loop.c: fix comment typo in loop_config_discard drivers/block/cciss.c:cciss_init_one(): use proper errnos drivers/block/paride/pg.c: underflow bug in pg_write() drivers/block/sx8.c: remove unnecessary pci_set_drvdata() drivers/block/sx8.c: use module_pci_driver() floppy: bail out in open() if drive is not responding to block0 read bcache: Fix auxiliary search trees for key size > cacheline size bcache: Don't return -EINTR when insert finished bcache: Improve bucket_prio() calculation bcache: Add bch_bkey_equal_header() bcache: update bch_bkey_try_merge bcache: Move insert_fixup() to btree_keys_ops bcache: Convert sorting to btree_keys bcache: Convert debug code to btree_keys bcache: Convert btree_iter to struct btree_keys bcache: Refactor bset_tree sysfs stats ...	2014-01-30 11:40:10 -08:00
Linus Torvalds	f568849eda	Merge branch 'for-3.14/core' of git://git.kernel.dk/linux-block Pull core block IO changes from Jens Axboe: "The major piece in here is the immutable bio_ve series from Kent, the rest is fairly minor. It was supposed to go in last round, but various issues pushed it to this release instead. The pull request contains: - Various smaller blk-mq fixes from different folks. Nothing major here, just minor fixes and cleanups. - Fix for a memory leak in the error path in the block ioctl code from Christian Engelmayer. - Header export fix from CaiZhiyong. - Finally the immutable biovec changes from Kent Overstreet. This enables some nice future work on making arbitrarily sized bios possible, and splitting more efficient. Related fixes to immutable bio_vecs: - dm-cache immutable fixup from Mike Snitzer. - btrfs immutable fixup from Muthu Kumar. - bio-integrity fix from Nic Bellinger, which is also going to stable" * 'for-3.14/core' of git://git.kernel.dk/linux-block: (44 commits) xtensa: fixup simdisk driver to work with immutable bio_vecs block/blk-mq-cpu.c: use hotcpu_notifier() blk-mq: for_each_* macro correctness block: Fix memory leak in rw_copy_check_uvector() handling bio-integrity: Fix bio_integrity_verify segment start bug block: remove unrelated header files and export symbol blk-mq: uses page->list incorrectly blk-mq: use __smp_call_function_single directly btrfs: fix missing increment of bi_remaining Revert "block: Warn and free bio if bi_end_io is not set" block: Warn and free bio if bi_end_io is not set blk-mq: fix initializing request's start time block: blk-mq: don't export blk_mq_free_queue() block: blk-mq: make blk_sync_queue support mq block: blk-mq: support draining mq queue dm cache: increment bi_remaining when bi_end_io is restored block: fixup for generic bio chaining block: Really silence spurious compiler warnings block: Silence spurious compiler warnings block: Kill bio_pair_split() ...	2014-01-30 11:19:05 -08:00
Andrew Morton	381d3ee33b	block/blk-mq-cpu.c: use hotcpu_notifier() Cleaner, reduces text size when cpu hotplug is disabled. Cc: Christoph Hellwig <hch@lst.de> Cc: Jan Kara <jack@suse.cz> Cc: Jens Axboe <axboe@kernel.dk> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2014-01-28 09:52:01 -07:00
Kent Overstreet	6f6b5d1ec5	percpu_ida: Make percpu_ida_alloc + callers accept task state bitmask This patch changes percpu_ida_alloc() + callers to accept task state bitmask for prepare_to_wait() for code like target/iscsi that needs it for interruptible sleep, that is provided in a subsequent patch. It now expects TASK_UNINTERRUPTIBLE when the caller is able to sleep waiting for a new tag, or TASK_RUNNING when the caller cannot sleep, and is forced to return a negative value when no tags are available. v2 changes: - Include blk-mq + tcm_fc + vhost/scsi + target/iscsi changes - Drop signal_pending_state() call v3 changes: - Only call prepare_to_wait() + finish_wait() when != TASK_RUNNING (PeterZ) Reported-by: Linus Torvalds <torvalds@linux-foundation.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Ingo Molnar <mingo@redhat.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Jens Axboe <axboe@kernel.dk> Signed-off-by: Kent Overstreet <kmo@daterainc.com> Cc: <stable@vger.kernel.org> #3.12+ Signed-off-by: Nicholas Bellinger <nab@linux-iscsi.org>	2014-01-23 20:17:18 +00:00
Christian Engelmayer	17a05cca99	block: Fix memory leak in rw_copy_check_uvector() handling Fix a memory leak in the error handling path of function sg_io() that is used during the processing of scsi ioctl. Memory already allocated by rw_copy_check_uvector() needs to be freed correctly. Detected by Coverity: CID 1128953. Signed-off-by: Christian Engelmayer <cengelma@gmx.at> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2014-01-21 20:36:17 -08:00
CaiZhiyong	bca266b388	block: remove unrelated header files and export symbol Fix up the following items: - remove unrelated header files. - export interface function. - modify function cmdline_parts_parse return value, this will make it more friendly for the caller. Signed-off-by: CaiZhiyong <caizhiyong@huawei.com> Cc: Ezequiel Garcia <ezequiel.garcia@free-electrons.com> CC: Brian Norris <computersforpeace@gmail.com> Cc: "Wanglin (Albert)" <albert.wanglin@hisilicon.com> Cc: Artem Bityutskiy <dedekind1@gmail.com> Cc: Karel Zak <kzak@redhat.com> Cc: Shmulik Ladkani <shmulik.ladkani@gmail.com> Cc: David Woodhouse <dwmw2@infradead.org> Cc: Jens Axboe <axboe@kernel.dk> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2014-01-21 20:18:26 -08:00
Linus Torvalds	f075e0f699	Merge branch 'for-3.14' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup Pull cgroup updates from Tejun Heo: "The bulk of changes are cleanups and preparations for the upcoming kernfs conversion. - cgroup_event mechanism which is and will be used only by memcg is moved to memcg. - pidlist handling is updated so that it can be served by seq_file. Also, the list is not sorted if sane_behavior. cgroup documentation explicitly states that the file is not sorted but it has been for quite some time. - All cgroup file handling now happens on top of seq_file. This is to prepare for kernfs conversion. In addition, all operations are restructured so that they map 1-1 to kernfs operations. - Other cleanups and low-pri fixes" * 'for-3.14' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup: (40 commits) cgroup: trivial style updates cgroup: remove stray references to css_id doc: cgroups: Fix typo in doc/cgroups cgroup: fix fail path in cgroup_load_subsys() cgroup: fix missing unlock on error in cgroup_load_subsys() cgroup: remove for_each_root_subsys() cgroup: implement for_each_css() cgroup: factor out cgroup_subsys_state creation into create_css() cgroup: combine css handling loops in cgroup_create() cgroup: reorder operations in cgroup_create() cgroup: make for_each_subsys() useable under cgroup_root_mutex cgroup: css iterations and css_from_dir() are safe under cgroup_mutex cgroup: unify pidlist and other file handling cgroup: replace cftype->read_seq_string() with cftype->seq_show() cgroup: attach cgroup_open_file to all cgroup files cgroup: generalize cgroup_pidlist_open_file cgroup: unify read path so that seq_file is always used cgroup: unify cgroup_write_X64() and cgroup_write_string() cgroup: remove cftype->read(), ->read_map() and ->write() hugetlb_cgroup: convert away from cftype->read() ...	2014-01-21 17:51:34 -08:00
Dave Hansen	6753471c0c	blk-mq: uses page->list incorrectly 'struct page' has two list_head fields: 'lru' and 'list'. Conveniently, they are unioned together. This means that code can use them interchangably, which gets horribly confusing. The blk-mq made the logical decision to try to use page->list. But, that field was actually introduced just for the slub code. ->lru is the right field to use outside of slab/slub. Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Acked-by: David Rientjes <rientjes@google.com> Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Jens Axboe <axboe@kernel.dk> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2014-01-08 20:17:46 -07:00
Christoph Hellwig	3d6efbf62c	blk-mq: use __smp_call_function_single directly __smp_call_function_single already avoids multiple IPIs by internally queing up the items, and now also is available for non-SMP builds as a trivially correct stub, so there is no need to wrap it. If the additional lock roundtrip cause problems my patch to convert the generic IPI code to llists is waiting to get merged will fix it. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2014-01-08 14:31:27 -07:00
Kent Overstreet	c78afc6261	bcache/md: Use raid stripe size Now that we've got code for raid5/6 stripe awareness, bcache just needs to know about the stripes and when writing partial stripes is expensive - we probably don't want to enable this optimization for raid1 or 10, even though they have stripes. So add a flag to queue_limits. Signed-off-by: Kent Overstreet <kmo@daterainc.com>	2014-01-08 13:05:09 -08:00
Ming Lei	0fec08b4ec	blk-mq: fix initializing request's start time blk_rq_init() is called in req's complete handler to initialize the request, so the members of start_time and start_time_ns might become inaccurate when it is allocated in future. The patch initializes the two members in blk_mq_rq_ctx_init() to fix the problem. Signed-off-by: Ming Lei <tom.leiming@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2014-01-03 10:00:08 -07:00
Ming Lei	3edcc0ce85	block: blk-mq: don't export blk_mq_free_queue() blk_mq_free_queue() is called from release handler of queue kobject, so it needn't be called from drivers. Cc: Jens Axboe <axboe@kernel.dk> Signed-off-by: Ming Lei <tom.leiming@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2013-12-31 09:53:05 -07:00
Ming Lei	f04c1fe761	block: blk-mq: make blk_sync_queue support mq This patch moves synchronization on mq->delay_work from blk_mq_free_queue() to blk_sync_queue(), so that blk_sync_queue can work on mq. Cc: Jens Axboe <axboe@kernel.dk> Signed-off-by: Ming Lei <tom.leiming@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2013-12-31 09:53:05 -07:00
Ming Lei	43a5e4e219	block: blk-mq: support draining mq queue blk_mq_drain_queue() is introduced so that we can drain mq queue inside blk_cleanup_queue(). Also don't accept new requests any more if queue is marked as dying. Cc: Jens Axboe <axboe@kernel.dk> Signed-off-by: Ming Lei <tom.leiming@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2013-12-31 09:53:05 -07:00
Jens Axboe	b28bc9b38c	Linux 3.13-rc6 -----BEGIN PGP SIGNATURE----- Version: GnuPG v1 iQEcBAABAgAGBQJSwLfoAAoJEHm+PkMAQRiGi6QH/1U1B7lmHChDTw3jj1lfm9gA 189Si4QJlnxFWCKHvKEL+pcaVuACU+aMGI8+KyMYK4/JfuWVjjj5fr/SvyHH2/8m LdSK8aHMhJ46uBS4WJ/l6v46qQa5e2vn8RKSBAyKm/h4vpt+hd6zJdoFrFai4th7 k/TAwOAEHI5uzexUChwLlUBRTvbq4U8QUvDu+DeifC8cT63CGaaJ4qVzjOZrx1an eP6UXZrKDASZs7RU950i7xnFVDQu4PsjlZi25udsbeiKcZJgPqGgXz5ULf8ZH8RQ YCi1JOnTJRGGjyIOyLj7pyB01h7XiSM2+eMQ0S7g54F2s7gCJ58c2UwQX45vRWU= =/4/R -----END PGP SIGNATURE----- Merge tag 'v3.13-rc6' into for-3.14/core Needed to bring blk-mq uptodate, since changes have been going in since for-3.14/core was established. Fixup merge issues related to the immutable biovec changes. Signed-off-by: Jens Axboe <axboe@kernel.dk> Conflicts: block/blk-flush.c fs/btrfs/check-integrity.c fs/btrfs/extent_io.c fs/btrfs/scrub.c fs/logfs/dev_bdev.c	2013-12-31 09:51:02 -07:00
Andrey Vagin	8515736604	block: fix memory leaks on unplugging block device All objects, which are allocated in blk_mq_register_disk, must be released in blk_mq_unregister_disk. I use a KVM virtual machine and virtio disk to reproduce this issue. kmemleak: 18 new suspected memory leaks (see /sys/kernel/debug/kmemleak) $ cat /sys/kernel/debug/kmemleak \| head -n 30 unreferenced object 0xffff8800b6636150 (size 8): comm "kworker/0:2", pid 65, jiffies 4294809903 (age 86.358s) hex dump (first 8 bytes): 76 69 72 74 69 6f 34 00 virtio4. backtrace: [<ffffffff8165d41e>] kmemleak_alloc+0x4e/0xb0 [<ffffffff8118cfc5>] __kmalloc_track_caller+0xf5/0x260 [<ffffffff81155b11>] kstrdup+0x31/0x60 [<ffffffff812242be>] sysfs_new_dirent+0x2e/0x140 [<ffffffff81224678>] create_dir+0x38/0xe0 [<ffffffff812249e3>] sysfs_create_dir_ns+0x73/0xc0 [<ffffffff8130dfa9>] kobject_add_internal+0xc9/0x340 [<ffffffff8130e535>] kobject_add+0x65/0xb0 [<ffffffff813f34f8>] device_add+0x128/0x660 [<ffffffff813f3a4a>] device_register+0x1a/0x20 [<ffffffff813ae6f8>] register_virtio_device+0x98/0xe0 [<ffffffff813b0cce>] virtio_pci_probe+0x12e/0x1c0 [<ffffffff81340675>] local_pci_probe+0x45/0xa0 [<ffffffff81341a51>] pci_device_probe+0x121/0x130 [<ffffffff813f67f7>] driver_probe_device+0x87/0x390 [<ffffffff813f6b3b>] __device_attach+0x3b/0x40 unreferenced object 0xffff8800b65aa1d8 (size 144): Fixes: `320ae51fee` (blk-mq: new multi-queue block IO queueing mechanism) Cc: Jens Axboe <axboe@kernel.dk> Signed-off-by: Andrey Vagin <avagin@openvz.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2013-12-06 09:18:02 -07:00
Linus Torvalds	5ee540613d	Merge branch 'for-linus' of git://git.kernel.dk/linux-block Pull block layer fixes from Jens Axboe: "A small collection of fixes for the current series. It contains: - A fix for a use-after-free of a request in blk-mq. From Ming Lei - A fix for a blk-mq bug that could attempt to dereference a NULL rq if allocation failed - Two xen-blkfront small fixes - Cleanup of submit_bio_wait() type uses in the kernel, unifying that. From Kent - A fix for 32-bit blkg_rwstat reading. I apologize for this one looking mangled in the shortlog, it's entirely my fault for missing an empty line between the description and body of the text" * 'for-linus' of git://git.kernel.dk/linux-block: blk-mq: fix use-after-free of request blk-mq: fix dereference of rq->mq_ctx if allocation fails block: xen-blkfront: Fix possible NULL ptr dereference xen-blkfront: Silence pfn maybe-uninitialized warning block: submit_bio_wait() conversions Update of blkg_stat and blkg_rwstat may happen in bh context	2013-12-05 15:33:27 -08:00
Ming Lei	0d11e6aca3	blk-mq: fix use-after-free of request If accounting is on, we will do the IO completion accounting after we have freed the request. Fix that by moving it sooner instead. Signed-off-by: Jens Axboe <axboe@kernel.dk>	2013-12-05 10:50:39 -07:00
Tejun Heo	2da8ca822d	cgroup: replace cftype->read_seq_string() with cftype->seq_show() In preparation of conversion to kernfs, cgroup file handling is updated so that it can be easily mapped to kernfs. This patch replaces cftype->read_seq_string() with cftype->seq_show() which is not limited to single_open() operation and will map directcly to kernfs seq_file interface. The conversions are mechanical. As ->seq_show() doesn't have @css and @cft, the functions which make use of them are converted to use seq_css() and seq_cft() respectively. In several occassions, e.f. if it has seq_string in its name, the function name is updated to fit the new method better. This patch does not introduce any behavior changes. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Aristeu Rozanski <arozansk@redhat.com> Acked-by: Vivek Goyal <vgoyal@redhat.com> Acked-by: Michal Hocko <mhocko@suse.cz> Acked-by: Daniel Wagner <daniel.wagner@bmw-carit.de> Acked-by: Li Zefan <lizefan@huawei.com> Cc: Jens Axboe <axboe@kernel.dk> Cc: Ingo Molnar <mingo@redhat.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Balbir Singh <bsingharora@gmail.com> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Neil Horman <nhorman@tuxdriver.com>	2013-12-05 12:28:04 -05:00
Kent Overstreet	2b8221e181	block: Really silence spurious compiler warnings The uninitialized_var() macro appears to not work on structs... Get rid of it, and manually initialize instead. Signed-off-by: Kent Overstreet <kmo@daterainc.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2013-12-03 14:29:09 -07:00
Jeff Moyer	959a35f13e	blk-mq: fix dereference of rq->mq_ctx if allocation fails If __GFP_WAIT isn't set and we fail allocating, when we go to drop the reference on the ctx, we will attempt to dereference the NULL rq. Fix that. Signed-off-by: Jeff Moyer <jmoyer@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2013-12-03 14:24:28 -07:00
Kent Overstreet	3f273d301b	block: Silence spurious compiler warnings Signed-off-by: Kent Overstreet <kmo@daterainc.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2013-11-26 17:50:37 -07:00
Kent Overstreet	c170bbb45f	block: submit_bio_wait() conversions It was being open coded in a few places. Signed-off-by: Kent Overstreet <kmo@daterainc.com> Cc: Jens Axboe <axboe@kernel.dk> Cc: Joern Engel <joern@logfs.org> Cc: Prasad Joshi <prasadjoshi.linux@gmail.com> Cc: Neil Brown <neilb@suse.de> Cc: Chris Mason <chris.mason@fusionio.com> Acked-by: NeilBrown <neilb@suse.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2013-11-24 16:33:41 -07:00
Kent Overstreet	f619d25460	block: Kill bio_iovec_idx(), __bio_iovec() bio_iovec_idx() and __bio_iovec() don't have any valid uses anymore - previous users have been converted to bio_iovec_iter() or other methods. __BVEC_END() has to go too - the bvec array can't be used directly for the last biovec because we might only be using the first portion of it, we have to iterate over the bvec array with bio_for_each_segment() which checks against the current value of bi_iter.bi_size. Signed-off-by: Kent Overstreet <kmo@daterainc.com> Cc: Jens Axboe <axboe@kernel.dk>	2013-11-23 22:33:53 -08:00
Kent Overstreet	d57a5f7c66	bio-integrity: Convert to bvec_iter The bio integrity is also stored in a bvec array, so if we use the bvec iter code we just added, the integrity code won't need to implement its own iteration stuff (bio_integrity_mark_head(), bio_integrity_mark_tail()) Signed-off-by: Kent Overstreet <kmo@daterainc.com> Cc: Jens Axboe <axboe@kernel.dk> Cc: "Martin K. Petersen" <martin.petersen@oracle.com> Cc: "James E.J. Bottomley" <JBottomley@parallels.com>	2013-11-23 22:33:50 -08:00
Kent Overstreet	7988613b0e	block: Convert bio_for_each_segment() to bvec_iter More prep work for immutable biovecs - with immutable bvecs drivers won't be able to use the biovec directly, they'll need to use helpers that take into account bio->bi_iter.bi_bvec_done. This updates callers for the new usage without changing the implementation yet. Signed-off-by: Kent Overstreet <kmo@daterainc.com> Cc: Jens Axboe <axboe@kernel.dk> Cc: Geert Uytterhoeven <geert@linux-m68k.org> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: Paul Mackerras <paulus@samba.org> Cc: "Ed L. Cashin" <ecashin@coraid.com> Cc: Nick Piggin <npiggin@kernel.dk> Cc: Lars Ellenberg <drbd-dev@lists.linbit.com> Cc: Jiri Kosina <jkosina@suse.cz> Cc: Paul Clements <Paul.Clements@steeleye.com> Cc: Jim Paris <jim@jtan.com> Cc: Geoff Levand <geoff@infradead.org> Cc: Yehuda Sadeh <yehuda@inktank.com> Cc: Sage Weil <sage@inktank.com> Cc: Alex Elder <elder@inktank.com> Cc: ceph-devel@vger.kernel.org Cc: Joshua Morris <josh.h.morris@us.ibm.com> Cc: Philip Kelleher <pjk1939@linux.vnet.ibm.com> Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> Cc: Jeremy Fitzhardinge <jeremy@goop.org> Cc: Neil Brown <neilb@suse.de> Cc: Martin Schwidefsky <schwidefsky@de.ibm.com> Cc: Heiko Carstens <heiko.carstens@de.ibm.com> Cc: linux390@de.ibm.com Cc: Nagalakshmi Nandigama <Nagalakshmi.Nandigama@lsi.com> Cc: Sreekanth Reddy <Sreekanth.Reddy@lsi.com> Cc: support@lsi.com Cc: "James E.J. Bottomley" <JBottomley@parallels.com> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Alexander Viro <viro@zeniv.linux.org.uk> Cc: Steven Whitehouse <swhiteho@redhat.com> Cc: Herton Ronaldo Krzesinski <herton.krzesinski@canonical.com> Cc: Tejun Heo <tj@kernel.org> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Guo Chao <yan@linux.vnet.ibm.com> Cc: Asai Thambi S P <asamymuthupa@micron.com> Cc: Selvan Mani <smani@micron.com> Cc: Sam Bradshaw <sbradshaw@micron.com> Cc: Matthew Wilcox <matthew.r.wilcox@intel.com> Cc: Keith Busch <keith.busch@intel.com> Cc: Stephen Hemminger <shemminger@vyatta.com> Cc: Quoc-Son Anh <quoc-sonx.anh@intel.com> Cc: Sebastian Ott <sebott@linux.vnet.ibm.com> Cc: Nitin Gupta <ngupta@vflare.org> Cc: Minchan Kim <minchan@kernel.org> Cc: Jerome Marchand <jmarchan@redhat.com> Cc: Seth Jennings <sjenning@linux.vnet.ibm.com> Cc: "Martin K. Petersen" <martin.petersen@oracle.com> Cc: Mike Snitzer <snitzer@redhat.com> Cc: Vivek Goyal <vgoyal@redhat.com> Cc: "Darrick J. Wong" <darrick.wong@oracle.com> Cc: Chris Metcalf <cmetcalf@tilera.com> Cc: Jan Kara <jack@suse.cz> Cc: linux-m68k@lists.linux-m68k.org Cc: linuxppc-dev@lists.ozlabs.org Cc: drbd-user@lists.linbit.com Cc: nbd-general@lists.sourceforge.net Cc: cbe-oss-dev@lists.ozlabs.org Cc: xen-devel@lists.xensource.com Cc: virtualization@lists.linux-foundation.org Cc: linux-raid@vger.kernel.org Cc: linux-s390@vger.kernel.org Cc: DL-MPTFusionLinux@lsi.com Cc: linux-scsi@vger.kernel.org Cc: devel@driverdev.osuosl.org Cc: linux-fsdevel@vger.kernel.org Cc: cluster-devel@redhat.com Cc: linux-mm@kvack.org Acked-by: Geoff Levand <geoff@infradead.org>	2013-11-23 22:33:49 -08:00
Kent Overstreet	4f024f3797	block: Abstract out bvec iterator Immutable biovecs are going to require an explicit iterator. To implement immutable bvecs, a later patch is going to add a bi_bvec_done member to this struct; for now, this patch effectively just renames things. Signed-off-by: Kent Overstreet <kmo@daterainc.com> Cc: Jens Axboe <axboe@kernel.dk> Cc: Geert Uytterhoeven <geert@linux-m68k.org> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: Paul Mackerras <paulus@samba.org> Cc: "Ed L. Cashin" <ecashin@coraid.com> Cc: Nick Piggin <npiggin@kernel.dk> Cc: Lars Ellenberg <drbd-dev@lists.linbit.com> Cc: Jiri Kosina <jkosina@suse.cz> Cc: Matthew Wilcox <willy@linux.intel.com> Cc: Geoff Levand <geoff@infradead.org> Cc: Yehuda Sadeh <yehuda@inktank.com> Cc: Sage Weil <sage@inktank.com> Cc: Alex Elder <elder@inktank.com> Cc: ceph-devel@vger.kernel.org Cc: Joshua Morris <josh.h.morris@us.ibm.com> Cc: Philip Kelleher <pjk1939@linux.vnet.ibm.com> Cc: Rusty Russell <rusty@rustcorp.com.au> Cc: "Michael S. Tsirkin" <mst@redhat.com> Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> Cc: Jeremy Fitzhardinge <jeremy@goop.org> Cc: Neil Brown <neilb@suse.de> Cc: Alasdair Kergon <agk@redhat.com> Cc: Mike Snitzer <snitzer@redhat.com> Cc: dm-devel@redhat.com Cc: Martin Schwidefsky <schwidefsky@de.ibm.com> Cc: Heiko Carstens <heiko.carstens@de.ibm.com> Cc: linux390@de.ibm.com Cc: Boaz Harrosh <bharrosh@panasas.com> Cc: Benny Halevy <bhalevy@tonian.com> Cc: "James E.J. Bottomley" <JBottomley@parallels.com> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: "Nicholas A. Bellinger" <nab@linux-iscsi.org> Cc: Alexander Viro <viro@zeniv.linux.org.uk> Cc: Chris Mason <chris.mason@fusionio.com> Cc: "Theodore Ts'o" <tytso@mit.edu> Cc: Andreas Dilger <adilger.kernel@dilger.ca> Cc: Jaegeuk Kim <jaegeuk.kim@samsung.com> Cc: Steven Whitehouse <swhiteho@redhat.com> Cc: Dave Kleikamp <shaggy@kernel.org> Cc: Joern Engel <joern@logfs.org> Cc: Prasad Joshi <prasadjoshi.linux@gmail.com> Cc: Trond Myklebust <Trond.Myklebust@netapp.com> Cc: KONISHI Ryusuke <konishi.ryusuke@lab.ntt.co.jp> Cc: Mark Fasheh <mfasheh@suse.com> Cc: Joel Becker <jlbec@evilplan.org> Cc: Ben Myers <bpm@sgi.com> Cc: xfs@oss.sgi.com Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Frederic Weisbecker <fweisbec@gmail.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Len Brown <len.brown@intel.com> Cc: Pavel Machek <pavel@ucw.cz> Cc: "Rafael J. Wysocki" <rjw@sisk.pl> Cc: Herton Ronaldo Krzesinski <herton.krzesinski@canonical.com> Cc: Ben Hutchings <ben@decadent.org.uk> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Guo Chao <yan@linux.vnet.ibm.com> Cc: Tejun Heo <tj@kernel.org> Cc: Asai Thambi S P <asamymuthupa@micron.com> Cc: Selvan Mani <smani@micron.com> Cc: Sam Bradshaw <sbradshaw@micron.com> Cc: Wei Yongjun <yongjun_wei@trendmicro.com.cn> Cc: "Roger Pau Monné" <roger.pau@citrix.com> Cc: Jan Beulich <jbeulich@suse.com> Cc: Stefano Stabellini <stefano.stabellini@eu.citrix.com> Cc: Ian Campbell <Ian.Campbell@citrix.com> Cc: Sebastian Ott <sebott@linux.vnet.ibm.com> Cc: Christian Borntraeger <borntraeger@de.ibm.com> Cc: Minchan Kim <minchan@kernel.org> Cc: Jiang Liu <jiang.liu@huawei.com> Cc: Nitin Gupta <ngupta@vflare.org> Cc: Jerome Marchand <jmarchand@redhat.com> Cc: Joe Perches <joe@perches.com> Cc: Peng Tao <tao.peng@emc.com> Cc: Andy Adamson <andros@netapp.com> Cc: fanchaoting <fanchaoting@cn.fujitsu.com> Cc: Jie Liu <jeff.liu@oracle.com> Cc: Sunil Mushran <sunil.mushran@gmail.com> Cc: "Martin K. Petersen" <martin.petersen@oracle.com> Cc: Namjae Jeon <namjae.jeon@samsung.com> Cc: Pankaj Kumar <pankaj.km@samsung.com> Cc: Dan Magenheimer <dan.magenheimer@oracle.com> Cc: Mel Gorman <mgorman@suse.de>6	2013-11-23 22:33:47 -08:00
Kent Overstreet	33879d4512	block: submit_bio_wait() conversions It was being open coded in a few places. Signed-off-by: Kent Overstreet <kmo@daterainc.com> Cc: Jens Axboe <axboe@kernel.dk> Cc: Joern Engel <joern@logfs.org> Cc: Prasad Joshi <prasadjoshi.linux@gmail.com> Cc: Neil Brown <neilb@suse.de> Cc: Chris Mason <chris.mason@fusionio.com> Acked-by: NeilBrown <neilb@suse.de>	2013-11-23 22:33:38 -08:00
Antti P Miettinen	49204c116a	block/partitions/efi.c: fix bound check Use ARRAY_SIZE instead of sizeof to get proper max for label length. Since this is just a read out of bounds it's not that bad, but the problem becomes user-visible eg if one tries to use DEBUG_PAGEALLOC and DEBUG_RODATA, at least with some enhancements from Hiroshi. Of course the destination array can contain garbage when we read beyond the end of source array so that would be another user-visible problem. Signed-off-by: Antti P Miettinen <amiettinen@nvidia.com> Reviewed-by: Hiroshi Doyu <hdoyu@nvidia.com> Tested-by: Hiroshi Doyu <hdoyu@nvidia.com> Cc: Will Drewry <wad@chromium.org> Cc: Matt Fleming <matt.fleming@intel.com> Acked-by: Davidlohr Bueso <davidlohr@hp.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2013-11-21 16:42:27 -08:00
Hong Zhiguo	2c575026fa	Update of blkg_stat and blkg_rwstat may happen in bh context. While u64_stats_fetch_retry is only preempt_disable on 32bit UP system. This is not enough to avoid preemption by bh and may read strange 64 bit value. Signed-off-by: Hong Zhiguo <zhiguohong@tencent.com> Acked-by: Tejun Heo <tj@kernel.org> Cc: stable@kernel.org Signed-off-by: Jens Axboe <axboe@kernel.dk>	2013-11-20 15:33:04 -07:00
Jens Axboe	01b983c9fc	blk-mq: add blktrace insert event trace We need it to make 'btt' from blktrace happy, otherwise we are missing one state transition. Signed-off-by: Jens Axboe <axboe@kernel.dk>	2013-11-19 19:00:45 -07:00
Jens Axboe	94eddfbeaa	blk-mq: ensure that we set REQ_IO_STAT so diskstats work If disk stats are enabled on the queue, a request needs to be marked with REQ_IO_STAT for accounting to be active on that request. This fixes an issue with virtio-blk not showing up in /proc/diskstats after the conversion to blk-mq. Add QUEUE_FLAG_MQ_DEFAULT, setting stats and same cpu-group completion on by default. Reported-by: Dave Chinner <david@fromorbit.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2013-11-19 09:25:07 -07:00
Linus Torvalds	f412f2c60b	Merge branch 'for-linus' of git://git.kernel.dk/linux-block Pull second round of block driver updates from Jens Axboe: "As mentioned in the original pull request, the bcache bits were pulled because of their dependency on the immutable bio vecs. Kent re-did this part and resubmitted it, so here's the 2nd round of (mostly) driver updates for 3.13. It contains: - The bcache work from Kent. - Conversion of virtio-blk to blk-mq. This removes the bio and request path, and substitutes with the blk-mq path instead. The end result almost 200 deleted lines. Patch is acked by Asias and Christoph, who both did a bunch of testing. - A removal of bootmem.h include from Grygorii Strashko, part of a larger series of his killing the dependency on that header file. - Removal of __cpuinit from blk-mq from Paul Gortmaker" * 'for-linus' of git://git.kernel.dk/linux-block: (56 commits) virtio_blk: blk-mq support blk-mq: remove newly added instances of __cpuinit bcache: defensively handle format strings bcache: Bypass torture test bcache: Delete some slower inline asm bcache: Use ida for bcache block dev minor bcache: Fix sysfs splat on shutdown with flash only devs bcache: Better full stripe scanning bcache: Have btree_split() insert into parent directly bcache: Move spinlock into struct time_stats bcache: Kill sequential_merge option bcache: Kill bch_next_recurse_key() bcache: Avoid deadlocking in garbage collection bcache: Incremental gc bcache: Add make_btree_freeing_key() bcache: Add btree_node_write_sync() bcache: PRECEDING_KEY() bcache: bch_(btree\|extent)_ptr_invalid() bcache: Don't bother with bucket refcount for btree node allocations bcache: Debug code improvements ...	2013-11-15 16:33:41 -08:00
Christoph Hellwig	0a06ff068f	kernel: remove CONFIG_USE_GENERIC_SMP_HELPERS We've switched over every architecture that supports SMP to it, so remove the new useless config variable. Signed-off-by: Christoph Hellwig <hch@lst.de> Cc: Jan Kara <jack@suse.cz> Cc: Jens Axboe <axboe@kernel.dk> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2013-11-15 09:32:22 +09:00
Paul Gortmaker	f618ef7c47	blk-mq: remove newly added instances of __cpuinit The new blk-mq code added new instances of __cpuinit usage. We removed this a couple versions ago; we now want to remove the compat no-op stubs. Introducing new users is not what we want to see at this point in time, as it will break once the stubs are gone. Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2013-11-14 08:26:02 -07:00
Linus Torvalds	5e30025a31	Merge branch 'core-locking-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull core locking changes from Ingo Molnar: "The biggest changes: - add lockdep support for seqcount/seqlocks structures, this unearthed both bugs and required extra annotation. - move the various kernel locking primitives to the new kernel/locking/ directory" * 'core-locking-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (21 commits) block: Use u64_stats_init() to initialize seqcounts locking/lockdep: Mark __lockdep_count_forward_deps() as static lockdep/proc: Fix lock-time avg computation locking/doc: Update references to kernel/mutex.c ipv6: Fix possible ipv6 seqlock deadlock cpuset: Fix potential deadlock w/ set_mems_allowed seqcount: Add lockdep functionality to seqcount/seqlock structures net: Explicitly initialize u64_stats_sync structures for lockdep locking: Move the percpu-rwsem code to kernel/locking/ locking: Move the lglocks code to kernel/locking/ locking: Move the rwsem code to kernel/locking/ locking: Move the rtmutex code to kernel/locking/ locking: Move the semaphore core to kernel/locking/ locking: Move the spinlock code to kernel/locking/ locking: Move the lockdep code to kernel/locking/ locking: Move the mutex code to kernel/locking/ hung_task debugging: Add tracepoint to report the hang x86/locking/kconfig: Update paravirt spinlock Kconfig description lockstat: Report avg wait and hold times lockdep, x86/alternatives: Drop ancient lockdep fixup message ...	2013-11-14 16:30:30 +09:00
Linus Torvalds	0910c0bdf7	Merge branch 'for-3.13/core' of git://git.kernel.dk/linux-block Pull block IO core updates from Jens Axboe: "This is the pull request for the core changes in the block layer for 3.13. It contains: - The new blk-mq request interface. This is a new and more scalable queueing model that marries the best part of the request based interface we currently have (which is fully featured, but scales poorly) and the bio based "interface" which the new drivers for high IOPS devices end up using because it's much faster than the request based one. The bio interface has no block layer support, since it taps into the stack much earlier. This means that drivers end up having to implement a lot of functionality on their own, like tagging, timeout handling, requeue, etc. The blk-mq interface provides all these. Some drivers even provide a switch to select bio or rq and has code to handle both, since things like merging only works in the rq model and hence is faster for some workloads. This is a huge mess. Conversion of these drivers nets us a substantial code reduction. Initial results on converting SCSI to this model even shows an 8x improvement on single queue devices. So while the model was intended to work on the newer multiqueue devices, it has substantial improvements for "classic" hardware as well. This code has gone through extensive testing and development, it's now ready to go. A pull request is coming to convert virtio-blk to this model will be will be coming as well, with more drivers scheduled for 3.14 conversion. - Two blktrace fixes from Jan and Chen Gang. - A plug merge fix from Alireza Haghdoost. - Conversion of __get_cpu_var() from Christoph Lameter. - Fix for sector_div() with 64-bit divider from Geert Uytterhoeven. - A fix for a race between request completion and the timeout handling from Jeff Moyer. This is what caused the merge conflict with blk-mq/core, in case you are looking at that. - A dm stacking fix from Mike Snitzer. - A code consolidation fix and duplicated code removal from Kent Overstreet. - A handful of block bug fixes from Mikulas Patocka, fixing a loop crash and memory corruption on blk cg. - Elevator switch bug fix from Tomoki Sekiyama. A heads-up that I had to rebase this branch. Initially the immutable bio_vecs had been queued up for inclusion, but a week later, it became clear that it wasn't fully cooked yet. So the decision was made to pull this out and postpone it until 3.14. It was a straight forward rebase, just pruning out the immutable series and the later fixes of problems with it. The rest of the patches applied directly and no further changes were made" * 'for-3.13/core' of git://git.kernel.dk/linux-block: (31 commits) block: replace IS_ERR and PTR_ERR with PTR_ERR_OR_ZERO block: replace IS_ERR and PTR_ERR with PTR_ERR_OR_ZERO block: Do not call sector_div() with a 64-bit divisor kernel: trace: blktrace: remove redundent memcpy() in compat_blk_trace_setup() block: Consolidate duplicated bio_trim() implementations block: Use rw_copy_check_uvector() block: Enable sysfs nomerge control for I/O requests in the plug list block: properly stack underlying max_segment_size to DM device elevator: acquire q->sysfs_lock in elevator_change() elevator: Fix a race in elevator switching and md device initialization block: Replace __get_cpu_var uses bdi: test bdi_init failure block: fix a probe argument to blk_register_region loop: fix crash if blk_alloc_queue fails blk-core: Fix memory corruption if blkcg_init_queue fails block: fix race between request completion and timeout handling blktrace: Send BLK_TN_PROCESS events to all running traces blk-mq: don't disallow request merges for req->special being set blk-mq: mq plug list breakage blk-mq: fix for flush deadlock ...	2013-11-14 12:08:14 +09:00
Linus Torvalds	8ceafbfa91	Merge branch 'for-linus-dma-masks' of git://git.linaro.org/people/rmk/linux-arm Pull DMA mask updates from Russell King: "This series cleans up the handling of DMA masks in a lot of drivers, fixing some bugs as we go. Some of the more serious errors include: - drivers which only set their coherent DMA mask if the attempt to set the streaming mask fails. - drivers which test for a NULL dma mask pointer, and then set the dma mask pointer to a location in their module .data section - which will cause problems if the module is reloaded. To counter these, I have introduced two helper functions: - dma_set_mask_and_coherent() takes care of setting both the streaming and coherent masks at the same time, with the correct error handling as specified by the API. - dma_coerce_mask_and_coherent() which resolves the problem of drivers forcefully setting DMA masks. This is more a marker for future work to further clean these locations up - the code which creates the devices really should be initialising these, but to fix that in one go along with this change could potentially be very disruptive. The last thing this series does is prise away some of Linux's addition to "DMA addresses are physical addresses and RAM always starts at zero". We have ARM LPAE systems where all system memory is above 4GB physical, hence having DMA masks interpreted by (eg) the block layers as describing physical addresses in the range 0..DMAMASK fails on these platforms. Santosh Shilimkar addresses this in this series; the patches were copied to the appropriate people multiple times but were ignored. Fixing this also gets rid of some ARM weirdness in the setup of the maxpfn variables, and brings ARM into line with every other Linux architecture as far as those go" 'for-linus-dma-masks' of git://git.linaro.org/people/rmk/linux-arm: (52 commits) ARM: 7805/1: mm: change maxpfn to include the physical offset of memory ARM: 7797/1: mmc: Use dma_max_pfn(dev) helper for bounce_limit calculations ARM: 7796/1: scsi: Use dma_max_pfn(dev) helper for bounce_limit calculations ARM: 7795/1: mm: dma-mapping: Add dma_max_pfn(dev) helper function ARM: 7794/1: block: Rename parameter dma_mask to max_addr for blk_queue_bounce_limit() ARM: DMA-API: better handing of DMA masks for coherent allocations ARM: 7857/1: dma: imx-sdma: setup dma mask DMA-API: firmware/google/gsmi.c: avoid direct access to DMA masks DMA-API: dcdbas: update DMA mask handing DMA-API: dma: edma.c: no need to explicitly initialize DMA masks DMA-API: usb: musb: use platform_device_register_full() to avoid directly messing with dma masks DMA-API: crypto: remove last references to 'static struct device dev' DMA-API: crypto: fix ixp4xx crypto platform device support DMA-API: others: use dma_set_coherent_mask() DMA-API: staging: use dma_set_coherent_mask() DMA-API: usb: use new dma_coerce_mask_and_coherent() DMA-API: usb: use dma_set_coherent_mask() DMA-API: parport: parport_pc.c: use dma_coerce_mask_and_coherent() DMA-API: net: octeon: use dma_coerce_mask_and_coherent() DMA-API: net: nxp/lpc_eth: use dma_coerce_mask_and_coherent() ...	2013-11-14 07:55:21 +09:00
Peter Zijlstra	90d3839b90	block: Use u64_stats_init() to initialize seqcounts Now that seqcounts are lockdep enabled objects, we need to explicitly initialize runtime allocated seqcounts so that lockdep can track them. Without this patch, Fengguang was seeing: [ 4.127282] INFO: trying to register non-static key. [ 4.128027] the code is fine but needs lockdep annotation. [ 4.128027] turning off the locking correctness validator. [ 4.128027] CPU: 0 PID: 96 Comm: kworker/u4:1 Not tainted 3.12.0-next-20131108-10601-gbad570d #2 [ 4.128027] Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011 [ ... ] [ 4.128027] Call Trace: [ 4.128027] [<7908e744>] ? console_unlock+0x353/0x380 [ 4.128027] [<79dc7cf2>] dump_stack+0x48/0x60 [ 4.128027] [<7908953e>] __lock_acquire.isra.26+0x7e3/0xceb [ 4.128027] [<7908a1c5>] lock_acquire+0x71/0x9a [ 4.128027] [<794079aa>] ? blk_throtl_bio+0x1c3/0x485 [ 4.128027] [<7940658b>] throtl_update_dispatch_stats+0x7c/0x153 [ 4.128027] [<794079aa>] ? blk_throtl_bio+0x1c3/0x485 [ 4.128027] [<794079aa>] blk_throtl_bio+0x1c3/0x485 ... Use u64_stats_init() for all affected data structures, which initializes the seqcount. Reported-and-Tested-by: Fengguang Wu <fengguang.wu@intel.com> Cc: Vivek Goyal <vgoyal@redhat.com> Cc: Jens Axboe <axboe@kernel.dk> Signed-off-by: Peter Zijlstra <peterz@infradead.org> [ Folded in another fix from the mailing list as well as a fix to that fix. Tweaked commit message. ] Signed-off-by: John Stultz <john.stultz@linaro.org> Signed-off-by: Peter Zijlstra <peterz@infradead.org> Link: http://lkml.kernel.org/r/1384314134-6895-1-git-send-email-john.stultz@linaro.org [ So I actually think that the two SOBs from PeterZ are the right depiction of the patch route. ] Signed-off-by: Ingo Molnar <mingo@kernel.org>	2013-11-13 13:54:08 +01:00
Grygorii Strashko	d17ab45927	block: cleanup removing dependency on bootmem headers Cc: Yinghai Lu <yinghai@kernel.org> Cc: Tejun Heo <tj@kernel.org> Cc: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Grygorii Strashko <grygorii.strashko@ti.com> Signed-off-by: Santosh Shilimkar <santosh.shilimkar@ti.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2013-11-08 19:43:48 -07:00
Jens Axboe	e37459b8e2	Merge branch 'blk-mq/core' into for-3.13/core Signed-off-by: Jens Axboe <axboe@kernel.dk> Conflicts: block/blk-timeout.c	2013-11-08 09:08:12 -07:00
Duan Jiong	c7d1ba417c	block: replace IS_ERR and PTR_ERR with PTR_ERR_OR_ZERO This patch fixes coccinelle error regarding usage of IS_ERR and PTR_ERR instead of PTR_ERR_OR_ZERO. Signed-off-by: Duan Jiong <duanj.fnst@cn.fujitsu.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2013-11-08 09:05:31 -07:00
Duan Jiong	8616ebb16b	block: replace IS_ERR and PTR_ERR with PTR_ERR_OR_ZERO This patch fixes coccinelle error regarding usage of IS_ERR and PTR_ERR instead of PTR_ERR_OR_ZERO. Signed-off-by: Duan Jiong <duanj.fnst@cn.fujitsu.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2013-11-08 09:05:30 -07:00
Geert Uytterhoeven	97597dc08f	block: Do not call sector_div() with a 64-bit divisor do_div() (called by sector_div() if CONFIG_LBDAF=y) is meant for divisions of 64-bit number by 32-bit numbers. Passing 64-bit divisor types caused issues in the past on 32-bit platforms, cfr. commit `ea077b1b96` ("m68k: Truncate base in do_div()"). As queue_limits.max_discard_sectors and .discard_granularity are unsigned int, max_discard_sectors and granularity should be unsigned int. As bdev_discard_alignment() returns int, alignment should be int. Now 2 calls to sector_div() can be replaced by 32-bit arithmetic: - The 64-bit modulo operation can become a 32-bit modulo operation, - The 64-bit division and multiplication can be replaced by a 32-bit modulo operation and a subtraction. Signed-off-by: Geert Uytterhoeven <geert@linux-m68k.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2013-11-08 09:04:46 -07:00
Kent Overstreet	e0ce0eacb3	block: Use rw_copy_check_uvector() No need for silly open coding - and struct sg_iovec has exactly the same layout as struct iovec... Signed-off-by: Kent Overstreet <kmo@daterainc.com> Cc: Jens Axboe <axboe@kernel.dk> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2013-11-08 09:02:14 -07:00
Alireza Haghdoost	23779fbc99	block: Enable sysfs nomerge control for I/O requests in the plug list This patch enables the sysfs to control I/O request merge functionality in the plug list. While this control has been implemented for the request queue, it was dismissed in the plug list. Therefore, block layer merges requests together (or attempt to merge) even if the merge capability was disable using sysfs nomerge parameter value 2. This limitation is directly affects functionality of io_submit() system call. The system call enables user to submit a bunch of IO requests from user space using struct iocb **ios input argument. However, the unconditioned merging functionality in the plug list potentially merges these requests together down the road. Therefore, there is no way to distinguish between an application sending bunch of sequential IOs and an application sending one big IO. Ultimately, all requests generated by the former app merge within the plug list together and looks similar to the second app. While the merging functionality is a desirable feature to improve the performance of IO subsystem for some applications, it is not useful for other application like ours at all. Signed-off-by: Alireza Haghdoost <alireza@cs.umn.edu> Reviewed-by: Jeff Moyer <jmoyer@redhat.com> Coding style modified. Signed-off-by: Jens Axboe <axboe@kernel.dk>	2013-11-08 09:00:22 -07:00
Mike Snitzer	d82ae52e68	block: properly stack underlying max_segment_size to DM device Without this patch all DM devices will default to BLK_MAX_SEGMENT_SIZE (65536) even if the underlying device(s) have a larger value -- this is due to blk_stack_limits() using min_not_zero() when stacking the max_segment_size limit. 1073741824 before patch: 65536 after patch: 1073741824 Reported-by: Lukasz Flis <l.flis@cyfronet.pl> Signed-off-by: Mike Snitzer <snitzer@redhat.com> Cc: stable@vger.kernel.org # v3.3+ Signed-off-by: Jens Axboe <axboe@kernel.dk>	2013-11-08 09:00:17 -07:00
Tomoki Sekiyama	7c8a3679e3	elevator: acquire q->sysfs_lock in elevator_change() Add locking of q->sysfs_lock into elevator_change() (an exported function) to ensure it is held to protect q->elevator from elevator_init(), even if elevator_change() is called from non-sysfs paths. sysfs path (elv_iosched_store) uses __elevator_change(), non-locking version, as the lock is already taken by elv_iosched_store(). Signed-off-by: Tomoki Sekiyama <tomoki.sekiyama@hds.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2013-11-08 09:00:13 -07:00
Tomoki Sekiyama	eb1c160b22	elevator: Fix a race in elevator switching and md device initialization The soft lockup below happens at the boot time of the system using dm multipath and the udev rules to switch scheduler. [ 356.127001] BUG: soft lockup - CPU#3 stuck for 22s! [sh:483] [ 356.127001] RIP: 0010:[<ffffffff81072a7d>] [<ffffffff81072a7d>] lock_timer_base.isra.35+0x1d/0x50 ... [ 356.127001] Call Trace: [ 356.127001] [<ffffffff81073810>] try_to_del_timer_sync+0x20/0x70 [ 356.127001] [<ffffffff8118b08a>] ? kmem_cache_alloc_node_trace+0x20a/0x230 [ 356.127001] [<ffffffff810738b2>] del_timer_sync+0x52/0x60 [ 356.127001] [<ffffffff812ece22>] cfq_exit_queue+0x32/0xf0 [ 356.127001] [<ffffffff812c98df>] elevator_exit+0x2f/0x50 [ 356.127001] [<ffffffff812c9f21>] elevator_change+0xf1/0x1c0 [ 356.127001] [<ffffffff812caa50>] elv_iosched_store+0x20/0x50 [ 356.127001] [<ffffffff812d1d09>] queue_attr_store+0x59/0xb0 [ 356.127001] [<ffffffff812143f6>] sysfs_write_file+0xc6/0x140 [ 356.127001] [<ffffffff811a326d>] vfs_write+0xbd/0x1e0 [ 356.127001] [<ffffffff811a3ca9>] SyS_write+0x49/0xa0 [ 356.127001] [<ffffffff8164e899>] system_call_fastpath+0x16/0x1b This is caused by a race between md device initialization by multipathd and shell script to switch the scheduler using sysfs. - multipathd: SyS_ioctl -> do_vfs_ioctl -> dm_ctl_ioctl -> ctl_ioctl -> table_load -> dm_setup_md_queue -> blk_init_allocated_queue -> elevator_init q->elevator = elevator_alloc(q, e); // not yet initialized - sh -c 'echo deadline > /sys/$DEVPATH/queue/scheduler': elevator_switch (in the call trace above) struct elevator_queue old = q->elevator; q->elevator = elevator_alloc(q, new_e); elevator_exit(old); // lockup! () - multipathd: (cont.) err = e->ops.elevator_init_fn(q); // init fails; q->elevator is modified (*) When del_timer_sync() is called, lock_timer_base() will loop infinitely while timer->base == NULL. In this case, as timer will never initialized, it results in lockup. This patch introduces acquisition of q->sysfs_lock around elevator_init() into blk_init_allocated_queue(), to provide mutual exclusion between initialization of the q->scheduler and switching of the scheduler. This should fix this bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=902012 Signed-off-by: Tomoki Sekiyama <tomoki.sekiyama@hds.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2013-11-08 09:00:08 -07:00
Christoph Lameter	170d800af8	block: Replace __get_cpu_var uses __get_cpu_var() is used for multiple purposes in the kernel source. One of them is address calculation via the form &__get_cpu_var(x). This calculates the address for the instance of the percpu variable of the current processor based on an offset. Other use cases are for storing and retrieving data from the current processors percpu area. __get_cpu_var() can be used as an lvalue when writing data or on the right side of an assignment. __get_cpu_var() is defined as : #define __get_cpu_var(var) (this_cpu_ptr(&(var))) __get_cpu_var() always only does an address determination. However, store and retrieve operations could use a segment prefix (or global register on other platforms) to avoid the address calculation. this_cpu_write() and this_cpu_read() can directly take an offset into a percpu area and use optimized assembly code to read and write per cpu variables. This patch converts __get_cpu_var into either an explicit address calculation using this_cpu_ptr() or into a use of this_cpu operations that use the offset. Thereby address calculations are avoided and less registers are used when code is generated. At the end of the patch set all uses of __get_cpu_var have been removed so the macro is removed too. The patch set includes passes over all arches as well. Once these operations are used throughout then specialized macros can be defined in non -x86 arches as well in order to optimize per cpu access by f.e. using a global register that may be set to the per cpu base. Transformations done to __get_cpu_var() 1. Determine the address of the percpu instance of the current processor. DEFINE_PER_CPU(int, y); int x = &__get_cpu_var(y); Converts to int x = this_cpu_ptr(&y); 2. Same as #1 but this time an array structure is involved. DEFINE_PER_CPU(int, y[20]); int x = __get_cpu_var(y); Converts to int *x = this_cpu_ptr(y); 3. Retrieve the content of the current processors instance of a per cpu variable. DEFINE_PER_CPU(int, y); int x = __get_cpu_var(y) Converts to int x = __this_cpu_read(y); 4. Retrieve the content of a percpu struct DEFINE_PER_CPU(struct mystruct, y); struct mystruct x = __get_cpu_var(y); Converts to memcpy(&x, this_cpu_ptr(&y), sizeof(x)); 5. Assignment to a per cpu variable DEFINE_PER_CPU(int, y) __get_cpu_var(y) = x; Converts to this_cpu_write(y, x); 6. Increment/Decrement etc of a per cpu variable DEFINE_PER_CPU(int, y); __get_cpu_var(y)++ Converts to this_cpu_inc(y) Signed-off-by: Christoph Lameter <cl@linux.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2013-11-08 08:59:58 -07:00
Mikulas Patocka	fff4996b7d	blk-core: Fix memory corruption if blkcg_init_queue fails If blkcg_init_queue fails, blk_alloc_queue_node doesn't call bdi_destroy to clean up structures allocated by the backing dev. ------------[ cut here ]------------ WARNING: at lib/debugobjects.c:260 debug_print_object+0x85/0xa0() ODEBUG: free active (active state 0) object type: percpu_counter hint: (null) Modules linked in: dm_loop dm_mod ip6table_filter ip6_tables uvesafb cfbcopyarea cfbimgblt cfbfillrect fbcon font bitblit fbcon_rotate fbcon_cw fbcon_ud fbcon_ccw softcursor fb fbdev ipt_MASQUERADE iptable_nat nf_nat_ipv4 msr nf_conntrack_ipv4 nf_defrag_ipv4 xt_state ipt_REJECT xt_tcpudp iptable_filter ip_tables x_tables bridge stp llc tun ipv6 cpufreq_userspace cpufreq_stats cpufreq_powersave cpufreq_ondemand cpufreq_conservative spadfs fuse hid_generic usbhid hid raid0 md_mod dmi_sysfs nf_nat_ftp nf_nat nf_conntrack_ftp nf_conntrack lm85 hwmon_vid snd_usb_audio snd_pcm_oss snd_mixer_oss snd_pcm snd_timer snd_page_alloc snd_hwdep snd_usbmidi_lib snd_rawmidi snd soundcore acpi_cpufreq freq_table mperf sata_svw serverworks kvm_amd ide_core ehci_pci ohci_hcd libata ehci_hcd kvm usbcore tg3 usb_common libphy k10temp pcspkr ptp i2c_piix4 i2c_core evdev microcode hwmon rtc_cmos pps_core e100 skge floppy mii processor button unix CPU: 0 PID: 2739 Comm: lvchange Tainted: G W 3.10.15-devel #14 Hardware name: empty empty/S3992-E, BIOS 'V1.06 ' 06/09/2009 0000000000000009 ffff88023c3c1ae8 ffffffff813c8fd4 ffff88023c3c1b20 ffffffff810399eb ffff88043d35cd58 ffffffff81651940 ffff88023c3c1bf8 ffffffff82479d90 0000000000000005 ffff88023c3c1b80 ffffffff81039a67 Call Trace: [<ffffffff813c8fd4>] dump_stack+0x19/0x1b [<ffffffff810399eb>] warn_slowpath_common+0x6b/0xa0 [<ffffffff81039a67>] warn_slowpath_fmt+0x47/0x50 [<ffffffff8122aaaf>] ? debug_check_no_obj_freed+0xcf/0x250 [<ffffffff81229a15>] debug_print_object+0x85/0xa0 [<ffffffff8122abe3>] debug_check_no_obj_freed+0x203/0x250 [<ffffffff8113c4ac>] kmem_cache_free+0x20c/0x3a0 [<ffffffff811f6709>] blk_alloc_queue_node+0x2a9/0x2c0 [<ffffffff811f672e>] blk_alloc_queue+0xe/0x10 [<ffffffffa04c0093>] dm_create+0x1a3/0x530 [dm_mod] [<ffffffffa04c6bb0>] ? list_version_get_info+0xe0/0xe0 [dm_mod] [<ffffffffa04c6c07>] dev_create+0x57/0x2b0 [dm_mod] [<ffffffffa04c6bb0>] ? list_version_get_info+0xe0/0xe0 [dm_mod] [<ffffffffa04c6bb0>] ? list_version_get_info+0xe0/0xe0 [dm_mod] [<ffffffffa04c6528>] ctl_ioctl+0x268/0x500 [dm_mod] [<ffffffff81097662>] ? get_lock_stats+0x22/0x70 [<ffffffffa04c67ce>] dm_ctl_ioctl+0xe/0x20 [dm_mod] [<ffffffff81161aad>] do_vfs_ioctl+0x2ed/0x520 [<ffffffff8116cfc7>] ? fget_light+0x377/0x4e0 [<ffffffff81161d2b>] SyS_ioctl+0x4b/0x90 [<ffffffff813cff16>] system_call_fastpath+0x1a/0x1f ---[ end trace 4b5ff0d55673d986 ]--- ------------[ cut here ]------------ This fix should be backported to stable kernels starting with 2.6.37. Note that in the kernels prior to 3.5 the affected code is different, but the bug is still there - bdi_init is called and bdi_destroy isn't. Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Acked-by: Tejun Heo <tj@kernel.org> Cc: stable@kernel.org # 2.6.37+ Signed-off-by: Jens Axboe <axboe@kernel.dk>	2013-11-08 08:59:17 -07:00
Jeff Moyer	4912aa6c11	block: fix race between request completion and timeout handling crocode i2c_i801 i2c_core iTCO_wdt iTCO_vendor_support shpchp ioatdma dca be2net sg ses enclosure ext4 mbcache jbd2 sd_mod crc_t10dif ahci megaraid_sas(U) dm_mirror dm_region_hash dm_log dm_mod [last unloaded: scsi_wait_scan] Pid: 491, comm: scsi_eh_0 Tainted: G W ---------------- 2.6.32-220.13.1.el6.x86_64 #1 IBM -[8722PAX]-/00D1461 RIP: 0010:[<ffffffff8124e424>] [<ffffffff8124e424>] blk_requeue_request+0x94/0xa0 RSP: 0018:ffff881057eefd60 EFLAGS: 00010012 RAX: ffff881d99e3e8a8 RBX: ffff881d99e3e780 RCX: ffff881d99e3e8a8 RDX: ffff881d99e3e8a8 RSI: ffff881d99e3e780 RDI: ffff881d99e3e780 RBP: ffff881057eefd80 R08: ffff881057eefe90 R09: 0000000000000000 R10: 0000000000000000 R11: 0000000000000000 R12: ffff881057f92338 R13: 0000000000000000 R14: ffff881057f92338 R15: ffff883058188000 FS: 0000000000000000(0000) GS:ffff880040200000(0000) knlGS:0000000000000000 CS: 0010 DS: 0018 ES: 0018 CR0: 000000008005003b CR2: 00000000006d3ec0 CR3: 000000302cd7d000 CR4: 00000000000406b0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400 Process scsi_eh_0 (pid: 491, threadinfo ffff881057eee000, task ffff881057e29540) Stack: 0000000000001057 0000000000000286 ffff8810275efdc0 ffff881057f16000 <0> ffff881057eefdd0 ffffffff81362323 ffff881057eefe20 ffffffff8135f393 <0> ffff881057e29af8 ffff8810275efdc0 ffff881057eefe78 ffff881057eefe90 Call Trace: [<ffffffff81362323>] __scsi_queue_insert+0xa3/0x150 [<ffffffff8135f393>] ? scsi_eh_ready_devs+0x5e3/0x850 [<ffffffff81362a23>] scsi_queue_insert+0x13/0x20 [<ffffffff8135e4d4>] scsi_eh_flush_done_q+0x104/0x160 [<ffffffff8135fb6b>] scsi_error_handler+0x35b/0x660 [<ffffffff8135f810>] ? scsi_error_handler+0x0/0x660 [<ffffffff810908c6>] kthread+0x96/0xa0 [<ffffffff8100c14a>] child_rip+0xa/0x20 [<ffffffff81090830>] ? kthread+0x0/0xa0 [<ffffffff8100c140>] ? child_rip+0x0/0x20 Code: 00 00 eb d1 4c 8b 2d 3c 8f 97 00 4d 85 ed 74 bf 49 8b 45 00 49 83 c5 08 48 89 de 4c 89 e7 ff d0 49 8b 45 00 48 85 c0 75 eb eb a4 <0f> 0b eb fe 0f 1f 84 00 00 00 00 00 55 48 89 e5 0f 1f 44 00 00 RIP [<ffffffff8124e424>] blk_requeue_request+0x94/0xa0 RSP <ffff881057eefd60> The RIP is this line: BUG_ON(blk_queued_rq(rq)); After digging through the code, I think there may be a race between the request completion and the timer handler running. A timer is started for each request put on the device's queue (see blk_start_request->blk_add_timer). If the request does not complete before the timer expires, the timer handler (blk_rq_timed_out_timer) will mark the request complete atomically: static inline int blk_mark_rq_complete(struct request rq) { return test_and_set_bit(REQ_ATOM_COMPLETE, &rq->atomic_flags); } and then call blk_rq_timed_out. The latter function will call scsi_times_out, which will return one of BLK_EH_HANDLED, BLK_EH_RESET_TIMER or BLK_EH_NOT_HANDLED. If BLK_EH_RESET_TIMER is returned, blk_clear_rq_complete is called, and blk_add_timer is again called to simply wait longer for the request to complete. Now, if the request happens to complete while this is going on, what happens? Given that we know the completion handler will bail if it finds the REQ_ATOM_COMPLETE bit set, we need to focus on the completion handler running after that bit is cleared. So, from the above paragraph, after the call to blk_clear_rq_complete. If the completion sets REQ_ATOM_COMPLETE before the BUG_ON in blk_add_timer, we go boom there (I haven't seen this in the cores). Next, if we get the completion before the call to list_add_tail, then the timer will eventually fire for an old req, which may either be freed or reallocated (there is evidence that this might be the case). Finally, if the completion comes in after* the addition to the timeout list, I think it's harmless. The request will be removed from the timeout list, req_atom_complete will be set, and all will be well. This will only actually explain the coredumps IF the request structure was freed, reallocated and queued before the error handler thread had a chance to process it. That is possible, but it may make sense to keep digging for another race. I think that if this is what was happening, we would see other instances of this problem showing up as null pointer or garbage pointer dereferences, for example when the request structure was not re-used. It looks like we actually do run into that situation in other reports. This patch moves the BUG_ON(test_bit(REQ_ATOM_COMPLETE, &req->atomic_flags)); from blk_add_timer to the only caller that could trip over it (blk_start_request). It then inverts the calls to blk_clear_rq_complete and blk_add_timer in blk_rq_timed_out to address the race. I've boot tested this patch, but nothing more. Signed-off-by: Jeff Moyer <jmoyer@redhat.com> Acked-by: Hannes Reinecke <hare@suse.de> Cc: stable@kernel.org Signed-off-by: Jens Axboe <axboe@kernel.dk>	2013-11-08 08:59:04 -07:00
Santosh Shilimkar	9f7e45d83e	ARM: 7794/1: block: Rename parameter dma_mask to max_addr for blk_queue_bounce_limit() The blk_queue_bounce_limit() API parameter 'dma_mask' is actually the maximum address the device can handle rather than a dma_mask. Rename it accordingly to avoid it being interpreted as dma_mask. No functional change. The idea is to fix the bad assumptions about dma_mask wherever it could be miss-interpreted. Cc: Jens Axboe <axboe@kernel.dk> Signed-off-by: Santosh Shilimkar <santosh.shilimkar@ti.com> Signed-off-by: Russell King <rmk+kernel@arm.linux.org.uk>	2013-10-31 14:49:22 +00:00
Jens Axboe	e7e2450001	blk-mq: don't disallow request merges for req->special being set For blk-mq, if a driver has requested per-request payload data to carry command structures, they are stuffed into req->special. For an old style request based driver, req->special is used for the same purpose but indicates that a per-driver request structure has been prepared for the request already. So for the old style driver, we do not merge such requests. As most/all blk-mq drivers will use the payload feature, and since we have no problem merging on these, make this check dependent on whether it's a blk-mq enabled driver or not. Reported-by: Shaohua Li <shli@fusionio.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2013-10-29 12:11:47 -06:00
Shaohua Li	92f399c72a	blk-mq: mq plug list breakage We switched to plug mq_list for mq, but some code are still using old list. Signed-off-by: Shaohua Li <shli@fusionio.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2013-10-29 12:01:03 -06:00
Christoph Hellwig	3228f48be2	blk-mq: fix for flush deadlock The flush state machine takes in a struct request, which then is submitted multiple times to the underling driver. The old block code requeses the same request for each of those, so it does not have an issue with tapping into the request pool. The new one on the other hand allocates a new request for each of the actualy steps of the flush sequence. If have already allocated all of the tags for IO, we will fail allocating the flush request. Set aside a reserved request just for flushes. Signed-off-by: Jens Axboe <axboe@kernel.dk>	2013-10-28 13:33:58 -06:00
Christoph Hellwig	280d45f6c3	blk-mq: add blk_mq_stop_hw_queues Add a helper to iterate over all hw queues and stop them. This is useful for driver that implement PM suspend functionality. Signed-off-by: Christoph Hellwig <hch@lst.de> Modified to just call blk_mq_stop_hw_queue() by Jens. Signed-off-by: Jens Axboe <axboe@kernel.dk>	2013-10-25 14:45:58 +01:00
Jens Axboe	320ae51fee	blk-mq: new multi-queue block IO queueing mechanism Linux currently has two models for block devices: - The classic request_fn based approach, where drivers use struct request units for IO. The block layer provides various helper functionalities to let drivers share code, things like tag management, timeout handling, queueing, etc. - The "stacked" approach, where a driver squeezes in between the block layer and IO submitter. Since this bypasses the IO stack, driver generally have to manage everything themselves. With drivers being written for new high IOPS devices, the classic request_fn based driver doesn't work well enough. The design dates back to when both SMP and high IOPS was rare. It has problems with scaling to bigger machines, and runs into scaling issues even on smaller machines when you have IOPS in the hundreds of thousands per device. The stacked approach is then most often selected as the model for the driver. But this means that everybody has to re-invent everything, and along with that we get all the problems again that the shared approach solved. This commit introduces blk-mq, block multi queue support. The design is centered around per-cpu queues for queueing IO, which then funnel down into x number of hardware submission queues. We might have a 1:1 mapping between the two, or it might be an N:M mapping. That all depends on what the hardware supports. blk-mq provides various helper functions, which include: - Scalable support for request tagging. Most devices need to be able to uniquely identify a request both in the driver and to the hardware. The tagging uses per-cpu caches for freed tags, to enable cache hot reuse. - Timeout handling without tracking request on a per-device basis. Basically the driver should be able to get a notification, if a request happens to fail. - Optional support for non 1:1 mappings between issue and submission queues. blk-mq can redirect IO completions to the desired location. - Support for per-request payloads. Drivers almost always need to associate a request structure with some driver private command structure. Drivers can tell blk-mq this at init time, and then any request handed to the driver will have the required size of memory associated with it. - Support for merging of IO, and plugging. The stacked model gets neither of these. Even for high IOPS devices, merging sequential IO reduces per-command overhead and thus increases bandwidth. For now, this is provided as a potential 3rd queueing model, with the hope being that, as it matures, it can replace both the classic and stacked model. That would get us back to having just 1 real model for block devices, leaving the stacked approach to dm/md devices (as it was originally intended). Contributions in this patch from the following people: Shaohua Li <shli@fusionio.com> Alexander Gordeev <agordeev@redhat.com> Christoph Hellwig <hch@infradead.org> Mike Christie <michaelc@cs.wisc.edu> Matias Bjorling <m@bjorling.me> Jeff Moyer <jmoyer@redhat.com> Acked-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2013-10-25 11:56:00 +01:00
Christoph Hellwig	71fe07d040	block: remove request ref_count This reference count has been around since before git history, but the only place where it's used is in blk_execute_rq, and ther it is entirely useless as it is incremented before submitting the request and decremented in the end_io handler before waking up the submitter thread. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2013-10-25 11:55:59 +01:00
Jens Axboe	5953316dbf	block: make rq->cmd_flags be 64-bit We have officially run out of flags in a 32-bit space. Extend it to 64-bit even on 32-bit archs. Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2013-10-25 11:55:59 +01:00
Doug Anderson	87fc0ad2ad	block/partitions/efi.c: treat size mismatch as a warning, not an error In commit `27a7c64217` ("partitions/efi: account for pmbr size in lba") we started treating bad sizes in lba field of the partition that has the 0xEE (GPT protective) as errors. However, we may run into these "bad sizes" in the real world if someone uses dd to copy an image from a smaller disk to a bigger disk. Since this case used to work (even without using force_gpt), keep it working and treat the size mismatch as a warning instead of an error. Reported-by: Josh Triplett <josh@joshtriplett.org> Reported-by: Sean Paul <seanpaul@chromium.org> Signed-off-by: Doug Anderson <dianders@chromium.org> Reviewed-by: Josh Triplett <josh@joshtriplett.org> Acked-by: Davidlohr Bueso <davidlohr@hp.com> Tested-by: Artem Bityutskiy <dedekind1@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2013-10-16 21:35:53 -07:00
Paul Gortmaker	080506ad0a	block: change config option name for cmdline partition parsing Recently commit `bab55417b1` ("block: support embedded device command line partition") introduced CONFIG_CMDLINE_PARSER. However, that name is too generic and sounds like it enables/disables generic kernel boot arg processing, when it really is block specific. Before this option becomes a part of a full/final release, add the BLK_ prefix to it so that it is clear in absence of any other context that it is block specific. In addition, fix up the following less critical items: - help text was not really at all helpful. - index file for Documentation was not updated - add the new arg to Documentation/kernel-parameters.txt - clarify wording in source comments Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com> Cc: Jens Axboe <axboe@kernel.dk> Cc: Cai Zhiyong <caizhiyong@huawei.com> Cc: Wei Yongjun <yongjun_wei@trendmicro.com.cn> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2013-09-30 14:31:02 -07:00
Linus Torvalds	68cf8d0c72	Merge branch 'for-3.12/core' of git://git.kernel.dk/linux-block Pull block IO fixes from Jens Axboe: "After merge window, no new stuff this time only a collection of neatly confined and simple fixes" * 'for-3.12/core' of git://git.kernel.dk/linux-block: cfq: explicitly use 64bit divide operation for 64bit arguments block: Add nr_bios to block_rq_remap tracepoint If the queue is dying then we only call the rq->end_io callout. This leaves bios setup on the request, because the caller assumes when the blk_execute_rq_nowait/blk_execute_rq call has completed that the rq->bios have been cleaned up. bio-integrity: Fix use of bs->bio_integrity_pool after free blkcg: relocate root_blkg setting and clearing block: Convert kmalloc_node(...GFP_ZERO...) to kzalloc_node(...) block: trace all devices plug operation	2013-09-22 15:00:11 -07:00
Anatol Pomozov	f3cff25f05	cfq: explicitly use 64bit divide operation for 64bit arguments 'samples' is 64bit operant, but do_div() second parameter is 32. do_div silently truncates high 32 bits and calculated result is invalid. In case if low 32bit of 'samples' are zeros then do_div() produces kernel crash. Signed-off-by: Anatol Pomozov <anatol.pomozov@gmail.com> Acked-by: Tejun Heo <tj@kernel.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2013-09-22 12:43:47 -06:00
Mike Christie	7652113c2f	If the queue is dying then we only call the rq->end_io callout. This leaves bios setup on the request, because the caller assumes when the blk_execute_rq_nowait/blk_execute_rq call has completed that the rq->bios have been cleaned up. This patch has blk_execute_rq_nowait use __blk_end_request_all to free bios and also call rq->end_io. Signed-off-by: Mike Christie <michaelc@cs.wisc.edu> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2013-09-18 08:33:55 -06:00
Davidlohr Bueso	6b02fa59a7	partitions/efi: loosen check fot pmbr size in lba Matt found that commit `27a7c64217` ("partitions/efi: account for pmbr size in lba") caused his GPT formatted eMMC device not to boot. The reason is that this commit enforced Linux to always check the lesser of the whole disk or 2Tib for the pMBR size in LBA. While most disk partitioning tools out there create a pMBR with these characteristics, Microsoft does not, as it always sets the entry to the maximum 32-bit limitation - even though a drive may be smaller than that[1]. Loosen this check and only verify that the size is either the whole disk or 0xFFFFFFFF. No tool in its right mind would set it to any value other than these. [1] http://thestarman.pcministry.com/asm/mbr/GPT.htm#GPTPT Reported-and-tested-by: Matt Porter <matt.porter@linaro.org> Signed-off-by: Davidlohr Bueso <davidlohr@hp.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2013-09-15 07:10:16 -04:00
Jan Kara	5e4c0d9741	lib/radix-tree.c: make radix_tree_node_alloc() work correctly within interrupt With users of radix_tree_preload() run from interrupt (block/blk-ioc.c is one such possible user), the following race can happen: radix_tree_preload() ... radix_tree_insert() radix_tree_node_alloc() if (rtp->nr) { ret = rtp->nodes[rtp->nr - 1]; <interrupt> ... radix_tree_preload() ... radix_tree_insert() radix_tree_node_alloc() if (rtp->nr) { ret = rtp->nodes[rtp->nr - 1]; And we give out one radix tree node twice. That clearly results in radix tree corruption with different results (usually OOPS) depending on which two users of radix tree race. We fix the problem by making radix_tree_node_alloc() always allocate fresh radix tree nodes when in interrupt. Using preloading when in interrupt doesn't make sense since all the allocations have to be atomic anyway and we cannot steal nodes from process-context users because some users rely on radix_tree_insert() succeeding after radix_tree_preload(). in_interrupt() check is somewhat ugly but we cannot simply key off passed gfp_mask as that is acquired from root_gfp_mask() and thus the same for all preload users. Another part of the fix is to avoid node preallocation in radix_tree_preload() when passed gfp_mask doesn't allow waiting. Again, preallocation in such case doesn't make sense and when preallocation would happen in interrupt we could possibly leak some allocated nodes. However, some users of radix_tree_preload() require following radix_tree_insert() to succeed. To avoid unexpected effects for these users, radix_tree_preload() only warns if passed gfp mask doesn't allow waiting and we provide a new function radix_tree_maybe_preload() for those users which get different gfp mask from different call sites and which are prepared to handle radix_tree_insert() failure. Signed-off-by: Jan Kara <jack@suse.cz> Cc: Jens Axboe <jaxboe@fusionio.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2013-09-11 15:59:36 -07:00
Andrew Morton	b4bc4a18a2	block/partitions/efi.c: consistently use pr_foo() Cc: Davidlohr Bueso <davidlohr@hp.com> Cc: Karel Zak <kzak@redhat.com> Cc: Matt Fleming <matt.fleming@intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2013-09-11 15:59:19 -07:00
Davidlohr Bueso	70f637e90e	partitions/efi: some style cleanups Trivial coding style cleanups - still plenty left. [akpm@linux-foundation.org: coding-style fixes] Signed-off-by: Davidlohr Bueso <davidlohr@hp.com> Reviewed-by: Karel Zak <kzak@redhat.com> Acked-by: Matt Fleming <matt.fleming@intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2013-09-11 15:59:19 -07:00
Davidlohr Bueso	08009b30a7	partitions/efi: delete annoying emacs style comments I love emacs, but these settings for coding style are annoying when trying to open the efi.h file. More important, we already have checkpatch for that. Signed-off-by: Davidlohr Bueso <davidlohr@hp.com> Reviewed-by: Karel Zak <kzak@redhat.com> Acked-by: Matt Fleming <matt.fleming@intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2013-09-11 15:59:18 -07:00
Davidlohr Bueso	aa054bc937	partitions/efi: compare first and last usable LBAs When verifying GPT header integrity, make sure that first usable LBA is smaller than last usable LBA. Signed-off-by: Davidlohr Bueso <davidlohr@hp.com> Reviewed-by: Karel Zak <kzak@redhat.com> Acked-by: Matt Fleming <matt.fleming@intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2013-09-11 15:59:18 -07:00
Davidlohr Bueso	27a7c64217	partitions/efi: account for pmbr size in lba The partition that has the 0xEE (GPT protective), must have the size in lba field set to the lesser of the size of the disk minus one or 0xFFFFFFFF for larger disks. Signed-off-by: Davidlohr Bueso <davidlohr@hp.com> Reviewed-by: Karel Zak <kzak@redhat.com> Acked-by: Matt Fleming <matt.fleming@intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2013-09-11 15:59:17 -07:00
Davidlohr Bueso	b05ebbbbeb	partitions/efi: detect hybrid MBRs One of the biggest problems with GPT is compatibility with older, non-GPT systems. The problem is addressed by creating hybrid mbrs, an extension, or variant, of the traditional protective mbr. This contains, apart from the 0xEE partition, up three additional primary partitions that point to the same space marked by up to three GPT partitions. The result is that legacy OSs can see the three required MBR partitions and at the same time ignore the GPT-aware partitions that protect the GPT structures. While hybrid MBRs are hacks, workarounds and simply not part of the GPT standard, they do exist and we have no way around them. For instance, by default, OSX creates a hybrid scheme when using multi-OS booting. In order for Linux to properly discover protective MBRs, it must be made aware of devices that have hybrid MBRs. No functionality is changed by this patch, just a debug message informing the user of the MBR scheme that is being used. [akpm@linux-foundation.org: coding-style fixes] Signed-off-by: Davidlohr Bueso <davidlohr@hp.com> Reviewed-by: Karel Zak <kzak@redhat.com> Acked-by: Matt Fleming <matt.fleming@intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2013-09-11 15:59:16 -07:00
Davidlohr Bueso	3e69ac3440	partitions/efi: do not require gpt partition to begin at sector 1 When detecting a valid protective MBR, the Linux kernel isn't picky about the partition (1-4) the 0xEE is at, but, unlike other operating systems, it does require it to begin at the second sector (sector 1). This check, apart from it not being enforced by UEFI, and causing Linux to potentially fail to detect any valid partitions on the disk, can present problems when dealing with hybrid MBRs[1]. For compatibility reasons, if the first partition is hybridized, the 0xEE partition must be small enough to ensure that it only protects the GPT data structures - as opposed to the the whole disk in a protective MBR. This problem is very well described by Rod Smith[1]: where MBR-only partitioning programs (such as older versions of fdisk) can see some of the disk space as unallocated, thus loosing the purpose of the 0xEE partition's protection of GPT data structures. By dropping this check, this patch enables Linux to be more flexible when probing for GPT disklabels. [1] http://www.rodsbooks.com/gdisk/hybrid.html#reactions Signed-off-by: Davidlohr Bueso <davidlohr@hp.com> Reviewed-by: Karel Zak <kzak@redhat.com> Acked-by: Matt Fleming <matt.fleming@intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2013-09-11 15:59:16 -07:00
Davidlohr Bueso	33afd7a7df	partitions/efi: check pmbr record's starting lba Per the UEFI Specs 2.4, June 2013, the starting lba of the partition that has the EFI GPT (0xEE) must be set to 0x00000001 - this is obviously the LBA of the GPT Partition Header. [akpm@linux-foundation.org: coding-style fixes] Signed-off-by: Davidlohr Bueso <davidlohr@hp.com> Reviewed-by: Karel Zak <kzak@redhat.com> Acked-by: Matt Fleming <matt.fleming@intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2013-09-11 15:59:15 -07:00
Davidlohr Bueso	c2ebdc2439	partitions/efi: use lba-aware partition records The kernel's GPT implementation currently uses the generic 'struct partition' type for dealing with legacy MBR partition records. While this is is useful for disklabels that we designed for CHS addressing, such as msdos, it doesn't adapt well to newer standards that use LBA instead, such as GUID partition tables. Furthermore, these generic partition structures do not have all the required fields to properly follow the UEFI specs. While a CHS address can be translated to LBA, it's much simpler and cleaner to just replace the partition type. This patch adds a new 'gpt_record' type that is fully compliant with EFI and will allow, in the next patches, to add more checks to properly verify a protective MBR, which is paramount to probing a device that makes use of GPT. [akpm@linux-foundation.org: coding-style fixes] Signed-off-by: Davidlohr Bueso <davidlohr@hp.com> Reviewed-by: Karel Zak <kzak@redhat.com> Acked-by: Matt Fleming <matt.fleming@intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2013-09-11 15:59:15 -07:00
Mathieu Desnoyers	3ddc5b46a8	kernel-wide: fix missing validations on __get/__put/__copy_to/__copy_from_user() I found the following pattern that leads in to interesting findings: grep -r "ret.\|=.__put_user" * grep -r "ret.\|=.__get_user" * grep -r "ret.\|=.__copy" * The __put_user() calls in compat_ioctl.c, ptrace compat, signal compat, since those appear in compat code, we could probably expect the kernel addresses not to be reachable in the lower 32-bit range, so I think they might not be exploitable. For the "__get_user" cases, I don't think those are exploitable: the worse that can happen is that the kernel will copy kernel memory into in-kernel buffers, and will fail immediately afterward. The alpha csum_partial_copy_from_user() seems to be missing the access_ok() check entirely. The fix is inspired from x86. This could lead to information leak on alpha. I also noticed that many architectures map csum_partial_copy_from_user() to csum_partial_copy_generic(), but I wonder if the latter is performing the access checks on every architectures. Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: Richard Henderson <rth@twiddle.net> Cc: Ivan Kokshaysky <ink@jurassic.park.msu.ru> Cc: Matt Turner <mattst88@gmail.com> Cc: Jens Axboe <axboe@kernel.dk> Cc: Oleg Nesterov <oleg@redhat.com> Cc: David Miller <davem@davemloft.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2013-09-11 15:58:18 -07:00
Cai Zhiyong	bab55417b1	block: support embedded device command line partition Read block device partition table from command line. The partition used for fixed block device (eMMC) embedded device. It is no MBR, save storage space. Bootloader can be easily accessed by absolute address of data on the block device. Users can easily change the partition. This code reference MTD partition, source "drivers/mtd/cmdlinepart.c" About the partition verbose reference "Documentation/block/cmdline-partition.txt" [akpm@linux-foundation.org: fix printk text] [yongjun_wei@trendmicro.com.cn: fix error return code in parse_parts()] Signed-off-by: Cai Zhiyong <caizhiyong@huawei.com> Cc: Karel Zak <kzak@redhat.com> Cc: "Wanglin (Albert)" <albert.wanglin@huawei.com> Cc: Marius Groeger <mag@sysgo.de> Cc: David Woodhouse <dwmw2@infradead.org> Cc: Jens Axboe <axboe@kernel.dk> Cc: Brian Norris <computersforpeace@gmail.com> Cc: Artem Bityutskiy <dedekind@infradead.org> Signed-off-by: Wei Yongjun <yongjun_wei@trendmicro.com.cn> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2013-09-11 15:56:57 -07:00
Jingoo Han	ed751e683c	block/blk-sysfs.c: replace strict_strtoul() with kstrtoul() The usage of strict_strtoul() is not preferred, because strict_strtoul() is obsolete. Thus, kstrtoul() should be used. Signed-off-by: Jingoo Han <jg1.han@samsung.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2013-09-11 15:56:56 -07:00
Tejun Heo	577cee1e8d	blkcg: relocate root_blkg setting and clearing Hello, Jens. The original thread can be read from http://thread.gmane.org/gmane.linux.kernel.cgroups/8937 While it leads to oops, given that it only triggers under specific configurations which aren't common. I don't think it's necessary to backport it through -stable and merging it during the coming merge window should be enough. Thanks! ----- 8< ----- Currently, q->root_blkg and q->root_rl.blkg are set from blkcg_activate_policy() and cleared from blkg_destroy_all(). This doesn't necessarily coincide with the lifetime of the root blkcg_gq leading to the following oops when blkcg is enabled but no policy is activated because __blk_queue_next_rl() malfunctions expecting the root_blkg pointers to be set. BUG: unable to handle kernel NULL pointer dereference at (null) IP: [<ffffffff810c58cb>] __wake_up_common+0x2b/0x90 PGD 60f7a9067 PUD 60f4c9067 PMD 0 Oops: 0000 [#1] SMP DEBUG_PAGEALLOC gsmi: Log Shutdown Reason 0x03 Modules linked in: act_mirred cls_tcindex cls_prioshift sch_dsmark xt_multiport iptable_mangle sata_mv elephant elephant_dev_num cdc_acm uhci_hcd ehci_hcd i2c_d CPU: 9 PID: 41382 Comm: iSCSI-write- Not tainted 3.11.0-dbg-DEV #19 Hardware name: Intel XXX task: ffff88060d16eec0 ti: ffff88060d170000 task.ti: ffff88060d170000 RIP: 0010:[<ffffffff810c58cb>] [<ffffffff810c58cb>] __wake_up_common+0x2b/0x90 RSP: 0000:ffff88060d171818 EFLAGS: 00010096 RAX: 0000000000000082 RBX: ffff880baa3dee60 RCX: 0000000000000000 RDX: 0000000000000000 RSI: 0000000000000003 RDI: ffff880baa3dee60 RBP: ffff88060d171858 R08: 0000000000000000 R09: 0000000000000000 R10: 0000000000000000 R11: 0000000000000002 R12: ffff880baa3dee98 R13: 0000000000000003 R14: 0000000000000000 R15: 0000000000000003 FS: 00007f977cba6700(0000) GS:ffff880c79c60000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b CR2: 0000000000000000 CR3: 000000060f7a5000 CR4: 00000000000007e0 Stack: 0000000000000082 0000000000000000 ffff88060d171858 ffff880baa3dee60 0000000000000082 0000000000000003 0000000000000000 0000000000000000 ffff88060d171898 ffffffff810c7848 ffff88060d171888 ffff880bde4bc4b8 Call Trace: [<ffffffff810c7848>] __wake_up+0x48/0x70 [<ffffffff8131da53>] __blk_drain_queue+0x123/0x190 [<ffffffff8131dbb5>] blk_cleanup_queue+0xf5/0x210 [<ffffffff8141877a>] __scsi_remove_device+0x5a/0xd0 [<ffffffff81418824>] scsi_remove_device+0x34/0x50 [<ffffffff814189cb>] scsi_remove_target+0x16b/0x220 [<ffffffff814210f1>] __iscsi_unbind_session+0xd1/0x1b0 [<ffffffff814212b2>] iscsi_remove_session+0xe2/0x1c0 [<ffffffff814213a6>] iscsi_destroy_session+0x16/0x60 [<ffffffff81423a59>] iscsi_session_teardown+0xd9/0x100 [<ffffffff8142b75a>] iscsi_sw_tcp_session_destroy+0x5a/0xb0 [<ffffffff81420948>] iscsi_if_rx+0x10e8/0x1560 [<ffffffff81573335>] netlink_unicast+0x145/0x200 [<ffffffff815736f3>] netlink_sendmsg+0x303/0x410 [<ffffffff81528196>] sock_sendmsg+0xa6/0xd0 [<ffffffff815294bc>] ___sys_sendmsg+0x38c/0x3a0 [<ffffffff811ea840>] ? fget_light+0x40/0x160 [<ffffffff811ea899>] ? fget_light+0x99/0x160 [<ffffffff811ea840>] ? fget_light+0x40/0x160 [<ffffffff8152bc79>] __sys_sendmsg+0x49/0x90 [<ffffffff8152bcd2>] SyS_sendmsg+0x12/0x20 [<ffffffff815fb642>] system_call_fastpath+0x16/0x1b Code: 66 66 66 66 90 55 48 89 e5 41 57 41 89 f7 41 56 41 89 ce 41 55 41 54 4c 8d 67 38 53 48 83 ec 18 89 55 c4 48 8b 57 38 4c 89 45 c8 <4c> 8b 2a 48 8d 42 e8 49 Fix it by moving r->root_blkg and q->root_rl.blkg setting to blkg_create() and clearing to blkg_destroy() so that they area initialized when a root blkg is created and cleared when destroyed. Reported-and-tested-by: Anatol Pomozov <anatol.pomozov@gmail.com> Signed-off-by: Tejun Heo <tj@kernel.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2013-09-11 13:23:09 -06:00
Joe Perches	c1b511eb21	block: Convert kmalloc_node(...GFP_ZERO...) to kzalloc_node(...) Use the helper function instead of __GFP_ZERO. Signed-off-by: Joe Perches <joe@perches.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2013-09-11 13:22:03 -06:00
Jianpeng Ma	7aef2e780b	block: trace all devices plug operation In func blk_queue_bio, if list of plug is empty,it will call blk_trace_plug. If process deal with a single device,it't ok.But if process deal with multi devices,it only trace the first device. Using request_count to judge, it can soleve this problem. In addition, i modify the comment. Signed-off-by: Jianpeng Ma <majianpeng@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2013-09-11 13:21:07 -06:00
Linus Torvalds	32dad03d16	Merge branch 'for-3.12' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup Pull cgroup updates from Tejun Heo: "A lot of activities on the cgroup front. Most changes aren't visible to userland at all at this point and are laying foundation for the planned unified hierarchy. - The biggest change is decoupling the lifetime management of css (cgroup_subsys_state) from that of cgroup's. Because controllers (cpu, memory, block and so on) will need to be dynamically enabled and disabled, css which is the association point between a cgroup and a controller may come and go dynamically across the lifetime of a cgroup. Till now, css's were created when the associated cgroup was created and stayed till the cgroup got destroyed. Assumptions around this tight coupling permeated through cgroup core and controllers. These assumptions are gradually removed, which consists bulk of patches, and css destruction path is completely decoupled from cgroup destruction path. Note that decoupling of creation path is relatively easy on top of these changes and the patchset is pending for the next window. - cgroup has its own event mechanism cgroup.event_control, which is only used by memcg. It is overly complex trying to achieve high flexibility whose benefits seem dubious at best. Going forward, new events will simply generate file modified event and the existing mechanism is being made specific to memcg. This pull request contains prepatory patches for such change. - Various fixes and cleanups" Fixed up conflict in kernel/cgroup.c as per Tejun. * 'for-3.12' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup: (69 commits) cgroup: fix cgroup_css() invocation in css_from_id() cgroup: make cgroup_write_event_control() use css_from_dir() instead of __d_cgrp() cgroup: make cgroup_event hold onto cgroup_subsys_state instead of cgroup cgroup: implement CFTYPE_NO_PREFIX cgroup: make cgroup_css() take cgroup_subsys * instead and allow NULL subsys cgroup: rename cgroup_css_from_dir() to css_from_dir() and update its syntax cgroup: fix cgroup_write_event_control() cgroup: fix subsystem file accesses on the root cgroup cgroup: change cgroup_from_id() to css_from_id() cgroup: use css_get() in cgroup_create() to check CSS_ROOT cpuset: remove an unncessary forward declaration cgroup: RCU protect each cgroup_subsys_state release cgroup: move subsys file removal to kill_css() cgroup: factor out kill_css() cgroup: decouple cgroup_subsys_state destruction from cgroup destruction cgroup: replace cgroup->css_kill_cnt with ->nr_css cgroup: bounce cgroup_subsys_state ref kill confirmation to a work item cgroup: move cgroup->subsys[] assignment to online_css() cgroup: reorganize css init / exit paths cgroup: add __rcu modifier to cgroup->subsys[] ...	2013-09-03 18:25:03 -07:00
Hannes Reinecke	7e782af576	[SCSI] Return ENODATA on medium error When a medium error is detected the SCSI stack should return ENODATA to the upper layers. [jejb: fix whitespace error] Signed-off-by: Hannes Reinecke <hare@suse.de> Signed-off-by: James Bottomley <JBottomley@Parallels.com>	2013-08-23 12:54:53 -04:00
Hannes Reinecke	a9d6ceb838	[SCSI] return ENOSPC on thin provisioning failure When the thin provisioning hard threshold is reached we should return ENOSPC to inform upper layers about this fact. Signed-off-by: Hannes Reinecke <hare@suse.de> Signed-off-by: James Bottomley <JBottomley@Parallels.com>	2013-08-23 12:43:54 -04:00
Tejun Heo	bd8815a6d8	cgroup: make css_for_each_descendant() and friends include the origin css in the iteration Previously, all css descendant iterators didn't include the origin (root of subtree) css in the iteration. The reasons were maintaining consistency with css_for_each_child() and that at the time of introduction more use cases needed skipping the origin anyway; however, given that css_is_descendant() considers self to be a descendant, omitting the origin css has become more confusing and looking at the accumulated use cases rather clearly indicates that including origin would result in simpler code overall. While this is a change which can easily lead to subtle bugs, cgroup API including the iterators has recently gone through major restructuring and no out-of-tree changes will be applicable without adjustments making this a relatively acceptable opportunity for this type of change. The conversions are mostly straight-forward. If the iteration block had explicit origin handling before or after, it's moved inside the iteration. If not, if (pos == origin) continue; is added. Some conversions add extra reference get/put around origin handling by consolidating origin handling and the rest. While the extra ref operations aren't strictly necessary, this shouldn't cause any noticeable difference. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Li Zefan <lizefan@huawei.com> Acked-by: Vivek Goyal <vgoyal@redhat.com> Acked-by: Aristeu Rozanski <aris@redhat.com> Acked-by: Michal Hocko <mhocko@suse.cz> Cc: Jens Axboe <axboe@kernel.dk> Cc: Matt Helsley <matthltc@us.ibm.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Balbir Singh <bsingharora@gmail.com>	2013-08-08 20:11:27 -04:00
Tejun Heo	d99c8727e7	cgroup: make cgroup_taskset deal with cgroup_subsys_state instead of cgroup cgroup is in the process of converting to css (cgroup_subsys_state) from cgroup as the principal subsystem interface handle. This is mostly to prepare for the unified hierarchy support where css's will be created and destroyed dynamically but also helps cleaning up subsystem implementations as css is usually what they are interested in anyway. cgroup_taskset which is used by the subsystem attach methods is the last cgroup subsystem API which isn't using css as the handle. Update cgroup_taskset_cur_cgroup() to cgroup_taskset_cur_css() and cgroup_taskset_for_each() to take @skip_css instead of @skip_cgrp. The conversions are pretty mechanical. One exception is cpuset::cgroup_cs(), which lost its last user and got removed. This patch shouldn't introduce any functional changes. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Li Zefan <lizefan@huawei.com> Acked-by: Daniel Wagner <daniel.wagner@bmw-carit.de> Cc: Ingo Molnar <mingo@redhat.com> Cc: Matt Helsley <matthltc@us.ibm.com> Cc: Steven Rostedt <rostedt@goodmis.org>	2013-08-08 20:11:27 -04:00
Tejun Heo	492eb21b98	cgroup: make hierarchy iterators deal with cgroup_subsys_state instead of cgroup cgroup is currently in the process of transitioning to using css (cgroup_subsys_state) as the primary handle instead of cgroup in subsystem API. For hierarchy iterators, this is beneficial because * In most cases, css is the only thing subsystems care about anyway. * On the planned unified hierarchy, iterations for different subsystems will need to skip over different subtrees of the hierarchy depending on which subsystems are enabled on each cgroup. Passing around css makes it unnecessary to explicitly specify the subsystem in question as css is intersection between cgroup and subsystem * For the planned unified hierarchy, css's would need to be created and destroyed dynamically independent from cgroup hierarchy. Having cgroup core manage css iteration makes enforcing deref rules a lot easier. Most subsystem conversions are straight-forward. Noteworthy changes are * blkio: cgroup_to_blkcg() is no longer used. Removed. * freezer: cgroup_freezer() is no longer used. Removed. * devices: cgroup_to_devcgroup() is no longer used. Removed. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Li Zefan <lizefan@huawei.com> Acked-by: Michal Hocko <mhocko@suse.cz> Acked-by: Vivek Goyal <vgoyal@redhat.com> Acked-by: Aristeu Rozanski <aris@redhat.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Balbir Singh <bsingharora@gmail.com> Cc: Matt Helsley <matthltc@us.ibm.com> Cc: Jens Axboe <axboe@kernel.dk>	2013-08-08 20:11:25 -04:00
Tejun Heo	182446d087	cgroup: pass around cgroup_subsys_state instead of cgroup in file methods cgroup is currently in the process of transitioning to using struct cgroup_subsys_state * as the primary handle instead of struct cgroup. Please see the previous commit which converts the subsystem methods for rationale. This patch converts all cftype file operations to take @css instead of @cgroup. cftypes for the cgroup core files don't have their subsytem pointer set. These will automatically use the dummy_css added by the previous patch and can be converted the same way. Most subsystem conversions are straight forwards but there are some interesting ones. * freezer: update_if_frozen() is also converted to take @css instead of @cgroup for consistency. This will make the code look simpler too once iterators are converted to use css. * memory/vmpressure: mem_cgroup_from_css() needs to be exported to vmpressure while mem_cgroup_from_cont() can be made static. Updated accordingly. * cpu: cgroup_tg() doesn't have any user left. Removed. * cpuacct: cgroup_ca() doesn't have any user left. Removed. * hugetlb: hugetlb_cgroup_form_cgroup() doesn't have any user left. Removed. * net_cls: cgrp_cls_state() doesn't have any user left. Removed. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Li Zefan <lizefan@huawei.com> Acked-by: Michal Hocko <mhocko@suse.cz> Acked-by: Vivek Goyal <vgoyal@redhat.com> Acked-by: Aristeu Rozanski <aris@redhat.com> Acked-by: Daniel Wagner <daniel.wagner@bmw-carit.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Ingo Molnar <mingo@redhat.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Balbir Singh <bsingharora@gmail.com> Cc: Matt Helsley <matthltc@us.ibm.com> Cc: Jens Axboe <axboe@kernel.dk> Cc: Steven Rostedt <rostedt@goodmis.org>	2013-08-08 20:11:24 -04:00
Tejun Heo	2bb566cb68	cgroup: add subsys backlink pointer to cftype cgroup is transitioning to using css (cgroup_subsys_state) instead of cgroup as the primary subsystem handle. The cgroupfs file interface will be converted to use css's which requires finding out the subsystem from cftype so that the matching css can be determined from the cgroup. This patch adds cftype->ss which points to the subsystem the file belongs to. The field is initialized while a cftype is being registered. This makes it unnecessary to explicitly specify the subsystem for other cftype handling functions. @ss argument dropped from various cftype handling functions. This patch shouldn't introduce any behavior differences. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Li Zefan <lizefan@huawei.com> Acked-by: Vivek Goyal <vgoyal@redhat.com> Cc: Jens Axboe <axboe@kernel.dk>	2013-08-08 20:11:23 -04:00
Tejun Heo	eb95419b02	cgroup: pass around cgroup_subsys_state instead of cgroup in subsystem methods cgroup is currently in the process of transitioning to using struct cgroup_subsys_state * as the primary handle instead of struct cgroup * in subsystem implementations for the following reasons. * With unified hierarchy, subsystems will be dynamically bound and unbound from cgroups and thus css's (cgroup_subsys_state) may be created and destroyed dynamically over the lifetime of a cgroup, which is different from the current state where all css's are allocated and destroyed together with the associated cgroup. This in turn means that cgroup_css() should be synchronized and may return NULL, making it more cumbersome to use. * Differing levels of per-subsystem granularity in the unified hierarchy means that the task and descendant iterators should behave differently depending on the specific subsystem the iteration is being performed for. * In majority of the cases, subsystems only care about its part in the cgroup hierarchy - ie. the hierarchy of css's. Subsystem methods often obtain the matching css pointer from the cgroup and don't bother with the cgroup pointer itself. Passing around css fits much better. This patch converts all cgroup_subsys methods to take @css instead of @cgroup. The conversions are mostly straight-forward. A few noteworthy changes are * ->css_alloc() now takes css of the parent cgroup rather than the pointer to the new cgroup as the css for the new cgroup doesn't exist yet. Knowing the parent css is enough for all the existing subsystems. * In kernel/cgroup.c::offline_css(), unnecessary open coded css dereference is replaced with local variable access. This patch shouldn't cause any behavior differences. v2: Unnecessary explicit cgrp->subsys[] deref in css_online() replaced with local variable @css as suggested by Li Zefan. Rebased on top of new for-3.12 which includes for-3.11-fixes so that ->css_free() invocation added by `da0a12caff` ("cgroup: fix a leak when percpu_ref_init() fails") is converted too. Suggested by Li Zefan. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Li Zefan <lizefan@huawei.com> Acked-by: Michal Hocko <mhocko@suse.cz> Acked-by: Vivek Goyal <vgoyal@redhat.com> Acked-by: Aristeu Rozanski <aris@redhat.com> Acked-by: Daniel Wagner <daniel.wagner@bmw-carit.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Ingo Molnar <mingo@redhat.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Balbir Singh <bsingharora@gmail.com> Cc: Matt Helsley <matthltc@us.ibm.com> Cc: Jens Axboe <axboe@kernel.dk> Cc: Steven Rostedt <rostedt@goodmis.org>	2013-08-08 20:11:23 -04:00
Tejun Heo	6387698699	cgroup: add css_parent() Currently, controllers have to explicitly follow the cgroup hierarchy to find the parent of a given css. cgroup is moving towards using cgroup_subsys_state as the main controller interface construct, so let's provide a way to climb the hierarchy using just csses. This patch implements css_parent() which, given a css, returns its parent. The function is guarnateed to valid non-NULL parent css as long as the target css is not at the top of the hierarchy. freezer, cpuset, cpu, cpuacct, hugetlb, memory, net_cls and devices are converted to use css_parent() instead of accessing cgroup->parent directly. * __parent_ca() is dropped from cpuacct and its usage is replaced with parent_ca(). The only difference between the two was NULL test on cgroup->parent which is now embedded in css_parent() making the distinction moot. Note that eventually a css->parent field will be added to css and the NULL check in css_parent() will go away. This patch shouldn't cause any behavior differences. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Li Zefan <lizefan@huawei.com>	2013-08-08 20:11:23 -04:00
Tejun Heo	a7c6d554aa	cgroup: add/update accessors which obtain subsys specific data from css css (cgroup_subsys_state) is usually embedded in a subsys specific data structure. Subsystems either use container_of() directly to cast from css to such data structure or has an accessor function wrapping such cast. As cgroup as whole is moving towards using css as the main interface handle, add and update such accessors to ease dealing with css's. All accessors explicitly handle NULL input and return NULL in those cases. While this looks like an extra branch in the code, as all controllers specific data structures have css as the first field, the casting doesn't involve any offsetting and the compiler can trivially optimize out the branch. * blkio, freezer, cpuset, cpu, cpuacct and net_cls didn't have such accessor. Added. * memory, hugetlb and devices already had one but didn't explicitly handle NULL input. Updated. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Li Zefan <lizefan@huawei.com>	2013-08-08 20:11:23 -04:00
Tejun Heo	8af01f56a0	cgroup: s/cgroup_subsys_state/cgroup_css/ s/task_subsys_state/task_css/ The names of the two struct cgroup_subsys_state accessors - cgroup_subsys_state() and task_subsys_state() - are somewhat awkward. The former clashes with the type name and the latter doesn't even indicate it's somehow related to cgroup. We're about to revamp large portion of cgroup API, so, let's rename them so that they're less awkward. Most per-controller usages of the accessors are localized in accessor wrappers and given the amount of scheduled changes, this isn't gonna add any noticeable headache. Rename cgroup_subsys_state() to cgroup_css() and task_subsys_state() to task_css(). This patch is pure rename. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Li Zefan <lizefan@huawei.com>	2013-08-08 20:11:22 -04:00
Paul Gortmaker	0b776b0628	block: delete __cpuinit usage from all block files The __cpuinit type of throwaway sections might have made sense some time ago when RAM was more constrained, but now the savings do not offset the cost and complications. For example, the fix in commit `5e427ec2d0` ("x86: Fix bit corruption at CPU resume time") is a good example of the nasty type of bugs that can be created with improper use of the various __init prefixes. After a discussion on LKML[1] it was decided that cpuinit should go the way of devinit and be phased out. Once all the users are gone, we can then finally remove the macros themselves from linux/init.h. This removes all the drivers/block uses of the __cpuinit macros from all C files. [1] https://lkml.org/lkml/2013/5/20/589 Cc: Jens Axboe <axboe@kernel.dk> Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com>	2013-07-14 19:36:59 -04:00
Linus Torvalds	36805aaea5	Merge branch 'for-3.11/core' of git://git.kernel.dk/linux-block Pull core block IO updates from Jens Axboe: "Here are the core IO block bits for 3.11. It contains: - A tweak to the reserved tag logic from Jan, for weirdo devices with just 3 free tags. But for those it improves things substantially for random writes. - Periodic writeback fix from Jan. Marked for stable as well. - Fix for a race condition in IO scheduler switching from Jianpeng. - The hierarchical blk-cgroup support from Tejun. This is the grunt of the series. - blk-throttle fix from Vivek. Just a note that I'm in the middle of a relocation, whole family is flying out tomorrow. Hence I will be awal the remainder of this week, but back at work again on Monday the 15th. CC'ing Tejun, since any potential "surprises" will most likely be from the blk-cgroup work. But it's been brewing for a while and sitting in my tree and linux-next for a long time, so should be solid." * 'for-3.11/core' of git://git.kernel.dk/linux-block: (36 commits) elevator: Fix a race in elevator switching block: Reserve only one queue tag for sync IO if only 3 tags are available writeback: Fix periodic writeback after fs mount blk-throttle: implement proper hierarchy support blk-throttle: implement throtl_grp->has_rules[] blk-throttle: Account for child group's start time in parent while bio climbs up blk-throttle: add throtl_qnode for dispatch fairness blk-throttle: make throtl_pending_timer_fn() ready for hierarchy blk-throttle: make tg_dispatch_one_bio() ready for hierarchy blk-throttle: make blk_throtl_bio() ready for hierarchy blk-throttle: make blk_throtl_drain() ready for hierarchy blk-throttle: dispatch from throtl_pending_timer_fn() blk-throttle: implement dispatch looping blk-throttle: separate out throtl_service_queue->pending_timer from throtl_data->dispatch_work blk-throttle: set REQ_THROTTLED from throtl_charge_bio() and gate stats update with it blk-throttle: implement sq_to_tg(), sq_to_td() and throtl_log() blk-throttle: add throtl_service_queue->parent_sq blk-throttle: generalize update_disptime optimization in blk_throtl_bio() blk-throttle: dispatch to throtl_data->service_queue.bio_lists[] blk-throttle: move bio_lists[] and friends to throtl_service_queue ...	2013-07-11 13:03:24 -07:00
Philippe De Muyter	f8f066033b	partitions/msdos: enumerate also AIX LVM partitions Graft AIX partitions enumeration into partitions/msdos.c There is already a AIX disks detection logic in msdos.c. When an AIX disk has been found, and if configured to, call the aix partitions recognizer. This avoids removal of AIX disks protection from msdos.c, avoids code duplication, and ensures that AIX partitions enumeration is called before plain msdos partitions enumeration. Signed-off-by: Philippe De Muyter <phdm@macqel.be> Cc: Karel Zak <kzak@redhat.com> Cc: Jens Axboe <axboe@kernel.dk> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2013-07-09 10:33:28 -07:00
Philippe De Muyter	6ceea22bbb	partitions: add aix lvm partition support files Add partitions/aix.h and partitions/aix.c. AIX LVM permits to make "logical volumes" which are made of multiple slices of multiple disks. The new code allows only access to the "logical volumes" which are made of one slice on the probed disk, a slice being a contiguous disk area. The code also detects "logical volumes" made of multiple slices on the probed disk, but can not describe them to the partition layer, because the partition layer generic code does not support that. When such non-contiguous "logical volumes" are detected, a diagnostic message is printed. [akpm@linux-foundation.org: coding-style fixes] Signed-off-by: Philippe De Muyter <phdm@macqel.be> Cc: Karel Zak <kzak@redhat.com> Cc: Jens Axboe <axboe@kernel.dk> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2013-07-09 10:33:28 -07:00
Philippe De Muyter	1d04f3c6ab	partitions/msdos.c: end-of-line whitespace and semicolon cleanup Signed-off-by: Philippe De Muyter <phdm@macqel.be> Cc: Karel Zak <kzak@redhat.com> Cc: Jens Axboe <axboe@kernel.dk> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2013-07-09 10:33:28 -07:00
Linus Torvalds	7f0ef0267e	Merge branch 'akpm' (updates from Andrew Morton) Merge first patch-bomb from Andrew Morton: - various misc bits - I'm been patchmonkeying ocfs2 for a while, as Joel and Mark have been distracted. There has been quite a bit of activity. - About half the MM queue - Some backlight bits - Various lib/ updates - checkpatch updates - zillions more little rtc patches - ptrace - signals - exec - procfs - rapidio - nbd - aoe - pps - memstick - tools/testing/selftests updates * emailed patches from Andrew Morton <akpm@linux-foundation.org>: (445 commits) tools/testing/selftests: don't assume the x bit is set on scripts selftests: add .gitignore for kcmp selftests: fix clean target in kcmp Makefile selftests: add .gitignore for vm selftests: add hugetlbfstest self-test: fix make clean selftests: exit 1 on failure kernel/resource.c: remove the unneeded assignment in function __find_resource aio: fix wrong comment in aio_complete() drivers/w1/slaves/w1_ds2408.c: add magic sequence to disable P0 test mode drivers/memstick/host/r592.c: convert to module_pci_driver drivers/memstick/host/jmb38x_ms: convert to module_pci_driver pps-gpio: add device-tree binding and support drivers/pps/clients/pps-gpio.c: convert to module_platform_driver drivers/pps/clients/pps-gpio.c: convert to devm_* helpers drivers/parport/share.c: use kzalloc Documentation/accounting/getdelays.c: avoid strncpy in accounting tool aoe: update internal version number to v83 aoe: update copyright date aoe: perform I/O completions in parallel ...	2013-07-03 17:12:13 -07:00
Kees Cook	ffc8b30866	block: do not pass disk names as format strings Disk names may contain arbitrary strings, so they must not be interpreted as format strings. It seems that only md allows arbitrary strings to be used for disk names, but this could allow for a local memory corruption from uid 0 into ring 0. CVE-2013-2851 Signed-off-by: Kees Cook <keescook@chromium.org> Cc: Jens Axboe <axboe@kernel.dk> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2013-07-03 16:07:25 -07:00
Cong Wang	8b0d77f131	block/compat_ioctl.c: do not leak info to user-space There is a hole in struct hd_geometry, so we have to zero the struct on stack before copying it to user-space. Signed-off-by: Cong Wang <amwang@redhat.com> Cc: Jens Axboe <axboe@kernel.dk> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2013-07-03 16:07:25 -07:00
Linus Torvalds	c1101cbc7d	Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/s390/linux Pull s390 updates from Martin Schwidefsky: "This is the bulk of the s390 patches for the 3.11 merge window. Notable enhancements are: the block timeout patches for dasd from Hannes, and more work on the PCI support front. In addition some cleanup and the usual bug fixing." * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/s390/linux: (42 commits) s390/dasd: Fail all requests when DASD_FLAG_ABORTIO is set s390/dasd: Add 'timeout' attribute block: check for timeout function in blk_rq_timed_out() block/dasd: detailed I/O errors s390/dasd: Reduce amount of messages for specific errors s390/dasd: Implement block timeout handling s390/dasd: process all requests in the device tasklet s390/dasd: make number of retries configurable s390/dasd: Clarify comment s390/hwsampler: Updated misleading member names in hws_data_entry s390/appldata_net_sum: do not use static data s390/appldata_mem: do not use static data s390/vmwatchdog: do not use static data s390/airq: simplify adapter interrupt code s390/pci: remove per device debug attribute s390/dma: remove gratuitous brackets s390/facility: decompose test_facility() s390/sclp: remove duplicated include from sclp_ctl.c s390/irq: store interrupt information in pt_regs s390/drivers: Cocci spatch "ptr_ret.spatch" ...	2013-07-03 11:08:24 -07:00
Jianpeng Ma	d50235b7bc	elevator: Fix a race in elevator switching There's a race between elevator switching and normal io operation. Because the allocation of struct elevator_queue and struct elevator_data don't in a atomic operation.So there are have chance to use NULL ->elevator_data. For example: Thread A: Thread B blk_queu_bio elevator_switch spin_lock_irq(q->queue_block) elevator_alloc elv_merge elevator_init_fn Because call elevator_alloc, it can't hold queue_lock and the ->elevator_data is NULL.So at the same time, threadA call elv_merge and nedd some info of elevator_data.So the crash happened. Move the elevator_alloc into func elevator_init_fn, it make the operations in a atomic operation. Using the follow method can easy reproduce this bug 1:dd if=/dev/sdb of=/dev/null 2:while true;do echo noop > scheduler;echo deadline > scheduler;done The test method also use this method. Signed-off-by: Jianpeng Ma <majianpeng@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2013-07-03 13:25:24 +02:00
Linus Torvalds	f317ff9eed	Merge branch 'for-3.11' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq Pull workqueue changes from Tejun Heo: "Surprisingly, Lai and I didn't break too many things implementing custom pools and stuff last time around and there aren't any follow-up changes necessary at this point. The only change in this pull request is Viresh's patches to make some per-cpu workqueues to behave as unbound workqueues dependent on a boot param whose default can be configured via a config option. This leads to higher processing overhead / lower bandwidth as more work items are bounced across CPUs; however, it can lead to noticeable powersave in certain configurations - ~10% w/ idlish constant workload on a big.LITTLE configuration according to Viresh. This is because per-cpu workqueues interfere with how the scheduler perceives whether or not each CPU is idle by forcing pinned tasks on them, which makes the scheduler's power-aware scheduling decisions less effective. Its effectiveness is likely less pronounced on homogenous configurations and this type of optimization can probably be made automatic; however, the changes are pretty minimal and the affected workqueues are clearly marked, so it's an easy gain for some configurations for the time being with pretty unintrusive changes." * 'for-3.11' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq: fbcon: queue work on power efficient wq block: queue work on power efficient wq PHYLIB: queue work on system_power_efficient_wq workqueue: Add system wide power_efficient workqueues workqueues: Introduce new flag WQ_POWER_EFFICIENT for power oriented workqueues	2013-07-02 19:53:30 -07:00
Hannes Reinecke	80bd7181b0	block: check for timeout function in blk_rq_timed_out() rq_timed_out_fn might have been unset while the request was in flight, so we need to check for it in blk_rq_timed_out(). Acked-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Hannes Reinecke <hare@suse.de> Signed-off-by: Stefan Weinhuber <wein@de.ibm.com> Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>	2013-07-01 17:31:23 +02:00
Hannes Reinecke	d1ffc1f866	block/dasd: detailed I/O errors The DASD driver is using FASTFAIL as an equivalent to the transport errors in SCSI. And the 'steal lock' function maps roughly to a reservation error. So we should be returning the appropriate error codes when completing a request. Acked-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Hannes Reinecke <hare@suse.de> Signed-off-by: Stefan Weinhuber <wein@de.ibm.com> Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>	2013-07-01 17:31:22 +02:00
Jan Kara	a6b3f7614c	block: Reserve only one queue tag for sync IO if only 3 tags are available In case a device has three tags available we still reserve two of them for sync IO. That leaves only a single tag for async IO such as writeback from flusher thread which results in poor performance. Allow async IO to consume two tags in case queue has three tag availabe to get a decent async write performance. This patch improves streaming write performance on a machine with such disk from ~21 MB/s to ~52 MB/s. Also postmark throughput in presence of streaming writer improves from 8 to 12 transactions per second so sync IO doesn't seem to be harmed in presence of heavy async writer. Signed-off-by: Jan Kara <jack@suse.cz> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2013-06-28 21:32:27 +02:00
Aaron Lu	c60855cdb9	blkpm: avoid sleep when holding queue lock In blk_post_runtime_resume, an autosuspend request will be initiated for the device. Since we are holding the queue lock, we can't sleep and thus we should use the async version to initiate an autosuspend, i.e. pm_request_suspend instead of pm_runtime_suspend, which might sleep. Signed-off-by: Aaron Lu <aaron.lu@intel.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2013-05-17 10:00:43 +02:00
Tejun Heo	9138125bea	blk-throttle: implement proper hierarchy support With the recent updates, blk-throttle is finally ready for proper hierarchy support. Dispatching now honors service_queue->parent_sq and propagates correctly. The only thing missing is setting ->parent_sq correctly so that throtl_grp hierarchy matches the cgroup hierarchy. This patch updates throtl_pd_init() such that service_queues form the same hierarchy as the cgroup hierarchy if sane_behavior is enabled. As this concludes proper hierarchy support for blkcg, the shameful .broken_hierarchy tag is removed from blkio_subsys. v2: Updated blkio-controller.txt as suggested by Vivek. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Vivek Goyal <vgoyal@redhat.com> Cc: Li Zefan <lizefan@huawei.com>	2013-05-14 13:52:38 -07:00
Tejun Heo	693e751e70	blk-throttle: implement throtl_grp->has_rules[] blk_throtl_bio() has a quick exit path for throtl_grps without limits configured. It looks at the bps and iops limits and if both are not configured, the bio is issued immediately. While this is correct in the current flat hierarchy as each throtl_grp behaves completely independently, it would become wrong in proper hierarchy mode. A group without any limits could still be limited by one of its ancestors and bio's queued for such group should not bypass blk-throtl. As having a quick bypass mechanism is beneficial, this patch reimplements the mechanism such that it's correct even with proper hierarchy. throtl_grp->has_rules[] is added. These booleans are updated for the whole subtree whenever a config is updated so that has_rules[] of the whole subtree stays synchronized. They're also updated when a new throtl_grp comes online so that it can't escape the limits of its ancestors. As no throtl_grp has another throtl_grp as parent now, this patch doesn't yet make any behavior differences. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Vivek Goyal <vgoyal@redhat.com>	2013-05-14 13:52:38 -07:00
Vivek Goyal	32ee5bc478	blk-throttle: Account for child group's start time in parent while bio climbs up With the planned proper hierarchy support, a bio will climb up the tree before actually being dispatched. This makes sure bio is also subjected to parent's throttling limits, if any. It might happen that parent is idle and when bio is transferred to parent, a new slice starts fresh. But that is incorrect as parents wait time should have started when bio was queued in child group and causes IOs to be throttled more than configured as they climb the hierarchy. Given the fact that we have not written hierarchical algorithm in a way where child's and parents time slices are synchronized, we transfer the child's start time to parent if parent was idling. If parent was busy doing dispatch of other bios all this while, this is not an issue. Child's slice start time is passed to parent. Parent looks at its last expired slice start time. If child's start time is after parents old start time, that means parent had been idle and after parent went idle, child had an IO queued. So use child's start time as parent start time. If parent's start time is after child's start time, that means, when IO got queued in child group, parent was not idle. But later it dispatched some IO, its slice got trimmed and then it went idle. After a while child's request got shifted in parent group. In this case use parent's old start time as new start time as that's the duration of slice we did not use. This logic is far from perfect as if there are multiple childs then first child transferring the bio decides the start time while a bio might have queued up even earlier in other child, which is yet to be transferred up to parent. In that case we will lose time and bandwidth in parent. This patch is just an approximation to make situation somewhat better. Signed-off-by: Vivek Goyal <vgoyal@redhat.com> Signed-off-by: Tejun Heo <tj@kernel.org>	2013-05-14 13:52:38 -07:00

... 8 9 10 11 12 ...

2883 Commits