linux_dsm_epyc7002

mirror of https://github.com/AuxXxilium/linux_dsm_epyc7002.git synced 2024-11-26 14:30:55 +07:00

Author	SHA1	Message	Date
Keith Busch	e6e96d73a2	NVMe: Initialize device list head before starting Driver recovery requires the device's list node to have been initialized. Fixes: https://lkml.org/lkml/2015/3/22/262 Reported-by: Steven Noonan <steven@uplinklabs.net> Signed-off-by: Keith Busch <keith.busch@intel.com> Cc: Matthew Wilcox <willy@linux.intel.com> Cc: Jens Axboe <axboe@fb.com> Signed-off-by: Jens Axboe <axboe@fb.com>	2015-03-23 09:35:12 -06:00
David Gibson	f571872671	powerpc: Move Power Macintosh drivers to generic byteswappers ppc has special instruction forms to efficiently load and store values in non-native endianness. These can be accessed via the arch-specific {ld,st}_le{16,32}() inlines in arch/powerpc/include/asm/swab.h. However, gcc is perfectly capable of generating the byte-reversing load/store instructions when using the normal, generic cpu_to_le() and le_to_cpu() functions eaning the arch-specific functions don't have much point. Worse the "le" in the names of the arch specific functions is now misleading, because they always generate byte-reversing forms, but some ppc machines can now run a little-endian kernel. To start getting rid of the arch-specific forms, this patch removes them from all the old Power Macintosh drivers, replacing them with the generic byteswappers. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>	2015-03-23 14:29:40 +11:00
Valentin Rothberg	d8bf368d06	genirq: Remove the deprecated 'IRQF_DISABLED' request_irq() flag entirely The IRQF_DISABLED flag is a NOOP and has been scheduled for removal since Linux v2.6.36 by commit `6932bf37be` ("genirq: Remove IRQF_DISABLED from core code"). According to commit `e58aa3d2d0` ("genirq: Run irq handlers with interrupts disabled"), running IRQ handlers with interrupts enabled can cause stack overflows when the interrupt line of the issuing device is still active. This patch ends the grace period for IRQF_DISABLED (i.e., SA_INTERRUPT in older versions of Linux) and removes the definition and all remaining usages of this flag. There's still a few non-functional references left in the kernel source: - The bigger hunk in Documentation/scsi/ncr53c8xx.txt is removed entirely as IRQF_DISABLED is gone now; the usage in older kernel versions (including the old SA_INTERRUPT flag) should be discouraged. The trouble of using IRQF_SHARED is a general problem and not specific to any driver. - I left the reference in Documentation/PCI/MSI-HOWTO.txt untouched since it has already been removed in linux-next. - All remaining references are changelogs that I suggest to keep. Signed-off-by: Valentin Rothberg <valentinrothberg@gmail.com> Cc: Afzal Mohammed <afzal@ti.com> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Brian Norris <computersforpeace@gmail.com> Cc: Christoph Hellwig <hch@lst.de> Cc: Dan Carpenter <dan.carpenter@oracle.com> Cc: David Woodhouse <dwmw2@infradead.org> Cc: Ewan Milne <emilne@redhat.com> Cc: Eyal Perry <eyalpe@mellanox.com> Cc: Felipe Balbi <balbi@ti.com> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Hannes Reinecke <hare@suse.de> Cc: Hongliang Tao <taohl@lemote.com> Cc: Huacai Chen <chenhc@lemote.com> Cc: Jiri Kosina <jkosina@suse.cz> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Keerthy <j-keerthy@ti.com> Cc: Laurent Pinchart <laurent.pinchart@ideasonboard.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Nishanth Menon <nm@ti.com> Cc: Paul Bolle <pebolle@tiscali.nl> Cc: Peter Ujfalusi <peter.ujfalusi@ti.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Quentin Lambert <lambert.quentin@gmail.com> Cc: Rajendra Nayak <rnayak@ti.com> Cc: Ralf Baechle <ralf@linux-mips.org> Cc: Santosh Shilimkar <santosh.shilimkar@ti.com> Cc: Sricharan R <r.sricharan@ti.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Tony Lindgren <tony@atomide.com> Cc: Zhou Wang <wangzhou1@hisilicon.com> Cc: iss_storagedev@hp.com Cc: linux-mips@linux-mips.org Cc: linux-mtd@lists.infradead.org Link: http://lkml.kernel.org/r/1425565425-12604-1-git-send-email-valentinrothberg@gmail.com Signed-off-by: Ingo Molnar <mingo@kernel.org>	2015-03-05 20:53:06 +01:00
Jens Axboe	b8be79b714	NBD fixes based on v4.0-rc1 -----BEGIN PGP SIGNATURE----- Version: GnuPG v1 iQIcBAABAgAGBQJU+As6AAoJEEpcgKtcEGQQr5UQAJeNbwo3t45u+ftmL+rzznqr ospLlVirYqKcB3IZk8WUZ+xWtP2EyzJ6qiTPdYRC2lvgQMDiXTpqYkbi4lNCUr0V CBMVSWuOwlZOwlD0Cxr5L6m2A0X+M8aGI1OUISYZuogLaHxjgPRQiEs5vccQKcoF 1ZHBg0Lguk+alfTLxEtddDfzETx72I+CRaJyxa/g9SNk91C1R8U/+YT6g9ryrF9H 8bJ5K4p5LYJDecIZMTNTFCndNa1fFZmynDv4cvla0MOfrOwA+lSaIFdCtbSPEbUu WHjXzwDlsIKGjJY1Pt7YfND1B+khFSVS9Cgh11hQ0nOer6y3WRIxWg0Jyo7m+PUz W4J7Pd1sHr/IgPLLOOI0o21n+Glkqp/wh5O/mDEP0ZEWMpm4pdd2R+UX76Fv62QT YCKcYl/0oxbSPjLBKWZLw2hhmd/NRvU4GM7r/mHtQvOxrHEQf5HW9p4ZlQvU7GV3 kSYZwbMZo/FsiP1KCC/VSv3l+BQJMcZRlBRg0pzSJTvZmGEaBSvydLd7lsV05OsJ txmmDw7Xd14rT8Uav83i9olXuZGX8tK+QDOdQxY7SMXmEiMWre1B4pvJu10lXrmo tmZ9zPEQyWdcsuf/og8coSk3QKQlqHHmlcdtz9WZKqUHbvE0l4g8GWS9OWBdyGR5 RXpfcbwvd396f9JXxxob =ttVz -----END PGP SIGNATURE----- Merge tag 'nbd_fixes_20150305' of git://git.pengutronix.de/git/mpa/linux-nbd into for-linus NBD fixes based on v4.0-rc1	2015-03-05 07:46:13 -07:00
Sudip Mukherjee	ff6b8090e2	nbd: fix possible memory leak we have already allocated memory for nbd_dev, but we were not releasing that memory and just returning the error value. Signed-off-by: Sudip Mukherjee <sudip@vectorindia.org> Acked-by: Paul Clements <Paul.Clements@SteelEye.com> Cc: <stable@vger.kernel.org> Signed-off-by: Markus Pargmann <mpa@pengutronix.de>	2015-03-05 08:51:03 +01:00
Linus Torvalds	a015d33c98	Merge branch 'for-linus' of git://git.kernel.dk/linux-block Pull block layer fixes from Jens Axboe: "Two smaller fixes for this cycle: - A fixup from Keith so that NVMe compiles without BLK_INTEGRITY, basically just moving the code around appropriately. - A fixup for shm, fixing an oops in shmem_mapping() for mapping with no inode. From Sasha" [ The shmem fix doesn't look block-layer-related, but fixes a bug that happened due to the backing_dev_info removal.. - Linus ] * 'for-linus' of git://git.kernel.dk/linux-block: mm: shmem: check for mapping owner before dereferencing NVMe: Fix for BLK_DEV_INTEGRITY not set	2015-02-28 10:21:57 -08:00
Joonsoo Kim	2ea55a2cae	zram: use proper type to update max_used_pages max_used_pages is defined as atomic_long_t so we need to use unsigned long to keep temporary value for it rather than int which is smaller than unsigned long in a 64 bit system. Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Minchan Kim <minchan@kernel.org> Cc: Jerome Marchand <jmarchan@redhat.com> Cc: Nitin Gupta <ngupta@vflare.org> Cc: Sergey Senozhatsky <sergey.senozhatsky.work@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2015-02-28 09:57:51 -08:00
Keith Busch	52b68d7ef8	NVMe: Fix for BLK_DEV_INTEGRITY not set Need to define and use appropriate functions for when BLK_DEV_INTEGRITY is not set. Reported-by: Fengguang Wu <fengguang.wu@intel.com> Signed-off-by: Keith Busch <keith.busch@intel.com> Signed-off-by: Jens Axboe <axboe@fb.com>	2015-02-23 09:17:54 -08:00
Jens Axboe	decf6d79de	Merge branch 'for-3.20' of git://git.infradead.org/users/kbusch/linux-nvme into for-linus Merge 3.20 NVMe changes from Keith.	2015-02-20 22:12:02 -08:00
Keith Busch	0c0f9b95c8	NVMe: Fix potential corruption on sync commands This makes all sync commands uninterruptible and schedules without timeout so the controller either has to post a completion or the timeout recovery fails the command. This fixes potential memory or data corruption from a command timing out too early or woken by a signal. Previously any DMA buffers mapped for that command would have been released even though we don't know what the controller is planning to do with those addresses. Signed-off-by: Keith Busch <keith.busch@intel.com>	2015-02-19 16:15:38 -07:00
Keith Busch	4832851840	NVMe: Remove unused variables We don't track queues in a llist, subscribe to hot-cpu notifications, or internally retry commands. Delete the unused artifacts. Signed-off-by: Keith Busch <keith.busch@intel.com>	2015-02-19 16:15:38 -07:00
Keith Busch	9ac16938ab	NVMe: Fix scsi mode select llbaa setting It should be a logical bitwise AND, not conditional. Signed-off-by: Keith Busch <keith.busch@intel.com>	2015-02-19 16:15:37 -07:00
Keith Busch	07836e659c	NVMe: Fix potential corruption during shutdown The driver has to end unreturned commands at some point even if the controller has not provided a completion. The driver tried to be safe by deleting IO queues prior to ending all unreturned commands. That should cause the controller to internally abort inflight commands, but IO queue deletion request does not have to be successful, so all bets are off. We still have to make progress, so to be extra safe, this patch doesn't clear a queue to release the dma mapping for a command until after the pci device has been disabled. This patch removes the special handling during device initialization so controller recovery can be done all the time. This is possible since initialization is not inlined with pci probe anymore. Reported-by: Nilish Choudhury <nilesh.choudhury@oracle.com> Signed-off-by: Keith Busch <keith.busch@intel.com>	2015-02-19 16:15:37 -07:00
Keith Busch	2e1d844819	NVMe: Asynchronous controller probe This performs the longest parts of nvme device probe in scheduled work. This speeds up probe significantly when multiple devices are in use. Signed-off-by: Keith Busch <keith.busch@intel.com>	2015-02-19 16:15:36 -07:00
Keith Busch	b3fffdefab	NVMe: Register management handle under nvme class This creates a new class type for nvme devices to register their management character devices with. This is so we do not rely on miscdev to provide enough minors for as many nvme devices some people plan to use. The previous limit was approximately 60 NVMe controllers, depending on the platform and kernel. Now the limit is 1M, which ought to be enough for anybody. Since we have a new device class, it makes sense to attach the block devices under this as well, so part of this patch moves the management handle initialization prior to the namespaces discovery. Signed-off-by: Keith Busch <keith.busch@intel.com>	2015-02-19 16:15:36 -07:00
Keith Busch	4f1982b4e2	NVMe: Update SCSI Inquiry VPD 83h translation The original translation created collisions on Inquiry VPD 83 for many existing devices. Newer specifications provide other ways to translate based on the device's version can be used to create unique identifiers. Version 1.1 provides an EUI64 field that uniquely identifies each namespace, and 1.2 added the longer NGUID field for the same reason. Both follow the IEEE EUI format and readily translate to the SCSI device identification EUI designator type 2h. For devices implementing either, the translation will use this type, defaulting to the EUI64 8-byte type if implemented then NGUID's 16 byte version if not. If neither are provided, the 1.0 translation is used, and is updated to use the SCSI String format to guarantee a unique identifier. Knowing when to use the new fields depends on the nvme controller's revision. The NVME_VS macro was not decoding this correctly, so that is fixed in this patch and moved to a more appropriate place. Since the Identify Namespace structure required an update for the NGUID field, this patch adds the remaining new 1.2 fields to the structure. Signed-off-by: Keith Busch <keith.busch@intel.com>	2015-02-19 16:15:35 -07:00
Keith Busch	e1e5e5641e	NVMe: Metadata format support Adds support for NVMe metadata formats and exposes block devices for all namespaces regardless of their format. Namespace formats that are unusable will have disk capacity set to 0, but a handle to the block device is created to simplify device management. A namespace is not usable when the format requires host interleave block and metadata in single buffer, has no provisioned storage, or has better data but failed to register with blk integrity. The namespace has to be scanned in two phases to support separate metadata formats. The first establishes the sector size and capacity prior to invoking add_disk. If metadata is required, the capacity will be temporarilly set to 0 until it can be revalidated and registered with the integrity extenstions after add_disk completes. The driver relies on the integrity extensions to provide the metadata buffer. NVMe requires this be a single physically contiguous region, so only one integrity segment is allowed per command. If the metadata is used for T10 PI, the driver provides mappings to save and restore the reftag physical block translation. The driver provides no-op functions for generate and verify if metadata is not used for protection information. This way the setup is always provided by the block layer. If a request does not supply a required metadata buffer, the command is failed with bad address. This could only happen if a user manually disables verify/generate on such a disk. The only exception to where this is okay is if the controller is capable of stripping/generating the metadata, which is possible on some types of formats. The metadata scatter gather list now occupies the spot in the nvme_iod that used to be used to link retryable IOD's, but we don't do that anymore, so the field was unused. Signed-off-by: Keith Busch <keith.busch@intel.com>	2015-02-19 16:15:35 -07:00
Linus Torvalds	4533f6e27a	Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client Pull Ceph changes from Sage Weil: "On the RBD side, there is a conversion to blk-mq from Christoph, several long-standing bug fixes from Ilya, and some cleanup from Rickard Strandqvist. On the CephFS side there is a long list of fixes from Zheng, including improved session handling, a few IO path fixes, some dcache management correctness fixes, and several blocking while !TASK_RUNNING fixes. The core code gets a few cleanups and Chaitanya has added support for TCP_NODELAY (which has been used on the server side for ages but we somehow missed on the kernel client). There is also an update to MAINTAINERS to fix up some email addresses and reflect that Ilya and Zheng are doing most of the maintenance for RBD and CephFS these days. Do not be surprised to see a pull request come from one of them in the future if I am unavailable for some reason" * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client: (27 commits) MAINTAINERS: update Ceph and RBD maintainers libceph: kfree() in put_osd() shouldn't depend on authorizer libceph: fix double __remove_osd() problem rbd: convert to blk-mq ceph: return error for traceless reply race ceph: fix dentry leaks ceph: re-send requests when MDS enters reconnecting stage ceph: show nocephx_require_signatures and notcp_nodelay options libceph: tcp_nodelay support rbd: do not treat standalone as flatten ceph: fix atomic_open snapdir ceph: properly mark empty directory as complete client: include kernel version in client metadata ceph: provide seperate {inode,file}_operations for snapdir ceph: fix request time stamp encoding ceph: fix reading inline data when i_size > PAGE_SIZE ceph: avoid block operation when !TASK_RUNNING (ceph_mdsc_close_sessions) ceph: avoid block operation when !TASK_RUNNING (ceph_get_caps) ceph: avoid block operation when !TASK_RUNNING (ceph_mdsc_sync) rbd: fix error paths in rbd_dev_refresh() ...	2015-02-19 14:14:42 -08:00
Christoph Hellwig	7ad18afad0	rbd: convert to blk-mq This converts the rbd driver to use the blk-mq infrastructure. Except for switching to a per-request work item this is almost mechanical. This was tested by Alexandre DERUMIER in November, and found to give him 120000 iops, although the only comparism available was an old 3.10 kernel which gave 80000iops. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Alex Elder <elder@linaro.org> [idryomov@gmail.com: context, blk_mq_init_queue() EH] Signed-off-by: Ilya Dryomov <idryomov@gmail.com>	2015-02-19 14:27:42 +03:00
Ilya Dryomov	cf32bd9c86	rbd: do not treat standalone as flatten If the clone is resized down to 0, it becomes standalone. If such resize is carried over while an image is mapped we would detect this and call rbd_dev_parent_put() which means "let go of all parent state, including the spec(s) of parent images(s)". This leads to a mismatch between "rbd info" and sysfs parent fields, so a fix is in order. # rbd create --image-format 2 --size 1 foo # rbd snap create foo@snap # rbd snap protect foo@snap # rbd clone foo@snap bar # DEV=$(rbd map bar) # rbd resize --allow-shrink --size 0 bar # rbd resize --size 1 bar # rbd info bar \| grep parent parent: rbd/foo@snap Before: # cat /sys/bus/rbd/devices/0/parent (no parent image) After: # cat /sys/bus/rbd/devices/0/parent pool_id 0 pool_name rbd image_id 10056b8b4567 image_name foo snap_id 2 snap_name snap overlap 0 Signed-off-by: Ilya Dryomov <idryomov@redhat.com> Reviewed-by: Josh Durgin <jdurgin@redhat.com> Reviewed-by: Alex Elder <elder@linaro.org>	2015-02-19 13:31:39 +03:00
Ilya Dryomov	73e39e4dba	rbd: fix error paths in rbd_dev_refresh() header_rwsem should be released on errors. Also remove useless rbd_dev->mapping.size != rbd_dev->header.image_size test. Signed-off-by: Ilya Dryomov <idryomov@redhat.com>	2015-02-19 13:31:38 +03:00
Rickard Strandqvist	3a25cf43e0	rbd: nuke copy_token() It's been largely superseded by dup_token() and unused for over 2 years, identified by cppcheck. Signed-off-by: Rickard Strandqvist <rickard_strandqvist@spectrumdigital.se> [idryomov@redhat.com: changelog] Signed-off-by: Ilya Dryomov <idryomov@redhat.com>	2015-02-19 13:31:38 +03:00
Linus Torvalds	53861af9a1	OK, this has the big virtio 1.0 implementation, as specified by OASIS. On top of tht is the major rework of lguest, to use PCI and virtio 1.0, to double-check the implementation. Then comes the inevitable fixes and cleanups from that work. Thanks, Rusty. -----BEGIN PGP SIGNATURE----- Version: GnuPG v1 iQIcBAABAgAGBQJU5B9cAAoJENkgDmzRrbjxPacP/jajliXX353JJ/g/hkZ6oDN5 o7FhELBKiUMr7enVZYwj2BBYk5OM36nB9pQkiqHMSbjJGoS5IK70enxb4YRxSHBn YCLblZMNqutGS0kclZ9DDysztjAhxH7CvLM6pMZ7eHP0f3+FM/QhbxHfbG9DTBUH 2U/nybvd3M/+YBe7ptwQdrH8aOCAD6RTIsXellfm99dNMK6K/5lqnWQ98WSXmNXq vyvdaAQsqqUkmxtajjcBumaCH4/SehOJJjUqojCMsR3aBkgOBWDZJURMek+KA5Dt X996fBsTAlvTtCUKRrmLTb2ScDH7fu+jwbWRqMYDk8zpEr3XqiLTTPV4/TiHGmi7 Wiw3g1wIY1YbETlZyongB5MIoVyUfmDAd+bT8nBsj3KIITD84gOUQFDMl6d63c0I z6A9Pu/UzpJGsXZT3WoFLi6TO67QyhOseqZnhS4wBgLabjxffNM7yov9RVKUVH/n JHunnpUk2iTtSgscBarOBz5867dstuurnaUIspZthVBo6y6N0z+GrU+agJ8Y4DXx mvwzeYLhQH2208PjxPFiah/kA/gHNm1m678TbpS+CUsgmpQiJ4gTwtazDSi4TwZY Hs9T9GulkzpZIzEyKL3qG2TsfyDhW5Avn+GvKInAT9+Fkig4BnP3DUONBxcwGZ78 eI3FDUWsE36NqE5ECWmz =ivCe -----END PGP SIGNATURE----- Merge tag 'virtio-next-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rusty/linux Pull virtio updates from Rusty Russell: "OK, this has the big virtio 1.0 implementation, as specified by OASIS. On top of tht is the major rework of lguest, to use PCI and virtio 1.0, to double-check the implementation. Then comes the inevitable fixes and cleanups from that work" * tag 'virtio-next-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rusty/linux: (80 commits) virtio: don't set VIRTIO_CONFIG_S_DRIVER_OK twice. virtio_net: unconditionally define struct virtio_net_hdr_v1. tools/lguest: don't use legacy definitions for net device in example launcher. virtio: Don't expose legacy net features when VIRTIO_NET_NO_LEGACY defined. tools/lguest: use common error macros in the example launcher. tools/lguest: give virtqueues names for better error messages tools/lguest: more documentation and checking of virtio 1.0 compliance. lguest: don't look in console features to find emerg_wr. tools/lguest: don't start devices until DRIVER_OK status set. tools/lguest: handle indirect partway through chain. tools/lguest: insert driver references from the 1.0 spec (4.1 Virtio Over PCI) tools/lguest: insert device references from the 1.0 spec (4.1 Virtio Over PCI) tools/lguest: rename virtio_pci_cfg_cap field to match spec. tools/lguest: fix features_accepted logic in example launcher. tools/lguest: handle device reset correctly in example launcher. virtual: Documentation: simplify and generalize paravirt_ops.txt lguest: remove NOTIFY call and eventfd facility. lguest: remove NOTIFY facility from demonstration launcher. lguest: use the PCI console device's emerg_wr for early boot messages. lguest: always put console in PCI slot #1. ...	2015-02-18 09:24:01 -08:00
Matthew Wilcox	a7a97fc9ff	brd: rename XIP to DAX Since this is relating to FS_XIP, not KERNEL_XIP, it should be called DAX instead of XIP. Signed-off-by: Matthew Wilcox <matthew.r.wilcox@intel.com> Cc: Andreas Dilger <andreas.dilger@intel.com> Cc: Boaz Harrosh <boaz@plexistor.com> Cc: Christoph Hellwig <hch@lst.de> Cc: Dave Chinner <david@fromorbit.com> Cc: Jan Kara <jack@suse.cz> Cc: Jens Axboe <axboe@kernel.dk> Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: Randy Dunlap <rdunlap@infradead.org> Cc: Ross Zwisler <ross.zwisler@linux.intel.com> Cc: Theodore Ts'o <tytso@mit.edu> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2015-02-16 17:56:04 -08:00
Linus Torvalds	818099574b	Merge branch 'akpm' (patches from Andrew) Merge third set of updates from Andrew Morton: - the rest of MM [ This includes getting rid of the numa hinting bits, in favor of just generic protnone logic. Yay. - Linus ] - core kernel - procfs - some of lib/ (lots of lib/ material this time) * emailed patches from Andrew Morton <akpm@linux-foundation.org>: (104 commits) lib/lcm.c: replace include lib/percpu_ida.c: remove redundant includes lib/strncpy_from_user.c: replace module.h include lib/stmp_device.c: replace module.h include lib/sort.c: move include inside #if 0 lib/show_mem.c: remove redundant include lib/radix-tree.c: change to simpler include lib/plist.c: remove redundant include lib/nlattr.c: remove redundant include lib/kobject_uevent.c: remove redundant include lib/llist.c: remove redundant include lib/md5.c: simplify include lib/list_sort.c: rearrange includes lib/genalloc.c: remove redundant include lib/idr.c: remove redundant include lib/halfmd4.c: simplify includes lib/dynamic_queue_limits.c: simplify includes lib/sort.c: use simpler includes lib/interval_tree.c: simplify includes hexdump: make it return number of bytes placed in buffer ...	2015-02-12 18:54:28 -08:00
Rasmus Villemoes	02f1f2170d	kernel.h: remove ancient __FUNCTION__ hack __FUNCTION__ hasn't been treated as a string literal since gcc 3.4, so this only helps people who only test-compile using 3.3 (compiler-gcc3.h barks at anything older than that). Besides, there are almost no occurrences of __FUNCTION__ left in the tree. [akpm@linux-foundation.org: convert remaining __FUNCTION__ references] Signed-off-by: Rasmus Villemoes <linux@rasmusvillemoes.dk> Cc: Michal Nazarewicz <mina86@mina86.com> Cc: Joe Perches <joe@perches.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2015-02-12 18:54:13 -08:00
Ganesh Mahendran	3eba0c6a56	mm/zpool: add name argument to create zpool Currently the underlay of zpool: zsmalloc/zbud, do not know who creates them. There is not a method to let zsmalloc/zbud find which caller they belong to. Now we want to add statistics collection in zsmalloc. We need to name the debugfs dir for each pool created. The way suggested by Minchan Kim is to use a name passed by caller(such as zram) to create the zsmalloc pool. /sys/kernel/debug/zsmalloc/zram0 This patch adds an argument `name' to zs_create_pool() and other related functions. Signed-off-by: Ganesh Mahendran <opensource.ganesh@gmail.com> Acked-by: Minchan Kim <minchan@kernel.org> Cc: Seth Jennings <sjennings@variantweb.net> Cc: Nitin Gupta <ngupta@vflare.org> Cc: Dan Streetman <ddstreet@ieee.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2015-02-12 18:54:12 -08:00
Sergey Senozhatsky	ee98016010	zram: remove request_queue from struct zram `struct zram' contains both `struct gendisk' and `struct request_queue'. the latter can be deleted, because zram->disk carries ->queue pointer, and ->queue carries zram pointer: create_device() zram->queue->queuedata = zram zram->disk->queue = zram->queue zram->disk->private_data = zram so zram->queue is not needed, we can access all necessary data anyway. Signed-off-by: Sergey Senozhatsky <sergey.senozhatsky@gmail.com> Cc: Minchan Kim <minchan@kernel.org> Cc: Jerome Marchand <jmarchan@redhat.com> Cc: Nitin Gupta <ngupta@vflare.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2015-02-12 18:54:12 -08:00
Minchan Kim	08eee69fcf	zram: remove init_lock in zram_make_request Admin could reset zram during I/O operation going on so we have used zram->init_lock as read-side lock in I/O path to prevent sudden zram meta freeing. However, the init_lock is really troublesome. We can't do call zram_meta_alloc under init_lock due to lockdep splat because zram_rw_page is one of the function under reclaim path and hold it as read_lock while other places in process context hold it as write_lock. So, we have used allocation out of the lock to avoid lockdep warn but it's not good for readability and fainally, I met another lockdep splat between init_lock and cpu_hotplug from kmem_cache_destroy during working zsmalloc compaction. :( Yes, the ideal is to remove horrible init_lock of zram in rw path. This patch removes it in rw path and instead, add atomic refcount for meta lifetime management and completion to free meta in process context. It's important to free meta in process context because some of resource destruction needs mutex lock, which could be held if we releases the resource in reclaim context so it's deadlock, again. As a bonus, we could remove init_done check in rw path because zram_meta_get will do a role for it, instead. Signed-off-by: Sergey Senozhatsky <sergey.senozhatsky@gmail.com> Signed-off-by: Minchan Kim <minchan@kernel.org> Cc: Nitin Gupta <ngupta@vflare.org> Cc: Jerome Marchand <jmarchan@redhat.com> Cc: Ganesh Mahendran <opensource.ganesh@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2015-02-12 18:54:12 -08:00
Minchan Kim	2b269ce6fc	zram: check bd_openers instead of bd_holders bd_holders is increased only when user open the device file as FMODE_EXCL so if something opens zram0 as !FMODE_EXCL and request I/O while another user reset zram0, we can see following warning. zram0: detected capacity change from 0 to 64424509440 Buffer I/O error on dev zram0, logical block 180823, lost async page write Buffer I/O error on dev zram0, logical block 180824, lost async page write Buffer I/O error on dev zram0, logical block 180825, lost async page write Buffer I/O error on dev zram0, logical block 180826, lost async page write Buffer I/O error on dev zram0, logical block 180827, lost async page write Buffer I/O error on dev zram0, logical block 180828, lost async page write Buffer I/O error on dev zram0, logical block 180829, lost async page write Buffer I/O error on dev zram0, logical block 180830, lost async page write Buffer I/O error on dev zram0, logical block 180831, lost async page write Buffer I/O error on dev zram0, logical block 180832, lost async page write ------------[ cut here ]------------ WARNING: CPU: 11 PID: 1996 at fs/block_dev.c:57 __blkdev_put+0x1d7/0x210() Modules linked in: CPU: 11 PID: 1996 Comm: dd Not tainted 3.19.0-rc6-next-20150202+ #1125 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Bochs 01/01/2011 Call Trace: dump_stack+0x45/0x57 warn_slowpath_common+0x8a/0xc0 warn_slowpath_null+0x1a/0x20 __blkdev_put+0x1d7/0x210 blkdev_put+0x50/0x130 blkdev_close+0x25/0x30 __fput+0xdf/0x1e0 ____fput+0xe/0x10 task_work_run+0xa7/0xe0 do_notify_resume+0x49/0x60 int_signal+0x12/0x17 ---[ end trace 274fbbc5664827d2 ]--- The warning comes from bdev_write_node in blkdev_put path. static void bdev_write_inode(struct inode *inode) { spin_lock(&inode->i_lock); while (inode->i_state & I_DIRTY) { spin_unlock(&inode->i_lock); WARN_ON_ONCE(write_inode_now(inode, true)); <========= here. spin_lock(&inode->i_lock); } spin_unlock(&inode->i_lock); } The reason is dd process encounters I/O fails due to sudden block device disappear so in filemap_check_errors in __writeback_single_inode returns -EIO. If we check bd_openers instead of bd_holders, we could address the problem. When I see the brd, it already have used it rather than bd_holders so although I'm not a expert of block layer, it seems to be better. I can make following warning with below simple script. In addition, I added msleep(2000) below set_capacity(zram->disk, 0) after applying your patch to make window huge(Kudos to Ganesh!) script: echo $((60<<30)) > /sys/block/zram0/disksize setsid dd if=/dev/zero of=/dev/zram0 & sleep 1 setsid echo 1 > /sys/block/zram0/reset Signed-off-by: Minchan Kim <minchan@kernel.org> Acked-by: Sergey Senozhatsky <sergey.senozhatsky@gmail.com> Cc: Nitin Gupta <ngupta@vflare.org> Cc: Jerome Marchand <jmarchan@redhat.com> Cc: Ganesh Mahendran <opensource.ganesh@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2015-02-12 18:54:11 -08:00
Sergey Senozhatsky	a096cafc31	zram: rework reset and destroy path We need to return set_capacity(disk, 0) from reset_store() back to zram_reset_device(), a catch by Ganesh Mahendran. Potentially, we can race set_capacity() calls from init and reset paths. The problem is that zram_reset_device() is also getting called from zram_exit(), which performs operations in misleading reversed order -- we first create_device() and then init it, while zram_exit() perform destroy_device() first and then does zram_reset_device(). This is done to remove sysfs group before we reset device, so we can continue with device reset/destruction not being raced by sysfs attr write (f.e. disksize). Apart from that, destroy_device() releases zram->disk (but we still have ->disk pointer), so we cannot acces zram->disk in later zram_reset_device() call, which may cause additional errors in the future. So, this patch rework and cleanup destroy path. 1) remove several unneeded goto labels in zram_init() 2) factor out zram_init() error path and zram_exit() into destroy_devices() function, which takes the number of devices to destroy as its argument. 3) remove sysfs group in destroy_devices() first, so we can reorder operations -- reset device (as expected) goes before disk destroy and queue cleanup. So we can always access ->disk in zram_reset_device(). 4) and, finally, return set_capacity() back under ->init_lock. [akpm@linux-foundation.org: tweak comment] Signed-off-by: Sergey Senozhatsky <sergey.senozhatsky@gmail.com> Reported-by: Ganesh Mahendran <opensource.ganesh@gmail.com> Cc: Minchan Kim <minchan@kernel.org> Cc: Jerome Marchand <jmarchan@redhat.com> Cc: Nitin Gupta <ngupta@vflare.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2015-02-12 18:54:11 -08:00
Sergey Senozhatsky	ba6b17d68c	zram: fix umount-reset_store-mount race condition Ganesh Mahendran was the first one who proposed to use bdev->bd_mutex to avoid ->bd_holders race condition: CPU0 CPU1 umount /* zram->init_done is true */ reset_store() bdev->bd_holders == 0 mount ... zram_make_request() zram_reset_device() However, his solution required some considerable amount of code movement, which we can avoid. Apart from using bdev->bd_mutex in reset_store(), this patch also simplifies zram_reset_device(). zram_reset_device() has a bool parameter reset_capacity which tells it whether disk capacity and itself disk should be reset. There are two zram_reset_device() callers: -- zram_exit() passes reset_capacity=false -- reset_store() passes reset_capacity=true So we can move reset_capacity-sensitive work out of zram_reset_device() and perform it unconditionally in reset_store(). This also lets us drop reset_capacity parameter from zram_reset_device() and pass zram pointer only. Signed-off-by: Sergey Senozhatsky <sergey.senozhatsky@gmail.com> Reported-by: Ganesh Mahendran <opensource.ganesh@gmail.com> Cc: Minchan Kim <minchan@kernel.org> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2015-02-12 18:54:11 -08:00
Ganesh Mahendran	1fec117281	zram: free meta table in zram_meta_free zram_meta_alloc() and zram_meta_free() are a pair. In zram_meta_alloc(), meta table is allocated. So it it better to free it in zram_meta_free(). Signed-off-by: Ganesh Mahendran <opensource.ganesh@gmail.com> Acked-by: Minchan Kim <minchan@kernel.org> Acked-by: Sergey Senozhatsky <sergey.senozhatsky@gmail.com> Cc: Nitin Gupta <ngupta@vflare.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2015-02-12 18:54:11 -08:00
Sergey Senozhatsky	b817995832	zram: clean up zram_meta_alloc() A trivial cleanup of zram_meta_alloc() error handling. Signed-off-by: Sergey Senozhatsky <sergey.senozhatsky@gmail.com> Acked-by: Minchan Kim <minchan@kernel.org> Cc: Nitin Gupta <ngupta@vflare.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2015-02-12 18:54:11 -08:00
Linus Torvalds	8494bcf5b7	Merge branch 'for-3.20/drivers' of git://git.kernel.dk/linux-block Pull block driver changes from Jens Axboe: "This contains: - The 4k/partition fixes for brd from Boaz/Matthew. - A few xen front/back block fixes from David Vrabel and Roger Pau Monne. - Floppy changes from Takashi, cleaning the device file creation. - Switching libata to use the new blk-mq tagging policy, removing code (and a suboptimal implementation) from libata. This will throw you a merge conflict, since a bug in the original libata tagging code was fixed since this code was branched. Trivial. From Shaohua. - Conversion of loop to blk-mq, from Ming Lei. - Cleanup of the io_schedule() handling in bsg from Peter Zijlstra. He claims it improves on unreadable code, which will cost him a beer. - Maintainer update or NDB, now handled by Markus Pargmann. - NVMe: - Optimization from me that avoids a kmalloc/kfree per IO for smaller (<= 8KB) IO. This cuts about 1% of high IOPS CPU overhead. - Removal of (now) dead RCU code, a relic from before NVMe was converted to blk-mq" * 'for-3.20/drivers' of git://git.kernel.dk/linux-block: xen-blkback: default to X86_32 ABI on x86 xen-blkfront: fix accounting of reqs when migrating xen-blkback,xen-blkfront: add myself as maintainer block: Simplify bsg complete all floppy: Avoid manual call of device_create_file() NVMe: avoid kmalloc/kfree for smaller IO MAINTAINERS: Update NBD maintainer libata: make sata_sil24 use fifo tag allocator libata: move sas ata tag allocation to libata-scsi.c libata: use blk taging NVMe: within nvme_free_queues(), delete RCU sychro/deferred free null_blk: suppress invalid partition info brd: Request from fdisk 4k alignment brd: Fix all partitions BUGs axonram: Fix bug in direct_access loop: add blk-mq.h include block: loop: don't handle REQ_FUA explicitly block: loop: introduce lo_discard() and lo_req_flush() block: loop: say goodby to bio block: loop: improve performance via blk-mq	2015-02-12 14:30:53 -08:00
Linus Torvalds	3e12cefbe1	Merge branch 'for-3.20/core' of git://git.kernel.dk/linux-block Pull core block IO changes from Jens Axboe: "This contains: - A series from Christoph that cleans up and refactors various parts of the REQ_BLOCK_PC handling. Contributions in that series from Dongsu Park and Kent Overstreet as well. - CFQ: - A bug fix for cfq for realtime IO scheduling from Jeff Moyer. - A stable patch fixing a potential crash in CFQ in OOM situations. From Konstantin Khlebnikov. - blk-mq: - Add support for tag allocation policies, from Shaohua. This is a prep patch enabling libata (and other SCSI parts) to use the blk-mq tagging, instead of rolling their own. - Various little tweaks from Keith and Mike, in preparation for DM blk-mq support. - Minor little fixes or tweaks from me. - A double free error fix from Tony Battersby. - The partition 4k issue fixes from Matthew and Boaz. - Add support for zero+unprovision for blkdev_issue_zeroout() from Martin" * 'for-3.20/core' of git://git.kernel.dk/linux-block: (27 commits) block: remove unused function blk_bio_map_sg block: handle the null_mapped flag correctly in blk_rq_map_user_iov blk-mq: fix double-free in error path block: prevent request-to-request merging with gaps if not allowed blk-mq: make blk_mq_run_queues() static dm: fix multipath regression due to initializing wrong request cfq-iosched: handle failure of cfq group allocation block: Quiesce zeroout wrapper block: rewrite and split __bio_copy_iov() block: merge __bio_map_user_iov into bio_map_user_iov block: merge __bio_map_kern into bio_map_kern block: pass iov_iter to the BLOCK_PC mapping functions block: add a helper to free bio bounce buffer pages block: use blk_rq_map_user_iov to implement blk_rq_map_user block: simplify bio_map_kern block: mark blk-mq devices as stackable block: keep established cmd_flags when cloning into a blk-mq request block: add blk-mq support to blk_insert_cloned_request() block: require blk_rq_prep_clone() be given an initialized clone request blk-mq: add tag allocation policy ...	2015-02-12 14:13:23 -08:00
Linus Torvalds	bdccc4edeb	xen: features and fixes for 3.20-rc0 - Reworked handling for foreign (grant mapped) pages to simplify the code, enable a number of additional use cases and fix a number of long-standing bugs. - Prefer the TSC over the Xen PV clock when dom0 (and the TSC is stable). - Assorted other cleanup and minor bug fixes. -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.12 (GNU/Linux) iQEcBAABAgAGBQJU2JC+AAoJEFxbo/MsZsTRIvAH/1lgQ0EQlxaZtEFWY8cJBzxY dXaTMfyGQOddGYDCW0r42hhXJHeX7DWXSERSD3aW9DZOn/eYdneHq9gWRD4uPrGn hEFQ26J4jZWR5riGXaja0LqI2gJKLZ6BhHIQciLEbY+jw4ynkNBLNRPFehuwrCsZ WdBwJkyvXC3RErekncRl/aNhxdi4p1P6qeiaW/mo3UcSO/CFSKybOLwT65iePazg XuY9UiTn2+qcRkm/tjx8K9heHK8SBEGNWuoTcWYF1to8mwwUfKIAc4NO2UBDXJI+ rp7Z2lVFdII15JsQ08ATh3t7xDrMWLzCX/y4jCzmF3DBXLbSWdHCQMgI7TWt5pE= =PyJK -----END PGP SIGNATURE----- Merge tag 'stable/for-linus-3.20-rc0-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/xen/tip Pull xen features and fixes from David Vrabel: - Reworked handling for foreign (grant mapped) pages to simplify the code, enable a number of additional use cases and fix a number of long-standing bugs. - Prefer the TSC over the Xen PV clock when dom0 (and the TSC is stable). - Assorted other cleanup and minor bug fixes. * tag 'stable/for-linus-3.20-rc0-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/xen/tip: (25 commits) xen/manage: Fix USB interaction issues when resuming xenbus: Add proper handling of XS_ERROR from Xenbus for transactions. xen/gntdev: provide find_special_page VMA operation xen/gntdev: mark userspace PTEs as special on x86 PV guests xen-blkback: safely unmap grants in case they are still in use xen/gntdev: safely unmap grants in case they are still in use xen/gntdev: convert priv->lock to a mutex xen/grant-table: add a mechanism to safely unmap pages that are in use xen-netback: use foreign page information from the pages themselves xen: mark grant mapped pages as foreign xen/grant-table: add helpers for allocating pages x86/xen: require ballooned pages for grant maps xen: remove scratch frames for ballooned pages and m2p override xen/grant-table: pre-populate kernel unmap ops for xen_gnttab_unmap_refs() mm: add 'foreign' alias for the 'pinned' page flag mm: provide a find_special_page vma operation x86/xen: cleanup arch/x86/xen/mmu.c x86/xen: add some __init annotations in arch/x86/xen/mmu.c x86/xen: add some __init and static annotations in arch/x86/xen/setup.c x86/xen: use correct types for addresses in arch/x86/xen/setup.c ...	2015-02-10 13:56:56 -08:00
David Vrabel	b042a3ca94	xen-blkback: default to X86_32 ABI on x86 Prior to the existance of 64-bit backends using the X86_64 ABI, frontends used the X86_32 ABI. These old frontends do not specify the ABI and when used with a 64-bit backend do not work. On x86, default to the X86_32 ABI if one is not specified. Backends on ARM continue to default to their NATIVE ABI. Signed-off-by: David Vrabel <david.vrabel@citrix.com> Acked-by: Roger Pau Monné <roger.pau@citrix.com>	2015-02-10 16:04:46 +00:00
Roger Pau Monne	3bb8c98e56	xen-blkfront: fix accounting of reqs when migrating Current migration code uses blk_put_request in order to finish a request before requeuing it. This function doesn't update the statistics of the queue, which completely screws accounting. Use blk_end_request_all instead which properly updates the statistics of the queue. Signed-off-by: Roger Pau Monné <roger.pau@citrix.com> Reported-and-Tested-by: Ouyang Zhaowei (Charles) <ouyangzhaowei@huawei.com> Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com> Cc: David Vrabel <david.vrabel@citrix.com> Cc: xen-devel@lists.xenproject.org	2015-02-10 16:04:45 +00:00
Takashi Iwai	b7f120b211	floppy: Avoid manual call of device_create_file() Use the static attribute groups assigned to the device instead of calling device_create_file() after the device registration. Signed-off-by: Takashi Iwai <tiwai@suse.de> Signed-off-by: Jiri Kosina <jkosina@suse.cz>	2015-02-03 13:00:36 +01:00
Jens Axboe	ac3dd5bd12	NVMe: avoid kmalloc/kfree for smaller IO Currently we allocate an nvme_iod for each IO, which holds the sg list, prps, and other IO related info. Set a threshold of 2 pages and/or 8KB of data, below which we can just embed this in the per-command pdu in blk-mq. For any IO at or below NVME_INT_PAGES and NVME_INT_BYTES, we save a kmalloc and kfree. For higher IOPS, this saves up to 1% of CPU time. Signed-off-by: Jens Axboe <axboe@fb.com> Reviewed-by: Keith Busch <keith.busch@intel.com>	2015-01-29 09:25:34 -08:00
Jennifer Herbert	c43cf3ea83	xen-blkback: safely unmap grants in case they are still in use Use gnttab_unmap_refs_async() to wait until the mapped pages are no longer in use before unmapping them. This allows blkback to use network storage which may retain refs to pages in queued skbs after the block I/O has completed. Signed-off-by: Jennifer Herbert <jennifer.herbert@citrix.com> Acked-by: Roger Pau Monné <roger.pau@citrix.com> Acked-by: Jens Axboe <axboe@kernel.de> Signed-off-by: David Vrabel <david.vrabel@citrix.com>	2015-01-28 14:03:16 +00:00
David Vrabel	ff4b156f16	xen/grant-table: add helpers for allocating pages Add gnttab_alloc_pages() and gnttab_free_pages() to allocate/free pages suitable to for granted maps. Signed-off-by: David Vrabel <david.vrabel@citrix.com> Reviewed-by: Stefano Stabellini <stefano.stabellini@eu.citrix.com>	2015-01-28 14:03:12 +00:00
Ilya Dryomov	e69b8d414f	rbd: drop parent_ref in rbd_dev_unprobe() unconditionally This effectively reverts the last hunk of `392a9dad7e` ("rbd: detect when clone image is flattened"). The problem with parent_overlap != 0 condition is that it's possible and completely valid to have an image with parent_overlap == 0 whose parent state needs to be cleaned up on unmap. The next commit, which drops the "clone image now standalone" logic, opens up another window of opportunity to hit this, but even without it # cat parent-ref.sh #!/bin/bash rbd create --image-format 2 --size 1 foo rbd snap create foo@snap rbd snap protect foo@snap rbd clone foo@snap bar rbd resize --allow-shrink --size 0 bar rbd resize --size 1 bar DEV=$(rbd map bar) rbd unmap $DEV leaves rbd_device/rbd_spec/etc and rbd_client along with ceph_client hanging around. My thinking behind calling rbd_dev_parent_put() unconditionally is that there shouldn't be any requests in flight at that point in time as we are deep into unmap sequence. Hence, even if rbd_dev_unparent() caused by flatten is delayed by in-flight requests, it will have finished by the time we reach rbd_dev_unprobe() caused by unmap, thus turning unconditional rbd_dev_parent_put() into a no-op. Fixes: http://tracker.ceph.com/issues/10352 Cc: stable@vger.kernel.org # 3.11+ Signed-off-by: Ilya Dryomov <idryomov@redhat.com> Reviewed-by: Josh Durgin <jdurgin@redhat.com> Reviewed-by: Alex Elder <elder@linaro.org>	2015-01-28 16:12:02 +03:00
Ilya Dryomov	ae43e9d05e	rbd: fix rbd_dev_parent_get() when parent_overlap == 0 The comment for rbd_dev_parent_get() said * We must get the reference before checking for the overlap to * coordinate properly with zeroing the parent overlap in * rbd_dev_v2_parent_info() when an image gets flattened. We * drop it again if there is no overlap. but the "drop it again if there is no overlap" part was missing from the implementation. This lead to absurd parent_ref values for images with parent_overlap == 0, as parent_ref was incremented for each img_request and virtually never decremented. Fix this by leveraging the fact that refresh path calls rbd_dev_v2_parent_info() under header_rwsem and use it for read in rbd_dev_parent_get(), instead of messing around with atomics. Get rid of barriers in rbd_dev_v2_parent_info() while at it - I don't see what they'd pair with now and I suspect we are in a pretty miserable situation as far as proper locking goes regardless. Cc: stable@vger.kernel.org # 3.11+ Signed-off-by: Ilya Dryomov <idryomov@redhat.com> Reviewed-by: Josh Durgin <jdurgin@redhat.com> Reviewed-by: Alex Elder <elder@linaro.org>	2015-01-28 16:11:51 +03:00
Jens Axboe	a4a1cc16a7	Merge branch 'for-3.20/core' into for-3.20/drivers We need the tagging changes for the libata conversion.	2015-01-23 14:18:49 -07:00
Shaohua Li	ee1b6f7aff	block: support different tag allocation policy The libata tag allocation is using a round-robin policy. Next patch will make libata use block generic tag allocation, so let's add a policy to tag allocation. Currently two policies: FIFO (default) and round-robin. Cc: Jens Axboe <axboe@fb.com> Cc: Tejun Heo <tj@kernel.org> Cc: Christoph Hellwig <hch@infradead.org> Signed-off-by: Shaohua Li <shli@fb.com> Signed-off-by: Jens Axboe <axboe@fb.com>	2015-01-23 14:15:46 -07:00
kaoudis	121c7ad4ef	NVMe: within nvme_free_queues(), delete RCU sychro/deferred free Converting from to blk-queue got rid of the driver's RCU locking-on-queue, so removing unnecessary RCU locking-on-queue artefacts. Reviewed-by: Keith Busch <keith.busch@intel.com> Signed-off-by: Kelly Nicole Kaoudis <kaoudis@colorado.edu> Signed-off-by: Jens Axboe <axboe@fb.com>	2015-01-21 21:51:57 -07:00
Martin K. Petersen	d93ba7a5a9	block: Add discard flag to blkdev_issue_zeroout() function blkdev_issue_discard() will zero a given block range. This is done by way of explicit writing, thus provisioning or allocating the blocks on disk. There are use cases where the desired behavior is to zero the blocks but unprovision them if possible. The blocks must deterministically contain zeroes when they are subsequently read back. This patch adds a flag to blkdev_issue_zeroout() that provides this variant. If the discard flag is set and a block device guarantees discard_zeroes_data we will use REQ_DISCARD to clear the block range. If the device does not support discard_zeroes_data or if the discard request fails we will fall back to first REQ_WRITE_SAME and then a regular REQ_WRITE. Also update the callers of blkdev_issue_zero() to reflect the new flag and make sb_issue_zeroout() prefer the discard approach. Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@fb.com>	2015-01-21 10:41:46 -07:00
Michael S. Tsirkin	bb6ec57600	virtio_blk: coding style fixes Most of our code has struct foo { } Fix two instances where blk is inconsistent. Signed-off-by: Michael S. Tsirkin <mst@redhat.com> Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>	2015-01-21 16:28:57 +10:30
Michael S. Tsirkin	a4379fd841	virtio/blk: verify device has config space Some devices might not implement config space access (e.g. remoteproc used not to - before 3.9). virtio/blk needs config space access so make it fail gracefully if not there. Signed-off-by: Michael S. Tsirkin <mst@redhat.com> Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>	2015-01-21 16:28:45 +10:30
Jens Axboe	227290b469	null_blk: suppress invalid partition info null_blk is partitionable, but it doesn't store any of the info. When it is loaded, you would normally see: [1226739.343608] nullb0: unknown partition table [1226739.343746] nullb1: unknown partition table which can confuse some people. Add the appropriate gendisk flag to suppress this info. Signed-off-by: Jens Axboe <axboe@fb.com>	2015-01-16 16:02:24 -07:00
Jens Axboe	6222d1721d	NVMe: cq_vector should be signed This was inadvertently dropped from an earlier commit, otherwise the check against cq_vector == -1 to prevent double free doesn't make any sense. Fixes: `2b25d98179` Signed-off-by: Jens Axboe <axboe@fb.com>	2015-01-15 15:19:10 -07:00
Boaz Harrosh	c8fa31730f	brd: Request from fdisk 4k alignment Because of the direct_access() API which returns a PFN. partitions better start on 4K boundary, else offset ZERO of a partition will not be aligned and blk_direct_access() will fail the call. By setting blk_queue_physical_block_size(PAGE_SIZE) we can communicate this to fdisk and friends. The call to blk_queue_physical_block_size() is harmless and will not affect the Kernel behavior in any way. It is only for communication to user-mode. before this patch running fdisk on a default size brd of 4M the first sector offered is 34 (BAD), but after this patch it will be 40, ie 8 sectors aligned. Also when entering some random partition sizes the next partition-start sector is offered 8 sectors aligned after this patch. (Please note that with fdisk the user can still enter bad values, only the offered default values will be correct) Note that with bdev-size > 4M fdisk will try to align on a 1M boundary (above first-sector will be 2048), in any case. CC: Martin K. Petersen <martin.petersen@oracle.com> Signed-off-by: Boaz Harrosh <boaz@plexistor.com> Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com> Signed-off-by: Jens Axboe <axboe@fb.com>	2015-01-13 21:59:15 -07:00
Boaz Harrosh	937af5ecd0	brd: Fix all partitions BUGs This patch fixes up brd's partitions scheme, now enjoying all worlds. The MAIN fix here is that currently, if one fdisks some partitions, a BAD bug will make all partitions point to the same start-end sector ie: 0 - brd_size And an mkfs of any partition would trash the partition table and the other partition. Another fix is that "mount -U uuid" will not work if show_part was not specified, because of the GENHD_FL_SUPPRESS_PARTITION_INFO flag. We now always load without it and remove the show_part parameter. [We remove Dmitry's new module-param part_show it is now always show] So NOW the logic goes like this: * max_part - Just says how many minors to reserve between ramX devices. In any way, there can be as many partition as requested. If minors between devices ends, then dynamic 259-major ids will be allocated on the fly. The default is now max_part=1, which means all partitions devt(s) will be from the dynamic (259) major-range. (If persistent partition minors is needed use max_part=X) For example with /dev/sdX max_part is hard coded 16. * Creation of new devices on the fly still/always work: mknod /path/devnod b 1 X fdisk -l /path/devnod Will create a new device if [X / max_part] was not already created before. (Just as before) partitions on the dynamically created device will work as well Same logic applies with minors as with the pre-created ones. TODO: dynamic grow of device size. So each device can have it's own size. CC: Dmitry Monakhov <dmonakhov@openvz.org> Tested-by: Ross Zwisler <ross.zwisler@linux.intel.com> Signed-off-by: Boaz Harrosh <boaz@plexistor.com> Signed-off-by: Jens Axboe <axboe@fb.com>	2015-01-13 21:59:03 -07:00
Jens Axboe	d4119ee0e1	Merge branch 'for-3.20/core' into for-3.20/drivers	2015-01-13 21:58:45 -07:00
Matthew Wilcox	dd22f551ac	block: Change direct_access calling convention In order to support accesses to larger chunks of memory, pass in a 'size' parameter (counted in bytes), and return the amount available at that address. Add a new helper function, bdev_direct_access(), to handle common functionality including partition handling, checking the length requested is positive, checking for the sector being page-aligned, and checking the length of the request does not pass the end of the partition. Signed-off-by: Matthew Wilcox <matthew.r.wilcox@intel.com> Reviewed-by: Jan Kara <jack@suse.cz> Reviewed-by: Boaz Harrosh <boaz@plexistor.com> Signed-off-by: Jens Axboe <axboe@fb.com>	2015-01-13 21:58:11 -07:00
Keith Busch	7a509a6b07	NVMe: Fix locking on abort handling The queues and device need to be locked when messing with them. Signed-off-by: Keith Busch <keith.busch@intel.com> Signed-off-by: Jens Axboe <axboe@fb.com>	2015-01-08 09:02:23 -07:00
Keith Busch	c9d3bf8810	NVMe: Start and stop h/w queues on reset This freezes and stops all the queues on device shutdown and restarts them on resume. This fixes hotplug and reset issues when the controller is actively being used. Signed-off-by: Keith Busch <keith.busch@intel.com> Signed-off-by: Jens Axboe <axboe@fb.com>	2015-01-08 09:02:20 -07:00
Keith Busch	cef6a94827	NVMe: Command abort handling fixes Aborts all requeued commands prior to killing the request_queue. For commands that time out on a dying request queue, set the "Do Not Retry" bit on the command status so the command cannot be requeued. Finanally, if the driver is requested to abort a command it did not start, do nothing. Signed-off-by: Keith Busch <keith.busch@intel.com> Signed-off-by: Jens Axboe <axboe@fb.com>	2015-01-08 09:02:18 -07:00
Keith Busch	0fb59cbc5f	NVMe: Admin queue removal handling This protects admin queue access on shutdown. When the controller is disabled, the queue is frozen to prevent new entry, and unfrozen on resume, and fixes cq_vector signedness to not suspend a queue twice. Since unfreezing the queue makes it available for commands, it requires the queue be initialized, so this moves this part after that. Special handling is done when the device is unresponsive during shutdown. This can be optimized to not require subsequent commands to timeout, but saving that fix for later. This patch also removes the kill signals in this path that were left-over artifacts from the blk-mq conversion and no longer necessary. Signed-off-by: Keith Busch <keith.busch@intel.com> Signed-off-by: Jens Axboe <axboe@fb.com>	2015-01-08 09:02:08 -07:00
Keith Busch	ea191d2f36	NVMe: Reference count admin queue usage Since there is no gendisk associated with the admin queue, the driver needs to hold a reference to it until all open references to the controller are closed. This also combines queue cleanup with freeing the tag set since these should not be separate. Signed-off-by: Keith Busch <keith.busch@intel.com> Signed-off-by: Jens Axboe <axboe@fb.com>	2015-01-08 09:00:32 -07:00
Keith Busch	c917dfe528	NVMe: Start all requests Once the nvme callback is set for a request, the driver can start it and make it available for timeout handling. For timed out commands on a device that is not initialized, this fixes potential deadlocks that can occur on startup and shutdown when a device is unresponsive since they can now be cancelled. Asynchronous requests do not have any expected timeout, so these are using the new "REQ_NO_TIMEOUT" request flags. Signed-off-by: Keith Busch <keith.busch@intel.com> Signed-off-by: Jens Axboe <axboe@fb.com>	2015-01-08 09:00:29 -07:00
Jens Axboe	78e367a360	loop: add blk-mq.h include Looks like we pull it in through other ways on x86, but we fail on sparc: In file included from drivers/block/cryptoloop.c:30:0: drivers/block/loop.h:63:24: error: field 'tag_set' has incomplete type struct blk_mq_tag_set tag_set; Add the include to loop.h, kill it from loop.c. Signed-off-by: Jens Axboe <axboe@fb.com>	2015-01-02 15:20:25 -07:00
Ming Lei	af65aa8ea7	block: loop: don't handle REQ_FUA explicitly block core handles REQ_FUA by its flush state machine, so won't do it in loop explicitly. Signed-off-by: Ming Lei <ming.lei@canonical.com> Signed-off-by: Jens Axboe <axboe@fb.com>	2015-01-02 15:07:49 -07:00
Ming Lei	cf655d9534	block: loop: introduce lo_discard() and lo_req_flush() No behaviour change, just move the handling for REQ_DISCARD and REQ_FLUSH in these two functions. Signed-off-by: Ming Lei <ming.lei@canonical.com> Signed-off-by: Jens Axboe <axboe@fb.com>	2015-01-02 15:07:49 -07:00
Ming Lei	3011201346	block: loop: say goodby to bio Switch to block request completely. Signed-off-by: Ming Lei <ming.lei@canonical.com> Signed-off-by: Jens Axboe <axboe@fb.com>	2015-01-02 15:07:49 -07:00
Ming Lei	b5dd2f6047	block: loop: improve performance via blk-mq The conversion is a bit straightforward, and use work queue to dispatch requests of loop block, and one big change is that requests is submitted to backend file/device concurrently with work queue, so throughput may get improved much. Given write requests over same file are often run exclusively, so don't handle them concurrently for avoiding extra context switch cost, possible lock contention and work schedule cost. Also with blk-mq, there is opportunity to get loop I/O merged before submitting to backend file/device. In the following test: - base: v3.19-rc2-2041231 - loop over file in ext4 file system on SSD disk - bs: 4k, libaio, io depth: 64, O_DIRECT, num of jobs: 1 - throughput: IOPS ------------------------------------------------------ \| \| base \| base with loop-mq \| delta \| ------------------------------------------------------ \| randread \| 1740 \| 25318 \| +1355%\| ------------------------------------------------------ \| read \| 42196 \| 51771 \| +22.6%\| ----------------------------------------------------- \| randwrite \| 35709 \| 34624 \| -3% \| ----------------------------------------------------- \| write \| 39137 \| 40326 \| +3% \| ----------------------------------------------------- So loop-mq can improve throughput for both read and randread, meantime, performance of write and randwrite isn't hurted basically. Another benefit is that loop driver code gets simplified much after blk-mq conversion, and the patch can be thought as cleanup too. Signed-off-by: Ming Lei <ming.lei@canonical.com> Signed-off-by: Jens Axboe <axboe@fb.com>	2015-01-02 15:07:49 -07:00
Ming Lei	35b489d32f	block: fix checking return value of blk_mq_init_queue Check IS_ERR_OR_NULL(return value) instead of just return value. Signed-off-by: Ming Lei <ming.lei@canonical.com> Reduced to IS_ERR() by me, we never return NULL. Signed-off-by: Jens Axboe <axboe@fb.com>	2015-01-02 10:32:02 -07:00
Keith Busch	2b25d98179	NVMe: Fix double free irq Sets the vector to an invalid value after it's freed so we don't free it twice. Signed-off-by: Keith Busch <keith.busch@intel.com> Signed-off-by: Jens Axboe <axboe@fb.com>	2014-12-22 12:59:04 -07:00
Linus Torvalds	57666509b7	Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client Pull ceph updates from Sage Weil: "The big item here is support for inline data for CephFS and for message signatures from Zheng. There are also several bug fixes, including interrupted flock request handling, 0-length xattrs, mksnap, cached readdir results, and a message version compat field. Finally there are several cleanups from Ilya, Dan, and Markus. Note that there is another series coming soon that fixes some bugs in the RBD 'lingering' requests, but it isn't quite ready yet" * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client: (27 commits) ceph: fix setting empty extended attribute ceph: fix mksnap crash ceph: do_sync is never initialized libceph: fixup includes in pagelist.h ceph: support inline data feature ceph: flush inline version ceph: convert inline data to normal data before data write ceph: sync read inline data ceph: fetch inline data when getting Fcr cap refs ceph: use getattr request to fetch inline data ceph: add inline data to pagecache ceph: parse inline data in MClientReply and MClientCaps libceph: specify position of extent operation libceph: add CREATE osd operation support libceph: add SETXATTR/CMPXATTR osd operations support rbd: don't treat CEPH_OSD_OP_DELETE as extent op ceph: remove unused stringification macros libceph: require cephx message signature by default ceph: introduce global empty snap context ceph: message versioning fixes ...	2014-12-17 16:03:12 -08:00
Ilya Dryomov	7e868b6eff	rbd: don't treat CEPH_OSD_OP_DELETE as extent op CEPH_OSD_OP_DELETE is not an extent op, stop treating it as such. This sneaked in with discard patches - it's one of the three osd ops (the other two are CEPH_OSD_OP_TRUNCATE and CEPH_OSD_OP_ZERO) that discard is implemented with. Signed-off-by: Ilya Dryomov <idryomov@redhat.com> Reviewed-by: Alex Elder <elder@linaro.org>	2014-12-17 20:09:51 +03:00
SF Markus Elfring	e96a650a81	ceph, rbd: delete unnecessary checks before two function calls The functions ceph_put_snap_context() and iput() test whether their argument is NULL and then return immediately. Thus the test around the call is not needed. This issue was detected by using the Coccinelle software. Signed-off-by: Markus Elfring <elfring@users.sourceforge.net> [idryomov@redhat.com: squashed rbd.c hunk, changelog] Signed-off-by: Ilya Dryomov <idryomov@redhat.com>	2014-12-17 20:09:50 +03:00
Linus Torvalds	e6b5be2be4	Driver core patches for 3.19-rc1 Here's the set of driver core patches for 3.19-rc1. They are dominated by the removal of the .owner field in platform drivers. They touch a lot of files, but they are "simple" changes, just removing a line in a structure. Other than that, a few minor driver core and debugfs changes. There are some ath9k patches coming in through this tree that have been acked by the wireless maintainers as they relied on the debugfs changes. Everything has been in linux-next for a while. Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> -----BEGIN PGP SIGNATURE----- Version: GnuPG v2 iEYEABECAAYFAlSOD20ACgkQMUfUDdst+ylLPACg2QrW1oHhdTMT9WI8jihlHVRM 53kAoLeteByQ3iVwWurwwseRPiWa8+MI =OVRS -----END PGP SIGNATURE----- Merge tag 'driver-core-3.19-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core Pull driver core update from Greg KH: "Here's the set of driver core patches for 3.19-rc1. They are dominated by the removal of the .owner field in platform drivers. They touch a lot of files, but they are "simple" changes, just removing a line in a structure. Other than that, a few minor driver core and debugfs changes. There are some ath9k patches coming in through this tree that have been acked by the wireless maintainers as they relied on the debugfs changes. Everything has been in linux-next for a while" * tag 'driver-core-3.19-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core: (324 commits) Revert "ath: ath9k: use debugfs_create_devm_seqfile() helper for seq_file entries" fs: debugfs: add forward declaration for struct device type firmware class: Deletion of an unnecessary check before the function call "vunmap" firmware loader: fix hung task warning dump devcoredump: provide a one-way disable function device: Add dev_<level>_once variants ath: ath9k: use debugfs_create_devm_seqfile() helper for seq_file entries ath: use seq_file api for ath9k debugfs files debugfs: add helper function to create device related seq_file drivers/base: cacheinfo: remove noisy error boot message Revert "core: platform: add warning if driver has no owner" drivers: base: support cpu cache information interface to userspace via sysfs drivers: base: add cpu_device_create to support per-cpu devices topology: replace custom attribute macros with standard DEVICE_ATTR* cpumask: factor out show_cpumap into separate helper function driver core: Fix unbalanced device reference in drivers_probe driver core: fix race with userland in device_add() sysfs/kernfs: make read requests on pre-alloc files use the buffer. sysfs/kernfs: allow attributes to request write buffer be pre-allocated. fs: sysfs: return EGBIG on write if offset is larger than file size ...	2014-12-14 16:10:09 -08:00
Linus Torvalds	9ea18f8cab	Merge branch 'for-3.19/drivers' of git://git.kernel.dk/linux-block Pull block layer driver updates from Jens Axboe: - NVMe updates: - The blk-mq conversion from Matias (and others) - A stack of NVMe bug fixes from the nvme tree, mostly from Keith. - Various bug fixes from me, fixing issues in both the blk-mq conversion and generic bugs. - Abort and CPU online fix from Sam. - Hot add/remove fix from Indraneel. - A couple of drbd fixes from the drbd team (Andreas, Lars, Philipp) - With the generic IO stat accounting from 3.19/core, converting md, bcache, and rsxx to use those. From Gu Zheng. - Boundary check for queue/irq mode for null_blk from Matias. Fixes cases where invalid values could be given, causing the device to hang. - The xen blkfront pull request, with two bug fixes from Vitaly. * 'for-3.19/drivers' of git://git.kernel.dk/linux-block: (56 commits) NVMe: fix race condition in nvme_submit_sync_cmd() NVMe: fix retry/error logic in nvme_queue_rq() NVMe: Fix FS mount issue (hot-remove followed by hot-add) NVMe: fix error return checking from blk_mq_alloc_request() NVMe: fix freeing of wrong request in abort path xen/blkfront: remove redundant flush_op xen/blkfront: improve protection against issuing unsupported REQ_FUA NVMe: Fix command setup on IO retry null_blk: boundary check queue_mode and irqmode block/rsxx: use generic io stats accounting functions to simplify io stat accounting md: use generic io stats accounting functions to simplify io stat accounting drbd: use generic io stats accounting functions to simplify io stat accounting md/bcache: use generic io stats accounting functions to simplify io stat accounting NVMe: Update module version major number NVMe: fail pci initialization if the device doesn't have any BARs NVMe: add ->exit_hctx() hook NVMe: make setup work for devices that don't do INTx NVMe: enable IO stats by default NVMe: nvme_submit_async_admin_req() must use atomic rq allocation NVMe: replace blk_put_request() with blk_mq_free_request() ...	2014-12-13 14:22:26 -08:00
Linus Torvalds	caf292ae5b	Merge branch 'for-3.19/core' of git://git.kernel.dk/linux-block Pull block driver core update from Jens Axboe: "This is the pull request for the core block IO changes for 3.19. Not a huge round this time, mostly lots of little good fixes: - Fix a bug in sysfs blktrace interface causing a NULL pointer dereference, when enabled/disabled through that API. From Arianna Avanzini. - Various updates/fixes/improvements for blk-mq: - A set of updates from Bart, mostly fixing buts in the tag handling. - Cleanup/code consolidation from Christoph. - Extend queue_rq API to be able to handle batching issues of IO requests. NVMe will utilize this shortly. From me. - A few tag and request handling updates from me. - Cleanup of the preempt handling for running queues from Paolo. - Prevent running of unmapped hardware queues from Ming Lei. - Move the kdump memory limiting check to be in the correct location, from Shaohua. - Initialize all software queues at init time from Takashi. This prevents a kobject warning when CPUs are brought online that weren't online when a queue was registered. - Single writeback fix for I_DIRTY clearing from Tejun. Queued with the core IO changes, since it's just a single fix. - Version X of the __bio_add_page() segment addition retry from Maurizio. Hope the Xth time is the charm. - Documentation fixup for IO scheduler merging from Jan. - Introduce (and use) generic IO stat accounting helpers for non-rq drivers, from Gu Zheng. - Kill off artificial limiting of max sectors in a request from Christoph" * 'for-3.19/core' of git://git.kernel.dk/linux-block: (26 commits) bio: modify __bio_add_page() to accept pages that don't start a new segment blk-mq: Fix uninitialized kobject at CPU hotplugging blktrace: don't let the sysfs interface remove trace from running list blk-mq: Use all available hardware queues blk-mq: Micro-optimize bt_get() blk-mq: Fix a race between bt_clear_tag() and bt_get() blk-mq: Avoid that __bt_get_word() wraps multiple times blk-mq: Fix a use-after-free blk-mq: prevent unmapped hw queue from being scheduled blk-mq: re-check for available tags after running the hardware queue blk-mq: fix hang in bt_get() blk-mq: move the kdump check to blk_mq_alloc_tag_set blk-mq: cleanup tag free handling blk-mq: use 'nr_cpu_ids' as highest CPU ID count for hwq <-> cpu map blk: introduce generic io stat accounting help function blk-mq: handle the single queue case in blk_mq_hctx_next_cpu genhd: check for int overflow in disk_expand_part_tbl() blk-mq: add blk_mq_free_hctx_request() blk-mq: export blk_mq_free_request() blk-mq: use get_cpu/put_cpu instead of preempt_disable/preempt_enable ...	2014-12-13 14:14:23 -08:00
Linus Torvalds	78a45c6f06	Merge branch 'akpm' (second patch-bomb from Andrew) Merge second patchbomb from Andrew Morton: - the rest of MM - misc fs fixes - add execveat() syscall - new ratelimit feature for fault-injection - decompressor updates - ipc/ updates - fallocate feature creep - fsnotify cleanups - a few other misc things * emailed patches from Andrew Morton <akpm@linux-foundation.org>: (99 commits) cgroups: Documentation: fix trivial typos and wrong paragraph numberings parisc: percpu: update comments referring to __get_cpu_var percpu: update local_ops.txt to reflect this_cpu operations percpu: remove __get_cpu_var and __raw_get_cpu_var macros fsnotify: remove destroy_list from fsnotify_mark fsnotify: unify inode and mount marks handling fallocate: create FAN_MODIFY and IN_MODIFY events mm/cma: make kmemleak ignore CMA regions slub: fix cpuset check in get_any_partial slab: fix cpuset check in fallback_alloc shmdt: use i_size_read() instead of ->i_size ipc/shm.c: fix overly aggressive shmdt() when calls span multiple segments ipc/msg: increase MSGMNI, remove scaling ipc/sem.c: increase SEMMSL, SEMMNI, SEMOPM ipc/sem.c: change memory barrier in sem_lock() to smp_rmb() lib/decompress.c: consistency of compress formats for kernel image decompress_bunzip2: off by one in get_next_block() usr/Kconfig: make initrd compression algorithm selection not expert fault-inject: add ratelimit option ratelimit: add initialization macro ...	2014-12-13 13:00:36 -08:00
Ganesh Mahendran	083914eab9	zram: use DEVICE_ATTR_[RW\|RO\|WO] to define zram sys device attribute In current zram, we use DEVICE_ATTR() to define sys device attributes. SO, we need to set (S_IRUGO \| S_IWUSR) permission and other arguments manually. Linux already provids the macro DEVICE_ATTR_[RW\|RO\|WO] to define sys device attribute. It is simple and readable. This patch uses kernel defined macro DEVICE_ATTR_[RW\|RO\|WO] to define zram device attribute. Signed-off-by: Ganesh Mahendran <opensource.ganesh@gmail.com> Acked-by: Jerome Marchand <jmarchan@redhat.com> Cc: Minchan Kim <minchan@kernel.org> Cc: Nitin Gupta <ngupta@vflare.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2014-12-13 12:42:50 -08:00
Mahendran Ganesh	d49b1c254c	mm/zram: correct ZRAM_ZERO flag bit position In struct zram_table_entry, the element value contains obj size and obj zram flags. Bit 0 to bit (ZRAM_FLAG_SHIFT - 1) represent obj size, and bit ZRAM_FLAG_SHIFT to the highest bit of unsigned long represent obj zram_flags. So the first zram flag(ZRAM_ZERO) should be from ZRAM_FLAG_SHIFT instead of (ZRAM_FLAG_SHIFT + 1). This patch fixes this cosmetic issue. Also fix a typo, "page in now accessed" -> "page is now accessed" Signed-off-by: Mahendran Ganesh <opensource.ganesh@gmail.com> Acked-by: Minchan Kim <minchan@kernel.org> Acked-by: Weijie Yang <weijie.yang@samsung.com> Acked-by: Sergey Senozhatsky <sergey.senozhatsky@gmail.com> Cc: Nitin Gupta <ngupta@vflare.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2014-12-13 12:42:50 -08:00
karam.lee	8c7f01025f	zram: implement rw_page operation of zram This patch implements rw_page operation for zram block device. I implemented the feature in zram and tested it. Test bed was the G2, LG electronic mobile device, whtich has msm8974 processor and 2GB memory. With a memory allocation test program consuming memory, the system generates swap. Operating time of swap_write_page() was measured. -------------------------------------------------- \| \| operating time \| improvement \| \| \| (20 runs average) \| \| -------------------------------------------------- \|with patch \| 1061.15 us \| +2.4% \| -------------------------------------------------- \|without patch\| 1087.35 us \| \| -------------------------------------------------- Each test(with paged_io,with BIO) result set shows normal distribution and has equal variance. I mean the two values are valid result to compare. I can say operation with paged I/O(without BIO) is faster 2.4% with confidence level 95%. [minchan@kernel.org: make rw_page opeartion return 0] [minchan@kernel.org: rely on the bi_end_io for zram_rw_page fails] [sergey.senozhatsky@gmail.com: code cleanup] [minchan@kernel.org: add comment] Signed-off-by: karam.lee <karam.lee@lge.com> Acked-by: Minchan Kim <minchan@kernel.org> Acked-by: Jerome Marchand <jmarchan@redhat.com> Cc: Matthew Wilcox <matthew.r.wilcox@intel.com> Cc: Nitin Gupta <ngupta@vflare.org> Cc: <seungho1.park@lge.com> Signed-off-by: Minchan Kim <minchan@kernel.org> Signed-off-by: Sergey Senozhatsky <sergey.senozhatsky@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2014-12-13 12:42:50 -08:00
karam.lee	54850e73e8	zram: change parameter from vaild_io_request() This patch changes parameter of valid_io_request for common usage. The purpose of valid_io_request() is to determine if bio request is valid or not. This patch use I/O start address and size instead of a BIO parameter for common usage. Signed-off-by: karam.lee <karam.lee@lge.com> Acked-by: Minchan Kim <minchan@kernel.org> Acked-by: Jerome Marchand <jmarchan@redhat.com> Cc: Matthew Wilcox <matthew.r.wilcox@intel.com> Cc: Nitin Gupta <ngupta@vflare.org> Cc: <seungho1.park@lge.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2014-12-13 12:42:50 -08:00
karam.lee	b627cff3d3	zram: remove bio parameter from zram_bvec_rw() Recently rw_page block device operation has been added. This patchset implements rw_page operation for zram block device and does some clean-up. This patch (of 3): Remove an unnecessary parameter(bio) from zram_bvec_rw() and zram_bvec_read(). zram_bvec_read() doesn't use a bio parameter, so remove it. zram_bvec_rw() calls a read/write operation not using bio, so a rw parameter replaces a bio parameter. Signed-off-by: karam.lee <karam.lee@lge.com> Acked-by: Minchan Kim <minchan@kernel.org> Acked-by: Jerome Marchand <jmarchan@redhat.com> Cc: Matthew Wilcox <matthew.r.wilcox@intel.com> Cc: Nitin Gupta <ngupta@vflare.org> Cc: <seungho1.park@lge.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2014-12-13 12:42:49 -08:00
Jens Axboe	849c6e7746	NVMe: fix race condition in nvme_submit_sync_cmd() If we have a race between the schedule timing out and the command completing, we could have the task issuing the command exit nvme_submit_sync_cmd() while the irq is running sync_completion(). If that happens, we could be corrupting memory, since the stack that held 'cmdinfo' is no longer valid. Fix this by always calling nvme_abort_cmd_info(). Once that call completes, we know that we have either run sync_completion() if the completion came in, or that we will never run it since we now have special_completion() as the command callback handler. Acked-by: Keith Busch <keith.busch@intel.com> Signed-off-by: Jens Axboe <axboe@fb.com>	2014-12-12 08:53:40 -07:00
Dwight Engen	76e74bbe0a	sunvdc: reconnect ldc after vds service domain restarts This change enables the sunvdc driver to reconnect and recover if a vds service domain is disconnected or bounced. By default, it will wait indefinitely for the service domain to become available again, but will honor a non-zero vdc-timout md property if one is set. If a timeout is reached, any in-progress I/O's are completed with -EIO. Signed-off-by: Dwight Engen <dwight.engen@oracle.com> Reviewed-by: Chris Hyser <chris.hyser@oracle.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2014-12-11 18:52:45 -08:00
Dwight Engen	fe47c3c262	vio: create routines for inc,dec vio dring indexes Both sunvdc and sunvnet implemented distinct functionality for incrementing and decrementing dring indexes. Create common functions for use by both from the sunvnet versions, which were chosen since they will still work correctly in case a non power of two ring size is used. Signed-off-by: Dwight Engen <dwight.engen@oracle.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2014-12-11 18:52:45 -08:00
Dwight Engen	31f4888f51	sunvdc: fix module unload/reload Free resources allocated during port/disk probing so that the module may be successfully reloaded after unloading. Signed-off-by: Dwight Engen <dwight.engen@oracle.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2014-12-11 18:51:57 -08:00
Jens Axboe	fe54303ee2	NVMe: fix retry/error logic in nvme_queue_rq() The logic around retrying and erroring IO in nvme_queue_rq() is broken in a few ways: - If we fail allocating dma memory for a discard, we return retry. We have the 'iod' stored in ->special, but we free the 'iod'. - For a normal request, if we fail dma mapping of setting up prps, we have the same iod situation. Additionally, we haven't set the callback for the request yet, so we also potentially leak IOMMU resources. Get rid of the ->special 'iod' store. The retry is uncommon enough that it's not worth optimizing for or holding on to resources to attempt to speed it up. Additionally, it's usually best practice to free any request related resources when doing retries. Acked-by: Keith Busch <keith.busch@intel.com> Signed-off-by: Jens Axboe <axboe@fb.com>	2014-12-11 13:58:39 -07:00
Linus Torvalds	6b9e2cea42	virtio: virtio 1.0 support, misc patches This adds a lot of infrastructure for virtio 1.0 support. Notable missing pieces: virtio pci, virtio balloon (needs spec extension), vhost scsi. Plus, there are some minor fixes in a couple of places. Cc: David Miller <davem@davemloft.net> Cc: Rusty Russell <rusty@rustcorp.com.au> Signed-off-by: Michael S. Tsirkin <mst@redhat.com> -----BEGIN PGP SIGNATURE----- Version: GnuPG v1 iQEcBAABAgAGBQJUh1CVAAoJECgfDbjSjVRpWZcH/2+EGPyng7Lca820UHA0cU1U u4D8CAAwOGaVdnUUo8ox1eon3LNB2UgRtgsl3rBDR3YTgFfNPrfuYdnHO0dYIDc1 lS26NuPrVrTX0lA+OBPe2nlKrsrOkn8aw1kxG9Y0gKtNg/+HAGNW5e2eE7R/LrA5 94XbWZ8g9Yf4GPG1iFmih9vQvvN0E68zcUlojfCnllySgaIEYr8nTiGQBWpRgJat fCqFAp1HMDZzGJQO+m1/Vw0OftTRVybyfai59e6uUTa8x1djvzPb/1MvREqQjegM ylSuofIVyj7JPu++FbAjd9mikkb53GSc8ql3YmWNZLdr69rnkzP0GdzQvrdheAo= =RtrR -----END PGP SIGNATURE----- Merge tag 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mst/vhost Pull virtio updates from Michael Tsirkin: "virtio: virtio 1.0 support, misc patches This adds a lot of infrastructure for virtio 1.0 support. Notable missing pieces: virtio pci, virtio balloon (needs spec extension), vhost scsi. Plus, there are some minor fixes in a couple of places. Note: some net drivers are affected by these patches. David said he's fine with merging these patches through my tree. Rusty's on vacation, he acked using my tree for these, too" * tag 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mst/vhost: (70 commits) virtio_ccw: finalize_features error handling virtio_ccw: future-proof finalize_features virtio_pci: rename virtio_pci -> virtio_pci_common virtio_pci: update file descriptions and copyright virtio_pci: split out legacy device support virtio_pci: setup config vector indirectly virtio_pci: setup vqs indirectly virtio_pci: delete vqs indirectly virtio_pci: use priv for vq notification virtio_pci: free up vq->priv virtio_pci: fix coding style for structs virtio_pci: add isr field virtio: drop legacy_only driver flag virtio_balloon: drop legacy_only driver flag virtio_ccw: rev 1 devices set VIRTIO_F_VERSION_1 virtio: allow finalize_features to fail virtio_ccw: legacy: don't negotiate rev 1/features virtio: add API to detect legacy devices virtio_console: fix sparse warnings vhost: remove unnecessary forward declarations in vhost.h ...	2014-12-11 12:20:31 -08:00
Indraneel M	285dffc910	NVMe: Fix FS mount issue (hot-remove followed by hot-add) After Hot-remove of a device with a mounted partition, when the device is hot-added again, the new node reappears as nvme0n1. Mounting this new node fails with the error: mount: mount /dev/nvme0n1p1 on /mnt failed: File exists. The old nodes's FS entries still exist and the kernel can't re-create procfs and sysfs entries for the new node with the same name. The patch fixes this issue. Acked-by: Keith Busch <keith.busch@intel.com> Signed-off-by: Indraneel M <indraneel.m@samsung.com> Signed-off-by: Jens Axboe <axboe@fb.com>	2014-12-11 08:24:18 -07:00
Linus Torvalds	cbfe0de303	Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs Pull VFS changes from Al Viro: "First pile out of several (there _definitely_ will be more). Stuff in this one: - unification of d_splice_alias()/d_materialize_unique() - iov_iter rewrite - killing a bunch of ->f_path.dentry users (and f_dentry macro). Getting that completed will make life much simpler for unionmount/overlayfs, since then we'll be able to limit the places sensitive to file _dentry_ to reasonably few. Which allows to have file_inode(file) pointing to inode in a covered layer, with dentry pointing to (negative) dentry in union one. Still not complete, but much closer now. - crapectomy in lustre (dead code removal, mostly) - "let's make seq_printf return nothing" preparations - assorted cleanups and fixes There _definitely_ will be more piles" * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (63 commits) copy_from_iter_nocache() new helper: iov_iter_kvec() csum_and_copy_..._iter() iov_iter.c: handle ITER_KVEC directly iov_iter.c: convert copy_to_iter() to iterate_and_advance iov_iter.c: convert copy_from_iter() to iterate_and_advance iov_iter.c: get rid of bvec_copy_page_{to,from}_iter() iov_iter.c: convert iov_iter_zero() to iterate_and_advance iov_iter.c: convert iov_iter_get_pages_alloc() to iterate_all_kinds iov_iter.c: convert iov_iter_get_pages() to iterate_all_kinds iov_iter.c: convert iov_iter_npages() to iterate_all_kinds iov_iter.c: iterate_and_advance iov_iter.c: macros for iterating over iov_iter kill f_dentry macro dcache: fix kmemcheck warning in switch_names new helper: audit_file() nfsd_vfs_write(): use file_inode() ncpfs: use file_inode() kill f_dentry uses lockd: get rid of ->f_path.dentry->d_sb ...	2014-12-10 16:10:49 -08:00
Jens Axboe	d8ead9b763	Merge branch 'for-jens-3.19' of git://git.kernel.org/pub/scm/linux/kernel/git/konrad/xen into for-3.19/drivers Konrad writes: These are two fixes for Xen blkfront. They harden how it deals with broken backends.	2014-12-10 13:58:17 -07:00
Jens Axboe	97fe383222	NVMe: fix error return checking from blk_mq_alloc_request() We return an error pointer or the request, not NULL. Half the call paths got it right, the others didn't. Fix those up. Signed-off-by: Jens Axboe <axboe@fb.com>	2014-12-10 13:02:44 -07:00
Sam Bradshaw	c87fd5407e	NVMe: fix freeing of wrong request in abort path We allocate 'abort_req', but free 'req' in case of an error submitting the IO. Signed-off-by: Sam Bradshaw <sbradshaw@micron.com> Signed-off-by: Jens Axboe <axboe@fb.com>	2014-12-10 13:00:31 -07:00
Vitaly Kuznetsov	fdf9b96503	xen/blkfront: remove redundant flush_op flush_op is unambiguously defined by feature_flush: REQ_FUA \| REQ_FLUSH -> BLKIF_OP_WRITE_BARRIER REQ_FLUSH -> BLKIF_OP_FLUSH_DISKCACHE 0 -> 0 and thus can be removed. This is just a cleanup. The patch was suggested by Boris Ostrovsky. Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com> Reviewed-by: Boris Ostrovsky <boris.ostrovsky@oracle.com> Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>	2014-12-10 12:20:18 -05:00
Vitaly Kuznetsov	ad42d391ae	xen/blkfront: improve protection against issuing unsupported REQ_FUA Guard against issuing unsupported REQ_FUA and REQ_FLUSH was introduced in `d11e61583` and was factored out into blkif_request_flush_valid() in `0f1ca65ee`. However: 1) This check in incomplete. In case we negotiated to feature_flush = REQ_FLUSH and flush_op = BLKIF_OP_FLUSH_DISKCACHE (so FUA is unsupported) FUA request will still pass the check. 2) blkif_request_flush_valid() is misnamed. It is bool but returns true when the request is invalid. 3) When blkif_request_flush_valid() fails -EIO is being returned. It seems that -EOPNOTSUPP is more appropriate here. Fix all of the above issues. This patch is based on the original patch by Laszlo Ersek and a comment by Jeff Moyer. Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com> Reviewed-by: Laszlo Ersek <lersek@redhat.com> Reviewed-by: Boris Ostrovsky <boris.ostrovsky@oracle.com> Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>	2014-12-10 12:20:07 -05:00
Michael S. Tsirkin	51cdc3815f	virtio: drop VIRTIO_F_VERSION_1 from drivers Core activates this bit automatically now, drop it from drivers that set it explicitly. Signed-off-by: Michael S. Tsirkin <mst@redhat.com>	2014-12-09 12:06:32 +02:00
Michael S. Tsirkin	38f37b578f	virtio_blk: fix race at module removal If a device appears while module is being removed, driver will get a callback after we've given up on the major number. In theory this means this major number can get reused by something else, resulting in a conflict. Signed-off-by: Michael S. Tsirkin <mst@redhat.com> Reviewed-by: Cornelia Huck <cornelia.huck@de.ibm.com>	2014-12-09 12:05:27 +02:00
Michael S. Tsirkin	393c525b5b	virtio_blk: make serial attribute static It's never declared so no need to make it extern. Signed-off-by: Michael S. Tsirkin <mst@redhat.com> Reviewed-by: Cornelia Huck <cornelia.huck@de.ibm.com>	2014-12-09 12:05:27 +02:00
Michael S. Tsirkin	19c1c5a64c	virtio_blk: v1.0 support Based on patch by Cornelia Huck. Note: for consistency, and to avoid sparse errors, convert all fields, even those no longer in use for virtio v1.0. Signed-off-by: Cornelia Huck <cornelia.huck@de.ibm.com> Signed-off-by: Michael S. Tsirkin <mst@redhat.com>	2014-12-09 12:05:26 +02:00
Al Viro	ba00410b81	Merge branch 'iov_iter' into for-next	2014-12-08 20:39:29 -05:00
James Bottomley	dc843ef00e	Merge remote-tracking branch 'scsi-queue/core-for-3.19' into for-linus	2014-12-08 07:40:20 -08:00
Keith Busch	9af8785a38	NVMe: Fix command setup on IO retry On retry, the req->special is pointing to an already setup IOD, but we still need to setup the command context and callback, otherwise you'll see false twice completed errors and leak requests. Signed-off-by: Keith Busch <keith.busch@intel.com> Signed-off-by: Jens Axboe <axboe@fb.com>	2014-12-03 19:10:45 -07:00
Matias Bjorling	709c8667ad	null_blk: boundary check queue_mode and irqmode When either queue_mode or irq_mode parameter is set outside its boundaries, the driver will not complete requests. This stalls driver initialization when partitions are probed. Fix by setting out of bound values to the parameters default. Signed-off-by: Matias Bjørling <m@bjorling.me> Updated by me to have the parse+check in just one function. Signed-off-by: Jens Axboe <axboe@fb.com>	2014-11-26 14:45:48 -07:00
Hannes Reinecke	eb846d9f14	scsi: rename SERVICE_ACTION_IN to SERVICE_ACTION_IN_16 SPC-3 defines SERVICE ACTION IN(12) and SERVICE ACTION IN(16). So rename SERVICE_ACTION_IN to SERVICE_ACTION_IN_16 to be consistent with SPC and to allow for better distinction. Signed-off-by: Hannes Reinecke <hare@suse.de> Tested-by: Robert Elliott <elliott@hp.com> Reviewed-by: Robert Elliott <elliott@hp.com> Signed-off-by: Christoph Hellwig <hch@lst.de>	2014-11-24 20:01:40 +01:00
Gu Zheng	fa573f7279	block/rsxx: use generic io stats accounting functions to simplify io stat accounting Use generic io stats accounting help functions (generic_{start,end}_io_acct) to simplify io stat accounting. Signed-off-by: Gu Zheng <guz.fnst@cn.fujitsu.com> Signed-off-by: Jens Axboe <axboe@fb.com>	2014-11-24 08:05:18 -07:00
Gu Zheng	244808543e	drbd: use generic io stats accounting functions to simplify io stat accounting Use generic io stats accounting help functions (generic_{start,end}_io_acct) to simplify io stat accounting. Signed-off-by: Gu Zheng <guz.fnst@cn.fujitsu.com> Signed-off-by: Jens Axboe <axboe@fb.com>	2014-11-24 08:05:14 -07:00
Keith Busch	c78b47136f	NVMe: Update module version major number It's already near impossible to tell what bits someone is running based on a 'modinfo nvme', and I don't want to try guessing if someone is running blk-mq or bio-based. Let's make it obvious with the module version that the blk-mq conversion is a major change. Future bio-based versions can increment to 0.10 in a fork if revisions occur. Signed-off-by: Keith Busch <keith.busch@intel.com> Signed-off-by: Jens Axboe <axboe@fb.com>	2014-11-21 18:22:35 -07:00
Jens Axboe	be7837e89d	NVMe: fail pci initialization if the device doesn't have any BARs The PCI init of NVMe doesn't check for valid bars before proceeding to map and use BAR 0. If the device is hosed (or firmware is), then we should catch this case and give up early. This fixes a: [ 1662.035778] WARNING: CPU: 0 PID: 4 at arch/x86/mm/ioremap.c:63 __ioremap_check_ram+0xa7/0xc0() and later badness on such a device. Acked-by: Keith Busch <keith.busch@intel.com> Signed-off-by: Jens Axboe <axboe@fb.com>	2014-11-20 11:10:06 -07:00
Jens Axboe	2c30540b38	NVMe: add ->exit_hctx() hook If we do teardown and setup of the queue and block related parts of the driver, then we should clear nvmeq->hctx once we kill the hardware queue. Acked-by: Keith Busch <keith.busch@intel.com> Signed-off-by: Jens Axboe <axboe@fb.com>	2014-11-20 11:10:03 -07:00
Jens Axboe	e32efbfc35	NVMe: make setup work for devices that don't do INTx The setup/probe part currently relies on INTx being there and working, that's not always the case. For devices that don't advertise INTx, enable a single MSIx vector early on and disable it again before we ask for our full range of queue vecs. Acked-by: Keith Busch <keith.busch@intel.com> Signed-off-by: Jens Axboe <axboe@fb.com>	2014-11-19 12:46:17 -07:00
Al Viro	b583043e99	kill f_dentry uses Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2014-11-19 13:01:25 -05:00
Jens Axboe	b352172976	Merge branch 'master' into for-3.19/drivers	2014-11-18 19:43:46 -07:00
Jens Axboe	1397688953	NVMe: enable IO stats by default Before the blk-mq conversion they were on by default, we should not change behavior there. Signed-off-by: Jens Axboe <axboe@fb.com>	2014-11-18 08:45:31 -07:00
Jens Axboe	6dcc0cf6cb	NVMe: nvme_submit_async_admin_req() must use atomic rq allocation We are called for async event notification issues, and the nvmeq lock is already held. If we fail the request allocation, we'll just retry next time. Reported-by: Julia Lawall <julia.lawall@lip6.fr> Signed-off-by: Jens Axboe <axboe@fb.com>	2014-11-18 08:21:18 -07:00
Jens Axboe	9d135bb8c2	NVMe: replace blk_put_request() with blk_mq_free_request() No point in using blk_put_request(), since we know we are blk-mq. This only makes sense in core code where we could be dealing with either legacy or blk-mq drivers. Additionally, use blk_mq_free_hctx_request() for the request completion fast path, where we already know the mapping from request to hardware queue. Signed-off-by: Jens Axboe <axboe@fb.com>	2014-11-17 10:43:42 -07:00
Weijie Yang	c406515239	zram: avoid kunmap_atomic() of a NULL pointer zram could kunmap_atomic() a NULL pointer in a rare situation: a zram page becomes a full-zeroed page after a partial write io. The current code doesn't handle this case and performs kunmap_atomic() on a NULL pointer, which panics the kernel. This patch fixes this issue. Signed-off-by: Weijie Yang <weijie.yang@samsung.com> Cc: Sergey Senozhatsky <sergey.senozhatsky@gmail.com> Cc: Dan Streetman <ddstreet@ieee.org> Cc: Nitin Gupta <ngupta@vflare.org> Cc: Weijie Yang <weijie.yang.kh@gmail.com> Acked-by: Jerome Marchand <jmarchan@redhat.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2014-11-13 16:17:05 -08:00
Philipp Reisner	e805b983d3	drbd: Remove an useless copy of kernel_setsockopt() Old backward-compat cruft Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com> Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com> Signed-off-by: Jens Axboe <axboe@fb.com>	2014-11-10 09:27:43 -07:00
Philipp Reisner	9581f97a68	drbd: Fix state change in case of connection timeout A connection timeout affects all volumes of a resource! Under the following conditions: A resource with multiple volumes AND ko-count >=1 AND a write request triggers the timeout (ko-count * timeout) DRBD's internal state gets confused. That in turn may lead to very miss leading follow up failures. E.g. "BUG: scheduling while atomic" CC: stable@kernel.org # v3.17 Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com> Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com> Signed-off-by: Jens Axboe <axboe@fb.com>	2014-11-10 09:27:41 -07:00
Lars Ellenberg	3b9d35d744	drbd: merge_bvec_fn: properly remap bvm->bi_bdev This was not noticed for many years. Affects operation if md raid is used a backing device for DRBD. CC: stable@kernel.org # v3.2+ Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com> Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com> Signed-off-by: Jens Axboe <axboe@fb.com>	2014-11-10 09:27:39 -07:00
Lars Ellenberg	ff8bd88b73	drbd: fix resync throttling initialization If for some reason DRBD resync was the only activity on a backend device, drbd_rs_c_min_rate_throttle() would mistakenly decide that it is still initialization time, and keep throttling the resync. This patch explicitly initializes ->rs_last_events to the current backend event counters, and drops the rs_last_events == 0 from the throttle condition. Reported-by: Mikhail Sugakov <msugakov@amazon.de> Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com> Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com> Signed-off-by: Jens Axboe <axboe@fb.com>	2014-11-10 09:27:37 -07:00
Philipp Reisner	a88215312c	drbd: fix race between role change and handshake Symptoms: If DRBD was "cleanly shut down" (all in sync, both Secondary before disconnect, identical data generation uuids), and then one side was promoted during the next connection handshake, the role change could confuse the handshake. The Primary would get stuck in WFBitmapS, the Secondary would log unexpected cstate (Connected) in receive_bitmap and get stuck in WFBitmapT. Fix: The test in is_valid_soft_transition wrong. It works because the not allowed actions (promote/attach) do not touch the cstate. The previous condition failed to demand a cstate change in one clause. In order to avoid deadlocks give up the state_mutex while waiting for the transient state to go away. Conflicts: drbd/drbd_state.c drbd/drbd_state.h drbd/drbd_wrappers.h Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com> Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com> Signed-off-by: Jens Axboe <axboe@fb.com>	2014-11-10 09:27:35 -07:00
Andreas Gruenbacher	f221f4bcc5	drbd: Only use drbd_msg_put_info() in drbd_nl.c Avoid generic netlink calls in other parts of the code base. Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com> Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com> Signed-off-by: Jens Axboe <axboe@fb.com>	2014-11-10 09:27:33 -07:00
Andreas Gruenbacher	179e20b8df	drbd: Minor cleanups . Update comments . drbd_set_{in,out_of}_sync(): Remove unused parameters . Move common code into adm_del_resource() . Redefine ERR_MINOR_EXISTS -> ERR_MINOR_OR_VOLUME_EXISTS Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com> Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com> Signed-off-by: Jens Axboe <axboe@fb.com>	2014-11-10 09:27:30 -07:00
kbuild test robot	a64e6bb4db	NVMe: __nvme_submit_admin_cmd() can be static drivers/block/nvme-core.c:865:5: sparse: symbol '__nvme_submit_admin_cmd' was not declared. Should it be static? Signed-off-by: Fengguang Wu <fengguang.wu@intel.com> Signed-off-by: Jens Axboe <axboe@fb.com>	2014-11-10 09:27:01 -07:00
Dan Carpenter	9f173b3384	NVMe: blk_mq_alloc_request() returns error pointers We recently converted this to blk_mq but the error checks have to be updated to check for IS_ERR() instead of NULL. Fixes: `a4aea5623d` ('NVMe: Convert to blk-mq') Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com> Signed-off-by: Jens Axboe <axboe@fb.com>	2014-11-05 13:47:03 -07:00
Matias Bjørling	a4aea5623d	NVMe: Convert to blk-mq This converts the NVMe driver to a blk-mq request-based driver. The NVMe driver is currently bio-based and implements queue logic within itself. By using blk-mq, a lot of these responsibilities can be moved and simplified. The patch is divided into the following blocks: * Per-command data and cmdid have been moved into the struct request field. The cmdid_data can be retrieved using blk_mq_rq_to_pdu() and id maintenance are now handled by blk-mq through the rq->tag field. * The logic for splitting bio's has been moved into the blk-mq layer. The driver instead notifies the block layer about limited gap support in SG lists. * blk-mq handles timeouts and is reimplemented within nvme_timeout(). This both includes abort handling and command cancelation. * Assignment of nvme queues to CPUs are replaced with the blk-mq version. The current blk-mq strategy is to assign the number of mapped queues and CPUs to provide synergy, while the nvme driver assign as many nvme hw queues as possible. This can be implemented in blk-mq if needed. * NVMe queues are merged with the tags structure of blk-mq. * blk-mq takes care of setup/teardown of nvme queues and guards invalid accesses. Therefore, RCU-usage for nvme queues can be removed. * IO tracing and accounting are handled by blk-mq and therefore removed. * Queue suspension logic is replaced with the logic from the block layer. Contributions in this patch from: Sam Bradshaw <sbradshaw@micron.com> Jens Axboe <axboe@fb.com> Keith Busch <keith.busch@intel.com> Robert Nelson <rlnelson@google.com> Acked-by: Keith Busch <keith.busch@intel.com> Acked-by: Jens Axboe <axboe@fb.com> Updated for new ->queue_rq() prototype. Signed-off-by: Jens Axboe <axboe@fb.com>	2014-11-04 13:18:52 -07:00
Keith Busch	9dbbfab7d5	NVMe: Do not over allocate for discard requests Discard requests are often for very large ranges. The discard size is not representative of the data transfer size so we don't need to allocate for such a large prp list. This patch requests allocating only enough for the memory needed for the data transfer and saves a little over 8k of memory per max discard request. Signed-off-by: Keith Busch <keith.busch@intel.com> Reported-by: Paul Grabinar <paul.grabinar@ranbarg.com> Signed-off-by: Matthew Wilcox <matthew.r.wilcox@intel.com> Signed-off-by: Jens Axboe <axboe@fb.com>	2014-11-04 13:18:37 -07:00
Keith Busch	9e60352cf8	NVMe: Do not open disks that are being deleted It is possible the block layer will request to open a block device after the driver deleted it. Subsequent releases will cause a double free, or the disk's private_data is pointing to freed memory. This patch protects the driver's freed disks from being opened and accessed: the nvme namespaces are freed only when the device's refcount is 0, so at that moment there were no active openers and no more should be allowed, and it is safe to clear the disk's private_data that is about to be freed. Signed-off-by: Keith Busch <keith.busch@intel.com> Reported-by: Henry Chow <henry.chow@oracle.com> Signed-off-by: Matthew Wilcox <matthew.r.wilcox@intel.com> Signed-off-by: Jens Axboe <axboe@fb.com>	2014-11-04 13:18:32 -07:00
Keith Busch	5940c8578f	NVMe: Clear QUEUE_FLAG_STACKABLE The nvme namespace request_queue's flags are initialized to QUEUE_FLAG_DEFAULT, which currently sets QUEUE_FLAG_STACKABLE. The device-mapper indicates this flag means the block driver is requset based, though this driver is bio-based and problems will occur if an nvme namespace is used with a request based dm device. This patch clears the stackable flag. Signed-off-by: Keith Busch <keith.busch@intel.com> Signed-off-by: Matthew Wilcox <matthew.r.wilcox@intel.com> Signed-off-by: Jens Axboe <axboe@fb.com>	2014-11-04 13:18:10 -07:00
Keith Busch	387caa5a2b	NVMe: Fix device probe waiting on kthread If we ever do parallel device probing, we need to wake up all processes waiting for nvme kthread to start, not just one. This is currently serialized so the bug is not reachable today, but fixing this anyway in the hopes we implement parallel or asynchronous probe in the future. Signed-off-by: Keith Busch <keith.busch@intel.com> Signed-off-by: Matthew Wilcox <matthew.r.wilcox@intel.com> Signed-off-by: Jens Axboe <axboe@fb.com>	2014-11-04 13:17:10 -07:00
Keith Busch	7963e52181	NVMe: Passthrough IOCTL for IO commands The NVME_IOCTL_SUBMIT_IO only works for IO commands with block data transfers and isn't usable for other NVMe commands like flush, data set management, or any sort of vendor unique command. The NVME_IOCTL_ADMIN_CMD, however, can easily be modified to accept arbitrary IO commands in addition to arbitrary admin commands without breaking backward compatibility. This patch just adds a new IOCTL to distinguish if the driver should submit the command on an IO or Admin queue. Signed-off-by: Keith Busch <keith.busch@intel.com> Signed-off-by: Matthew Wilcox <matthew.r.wilcox@intel.com> Signed-off-by: Jens Axboe <axboe@fb.com>	2014-11-04 13:17:09 -07:00
Keith Busch	1b9dbf7fe0	NVMe: Add revalidate_disk callback This adds a callback to revalidate the disk and change its block size and capacity if needed. Before, a user would have to remove + rescan an entire device if they changed the logical block size using an NVMe Format or other vendor specific command; now they can just run something that issues the BLKRRPART IOCTL, like # hdparm -z /dev/nvmeXnY This can also be used in response to the 1.2 Spec's Namespace Attribute Change asynchronous event. Signed-off-by: Keith Busch <keith.busch@intel.com> Signed-off-by: Matthew Wilcox <matthew.r.wilcox@intel.com> Signed-off-by: Jens Axboe <axboe@fb.com>	2014-11-04 13:17:09 -07:00
Keith Busch	7be50e93fb	NVMe: Fix nvmeq waitqueue entry initialization We need to update the nvme queue's wait_queue_t entry during each initialization since the nvme_thread may be ended and restarted when the device is reset. If a device reset occurs during a large amount of buffered IO, it would take a lot longer to complete the outstanding requests due to the 1 second polling instead of waking up as completions occur. Fixes: `b9afca3efb` Signed-off-by: Keith Busch <keith.busch@intel.com> Signed-off-by: Matthew Wilcox <matthew.r.wilcox@intel.com> Signed-off-by: Jens Axboe <axboe@fb.com>	2014-11-04 13:17:09 -07:00
Keith Busch	b4ff9c8ddb	NVMe: Translate NVMe status to errno This returns a more appropriate error for the "capacity exceeded" status. In case other NVMe statuses have a better errno, this patch adds a convience function to translate an NVMe status code to an errno for IO commands, defaulting to the current -EIO. Signed-off-by: Keith Busch <keith.busch@intel.com> Signed-off-by: Matthew Wilcox <matthew.r.wilcox@intel.com> Signed-off-by: Jens Axboe <axboe@fb.com>	2014-11-04 13:17:09 -07:00
Keith Busch	695a4fe79f	NVMe: Fix SG_IO status values We've only been setting the sg_io_hdr status values on SCSI commands that require an nvme command to complete the translation. The fields in the struct are output parameters, so we have to set them, otherwise user space will see whatever was in memory from before. In the case of compat SG_IO, this would reveal kernel memory. This fixes the issue by initializing the sg_io_hdr with successful status. Signed-off-by: Keith Busch <keith.busch@intel.com> Acked-by: Vishal Verma <vishal.l.verma@linux.intel.com> Signed-off-by: Matthew Wilcox <matthew.r.wilcox@intel.com> Signed-off-by: Jens Axboe <axboe@fb.com>	2014-11-04 13:17:09 -07:00
Keith Busch	e179729a82	NVMe: Remove duplicate compat SG_IO code We can return -ENOIOCTLCMD and the ioctl will be handled by fs/compat_ioctl.c instead. This removes a lot of duplicate code in the nvme driver. Signed-off-by: Keith Busch <keith.busch@intel.com> Signed-off-by: Matthew Wilcox <matthew.r.wilcox@intel.com> Signed-off-by: Jens Axboe <axboe@fb.com>	2014-11-04 13:17:09 -07:00
Keith Busch	a96d4f5c2d	NVMe: Reference count pci device If an nvme device is removed but user space has an open reference, the nvme driver would have been holding an invalid reference to its pci device. You may get a general protection fault on x86 h/w when the driver uses that reference in dma_map_sg(), as is done in nvme_map_user_pages() from the IOCTL interface. This patch fixes the fault by taking a reference on the pci device and holding it even after device removal until all opens on the nvme device are closed. Signed-off-by: Keith Busch <keith.busch@intel.com> Reported-by: Nilesh Choudhury <nilesh.choudhury@oracle.com> Signed-off-by: Matthew Wilcox <matthew.r.wilcox@intel.com> Signed-off-by: Jens Axboe <axboe@fb.com>	2014-11-04 13:17:09 -07:00
Andreea-Cristina Bernat	062261be4e	nvme: Replace rcu_assign_pointer() with RCU_INIT_POINTER() The use of "rcu_assign_pointer()" is NULLing out the pointer. According to RCU_INIT_POINTER()'s block comment: "1. This use of RCU_INIT_POINTER() is NULLing out the pointer" it is better to use it instead of rcu_assign_pointer() because it has a smaller overhead. The following Coccinelle semantic patch was used: @@ @@ - rcu_assign_pointer + RCU_INIT_POINTER (..., NULL) Signed-off-by: Andreea-Cristina Bernat <bernat.ada@gmail.com> Signed-off-by: Matthew Wilcox <matthew.r.wilcox@intel.com> Signed-off-by: Jens Axboe <axboe@fb.com>	2014-11-04 13:17:08 -07:00
Sam Bradshaw	5905535610	NVMe: Correctly handle IOCTL_SUBMIT_IO when cpus > online queues nvme_submit_io_cmd() uses smp_processor_id() to pick an IO queue index. This patch fixes the case where there are more cpus from which the ioctl call can originate than online queues, which can happen when a device supports or was allocated fewer interrupt vectors than exist cpu cores. Thanks to Keith Busch for the implementation suggestion. Signed-off-by: Sam Bradshaw <sbradshaw@micron.com> Signed-off-by: Matthew Wilcox <matthew.r.wilcox@intel.com> Signed-off-by: Jens Axboe <axboe@fb.com>	2014-11-04 13:17:08 -07:00
Keith Busch	302c6727e5	NVMe: Fix filesystem sync deadlock on removal This changes the order of deleting the gendisks so it happens after the nvme IO queues are freed. If a device is removed while a filesystem has associated dirty data, the removal will wait on these to complete before proceeding from del_gendisk, which could have caused deadlock before. The implication of this is that an orderly removal of a responsive device won't necessarily wait for dirty data to be written, but we are not guaranteed the device is even going to respond at this point either. Signed-off-by: Keith Busch <keith.busch@intel.com> Signed-off-by: Matthew Wilcox <matthew.r.wilcox@intel.com> Signed-off-by: Jens Axboe <axboe@fb.com>	2014-11-04 13:17:08 -07:00
Keith Busch	f435c2825b	NVMe: Call nvme_free_queue directly Rather than relying on call_rcu, this patch directly frees the nvme_queue's memory after ensuring no readers exist. Some arch specific dma_free_coherent implementations may not be called from a call_rcu's soft interrupt context, hence the change. Signed-off-by: Keith Busch <keith.busch@intel.com> Reported-by: Matthew Minter <matthew_minter@xyratex.com> Signed-off-by: Matthew Wilcox <matthew.r.wilcox@intel.com> Signed-off-by: Jens Axboe <axboe@fb.com>	2014-11-04 13:17:08 -07:00
Dan McLeran	2484f40780	NVMe: Add shutdown timeout as module parameter. The current implementation hard-codes the shutdown timeout to 2 seconds. Some devices take longer than this to complete a normal shutdown. Changing the shutdown timeout to a module parameter with a default timeout of 5 seconds. Signed-off-by: Dan McLeran <daniel.mcleran@intel.com> Signed-off-by: Matthew Wilcox <matthew.r.wilcox@intel.com> Signed-off-by: Jens Axboe <axboe@fb.com>	2014-11-04 13:17:08 -07:00
Keith Busch	7c1b245038	NVMe: Skip orderly shutdown on failed devices Rather than skipping shutdown only for devices that have been removed, skip the orderly shutdown on failed devices to avoid the long timeout handling that inevitably happens when deleting queues on such a device. Signed-off-by: Keith Busch <keith.busch@intel.com> Signed-off-by: Matthew Wilcox <matthew.r.wilcox@intel.com> Signed-off-by: Jens Axboe <axboe@fb.com>	2014-11-04 13:17:08 -07:00
Keith Busch	a67394790a	NVMe: Whitespace fixes Fixing tabs inadvertently converted to spaces. Signed-off-by: Keith Busch <keith.busch@intel.com> Signed-off-by: Matthew Wilcox <matthew.r.wilcox@intel.com> Signed-off-by: Jens Axboe <axboe@fb.com>	2014-11-04 13:17:08 -07:00
Keith Busch	c81f49758a	NVMe: Use pci_stop_and_remove_bus_device_locked() Race conditions are theoretically possible between the NVMe PCI device removal and the generic PCI bus rescan and device removal that can be triggered via sysfs. To avoid those race conditions make the NVMe code use pci_stop_and_remove_bus_device_locked(). Signed-off-by: Keith Busch <keith.busch@intel.com> Signed-off-by: Matthew Wilcox <matthew.r.wilcox@intel.com> Signed-off-by: Jens Axboe <axboe@fb.com>	2014-11-04 13:17:07 -07:00
Keith Busch	badc34d415	NVMe: Handling devices incapable of I/O This is a minor refactor for handling devices that are incapable of IO. The driver previously used special error codes to know that IO queues are unavailable, but we have an online queue count now. This also fixes an issue where the driver successfully sets the queue count, but either is unable to allocate an IO queue or the device can't create one for some reason. If the driver can successfully enable the device and get responses to admin commands, the driver will bring up a character device for managment but not create block devices. Signed-off-by: Keith Busch <keith.busch@intel.com> Signed-off-by: Matthew Wilcox <matthew.r.wilcox@intel.com> Signed-off-by: Jens Axboe <axboe@fb.com>	2014-11-04 13:17:07 -07:00
Dan McLeran	01079522f9	NVMe: Change nvme_enable_ctrl to set EN and manage CC thru ctrl_config. Change the behavior of nvme_enable_ctrl to set EN. Clear CC.SH for both nvme_enable_ctrl and nvme_disable_ctrl. Remove reading of the CC register and manage the state in dev->ctrl_config. Signed-off-by: Dan McLeran <daniel.mcleran@intel.com> [removed an unwanted write to CC] Signed-off-by: Matthew Wilcox <matthew.r.wilcox@intel.com> Signed-off-by: Jens Axboe <axboe@fb.com>	2014-11-04 13:17:07 -07:00
Keith Busch	1d09062460	NVMe: Mismatched host/device page size support Adds support for devices with max page size smaller than the host's. In the case we encounter such a host/device combination, the driver will split a page into as many PRP entries as necessary for the device's page size capabilities. If the device's reported minimum page size is greater than the host's, the driver will not attempt to enable the device and return an error instead. Signed-off-by: Keith Busch <keith.busch@intel.com> Signed-off-by: Matthew Wilcox <matthew.r.wilcox@intel.com> Signed-off-by: Jens Axboe <axboe@fb.com>	2014-11-04 13:17:07 -07:00
Keith Busch	6fccf9383b	NVMe: Async event request Submits NVMe asynchronous event requests, one event up to the controller maximum or number of possible different event types (8), whichever is smaller. Events successfully returned by the controller are logged. Signed-off-by: Keith Busch <keith.busch@intel.com> Signed-off-by: Matthew Wilcox <matthew.r.wilcox@intel.com> Signed-off-by: Jens Axboe <axboe@fb.com>	2014-11-04 13:17:07 -07:00
Joe Perches	4d51abf9bc	block: Use dma_zalloc_coherent Use the zeroing function instead of dma_alloc_coherent & memset(,0,) Signed-off-by: Joe Perches <joe@perches.com> Signed-off-by: Matthew Wilcox <matthew.r.wilcox@intel.com> Signed-off-by: Jens Axboe <axboe@fb.com>	2014-11-04 13:17:07 -07:00
Greg Kroah-Hartman	a8a93c6f99	Merge branch 'platform/remove_owner' of git://git.kernel.org/pub/scm/linux/kernel/git/wsa/linux into driver-core-next Remove all .owner fields from platform drivers	2014-11-03 19:53:56 -08:00
Linus Torvalds	ce1928da84	Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client Pull ceph fixes from Sage Weil: "There is a GFP flag fix from Mike Christie, an error code fix from Jan, and fixes for two unnecessary allocations (kmalloc and workqueue) from Ilya. All are well tested. Ilya has one other fix on the way but it didn't get tested in time" * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client: libceph: eliminate unnecessary allocation in process_one_ticket() rbd: Fix error recovery in rbd_obj_read_sync() libceph: use memalloc flags for net IO rbd: use a single workqueue for all devices	2014-11-03 15:04:26 -08:00
Linus Torvalds	53429290a0	Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/sparc Pull sparc update from David Miller: "Two changes: 1) It makes no sense to execute a VTOC partition table request in the Sun virtual block device driver and fail to load if it doesn't succeed because a) we don't use the result at all and b) it won't succeed if there is an EFI partition on the disk, for example. We read the partition table via the normal means in the block layer anyways, so this is really completely useless, so just remove it. From Dwight Engen. 2) Hook up new bpf system call" * git://git.kernel.org/pub/scm/linux/kernel/git/davem/sparc: sunvdc: don't call VD_OP_GET_VTOC sparc: Hook up bpf system call.	2014-10-31 15:00:48 -07:00
Dwight Engen	85b0c6e62c	sunvdc: don't call VD_OP_GET_VTOC The VD_OP_GET_VTOC operation will succeed only if the vdisk backend has a VTOC label, otherwise it will fail. In particular, it will return error 48 (ENOTSUP) if the disk has an EFI label. VTOC disk labels are already handled by directly reading the disk in block/partitions/sun.c (enabled by CONFIG_SUN_PARTITION which defaults to y on SPARC). Since port->label is unused in the driver, remove the call and the field. Signed-off-by: Dwight Engen <dwight.engen@oracle.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2014-10-31 15:49:45 -04:00
Jan Kara	a8d4205623	rbd: Fix error recovery in rbd_obj_read_sync() When we fail to allocate page vector in rbd_obj_read_sync() we just basically ignore the problem and continue which will result in an oops later. Fix the problem by returning proper error. CC: Yehuda Sadeh <yehuda@inktank.com> CC: Sage Weil <sage@inktank.com> CC: ceph-devel@vger.kernel.org CC: stable@vger.kernel.org Coverity-id: 1226882 Signed-off-by: Jan Kara <jack@suse.cz> Signed-off-by: Ilya Dryomov <idryomov@redhat.com>	2014-10-30 13:11:51 +03:00
Ilya Dryomov	f5ee37bd31	rbd: use a single workqueue for all devices Using one queue per device doesn't make much sense given that our workfn processes "devices" and not "requests". Switch to a single workqueue for all devices. Signed-off-by: Ilya Dryomov <idryomov@redhat.com> Reviewed-by: Sage Weil <sage@redhat.com>	2014-10-30 13:11:25 +03:00
Linus Torvalds	a7ca10f263	Merge branch 'akpm' (incoming from Andrew Morton) Merge misc fixes from Andrew Morton: "21 fixes" * emailed patches from Andrew Morton <akpm@linux-foundation.org>: (21 commits) mm/balloon_compaction: fix deflation when compaction is disabled sh: fix sh770x SCIF memory regions zram: avoid NULL pointer access in concurrent situation mm/slab_common: don't check for duplicate cache names ocfs2: fix d_splice_alias() return code checking mm: rmap: split out page_remove_file_rmap() mm: memcontrol: fix missed end-writeback page accounting mm: page-writeback: inline account_page_dirtied() into single caller lib/bitmap.c: fix undefined shift in __bitmap_shift_{left\|right}() drivers/rtc/rtc-bq32k.c: fix register value memory-hotplug: clear pgdat which is allocated by bootmem in try_offline_node() drivers/rtc/rtc-s3c.c: fix initialization failure without rtc source clock kernel/kmod: fix use-after-free of the sub_info structure drivers/rtc/rtc-pm8xxx.c: rework to support pm8941 rtc mm, thp: fix collapsing of hugepages on madvise drivers: of: add return value to of_reserved_mem_device_init() mm: free compound page with correct order gcov: add ARM64 to GCOV_PROFILE_ALL fsnotify: next_i is freed during fsnotify_unmount_inodes. mm/compaction.c: avoid premature range skip in isolate_migratepages_range ...	2014-10-29 16:38:48 -07:00
Weijie Yang	5a99e95b8d	zram: avoid NULL pointer access in concurrent situation There is a rare NULL pointer bug in mem_used_total_show() and mem_used_max_store() in concurrent situation, like this: zram is not initialized, process A is a mem_used_total reader which runs periodically, while process B try to init zram. process A process B access meta, get a NULL value init zram, done init_done() is true access meta->mem_pool, get a NULL pointer BUG This patch fixes this issue. Signed-off-by: Weijie Yang <weijie.yang@samsung.com> Acked-by: Minchan Kim <minchan@kernel.org> Acked-by: Sergey Senozhatsky <sergey.senozhatsky@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2014-10-29 16:33:15 -07:00
Jens Axboe	74c450521d	blk-mq: add a 'list' parameter to ->queue_rq() Since we have the notion of a 'last' request in a chain, we can use this to have the hardware optimize the issuing of requests. Add a list_head parameter to queue_rq that the driver can use to temporarily store hw commands for issue when 'last' is true. If we are doing a chain of requests, pass in a NULL list for the first request to force issue of that immediately, then batch the remainder for deferred issue until the last request has been sent. Instead of adding yet another argument to the hot ->queue_rq path, encapsulate the passed arguments in a blk_mq_queue_data structure. This is passed as a constant, and has been tested as faster than passing 4 (or even 3) args through ->queue_rq. Update drivers for the new ->queue_rq() prototype. There are no functional changes in this patch for drivers - if they don't use the passed in list, then they will just queue requests individually like before. Signed-off-by: Jens Axboe <axboe@fb.com>	2014-10-29 11:14:52 -06:00
Jan Kara	31f9690e6e	null_blk: Cleanup error recovery in null_add_dev() When creation of queues fails in init_driver_queues(), we free the queues. But null_add_dev() doesn't test for this failure and continues with the setup leading to strange consequences, likely oops. Fix the problem by testing whether init_driver_queues() failed and do proper error cleanup. Coverity-id: 1148005 Signed-off-by: Jan Kara <jack@suse.cz> Signed-off-by: Jens Axboe <axboe@fb.com>	2014-10-22 07:59:25 -06:00
Christoph Hellwig	34b48db66e	block: remove artifical max_hw_sectors cap Set max_sectors to the value the drivers provides as hardware limit by default. Linux had proper I/O throttling for a long time and doesn't rely on a artifically small maximum I/O size anymore. By not limiting the I/O size by default we remove an annoying tuning step required for most Linux installation. Note that both the user, and if absolutely required the driver can still impose a limit for FS requests below max_hw_sectors_kb. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@fb.com>	2014-10-21 14:02:54 -06:00
Wolfram Sang	8294adb90b	block: drop owner assignment from platform_drivers A platform_driver does not need to set an owner, it will be populated by the driver core. Signed-off-by: Wolfram Sang <wsa@the-dreams.de>	2014-10-20 16:20:18 +02:00
Linus Torvalds	e75437fb93	Merge branch 'for-3.18/drivers' of git://git.kernel.dk/linux-block Pull block layer driver update from Jens Axboe: "This is the block driver pull request for 3.18. Not a lot in there this round, and nothing earth shattering. - A round of drbd fixes from the linbit team, and an improvement in asender performance. - Removal of deprecated (and unused) IRQF_DISABLED flag in rsxx and hd from Michael Opdenacker. - Disable entropy collection from flash devices by default, from Mike Snitzer. - A small collection of xen blkfront/back fixes from Roger Pau Monné and Vitaly Kuznetsov" * 'for-3.18/drivers' of git://git.kernel.dk/linux-block: block: disable entropy contributions for nonrot devices xen, blkfront: factor out flush-related checks from do_blkif_request() xen-blkback: fix leak on grant map error path xen/blkback: unmap all persistent grants when frontend gets disconnected rsxx: Remove deprecated IRQF_DISABLED block: hd: remove deprecated IRQF_DISABLED drbd: use RB_DECLARE_CALLBACKS() to define augment callbacks drbd: compute the end before rb_insert_augmented() drbd: Add missing newline in resync progress display in /proc/drbd drbd: reduce lock contention in drbd_worker drbd: Improve asender performance drbd: Get rid of the WORK_PENDING macro drbd: Get rid of the __no_warn and __cond_lock macros drbd: Avoid inconsistent locking warning drbd: Remove superfluous newline from "resync_extents" debugfs entry. drbd: Use consistent names for all the bi_end_io callbacks drbd: Use better variable names	2014-10-18 12:12:45 -07:00
Linus Torvalds	d3dc366bba	Merge branch 'for-3.18/core' of git://git.kernel.dk/linux-block Pull core block layer changes from Jens Axboe: "This is the core block IO pull request for 3.18. Apart from the new and improved flush machinery for blk-mq, this is all mostly bug fixes and cleanups. - blk-mq timeout updates and fixes from Christoph. - Removal of REQ_END, also from Christoph. We pass it through the ->queue_rq() hook for blk-mq instead, freeing up one of the request bits. The space was overly tight on 32-bit, so Martin also killed REQ_KERNEL since it's no longer used. - blk integrity updates and fixes from Martin and Gu Zheng. - Update to the flush machinery for blk-mq from Ming Lei. Now we have a per hardware context flush request, which both cleans up the code should scale better for flush intensive workloads on blk-mq. - Improve the error printing, from Rob Elliott. - Backing device improvements and cleanups from Tejun. - Fixup of a misplaced rq_complete() tracepoint from Hannes. - Make blk_get_request() return error pointers, fixing up issues where we NULL deref when a device goes bad or missing. From Joe Lawrence. - Prep work for drastically reducing the memory consumption of dm devices from Junichi Nomura. This allows creating clone bio sets without preallocating a lot of memory. - Fix a blk-mq hang on certain combinations of queue depths and hardware queues from me. - Limit memory consumption for blk-mq devices for crash dump scenarios and drivers that use crazy high depths (certain SCSI shared tag setups). We now just use a single queue and limited depth for that" * 'for-3.18/core' of git://git.kernel.dk/linux-block: (58 commits) block: Remove REQ_KERNEL blk-mq: allocate cpumask on the home node bio-integrity: remove the needless fail handle of bip_slab creating block: include func name in __get_request prints block: make blk_update_request print prefix match ratelimited prefix blk-merge: don't compute bi_phys_segments from bi_vcnt for cloned bio block: fix alignment_offset math that assumes io_min is a power-of-2 blk-mq: Make bt_clear_tag() easier to read blk-mq: fix potential hang if rolling wakeup depth is too high block: add bioset_create_nobvec() block: use bio_clone_fast() in blk_rq_prep_clone() block: misplaced rq_complete tracepoint sd: Honor block layer integrity handling flags block: Replace strnicmp with strncasecmp block: Add T10 Protection Information functions block: Don't merge requests if integrity flags differ block: Integrity checksum flag block: Relocate bio integrity flags block: Add a disk flag to block integrity profile block: Add prefix to block integrity profile flags ...	2014-10-18 11:53:51 -07:00
Linus Torvalds	0e6e58f941	One cc: stable commit, the rest are a series of minor cleanups which have been sitting in MST's tree during my vacation. I changed a function name and made one trivial change, then they spent two days in linux-next. Thanks, Rusty. -----BEGIN PGP SIGNATURE----- Version: GnuPG v1 iQIcBAABAgAGBQJUQFBQAAoJENkgDmzRrbjxJRIP/1yCQRElQewxURSmJelyqCdU 0mHYB0R9Mf3tfre1xnofqs2lWeSMc/4ptKHsVR6pupoztSwnz7HsLHfEFvFJh4mj KsaqYElxkNxTcfyHwLjyJS0/J6tG1tYypXGiimTBS0bvFHL3XZdimVgJ6WvX+gO7 YSaDEX8/EqCERafslS5+gKJlz3drDOnCZCe9y4BDSmsvl2k7bkpSxIn8vsR6jIC0 c5JpUy6QVF+3XA/J932M7yRs+xpqxNoUWiyY3ar9o3CtQAaQB0ZAetSxY6hTfvVc GlNFzCifdsaQwsl2SVsE2h6tWaRhtMtcGWQuhHThIPyIf8XxhYyBRY2FLo70LMz1 eqtwy6F/Bg/nzUsdee4PZBMeoKHlAEL12RpsEKgfUoLzj16Aqa8ll+Agbglbkw8G f3d2FwzKAlpY5NwHETC1wYy52PJ3efqksRWuhokmYpxNSbHJS/lsiJOE7272/4Qr MtXuvRmo22tf34XFd5y7zqWjgZ58eeFOqQWi/K+6ZgpqVOvikjrXXKEuiVdjO0ZD kTVR/sQKiR+79rzENk80XBhWaMveECNXF1TiZ/3MmURkmEOBRQMxRQ20BX3exvna AJ/WVA5DcfXZc1yyqknE1NLGrvSBMJENH13x2QPwrqNWAryOOKuF1VKKIwWlDw5j vtx5nXiJa8YYdxI2TJCN =JK6x -----END PGP SIGNATURE----- Merge tag 'virtio-next-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rusty/linux Pull virtio updates from Rusty Russell: "One cc: stable commit, the rest are a series of minor cleanups which have been sitting in MST's tree during my vacation. I changed a function name and made one trivial change, then they spent two days in linux-next" * tag 'virtio-next-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rusty/linux: (25 commits) virtio-rng: refactor probe error handling virtio_scsi: drop scan callback virtio_balloon: enable VQs early on restore virtio_scsi: fix race on device removal virito_scsi: use freezable WQ for events virtio_net: enable VQs early on restore virtio_console: enable VQs early on restore virtio_scsi: enable VQs early on restore virtio_blk: enable VQs early on restore virtio_scsi: move kick event out from virtscsi_init virtio_net: fix use after free on allocation failure 9p/trans_virtio: enable VQs early virtio_console: enable VQs early virtio_blk: enable VQs early virtio_net: enable VQs early virtio: add API to enable VQs early virtio_net: minor cleanup virtio-net: drop config_mutex virtio_net: drop config_enable virtio-blk: drop config_mutex ...	2014-10-18 10:25:09 -07:00
Linus Torvalds	6b04908166	Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client Pull Ceph updates from Sage Weil: "There is the long-awaited discard support for RBD (Guangliang Zhao, Josh Durgin), a pile of RBD bug fixes that didn't belong in late -rc's (Ilya Dryomov, Li RongQing), a pile of fs/ceph bug fixes and performance and debugging improvements (Yan, Zheng, John Spray), and a smattering of cleanups (Chao Yu, Fabian Frederick, Joe Perches)" * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client: (40 commits) ceph: fix divide-by-zero in __validate_layout() rbd: rbd workqueues need a resque worker libceph: ceph-msgr workqueue needs a resque worker ceph: fix bool assignments libceph: separate multiple ops with commas in debugfs output libceph: sync osd op definitions in rados.h libceph: remove redundant declaration ceph: additional debugfs output ceph: export ceph_session_state_name function ceph: include the initial ACL in create/mkdir/mknod MDS requests ceph: use pagelist to present MDS request data libceph: reference counting pagelist ceph: fix llistxattr on symlink ceph: send client metadata to MDS ceph: remove redundant code for max file size verification ceph: remove redundant io_iter_advance() ceph: move ceph_find_inode() outside the s_mutex ceph: request xattrs if xattr_version is zero rbd: set the remaining discard properties to enable support rbd: use helpers to handle discard for layered images correctly ...	2014-10-15 06:46:01 +02:00
Michael S. Tsirkin	6d62c37f19	virtio_blk: enable VQs early on restore virtio spec requires drivers to set DRIVER_OK before using VQs. This is set automatically after restore returns, virtio block violated this rule on restore by restarting queues, which might in theory cause the VQ to be used directly within restore. To fix, call virtio_device_ready before using starting queues. Signed-off-by: Michael S. Tsirkin <mst@redhat.com> Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>	2014-10-15 10:25:07 +10:30
Michael S. Tsirkin	7a11370e5e	virtio_blk: enable VQs early virtio spec requires drivers to set DRIVER_OK before using VQs. This is set automatically after probe returns, virtio block violated this rule by calling add_disk, which causes the VQ to be used directly within probe. To fix, call virtio_device_ready before using VQs. Signed-off-by: Michael S. Tsirkin <mst@redhat.com> Reviewed-by: Cornelia Huck <cornelia.huck@de.ibm.com> Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>	2014-10-15 10:25:02 +10:30
Michael S. Tsirkin	1f54b0c055	virtio-blk: drop config_mutex config_mutex served two purposes: prevent multiple concurrent config change handlers, and synchronize access to config_enable flag. Since commit `dbf2576e37` workqueue: make all workqueues non-reentrant all workqueues are non-reentrant, and config_enable is now gone. Get rid of the unnecessary lock. Signed-off-by: Michael S. Tsirkin <mst@redhat.com> Reviewed-by: Cornelia Huck <cornelia.huck@de.ibm.com> Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>	2014-10-15 10:24:57 +10:30
Michael S. Tsirkin	cc74f71934	virtio_blk: drop config_enable Now that virtio core ensures config changes don't arrive during probing, drop config_enable flag in virtio blk. On removal, flush is now sufficient to guarantee that no change work is queued. This help simplify the driver, and will allow setting DRIVER_OK earlier without losing config change notifications. Signed-off-by: Michael S. Tsirkin <mst@redhat.com> Reviewed-by: Cornelia Huck <cornelia.huck@de.ibm.com> Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>	2014-10-15 10:24:56 +10:30
Ilya Dryomov	792c3a9149	rbd: rbd workqueues need a resque worker Need to use WQ_MEM_RECLAIM for our workqueues to prevent I/O lockups under memory pressure - we sit on the memory reclaim path. Cc: stable@vger.kernel.org # 3.17, needs backporting for 3.16 Signed-off-by: Ilya Dryomov <idryomov@redhat.com> Tested-by: Micha Krause <micha@krausam.de> Reviewed-by: Sage Weil <sage@redhat.com>	2014-10-14 12:57:05 -07:00
Josh Durgin	b76f82398c	rbd: set the remaining discard properties to enable support max_discard_sectors must be set for the queue to support discard. Operations implementing discard for rbd zero data, so report that. Signed-off-by: Josh Durgin <josh.durgin@inktank.com>	2014-10-14 21:03:37 +04:00
Josh Durgin	d3246fb0da	rbd: use helpers to handle discard for layered images correctly Only allocate two osd ops for discard requests, since the preallocation hint is only added for regular writes. Use rbd_img_obj_request_fill() to recreate the original write or discard osd operations, isolating that logic to one place, and change the assert in rbd_osd_req_create_copyup() to accept discard requests as well. Signed-off-by: Josh Durgin <josh.durgin@inktank.com>	2014-10-14 21:03:36 +04:00
Josh Durgin	3b434a2aff	rbd: extract a method for adding object operations rbd_img_request_fill() creates a ceph_osd_request and has logic for adding the appropriate osd ops to it based on the request type and image properties. For layered images, the original rbd_obj_request is resent with a copyup operation in front, using a new ceph_osd_request. The logic for adding the original operations should be the same as when first sending them, so move it to a helper function. op_type only needs to be checked once, so create a helper for that as well and call it outside the loop in rbd_img_request_fill(). Signed-off-by: Josh Durgin <josh.durgin@inktank.com>	2014-10-14 21:03:35 +04:00
Josh Durgin	1c220881e3	rbd: make discard trigger copy-on-write Discard requests are a form of write, so they should go through the same process as plain write requests and trigger copy-on-write for layered images. Signed-off-by: Josh Durgin <josh.durgin@inktank.com>	2014-10-14 21:03:34 +04:00
Josh Durgin	d0265de7c3	rbd: tolerate -ENOENT for discard operations Discard may try to delete an object from a non-layered image that does not exist. If this occurs, the image already has no data in that range, so change the result to success. Signed-off-by: Josh Durgin <josh.durgin@inktank.com>	2014-10-14 21:03:33 +04:00
Josh Durgin	bef95455a4	rbd: fix snapshot context reference count for discards Discards take a reference to the snapshot context of an image when they are created. This reference needs to be cleaned up when the request is done just as it is for regular writes. Signed-off-by: Josh Durgin <josh.durgin@inktank.com>	2014-10-14 21:03:32 +04:00
Josh Durgin	3c5df89367	rbd: read image size for discard check safely In rbd_img_request_fill() the image size is only checked to determine whether we can truncate an object instead of zeroing it for discard requests. Take rbd_dev->header_rwsem while reading the image size, and move this read into the discard check, so that non-discard ops don't need to take the semaphore in this function. Signed-off-by: Josh Durgin <josh.durgin@inktank.com>	2014-10-14 21:03:32 +04:00
Guangliang Zhao	90e98c5229	rbd: initial discard bits from Guangliang Zhao This patch add the discard support for rbd driver. There are three types operation in the driver: 1. The objects would be removed if they completely contained within the discard range. 2. The objects would be truncated if they partly contained within the discard range, and align with their boundary. 3. Others would be zeroed. A discard request from blkdev_issue_discard() is defined which REQ_WRITE and REQ_DISCARD both marked and no data, so we must check the REQ_DISCARD first when getting the request type. This resolve: http://tracker.ceph.com/issues/190 [ Ilya Dryomov: This is incomplete and somewhat buggy, see follow up commits by Josh Durgin for refinements and fixes which weren't folded in to preserve authorship. ] Signed-off-by: Guangliang Zhao <lucienchao@gmail.com> Reviewed-by: Josh Durgin <josh.durgin@inktank.com> Reviewed-by: Alex Elder <elder@linaro.org>	2014-10-14 21:03:31 +04:00
Guangliang Zhao	6d2940c881	rbd: extend the operation type It could only handle the read and write operations now, extend it for the coming discard support. Signed-off-by: Guangliang Zhao <lucienchao@gmail.com> Reviewed-by: Josh Durgin <josh.durgin@inktank.com> Reviewed-by: Alex Elder <elder@linaro.org>	2014-10-14 21:03:30 +04:00
Guangliang Zhao	c622d22615	rbd: skip the copyup when an entire object writing It need to copyup the parent's content when layered writing, but an entire object write would overwrite it, so skip it. Signed-off-by: Guangliang Zhao <lucienchao@gmail.com> Reviewed-by: Josh Durgin <josh.durgin@inktank.com> Reviewed-by: Alex Elder <elder@linaro.org>	2014-10-14 21:03:29 +04:00
Ilya Dryomov	70d045f660	rbd: add img_obj_request_simple() helper To clarify the conditions and make it easier to add new ones. Signed-off-by: Ilya Dryomov <ilya.dryomov@inktank.com>	2014-10-14 21:03:28 +04:00
Josh Durgin	4e752f0ab0	rbd: access snapshot context and mapping size safely These fields may both change while the image is mapped if a snapshot is created or deleted or the image is resized. They are guarded by rbd_dev->header_rwsem, so hold that while reading them, and store a local copy to refer to outside of the critical section. The local copy will stay consistent since the snapshot context is reference counted, and the mapping size is just a u64. This prevents torn loads from giving us inconsistent values. Move reading header.snapc into the caller of rbd_img_request_create() so that we only need to take the semaphore once. The read-only caller, rbd_parent_request_create() can just pass NULL for snapc, since the snapshot context is only relevant for writes. Signed-off-by: Josh Durgin <josh.durgin@inktank.com>	2014-10-14 21:03:27 +04:00
Ilya Dryomov	7dd440c9e0	rbd: do not return -ERANGE on auth failures Trying to map an image out of a pool for which we don't have an 'x' permission bit fails with -ERANGE from ceph_extract_encoded_string() due to an unsigned vs signed bug. Fix it and get rid of the -EINVAL sink, thus propagating rbd::get_id cls method errors. (I've seen a bunch of unexplained -ERANGE reports, I bet this is it). Signed-off-by: Ilya Dryomov <ilya.dryomov@inktank.com> Reviewed-by: Alex Elder <elder@linaro.org>	2014-10-14 21:03:26 +04:00
Linus Torvalds	77c688ac87	Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs Pull vfs updates from Al Viro: "The big thing in this pile is Eric's unmount-on-rmdir series; we finally have everything we need for that. The final piece of prereqs is delayed mntput() - now filesystem shutdown always happens on shallow stack. Other than that, we have several new primitives for iov_iter (Matt Wilcox, culled from his XIP-related series) pushing the conversion to ->read_iter()/ ->write_iter() a bit more, a bunch of fs/dcache.c cleanups and fixes (including the external name refcounting, which gives consistent behaviour of d_move() wrt procfs symlinks for long and short names alike) and assorted cleanups and fixes all over the place. This is just the first pile; there's a lot of stuff from various people that ought to go in this window. Starting with unionmount/overlayfs mess... ;-/" * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (60 commits) fs/file_table.c: Update alloc_file() comment vfs: Deduplicate code shared by xattr system calls operating on paths reiserfs: remove pointless forward declaration of struct nameidata don't need that forward declaration of struct nameidata in dcache.h anymore take dname_external() into fs/dcache.c let path_init() failures treated the same way as subsequent link_path_walk() fix misuses of f_count() in ppp and netlink ncpfs: use list_for_each_entry() for d_subdirs walk vfs: move getname() from callers to do_mount() gfs2_atomic_open(): skip lookups on hashed dentry [infiniband] remove pointless assignments gadgetfs: saner API for gadgetfs_create_file() f_fs: saner API for ffs_sb_create_file() jfs: don't hash direct inode [s390] remove pointless assignment of ->f_op in vmlogrdr ->open() ecryptfs: ->f_op is never NULL android: ->f_op is never NULL nouveau: __iomem misannotations missing annotation in fs/file.c fs: namespace: suppress 'may be used uninitialized' warnings ...	2014-10-13 11:28:42 +02:00
Linus Torvalds	052db7ec86	Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/sparc Pull sparc updates from David Miller: 1) Move to 4-level page tables on sparc64 and support up to 53-bits of physical addressing. Kernel static image BSS size reduced by several megabytes. 2) M6/M7 cpu support, from Allan Pais. 3) Move to sparse IRQs, handle hypervisor TLB call errors more gracefully, and add T5 perf_event support. From Bob Picco. 4) Recognize cdroms and compute geometry from capacity in virtual disk driver, also from Allan Pais. 5) Fix memset() return value on sparc32, from Andreas Larsson. 6) Respect gfp flags in dma_alloc_coherent on sparc32, from Daniel Hellstrom. 7) Fix handling of compound pages in virtual disk driver, from Dwight Engen. 8) Fix lockdep warnings in LDC layer by moving IRQ requesting to ldc_alloc() from ldc_bind(). 9) Increase boot string length to 1024 bytes, from Dave Kleikamp. * git://git.kernel.org/pub/scm/linux/kernel/git/davem/sparc: (31 commits) sparc64: Fix lockdep warnings on reboot on Ultra-5 sparc64: Increase size of boot string to 1024 bytes sparc64: Kill unnecessary tables and increase MAX_BANKS. sparc64: sparse irq sparc64: Adjust vmalloc region size based upon available virtual address bits. sparc64: Increase MAX_PHYS_ADDRESS_BITS to 53. sparc64: Use kernel page tables for vmemmap. sparc64: Fix physical memory management regressions with large max_phys_bits. sparc64: Adjust KTSB assembler to support larger physical addresses. sparc64: Define VA hole at run time, rather than at compile time. sparc64: Switch to 4-level page tables. sparc64: Fix reversed start/end in flush_tlb_kernel_range() sparc64: Add vio_set_intr() to enable/disable Rx interrupts vio: fix reuse of vio_dring slot sunvdc: limit each sg segment to a page sunvdc: compute vdisk geometry from capacity sunvdc: add cdrom and v1.1 protocol support sparc: VIO protocol version 1.6 sparc64: Fix hibernation code refrence to PAGE_OFFSET. sparc64: Move request_irq() from ldc_bind() to ldc_alloc() ...	2014-10-11 20:36:34 -04:00
Linus Torvalds	81ae31d782	xen: features and fixes for 3.18-rc0 - Add pvscsi frontend and backend drivers. - Remove _PAGE_IOMAP PTE flag, freeing it for alternate uses. - Try and keep memory contiguous during PV memory setup (reduces SWIOTLB usage). - Allow front/back drivers to use threaded irqs. - Support large initrds in PV guests. - Fix PVH guests in preparation for Xen 4.5 -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.12 (GNU/Linux) iQEcBAABAgAGBQJUNonmAAoJEFxbo/MsZsTRHAQH/inCjpCT+pkvTB0YAVfVvgMI gUogT8G+iB2MuCNpMffGIt8TAVXwcVtnOLH9ABH3IBVehzgipIbIiVEM9YhjrYvU 1rgIKBpmZqSpjDHoIHpdHeCH67cVnRzA/PyoxZWLxPNmQ0t6bNf9yeAcCXK9PfUc 7EAblUDmPGSx9x/EUnOKNNaZSEiUJZHDBXbMBLllk1+5H1vfKnpFCRGMG0IrfI44 KVP2NX9Gfa05edMZYtH887FYyjFe2KNV6LJvE7+w7h2Dy0yIzf7y86t0l4n8gETb plvEUJ/lu9RYzTiZY/RxgBFYVTV59EqT45brSUtoe2Jcp8GSwiHslTHdfyFBwSo= =gw4d -----END PGP SIGNATURE----- Merge tag 'stable/for-linus-3.18-rc0-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/xen/tip Pull Xen updates from David Vrabel: "Features and fixes: - Add pvscsi frontend and backend drivers. - Remove _PAGE_IOMAP PTE flag, freeing it for alternate uses. - Try and keep memory contiguous during PV memory setup (reduces SWIOTLB usage). - Allow front/back drivers to use threaded irqs. - Support large initrds in PV guests. - Fix PVH guests in preparation for Xen 4.5" * tag 'stable/for-linus-3.18-rc0-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/xen/tip: (22 commits) xen: remove DEFINE_XENBUS_DRIVER() macro xen/xenbus: Remove BUG_ON() when error string trucated xen/xenbus: Correct the comments for xenbus_grant_ring() x86/xen: Set EFER.NX and EFER.SCE in PVH guests xen: eliminate scalability issues from initrd handling xen: sync some headers with xen tree xen: make pvscsi frontend dependant on xenbus frontend arm{,64}/xen: Remove "EXPERIMENTAL" in the description of the Xen options xen-scsifront: don't deadlock if the ring becomes full x86: remove the Xen-specific _PAGE_IOMAP PTE flag x86/xen: do not use _PAGE_IOMAP PTE flag for I/O mappings x86: skip check for spurious faults for non-present faults xen/efi: Directly include needed headers xen-scsiback: clean up a type issue in scsiback_make_tpg() xen-scsifront: use GFP_ATOMIC under spin_lock MAINTAINERS: Add xen pvscsi maintainer xen-scsiback: Add Xen PV SCSI backend driver xen-scsifront: Add Xen PV SCSI frontend driver xen: Add Xen pvSCSI protocol description xen/events: support threaded irqs for interdomain event channels ...	2014-10-11 20:29:01 -04:00
Sergey Senozhatsky	015254daf1	zram: use notify_free to account all free notifications `notify_free' device attribute accounts the number of slot free notifications and internally represents the number of zram_free_page() calls. Slot free notifications are sent only when device is used as a swap device, hence `notify_free' is used only for swap devices. Since `f4659d8e62` (zram: support REQ_DISCARD) ZRAM handles yet another one free notification (also via zram_free_page() call) -- REQ_DISCARD requests, which are sent by a filesystem, whenever some data blocks are discarded. However, there is no way to know the number of notifications in the latter case. Use `notify_free' to account the number of pages freed by zram_bio_discard() and zram_slot_free_notify(). Depending on usage scenario `notify_free' represents: a) the number of pages freed because of slot free notifications, which is equal to the number of swap_slot_free_notify() calls, so there is no behaviour change b) the number of pages freed because of REQ_DISCARD notifications Signed-off-by: Sergey Senozhatsky <sergey.senozhatsky@gmail.com> Acked-by: Minchan Kim <minchan@kernel.org> Acked-by: Jerome Marchand <jmarchan@redhat.com> Cc: Nitin Gupta <ngupta@vflare.org> Cc: Chao Yu <chao2.yu@samsung.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2014-10-09 22:26:03 -04:00
Minchan Kim	461a8eee6a	zram: report maximum used memory Normally, zram user could get maximum memory usage zram consumed via polling mem_used_total with sysfs in userspace. But it has a critical problem because user can miss peak memory usage during update inverval of polling. For avoiding that, user should poll it with shorter interval(ie, 0.0000000001s) with mlocking to avoid page fault delay when memory pressure is heavy. It would be troublesome. This patch adds new knob "mem_used_max" so user could see the maximum memory usage easily via reading the knob and reset it via "echo 0 > /sys/block/zram0/mem_used_max". Signed-off-by: Minchan Kim <minchan@kernel.org> Reviewed-by: Dan Streetman <ddstreet@ieee.org> Cc: Sergey Senozhatsky <sergey.senozhatsky@gmail.com> Cc: Jerome Marchand <jmarchan@redhat.com> Cc: <juno.choi@lge.com> Cc: <seungho1.park@lge.com> Cc: Luigi Semenzato <semenzato@google.com> Cc: Nitin Gupta <ngupta@vflare.org> Cc: Seth Jennings <sjennings@variantweb.net> Reviewed-by: David Horner <ds2horner@gmail.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2014-10-09 22:26:02 -04:00
Minchan Kim	9ada9da957	zram: zram memory size limitation Since zram has no control feature to limit memory usage, it makes hard to manage system memrory. This patch adds new knob "mem_limit" via sysfs to set up the a limit so that zram could fail allocation once it reaches the limit. In addition, user could change the limit in runtime so that he could manage the memory more dynamically. Initial state is no limit so it doesn't break old behavior. [akpm@linux-foundation.org: fix typo, per Sergey] Signed-off-by: Minchan Kim <minchan@kernel.org> Cc: Dan Streetman <ddstreet@ieee.org> Cc: Sergey Senozhatsky <sergey.senozhatsky@gmail.com> Cc: Jerome Marchand <jmarchan@redhat.com> Cc: <juno.choi@lge.com> Cc: <seungho1.park@lge.com> Cc: Luigi Semenzato <semenzato@google.com> Cc: Nitin Gupta <ngupta@vflare.org> Cc: Seth Jennings <sjennings@variantweb.net> Cc: David Horner <ds2horner@gmail.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Minchan Kim <minchan@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2014-10-09 22:26:02 -04:00
Minchan Kim	722cdc1723	zsmalloc: change return value unit of zs_get_total_size_bytes zs_get_total_size_bytes returns a amount of memory zsmalloc consumed with byte unit but zsmalloc operates page unit rather than byte unit so let's change the API so benefit we could get is that reduce unnecessary overhead (ie, change page unit with byte unit) in zsmalloc. Since return type is pages, "zs_get_total_pages" is better than "zs_get_total_size_bytes". Signed-off-by: Minchan Kim <minchan@kernel.org> Reviewed-by: Dan Streetman <ddstreet@ieee.org> Cc: Sergey Senozhatsky <sergey.senozhatsky@gmail.com> Cc: Jerome Marchand <jmarchan@redhat.com> Cc: <juno.choi@lge.com> Cc: <seungho1.park@lge.com> Cc: Luigi Semenzato <semenzato@google.com> Cc: Nitin Gupta <ngupta@vflare.org> Cc: Seth Jennings <sjennings@variantweb.net> Cc: David Horner <ds2horner@gmail.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2014-10-09 22:26:02 -04:00
Al Viro	8e3fb059ae	rsxx debugfs inanity check with the author of that horror... Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2014-10-09 02:39:07 -04:00
Linus Torvalds	28596c9722	Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/trivial Pull "trivial tree" updates from Jiri Kosina: "Usual pile from trivial tree everyone is so eagerly waiting for" * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/trivial: (39 commits) Remove MN10300_PROC_MN2WS0038 mei: fix comments treewide: Fix typos in Kconfig kprobes: update jprobe_example.c for do_fork() change Documentation: change "&" to "and" in Documentation/applying-patches.txt Documentation: remove obsolete pcmcia-cs from Changes Documentation: update links in Changes Documentation: Docbook: Fix generated DocBook/kernel-api.xml score: Remove GENERIC_HAS_IOMAP gpio: fix 'CONFIG_GPIO_IRQCHIP' comments tty: doc: Fix grammar in serial/tty dma-debug: modify check_for_stack output treewide: fix errors in printk genirq: fix reference in devm_request_threaded_irq comment treewide: fix synchronize_rcu() in comments checkstack.pl: port to AArch64 doc: queue-sysfs: minor fixes init/do_mounts: better syntax description MIPS: fix comment spelling powerpc/simpleboot: fix comment ...	2014-10-07 21:16:26 -04:00
David Vrabel	95afae4814	xen: remove DEFINE_XENBUS_DRIVER() macro The DEFINE_XENBUS_DRIVER() macro looks a bit weird and causes sparse errors. Replace the uses with standard structure definitions instead. This is similar to pci and usb device registration. Signed-off-by: David Vrabel <david.vrabel@citrix.com>	2014-10-06 10:27:57 +01:00
Mike Snitzer	b277da0a8a	block: disable entropy contributions for nonrot devices Clear QUEUE_FLAG_ADD_RANDOM in all block drivers that set QUEUE_FLAG_NONROT. Historically, all block devices have automatically made entropy contributions. But as previously stated in commit `e2e1a148` ("block: add sysfs knob for turning off disk entropy contributions"): - On SSD disks, the completion times aren't as random as they are for rotational drives. So it's questionable whether they should contribute to the random pool in the first place. - Calling add_disk_randomness() has a lot of overhead. There are more reliable sources for randomness than non-rotational block devices. From a security perspective it is better to err on the side of caution than to allow entropy contributions from unreliable "random" sources. Signed-off-by: Mike Snitzer <snitzer@redhat.com> Signed-off-by: Jens Axboe <axboe@fb.com>	2014-10-04 10:55:32 -06:00
Jens Axboe	7b7b7f7e02	Merge branch 'stable/for-jens-3.18' of git://git.kernel.org/pub/scm/linux/kernel/git/konrad/xen into for-3.18/drivers Konrad writes: This pull has two fixes and one cleanup. Nothing earthshattering.	2014-10-01 14:37:25 -06:00
Arianna Avanzini	0f1ca65ee5	xen, blkfront: factor out flush-related checks from do_blkif_request() This commit factors out some checks related to the request insertion path, which can be done in an function instead of by itself. Reviewed-by: David Vrabel <david.vrabel@citrix.com> Signed-off-by: Arianna Avanzini <avanzini.arianna@gmail.com> Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>	2014-10-01 16:32:39 -04:00
Roger Pau Monné	61cecca865	xen-blkback: fix leak on grant map error path Fix leaking a page when a grant mapping has failed. CC: stable@vger.kernel.org Signed-off-by: Roger Pau Monné <roger.pau@citrix.com> Reported-and-Tested-by: Tao Chen <boby.chen@huawei.com> Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>	2014-10-01 16:32:31 -04:00
Vitaly Kuznetsov	12ea729645	xen/blkback: unmap all persistent grants when frontend gets disconnected blkback does not unmap persistent grants when frontend goes to Closed state (e.g. when blkfront module is being removed). This leads to the following in guest's dmesg: [ 343.243825] xen:grant_table: WARNING: g.e. 0x445 still in use! [ 343.243825] xen:grant_table: WARNING: g.e. 0x42a still in use! ... When load module -> use device -> unload module sequence is performed multiple times it is possible to hit BUG() condition in blkfront module: [ 343.243825] kernel BUG at drivers/block/xen-blkfront.c:954! [ 343.243825] invalid opcode: 0000 [#1] SMP [ 343.243825] Modules linked in: xen_blkfront(-) ata_generic pata_acpi [last unloaded: xen_blkfront] ... [ 343.243825] Call Trace: [ 343.243825] [<ffffffff814111ef>] ? unregister_xenbus_watch+0x16f/0x1e0 [ 343.243825] [<ffffffffa0016fbf>] blkfront_remove+0x3f/0x140 [xen_blkfront] ... [ 343.243825] RIP [<ffffffffa0016aae>] blkif_free+0x34e/0x360 [xen_blkfront] [ 343.243825] RSP <ffff88001eb8fdc0> We don't need to keep these grants if we're disconnecting as frontend might already forgot about them. Solve the issue by moving xen_blkbk_free_caches() call from xen_blkif_free() to xen_blkif_disconnect(). Now we can see the following: [ 928.590893] xen:grant_table: WARNING: g.e. 0x587 still in use! [ 928.591861] xen:grant_table: WARNING: g.e. 0x372 still in use! ... [ 929.592146] xen:grant_table: freeing g.e. 0x587 [ 929.597174] xen:grant_table: freeing g.e. 0x372 ... Backend does not keep persistent grants any more, reconnect works fine. CC: stable@vger.kernel.org Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com> Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>	2014-10-01 16:32:23 -04:00
Michael Opdenacker	baf378126b	rsxx: Remove deprecated IRQF_DISABLED This removes the use of the IRQF_DISABLED flag from drivers/block/rsxx/core.c It's a NOOP since 2.6.35 and it will be removed one day. Signed-off-by: Michael Opdenacker <michael.opdenacker@free-electrons.com> Acked-by Philip Kelleher <pjk1939@linux.vnet.ibm.com> Signed-off-by: Jens Axboe <axboe@fb.com>	2014-10-01 14:07:39 -06:00
Michael Opdenacker	fc2021fb9b	block: hd: remove deprecated IRQF_DISABLED This patch removes the use of the IRQF_DISABLED flag from drivers/block/hd.c It's a NOOP since 2.6.35 and it will be removed one day. This also removes a related comment which is obsolete too. Signed-off-by: Michael Opdenacker <michael.opdenacker@free-electrons.com> Signed-off-by: Jens Axboe <axboe@fb.com>	2014-10-01 08:16:07 -06:00
Dwight Engen	d0aedcd4f1	vio: fix reuse of vio_dring slot vio_dring_avail() will allow use of every dring entry, but when the last entry is allocated then dr->prod == dr->cons which is indistinguishable from the ring empty condition. This causes the next allocation to reuse an entry. When this happens in sunvdc, the server side vds driver begins nack'ing the messages and ends up resetting the ldc channel. This problem does not effect sunvnet since it checks for < 2. The fix here is to just never allocate the very last dring slot so that full and empty are not the same condition. The request start path was changed to check for the ring being full a bit earlier, and to stop the blk_queue if there is no space left. The blk_queue will be restarted once the ring is only half full again. The number of ring entries was increased to 512 which matches the sunvnet and Solaris vdc drivers, and greatly reduces the frequency of hitting the ring full condition and the associated blk_queue stop/starting. The checks in sunvent were adjusted to account for vio_dring_avail() returning 1 less. Orabug: 19441666 OraBZ: 14983 Signed-off-by: Dwight Engen <dwight.engen@oracle.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2014-09-30 14:37:35 -07:00
Dwight Engen	5eed69ffd2	sunvdc: limit each sg segment to a page ldc_map_sg() could fail its check that the number of pages referred to by the sg scatterlist was <= the number of cookies. This fixes the issue by doing a similar thing to the xen-blkfront driver, ensuring that the scatterlist will only ever contain a segment count <= port->ring_cookies, and each segment will be page aligned, and <= page size. This ensures that the scatterlist is always mappable. Orabug: 19347817 OraBZ: 15945 Signed-off-by: Dwight Engen <dwight.engen@oracle.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2014-09-30 14:37:35 -07:00
Allen Pais	de5b73f084	sunvdc: compute vdisk geometry from capacity The LDom diskserver doesn't return reliable geometry data. In addition, the types for all fields in the vio_disk_geom are u16, which were being truncated in the cast into the u8's of the Linux struct hd_geometry. Modify vdc_getgeo() to compute the geometry from the disk's capacity in a manner consistent with xen-blkfront::blkif_getgeo(). Signed-off-by: Dwight Engen <dwight.engen@oracle.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2014-09-30 14:37:35 -07:00
Allen Pais	9bce21828d	sunvdc: add cdrom and v1.1 protocol support Interpret the media type from v1.1 protocol to support CDROM/DVD. For v1.0 protocol, a disk's size continues to be calculated from the geometry returned by the vdisk server. The geometry returned by the server can be less than the actual number of sectors available in the backing image/device due to the rounding in the division used to compute the geometry in the vdisk server. In v1.1 protocol a disk's actual size in sectors is returned during the handshake. Use this size when v1.1 protocol is negotiated. Since this size will always be larger than the former geometry computed size, disks created under v1.0 will be forwards compatible to v1.1, but not vice versa. Signed-off-by: Dwight Engen <dwight.engen@oracle.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2014-09-30 14:37:34 -07:00
Christoph Hellwig	c8a446ad69	blk-mq: rename blk_mq_end_io to blk_mq_end_request Now that we've changed the driver API on the submission side use the opportunity to fix up the name on the completion side to fit into the general scheme. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@fb.com>	2014-09-22 12:00:07 -06:00
Christoph Hellwig	e2490073cd	blk-mq: call blk_mq_start_request from ->queue_rq When we call blk_mq_start_request from the core blk-mq code before calling into ->queue_rq there is a racy window where the timeout handler can hit before we've fully set up the driver specific part of the command. Move the call to blk_mq_start_request into the driver so the driver can start the request only once it is fully set up. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@fb.com>	2014-09-22 12:00:07 -06:00
Christoph Hellwig	bf57229745	blk-mq: remove REQ_END Pass an explicit parameter for the last request in a batch to ->queue_rq instead of using a request flag. Besides being a cleaner and non-stateful interface this is also required for the next patch, which fixes the blk-mq I/O submission code to not start a time too early. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@fb.com>	2014-09-22 12:00:07 -06:00
Jens Axboe	6d11fb454b	Merge branch 'for-linus' into for-3.18/core Moving patches from for-linus to 3.18 instead, pull in this changes that will go to Linus today.	2014-09-22 11:57:32 -06:00
Lai Jiangshan	e9f05b4cfe	drbd: use RB_DECLARE_CALLBACKS() to define augment callbacks The original code are the same as RB_DECLARE_CALLBACKS(). CC: Michel Lespinasse <walken@google.com> Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com> Signed-off-by: Andreas Gruenbacher <agruen@linbit.com> Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com> Signed-off-by: Jens Axboe <axboe@fb.com>	2014-09-18 09:00:17 -06:00
Lai Jiangshan	82cfb90bc9	drbd: compute the end before rb_insert_augmented() Commit `98683650` "Merge branch 'drbd-8.4_ed6' into for-3.8-drivers-drbd-8.4_ed6" switches to the new augment API, but the new API requires that the tree is augmented before rb_insert_augmented() is called, which is missing. So we add the augment-code to drbd_insert_interval() when it travels the tree up to down before rb_insert_augmented(). See the example in include/linux/interval_tree_generic.h or Documentation/rbtree.txt. drbd_insert_interval() may cancel the insertion when traveling, in this case, the just added augment-code does nothing before cancel since the @this node is already in the subtrees in this case. CC: Michel Lespinasse <walken@google.com> CC: stable@kernel.org # v3.10+ Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com> Signed-off-by: Andreas Gruenbacher <agruen@linbit.com> Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com> Signed-off-by: Jens Axboe <axboe@fb.com>	2014-09-18 09:00:16 -06:00
Linus Torvalds	645cc09381	Merge branch 'for-linus' of git://git.kernel.dk/linux-block Pull block fixes from Jens Axboe: "A small collection of fixes for the current rc series. This contains: - Two small blk-mq patches from Rob Elliott, cleaning up error case at init time. - A fix from Ming Lei, fixing SG merging for blk-mq where QUEUE_FLAG_SG_NO_MERGE is the default. - A dev_t minor lifetime fix from Keith, fixing an issue where a minor might be reused before all references to it were gone. - Fix from Alan Stern where an unbalanced queue bypass caused SCSI some headaches when it does a series of add/del on devices without fully registrering the queue. - A fix from me for improving the scaling of tag depth in blk-mq if we are short on memory" * 'for-linus' of git://git.kernel.dk/linux-block: blk-mq: scale depth and rq map appropriate if low on memory Block: fix unbalanced bypass-disable in blk_register_queue block: Fix dev_t minor allocation lifetime blk-mq: cleanup after blk_mq_init_rq_map failures blk-mq: pass along blk_mq_alloc_tag_set return values blk-merge: fix blk_recount_segments	2014-09-13 09:39:55 -07:00
Jens Axboe	b207892b06	Merge branch 'for-linus' into for-3.18/core A bit of churn on the for-linus side that would be nice to have in the core bits for 3.18, so pull it in to catch us up and make forward progress easier. Signed-off-by: Jens Axboe <axboe@fb.com> Conflicts: block/scsi_ioctl.c	2014-09-11 09:31:18 -06:00
Philipp Reisner	590001c229	drbd: Add missing newline in resync progress display in /proc/drbd Was broken in 2010 with commit `4b0715f096` Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com> Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com> Signed-off-by: Jens Axboe <axboe@fb.com>	2014-09-11 08:41:29 -06:00
Lars Ellenberg	729e8b87ba	drbd: reduce lock contention in drbd_worker The worker may now dequeue work items in batches. This should reduce lock contention during busy periods. Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com> Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com> Signed-off-by: Jens Axboe <axboe@fb.com>	2014-09-11 08:41:29 -06:00
Lars Ellenberg	abde9cc6a5	drbd: Improve asender performance Shorten receive path in the asender thread. Reduces CPU utilisation of asender when receiving packets, and with that increases IOPs. Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com> Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com> Signed-off-by: Jens Axboe <axboe@fb.com>	2014-09-11 08:41:29 -06:00
Andreas Gruenbacher	b47a06d105	drbd: Get rid of the WORK_PENDING macro This macro doesn't add any value; just use test_bit() instead. Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com> Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com> Signed-off-by: Jens Axboe <axboe@fb.com>	2014-09-11 08:41:29 -06:00
Andreas Gruenbacher	d1b8085356	drbd: Get rid of the __no_warn and __cond_lock macros These macros can easily be replaced with its definition. Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com> Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com> Signed-off-by: Jens Axboe <axboe@fb.com>	2014-09-11 08:41:29 -06:00
Andreas Gruenbacher	8d4ba3f0fa	drbd: Avoid inconsistent locking warning request_timer_fn() takes resource->req_lock via the device and releases it via the connection. Avoid this as it is confusing static code checkers. Reported-by: "Dan Carpenter" <dan.carpenter@oracle.com> Signed-off-by: Andreas Gruenbacher <agruen@linbit.com> Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com> Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com> Signed-off-by: Jens Axboe <axboe@fb.com>	2014-09-11 08:41:29 -06:00
Philipp Marek	f0c21e6228	drbd: Remove superfluous newline from "resync_extents" debugfs entry. See "drbd/resources//volumes//resync_extents". Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com> Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com> Signed-off-by: Jens Axboe <axboe@fb.com>	2014-09-11 08:41:29 -06:00
Andreas Gruenbacher	ed15b79509	drbd: Use consistent names for all the bi_end_io callbacks Now they follow the _endio naming sheme. Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com> Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com> Signed-off-by: Jens Axboe <axboe@fb.com>	2014-09-11 08:41:29 -06:00
Andreas Gruenbacher	11f8b2b69d	drbd: Use better variable names Rename local variable 'ds' to 'disk_state' or 'data_size'. 'dgs' to 'digest_size' Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com> Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com> Signed-off-by: Jens Axboe <axboe@fb.com>	2014-09-11 08:41:29 -06:00
Wei Yongjun	255939e783	rbd: fix error return code in rbd_dev_device_setup() Fix to return -ENOMEM from the workqueue alloc error handling case instead of 0, as done elsewhere in this function. Reviewed-by: Alex Elder <elder@linaro.org> Signed-off-by: Wei Yongjun <yongjun_wei@trendmicro.com.cn>	2014-09-10 11:59:06 +04:00
Ilya Dryomov	58d1362b50	rbd: avoid format-security warning inside alloc_workqueue() drivers/block/rbd.c: In function ‘rbd_dev_device_setup’: drivers/block/rbd.c:5090:19: warning: format not a string literal and no format arguments [-Wformat-security] Reported-by: kbuild test robot <fengguang.wu@intel.com> Signed-off-by: Ilya Dryomov <ilya.dryomov@inktank.com>	2014-09-10 11:59:06 +04:00
Robert Elliott	dc501dc0d9	blk-mq: pass along blk_mq_alloc_tag_set return values Two of the blk-mq based drivers do not pass back the return value from blk_mq_alloc_tag_set, instead just returning -ENOMEM. blk_mq_alloc_tag_set returns -EINVAL if the number of queues or queue depth is bad. -ENOMEM implies that retrying after freeing some memory might be more successful, but that won't ever change in the -EINVAL cases. Change the null_blk and mtip32xx drivers to pass along the return value. Signed-off-by: Robert Elliott <elliott@hp.com> Signed-off-by: Jens Axboe <axboe@fb.com>	2014-09-02 15:39:03 -06:00
Linus Torvalds	10f3291a1d	Merge branch 'akpm' (fixes from Andrew Morton) Merge patches from Andrew Morton: "22 fixes" * emailed patches from Andrew Morton <akpm@linux-foundation.org>: (22 commits) kexec: purgatory: add clean-up for purgatory directory Documentation/kdump/kdump.txt: add ARM description flush_icache_range: export symbol to fix build errors tools: selftests: fix build issue with make kselftests target ocfs2: quorum: add a log for node not fenced ocfs2: o2net: set tcp user timeout to max value ocfs2: o2net: don't shutdown connection when idle timeout ocfs2: do not write error flag to user structure we cannot copy from/to x86/purgatory: use approprate -m64/-32 build flag for arch/x86/purgatory drivers/rtc/rtc-s5m.c: re-add support for devices without irq specified xattr: fix check for simultaneous glibc header inclusion kexec: remove CONFIG_KEXEC dependency on crypto kexec: create a new config option CONFIG_KEXEC_FILE for new syscall x86,mm: fix pte_special versus pte_numa hugetlb_cgroup: use lockdep_assert_held rather than spin_is_locked mm/zpool: use prefixed module loading zram: fix incorrect stat with failed_reads lib: turn CONFIG_STACKTRACE into an actual option. mm: actually clear pmd_numa before invalidating memblock, memhotplug: fix wrong type in memblock_find_in_range_node(). ...	2014-08-29 16:28:29 -07:00
Chao Yu	0cf1e9d6c3	zram: fix incorrect stat with failed_reads Since we allocate a temporary buffer in zram_bvec_read to handle partial page operations in commit `924bd88d70` ("Staging: zram: allow partial page operations"), our ->failed_reads value may be incorrect as we do not increase its value when failing to allocate the temporary buffer. Let's fix this issue and correct the annotation of failed_reads. Signed-off-by: Chao Yu <chao2.yu@samsung.com> Acked-by: Minchan Kim <minchan@kernel.org> Cc: Nitin Gupta <ngupta@vflare.org> Acked-by: Jerome Marchand <jmarchan@redhat.com> Acked-by: Sergey Senozhatsky <sergey.senozhatsky@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2014-08-29 16:28:16 -07:00
Joe Lawrence	a492f07545	block,scsi: fixup blk_get_request dead queue scenarios The blk_get_request function may fail in low-memory conditions or during device removal (even if __GFP_WAIT is set). To distinguish between these errors, modify the blk_get_request call stack to return the appropriate ERR_PTR. Verify that all callers check the return status and consider IS_ERR instead of a simple NULL pointer check. For consistency, make a similar change to the blk_mq_alloc_request leg of blk_get_request. It may fail if the queue is dead, or the caller was unwilling to wait. Signed-off-by: Joe Lawrence <joe.lawrence@stratus.com> Acked-by: Jiri Kosina <jkosina@suse.cz> [for pktdvd] Acked-by: Boaz Harrosh <bharrosh@panasas.com> [for osd] Reviewed-by: Jeff Moyer <jmoyer@redhat.com> Signed-off-by: Jens Axboe <axboe@fb.com>	2014-08-28 10:03:46 -06:00
Joe Lawrence	eb571eeade	block,scsi: verify return pointer from blk_get_request The blk-core dead queue checks introduce an error scenario to blk_get_request that returns NULL if the request queue has been shutdown. This affects the behavior for __GFP_WAIT callers, who should verify the return value before dereferencing. Signed-off-by: Joe Lawrence <joe.lawrence@stratus.com> Acked-by: Jiri Kosina <jkosina@suse.cz> [for pktdvd] Reviewed-by: Jeff Moyer <jmoyer@redhat.com> Signed-off-by: Jens Axboe <axboe@fb.com>	2014-08-26 15:20:23 -06:00
Geert Uytterhoeven	336ec13734	paride/pcd: Fix grammar Signed-off-by: Geert Uytterhoeven <geert@linux-m68k.org> Signed-off-by: Jiri Kosina <jkosina@suse.cz>	2014-08-26 09:35:57 +02:00
Dmitry Monakhov	aeac318194	brd: add ram disk visibility option Currenly ram disk is not visiable inside /proc/partitions. This was done for compatibility reasons here: `53978d0a7a`. But some utilities expect disk presents in /proc/partitions. Let's add module's option and let's administrator chose visibility behaviour. By default, old behaviour preserved. Signed-off-by: Dmitry Monakhov <dmonakhov@openvz.org> Signed-off-by: Jens Axboe <axboe@fb.com>	2014-08-21 20:42:01 -05:00
Michal Simek	ffb5db73eb	block: systemace: Remove .owner field for driver There is no need to init .owner field. Based on the patch from Peter Griffin <peter.griffin@linaro.org> "mmc: remove .owner field for drivers using module_platform_driver" This patch removes the superflous .owner field for drivers which use the module_platform_driver API, as this is overriden in platform_driver_register anyway." Signed-off-by: Michal Simek <michal.simek@xilinx.com> Signed-off-by: Jens Axboe <axboe@fb.com>	2014-08-21 20:37:54 -05:00
Linus Torvalds	a11c5c9ef6	PCI changes for the v3.17 merge window (part 2): Miscellaneous - Remove DEFINE_PCI_DEVICE_TABLE macro use (Benoit Taine) -----BEGIN PGP SIGNATURE----- Version: GnuPG v1 iQIcBAABAgAGBQJT7PyAAAoJEFmIoMA60/r8kjQQALr/8oEfZoVcjgCb7waWOr25 hUTnrI6GBIAh/50hoBiPq0ouPCAKVv66+CUhuhFkLP7oJz+rMU0B9hfUvdLfmCpH 7ppaallkllT9nPFIr7h5RUWLXsoQyuHmCYmSrUCcnlT2LPgU0dN72YWElLisEM6Z Pldg3933xyIQaCWviHjGEjWb7NvC+JY4pTkV5iyqGgU8Ale/eFYtLLSfdBEjIbGv VDirYZmKELYeuncZPrTAsp4IENRMZn702wwDakMSODVMEWtJB5h4yrBawqQDlFP5 9ztIX6n9p9zkdVKbYZlx/Xwv6SYEnYXLxauVQMSO3Nck7Z10R5Ud+5uuCg/6mWH8 AQI4UV5bbJcg7zHgocTG9XLFLFPoPtD2JT6k6UT1LeUAiAOqcSzhRO+/qJBmJOWZ Zv+EHXPlxBrl0zNifut6ZQrY17teuItVtmha70a/9W3PjnIx3KecqLcTwdTvDsOY IAyH8WMZrBKpPpsczSmfE93i2Z1QRS91HEAOeSMxl/98dcDTdllYZS7spjoDll2f xmpGDbpriLSCu2XsGHfTC9RbqA7CyuFlHggJSQDkT/5Esli0sCs7eweTuK3RVvPu t6bUHK3yElb6x9qMZhb5q6l72wSMlGMishTdaxEHmqrEA8PtaIFodmVX2T/Zel5n GHN6bysPqDItNR2v/3JX =jJGu -----END PGP SIGNATURE----- Merge tag 'pci-v3.17-changes-2' of git://git.kernel.org/pub/scm/linux/kernel/git/helgaas/pci Pull DEFINE_PCI_DEVICE_TABLE removal from Bjorn Helgaas: "Part two of the PCI changes for v3.17: - Remove DEFINE_PCI_DEVICE_TABLE macro use (Benoit Taine) It's a mechanical change that removes uses of the DEFINE_PCI_DEVICE_TABLE macro. I waited until later in the merge window to reduce conflicts, but it's possible you'll still see a few" * tag 'pci-v3.17-changes-2' of git://git.kernel.org/pub/scm/linux/kernel/git/helgaas/pci: PCI: Remove DEFINE_PCI_DEVICE_TABLE macro use	2014-08-14 18:10:33 -06:00
Linus Torvalds	d429a3639c	Merge branch 'for-3.17/drivers' of git://git.kernel.dk/linux-block Pull block driver changes from Jens Axboe: "Nothing out of the ordinary here, this pull request contains: - A big round of fixes for bcache from Kent Overstreet, Slava Pestov, and Surbhi Palande. No new features, just a lot of fixes. - The usual round of drbd updates from Andreas Gruenbacher, Lars Ellenberg, and Philipp Reisner. - virtio_blk was converted to blk-mq back in 3.13, but now Ming Lei has taken it one step further and added support for actually using more than one queue. - Addition of an explicit SG_FLAG_Q_AT_HEAD for block/bsg, to compliment the the default behavior of adding to the tail of the queue. From Douglas Gilbert" * 'for-3.17/drivers' of git://git.kernel.dk/linux-block: (86 commits) bcache: Drop unneeded blk_sync_queue() calls bcache: add mutex lock for bch_is_open bcache: Correct printing of btree_gc_max_duration_ms bcache: try to set b->parent properly bcache: fix memory corruption in init error path bcache: fix crash with incomplete cache set bcache: Fix more early shutdown bugs bcache: fix use-after-free in btree_gc_coalesce() bcache: Fix an infinite loop in journal replay bcache: fix crash in bcache_btree_node_alloc_fail tracepoint bcache: bcache_write tracepoint was crashing bcache: fix typo in bch_bkey_equal_header bcache: Allocate bounce buffers with GFP_NOWAIT bcache: Make sure to pass GFP_WAIT to mempool_alloc() bcache: fix uninterruptible sleep in writeback thread bcache: wait for buckets when allocating new btree root bcache: fix crash on shutdown in passthrough mode bcache: fix lockdep warnings on shutdown bcache allocator: send discards with correct size bcache: Fix to remove the rcu_sched stalls. ...	2014-08-14 09:10:21 -06:00
Linus Torvalds	8d2d441ac4	Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client Pull Ceph updates from Sage Weil: "There is a lot of refactoring and hardening of the libceph and rbd code here from Ilya that fix various smaller bugs, and a few more important fixes with clone overlap. The main fix is a critical change to the request_fn handling to not sleep that was exposed by the recent mutex changes (which will also go to the 3.16 stable series). Yan Zheng has several fixes in here for CephFS fixing ACL handling, time stamps, and request resends when the MDS restarts. Finally, there are a few cleanups from Himangi Saraogi based on Coccinelle" * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client: (39 commits) libceph: set last_piece in ceph_msg_data_pages_cursor_init() correctly rbd: remove extra newlines from rbd_warn() messages rbd: allocate img_request with GFP_NOIO instead GFP_ATOMIC rbd: rework rbd_request_fn() ceph: fix kick_requests() ceph: fix append mode write ceph: fix sizeof(struct tYpO *) typo ceph: remove redundant memset(0) rbd: take snap_id into account when reading in parent info rbd: do not read in parent info before snap context rbd: update mapping size only on refresh rbd: harden rbd_dev_refresh() and callers a bit rbd: split rbd_dev_spec_update() into two functions rbd: remove unnecessary asserts in rbd_dev_image_probe() rbd: introduce rbd_dev_header_info() rbd: show the entire chain of parent images ceph: replace comma with a semicolon rbd: use rbd_segment_name_free() instead of kfree() ceph: check zero length in ceph_sync_read() ceph: reset r_resend_mds after receiving -ESTALE ...	2014-08-13 17:43:29 -06:00
Benoit Taine	9baa3c34ac	PCI: Remove DEFINE_PCI_DEVICE_TABLE macro use We should prefer `struct pci_device_id` over `DEFINE_PCI_DEVICE_TABLE` to meet kernel coding style guidelines. This issue was reported by checkpatch. A simplified version of the semantic patch that makes this change is as follows (http://coccinelle.lip6.fr/): // <smpl> @@ identifier i; declarer name DEFINE_PCI_DEVICE_TABLE; initializer z; @@ - DEFINE_PCI_DEVICE_TABLE(i) + const struct pci_device_id i[] = z; // </smpl> [bhelgaas: add semantic patch] Signed-off-by: Benoit Taine <benoit.taine@lip6.fr> Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>	2014-08-12 12:15:14 -06:00
Joe Perches	a5bbf6160c	block: use pci_zalloc_consistent Remove the now unnecessary memset too. Signed-off-by: Joe Perches <joe@perches.com> Mike Miller <mike.miller@hp.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2014-08-08 15:57:28 -07:00
Ilya Dryomov	9584d50826	rbd: remove extra newlines from rbd_warn() messages rbd_warn() string should be a single line - rbd_warn() appends \n. Signed-off-by: Ilya Dryomov <ilya.dryomov@inktank.com>	2014-08-07 16:01:09 +04:00
Ilya Dryomov	7a716aac01	rbd: allocate img_request with GFP_NOIO instead GFP_ATOMIC Now that rbd_img_request_create() is called from work functions, no need to use GFP_ATOMIC. Signed-off-by: Ilya Dryomov <ilya.dryomov@inktank.com> Reviewed-by: Alex Elder <elder@linaro.org>	2014-08-07 16:00:52 +04:00
Ilya Dryomov	bc1ecc65a2	rbd: rework rbd_request_fn() While it was never a good idea to sleep in request_fn(), commit `34c6bc2c91` ("locking/mutexes: Add extra reschedule point") made it a bad idea. mutex_lock() since 3.15 may reschedule before putting task on the mutex wait queue, which for tasks in !TASK_RUNNING state means block forever. request_fn() may be called with !TASK_RUNNING on the way to schedule() in io_schedule(). Offload request handling to a workqueue, one per rbd device, to avoid calling blocking primitives from rbd_request_fn(). Fixes: http://tracker.ceph.com/issues/8818 Cc: stable@vger.kernel.org # 3.16, needs backporting for 3.15 Signed-off-by: Ilya Dryomov <ilya.dryomov@inktank.com> Tested-by: Eric Eastman <eric0e@aol.com> Tested-by: Greg Wilson <greg.wilson@keepertech.com> Reviewed-by: Alex Elder <elder@linaro.org>	2014-08-07 14:56:20 +04:00
Weijie Yang	d2d5e762c8	zram: replace global tb_lock with fine grain lock Currently, we use a rwlock tb_lock to protect concurrent access to the whole zram meta table. However, according to the actual access model, there is only a small chance for upper user to access the same table[index], so the current lock granularity is too big. The idea of optimization is to change the lock granularity from whole meta table to per table entry (table -> table[index]), so that we can protect concurrent access to the same table[index], meanwhile allow the maximum concurrency. With this in mind, several kinds of locks which could be used as a per-entry lock were tested and compared: Test environment: x86-64 Intel Core2 Q8400, system memory 4GB, Ubuntu 12.04, kernel v3.15.0-rc3 as base, zram with 4 max_comp_streams LZO. iozone test: iozone -t 4 -R -r 16K -s 200M -I +Z (1GB zram with ext4 filesystem, take the average of 10 tests, KB/s) Test base CAS spinlock rwlock bit_spinlock ------------------------------------------------------------------- Initial write 1381094 1425435 1422860 1423075 1421521 Rewrite 1529479 1641199 1668762 1672855 1654910 Read 8468009 11324979 11305569 11117273 10997202 Re-read 8467476 11260914 11248059 11145336 10906486 Reverse Read 6821393 8106334 8282174 8279195 8109186 Stride read 7191093 8994306 9153982 8961224 9004434 Random read 7156353 8957932 9167098 8980465 8940476 Mixed workload 4172747 5680814 5927825 5489578 5972253 Random write 1483044 1605588 1594329 1600453 1596010 Pwrite 1276644 1303108 1311612 1314228 1300960 Pread 4324337 4632869 4618386 4457870 4500166 To enhance the possibility of access the same table[index] concurrently, set zram a small disksize(10MB) and let threads run with large loop count. fio test: fio --bs=32k --randrepeat=1 --randseed=100 --refill_buffers --scramble_buffers=1 --direct=1 --loops=3000 --numjobs=4 --filename=/dev/zram0 --name=seq-write --rw=write --stonewall --name=seq-read --rw=read --stonewall --name=seq-readwrite --rw=rw --stonewall --name=rand-readwrite --rw=randrw --stonewall (10MB zram raw block device, take the average of 10 tests, KB/s) Test base CAS spinlock rwlock bit_spinlock ------------------------------------------------------------- seq-write 933789 999357 1003298 995961 1001958 seq-read 5634130 6577930 6380861 6243912 6230006 seq-rw 1405687 1638117 1640256 1633903 1634459 rand-rw 1386119 1614664 1617211 `1609267` 1612471 All the optimization methods show a higher performance than the base, however, it is hard to say which method is the most appropriate. On the other hand, zram is mostly used on small embedded system, so we don't want to increase any memory footprint. This patch pick the bit_spinlock method, pack object size and page_flag into an unsigned long table.value, so as to not increase any memory overhead on both 32-bit and 64-bit system. On the third hand, even though different kinds of locks have different performances, we can ignore this difference, because: if zram is used as zram swapfile, the swap subsystem can prevent concurrent access to the same swapslot; if zram is used as zram-blk for set up filesystem on it, the upper filesystem and the page cache also prevent concurrent access of the same block mostly. So we can ignore the different performances among locks. Acked-by: Sergey Senozhatsky <sergey.senozhatsky@gmail.com> Reviewed-by: Davidlohr Bueso <davidlohr@hp.com> Signed-off-by: Weijie Yang <weijie.yang@samsung.com> Signed-off-by: Minchan Kim <minchan@kernel.org> Cc: Jerome Marchand <jmarchan@redhat.com> Cc: Nitin Gupta <ngupta@vflare.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2014-08-06 18:01:23 -07:00
Minchan Kim	023b409f9d	zram: use size_t instead of u16 Some architectures (eg, hexagon and PowerPC) could use PAGE_SHIFT of 16 or more. In these cases u16 is not sufficiently large to represent a compressed page's size so use size_t. Signed-off-by: Minchan Kim <minchan@kernel.org> Reported-by: Weijie Yang <weijie.yang@samsung.com> Acked-by: Sergey Senozhatsky <sergey.senozhatsky@gmail.com> Cc: Jerome Marchand <jmarchan@redhat.com> Cc: Nitin Gupta <ngupta@vflare.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2014-08-06 18:01:23 -07:00
Sergey Senozhatsky	a830eff749	zram: remove unused SECTOR_SIZE define Drop SECTOR_SIZE define, because it's not used. Signed-off-by: Sergey Senozhatsky <sergey.senozhatsky@gmail.com> Cc: Minchan Kim <minchan@kernel.org> Cc: Nitin Gupta <ngupta@vflare.org> Cc: Weijie Yang <weijie.yang@samsung.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2014-08-06 18:01:22 -07:00
Sergey Senozhatsky	cb8f2eec3c	zram: rename struct `table' to` zram_table_entry' Andrew Morton has recently noted that `struct table' actually represents table entry and, thus, should be renamed. Rename to `zram_table_entry'. Signed-off-by: Sergey Senozhatsky <sergey.senozhatsky@gmail.com> Cc: Minchan Kim <minchan@kernel.org> Cc: Nitin Gupta <ngupta@vflare.org> Cc: Weijie Yang <weijie.yang@samsung.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2014-08-06 18:01:22 -07:00
Ilya Dryomov	4d9b67cddd	rbd: take snap_id into account when reading in parent info If we are mapping a snapshot, we must read in the parent_overlap value of that snapshot instead of that of the base image. Not doing so may in particular result in us returning zeros instead of user data: # cat overlap-snap.sh #!/bin/bash rbd create --size 10 --image-format 2 foo FOO_DEV=$(rbd map foo) dd if=/dev/urandom of=$FOO_DEV bs=1M &>/dev/null echo "Base image" dd if=$FOO_DEV bs=1 count=16 skip=$(((4 << 20) - 8)) 2>/dev/null \| xxd rbd snap create foo@snap rbd snap protect foo@snap rbd clone foo@snap bar rbd snap create bar@snap BAR_DEV=$(rbd map bar@snap) echo "Snapshot" dd if=$BAR_DEV bs=1 count=16 skip=$(((4 << 20) - 8)) 2>/dev/null \| xxd rbd resize --allow-shrink --size 4 bar echo "Snapshot after base image resize" dd if=$BAR_DEV bs=1 count=16 skip=$(((4 << 20) - 8)) 2>/dev/null \| xxd # ./overlap-snap.sh Base image 0000000: e781 e33b d34b 2225 6034 2845 a2e3 36ed ...;.K"%`4(E..6. Snapshot 0000000: e781 e33b d34b 2225 6034 2845 a2e3 36ed ...;.K"%`4(E..6. Resizing image: 100% complete...done. Snapshot after base image resize 0000000: e781 e33b d34b 2225 0000 0000 0000 0000 ...;.K"%........ Even though bar@snap is taken with the old bar parent_overlap (8M), reads from bar@snap beyond the new bar parent_overlap (4M) return zeroes. Fix it. Signed-off-by: Ilya Dryomov <ilya.dryomov@inktank.com> Reviewed-by: Alex Elder <elder@linaro.org>	2014-07-25 13:16:50 +04:00
Ilya Dryomov	e8f59b595d	rbd: do not read in parent info before snap context Currently rbd_dev_v2_header_info() reads in parent info before the snap context is read in. This is wrong, because we may need to look at the the parent_overlap value of the snapshot instead of that of the base image, for example when mapping a snapshot - see next commit. (When mapping a snapshot, all we got is its name and we need the snap context to translate that name into an id to know which parent info to look for.) The approach taken here is to make sure rbd_dev_v2_parent_info() is called after the snap context has been read in. The other approach would be to add a parent_overlap field to struct rbd_mapping and maintain it the same way rbd_mapping::size is maintained. The reason I chose the first approach is that the value of keeping around both base image values and the actual mapping values is unclear to me. Signed-off-by: Ilya Dryomov <ilya.dryomov@inktank.com> Reviewed-by: Alex Elder <elder@linaro.org>	2014-07-25 13:16:14 +04:00
Ilya Dryomov	5ff1108ccc	rbd: update mapping size only on refresh There is no sense in trying to update the mapping size before it's even been set. Signed-off-by: Ilya Dryomov <ilya.dryomov@inktank.com> Reviewed-by: Alex Elder <elder@linaro.org>	2014-07-25 13:15:58 +04:00
Ilya Dryomov	52bb1f9bed	rbd: harden rbd_dev_refresh() and callers a bit Recently discovered watch/notify problems showed that we really can't ignore errors in anything refresh related. Alas, currently there is not much we can do in response to those errors, except print warnings. Signed-off-by: Ilya Dryomov <ilya.dryomov@inktank.com> Reviewed-by: Alex Elder <elder@linaro.org>	2014-07-25 13:15:44 +04:00
Ilya Dryomov	0407759971	rbd: split rbd_dev_spec_update() into two functions rbd_dev_spec_update() has two modes of operation, with nothing in common between them. Split it into two functions, one for each mode and make our expectations more clear. Signed-off-by: Ilya Dryomov <ilya.dryomov@inktank.com> Reviewed-by: Alex Elder <elder@linaro.org>	2014-07-25 13:15:35 +04:00
Ilya Dryomov	7626eb7d82	rbd: remove unnecessary asserts in rbd_dev_image_probe() spec->image_id assert doesn't buy us much and image_format is asserted in rbd_dev_header_name() and rbd_dev_header_info() anyway. Signed-off-by: Ilya Dryomov <ilya.dryomov@inktank.com> Reviewed-by: Alex Elder <elder@linaro.org>	2014-07-25 13:15:18 +04:00
Ilya Dryomov	a720ae0901	rbd: introduce rbd_dev_header_info() A wrapper around rbd_dev_v{1,2}_header_info() to reduce duplication. Signed-off-by: Ilya Dryomov <ilya.dryomov@inktank.com> Reviewed-by: Alex Elder <elder@linaro.org>	2014-07-25 13:15:00 +04:00
Ilya Dryomov	ff96128fb0	rbd: show the entire chain of parent images Make /sys/bus/rbd/devices/<id>/parent show the entire chain of parent images. While at it, kernel sprintf() doesn't return negative values, casting to unsigned long long is no longer necessary and there is no good reason to split into multiple sprintf() calls. Signed-off-by: Ilya Dryomov <ilya.dryomov@inktank.com> Reviewed-by: Alex Elder <elder@linaro.org>	2014-07-25 13:14:27 +04:00
Himangi Saraogi	7d5079aa8b	rbd: use rbd_segment_name_free() instead of kfree() Free memory allocated using kmem_cache_zalloc using kmem_cache_free rather than kfree. The helper rbd_segment_name_free does the job here. Its position is shifted above the calling function. The Coccinelle semantic patch that detects this change is as follows: // <smpl> @@ expression x,E,c; @@ x = $kmem_cache_alloc\\|kmem_cache_zalloc\\|kmem_cache_alloc_node$(c,...) ... when != x = E when != &x ?-kfree(x) +kmem_cache_free(c,x) // </smpl> Signed-off-by: Himangi Saraogi <himangi774@gmail.com> Acked-by: Julia Lawall <julia.lawall@lip6.fr> Signed-off-by: Ilya Dryomov <ilya.dryomov@inktank.com>	2014-07-24 11:57:27 +04:00
Minchan Kim	b4c5c60920	zram: avoid lockdep splat by revalidate_disk Sasha reported lockdep warning [1] introduced by [2]. It could be fixed by doing disk revalidation out of the init_lock. It's okay because disk capacity change is protected by init_lock so that revalidate_disk always sees up-to-date value so there is no race. [1] https://lkml.org/lkml/2014/7/3/735 [2] zram: revalidate disk after capacity change Fixes `2e32baea46` ("zram: revalidate disk after capacity change"). Signed-off-by: Minchan Kim <minchan@kernel.org> Reported-by: Sasha Levin <sasha.levin@oracle.com> Cc: "Alexander E. Patrakov" <patrakov@gmail.com> Cc: Nitin Gupta <ngupta@vflare.org> Cc: Jerome Marchand <jmarchan@redhat.com> Cc: Sergey Senozhatsky <sergey.senozhatsky@gmail.com> CC: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2014-07-23 15:10:54 -07:00
Dan Carpenter	bf0d6e4a11	drbd: silence underflow warning in read_in_block() My static checker warns that "data_size" could be negative and underflow the limit check. The code looks suspicious but I don't know if it is a real bug. Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com> Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com> Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>	2014-07-10 18:35:23 +02:00
Lars Ellenberg	1e39152fea	drbd: implicitly truncate cpu-mask Don't error out with misleading "out of memory" if the cpu-mask has more bits set than there are CPUs. Just truncate to nr_cpu_ids implicitly. Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com> Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>	2014-07-10 18:35:22 +02:00
Lars Ellenberg	193cb00ce3	drbd: drop spurious parameters from _drbd_md_sync_page_io size is always 4096, page is always device->md_io.page. Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com> Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>	2014-07-10 18:35:22 +02:00
Lars Ellenberg	f5b90b6bf0	drbd: resync should only lock out specific ranges During resync, if we need to block some specific incoming write because of active resync requests to that same range, we potentially caused all new application writes (to "cold" activity log extents) to block until this one request has been processed. Improve the do_submit() logic to * grab all incoming requests to some "incoming" list * process this list - move aside requests that are blocked by resync - prepare activity log transactions, - commit transactions and submit corresponding requests - if there are remaining requests that only wait for activity log extents to become free, stop the fast path (mark activity log as "starving") - iterate until no more requests are waiting for the activity log, but all potentially remaining requests are only blocked by resync * only then grab new incoming requests That way, very busy IO on currently "hot" activity log extents cannot starve scattered IO to "cold" extents. And blocked-by-resync requests are processed once resync traffic on the affected region has ceased, without blocking anything else. The only blocking mode left is when we cannot start requests to "cold" extents because all currently "hot" extents are actually used. Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com> Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>	2014-07-10 18:35:21 +02:00
Lars Ellenberg	cc356f85bb	drbd: debugfs: add per device data_gen_id The data generation identifiers used to be exposed via sysfs at /sys/block/drbdX/drbd/meta_data/data_gen_id (out-of-tree), for advanced policy scripting. Bring that information over to debugfs. Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com> Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>	2014-07-10 18:35:21 +02:00
Lars Ellenberg	3d299f48a9	drbd: debugfs: add per connection oldest requests Information of former /sys/block/drbdX/drbd/oldest_requests is already with higher detail in these files: debugfs/drbd/resource/$name/in_flight_summary, debugfs/drbd/resource/$name/volumes/$vnr/oldest_requests This patch adds debugfs/drbd/resource/$name/connections/peer/oldest_requests Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com> Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>	2014-07-10 18:35:20 +02:00
Lars Ellenberg	b44e1184fc	drbd: debugfs: add version tag to debugfs files Make the first line of debugfs files a version number, starting now with "v: 0". If we change content of presentation, we will bump that. Monitoring or diagnostic scritps that may parse these files can then easily know when they need to be reviewed. Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com> Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>	2014-07-10 18:35:19 +02:00
Lars Ellenberg	54e6fc38e8	drbd: debugfs: add per volume oldest_requests Show oldest requests * pending master bio completion and, * if different, local disk bio completion. Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com> Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>	2014-07-10 18:35:19 +02:00
Lars Ellenberg	944410e97c	drbd: debugfs: add callback_history Add a per-connection worker thread callback_history with timing details, call site and callback function. Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com> Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>	2014-07-10 18:35:18 +02:00
Lars Ellenberg	f418815f7a	drbd: debugfs: Add in_flight_summary * Add details about pending meta data operations to in_flight_summary. * Report number of requests waiting for activity log transactions. * timing details of peer_requests to in_flight_summary. * FLUSH details DRBD devides the incoming request stream into "epochs", in which peers are allowed to re-order writes independendly. These epochs are separated by P_BARRIER on the replication link. Such barrier packets, depending on configuration, may cause the receiving side to drain the lower level device request queues and call blkdev_issue_flush(). This is known to be an other major source of latency in DRBD. Track timing details of calls to blkdev_issue_flush(), and add them to in_flight_summary. * data socket stats To be able to diagnose bottlenecks and root causes of "slow" IO on DRBD, it is useful to see network buffer stats along with the timing details of requests, peer requests, and meta data IO. * pending bitmap IO timing details to in_flight_summary. Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com> Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>	2014-07-10 18:35:17 +02:00
Lars Ellenberg	4a521cca9b	drbd: debugfs: deal with destructor racing with open of debugfs file Try to close the race between open() and debugfs_remove_recursive() from inside an object destructor. Once open succeeds, the object should stay around. Open should not succeed if the object has already reached its destructor. This may be overkill, but to make that happen, we check for existence of a parent directory, "stale-ness" of "this" dentry, and serialize kref_get_unless_zero() on the outermost object relevant for this file with d_delete() on this dentry (using the parent's i_mutex). Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com> Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>	2014-07-10 18:35:17 +02:00
Lars Ellenberg	db1866ffee	drbd: debugfs: add in_flight_summary data To help diagnosing "high latency" or "hung" IO situations on DRBD, present per drbd resource group a summary of operations currently in progress. First item is a list of oldest drbd_request objects waiting for various things: * still being prepared * waiting for activity log transaction * waiting for local disk * waiting to be sent * waiting for peer acknowledgement ("receive ack", "write ack") * waiting for peer epoch acknowledgement ("barrier ack") Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com> Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>	2014-07-10 18:35:16 +02:00
Lars Ellenberg	4d3d5aa83a	drbd: debugfs: add basic hierarchy Add new debugfs hierarchy /sys/kernel/debug/ drbd/ resources/ $resource_name/connections/peer/$volume_number/ $resource_name/volumes/$volume_number/ minors/$minor_number -> ../resources/$resource_name/volumes/$volume_number/ Followup commits will populate this hierarchy with files containing statistics, diagnostic information and some attribute data. Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com> Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>	2014-07-10 18:35:16 +02:00
Lars Ellenberg	4ce4926683	drbd: track details of bitmap IO Track start and submit time of bitmap operations, and add pending bitmap IO contexts to a new pending_bitmap_io list. Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com> Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>	2014-07-10 18:35:15 +02:00
Lars Ellenberg	c5a2c1509e	drbd: register peer requests on read_ee early Initialize peer_request with timestamp and proper empty list head. Add peer_request to list early, so debugfs can find this request and report it as "preparing", even if we sleep before we actually submit it. Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com> Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>	2014-07-10 18:35:14 +02:00
Lars Ellenberg	21ae5d7f95	drbd: track timing details of peer_requests To be able to present timing details in debugfs, we need to track preparation/submit times of peer requests. Track peer request flags early, before they are put on the epoch_entry lists. Waiting for activity log transactions may be a major latency factor. We want to be able to present the peer_request state accurately in debugfs, and what it is waiting for. Consistently mark/unmark peer requests with EE_CALL_AL_COMPLETE_IO. Set it only after calling drbd_al_begin_io(), clear it as soon as we call drbd_al_complete_io(). Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com> Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>	2014-07-10 18:35:14 +02:00
Lars Ellenberg	ad3fee7900	drbd: improve throttling decisions of background resynchronisation Background resynchronisation does some "side-stepping", or throttles itself, if it detects application IO activity, and the current resync rate estimate is above the configured "cmin-rate". What was not detected: if there is no application IO, because it blocks on activity log transactions. Introduce a new atomic_t ap_actlog_cnt, tracking such blocked requests, and count non-zero as application IO activity. This counter is exposed at proc_details level 2 and above. Also make sure to release the currently locked resync extent if we side-step due to such voluntary throttling. Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com> Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>	2014-07-10 18:35:13 +02:00
Lars Ellenberg	7753a4c17f	drbd: add caching oldest request pointers for replication stages A request that is to be shipped to the peer goes through a few stages: - queued - sent, waiting for ack - ack received, waiting for "barrier ack", which is re-order epoch being closed on the peer by acknowledging a "cache flush" equivalent on the lower level device. In the later two stages, depending on protocol, we may have already completed this request to the upper layers, so it won't be found anymore on device->pending_master_completion[] lists. Track the oldest request yet to be sent (req_next), the oldest not yet acknowledged (req_ack_pending) and the oldest "still waiting for something from the peer" (req_not_net_done), doing short list walks on the transfer log to find the next pending one whenever such a request makes progress. Now we have a fast way to look up the oldest requests, don't do a transfer log walk every time. Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com> Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>	2014-07-10 18:35:12 +02:00
Lars Ellenberg	844a6ae735	drbd: add lists to find oldest pending requests Adding requests to per-device fifo lists as soon as possible after allocating them leaves a simple list_first_entry_or_null() to find the oldest request, regardless what it is still waiting for. Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com> Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>	2014-07-10 18:35:12 +02:00
Lars Ellenberg	e5f891b223	drbd: gather detailed timing statistics for drbd_requests Record (in jiffies) how much time a request spends in which stages. Followup commits will use and present this additional timing information so we can better locate and tackle the root causes of latency spikes, or present the backlog for asynchronous replication. Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com> Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>	2014-07-10 18:35:11 +02:00
Lars Ellenberg	e37d2438d8	drbd: track meta data IO intent, start and submit time For diagnostic purposes, track intent, start time and latest submit time of meta data IO. Move separate members from struct drbd_device into the embeded struct drbd_md_io. s/md_io_(page\|in_use)/md_io.\1/ Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com> Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>	2014-07-10 18:35:10 +02:00
Lars Ellenberg	a8ba0d6069	drbd: fix drbd_destroy_device reference count updates drbd_destroy_device means to give up reference counts on the connection(s) reachable via the peer_device(s). It must not do that by iterating via device->resource->connections, resource and connections may have already been disassociated by drbd_free_resource, and we'd leak connection refs. Instead, iterate via device->peer_devices->connection. Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com> Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>	2014-07-10 18:35:10 +02:00
Lars Ellenberg	c2258ffc56	drbd: poison free'd device, resource and connection structs Now that we have additional asynchronous kref_get/kref_put via debugfs, make sure we catch access after free. Poison struct drbd_device, drbd_connection and drbd_resource before kfree() with 0xfd, 0xfc, and 0xf2, respectively. Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com> Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>	2014-07-10 18:35:09 +02:00
Lars Ellenberg	45d2933c92	drbd: also keep track of trim -> zero-out fallback peer_requests To be able to find and present such zero-out fallback peer_requests in debugfs, we add those to "active_ee", once that list drained. Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com> Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>	2014-07-10 18:35:09 +02:00
Lars Ellenberg	b9ed7080d7	drbd: consistently use list_add_tail for peer_request tracking Keep the epoch entry lists (active_ee, read_ee, sync_ee, ...) consistently "oldest first". That way finding the oldest not yet successfully processed request is simply list_first_entry_or_null. Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com> Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>	2014-07-10 18:35:08 +02:00
Lars Ellenberg	41d9f7cd5b	drbd: drop drbd_md_flush The only user of drbd_md_flush was bm_rw(), and it is always followed by either a drbd_md_sync(), or an al_write_transaction(), which, if so configured, both end up submiting a FLUSH\|FUA request anyways. Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com> Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>	2014-07-10 18:35:07 +02:00
Lars Ellenberg	15e26f6a3c	drbd: add drbd_queue_work_if_unqueued helper We sometimes do if (list_empty(&w.list)) drbd_queue_work(&q, &w.list); Removal (list_del_init) may happen outside all locks, after all pending work entries have been moved to an on-stack local work list. For not dynamically allocated, but embeded, work structs, we must avoid to re-add until it really was removed. Move that list_empty check inside the spin_lock(&q->q_lock) within the helper function, and change to list_empty_careful(). This may have been the reason for a list_add corruption inside drbd_queue_work(). Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com> Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>	2014-07-10 18:35:07 +02:00
Lars Ellenberg	7f34f61490	drbd: drbd_rs_number_requests: fix unit mismatch in comparison We try to limit the number of "in-flight" resync requests. One condition for that is the amount of requested data should not exceed half of what can be covered by our "max-buffers" setting. However we compared number of 4k pages with number of in-flight 512 Byte sectors, and this extra throttle triggered much earlier than intended. Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com> Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>	2014-07-10 18:35:06 +02:00
Lars Ellenberg	f88c5d90cc	drbd: cosmetic: change all printk(level, ...) to pr_<level>(...) Cosmetic change only. Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com> Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>	2014-07-10 18:35:05 +02:00
Lars Ellenberg	2f2abeae3c	drbd: clear CRASHED_PRIMARY only after successful resync If we lost a disk during the first resync after primary crash, we could have prematurely cleared the CRASHED_PRIMARY flag. Testing on C_CONNECTED is not what we meant there, but testing for both peers to become D_UP_TO_DATE. Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com> Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>	2014-07-10 18:35:05 +02:00
Lars Ellenberg	506afb6248	drbd: improve resync request throttling due to sendbuf size If we throttle resync because the socket sendbuffer is filling up, tell TCP about it, so it may expand the sendbuffer for us. Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com> Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>	2014-07-10 18:35:04 +02:00
Joe Perches	659b2e3bb8	block: Convert last uses of __FUNCTION__ to __func__ Just about all of these have been converted to __func__, so convert the last uses. Signed-off-by: Joe Perches <joe@perches.com> Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>	2014-07-10 18:35:04 +02:00
Monam Agarwal	ccdd6a93ee	drivers/block: Use RCU_INIT_POINTER(x, NULL) in drbd/drbd_state.c This patch replaces rcu_assign_pointer(x, NULL) with RCU_INIT_POINTER(x, NULL) The rcu_assign_pointer() ensures that the initialization of a structure is carried out before storing a pointer to that structure. And in the case of the NULL pointer, there is no structure to initialize. So, rcu_assign_pointer(p, NULL) can be safely converted to RCU_INIT_POINTER(p, NULL) Signed-off-by: Monam Agarwal <monamagarwal123@gmail.com> Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>	2014-07-10 18:35:03 +02:00
Lars Ellenberg	0c066bc39e	drbd: short-circuit in maybe_pull_ahead If we already "pulled ahead", we can short-circuit, and avoid logging the same messages over and over again. Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com> Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>	2014-07-10 18:35:02 +02:00
Lars Ellenberg	08d0dabf48	drbd: application writes may set-in-sync in protocol != C If "dirty" blocks are written to during resync, that brings them in-sync. By explicitly requesting write-acks during resync even in protocol != C, we now can actually respect this. Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com> Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>	2014-07-10 18:35:02 +02:00
Philipp Reisner	5d0b17f1a2	drbd: New net configuration option socket-check-timeout In setups involving a DRBD-proxy and connections that experience a lot of buffer-bloat it might be necessary to set ping-timeout to an unusual high value. By default DRBD uses the same value to wait if a newly established TCP-connection is stable. Since the DRBD-proxy is usually located in the same data center such a long wait time may hinder DRBD's connect process. In such setups socket-check-timeout should be set to at least to the round trip time between DRBD and DRBD-proxy. I.e. in most cases to 1. Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com> Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>	2014-07-10 18:35:01 +02:00
Philipp Reisner	4920e37a9e	drbd: Limit the time we are waiting for the first packet on an accepted socket Before the patch 'drbd: Keep the listening socket open while trying to connect to the peer' the newly created socket inherited the receive timeout from the listen socket. The listen socket had a receive timeout of connect-intervall +- 30% random jitter. The real issue is that after the mentioned patch we had no timeout at all. Now use 4 times the ping-timeout. Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com> Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>	2014-07-10 18:35:00 +02:00
Lars Ellenberg	aaaba34576	drbd: implement csums-after-crash-only Checksum based resync trades CPU cycles for network bandwidth, in situations where we expect much of the to-be-resynced blocks to be actually identical on both sides already. In a "network hickup" scenario, it won't help: all to-be-resynced blocks will typically be different. The use case is for the resync of potentially different blocks after crash recovery -- the crash recovery had marked larger areas (those covered by the activity log) as need-to-be-resynced, just in case. Most of those blocks will be identical. This option makes it possible to configure checksum based resync, but only actually use it for the first resync after primary crash. Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com> Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>	2014-07-10 18:35:00 +02:00
Lars Ellenberg	6a8d68b187	drbd: don't implicitly resize Diskless node beyond end of device During handshake, we compare backend sizes, and user set limits, and agree on what device size we are going to expose. We remember that last-agreed-size in our meta data. But if we come up diskless, we have to accept what the peer presents us with. We used to accept the peers maximum potential capacity (backend size), which is wrong, and could lead to IO errors due to access beyond end of device. Instead, we need to accept the peer's current size. Unless that is communicated as 0, in which case we accept the backend size, or the user set limit, if set. Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com> Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>	2014-07-10 18:34:59 +02:00
Lars Ellenberg	a5655dac75	drbd: fix bogus resync stats in /proc/drbd We intentionally do not serialize /proc/drbd access with internal state changes or statistic updates. Because of that, cat /proc/drbd may race with resync just being finished, still see the sync state, and find information about number of blocks still to go, but then find the total number of blocks within this resync has just been reset to 0 when accessing it. This now produces bogus numbers in the resync speed estimates. Fix by accessing all relevant data only once, and fixing it up if "still to go" happens to be more than "total". Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com> Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>	2014-07-10 18:34:59 +02:00
Andreas Gruenbacher	caa3db0e14	drbd: Remove unnecessary/unused code Get rid of dump_stack() debug statements. There is no point whatsoever in registering and unregistering a reboot notifier that doesn't do anything. The intention was to switch to an "emergency read-only" mode, so we won't have to resync the full activity log just because we had been Primary before the reboot. Once we have that implemented, we may re-introduce the reboot notifier. Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com> Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>	2014-07-10 18:34:58 +02:00
Lars Ellenberg	8ce953aa39	drbd: silence -Wmissing-prototypes warnings Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com> Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>	2014-07-10 18:34:57 +02:00
Lars Ellenberg	3e0c78d346	drbd: drop wrong debugging aid The textual representation of resync extents in /proc/drbd presented with proc_details >= 3 was wrong, it used bitnumbers as bitmasks. It was not particularly useful either, and I doubt anyone has even tried to look at it in the last few years. Drop it. Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com> Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>	2014-07-10 18:34:57 +02:00
Lars Ellenberg	4dd726f029	drbd: get rid of drbd_queue_work_front The last user was al_write_transaction, if called with "delegate", and the last user to call it with "delegate = true" was the receiver thread, which has no need to delegate, but can call it himself. Finally drop the delegate parameter, drop the extra w_al_write_transaction callback, and drop drbd_queue_work_front. Do not (yet) change dequeue_work_item to dequeue_work_batch, though. Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com> Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>	2014-07-10 18:34:56 +02:00
Lars Ellenberg	ac0acb9e39	drbd: use drbd_device_post_work() in more places This replaces the md_sync_work member of struct drbd_device by a new MD_SYNC "work bit" in device->flags. This replaces the resync_start_work member of struct drbd_device by a new RS_START "work bit" in device->flags. Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com> Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>	2014-07-10 18:34:55 +02:00
Lars Ellenberg	e334f55095	drbd: make sure disk cleanup happens in worker context The recent fix to put_ldev() (correct ordering of access to local_cnt and state.disk; memory barrier in __drbd_set_state) guarantees that the cleanup happens exactly once. However it does not yet guarantee that the cleanup happens from worker context, the last put_ldev() may still happen from atomic context, which must not happen: blkdev_put() may sleep. Fix this by scheduling the cleanup to the worker instead, using a couple more bits in device->flags and a new helper, drbd_device_post_work(). Generalized the "resync progress" work to cover these new work bits. Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com> Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>	2014-07-10 18:34:55 +02:00
Lars Ellenberg	ba3c6fb87d	drbd: close race when detaching from disk BUG: unable to handle kernel NULL pointer dereference at 0000000000000058 IP: bd_release+0x21/0x70 Process drbd_w_t7146 Call Trace: close_bdev_exclusive drbd_free_ldev [drbd] drbd_ldev_destroy [drbd] w_after_state_ch [drbd] Race probably went like this: state.disk = D_FAILED ... first one to hit zero during D_FAILED: put_ldev() /* ----------------> 0 / i = atomic_dec_return() if (i == 0) if (state.disk == D_FAILED) schedule_work(go_diskless) / 1 <------ / get_ldev_if_state() go_diskless() do_some_pre_cleanup() corresponding put_ldev(): force_state(D_DISKLESS) / 0 <------ / i = atomic_dec_return() if (i == 0) atomic_inc() / ---------> 1 / state.disk = D_DISKLESS schedule_work(after_state_ch) / execution pre-empted by IRQ ? / after_state_ch() put_ldev() i = atomic_dec_return() / 0 / if (i == 0) if (state.disk == D_DISKLESS) if (state.disk == D_DISKLESS) drbd_ldev_destroy() drbd_ldev_destroy(); Trying to fix this by checking the disk state before* the atomic_dec_return(), which implies memory barriers, and by inserting extra memory barriers around the state assignment in __drbd_set_state(). Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com> Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>	2014-07-10 18:34:54 +02:00
Lars Ellenberg	2ed912e9d3	drbd: explicitly submit meta data requests with REQ_NOIDLE For some reason we have assumed NOIDLE was implied by one of the other flags we set. It is not (anymore?). Explicitly set REQ_NOIDLE for synchronous meta data updates, or we can seriously starve random writes when using CFQ. Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com> Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>	2014-07-10 18:34:54 +02:00
Lars Ellenberg	720979fb90	drbd: move set_disk_ro() to after we persisted the new role This probably does not have any real life impact, but we should first persist any potentially new UUID and other meta data flags, as well as our new role, before we allow/disallow write access. Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com> Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>	2014-07-10 18:34:53 +02:00
Lars Ellenberg	123ff122ad	drbd: trigger tcp_push_pending_frames() for PING and PING_ACK This should reduce latency for such in-DRBD-protocol "pings", and may help reduce spurious disconnect/reconnect cycles due to "PingAck did not arrive in time." Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com> Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>	2014-07-10 18:34:52 +02:00
Lars Ellenberg	66ce6dbce2	drbd: re-add lost conf_mutex protection in drbd_set_role The conf_update mutex used to be held while clearing the net_conf->discard_my_data flag inside drbd_set_role. It was moved into drbd_adm_set_role with drbd: allow parallel promote/demote actions but then replaced at that location by the newly introduced adm_mutex with drbd: Fix a potential deadlock in drbdsetup, introduce resource->adm_mutex And I simply forgot to put it back in at the original location. Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com> Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>	2014-07-10 18:34:52 +02:00
Lars Ellenberg	fcb096740a	drbd: stop the meta data sync timer before open coded meta data sync If we re-write all meta data due to resize, we have open-coded write-out of our meta data super block. Stop the md_sync_timer, it would just trigger scary but in this case spurious "timer expired" messages. Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com> Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>	2014-07-10 18:34:51 +02:00
Lars Ellenberg	5ab7d2c005	drbd: fix resync finished detection This fixes one recent regresion, and one long existing bug. The bug: drbd_try_clear_on_disk_bm() assumed that all "count" bits have to be accounted in the resync extent corresponding to the start sector. Since we allow application requests to cross our "extent" boundaries, this assumption is no longer true, resulting in possible misaccounting, scary messages ("BAD! sector=12345s enr=6 rs_left=-7 rs_failed=0 count=58 cstate=..."), and potentially, if the last bit to be cleared during resync would reside in previously misaccounted resync extent, the resync would never be recognized as finished, but would be "stalled" forever, even though all blocks are in sync again and all bits have been cleared... The regression was introduced by drbd: get rid of atomic update on disk bitmap works For an "empty" resync (rs_total == 0), we must not "finish" the resync on the SyncSource before the SyncTarget knows all relevant information (sync uuid). We need to wait for the full round-trip, the SyncTarget will then explicitly notify us. Also for normal, non-empty resyncs (rs_total > 0), the resync-finished condition needs to be tested before the schedule() in wait_for_work, or it is likely to be missed. Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com> Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>	2014-07-10 18:34:50 +02:00
Lars Ellenberg	a80ca1ae81	drbd: fix a race stopping the worker thread We may implicitly call drbd_send() from inside wait_for_work(), via maybe_send_barrier(). If the "stop" signal was send just before that, drbd_send() would call flush_signals(), and we would run an unbounded schedule() afterwards. Fix: check for thread_state == RUNNING before we schedule() Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com> Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>	2014-07-10 18:34:50 +02:00
Lars Ellenberg	c7a58db4e9	drbd: get rid of atomic update on disk bitmap works Just trigger the occasional lazy bitmap write-out during resync from the central wait_for_work() helper. Previously, during resync, bitmap pages would be written out separately, synchronously, one at a time, at least 8 times each (every 512 bytes worth of bitmap cleared). Now we trigger "merge friendly" bulk write out of all cleared pages every two seconds during resync, and once the resync is finished. Most pages will be written out only once. Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com> Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>	2014-07-10 18:34:49 +02:00
Lars Ellenberg	70df70927b	drbd: allow write-ordering policy to be bumped up again Previously, once you disabled flushes as a means of enforcing write-ordering, you'd need to detach/re-attach to enable them again. Allow drbdsetup disk-options to re-enable previously disabled write-ordering policy options at runtime. While at it fix RCU in drbd_bump_write_ordering() max_allowed_wo() uses rcu_dereference, therefore it must be called within rcu_read_lock()/rcu_read_unlock() Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com> Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>	2014-07-10 15:22:22 +02:00
Lars Ellenberg	44a4d55184	drbd: refactor use of first_peer_device() Reduce the number of calls to first_peer_device(). Instead, call first_peer_device() just once to assign a local variable peer_device. Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com> Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>	2014-07-10 15:22:22 +02:00
Lars Ellenberg	35b5ed5bba	drbd: reduce number of spinlock drop/re-aquire cycles Instead of dropping and re-aquiring the spinlock around the submit, just remember that we want to submit, and do that only once we have dropped the spinlock for good. Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com> Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>	2014-07-10 15:22:21 +02:00
Philipp Reisner	28995af5cf	drbd: rename drbd_free_bc() to drbd_free_ldev() Since the member of drbd_device is called ldev Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com> Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>	2014-07-10 15:22:21 +02:00
Philipp Reisner	8fe39aac05	drbd: device->ldev is not guaranteed on an D_ATTACHING disk Some parts of the code assumed that get_ldev_if_state(device, D_ATTACHING) is sufficient to access the ldev member of the device object. That was wrong. ldev may not be there or might be freed at any time if the device has a disk state of D_ATTACHING. bm_rw() Documented that drbd_bm_read() is only called from drbd_adm_attach. drbd_bm_write() is only called when a reference is held, and it is documented that a caller has to hold a reference before calling drbd_bm_write() drbd_bm_write_page() Use get_ldev() instead of get_ldev_if_state(device, D_ATTACHING) drbd_bmio_set_n_write() No longer use get_ldev_if_state(device, D_ATTACHING). All callers hold a reference to ldev now. drbd_bmio_clear_n_write() All callers where holding a reference of ldev anyways. Remove the misleading get_ldev_if_state(device, D_ATTACHING) drbd_reconsider_max_bio_size() Removed the get_ldev_if_state(device, D_ATTACHING). All callers now pass a struct drbd_backing_dev* when they have a proper reference, or a NULL pointer. Before this fix, the receiver could trigger a NULL pointer deref when in drbd_reconsider_max_bio_size() drbd_bump_write_ordering() Used get_ldev_if_state(device, D_ATTACHING) with the wrong assumption. Remove it, and allow the caller to pass in a struct drbd_backing_dev* when the caller knows that accessing this bdev is safe. Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com> Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>	2014-07-10 15:22:20 +02:00
Philipp Reisner	e952658020	drbd: Move write_ordering from connection to resource Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com> Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>	2014-07-10 15:22:19 +02:00
Lars Ellenberg	bbc1c5e8ad	drbd: fix regression 'out of mem, failed to invoke fence-peer helper' Since linux kernel 3.13, kthread_run() internally uses wait_for_completion_killable(). We sometimes may use kthread_run() while we still have a signal pending, which we used to kick our threads out of potentially blocking network functions, causing kthread_run() to mistake that as a new fatal signal and fail. Fix: flush_signals() before kthread_run(). Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com> Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com> Signed-off-by: Jens Axboe <axboe@fb.com>	2014-07-10 11:06:03 +02:00
Ilya Dryomov	fbba11b3be	rbd: do not leak image_id in rbd_dev_v2_parent_info() image_id is leaked if the parent happens to have been recorded already. Fix it. Signed-off-by: Ilya Dryomov <ilya.dryomov@inktank.com> Reviewed-by: Alex Elder <elder@linaro.org>	2014-07-08 15:08:46 +04:00
Ilya Dryomov	76756a51e2	rbd: use rbd_obj_watch_request_helper() helper Switch rbd_dev_header_{un,}watch_sync() to use the new helper and fix rbd_dev_header_unwatch_sync() to destroy watch_request structures before queuing watch-remove message while at it. This mistake slipped into commit `b30a01f2a3` ("rbd: fix osd_request memory leak in __rbd_dev_header_watch_sync()") and could lead to "image still in use" errors on image removal. Signed-off-by: Ilya Dryomov <ilya.dryomov@inktank.com> Reviewed-by: Alex Elder <elder@linaro.org>	2014-07-08 15:08:45 +04:00
Ilya Dryomov	bb040aa03c	rbd: add rbd_obj_watch_request_helper() helper In the past, rbd_dev_header_watch_sync() used to handle both watch and unwatch requests and was entangled and leaky. Commit `b30a01f2a3` ("rbd: fix osd_request memory leak in __rbd_dev_header_watch_sync()") split it into two separate functions. This commit cleanly abstracts the common bits, relying on the fixed rbd_obj_request_wait(). Signed-off-by: Ilya Dryomov <ilya.dryomov@inktank.com> Reviewed-by: Alex Elder <elder@linaro.org>	2014-07-08 15:08:45 +04:00
Ilya Dryomov	71c20a066f	rbd: rbd_obj_request_wait() should cancel the request if interrupted rbd_obj_request_wait() should cancel the underlying OSD request if interrupted. Otherwise libceph will hold onto it indefinitely, causing assert failures or leaking the original object request. This also adds an rbd wrapper around ceph_osdc_cancel_request() to match rbd_obj_request_submit() and rbd_obj_request_wait(). Signed-off-by: Ilya Dryomov <ilya.dryomov@inktank.com> Reviewed-by: Alex Elder <elder@linaro.org>	2014-07-08 15:08:45 +04:00
Minchan Kim	2e32baea46	zram: revalidate disk after capacity change Alexander reported mkswap on /dev/zram0 is failed if other process is opening the block device file. Step is as follows, 0. Reset the unused zram device. 1. Use a program that opens /dev/zram0 with O_RDWR and sleeps until killed. 2. While that program sleeps, echo the correct value to /sys/block/zram0/disksize. 3. Verify (e.g. in /proc/partitions) that the disk size is applied correctly. It is. 4. While that program still sleeps, attempt to mkswap /dev/zram0. This fails: mkswap: error: swap area needs to be at least 40 KiB When I investigated, the size get by ioctl(fd, BLKGETSIZE64, xxx) on mkswap to get a size of blockdev was zero although zram0 has right size by 2. The reason is zram didn't revalidate disk after changing capacity so that size of blockdev's inode is not uptodate until all of file is close. This patch should fix the BUG. Signed-off-by: Minchan Kim <minchan@kernel.org> Reported-by: Alexander E. Patrakov <patrakov@gmail.com> Tested-by: Alexander E. Patrakov <patrakov@gmail.com> Reviewed-by: Sergey Senozhatsky <sergey.senozhatsky@gmail.com> Cc: Nitin Gupta <ngupta@vflare.org> Acked-by: Jerome Marchand <jmarchan@redhat.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2014-07-03 09:21:53 -07:00
Ming Lei	6a27b656fc	block: virtio-blk: support multi virt queues per virtio-blk device Firstly this patch supports more than one virtual queues for virtio-blk device. Secondly this patch maps the virtual queue to blk-mq's hardware queue. With this approach, both scalability and performance can be improved. Signed-off-by: Ming Lei <ming.lei@canonical.com> Acked-by: Michael S. Tsirkin <mst@redhat.com> Signed-off-by: Jens Axboe <axboe@fb.com>	2014-07-01 10:51:03 -06:00
Linus Torvalds	3493860c76	Merge branch 'for-linus' of git://git.kernel.dk/linux-block Pull block fixes from Jens Axboe: "A small collection of fixes/changes for the current series. This contains: - Removal of dead code from Gu Zheng. - Revert of two bad fixes that went in earlier in this round, marking things as __init that were not purely used from init. - A fix for blk_mq_start_hw_queue() using the __blk_mq_run_hw_queue(), which could place us wrongly. Make it use the non __ variant, which handles cases where we are called from the wrong CPU set. From me. - A fix for drbd, which allocates discard requests without room for the SCSI payload. From Lars Ellenberg. - A fix for user-after-free in the blkcg code from Tejun. - Addition of limiting gaps in SG lists, if the hardware needs it. This is the last pre-req patch for blk-mq to enable the full NVMe conversion. Could wait until 3.17, but it's simple enough so would be nice to have everything we need for the NVMe port in the 3.17 release. From me" * 'for-linus' of git://git.kernel.dk/linux-block: drbd: fix NULL pointer deref in blk_add_request_payload blk-mq: blk_mq_start_hw_queue() should use blk_mq_run_hw_queue() block: add support for limiting gaps in SG lists bio: remove unused macro bip_vec_idx() Revert "block: add __init to elv_register" Revert "block: add __init to blkcg_policy_register" blkcg: fix use-after-free in __blkg_release_rcu() by making blkcg_gq refcnt an atomic_t floppy: format block0 read error message properly	2014-06-26 13:06:13 -07:00
Lars Ellenberg	54ed4ed8f9	drbd: fix NULL pointer deref in blk_add_request_payload Discards don't have any payload. But the scsi layer still expects a bio_vec it can use internally, see sd_setup_discard_cmnd() and blk_add_request_payload(). Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com> Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com> Signed-off-by: Jens Axboe <axboe@fb.com>	2014-06-25 09:53:47 -06:00
Ilya Dryomov	9638556a27	rbd: handle parent_overlap on writes correctly The following check in rbd_img_obj_request_submit() rbd_dev->parent_overlap <= obj_request->img_offset allows the fall through to the non-layered write case even if both parent_overlap and obj_request->img_offset belong to the same RADOS object. This leads to data corruption, because the area to the left of parent_overlap ends up unconditionally zero-filled instead of being populated with parent data. Suppose we want to write 1M to offset 6M of image bar, which is a clone of foo@snap; object_size is 4M, parent_overlap is 5M: rbd_data.<id>.0000000000000001 ---------------------\|----------------------\|------------ \| should be copyup'ed \| should be zeroed out \| write ... ---------------------\|----------------------\|------------ 4M 5M 6M parent_overlap obj_request->img_offset 4..5M should be copyup'ed from foo, yet it is zero-filled, just like 5..6M is. Given that the only striping mode kernel client currently supports is chunking (i.e. stripe_unit == object_size, stripe_count == 1), round parent_overlap up to the next object boundary for the purposes of the overlap check. Cc: stable@vger.kernel.org # 3.10+ Signed-off-by: Ilya Dryomov <ilya.dryomov@inktank.com> Reviewed-by: Josh Durgin <josh.durgin@inktank.com>	2014-06-23 12:55:37 +04:00
Linus Torvalds	f1d702487b	Merge branch 'for-linus' of git://git.kernel.dk/linux-block Pull block fixes from Jens Axboe: "A smaller collection of fixes for the block core that would be nice to have in -rc2. This pull request contains: - Fixes for races in the wait/wakeup logic used in blk-mq from Alexander. No issues have been observed, but it is definitely a bit flakey currently. Alternatively, we may drop the cyclic wakeups going forward, but that needs more testing. - Some cleanups from Christoph. - Fix for an oops in null_blk if queue_mode=1 and softirq completions are used. From me. - A fix for a regression caused by the chunk size setting. It inadvertently used max_hw_sectors instead of max_sectors, which is incorrect, and causes hangs on btrfs multi-disk setups (where hw sectors apparently isn't set). From me. - Removal of WQ_POWER_EFFICIENT in the kblockd creation. This was a recent addition as well, but it actually breaks blk-mq which relies on strict scheduling. If the workqueue power_efficient mode is turned on, this breaks blk-mq. From Matias. - null_blk module parameter description fix from Mike" * 'for-linus' of git://git.kernel.dk/linux-block: blk-mq: bitmap tag: fix races in bt_get() function blk-mq: bitmap tag: fix race on blk_mq_bitmap_tags::wake_cnt blk-mq: bitmap tag: fix races on shared ::wake_index fields block: blk_max_size_offset() should check ->max_sectors null_blk: fix softirq completions for queue_mode == 1 blk-mq: merge blk_mq_drain_queue and __blk_mq_drain_queue blk-mq: properly drain stopped queues block: remove WQ_POWER_EFFICIENT from kblockd null_blk: fix name and description of 'queue_mode' module parameter block: remove elv_abort_queue and blk_abort_flushes	2014-06-19 17:56:43 -10:00
Jens Axboe	e64d468773	Merge branch 'for-jens' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/linux-block into for-linus	2014-06-18 10:30:22 -07:00
Jiri Kosina	1c65df3d7b	floppy: format block0 read error message properly In case reading of block 0 fails, line without trailing newline is printed causing dmesg to look horrible. Signed-off-by: Jiri Kosina <jkosina@suse.cz>	2014-06-18 13:44:18 +02:00
Jens Axboe	d891fa7087	null_blk: fix softirq completions for queue_mode == 1 Only blk-mq completions have payload attached to the request, for request_fn mode we have stored it in req->special. This fixes an oops with queue_mode=1 and softirq completions. Signed-off-by: Jens Axboe <axboe@fb.com>	2014-06-16 11:40:25 -06:00
Linus Torvalds	b55b390202	Merge git://git.infradead.org/users/willy/linux-nvme Pull NVMe update from Matthew Wilcox: "Mostly bugfixes again for the NVMe driver. I'd like to call out the exported tracepoint in the block layer; I believe Keith has cleared this with Jens. We've had a few reports from people who're really pounding on NVMe devices at scale, hence the timeout changes (and new module parameters), hotplug cpu deadlock, tracepoints, and minor performance tweaks" [ Jens hadn't seen that tracepoint thing, but is ok with it - it will end up going away when mq conversion happens ] * git://git.infradead.org/users/willy/linux-nvme: (22 commits) NVMe: Fix START_STOP_UNIT Scsi->NVMe translation. NVMe: Use Log Page constants in SCSI emulation NVMe: Define Log Page constants NVMe: Fix hot cpu notification dead lock NVMe: Rename io_timeout to nvme_io_timeout NVMe: Use last bytes of f/w rev SCSI Inquiry NVMe: Adhere to request queue block accounting enable/disable NVMe: Fix nvme get/put queue semantics NVMe: Delete NVME_GET_FEAT_TEMP_THRESH NVMe: Make admin timeout a module parameter NVMe: Make iod bio timeout a parameter NVMe: Prevent possible NULL pointer dereference NVMe: Fix the buffer size passed in GetLogPage(CDW10.NUMD) NVMe: Update data structures for NVMe 1.2 NVMe: Enable BUILD_BUG_ON checks NVMe: Update namespace and controller identify structures to the 1.1a spec NVMe: Flush with data support NVMe: Configure support for block flush NVMe: Add tracepoints NVMe: Protect against badly formatted CQEs ...	2014-06-15 15:58:03 -10:00
Dan McLeran	b8e080847a	NVMe: Fix START_STOP_UNIT Scsi->NVMe translation. This patch contains several fixes for Scsi START_STOP_UNIT. The previous code did not account for signed vs. unsigned arithmetic which resulted in an invalid lowest power state caculation when the device only supports 1 power state. The code for Power Condition == 2 (Idle) was not following the spec. The spec calls for setting the device to specific power states, depending upon Power Condition Modifier, without accounting for the number of power states supported by the device. The code for Power Condition == 3 (Standby) was using a hard-coded '0' which is replaced with the macro POWER_STATE_0. Signed-off-by: Dan McLeran <daniel.mcleran@intel.com> Reviewed-by: Vishal Verma <vishal.l.verma@linux.intel.com> Signed-off-by: Matthew Wilcox <matthew.r.wilcox@intel.com>	2014-06-13 13:11:00 -04:00
Matthew Wilcox	ef351b97de	NVMe: Use Log Page constants in SCSI emulation The nvme-scsi file defined its own Log Page constant. Use the newly-defined one from the header file instead. Signed-off-by: Matthew Wilcox <matthew.r.wilcox@intel.com>	2014-06-13 10:54:21 -04:00
Keith Busch	f3db22feb5	NVMe: Fix hot cpu notification dead lock There is a potential dead lock if a cpu event occurs during nvme probe since it registered with hot cpu notification. This fixes the race by having the module register with notification outside of probe rather than have each device register. The actual work is done in a scheduled work queue instead of in the notifier since assigning IO queues has the potential to block if the driver creates additional queues. Signed-off-by: Keith Busch <keith.busch@intel.com> Signed-off-by: Matthew Wilcox <matthew.r.wilcox@intel.com>	2014-06-13 10:43:34 -04:00
Linus Torvalds	6d87c225f5	Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client Pull Ceph updates from Sage Weil: "This has a mix of bug fixes and cleanups. Alex's patch fixes a rare race in RBD. Ilya's patches fix an ENOENT check when a second rbd image is mapped and a couple memory leaks. Zheng fixes several issues with fragmented directories and multiple MDSs. Josh fixes a spin/sleep issue, and Josh and Guangliang's patches fix setting and unsetting RBD images read-only. Naturally there are several other cleanups mixed in for good measure" * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client: (23 commits) rbd: only set disk to read-only once rbd: move calls that may sleep out of spin lock range rbd: add ioctl for rbd ceph: use truncate_pagecache() instead of truncate_inode_pages() ceph: include time stamp in every MDS request rbd: fix ida/idr memory leak rbd: use reference counts for image requests rbd: fix osd_request memory leak in __rbd_dev_header_watch_sync() rbd: make sure we have latest osdmap on 'rbd map' libceph: add ceph_monc_wait_osdmap() libceph: mon_get_version request infrastructure libceph: recognize poolop requests in debugfs ceph: refactor readpage_nounlock() to make the logic clearer mds: check cap ID when handling cap export message ceph: remember subtree root dirfrag's auth MDS ceph: introduce ceph_fill_fragtree() ceph: handle cap import atomically ceph: pre-allocate ceph_cap struct for ceph_add_cap() ceph: update inode fields according to issued caps rbd: replace IS_ERR and PTR_ERR with PTR_ERR_OR_ZERO ...	2014-06-12 23:06:23 -07:00
Mike Snitzer	54ae81cd5a	null_blk: fix name and description of 'queue_mode' module parameter 'use_mq' is not the name of the module parameter, 'queue_mode' is. Signed-off-by: Mike Snitzer <snitzer@redhat.com> Signed-off-by: Jens Axboe <axboe@fb.com>	2014-06-11 15:34:44 -06:00
Linus Torvalds	23d4ed53b7	Merge branch 'for-linus' of git://git.kernel.dk/linux-block Pull block layer fixes from Jens Axboe: "Final small batch of fixes to be included before -rc1. Some general cleanups in here as well, but some of the blk-mq fixes we need for the NVMe conversion and/or scsi-mq. The pull request contains: - Support for not merging across a specified "chunk size", if set by the driver. Some NVMe devices perform poorly for IO that crosses such a chunk, so we need to support it generically as part of request merging avoid having to do complicated split logic. From me. - Bump max tag depth to 10Ki tags. Some scsi devices have a huge shared tag space. Before we failed with EINVAL if a too large tag depth was specified, now we truncate it and pass back the actual value. From me. - Various blk-mq rq init fixes from me and others. - A fix for enter on a dying queue for blk-mq from Keith. This is needed to prevent oopsing on hot device removal. - Fixup for blk-mq timer addition from Ming Lei. - Small round of performance fixes for mtip32xx from Sam Bradshaw. - Minor stack leak fix from Rickard Strandqvist. - Two __init annotations from Fabian Frederick" * 'for-linus' of git://git.kernel.dk/linux-block: block: add __init to blkcg_policy_register block: add __init to elv_register block: ensure that bio_add_page() always accepts a page for an empty bio blk-mq: add timer in blk_mq_start_request blk-mq: always initialize request->start_time block: blk-exec.c: Cleaning up local variable address returnd mtip32xx: minor performance enhancements blk-mq: ->timeout should be cleared in blk_mq_rq_ctx_init() blk-mq: don't allow queue entering for a dying queue blk-mq: bump max tag depth to 10K tags block: add blk_rq_set_block_pc() block: add notion of a chunk size for request merging	2014-06-11 08:41:17 -07:00
Josh Durgin	22001f619f	rbd: only set disk to read-only once rbd_open(), called every time the device is opened, calls set_device_ro(). There's no reason to set the device read-only or read-write every time it is opened. Just do this once during device setup, using set_disk_ro() instead because the struct block_device isn't available to us there. Signed-off-by: Josh Durgin <josh.durgin@inktank.com> Reviewed-by: Alex Elder <elder@linaro.org>	2014-06-10 18:09:25 -07:00
Josh Durgin	77f33c0373	rbd: move calls that may sleep out of spin lock range get_user() and set_disk_ro() may allocate memory, leading to a potential deadlock if theye are called while a spin lock is held. Move the acquisition and release of rbd_dev->lock from rbd_ioctl() into rbd_ioctl_set_ro(), so it can occur between get_user() and set_disk_ro(). Signed-off-by: Josh Durgin <josh.durgin@inktank.com> Reviewed-by: Alex Elder <elder@linaro.org>	2014-06-10 18:09:25 -07:00
Guangliang Zhao	131fd9f6fc	rbd: add ioctl for rbd When running the following commands: [root@ceph0 mnt]# blockdev --setro /dev/rbd1 [root@ceph0 mnt]# blockdev --getro /dev/rbd1 0 The block setro didn't take effect, it is because the rbd doesn't support ioctl of block driver. This resolves: http://tracker.ceph.com/issues/6265 Signed-off-by: Guangliang Zhao <guangliang@unitedstack.com> Reviewed-by: Alex Elder <elder@linaro.org> Reviewed-by: Josh Durgin <josh.durgin@inktank.com>	2014-06-10 18:09:24 -07:00
Hani Benhabiles	04cfac4e40	nbd: zero from and len fields in NBD_CMD_DISCONNECT. Len field is already set to zero, but not the from field which is sent as 0xfffffffffffffe00. This makes no sense, and may cause confuse server implementations doing sanity checks (qemu-nbd is an example.) Signed-off-by: Hani Benhabiles <hani@linux.com> Cc: Paul Clements <paul.clements@us.sios.com> Cc: Paul Clements <Paul.Clements@steeleye.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2014-06-06 16:08:18 -07:00
Sam Bradshaw	f45c40a92d	mtip32xx: minor performance enhancements This patch adds the following: 1) Compiler hinting in the fast path. 2) A prefetch of port->flags to eliminate moderate cpu stalling later in mtip_hw_submit_io(). 3) Eliminate a redundant rq_data_dir(). 4) Reorder members of driver_data to eliminate false cacheline sharing between irq_workers_active and unal_qdepth. With some workload and topology configurations, I'm seeing ~1.5% throughput improvement in small block random read benchmarks as well as improved latency std. dev. Signed-off-by: Sam Bradshaw <sbradshaw@micron.com> Add include of <linux/prefetch.h> Signed-off-by: Jens Axboe <axboe@fb.com>	2014-06-06 13:28:48 -06:00
Jens Axboe	f27b087b81	block: add blk_rq_set_block_pc() With the optimizations around not clearing the full request at alloc time, we are leaving some of the needed init for REQ_TYPE_BLOCK_PC up to the user allocating the request. Add a blk_rq_set_block_pc() that sets the command type to REQ_TYPE_BLOCK_PC, and properly initializes the members associated with this type of request. Update callers to use this function instead of manipulating rq->cmd_type directly. Includes fixes from Christoph Hellwig <hch@lst.de> for my half-assed attempt. Signed-off-by: Jens Axboe <axboe@fb.com>	2014-06-06 07:57:37 -06:00
Ilya Dryomov	ffe312cf31	rbd: fix ida/idr memory leak ida_destroy() needs to be called on module exit to release ida caches. Signed-off-by: Ilya Dryomov <ilya.dryomov@inktank.com> Reviewed-by: Alex Elder <elder@linaro.org>	2014-06-06 09:30:00 +08:00
Alex Elder	0f2d5be792	rbd: use reference counts for image requests Each image request contains a reference count, but to date it has not actually been used. (I think this was just an oversight.) A recent report involving rbd failing an assertion shed light on why and where we need to use these reference counts. Every OSD request associated with an object request uses rbd_osd_req_callback() as its callback function. That function will call a helper function (dependent on the type of OSD request) that will set the object request's "done" flag if the object request if appropriate. If that "done" flag is set, the object request is passed to rbd_obj_request_complete(). In rbd_obj_request_complete(), requests are processed in sequential order. So if an object request completes before one of its predecessors in the image request, the completion is deferred. Otherwise, if it's a completing object's "turn" to be completed, it is passed to rbd_img_obj_end_request(), which records the result of the operation, accumulates transferred bytes, and so on. Next, the successor to this request is checked and if it is marked "done", (deferred) completion processing is performed on that request, and so on. If the last object request in an image request is completed, rbd_img_request_complete() is called, which (typically) destroys the image request. There is a race here, however. The instant an object request is marked "done" it can be provided (by a thread handling completion of one of its predecessor operations) to rbd_img_obj_end_request(), which (for the last request) can then lead to the image request getting torn down. And this can happen before that object has itself entered rbd_img_obj_end_request(). As a result, once it does enter that function, the image request (and even the object request itself) may have been freed and become invalid. All that's necessary to avoid this is to properly count references to the image requests. We tear down an image request's object requests all at once--only when the entire image request has completed. So there's no need for an image request to count references for its object requests. However, we don't want an image request to go away until the last of its object requests has passed through rbd_img_obj_callback(). In other words, we don't want rbd_img_request_complete() to necessarily result in the image request being destroyed, because it may get called before we've finished processing on all of its object requests. So the fix is to add a reference to an image request for each of its object requests. The reference can be viewed as representing an object request that has not yet finished its call to rbd_img_obj_callback(). That is emphasized by getting the reference right after assigning that as the image object's callback function. The corresponding release of that reference is done at the end of rbd_img_obj_callback(), which every image object request passes through exactly once. Cc: stable@vger.kernel.org Signed-off-by: Alex Elder <elder@linaro.org> Reviewed-by: Ilya Dryomov <ilya.dryomov@inktank.com>	2014-06-06 09:29:59 +08:00
Ilya Dryomov	b30a01f2a3	rbd: fix osd_request memory leak in __rbd_dev_header_watch_sync() osd_request, along with r_request and r_reply messages attached to it are leaked in __rbd_dev_header_watch_sync() if the requested image doesn't exist. This is because lingering requests are special and get an extra ref in the reply path. Fix it by unregistering linger request on the error path and split __rbd_dev_header_watch_sync() into two functions to make it maintainable. Signed-off-by: Ilya Dryomov <ilya.dryomov@inktank.com>	2014-06-06 09:29:59 +08:00
Ilya Dryomov	30ba1f0202	rbd: make sure we have latest osdmap on 'rbd map' Given an existing idle mapping (img1), mapping an image (img2) in a newly created pool (pool2) fails: $ ceph osd pool create pool1 8 8 $ rbd create --size 1000 pool1/img1 $ sudo rbd map pool1/img1 $ ceph osd pool create pool2 8 8 $ rbd create --size 1000 pool2/img2 $ sudo rbd map pool2/img2 rbd: sysfs write failed rbd: map failed: (2) No such file or directory This is because client instances are shared by default and we don't request an osdmap update when bumping a ref on an existing client. The fix is to use the mon_get_version request to see if the osdmap we have is the latest, and block until the requested update is received if it's not. Fixes: http://tracker.ceph.com/issues/8184 Signed-off-by: Ilya Dryomov <ilya.dryomov@inktank.com> Reviewed-by: Sage Weil <sage@inktank.com>	2014-06-06 09:29:58 +08:00
Duan Jiong	461f758ac0	rbd: replace IS_ERR and PTR_ERR with PTR_ERR_OR_ZERO This patch fixes coccinelle error regarding usage of IS_ERR and PTR_ERR instead of PTR_ERR_OR_ZERO. Signed-off-by: Duan Jiong <duanj.fnst@cn.fujitsu.com> Reviewed-by: Yan, Zheng <zheng.z.yan@intel.com>	2014-06-06 09:29:51 +08:00
Linus Torvalds	00170fdd08	Merge branch 'akpm' (patchbomb from Andrew) into next Merge misc updates from Andrew Morton: - a few fixes for 3.16. Cc'ed to stable so they'll get there somehow. - various misc fixes and cleanups - most of the ocfs2 queue. Review is slow... - most of MM. The MM queue is pretty huge this time, but not much in the way of feature work. - some tweaks under kernel/ - printk maintenance work - updates to lib/ - checkpatch updates - tweaks to init/ * emailed patches from Andrew Morton <akpm@linux-foundation.org>: (276 commits) fs/autofs4/dev-ioctl.c: add __init to autofs_dev_ioctl_init fs/ncpfs/getopt.c: replace simple_strtoul by kstrtoul init/main.c: remove an ifdef kthreads: kill CLONE_KERNEL, change kernel_thread(kernel_init) to avoid CLONE_SIGHAND init/main.c: add initcall_blacklist kernel parameter init/main.c: don't use pr_debug() fs/binfmt_flat.c: make old_reloc() static fs/binfmt_elf.c: fix bool assignements fs/efs: convert printk(KERN_DEBUG to pr_debug fs/efs: add pr_fmt / use __func__ fs/efs: convert printk to pr_foo() scripts/checkpatch.pl: device_initcall is not the only __initcall substitute checkpatch: check stable email address checkpatch: warn on unnecessary void function return statements checkpatch: prefer kstrto<foo> to sscanf(buf, "%<lhuidx>", &bar); checkpatch: add warning for kmalloc/kzalloc with multiply checkpatch: warn on #defines ending in semicolon checkpatch: make --strict a default for files in drivers/net and net/ checkpatch: always warn on missing blank line after variable declaration block checkpatch: fix wildcard DT compatible string checking ...	2014-06-04 16:55:13 -07:00
Weijie Yang	38515c7339	zram: correct offset usage in zram_bio_discard We want to skip the physical block(PAGE_SIZE) which is partially covered by the discard bio, so we check the remaining size and subtract it if there is a need to goto the next physical block. The current offset usage in zram_bio_discard is incorrect, it will cause its upper filesystem breakdown. Consider the following scenario: On some architecture or config, PAGE_SIZE is 64K for example, filesystem is set up on zram disk without PAGE_SIZE aligned, a discard bio leads to a offset = 4K and size=72K, normally, it should not really discard any physical block as it partially cover two physical blocks. However, with the current offset usage, it will discard the second physical block and free its memory, which will cause filesystem breakdown. This patch corrects the offset usage in zram_bio_discard. Signed-off-by: Weijie Yang <weijie.yang@samsung.com> Cc: Minchan Kim <minchan@kernel.org> Cc: Nitin Gupta <ngupta@vflare.org> Acked-by: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Sergey Senozhatsky <sergey.senozhatsky@gmail.com> Cc: Bob Liu <bob.liu@oracle.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2014-06-04 16:54:13 -07:00
Matthew Wilcox	96f8d8e096	brd: return -ENOSPC rather than -ENOMEM on page allocation failure brd is effectively a thinly provisioned device. Thinly provisioned devices return -ENOSPC when they can't write a new block. -ENOMEM is an implementation detail that callers shouldn't know. Signed-off-by: Matthew Wilcox <matthew.r.wilcox@intel.com> Acked-by: Dave Chinner <david@fromorbit.com> Cc: Dheeraj Reddy <dheeraj.reddy@intel.com> Cc: Hugh Dickins <hughd@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2014-06-04 16:54:02 -07:00
Matthew Wilcox	a72132c31d	brd: add support for rw_page() Signed-off-by: Matthew Wilcox <matthew.r.wilcox@intel.com> Cc: Dave Chinner <david@fromorbit.com> Cc: Dheeraj Reddy <dheeraj.reddy@intel.com> Cc: Hugh Dickins <hughd@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2014-06-04 16:54:02 -07:00
Jens Axboe	0e62f51f87	blk-mq: let blk_mq_tag_to_rq() take blk_mq_tags as the main parameter We currently pass in the hardware queue, and get the tags from there. But from scsi-mq, with a shared tag space, it's a lot more convenient to pass in the blk_mq_tags instead as the hardware queue isn't always directly available. So instead of having to re-map to a given hardware queue from rq->mq_ctx, just pass in the tags structure. Signed-off-by: Jens Axboe <axboe@fb.com>	2014-06-04 10:23:49 -06:00
Matthew Wilcox	bd67608a61	NVMe: Rename io_timeout to nvme_io_timeout It's positively immoral to have a global variable called 'io_timeout'. Keep the module parameter called io_timeout, though. Signed-off-by: Matthew Wilcox <matthew.r.wilcox@intel.com>	2014-06-03 23:04:30 -04:00
Keith Busch	dedf4b1561	NVMe: Use last bytes of f/w rev SCSI Inquiry After skipping right-padded spaces, use the last four bytes of the firmware revision when reporting the Inquiry Product Revision. These are generally more indicative to what is running. Signed-off-by: Keith Busch <keith.busch@intel.com> Acked-by: Vishal Verma <vishal.l.verma@linux.intel.com> Signed-off-by: Matthew Wilcox <matthew.r.wilcox@intel.com>	2014-06-03 22:59:01 -04:00
Sam Bradshaw	b4e75cbf13	NVMe: Adhere to request queue block accounting enable/disable Recently, a new sysfs control "iostats" was added to selectively enable or disable io statistics collection for request queues. This patch hooks that control. IO statistics collection is rather expensive on large, multi-node machines with drives pushing millions of iops. Having the ability to disable collection if not needed can improve throughput significantly. As a data point, on a quad E5-4640, I see more than 50% throughput improvement when io statistics accounting is disabled during heavily multi-threaded small block random read benchmarks where device performance is in the million iops+ range. Signed-off-by: Sam Bradshaw <sbradshaw@micron.com> Signed-off-by: Matthew Wilcox <matthew.r.wilcox@intel.com>	2014-06-03 22:58:54 -04:00
Keith Busch	a51afb5433	NVMe: Fix nvme get/put queue semantics The routines to get and lock nvme queues required the caller to "put" or "unlock" them even if getting one returned NULL. This patch fixes that. Signed-off-by: Keith Busch <keith.busch@intel.com> Signed-off-by: Matthew Wilcox <matthew.r.wilcox@intel.com>	2014-06-03 22:58:34 -04:00
Matthew Wilcox	de672b9748	NVMe: Delete NVME_GET_FEAT_TEMP_THRESH This define isn't used, and any code that wanted to use it should use NVME_FEAT_TEMP_THRESH instead. Signed-off-by: Matthew Wilcox <matthew.r.wilcox@intel.com>	2014-06-03 22:58:27 -04:00
Keith Busch	9d43cf646e	NVMe: Make admin timeout a module parameter Signed-off-by: Keith Busch <keith.busch@intel.com> [made admin_timeout static] Signed-off-by: Matthew Wilcox <matthew.r.wilcox@intel.com>	2014-06-03 22:57:56 -04:00
Keith Busch	61e4ce086d	NVMe: Make iod bio timeout a parameter This was originally set to 4 times the IO timeout, but that was when the IO timeout was 5 seconds instead of 30. 20 seconds for total time to failure seemed more reasonable than 2 minutes for most, but other users have requested to make this a module parameter instead. Signed-off-by: Keith Busch <keith.busch@intel.com> [renamed the module parameter to retry_time] [made retry_time static] Signed-off-by: Matthew Wilcox <matthew.r.wilcox@intel.com>	2014-06-03 22:56:43 -04:00
Linus Torvalds	c84a1e32ee	Merge branch 'sched-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip into next Pull scheduler updates from Ingo Molnar: "The main scheduling related changes in this cycle were: - various sched/numa updates, for better performance - tree wide cleanup of open coded nice levels - nohz fix related to rq->nr_running use - cpuidle changes and continued consolidation to improve the kernel/sched/idle.c high level idle scheduling logic. As part of this effort I pulled cpuidle driver changes from Rafael as well. - standardized idle polling amongst architectures - continued work on preparing better power/energy aware scheduling - sched/rt updates - misc fixlets and cleanups" * 'sched-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (49 commits) sched/numa: Decay ->wakee_flips instead of zeroing sched/numa: Update migrate_improves/degrades_locality() sched/numa: Allow task switch if load imbalance improves sched/rt: Fix 'struct sched_dl_entity' and dl_task_time() comments, to match the current upstream code sched: Consolidate open coded implementations of nice level frobbing into nice_to_rlimit() and rlimit_to_nice() sched: Initialize rq->age_stamp on processor start sched, nohz: Change rq->nr_running to always use wrappers sched: Fix the rq->next_balance logic in rebalance_domains() and idle_balance() sched: Use clamp() and clamp_val() to make sys_nice() more readable sched: Do not zero sg->cpumask and sg->sgp->power in build_sched_groups() sched/numa: Fix initialization of sched_domain_topology for NUMA sched: Call select_idle_sibling() when not affine_sd sched: Simplify return logic in sched_read_attr() sched: Simplify return logic in sched_copy_attr() sched: Fix exec_start/task_hot on migrated tasks arm64: Remove TIF_POLLING_NRFLAG metag: Remove TIF_POLLING_NRFLAG sched/idle: Make cpuidle_idle_call() void sched/idle: Reflow cpuidle_idle_call() sched/idle: Delay clearing the polling bit ...	2014-06-03 14:00:15 -07:00
Santosh Y	6808c5fb7f	NVMe: Prevent possible NULL pointer dereference kmalloc() used by the nvme_alloc_iod() to allocate memory for 'iod' can fail. So check the return value. Signed-off-by: Santosh Y <santosh.sy@samsung.com> Signed-off-by: Matthew Wilcox <matthew.r.wilcox@intel.com>	2014-06-03 16:43:30 -04:00
Indraneel Mukherjee	4131f2fcdc	NVMe: Fix the buffer size passed in GetLogPage(CDW10.NUMD) In GetLogPage the buffer size passed to device is a 0's based value. Signed-off-by: Indraneel M <indraneel.m@samsung.com> Reported-by: Shiro Itou <shiro.itou@outlook.com> Reviewed-by: Vishal Verma <vishal.l.verma@linux.intel.com> Signed-off-by: Matthew Wilcox <matthew.r.wilcox@intel.com>	2014-06-03 16:42:41 -04:00
Linus Torvalds	80081ec309	Merge branch 'for-3.16/drivers' of git://git.kernel.dk/linux-block into next Pull block driver changes from Jens Axboe: "Now that the core bits are in, here's the pull request for the driver related changes for 3.16. Nothing out of the ordinary here, mostly business as usual. There are a few pulls of for-3.16/core into this branch, which were done when the blk-mq was modified after the mtip32xx conversion was put in. The pull request contains: - skd and cciss converted to use pci_enable_msix_exact(). From Alexander Gordeev. - A few mtip32xx fixes from Asai @ Micron. - The conversion of mtip32xx from make_request_fn to blk-mq, and a later small fix for that conversion on quiescing for non-queued IO. From me. - A fix for bsg to use an exported function to check whether this driver is request based or not. Needed updating for blk-mq, which is request based, but does not have a request_fn hook. From me. - Small floppy bug fix from Jiri. - A series of cleanups for the cdrom uniform layer from Joe Perches. Gets rid of various old ugly macros, making the code conform more to the modern coding style. - A series of patches for drbd from the drbd crew (Lars Ellenberg and Philipp Reisner). - A use-after-free fix for null_blk from Ming Lei. - Also from Ming Lei is a performance patch for virtio-blk, which can net us a 3x win on kvm platforms where world notification is expensive. - Ming Lei also fixed a stall issue in virtio-blk, due to a race between queue start/stop and resource limits. - A small batch of fixes for xen-blk{back,front} from Olaf Hering and Valentin Priescu" * 'for-3.16/drivers' of git://git.kernel.dk/linux-block: (54 commits) block: virtio_blk: don't hold spin lock during world switch xen-blkback: defer freeing blkif to avoid blocking xenwatch xen blkif.h: fix comment typo in discard-alignment xen/blkback: disable discard feature if requested by toolstack xen-blkfront: remove type check from blkfront_setup_discard floppy: do not corrupt bio.bi_flags when reading block 0 mtip32xx: move error handling to service thread virtio_blk: fix race between start and stop queue mtip32xx: stop block hardware queues before quiescing IO mtip32xx: blk_mq_init_queue() returns an ERR_PTR mtip32xx: convert to use blk-mq cdrom: Remove unnecessary prototype for cdrom_get_disc_info cdrom: Remove unnecessary prototype for cdrom_mrw_exit cdrom: Remove cdrom_count_tracks prototype cdrom: Remove cdrom_get_next_writeable prototype cdrom: Remove cdrom_get_last_written prototype cdrom: Move mmc_ioctls above cdrom_ioctl to remove unnecessary prototype cdrom: Remove unnecessary sanitize_format prototype cdrom: Remove unnecessary check_for_audio_disc prototype cdrom: Remove prototype for open_for_data ...	2014-06-02 13:57:01 -07:00
Linus Torvalds	425553209b	PCI changes for the v3.16 merge window: Enumeration - Notify driver before and after device reset (Keith Busch) - Use reset notification in NVMe (Keith Busch) NUMA - Warn if we have to guess host bridge node information (Myron Stowe) - Work around AMD Fam15h BIOSes that fail to provide _PXM (Suravee Suthikulpanit) - Clean up and mark early_root_info_init() as deprecated (Suravee Suthikulpanit) Driver binding - Add "driver_override" for force specific binding (Alex Williamson) - Fail "new_id" addition for devices we already know about (Bandan Das) Resource management - Support BAR sizes up to 8GB (Nikhil Rao, Alan Cox) - Don't move IORESOURCE_PCI_FIXED resources (Bjorn Helgaas) - Mark SBx00 HPET BAR as IORESOURCE_PCI_FIXED (Bjorn Helgaas) - Fail safely if we can't handle BARs larger than 4GB (Bjorn Helgaas) - Reject BAR above 4GB if dma_addr_t is too small (Bjorn Helgaas) - Don't convert BAR address to resource if dma_addr_t is too small (Bjorn Helgaas) - Don't set BAR to zero if dma_addr_t is too small (Bjorn Helgaas) - Don't print anything while decoding is disabled (Bjorn Helgaas) - Don't add disabled subtractive decode bus resources (Bjorn Helgaas) - Add resource allocation comments (Bjorn Helgaas) - Restrict 64-bit prefetchable bridge windows to 64-bit resources (Yinghai Lu) - Assign i82875p_edac PCI resources before adding device (Yinghai Lu) PCI device hotplug - Remove unnecessary "dev->bus" test (Bjorn Helgaas) - Use PCI_EXP_SLTCAP_PSN define (Bjorn Helgaas) - Fix rphahp endianess issues (Laurent Dufour) - Acknowledge spurious "cmd completed" event (Rajat Jain) - Allow hotplug service drivers to operate in polling mode (Rajat Jain) - Fix cpqphp possible NULL dereference (Rickard Strandqvist) MSI - Replace pci_enable_msi_block() by pci_enable_msi_exact() (Alexander Gordeev) - Replace pci_enable_msix() by pci_enable_msix_exact() (Alexander Gordeev) - Simplify populate_msi_sysfs() (Jan Beulich) Virtualization - Add Intel Patsburg (X79) root port ACS quirk (Alex Williamson) - Mark RTL8110SC INTx masking as broken (Alex Williamson) Generic host bridge driver - Add generic PCI host controller driver (Will Deacon) Freescale i.MX6 - Use new clock names (Lucas Stach) - Drop old IRQ mapping (Lucas Stach) - Remove optional (and unused) IRQs (Lucas Stach) - Add support for MSI (Lucas Stach) - Fix imx6_add_pcie_port() section mismatch warning (Sachin Kamat) Renesas R-Car - Add gen2 device tree support (Ben Dooks) - Use new OF interrupt mapping when possible (Lucas Stach) - Add PCIe driver (Phil Edworthy) - Add PCIe MSI support (Phil Edworthy) - Add PCIe device tree bindings (Phil Edworthy) Samsung Exynos - Remove unnecessary OOM messages (Jingoo Han) - Fix add_pcie_port() section mismatch warning (Sachin Kamat) Synopsys DesignWare - Make MSI ISR shared IRQ aware (Lucas Stach) Miscellaneous - Check for broken config space aliasing (Alex Williamson) - Update email address (Ben Hutchings) - Fix Broadcom CNB20LE unintended sign extension (Bjorn Helgaas) - Fix incorrect vgaarb conditional in WARN_ON() (Bjorn Helgaas) - Remove unnecessary __ref annotations (Bjorn Helgaas) - Add arch/x86/kernel/quirks.c to MAINTAINERS PCI file patterns (Bjorn Helgaas) - Fix use of uninitialized MPS value (Bjorn Helgaas) - Tidy x86/gart messages (Bjorn Helgaas) - Fix return value from pci_user_{read,write}_config_() (Gavin Shan) - Turn pcibios_penalize_isa_irq() into a weak function (Hanjun Guo) - Remove unused serial device IDs (Jean Delvare) - Use designated initialization in PCI_VDEVICE (Mark Rustad) - Fix powerpc NULL dereference in pci_root_buses traversal (Mike Qiu) - Configure MPS on ARM (Murali Karicheri) - Remove unnecessary includes of <linux/init.h> (Paul Gortmaker) - Move Open Firmware devspec attribute to PCI common code (Sebastian Ott) - Use pdev->dev.groups for attribute creation on s390 (Sebastian Ott) - Remove pcibios_add_platform_entries() (Sebastian Ott) - Add new ID for Intel GPU "spurious interrupt" quirk (Thomas Jarosch) - Rename pci_is_bridge() to pci_has_subordinate() (Yijing Wang) - Add and use new pci_is_bridge() interface (Yijing Wang) - Make pci_bus_add_device() void (Yijing Wang) DMA API - Clarify physical/bus address distinction in docs (Bjorn Helgaas) - Fix typos in docs (Emilio López) - Update dma_pool_create ()and dma_pool_alloc() descriptions (Gioh Kim) - Change dma_declare_coherent_memory() CPU address to phys_addr_t (Bjorn Helgaas) - Pass GAPSPCI_DMA_BASE CPU & bus address to dma_declare_coherent_memory() (Bjorn Helgaas) -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.11 (GNU/Linux) iQIcBAABAgAGBQJTjMaQAAoJEFmIoMA60/r8XncQAKX7cD6btXCZnrcYo7inseyp 3rwOlrsNkWyHqSj/RqqzE1NY6L1h5G2uliI6xg1SKenuHPDcosm5d8FYO0ORKiUs xrqBkmZJHXN7fck//tJwsTXiYh5u42RO8QWbvZVr5UqXe40LyaMHMh9Y7VarrU/o sM2ADzFKagv1qMQ13nmYxqT+Zl+CqpimyLP+ep6Nfqxi6ils+KJ6b9SKYqrpqE6t Mcq2K5ShqU5SaYub1JIXLcQ9XylID+t1M9+cwixcs7a87HbJiktfkGqQvQJoUIuK Q5U+abcIGk4vfOnDCctSnoRyrcbTAZ/vqfo0vpX22TokESjwrD8hFOX5HPOFtD+4 wIDbYurW/8oBhLRaJ0uTPzSH8bXjXTynAwxHZgIrEur5908eECKQ/WiFCxyrovvv r4ThAN0FaobllEr0XOFESOzDNSt/ME00WWI7+puAJ/KJkFEtcXt9othLmLmvLz8H 2GWXrm/aOR0WUO7foGUxI3bXYlDN6NbSKpfuZsLAi2VAyJJ6L6yVSo/fT0X07e3z qRy9LOohuiwIKv/I4F2SEq2REfGGsnkrJBoeQi/oBZDcBy1Lsi7P9LWIERhLQEM+ Hm+30lC/f326nI3hoyThj2k2xxZOQzCIvrt658xP4qd9Zfe1bvCH58FF8K62CoOd p8XAf7Sl6v6YUodUrT/t =km55 -----END PGP SIGNATURE----- Merge tag 'pci-v3.16-changes' of git://git.kernel.org/pub/scm/linux/kernel/git/helgaas/pci into next Pull PCI changes from Bjorn Helgaas: "Enumeration - Notify driver before and after device reset (Keith Busch) - Use reset notification in NVMe (Keith Busch) NUMA - Warn if we have to guess host bridge node information (Myron Stowe) - Work around AMD Fam15h BIOSes that fail to provide _PXM (Suravee Suthikulpanit) - Clean up and mark early_root_info_init() as deprecated (Suravee Suthikulpanit) Driver binding - Add "driver_override" for force specific binding (Alex Williamson) - Fail "new_id" addition for devices we already know about (Bandan Das) Resource management - Support BAR sizes up to 8GB (Nikhil Rao, Alan Cox) - Don't move IORESOURCE_PCI_FIXED resources (Bjorn Helgaas) - Mark SBx00 HPET BAR as IORESOURCE_PCI_FIXED (Bjorn Helgaas) - Fail safely if we can't handle BARs larger than 4GB (Bjorn Helgaas) - Reject BAR above 4GB if dma_addr_t is too small (Bjorn Helgaas) - Don't convert BAR address to resource if dma_addr_t is too small (Bjorn Helgaas) - Don't set BAR to zero if dma_addr_t is too small (Bjorn Helgaas) - Don't print anything while decoding is disabled (Bjorn Helgaas) - Don't add disabled subtractive decode bus resources (Bjorn Helgaas) - Add resource allocation comments (Bjorn Helgaas) - Restrict 64-bit prefetchable bridge windows to 64-bit resources (Yinghai Lu) - Assign i82875p_edac PCI resources before adding device (Yinghai Lu) PCI device hotplug - Remove unnecessary "dev->bus" test (Bjorn Helgaas) - Use PCI_EXP_SLTCAP_PSN define (Bjorn Helgaas) - Fix rphahp endianess issues (Laurent Dufour) - Acknowledge spurious "cmd completed" event (Rajat Jain) - Allow hotplug service drivers to operate in polling mode (Rajat Jain) - Fix cpqphp possible NULL dereference (Rickard Strandqvist) MSI - Replace pci_enable_msi_block() by pci_enable_msi_exact() (Alexander Gordeev) - Replace pci_enable_msix() by pci_enable_msix_exact() (Alexander Gordeev) - Simplify populate_msi_sysfs() (Jan Beulich) Virtualization - Add Intel Patsburg (X79) root port ACS quirk (Alex Williamson) - Mark RTL8110SC INTx masking as broken (Alex Williamson) Generic host bridge driver - Add generic PCI host controller driver (Will Deacon) Freescale i.MX6 - Use new clock names (Lucas Stach) - Drop old IRQ mapping (Lucas Stach) - Remove optional (and unused) IRQs (Lucas Stach) - Add support for MSI (Lucas Stach) - Fix imx6_add_pcie_port() section mismatch warning (Sachin Kamat) Renesas R-Car - Add gen2 device tree support (Ben Dooks) - Use new OF interrupt mapping when possible (Lucas Stach) - Add PCIe driver (Phil Edworthy) - Add PCIe MSI support (Phil Edworthy) - Add PCIe device tree bindings (Phil Edworthy) Samsung Exynos - Remove unnecessary OOM messages (Jingoo Han) - Fix add_pcie_port() section mismatch warning (Sachin Kamat) Synopsys DesignWare - Make MSI ISR shared IRQ aware (Lucas Stach) Miscellaneous - Check for broken config space aliasing (Alex Williamson) - Update email address (Ben Hutchings) - Fix Broadcom CNB20LE unintended sign extension (Bjorn Helgaas) - Fix incorrect vgaarb conditional in WARN_ON() (Bjorn Helgaas) - Remove unnecessary __ref annotations (Bjorn Helgaas) - Add arch/x86/kernel/quirks.c to MAINTAINERS PCI file patterns (Bjorn Helgaas) - Fix use of uninitialized MPS value (Bjorn Helgaas) - Tidy x86/gart messages (Bjorn Helgaas) - Fix return value from pci_user_{read,write}_config_() (Gavin Shan) - Turn pcibios_penalize_isa_irq() into a weak function (Hanjun Guo) - Remove unused serial device IDs (Jean Delvare) - Use designated initialization in PCI_VDEVICE (Mark Rustad) - Fix powerpc NULL dereference in pci_root_buses traversal (Mike Qiu) - Configure MPS on ARM (Murali Karicheri) - Remove unnecessary includes of <linux/init.h> (Paul Gortmaker) - Move Open Firmware devspec attribute to PCI common code (Sebastian Ott) - Use pdev->dev.groups for attribute creation on s390 (Sebastian Ott) - Remove pcibios_add_platform_entries() (Sebastian Ott) - Add new ID for Intel GPU "spurious interrupt" quirk (Thomas Jarosch) - Rename pci_is_bridge() to pci_has_subordinate() (Yijing Wang) - Add and use new pci_is_bridge() interface (Yijing Wang) - Make pci_bus_add_device() void (Yijing Wang) DMA API - Clarify physical/bus address distinction in docs (Bjorn Helgaas) - Fix typos in docs (Emilio López) - Update dma_pool_create ()and dma_pool_alloc() descriptions (Gioh Kim) - Change dma_declare_coherent_memory() CPU address to phys_addr_t (Bjorn Helgaas) - Pass GAPSPCI_DMA_BASE CPU & bus address to dma_declare_coherent_memory() (Bjorn Helgaas)" * tag 'pci-v3.16-changes' of git://git.kernel.org/pub/scm/linux/kernel/git/helgaas/pci: (92 commits) MAINTAINERS: Add generic PCI host controller driver PCI: generic: Add generic PCI host controller driver PCI: imx6: Add support for MSI PCI: designware: Make MSI ISR shared IRQ aware PCI: imx6: Remove optional (and unused) IRQs PCI: imx6: Drop old IRQ mapping PCI: imx6: Use new clock names i82875p_edac: Assign PCI resources before adding device ARM/PCI: Call pcie_bus_configure_settings() to set MPS PCI: imx6: Fix imx6_add_pcie_port() section mismatch warning PCI: Make pci_bus_add_device() void PCI: exynos: Fix add_pcie_port() section mismatch warning PCI: Introduce new device binding path using pci_dev.driver_override PCI: rcar: Add gen2 device tree support PCI: cpqphp: Fix possible null pointer dereference PCI: rcar: Add R-Car PCIe device tree bindings PCI: rcar: Add MSI support for PCIe PCI: rcar: Add Renesas R-Car PCIe driver PCI: Fix return value from pci_user_{read,write}_config_*() PCI: exynos: Remove unnecessary OOM messages ...	2014-06-02 12:15:19 -07:00
Linus Torvalds	681a289548	Merge branch 'for-3.16/core' of git://git.kernel.dk/linux-block into next Pull block core updates from Jens Axboe: "It's a big(ish) round this time, lots of development effort has gone into blk-mq in the last 3 months. Generally we're heading to where 3.16 will be a feature complete and performant blk-mq. scsi-mq is progressing nicely and will hopefully be in 3.17. A nvme port is in progress, and the Micron pci-e flash driver, mtip32xx, is converted and will be sent in with the driver pull request for 3.16. This pull request contains: - Lots of prep and support patches for scsi-mq have been integrated. All from Christoph. - API and code cleanups for blk-mq from Christoph. - Lots of good corner case and error handling cleanup fixes for blk-mq from Ming Lei. - A flew of blk-mq updates from me: * Provide strict mappings so that the driver can rely on the CPU to queue mapping. This enables optimizations in the driver. * Provided a bitmap tagging instead of percpu_ida, which never really worked well for blk-mq. percpu_ida relies on the fact that we have a lot more tags available than we really need, it fails miserably for cases where we exhaust (or are close to exhausting) the tag space. * Provide sane support for shared tag maps, as utilized by scsi-mq * Various fixes for IO timeouts. * API cleanups, and lots of perf tweaks and optimizations. - Remove 'buffer' from struct request. This is ancient code, from when requests were always virtually mapped. Kill it, to reclaim some space in struct request. From me. - Remove 'magic' from blk_plug. Since we store these on the stack and since we've never caught any actual bugs with this, lets just get rid of it. From me. - Only call part_in_flight() once for IO completion, as includes two atomic reads. Hopefully we'll get a better implementation soon, as the part IO stats are now one of the more expensive parts of doing IO on blk-mq. From me. - File migration of block code from {mm,fs}/ to block/. This includes bio.c, bio-integrity.c, bounce.c, and ioprio.c. From me, from a discussion on lkml. That should describe the meat of the pull request. Also has various little fixes and cleanups from Dave Jones, Shaohua Li, Duan Jiong, Fengguang Wu, Fabian Frederick, Randy Dunlap, Robert Elliott, and Sam Bradshaw" * 'for-3.16/core' of git://git.kernel.dk/linux-block: (100 commits) blk-mq: push IPI or local end_io decision to __blk_mq_complete_request() blk-mq: remember to start timeout handler for direct queue block: ensure that the timer is always added blk-mq: blk_mq_unregister_hctx() can be static blk-mq: make the sysfs mq/ layout reflect current mappings blk-mq: blk_mq_tag_to_rq should handle flush request block: remove dead code in scsi_ioctl:blk_verify_command blk-mq: request initialization optimizations block: add queue flag for disabling SG merging block: remove 'magic' from struct blk_plug blk-mq: remove alloc_hctx and free_hctx methods blk-mq: add file comments and update copyright notices blk-mq: remove blk_mq_alloc_request_pinned blk-mq: do not use blk_mq_alloc_request_pinned in blk_mq_map_request blk-mq: remove blk_mq_wait_for_tags blk-mq: initialize request in __blk_mq_alloc_request blk-mq: merge blk_mq_alloc_reserved_request into blk_mq_alloc_request blk-mq: add helper to insert requests from irq context blk-mq: remove stale comment for blk_mq_complete_request() blk-mq: allow non-softirq completions ...	2014-06-02 09:29:34 -07:00
Linus Torvalds	5d70dacd4e	Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/geert/linux-m68k into next Pull m68k updates from Geert Uytterhoeven: "Highlights: - support for running kernels in fast TT-RAM instead of slow ST-RAM on Atari - multi-platform EARLY_PRINTK - better support for machines with lots of RAM (think ARAnyM), and for running kernels larger than 4 MiB (think multi-platform)" * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/geert/linux-m68k: m68k/hp300: Convert printk to pr_foo() m68k/apollo: Convert printk to pr_foo() m68k/amiga: Convert printk(foo to pr_foo() m68k: Increase initial mapping to 8 or 16 MiB if possible m68k: Update defconfigs for v3.15-rc2 m68k/atari: fix SCC initialization for debug console m68k/mvme16x: Adopt common boot console m68k: Multi-platform EARLY_PRINTK m68k: Toward platform agnostic framebuffer debug logging m68k/atari - atari_scsi: use correct virt/phys translation for DMA buffer m68k/atari - ataflop: use correct virt/phys translation for DMA buffer m68k/atari - atafb: convert allocation of fb ram to new interface m68k/atari - stram: alloc ST-RAM pool even if kernel not in ST-RAM	2014-06-02 08:03:34 -07:00
Ming Lei	e8edca6f7f	block: virtio_blk: don't hold spin lock during world switch Firstly, it isn't necessary to hold lock of vblk->vq_lock when notifying hypervisor about queued I/O. Secondly, virtqueue_notify() will cause world switch and it may take long time on some hypervisors(such as, qemu-arm), so it isn't good to hold the lock and block other vCPUs. On arm64 quad core VM(qemu-kvm), the patch can increase I/O performance a lot with VIRTIO_RING_F_EVENT_IDX enabled: - without the patch: 14K IOPS - with the patch: 34K IOPS fio script: [global] direct=1 bsrange=4k-4k timeout=10 numjobs=4 ioengine=libaio iodepth=64 filename=/dev/vdc group_reporting=1 [f1] rw=randread Cc: Rusty Russell <rusty@rustcorp.com.au> Cc: "Michael S. Tsirkin" <mst@redhat.com> Cc: virtualization@lists.linux-foundation.org Signed-off-by: Ming Lei <ming.lei@canonical.com> Acked-by: Rusty Russell <rusty@rustcorp.com.au> Cc: stable@kernel.org # 3.13+ Signed-off-by: Jens Axboe <axboe@fb.com>	2014-05-30 08:19:39 -06:00
Jens Axboe	f89ca16646	Merge branch 'for-3.16/core' into for-3.16/drivers Pulled in for the blk_mq_tag_to_rq() change, which impacts mtip32xx. Signed-off-by: Jens Axboe <axboe@fb.com>	2014-05-30 08:11:50 -06:00
Jens Axboe	879466e6a5	Merge branch 'stable/for-jens-3.16' of git://git.kernel.org/pub/scm/linux/kernel/git/xen/tip into for-3.16/drivers Konrad writes: Please git pull the following branch: git://git.kernel.org/pub/scm/linux/kernel/git/xen/tip.git stable/for-jens-3.16 which has a bunch of fixes to the Xen block frontend and backend driver and a new parameter for Xen backend driver - an override (set by the toolstack) whether to expose the discard support (if disk of course supports it) or not.	2014-05-28 12:37:04 -06:00
Valentin Priescu	814d04e7df	xen-blkback: defer freeing blkif to avoid blocking xenwatch Currently xenwatch blocks in VBD disconnect, waiting for all pending I/O requests to finish. If the VBD is attached to a hot-swappable disk, then xenwatch can hang for a long period of time, stalling other watches. INFO: task xenwatch:39 blocked for more than 120 seconds. "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. ffff880057f01bd0 0000000000000246 ffff880057f01ac0 ffffffff810b0782 ffff880057f01ad0 00000000000131c0 0000000000000004 ffff880057edb040 ffff8800344c6080 0000000000000000 ffff880058c00ba0 ffff880057edb040 Call Trace: [<ffffffff810b0782>] ? irq_to_desc+0x12/0x20 [<ffffffff8128f761>] ? list_del+0x11/0x40 [<ffffffff8147a080>] ? wait_for_common+0x60/0x160 [<ffffffff8147bcef>] ? _raw_spin_lock_irqsave+0x2f/0x50 [<ffffffff8147bd49>] ? _raw_spin_unlock_irqrestore+0x19/0x20 [<ffffffff8147a26a>] schedule+0x3a/0x60 [<ffffffffa018fe6a>] xen_blkif_disconnect+0x8a/0x100 [xen_blkback] [<ffffffff81079f70>] ? wake_up_bit+0x40/0x40 [<ffffffffa018ffce>] xen_blkbk_remove+0xae/0x1e0 [xen_blkback] [<ffffffff8130b254>] xenbus_dev_remove+0x44/0x90 [<ffffffff81345cb7>] __device_release_driver+0x77/0xd0 [<ffffffff81346488>] device_release_driver+0x28/0x40 [<ffffffff813456e8>] bus_remove_device+0x78/0xe0 [<ffffffff81342c9f>] device_del+0x12f/0x1a0 [<ffffffff81342d2d>] device_unregister+0x1d/0x60 [<ffffffffa0190826>] frontend_changed+0xa6/0x4d0 [xen_blkback] [<ffffffffa019c252>] ? frontend_changed+0x192/0x650 [xen_netback] [<ffffffff8130ae50>] ? cmp_dev+0x60/0x60 [<ffffffff81344fe4>] ? bus_for_each_dev+0x94/0xa0 [<ffffffff8130b06e>] xenbus_otherend_changed+0xbe/0x120 [<ffffffff8130b4cb>] frontend_changed+0xb/0x10 [<ffffffff81309c82>] xenwatch_thread+0xf2/0x130 [<ffffffff81079f70>] ? wake_up_bit+0x40/0x40 [<ffffffff81309b90>] ? xenbus_directory+0x80/0x80 [<ffffffff810799d6>] kthread+0x96/0xa0 [<ffffffff81485934>] kernel_thread_helper+0x4/0x10 [<ffffffff814839f3>] ? int_ret_from_sys_call+0x7/0x1b [<ffffffff8147c17c>] ? retint_restore_args+0x5/0x6 [<ffffffff81485930>] ? gs_change+0x13/0x13 With this patch, when there is still pending I/O, the actual disconnect is done by the last reference holder (last pending I/O request). In this case, xenwatch doesn't block indefinitely. Signed-off-by: Valentin Priescu <priescuv@amazon.com> Reviewed-by: Steven Kady <stevkady@amazon.com> Reviewed-by: Steven Noonan <snoonan@amazon.com> Reviewed-by: David Vrabel <david.vrabel@citrix.com> Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>	2014-05-28 14:17:32 -04:00
Olaf Hering	c926b701fe	xen/blkback: disable discard feature if requested by toolstack Newer toolstacks may provide a boolean property "discard-enable" in the backend node. Its purpose is to disable discard for file backed storage to avoid fragmentation. Recognize this setting also for physical storage. If that property exists and is false, do not advertise "feature-discard" to the frontend. Signed-off-by: Olaf Hering <olaf@aepfle.de> Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>	2014-05-28 14:17:28 -04:00
Olaf Hering	1c8cad6c1b	xen-blkfront: remove type check from blkfront_setup_discard In its initial implementation a check for "type" was added, but only phy and file are handled. This breaks advertised discard support for other type values such as qdisk. Fix and simplify this function: If the backend advertises discard support it is supposed to implement it properly, so enable feature_discard unconditionally. If the backend advertises the need for a certain granularity and alignment then propagate both properties to the blocklayer. The discard-secure property is a boolean, update the code to reflect that. Signed-off-by: Olaf Hering <olaf@aepfle.de> Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>	2014-05-28 14:17:26 -04:00
Jens Axboe	0fb662e225	Merge branch 'for-3.16/core' into for-3.16/drivers Pull in core changes (again), since we got rid of the alloc/free hctx mq_ops hooks and mtip32xx then needed updating again. Signed-off-by: Jens Axboe <axboe@fb.com>	2014-05-28 10:18:51 -06:00
Christoph Hellwig	cdef54dd85	blk-mq: remove alloc_hctx and free_hctx methods There is no need for drivers to control hardware context allocation now that we do the context to node mapping in common code. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@fb.com>	2014-05-28 10:18:31 -06:00
Jens Axboe	6178976500	Merge branch 'for-3.16/core' into for-3.16/drivers mtip32xx uses blk_mq_alloc_reserved_request(), so pull in the core changes so we have a properly merged end result. Signed-off-by: Jens Axboe <axboe@fb.com>	2014-05-28 09:50:26 -06:00
Jiri Kosina	6314a108ec	floppy: do not corrupt bio.bi_flags when reading block 0 Commit `41a55b4de3` ("floppy: silence warning during disk test") caused bio.bi_flags being overwritten, and its initialization to BIO_UPTODATE in bio_init() to be lost. This was unnoticed until `7b7b68bba5` ("floppy: bail out in open() if drive is not responding to block0 read"), because the error value wasn't checked for in the bio completion callback. Now we are actually looking at the error, and the loss of BIO_UPTODATE causes EIO to be wrongly passed to the callback, which confuses the FD_OPEN_SHOULD_FAIL_BIT logic. Fix this by not destroying previous value of bi_flags when setting BIO_QUIET. Cc: Stephen Hemminger <shemminger@vyatta.com> Reported-by: Takashi Iwai <tiwai@suse.de> Signed-off-by: Jiri Kosina <jkosina@suse.cz>	2014-05-28 14:56:52 +02:00
Jens Axboe	f14bbe77a9	blk-mq: pass in suggested NUMA node to ->alloc_hctx() Drivers currently have to figure this out on their own, and they are missing information to do it properly. The ones that did attempt to do it, do it wrong. So just pass in the suggested node directly to the alloc function. Signed-off-by: Jens Axboe <axboe@fb.com>	2014-05-27 12:06:53 -06:00
Keith Busch	f0d54a5417	NVMe: Implement PCIe reset notification callback Quiesce and shutdown the device prior to reset, then restart the device and resume IO after. Signed-off-by: Keith Busch <keith.busch@intel.com> Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>	2014-05-27 11:12:29 -06:00
Ming Lei	aa0818c6ee	virtio_blk: fix race between start and stop queue When there isn't enough vring descriptor for adding to vq, blk-mq will be put as stopped state until some of pending descriptors are completed & freed. Unfortunately, the vq's interrupt may come just before blk-mq's BLK_MQ_S_STOPPED flag is set, so the blk-mq will still be kept as stopped even though lots of descriptors are completed and freed in the interrupt handler. The worst case is that all pending descriptors are freed in the interrupt handler, and the queue is kept as stopped forever. This patch fixes the problem by starting/stopping blk-mq with holding vq_lock. Cc: Jens Axboe <axboe@kernel.dk> Cc: Rusty Russell <rusty@rustcorp.com.au> Signed-off-by: Ming Lei <tom.leiming@gmail.com> Cc: stable@kernel.org Signed-off-by: Jens Axboe <axboe@fb.com> Conflicts: drivers/block/virtio_blk.c	2014-05-27 08:41:10 -06:00
Michael Schmitz	a4de73fbcf	m68k/atari - ataflop: use correct virt/phys translation for DMA buffer With the kernel running from FastRAM instead of ST-RAM, none of ST-RAM is mapped by mem_init, and DMA-addressable buffer must be mapped by ioremap. Use platform specific virt/phys translation helpers for this case. Signed-off-by: Michael Schmitz <schmitz@debian.org> Cc: linux-kernel@vger.kernel.org Signed-off-by: Geert Uytterhoeven <geert@linux-m68k.org>	2014-05-26 22:41:24 +02:00
Ingo Molnar	65c2ce7004	Linux 3.15-rc6 -----BEGIN PGP SIGNATURE----- Version: GnuPG v1 iQEcBAABAgAGBQJTfR2zAAoJEHm+PkMAQRiG3noH/2s+KUge3qO2M+AmxttUo74B +npAMdbqYR3MdEiwxYZfsHcMu4Ye/IKLcrh4pydB5hI2mdjtGkH1bnmia0f1ve/c Z/a0256+W8gWp7mcUBqSNztqLPAWa7wKOqNdLjj5idr1BSj6u8im+fQ9FBh2woki 1fyYAuq/60lq4CMOKJvkA95V1Ome/jO+8tS4PguOgsCETQxCVFGurZcBbG3Mx5Y3 v+ioCqeRc6GvxPFR6YngnTZCrsLxSRT3tnO2Qy5zX7dxjIQkCEbvIckpBQv01Y3R wNUaX+2Jae207igxrEv8CjmCFnmZFuUI15aWWCy6fOS/j8bjuk6ThYJO8N4ZBM0= =2ShG -----END PGP SIGNATURE----- Merge tag 'v3.15-rc6' into sched/core, to pick up the latest fixes Signed-off-by: Ingo Molnar <mingo@kernel.org>	2014-05-22 10:28:56 +02:00
Asai Thambi S P	9b204fbf09	mtip32xx: move error handling to service thread Move error handling to service thread, and use mtip_set_timeout() to set timeouts for HDIO_DRIVE_TASK and HDIO_DRIVE_CMD IOCTL commands. Signed-off-by: Selvan Mani <smani@micron.com> Signed-off-by: Asai Thambi S P <asamymuthupa@micron.com> Signed-off-by: Jens Axboe <axboe@fb.com>	2014-05-20 11:51:30 -06:00
Ming Lei	0c29e93eae	virtio_blk: fix race between start and stop queue When there isn't enough vring descriptor for adding to vq, blk-mq will be put as stopped state until some of pending descriptors are completed & freed. Unfortunately, the vq's interrupt may come just before blk-mq's BLK_MQ_S_STOPPED flag is set, so the blk-mq will still be kept as stopped even though lots of descriptors are completed and freed in the interrupt handler. The worst case is that all pending descriptors are freed in the interrupt handler, and the queue is kept as stopped forever. This patch fixes the problem by starting/stopping blk-mq with holding vq_lock. Cc: Jens Axboe <axboe@kernel.dk> Cc: Rusty Russell <rusty@rustcorp.com.au> Signed-off-by: Ming Lei <tom.leiming@gmail.com> Signed-off-by: Jens Axboe <axboe@fb.com>	2014-05-16 09:40:31 -06:00
Jens Axboe	9acf03cfb1	mtip32xx: stop block hardware queues before quiescing IO We need to stop the block layer queues to prevent new "normal" IO from entering the driver, while we wait for existing commands to finish. Signed-off-by: Jens Axboe <axboe@fb.com>	2014-05-14 08:22:56 -06:00
Dan Carpenter	a8a642ccd2	mtip32xx: blk_mq_init_queue() returns an ERR_PTR We changed this from blk_alloc_queue_node() to blk_mq_init_queue() so the check needs to be updated as well. Fixes: `ffc771b3ca` ('mtip32xx: convert to use blk-mq') Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com> Signed-off-by: Jens Axboe <axboe@fb.com>	2014-05-14 08:05:44 -06:00
Jens Axboe	ffc771b3ca	mtip32xx: convert to use blk-mq This rips out timeout handling, requeueing, etc in converting it to use blk-mq instead. Acked-by: Asai Thambi S P <asamymuthupa@micron.com> Signed-off-by: Jens Axboe <axboe@fb.com>	2014-05-13 19:51:22 -06:00
Matthew Wilcox	21bd78bcf4	NVMe: Enable BUILD_BUG_ON checks Since _nvme_check_size() wasn't being called from anywhere, the compiler was optimising it away ... along with all the link-time build failures that would result if any of the structures were the wrong size. Call it from nvme_exit() for no particular reason. Signed-off-by: Matthew Wilcox <matthew.r.wilcox@intel.com>	2014-05-09 22:42:39 -04:00
Ingo Molnar	2fe5de9ce7	Merge branch 'sched/urgent' into sched/core, to avoid conflicts Signed-off-by: Ingo Molnar <mingo@kernel.org>	2014-05-07 13:15:46 +02:00
Keith Busch	53562be74b	NVMe: Flush with data support It is possible a filesystem may send a flush flagged bio with write data. There is no such composite NVMe command, so the driver sends flush and write separately. The device is allowed to execute these commands in any order, so it was possible the driver ends the bio after the write completes, but while the flush is still active. We don't want to let a filesystem believe flush succeeded before it really has; this could cause data corruption on a power loss between these events. To fix, this patch splits the flush and write into chained bios. Signed-off-by: Keith Busch <keith.busch@intel.com> Signed-off-by: Matthew Wilcox <matthew.r.wilcox@intel.com>	2014-05-05 10:54:02 -04:00
Keith Busch	a7d2ce2832	NVMe: Configure support for block flush This configures an nvme request_queue as flush capable if the device has a volatile write cache present. Signed-off-by: Keith Busch <keith.busch@intel.com> Signed-off-by: Matthew Wilcox <matthew.r.wilcox@intel.com>	2014-05-05 10:53:53 -04:00
Matthew Daley	2145e15e05	floppy: don't write kernel-only members to FDRAWCMD ioctl output Do not leak kernel-only floppy_raw_cmd structure members to userspace. This includes the linked-list pointer and the pointer to the allocated DMA space. Signed-off-by: Matthew Daley <mattd@bugfuzz.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2014-05-05 07:46:56 -07:00
Matthew Daley	ef87dbe761	floppy: ignore kernel-only members in FDRAWCMD ioctl input Always clear out these floppy_raw_cmd struct members after copying the entire structure from userspace so that the in-kernel version is always valid and never left in an interdeterminate state. Signed-off-by: Matthew Daley <mattd@bugfuzz.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2014-05-05 07:46:55 -07:00
Keith Busch	3291fa57cb	NVMe: Add tracepoints Adding tracepoints for bio_complete and block_split into nvme to help with gathering IO info using blktrace and blkparse. Signed-off-by: Keith Busch <keith.busch@intel.com> Signed-off-by: Matthew Wilcox <matthew.r.wilcox@intel.com>	2014-05-05 10:41:26 -04:00
Keith Busch	94bbac4052	NVMe: Protect against badly formatted CQEs If a misbehaving device posts a CQE with a command id < depth but for one that was never allocated, the command info will have a callback function set to NULL and we don't want to try invoking that. Signed-off-by: Keith Busch <keith.busch@intel.com> Signed-off-by: Matthew Wilcox <matthew.r.wilcox@intel.com>	2014-05-05 10:41:25 -04:00
Matthew Wilcox	27e8166c31	NVMe: Improve error messages Help people diagnose what is going wrong at initialisation time by printing out which command has gone wrong and what the device returned. Also fix the error message printed while waiting for reset. Signed-off-by: Matthew Wilcox <matthew.r.wilcox@intel.com> Reviewed-by: Keith Busch <keith.busch@intel.com>	2014-05-05 10:41:25 -04:00
Matthew Wilcox	8757ad65d3	NVMe: Update copyright headers Make the copyright dates accurate and remove the final paragraph that includes the address of the FSF. Signed-off-by: Matthew Wilcox <matthew.r.wilcox@intel.com>	2014-05-05 10:41:25 -04:00
Ming Lei	fc27691f35	block: null_blk: fix use after free entry(cmd->ll_list) may belong to new request once end_cmd() returns, so fix the bug with the patch. Without the change, it is easy to observe oops when doing null_blk(timer) test. Signed-off-by: Ming Lei <tom.leiming@gmail.com> Signed-off-by: Jens Axboe <axboe@fb.com>	2014-05-01 09:17:41 -06:00
Lars Ellenberg	ec4a340789	drbd: use list_first_entry_or_null in first_peer_device/first_connection If there are no peer_devices or connections, I'd rather have NULL than some "arbitrary" address pretending to point to a struct. Helps to avoid hard to debug symptoms, in case we ever try to use and dereference a drbd_connection or drbd_peer_device where we in fact don't have any connection at all. Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com> Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com> Signed-off-by: Jens Axboe <axboe@fb.com>	2014-04-30 13:46:56 -06:00
Philipp Reisner	babea49ebe	drbd: Allow attaching of a newly created device to any backing device A newly created device was never exposed before, i.e. has a exposed_data_uuid of 0. Then it is valid to attach to any current_uuid of a backing device (of course also to a newly created one (4)) Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com> Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com> Signed-off-by: Jens Axboe <axboe@fb.com>	2014-04-30 13:46:56 -06:00
Philipp Reisner	02df6fe145	drbd: Test cstate while holding req_lock In case a connection transitions into C_TIMEOUT within the timer function (request_timer_fn()) we need to make sure that the receiver thread (potentially running on a different CPU) sees the updated cstate later on. Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com> Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com> Signed-off-by: Jens Axboe <axboe@fb.com>	2014-04-30 13:46:56 -06:00

... 6 7 8 9 10 ...

5018 Commits