linux_dsm_epyc7002

mirror of https://github.com/AuxXxilium/linux_dsm_epyc7002.git synced 2025-02-20 03:06:57 +07:00

Author	SHA1	Message	Date
Christoph Hellwig	8f000cac6e	nvmet-rdma: add a NVMe over Fabrics RDMA target driver This patch implements the RDMA transport for the NVMe over Fabrics target, which allows exporting NVMe over Fabrics functionality over RDMA fabrics (Infiniband, RoCE, iWARP). All NVMe logic is in the generic target and this module just provides a small glue between it and the generic code in the RDMA subsystem. Signed-off-by: Armen Baloyan <armenx.baloyan@intel.com>, Signed-off-by: Jay Freyensee <james.p.freyensee@intel.com> Signed-off-by: Ming Lin <ming.l@ssi.samsung.com> Signed-off-by: Sagi Grimberg <sagi@grimberg.me> Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Steve Wise <swise@opengridcomputing.com> Tested-by: Steve Wise <swise@opengridcomputing.com> Signed-off-by: Jens Axboe <axboe@fb.com>	2016-07-08 08:38:49 -06:00
Sagi Grimberg	931a6de4d7	nvme-rdma.h: Add includes for nvme rdma_cm negotiation NVMe over Fabrics RDMA transport defines a connection establishment protocol over the RDMA connection manager. This header will be used by both the host and target drivers to negotiate the connection establishment parameters. Signed-off-by: Jay Freyensee <james.p.freyensee@intel.com> Signed-off-by: Ming Lin <ming.l@ssi.samsung.com> Signed-off-by: Sagi Grimberg <sagi@grimberg.me> Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Steve Wise <swise@opengridcomputing.com> Tested-by: Steve Wise <swise@opengridcomputing.com> Signed-off-by: Jens Axboe <axboe@fb.com>	2016-07-08 08:38:49 -06:00
Christoph Hellwig	def61eca96	nvme: add new reconnecting controller state The nvme fabric (RDMA, FC, etc...) can introduce port, link or node failures that may require a reconnect to re-establish the connection. Add a new reconnecting state that will initially be used by the RDMA driver. Reviewed-by: Jay Freyensee <james.p.freyensee@intel.com> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Steve Wise <swise@opengridcomputing.com> Tested-by: Steve Wise <swise@opengridcomputing.com> Signed-off-by: Jens Axboe <axboe@fb.com>	2016-07-08 08:38:49 -06:00
Sagi Grimberg	486cf9899e	blk-mq: Introduce blk_mq_reinit_tagset The new nvme-rdma driver will need to reinitialize all the tags as part of the error recovery procedure (realloc the tag memory region). Add a helper in blk-mq for it that can iterate over all requests in a tagset to make this easier. Signed-off-by: Sagi Grimberg <sagi@grimberg.me> Tested-by: Ming Lin <ming.l@ssi.samsung.com> Reviewed-by: Stephen Bates <Stephen.Bates@pmcs.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Steve Wise <swise@opengridcomputing.com> Tested-by: Steve Wise <swise@opengridcomputing.com> Signed-off-by: Jens Axboe <axboe@fb.com>	2016-07-08 08:38:49 -06:00
Matias Bjørling	d9e46d5d7c	lightnvm: make __nvm_submit_ppa static The __nvm_submit_ppa() function is not used outside lightnvm core. Signed-off-by: Matias Bjørling <m@bjorling.me> Signed-off-by: Jens Axboe <axboe@fb.com>	2016-07-07 08:51:52 -06:00
Matias Bjørling	8680f165ea	lightnvm: make ppa_list const in nvm_set_rqd_list The passed by reference ppa list in nvm_set_rqd_list() is updated when multiple planes are available. In that case, each PPA plane is incremented when the device side PPA list is created. This prevents the caller to rely on the PPA list to be unmodified after a call. Signed-off-by: Matias Bjørling <m@bjorling.me> Signed-off-by: Jens Axboe <axboe@fb.com>	2016-07-07 08:51:52 -06:00
Matias Bjørling	24d4a7d721	lightnvm: fix lun offset calculation for mark blk The gen_mark_blk_bad function marks the wrong block when a block is on a different channel. Fix the index calculation, so that it updates the correct block. Reported-by: Javier Gonzalez <javier@cnexlabs.com> Signed-off-by: Matias Bjørling <m@bjorling.me> Signed-off-by: Jens Axboe <axboe@fb.com>	2016-07-07 08:51:52 -06:00
Matias Bjørling	855cdd2c0b	lightnvm: make rrpc_map_page call nvm_get_blk outside locks The nvm_get_blk() function is called with rlun->lock held. This is ok when the media manager implementation doesn't go out of its atomic context. However, if a media manager persists its metadata, and guarantees that the block is given to the target, this is no longer a viable approach. Therefore, clean up the flow of rrpc_map_page, and make sure that nvm_get_blk() is called without any locks acquired. Signed-off-by: Matias Bjørling <m@bjorling.me> Signed-off-by: Jens Axboe <axboe@fb.com>	2016-07-07 08:51:52 -06:00
Matias Bjørling	41285fad51	lightnvm: remove _unlocked variant of [get/put]_blk The [get/put]_blk API enables targets to get ownership of blocks at runtime. This information is currently not recorded on disk, and the information is therefore lost on power failure. To restore the metadata, the [get/put]_blk must persist its metadata. In that case, we need to control the outer lock, so that we can disable them while updating the on-disk metadata. Fortunately, the _unlocked versions can be removed, which allows us to move the lock into the [get/put]_blk functions. Signed-off-by: Matias Bjørling <m@bjorling.me> Signed-off-by: Jens Axboe <axboe@fb.com>	2016-07-07 08:51:52 -06:00
Matias Bjørling	8c39eddbf2	lightnvm: remove unused lists from struct rrpc_block The ->list, ->open_list, and ->closed_list lists were previously used for statistics. However, their usage have been removed, and thus these can safely be removed. Signed-off-by: Matias Bjørling <m@bjorling.me> Signed-off-by: Jens Axboe <axboe@fb.com>	2016-07-07 08:51:52 -06:00
Matias Bjørling	5cd907853d	lightnvm: remove nested lock conflict with mm If a media manager tries to initialize it targets upon media manager initialization, the media manager will need to know which target types are available in LightNVM. The lists of which managers and target types are available shares the same lock. Therefore, on initialization, the nvm_lock is taken by LightNVM core, which later leads to a deadlock when target types are enumerated by the media manager. Add an exclusive lock for target types to resolve this conflict. Signed-off-by: Matias Bjørling <m@bjorling.me> Signed-off-by: Jens Axboe <axboe@fb.com>	2016-07-07 08:51:52 -06:00
Matias Bjørling	b76eb20bb0	lightnvm: move target mgmt into media mgr To enable persistent block management to easily control creation and removal of targets, we move target management into the media manager. The LightNVM core continues to maintain which target types are registered, while the media manager now keeps track of its initialized targets. Two new callbacks for the media manager are introduced. create_tgt and remove_tgt. Note that remove_tgt returns 0 on successfully removing a target, and returns 1 if the target was not found. Signed-off-by: Matias Bjørling <m@bjorling.me> Signed-off-by: Jens Axboe <axboe@fb.com>	2016-07-07 08:51:52 -06:00
Matias Bjørling	5e60edb7dc	lightnvm: rename gennvm and update description The generic manager should be called the general media manager, and instead of using the rather long name of "gennvm" in front of each data structures, use "gen" instead to shorten it. Update the description of the media manager as well to make the media manager purpose clearer. Signed-off-by: Matias Bjørling <m@bjorling.me> Signed-off-by: Jens Axboe <axboe@fb.com>	2016-07-07 08:51:52 -06:00
Matias Bjørling	077d238999	lightnvm: remove open/close statistics for gennvm The responsibility of the media manager is not to keep track of open/closed blocks. This is better maintained within a target, that already manages this information on writes. Remove the statistics and merge the states NVM_BLK_ST_OPEN and NVM_BLK_ST_CLOSED. Signed-off-by: Matias Bjørling <m@bjorling.me> Signed-off-by: Jens Axboe <axboe@fb.com>	2016-07-07 08:51:52 -06:00
Matias Bjørling	12624af26e	lightnvm: fix checkpatch terse errors A couple of small checkpatch fixups to stop it from complaining. ./drivers/lightnvm/core.c:360: WARNING: line over 80 characters ./drivers/lightnvm/core.c:360: ERROR: trailing statements should be on next line ./drivers/lightnvm/core.c:503: WARNING: Block comments use a trailing */ on a separate line Signed-off-by: Matias Bjørling <m@bjorling.me> Signed-off-by: Jens Axboe <axboe@fb.com>	2016-07-07 08:51:52 -06:00
Matias Bjørling	5114e2773a	lightnvm: remove checkpatch warning for unsigned ints Checkpatch found two incidents where the type was preferred to be written out in full. ./drivers/lightnvm/rrpc.h:184: WARNING: Prefer 'unsigned int' to bare use of 'unsigned' ./drivers/lightnvm/rrpc.h:209: WARNING: Prefer 'unsigned int' to bare use of 'unsigned' ./drivers/lightnvm/rrpc.c:51: WARNING: Prefer 'unsigned int' to bare use of 'unsigned' Signed-off-by: Matias Bjørling <m@bjorling.me> Signed-off-by: Jens Axboe <axboe@fb.com>	2016-07-07 08:51:52 -06:00
Johannes Thumshirn	58eaaf9b6c	lightnvm: Make functions not used by ouside static Mark functions not used by ouside of thier implementing file as static. Signed-off-by: Johannes Thumshirn <jthumshirn@suse.de> Signed-off-by: Matias Bjørling <m@bjorling.me> Signed-off-by: Jens Axboe <axboe@fb.com>	2016-07-07 08:51:52 -06:00
Johannes Thumshirn	6f92970219	nvme: lightnvm: make MLC num_pairs little endian According to the OpenChannel SSD interface specification the NAND flash MLC page pairing information's number of page page pairings field is the first two bytes in the MLC Page Pairing data structure. The hardware's data structure itself is little endian so annotate it as such, like the rest of lighnvm's data structures. Signed-off-by: Johannes Thumshirn <jthumshirn@suse.de> Signed-off-by: Matias Bjørling <m@bjorling.me> Signed-off-by: Jens Axboe <axboe@fb.com>	2016-07-07 08:51:52 -06:00
Javier González	5389a1dfb3	lightnvm: initialize ppa_addr in dev_to_generic_addr() The ->reserved bit is not initialized when allocated on stack. This may lead targets to misinterpret the PPA as cached. Signed-off-by: Javier González <javier@cnexlabs.com> Signed-off-by: Matias Bjørling <m@bjorling.me> Signed-off-by: Jens Axboe <axboe@fb.com>	2016-07-07 08:51:52 -06:00
Javier González	529435e818	lightnvm: add media manager mark_blk helper Expose media manager mark_blk() to targets, as done for the rest of the media manager callback functions. Signed-off-by: Javier González <javier@cnexlabs.com> Updated description Signed-off-by: Matias Bjørling <m@bjorling.me> Signed-off-by: Matias Bjørling <m@bjorling.me> Signed-off-by: Jens Axboe <axboe@fb.com>	2016-07-07 08:51:52 -06:00
Wenwei Tao	0de2415bb7	lightnvm: break the loop when rqd is not null Break the loop when rqd is not null to reduce an unnecessary schedule. Signed-off-by: Wenwei Tao <ww.tao0320@gmail.com> Signed-off-by: Matias Bjørling <m@bjorling.me> Signed-off-by: Jens Axboe <axboe@fb.com>	2016-07-07 08:51:52 -06:00
Dan Carpenter	f98d9ca17f	nvmet: fix an error code We accidentally return zero here when ERR_PTR(-ENOMEM) is intended. Fixes: `a07b4970f4` ('nvmet: add a generic NVMe target') Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com> Signed-off-by: Jens Axboe <axboe@fb.com>	2016-07-07 08:37:36 -06:00
Arnd Bergmann	1fb4704084	nvme-loop: add configfs dependency CONFIG_NVME_TARGET has a correct CONFIG_CONFIGFS_FS dependency, but the newly added NVME_TARGET_LOOP is missing this, resulting in a link failure: drivers/nvme/built-in.o: In function `nvmet_init_configfs': loop.c:(.init.text+0x2a0): undefined reference to `config_group_init' loop.c:(.init.text+0x2c0): undefined reference to `config_group_init_type_name' loop.c:(.init.text+0x318): undefined reference to `configfs_register_subsystem' drivers/nvme/built-in.o: In function `nvmet_exit_configfs': loop.c:(.exit.text+0x9c): undefined reference to `configfs_unregister_subsystem' This adds the same dependency here. Signed-off-by: Arnd Bergmann <arnd@arndb.de> Fixes: `3a85a5de29` ("nvme-loop: add a NVMe loopback host driver") Signed-off-by: Jens Axboe <axboe@fb.com>	2016-07-07 08:34:45 -06:00
Yijing Wang	89b920e003	bcache: Remove redundant block_size assignment We have assigned sb->block_size before the switch, so remove the redundant one. Reviewed-by: Coly Li <colyli@suse.de> Signed-off-by: Yijing Wang <wangyijing@huawei.com> Acked-by: Eric Wheeler <bcache@lists.ewheeler.net> Signed-off-by: Jens Axboe <axboe@fb.com>	2016-07-05 11:34:50 -06:00
Yijing Wang	7abc70d700	bcache: update document info There is no return in continue_at(), update the documentation. Signed-off-by: Yijing Wang <wangyijing@huawei.com> Acked-by: Coly Li <colyli@suse.de> Signed-off-by: Jens Axboe <axboe@fb.com>	2016-07-05 11:34:49 -06:00
Yijing Wang	c50d4d5dd3	bcache: Remove redundant parameter for cache_alloc() Cache_sb is not used in cache_alloc, and we have copied sb info to cache->sb already, remove it. Reviewed-by: Coly Li <colyli@suse.de> Signed-off-by: Yijing Wang <wangyijing@huawei.com> Signed-off-by: Jens Axboe <axboe@fb.com>	2016-07-05 11:34:47 -06:00
Christoph Hellwig	3a85a5de29	nvme-loop: add a NVMe loopback host driver This patch implements adds nvme-loop which allows to access local devices exported as NVMe over Fabrics namespaces. This module can be useful for easy evaluation, testing and also feature experimentation. To createa nvme-loop device you need to configure the NVMe target to export a loop port (see the nvmetcli documentaton for that) and then connect to it using nvme connect-all -t loop which requires the very latest nvme-cli version with Fabrics support. Signed-off-by: Jay Freyensee <james.p.freyensee@intel.com> Signed-off-by: Ming Lin <ming.l@ssi.samsung.com> Signed-off-by: Sagi Grimberg <sagi@grimberg.me> Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Steve Wise <swise@opengridcomputing.com> Signed-off-by: Jens Axboe <axboe@fb.com>	2016-07-05 11:30:36 -06:00
Christoph Hellwig	a07b4970f4	nvmet: add a generic NVMe target This patch introduces a implementation of NVMe subsystems, controllers and discovery service which allows to export NVMe namespaces across fabrics such as Ethernet, FC etc. The implementation conforms to the NVMe 1.2.1 specification and interoperates with NVMe over fabrics host implementations. Configuration works using configfs, and is best performed using the nvmetcli tool from http://git.infradead.org/users/hch/nvmetcli.git, which also has a detailed explanation of the required steps in the README file. Signed-off-by: Armen Baloyan <armenx.baloyan@intel.com> Signed-off-by: Anthony Knapp <anthony.j.knapp@intel.com> Signed-off-by: Jay Freyensee <james.p.freyensee@intel.com> Signed-off-by: Ming Lin <ming.l@ssi.samsung.com> Signed-off-by: Sagi Grimberg <sagi@grimberg.me> Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Steve Wise <swise@opengridcomputing.com> Signed-off-by: Jens Axboe <axboe@fb.com>	2016-07-05 11:30:33 -06:00
Sagi Grimberg	9645c1a233	block: Export blk_poll The new NVMe over fabrics target will make use of this outside from a module. Signed-off-by: Sagi Grimberg <sagi@grimberg.me> Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Steve Wise <swise@opengridcomputing.com> Signed-off-by: Jens Axboe <axboe@fb.com>	2016-07-05 11:30:31 -06:00
Sagi Grimberg	038bd4cb67	nvme: add keep-alive support Periodic keep-alive is a mandatory feature in NVMe over Fabrics, and optional in NVMe 1.2.1 for PCIe. This patch adds periodic keep-alive sent from the host to verify that the controller is still responsive and vice-versa. The keep-alive timeout is user-defined (with keep_alive_tmo connection parameter) and defaults to 5 seconds. In order to avoid a race condition where the host sends a keep-alive competing with the target side keep-alive timeout expiration, the host adds a grace period of 10 seconds when publishing the keep-alive timeout to the target. In case a keep-alive failed (or timed out), a transport specific error recovery kicks in. For now only NVMe over Fabrics is wired up to support keep alive, but we can add PCIe support easily once controllers actually supporting it become available. Signed-off-by: Sagi Grimberg <sagi@grimberg.me> Reviewed-by: Steve Wise <swise@chelsio.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Keith Busch <keith.busch@intel.com> Signed-off-by: Jens Axboe <axboe@fb.com>	2016-07-05 11:28:20 -06:00
Sagi Grimberg	7b89eae29e	nvme.h: Add keep-alive opcode and identify controller attribute KAS: keep-alive support and granularity of kato in units of 100 ms nvme_admin_keep_alive opcode: 0x18 Signed-off-by: Sagi Grimberg <sagi@grimberg.me> Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Keith Busch <keith.busch@intel.com> Signed-off-by: Jens Axboe <axboe@fb.com>	2016-07-05 11:28:18 -06:00
Christoph Hellwig	07bfcd09a2	nvme-fabrics: add a generic NVMe over Fabrics library The NVMe over Fabrics library provides an interface for both transports and the nvme core to handle fabrics specific commands and attributes independent of the underlying transport. In addition, the fabrics library adds a misc device interface that allow actually creating a fabrics controller, as we can't just autodiscover it like in the PCI case. The nvme-cli utility has been enhanced to use this interface to support fabric connect and discovery. Signed-off-by: Armen Baloyan <armenx.baloyan@intel.com>, Signed-off-by: Jay Freyensee <james.p.freyensee@intel.com>, Signed-off-by: Ming Lin <ming.l@ssi.samsung.com> Signed-off-by: Sagi Grimberg <sagi@grimberg.me> Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Keith Busch <keith.busch@intel.com> Signed-off-by: Jens Axboe <axboe@fb.com>	2016-07-05 11:28:16 -06:00
Christoph Hellwig	eb793e2c92	nvme.h: add NVMe over Fabrics definitions The NVMe over Fabrics specification defines a protocol interface and related extensions to NVMe that enable operation over network protocols. The NVMe over Fabrics specification has an NVMe Transport binding for each NVMe Transport. This patch adds the fabrics related definitions: - fabric specific command set and error codes - transport addressing and binding definitions - fabrics sgl extensions - controller identification fabrics enhancements - discovery log page definition Signed-off-by: Armen Baloyan <armenx.baloyan@intel.com> Signed-off-by: James Smart <james.smart@broadcom.com> Signed-off-by: Jay Freyensee <james.p.freyensee@intel.com> Signed-off-by: Ming Lin <ming.l@ssi.samsung.com> Signed-off-by: Sagi Grimberg <sagi@grimberg.me> Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Keith Busch <keith.busch@intel.com> Signed-off-by: Jens Axboe <axboe@fb.com>	2016-07-05 11:28:14 -06:00
Ming Lin	1a353d85b0	nvme: add fabrics sysfs attributes - delete_controller: This attribute allows to delete a controller. A driver is not obligated to support it (pci doesn't) so it is created only if the driver supports it. The new fabrics drivers will support it (essentialy a disconnect operation). Usage: echo > /sys/class/nvme/nvme0/delete_controller - subsysnqn: This attribute shows the subsystem nqn of the configured device. If a driver does not implement the get_subsysnqn method, the file will not appear in sysfs. - transport: This attribute shows the transport name. Added a "name" field to struct nvme_ctrl_ops. For loop, cat /sys/class/nvme/nvme0/transport loop For RDMA, cat /sys/class/nvme/nvme0/transport rdma For PCIe, cat /sys/class/nvme/nvme0/transport pcie - address: This attributes shows the controller address. The fabrics drivers that will implement get_address can show the address of the connected controller. example: cat /sys/class/nvme/nvme0/address traddr=192.168.2.2,trsvcid=1023 Signed-off-by: Ming Lin <ming.l@ssi.samsung.com> Reviewed-by: Jay Freyensee <james.p.freyensee@intel.com> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Keith Busch <keith.busch@intel.com> Signed-off-by: Jens Axboe <axboe@fb.com>	2016-07-05 11:28:12 -06:00
Christoph Hellwig	eb71f43557	nvme: Modify and export sync command submission for fabrics NVMe over fabrics will use __nvme_submit_sync_cmd in the the transport and require a few tweaks to it. For that we export it and add a few more paramters: 1. allow passing a queue ID to the block layer For the NVMe over Fabrics connect command we need to able to specify a queue ID that we want to send the command on. Add a qid parameter to the relevant functions to enable this behavior. 2. allow submitting at_head commands In cases where we want to (re)connect to a controller where we have inflight queued commands we want to first connect and only then allow the other queued commands to be kicked. This will prevents failures in controller resets and reconnects. 3. allow passing flags to blk_mq_allocate_request Both for Fabrics connect the the keep-alive feature in NVMe 1.2.1 we want to be able to use reserved requests. Reviewed-by: Jay Freyensee <james.p.freyensee@intel.com> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Tested-by: Ming Lin <ming.l@ssi.samsung.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Keith Busch <keith.busch@intel.com> Signed-off-by: Jens Axboe <axboe@fb.com>	2016-07-05 11:28:11 -06:00
Christoph Hellwig	7d2e80080d	nvme: allow transitioning from NEW to LIVE state For Fabrics we're not going through an intermediate reset state (at least for now). Reviewed-by: Jay Freyensee <james.p.freyensee@intel.com> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Keith Busch <keith.busch@intel.com> Signed-off-by: Jens Axboe <axboe@fb.com>	2016-07-05 11:28:09 -06:00
Ming Lin	1f5bd336b9	blk-mq: add blk_mq_alloc_request_hctx For some protocols like NVMe over Fabrics we need to be able to send initialization commands to a specific queue. Based on an earlier patch from Christoph Hellwig <hch@lst.de>. Signed-off-by: Ming Lin <ming.l@ssi.samsung.com> [hch: disallow sleeping allocation, req_op fixes] Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Keith Busch <keith.busch@intel.com> Signed-off-by: Jens Axboe <axboe@fb.com>	2016-07-05 11:28:07 -06:00
Bartlomiej Zolnierkiewicz	9e2d23f19e	mg_disk: fix error path in mg_probe() MG_DISK_MAJ is defined as 0 so dynamic block major number allocation is used by the driver and the assigned major number is stored in host->major. This patch fixes error path in mg_probe() to use host->major instead of using MG_DISK_MAJ. Cc: unsik Kim <donari75@gmail.com> Signed-off-by: Bartlomiej Zolnierkiewicz <b.zolnierkie@samsung.com> Signed-off-by: Jens Axboe <axboe@fb.com>	2016-06-28 11:01:27 -06:00
Lars Ellenberg	1b57e66384	drbd: correctly handle failed crypto_alloc_hash crypto_alloc_hash returns an ERR_PTR(), not NULL. Also reset peer_integrity_tfm to NULL, to not call crypto_free_hash() on an errno in the cleanup path. Reported-by: Insu Yun <wuninsu@gmail.com> Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com> Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com> Signed-off-by: Jens Axboe <axboe@fb.com>	2016-06-13 21:43:08 -06:00
Lars Ellenberg	27ea1d876e	drbd: al_write_transaction: skip re-scanning of bitmap page pointer array For larger devices, the array of bitmap page pointers can grow very large (8000 pointers per TB of storage). For each activity log transaction, we need to flush the associated bitmap pages to stable storage. Currently, we just "mark" the respective pages while setting up the transaction, then tell the bitmap code to write out all marked pages, but skip unchanged pages. But one such transaction can affect only a small number of bitmap pages, there is no need to scan the full array of several (ten-)thousand page pointers to find the few marked ones. Instead, remember the index numbers of the few affected pages, and later only re-check those to skip duplicates and unchanged ones. Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com> Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com> Signed-off-by: Jens Axboe <axboe@fb.com>	2016-06-13 21:43:08 -06:00
Lars Ellenberg	13c2088d41	drbd: finally report ms, not jiffies, in log message Also skip the message unless bitmap IO took longer than 5 ms. Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com> Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com> Signed-off-by: Jens Axboe <axboe@fb.com>	2016-06-13 21:43:08 -06:00
Roland Kammerer	4e526a0046	drbd: get rid of empty statement in is_valid_state This should silence a warning about an empty statement. Thanks to Fabian Frederick <fabf@skynet.be> who sent a patch I modified to be smaller and avoids an additional indent level. Signed-off-by: Roland Kammerer <roland.kammerer@linbit.com> Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com> Signed-off-by: Jens Axboe <axboe@fb.com>	2016-06-13 21:43:07 -06:00
Fabian Frederick	7e5fec3168	drbd: code cleanups without semantic changes This contains various cosmetic fixes ranging from simple typos to const-ifying, and using booleans properly. Original commit messages from Fabian's patch set: drbd: debugfs: constify drbd_version_fops drbd: use seq_put instead of seq_print where possible drbd: include linux/uaccess.h instead of asm/uaccess.h drbd: use const char * const for drbd strings drbd: kerneldoc warning fix in w_e_end_data_req() drbd: use unsigned for one bit fields drbd: use bool for peer is_ states drbd: fix typo drbd: use \| for bitmask combination drbd: use true/false for bool drbd: fix drbd_bm_init() comments drbd: introduce peer state union drbd: fix maybe_pull_ahead() locking comments drbd: use bool for growing drbd: remove redundant declarations drbd: replace if/BUG by BUG_ON Signed-off-by: Fabian Frederick <fabf@skynet.be> Signed-off-by: Roland Kammerer <roland.kammerer@linbit.com> Signed-off-by: Jens Axboe <axboe@fb.com>	2016-06-13 21:43:07 -06:00
Lars Ellenberg	20004e2435	drbd: bump current uuid when resuming IO with diskless peer Scenario, starting with normal operation Connected Primary/Secondary UpToDate/UpToDate NetworkFailure Primary/Unknown UpToDate/DUnknown (frozen) ... more failures happen, secondary loses it's disk, but eventually is able to re-establish the replication link ... Connected Primary/Secondary UpToDate/Diskless (resumed; needs to bump uuid!) We used to just resume/resent suspended requests, without bumping the UUID. Which will lead to problems later, when we want to re-attach the disk on the peer, without first disconnecting, or if we experience additional failures, because we now have diverging data without being able to recognize it. Make sure we also bump the current data generation UUID, if we notice "peer disk unknown" -> "peer disk known bad". Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com> Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com> Signed-off-by: Jens Axboe <axboe@fb.com>	2016-06-13 21:43:07 -06:00
Lars Ellenberg	31d646042d	drbd: disallow promotion during resync handshake, avoid deadlock and hard reset We already serialize connection state changes, and other, non-connection state changes (role changes) while we are establishing a connection. But if we have an established connection, then trigger a resync handshake (by primary --force or similar), until now we just had to be "lucky". Consider this sequence (e.g. deployment scenario): create-md; up; -> Connected Secondary/Secondary Inconsistent/Inconsistent then do a racy primary --force on both peers. block drbd0: drbd_sync_handshake: block drbd0: self 0000000000000004:0000000000000000:0000000000000000:0000000000000000 bits:25590 flags:0 block drbd0: peer 0000000000000004:0000000000000000:0000000000000000:0000000000000000 bits:25590 flags:0 block drbd0: peer( Unknown -> Secondary ) conn( WFReportParams -> Connected ) pdsk( DUnknown -> Inconsistent ) block drbd0: peer( Secondary -> Primary ) pdsk( Inconsistent -> UpToDate ) * HERE things go wrong. * block drbd0: role( Secondary -> Primary ) block drbd0: drbd_sync_handshake: block drbd0: self 0000000000000005:0000000000000000:0000000000000000:0000000000000000 bits:25590 flags:0 block drbd0: peer C90D2FC716D232AB:0000000000000004:0000000000000000:0000000000000000 bits:25590 flags:0 block drbd0: Becoming sync target due to disk states. block drbd0: Writing the whole bitmap, full sync required after drbd_sync_handshake. block drbd0: Remote failed to finish a request within 6007ms > ko-count (2) * timeout (30 * 0.1s) drbd s0: peer( Primary -> Unknown ) conn( Connected -> Timeout ) pdsk( UpToDate -> DUnknown ) The problem here is that the local promotion happens before the sync handshake triggered by the remote promotion was completed. Some assumptions elsewhere become wrong, and when the expected resync handshake is then received and processed, we get stuck in a deadlock, which can only be recovered by reboot :-( Fix: if we know the peer has good data, and our own disk is present, but NOT good, and there is no resync going on yet, we expect a sync handshake to happen "soon". So reject a racy promotion with SS_IN_TRANSIENT_STATE. Result: ... as above ... block drbd0: peer( Secondary -> Primary ) pdsk( Inconsistent -> UpToDate ) * local promotion being postponed until ... * block drbd0: drbd_sync_handshake: block drbd0: self 0000000000000004:0000000000000000:0000000000000000:0000000000000000 bits:25590 flags:0 block drbd0: peer 77868BDA836E12A5:0000000000000004:0000000000000000:0000000000000000 bits:25590 flags:0 ... block drbd0: conn( WFBitMapT -> WFSyncUUID ) block drbd0: updated sync uuid 85D06D0E8887AD44:0000000000000000:0000000000000000:0000000000000000 block drbd0: conn( WFSyncUUID -> SyncTarget ) * ... after the resync handshake * block drbd0: role( Secondary -> Primary ) Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com> Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com> Signed-off-by: Jens Axboe <axboe@fb.com>	2016-06-13 21:43:07 -06:00
Lars Ellenberg	f2d3d75b66	drbd: sync_handshake: handle identical uuids with current (frozen) Primary If in a two-primary scenario, we lost our peer, freeze IO, and are still frozen (no UUID rotation) when the peer comes back as Secondary after a hard crash, we will see identical UUIDs. The "rule_nr = 40" chose to use the "CRASHED_PRIMARY" bit as arbitration, but that would cause the still running (but frozen) Primary to become SyncTarget (which it typically refuses), and the handshake is declined. Fix: check current roles. If we have one current primary, the Primary wins. (rule_nr = 41) Since that is a protocol change, use the newly introduced DRBD_FF_WSAME to determine if rule_nr = 41 can be applied. Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com> Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com> Signed-off-by: Jens Axboe <axboe@fb.com>	2016-06-13 21:43:07 -06:00
Lars Ellenberg	9104d31a75	drbd: introduce WRITE_SAME support We will support WRITE_SAME, if * all peers support WRITE_SAME (both in kernel and DRBD version), * all peer devices support WRITE_SAME * logical_block_size is identical on all peers. We may at some point introduce a fallback on the receiving side for devices/kernels that do not support WRITE_SAME, by open-coding a submit loop. But not yet. Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com> Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com> Signed-off-by: Jens Axboe <axboe@fb.com>	2016-06-13 21:43:07 -06:00
Lars Ellenberg	60bac04012	drbd: report sizes if rejecting too small peer disk Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com> Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com> Signed-off-by: Jens Axboe <axboe@fb.com>	2016-06-13 21:43:06 -06:00
Lars Ellenberg	65f5be3579	drbd: discard_zeroes_if_aligned allows "thin" resync for discard_zeroes_data=0 Even if discard_zeroes_data != 0, if discard_zeroes_if_aligned is set, we assume we can reliably zero-out/discard using the drbd_issue_peer_discard() helper. Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com> Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com> Signed-off-by: Jens Axboe <axboe@fb.com>	2016-06-13 21:43:06 -06:00
Lars Ellenberg	af61494ad4	drbd: only restart frozen disk io when D_UP_TO_DATE When re-attaching the local backend device to a C_STANDALONE D_DISKLESS R_PRIMARY with OND_SUSPEND_IO, we may only resume IO if we recognize the backend that is being attached as D_UP_TO_DATE. Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com> Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com> Signed-off-by: Jens Axboe <axboe@fb.com>	2016-06-13 21:43:06 -06:00

1 2 3 4 5 ...

602068 Commits