Now that we do no longer in-place endian-swap the bitmap, we allow
selected bitmap operations (testing bits, sometimes even settting bits)
during some bulk operations.
This caused us to hit a lot of FIXME asserts similar to
FIXME asender in drbd_bm_count_bits,
bitmap locked for 'write from resync_finished' by worker
Which now is nonsense: looking at the bitmap is perfectly legal
as long as it is not being resized.
This cosmetic patch defines some flags to describe expectations in finer
detail, so the asserts in e.g. bm_change_bits_to() can be skipped if
appropriate.
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
All decisions about sync, sync direction, and wether or not to
allow a connect or attach are based on our set of UUIDs to tag a
data generation.
Log changes to the UUIDs whenever they occur,
logging "new current UUID P:Q:R:S" is more useful
than "Creating new current UUID".
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
The test if rs_pending_cnt == 0 was too weak. Using Test for
unacked_cnt == 0 instead. Moved that into the worker.
Since unacked_cnt gets already increased when an P_RS_DATA_REQ
comes in.
Also using a timer to make Ahead -> SyncSource -> Ahead cycles
slower...
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
See also commit from 2009-08-15
"drbd_uuid_compare(): Do not full sync in case a P_SYNC_UUID packet gets lost."
We saw cases where the History UUIDs where not as expected. So the
detection of the special case did not trigger. With the sync UUID
no longer being a random number, but deducible from the previous
bitmap UUID, the detection of this special case becomes more
reliable.
The SyncUUID now is the previous bitmap UUID + 0x1000000000000.
Rule 5a:
Cs = H1p & H1p + Offset = Bp
Connection was lost before SyncUUID Packet came through.
Corrent (peer) UUIDs:
Bp = H1p
H1p = H2p
H2p = 0
Become Sync target.
Rule 7a:
Cp = H1s & H1s + Offset = Bs
Connection was lost before SyncUUID Packet came through.
Correct (own) UUIDs:
Bs = H1s
H1s = H2s
H2s = 0
Become Sync source.
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
Besides removed a few lines of code, this moves the inspection
of the state from before the queuing process to after the queuing.
I.e. more closely to the actual invocation of the work.
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
When the sync source node replies to a P_RS_DATA_REQUEST packet
when it is already in ahead mode. I.e. those two packets
crossed each other on the wire, that may lead to diverging
bitmaps.
This never happens in a well-tuned-system. In a well-tuned-
system the resync controller has reduced the resync speed
to zero long before we got into ahead-mode.
But we have to be prepared for the not-well-tuned-system
of course as well.
Because -> diverging bitmaps = non terminating resync.
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
To improve the latency of IO requests during bitmap exchange,
we recently allowed writes while waiting for the bitmap, sending "set
out-of-sync" information packets for any newly dirtied bits.
We have to make sure that the new resync-uuid does not overtake
these "set oos" packets. Once the resync-uuid is received, the
sync target starts the resync process, and expects the bitmap to
only be cleared, not re-set.
If we use this protocol extension, we queue the generation and sending
of the resync-uuid on the worker, which naturally serializes with all
previously queued packets.
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
When we set or clear bits in a bitmap page,
also set a flag in the page->private pointer.
This allows us to skip writes of unchanged pages.
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
The old name is confusing: the function does not increment anything.
Also rename _inc_ap_bio_cond to inc_ap_bio_cond: there is no need for
an underscore.
Finally, make it clear that these functions return boolean values.
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
The FAULT_ACTIVE macro just wraps the drbd_insert_fault macro for no
apparent reason.
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
* C_STARTING_SYNC_S, C_STARTING_SYNC_T In these states the bitmap gets
written to disk. Locking out of app-IO is done by using the
drbd_queue_bitmap_io() and drbd_bitmap_io() functions these days.
It is no longer necessary to lock out app-IO based on the connection
state.
App-IO that may come in after the BITMAP_IO flag got cleared before the
state transition to C_SYNC_(SOURCE|TARGET) does not get mirrored, sets
a bit in the local bitmap, that is already set, therefore changes nothing.
* C_WF_BITMAP_S In this state we send updates (P_OUT_OF_SYNC packets).
With that we make sure they have the same number of bits when going
into the C_SYNC_(SOURCE|TARGET) connection state.
* C_UNCONNECTED: The receiver starts, no need to lock out IO.
* C_DISCONNECTING: in drbd_disconnect() we had a wait_event()
to wait until ap_bio_cnt reaches 0. Removed that.
* C_TIMEOUT, C_BROKEN_PIPE, C_NETWORK_FAILURE
C_PROTOCOL_ERROR, C_TEAR_DOWN: Same as C_DISCONNECTING
* C_WF_REPORT_PARAMS: IO still possible since that is still
like C_WF_CONNECTION.
And we do not need to send barriers in C_WF_BITMAP_S connection state.
Allow concurrent accesses to the bitmap when receiving the bitmap.
Everything gets ORed anyways.
A drbd_free_tl_hash() is in after_state_chg_work(). At that point
all the work items of the last connections must have been processed.
Introduced a call to drbd_free_tl_hash() into drbd_free_mdev()
for paranoia reasons.
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
The condition must be checked after perpare_to_wait(). The old
implementaion could loose wakeup events. Never observed in real
life.
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
We only issue resync requests if there is no significant application IO
going on. = Application IO has higher priority than resnyc IO.
If application IO can not be started because the resync process locked
an resync_lru entry, start the IO operations necessary to release the
lock ASAP.
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
In this connection mode, the ahead node no longer replicates
application IO. The behind's disk becomes out dated.
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
To ease tracking of bios in some hash tables, we want it to
not cross certain boundaries (128k, used to be 32k).
We limit the maximum bio size using queue parameters.
Historically some defines and variables we use there have been named
max_segment_size, which was misguided. Rename them to max_bio_size,
and use [blk_]queue_max_hw_sectors where appropriate.
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
Preparation patch to be able to use the auto-throttling resync controller
for online-verify requests as well.
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
With the plugging now being explicitly controlled by the
submitter, callers need not pass down unplugging hints
to the block layer. If they want to unplug, it's because they
manually plugged on their own - in which case, they should just
unplug at will.
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
Code has been converted over to the new explicit on-stack plugging,
and delay users have been converted to use the new API for that.
So lets kill off the old plugging along with aops->sync_page().
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
Over time, block layer has accumulated a set of APIs dealing with bdev
open, close, claim and release.
* blkdev_get/put() are the primary open and close functions.
* bd_claim/release() deal with exclusive open.
* open/close_bdev_exclusive() are combination of open and claim and
the other way around, respectively.
* bd_link/unlink_disk_holder() to create and remove holder/slave
symlinks.
* open_by_devnum() wraps bdget() + blkdev_get().
The interface is a bit confusing and the decoupling of open and claim
makes it impossible to properly guarantee exclusive access as
in-kernel open + claim sequence can disturb the existing exclusive
open even before the block layer knows the current open if for another
exclusive access. Reorganize the interface such that,
* blkdev_get() is extended to include exclusive access management.
@holder argument is added and, if is @FMODE_EXCL specified, it will
gain exclusive access atomically w.r.t. other exclusive accesses.
* blkdev_put() is similarly extended. It now takes @mode argument and
if @FMODE_EXCL is set, it releases an exclusive access. Also, when
the last exclusive claim is released, the holder/slave symlinks are
removed automatically.
* bd_claim/release() and close_bdev_exclusive() are no longer
necessary and either made static or removed.
* bd_link_disk_holder() remains the same but bd_unlink_disk_holder()
is no longer necessary and removed.
* open_bdev_exclusive() becomes a simple wrapper around lookup_bdev()
and blkdev_get(). It also has an unexpected extra bdev_read_only()
test which probably should be moved into blkdev_get().
* open_by_devnum() is modified to take @holder argument and pass it to
blkdev_get().
Most of bdev open/close operations are unified into blkdev_get/put()
and most exclusive accesses are tested atomically at the open time (as
it should). This cleans up code and removes some, both valid and
invalid, but unnecessary all the same, corner cases.
open_bdev_exclusive() and open_by_devnum() can use further cleanup -
rename to blkdev_get_by_path() and blkdev_get_by_devt() and drop
special features. Well, let's leave them for another day.
Most conversions are straight-forward. drbd conversion is a bit more
involved as there was some reordering, but the logic should stay the
same.
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Neil Brown <neilb@suse.de>
Acked-by: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>
Acked-by: Mike Snitzer <snitzer@redhat.com>
Acked-by: Philipp Reisner <philipp.reisner@linbit.com>
Cc: Peter Osterlund <petero2@telia.com>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: Jan Kara <jack@suse.cz>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Andreas Dilger <adilger.kernel@dilger.ca>
Cc: "Theodore Ts'o" <tytso@mit.edu>
Cc: Mark Fasheh <mfasheh@suse.com>
Cc: Joel Becker <joel.becker@oracle.com>
Cc: Alex Elder <aelder@sgi.com>
Cc: Christoph Hellwig <hch@infradead.org>
Cc: dm-devel@redhat.com
Cc: drbd-dev@lists.linbit.com
Cc: Leo Chen <leochen@broadcom.com>
Cc: Scott Branden <sbranden@broadcom.com>
Cc: Chris Mason <chris.mason@oracle.com>
Cc: Steven Whitehouse <swhiteho@redhat.com>
Cc: Dave Kleikamp <shaggy@linux.vnet.ibm.com>
Cc: Joern Engel <joern@logfs.org>
Cc: reiserfs-devel@vger.kernel.org
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Convert direct reads of an inode's i_size to using i_size_read().
i_size_{read,write} use a seqcount to protect reads from accessing
incomple writes. Concurrent i_size_write()s require mutual exclussion
to protect the seqcount that is used by i_size_{read,write}. But
i_size_read() callers do not need to use additional locking.
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Acked-by: NeilBrown <neilb@suse.de>
Acked-by: Lars Ellenberg <lars.ellenberg@linbit.com>
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
* 'for-2.6.37/barrier' of git://git.kernel.dk/linux-2.6-block: (46 commits)
xen-blkfront: disable barrier/flush write support
Added blk-lib.c and blk-barrier.c was renamed to blk-flush.c
block: remove BLKDEV_IFL_WAIT
aic7xxx_old: removed unused 'req' variable
block: remove the BH_Eopnotsupp flag
block: remove the BLKDEV_IFL_BARRIER flag
block: remove the WRITE_BARRIER flag
swap: do not send discards as barriers
fat: do not send discards as barriers
ext4: do not send discards as barriers
jbd2: replace barriers with explicit flush / FUA usage
jbd2: Modify ASYNC_COMMIT code to not rely on queue draining on barrier
jbd: replace barriers with explicit flush / FUA usage
nilfs2: replace barriers with explicit flush / FUA usage
reiserfs: replace barriers with explicit flush / FUA usage
gfs2: replace barriers with explicit flush / FUA usage
btrfs: replace barriers with explicit flush / FUA usage
xfs: replace barriers with explicit flush / FUA usage
block: pass gfp_mask and flags to sb_issue_discard
dm: convey that all flushes are processed as empty
...
If we have contention in drbd_al_begin_iod (heavy randon IO),
an administrative request to detach the disk may deadlock
for similar reasons as the recently fixed deadlock if detaching
because of IO-error.
The approach taken here is to either go through the intermediate
cleanup state D_FAILED, or first lock out application io,
don't just go directly to D_DISKLESS.
We need an additional state bit (WAS_IO_ERROR) to distinguish
the -> D_FAILED because of IO-error from other failures.
Sanitize D_ATTACHING -> D_FAILED to D_ATTACHING -> D_DISKLESS.
If only attaching, ldev may be missing still, but would be referenced
from within the after_state_ch for -> D_FAILED, potentially
dereferencing a NULL pointer.
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
If those messages ever get logged, clearly state that they are
actually failed ASSERTS, so our regression tests can pick them up
from the logs more easily.
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
Scenario:
Something (say, flush-147:0) is in drbd_al_begin_io,
holding a local_cnt, waiting for the resync to make progress.
Disk fails, worker in after_state_ch does drbd_rs_cancel_all,
then waits for local_cnt to drop to zero.
flush-147:0 is woken by drbd_rs_cancel_all, needs to write an AL
transaction, and queues that on the worker.
Deadlock.
Fix: do not wait in the worker, have put_ldev() trigger the
state change D_FAILED -> D_DISKLESS when necessary.
put_ldev() cannot do the state change directly, as it may or may not
already hold various spinlocks. We queue a short work instead.
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
drbd commit 17c854fea474a5eb3cfa12e4fb019e46debbc4ec
drbd: receiving of big packets, for payloads between 64kByte and 4GByte
introduced a new on-the-wire packet header format. We must no longer
assume either format, but use the result of whatever drbd_recv_header
has decoded.
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
Connections through a compressing proxy might have more bits
on the fly. 500MByte instead of 50MByte
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
There are three ways to get IO suspended:
* Loss of any access to data
* Fence-peer-handler running
* User requested to suspend IO
Track those in different bits, so that one condition clearing its
state bit does not interfere with the other two conditions.
Only when the user resumes IO he overrules all three bits.
The fact is hidden from the user, he sees only a single suspend
bit.
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
addendum to baa33ae4eaa4477b60af7c434c0ddd1d182c1ae7
The race:
drbd_md_sync()
if (!test_and_clear_bit(MD_DIRTY, &mdev->flags))
return;
==> RACE with drbd_md_mark_dirty() rearming the timer.
del_timer(&mdev->md_sync_timer);
Fixed by moving the del_timer before the test_and_clear_bit.
Additionally only rearm the timer in drbd_md_mark_dirty, if MD_DIRTY was
not already set, reduce the grace period from five to one second, and
add an ifdef'ed debuging aid to find code paths missing an explicit
drbd_md_sync, if any, as those are the only relevant ones for this race.
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>