Commit Graph

317 Commits

Author SHA1 Message Date
Kent Overstreet
f38a5181d9 ceph: Convert to immutable biovecs
Now that we've got a mechanism for immutable biovecs -
bi_iter.bi_bvec_done - we need to convert drivers to use primitives that
respect it instead of using the bvec array directly.

Signed-off-by: Kent Overstreet <kmo@daterainc.com>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Sage Weil <sage@inktank.com>
Cc: ceph-devel@vger.kernel.org
2013-11-23 22:33:52 -08:00
Linus Torvalds
6cccc7d301 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client
Pull ceph updates from Sage Weil:
 "This includes both the first pile of Ceph patches (which I sent to
  torvalds@vger, sigh) and a few new patches that add support for
  fscache for Ceph.  That includes a few fscache core fixes that David
  Howells asked go through the Ceph tree.  (Thanks go to Milosz Tanski
  for putting this feature together)

  This first batch of patches (included here) had (has) several
  important RBD bug fixes, hole punch support, several different
  cleanups in the page cache interactions, improvements in the truncate
  code (new truncate mutex to avoid shenanigans with i_mutex), and a
  series of fixes in the synchronous striping read/write code.

  On top of that is a random collection of small fixes all across the
  tree (error code checks and error path cleanup, obsolete wq flags,
  etc)"

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client: (43 commits)
  ceph: use d_invalidate() to invalidate aliases
  ceph: remove ceph_lookup_inode()
  ceph: trivial buildbot warnings fix
  ceph: Do not do invalidate if the filesystem is mounted nofsc
  ceph: page still marked private_2
  ceph: ceph_readpage_to_fscache didn't check if marked
  ceph: clean PgPrivate2 on returning from readpages
  ceph: use fscache as a local presisent cache
  fscache: Netfs function for cleanup post readpages
  FS-Cache: Fix heading in documentation
  CacheFiles: Implement interface to check cache consistency
  FS-Cache: Add interface to check consistency of a cached object
  rbd: fix null dereference in dout
  rbd: fix buffer size for writes to images with snapshots
  libceph: use pg_num_mask instead of pgp_num_mask for pg.seed calc
  rbd: fix I/O error propagation for reads
  ceph: use vfs __set_page_dirty_nobuffers interface instead of doing it inside filesystem
  ceph: allow sync_read/write return partial successed size of read/write.
  ceph: fix bugs about handling short-read for sync read mode.
  ceph: remove useless variable revoked_rdcache
  ...
2013-09-09 09:13:22 -07:00
Tejun Heo
4d1829a59d ceph: WQ_NON_REENTRANT is meaningless and going away
dbf2576e37 ("workqueue: make all workqueues non-reentrant") made
WQ_NON_REENTRANT no-op and the flag is going away.  Remove its usages.

This patch doesn't introduce any behavior changes.

Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Sage Weil <sage@inktank.com>
Cc: ceph-devel@vger.kernel.org
2013-08-09 17:55:29 -07:00
Eric Dumazet
64dc61306c net: add sk_stream_is_writeable() helper
Several call sites use the hardcoded following condition :

sk_stream_wspace(sk) >= sk_stream_min_wspace(sk)

Lets use a helper because TCP_NOTSENT_LOWAT support will change this
condition for TCP sockets.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Neal Cardwell <ncardwell@google.com>
Cc: Yuchung Cheng <ycheng@google.com>
Acked-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-07-24 17:54:48 -07:00
Alex Elder
81b36be4c5 libceph: allocate ceph message data with a slab allocator
Create a slab cache to manage ceph_msg_data structure allocation.

This is part of:
    http://tracker.ceph.com/issues/3926

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
2013-05-02 11:58:36 -05:00
Alex Elder
e3d5d63804 libceph: allocate ceph messages with a slab allocator
Create a slab cache to manage ceph_msg structure allocation.

This is part of:
    http://tracker.ceph.com/issues/3926

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
2013-05-02 11:58:31 -05:00
Alex Elder
a51b272e9e libceph: fix two messenger bugs
This patch makes four small changes in the ceph messenger.

While getting copyup functionality working I found two bugs in the
messenger.  Existing paths through the code did not trigger these
problems, but they're fixed here:
    - In ceph_msg_data_pagelist_cursor_init(), the cursor's
      last_piece field was being checked against the length
      supplied.  This was OK until this commit: ccba6d98 libceph:
      implement multiple data items in a message That commit changed
      the cursor init routines to allow lengths to be supplied that
      exceeded the size of the current data item. Because of this,
      we have to use the assigned cursor resid field rather than the
      provided length in determining whether the cursor points to
      the last piece of a data item.
    - In ceph_msg_data_add_pages(), a BUG_ON() was erroneously
      catching attempts to add page data to a message if the message
      already had data assigned to it. That was OK until that same
      commit, at which point it was fine for messages to have
      multiple data items. It slipped through because that BUG_ON()
      call was present twice in that function. (You can never be too
      careful.)

In addition two other minor things are changed:
    - In ceph_msg_data_cursor_init(), the local variable "data" was
      getting assigned twice.
    - In ceph_msg_data_advance(), it was assumed that the
      type-specific advance routine would set new_piece to true
      after it advanced past the last piece. That may have been
      fine, but since we check for that case we might as well set it
      explicitly in ceph_msg_data_advance().

This resolves:
    http://tracker.ceph.com/issues/4762

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
2013-05-01 21:19:05 -07:00
Alex Elder
90af36022a libceph: add, don't set data for a message
Change the names of the functions that put data on a pagelist to
reflect that we're adding to whatever's already there rather than
just setting it to the one thing.  Currently only one data item is
ever added to a message, but that's about to change.

This resolves:
    http://tracker.ceph.com/issues/2770

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
2013-05-01 21:18:34 -07:00
Alex Elder
ca8b3a6917 libceph: implement multiple data items in a message
This patch adds support to the messenger for more than one data item
in its data list.

A message data cursor has two more fields to support this:
    - a count of the number of bytes left to be consumed across
      all data items in the list, "total_resid"
    - a pointer to the head of the list (for validation only)

The cursor initialization routine has been split into two parts: the
outer one, which initializes the cursor for traversing the entire
list of data items; and the inner one, which initializes the cursor
to start processing a single data item.

When a message cursor is first initialized, the outer initialization
routine sets total_resid to the length provided.  The data pointer
is initialized to the first data item on the list.  From there, the
inner initialization routine finishes by setting up to process the
data item the cursor points to.

Advancing the cursor consumes bytes in total_resid.  If the resid
field reaches zero, it means the current data item is fully
consumed.  If total_resid indicates there is more data, the cursor
is advanced to point to the next data item, and then the inner
initialization routine prepares for using that.  (A check is made at
this point to make sure we don't wrap around the front of the list.)

The type-specific init routines are modified so they can be given a
length that's larger than what the data item can support.  The resid
field is initialized to the smaller of the provided length and the
length of the entire data item.

When total_resid reaches zero, we're done.

This resolves:
    http://tracker.ceph.com/issues/3761

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
2013-05-01 21:18:33 -07:00
Alex Elder
5240d9f95d libceph: replace message data pointer with list
In place of the message data pointer, use a list head which links
through message data items.  For now we only support a single entry
on that list.

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
2013-05-01 21:18:32 -07:00
Alex Elder
8ae4f4f5c0 libceph: have cursor point to data
Rather than having a ceph message data item point to the cursor it's
associated with, have the cursor point to a data item.  This will
allow a message cursor to be used for more than one data item.

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
2013-05-01 21:18:30 -07:00
Alex Elder
36153ec9dd libceph: move cursor into message
A message will only be processing a single data item at a time, so
there's no need for each data item to have its own cursor.

Move the cursor embedded in the message data structure into the
message itself.  To minimize the impact, keep the data->cursor
field, but make it be a pointer to the cursor in the message.

Move the definition of ceph_msg_data above ceph_msg_data_cursor so
the cursor can point to the data without a forward definition rather
than vice-versa.

This and the upcoming patches are part of:
    http://tracker.ceph.com/issues/3761

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
2013-05-01 21:18:29 -07:00
Alex Elder
c851c49591 libceph: record bio length
The bio is the only data item type that doesn't record its full
length.  Fix that.

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
2013-05-01 21:18:28 -07:00
Alex Elder
f759ebb968 libceph: skip message if too big to receive
We know the length of our message buffers.  If we get a message
that's too long, just dump it and ignore it.  If skip was set
then con->in_msg won't be valid, so be careful not to dereference
a null pointer in the process.

This resolves:
    http://tracker.ceph.com/issues/4664

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
2013-05-01 21:18:27 -07:00
Alex Elder
ea96571f7b libceph: fix possible CONFIG_BLOCK build problem
This patch:
    15a0d7b libceph: record message data length
did not enclose some bio-specific code inside CONFIG_BLOCK as
it should have.  Fix that.

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
2013-05-01 21:18:26 -07:00
Alex Elder
98fa5dd883 libceph: provide data length when preparing message
In prepare_message_data(), the length used to initialize the cursor
is taken from the header of the message provided.  I'm working
toward not using the header data length field to determine length in
outbound messages, and this is a step in that direction.  For
inbound messages this will be set to be the actual number of bytes
that are arriving (which may be less than the total size of the data
buffer available).

This resolves:
    http://tracker.ceph.com/issues/4589

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
2013-05-01 21:18:03 -07:00
Alex Elder
a193080481 libceph: record message data length
Keep track of the length of the data portion for a message in a
separate field in the ceph_msg structure.  This information has
been maintained in wire byte order in the message header, but
that's going to change soon.

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
2013-05-01 21:17:57 -07:00
Alex Elder
56fc565916 libceph: account for alignment in pages cursor
When a cursor for a page array data message is initialized it needs
to determine the initial value for cursor->last_piece.  Currently it
just checks if length is less than a page, but that's not correct.
The data in the first page in the array will be offset by a page
offset based on the alignment recorded for the data.  (All pages
thereafter will be aligned at the base of the page, so there's
no need to account for this except for the first page.)

Because this was wrong, there was a case where the length of a piece
would be calculated as all of the residual bytes in the message and
that plus the page offset could exceed the length of a page.

So fix this case.  Make sure the sum won't wrap.

This resolves a third issue described in:
    http://tracker.ceph.com/issues/4598

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Sage Weil <sage@inktank.com>
2013-05-01 21:17:40 -07:00
Alex Elder
5df521b1ee libceph: page offset must be less than page size
Currently ceph_msg_data_pages_advance() allows the page offset value
to be PAGE_SIZE, apparently assuming ceph_msg_data_pages_next() will
treat it as 0.  But that doesn't happen, and the result led to a
helpful assertion failure.

Change ceph_msg_data_pages_advance() to truncate the offset to 0
before returning if it reaches PAGE_SIZE.

Make a few other minor adjustments in this area (comments and a
better assertion) while modifying it.

This resolves a second issue described in:
    http://tracker.ceph.com/issues/4598

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Sage Weil <sage@inktank.com>
2013-05-01 21:17:39 -07:00
Alex Elder
1190bf06a6 libceph: fix broken data length assertions
It's OK for the result of a read to come back with fewer bytes than
were requested.  So don't trigger a BUG() in that case when
initializing the data cursor.

This resolves the first problem described in:
    http://tracker.ceph.com/issues/4598

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Sage Weil <sage@inktank.com>
2013-05-01 21:17:38 -07:00
Alex Elder
6644ed7b7e libceph: make message data be a pointer
Begin the transition from a single message data item to a list of
them by replacing the "data" structure in a message with a pointer
to a ceph_msg_data structure.

A null pointer will indicate the message has no data; replace the
use of ceph_msg_has_data() with a simple check for a null pointer.

Create functions ceph_msg_data_create() and ceph_msg_data_destroy()
to dynamically allocate and free a data item structure of a given type.

When a message has its data item "set," allocate one of these to
hold the data description, and free it when the last reference to
the message is dropped.

This partially resolves:
    http://tracker.ceph.com/issues/4429

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
2013-05-01 21:17:37 -07:00
Alex Elder
8ea299bcbc libceph: use only ceph_msg_data_advance()
The *_msg_pos_next() functions do little more than call
ceph_msg_data_advance().  Replace those wrapper functions with
a simple call to ceph_msg_data_advance().

This cleanup is related to:
    http://tracker.ceph.com/issues/4428

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
2013-05-01 21:17:36 -07:00
Alex Elder
143334ff44 libceph: don't add to crc unless data sent
In write_partial_message_data() we aggregate the crc for the data
portion of the message as each new piece of the data item is
encountered.  Because it was computed *before* sending the data, if
an attempt to send a new piece resulted in 0 bytes being sent, the
crc crc across that piece would erroneously get computed again and
added to the aggregate result.  This would occasionally happen in
the evnet of a connection failure.

The crc value isn't really needed until the complete value is known
after sending all data, so there's no need to compute it before
sending.

So don't calculate the crc for a piece until *after* we know at
least one byte of it has been sent.  That will avoid this problem.

This resolves:
    http://tracker.ceph.com/issues/4450

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Sage Weil <sage@inktank.com>
2013-05-01 21:17:35 -07:00
Alex Elder
f5db90bcf2 libceph: kill last of ceph_msg_pos
The only remaining field in the ceph_msg_pos structure is
did_page_crc.  In the new cursor model of things that flag (or
something like it) belongs in the cursor.

Define a new field "need_crc" in the cursor (which applies to all
types of data) and initialize it to true whenever a cursor is
initialized.

In write_partial_message_data(), the data CRC still will be computed
as before, but it will check the cursor->need_crc field to determine
whether it's needed.  Any time the cursor is advanced to a new piece
of a data item, need_crc will be set, and this will cause the crc
for that entire piece to be accumulated into the data crc.

In write_partial_message_data() the intermediate crc value is now
held in a local variable so it doesn't have to be byte-swapped so
many times.  In read_partial_msg_data() we do something similar
(but mainly for consistency there).

With that, the ceph_msg_pos structure can go away,  and it no longer
needs to be passed as an argument to prepare_message_data().

This cleanup is related to:
    http://tracker.ceph.com/issues/4428

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
2013-05-01 21:17:34 -07:00
Alex Elder
859a35d552 libceph: kill most of ceph_msg_pos
All but one of the fields in the ceph_msg_pos structure are now
never used (only assigned), so get rid of them.  This allows
several small blocks of code to go away.

This is cleanup of old code related to:
    http://tracker.ceph.com/issues/4428

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
2013-05-01 21:17:33 -07:00
Alex Elder
643c68a4a9 libceph: use cursor resid for loop condition
Use the "resid" field of a cursor rather than finding when the
message data position has moved up to meet the data length to
determine when all data has been sent or received in
write_partial_message_data() and read_partial_msg_data().

This is cleanup of old code related to:
    http://tracker.ceph.com/issues/4428

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
2013-05-01 21:17:32 -07:00
Alex Elder
4c59b4a278 libceph: collapse all data items into one
It turns out that only one of the data item types is ever used at
any one time in a single message (currently).
    - A page array is used by the osd client (on behalf of the file
      system) and by rbd.  Only one osd op (and therefore at most
      one data item) is ever used at a time by rbd.  And the only
      time the file system sends two, the second op contains no
      data.
    - A bio is only used by the rbd client (and again, only one
      data item per message)
    - A page list is used by the file system and by rbd for outgoing
      data, but only one op (and one data item) at a time.

We can therefore collapse all three of our data item fields into a
single field "data", and depend on the messenger code to properly
handle it based on its type.

This allows us to eliminate quite a bit of duplicated code.

This is related to:
    http://tracker.ceph.com/issues/4429

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
2013-05-01 21:17:30 -07:00
Alex Elder
686be20875 libceph: get rid of read helpers
Now that read_partial_message_pages() and read_partial_message_bio()
are literally identical functions we can factor them out.  They're
pretty simple as well, so just move their relevant content into
read_partial_msg_data().

This is and previous patches together resolve:
    http://tracker.ceph.com/issues/4428

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
2013-05-01 21:17:29 -07:00
Alex Elder
61fcdc97c0 libceph: no outbound zero data
There is handling in write_partial_message_data() for the case where
only the length of--and no other information about--the data to be
sent has been specified.  It uses the zero page as the source of
data to send in this case.

This case doesn't occur.  All message senders set up a page array,
pagelist, or bio describing the data to be sent.  So eliminate the
block of code that handles this (but check and issue a warning for
now, just in case it happens for some reason).

This resolves:
    http://tracker.ceph.com/issues/4426

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
2013-05-01 21:17:28 -07:00
Alex Elder
878efabd32 libceph: use cursor for inbound data pages
The cursor code for a page array selects the right page, page
offset, and length to use for a ceph_tcp_recvpage() call, so
we can use it to replace a block in read_partial_message_pages().

This partially resolves:
    http://tracker.ceph.com/issues/4428

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
2013-05-01 21:17:27 -07:00
Alex Elder
6518be47f9 libceph: kill ceph message bio_iter, bio_seg
The bio_iter and bio_seg fields in a message are no longer used, we
use the cursor instead.  So get rid of them and the functions that
operate on them them.

This is related to:
    http://tracker.ceph.com/issues/4428

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
2013-05-01 21:17:26 -07:00
Alex Elder
463207aa40 libceph: use cursor for bio reads
Replace the use of the information in con->in_msg_pos for incoming
bio data.  The old in_msg_pos and the new cursor mechanism do
basically the same thing, just slightly differently.

The main functional difference is that in_msg_pos keeps track of the
length of the complete bio list, and assumed it was fully consumed
when that many bytes had been transferred.  The cursor does not assume
a length, it simply consumes all bytes in the bio list.  Because the
only user of bio data is the rbd client, and because the length of a
bio list provided by rbd client always matches the number of bytes
in the list, both ways of tracking length are equivalent.

In addition, for in_msg_pos the initial bio vector is selected as
the initial value of the bio->bi_idx, while the cursor assumes this
is zero.  Again, the rbd client always passes 0 as the initial index
so the effect is the same.

Other than that, they basically match:
    in_msg_pos      cursor
    ----------      ------
    bio_iter        bio
    bio_seg         vec_index
    page_pos        page_offset

The in_msg_pos field is initialized by a call to init_bio_iter().
The bio cursor is initialized by ceph_msg_data_cursor_init().
Both now happen in the same spot, in prepare_message_data().

The in_msg_pos field is advanced by a call to in_msg_pos_next(),
which updates page_pos and calls iter_bio_next() to move to the next
bio vector, or to the next bio in the list.  The cursor is advanced
by ceph_msg_data_advance().  That isn't currently happening so
add a call to that in in_msg_pos_next().

Finally, the next piece of data to use for a read is determined
by a bunch of lines in read_partial_message_bio().  Those can be
replaced by an equivalent ceph_msg_data_bio_next() call.

This partially resolves:
    http://tracker.ceph.com/issues/4428

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
2013-05-01 21:17:25 -07:00
Alex Elder
25aff7c559 libceph: record residual bytes for all message data types
All of the data types can use this, not just the page array.  Until
now, only the bio type doesn't have it available, and only the
initiator of the request (the rbd client) is able to supply the
length of the full request without re-scanning the bio list.  Change
the cursor init routines so the length is supplied based on the
message header "data_len" field, and use that length to intiialize
the "resid" field of the cursor.

In addition, change the way "last_piece" is defined so it is based
on the residual number of bytes in the original request.  This is
necessary (at least for bio messages) because it is possible for
a read request to succeed without consuming all of the space
available in the data buffer.

This resolves:
    http://tracker.ceph.com/issues/4427

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
2013-05-01 21:17:24 -07:00
Alex Elder
28a89ddece libceph: drop pages parameter
The value passed for "pages" in read_partial_message_pages() is
always the pages pointer from the incoming message, which can be
derived inside that function.  So just get rid of the parameter.

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
2013-05-01 21:17:23 -07:00
Alex Elder
888334f966 libceph: initialize data fields on last msg put
When the last reference to a ceph message is dropped,
ceph_msg_last_put() is called to clean things up.

For "normal" messages (allocated via ceph_msg_new() rather than
being allocated from a memory pool) it's sufficient to just release
resources.  But for a mempool-allocated message we actually have to
re-initialize the data fields in the message back to initial state
so they're ready to go in the event the message gets reused.

Some of this was already done; this fleshes it out so it's done
more completely.

This resolves:
    http://tracker.ceph.com/issues/4540

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Sage Weil <sage@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
2013-05-01 21:17:22 -07:00
Sage Weil
20e55c4cc7 libceph: clear messenger auth_retry flag when we authenticate
We maintain a counter of failed auth attempts to allow us to retry once
before failing.  However, if the second attempt succeeds, the flag isn't
cleared, which makes us think auth failed again later when the connection
resets for other reasons (like a socket error).

This is one part of the sorry sequence of events in bug

	http://tracker.ceph.com/issues/4282

Signed-off-by: Sage Weil <sage@inktank.com>
Reviewed-by: Alex Elder <elder@inktank.com>
2013-05-01 21:17:11 -07:00
Sage Weil
3a23083bda libceph: implement RECONNECT_SEQ feature
This is an old protocol extension that allows the client and server to
avoid resending old messages after a reconnect (following a socket error).
Instead, the exchange their sequence numbers during the handshake.  This
avoids sending a bunch of useless data over the socket.

It has been supported in the server code since v0.22 (Sep 2010).

Signed-off-by: Sage Weil <sage@inktank.com>
Reviewed-by: Alex Elder <elder@inktank.com>
2013-05-01 21:17:09 -07:00
Alex Elder
8a166d0536 libceph: more cleanup of write_partial_msg_pages()
Basically all cases in write_partial_msg_pages() use the cursor, and
as a result we can simplify that function quite a bit.

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
2013-05-01 21:17:06 -07:00
Alex Elder
9d2a06c275 libceph: kill message trail
The wart that is the ceph message trail can now be removed, because
its only user was the osd client, and the previous patch made that
no longer the case.

The result allows write_partial_msg_pages() to be simplified
considerably.

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
2013-05-01 21:17:05 -07:00
Alex Elder
e766d7b55e libceph: implement pages array cursor
Implement and use cursor routines for page array message data items
for outbound message data.

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
2013-05-01 21:17:01 -07:00
Alex Elder
6aaa4511de libceph: implement bio message data item cursor
Implement and use cursor routines for bio message data items for
outbound message data.

(See the previous commit for reasoning in support of the changes
in out_msg_pos_next().)

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
2013-05-01 21:16:59 -07:00
Alex Elder
7fe1e5e57b libceph: use data cursor for message pagelist
Switch to using the message cursor for the (non-trail) outgoing
pagelist data item in a message if present.

Notes on the logic changes in out_msg_pos_next():
    - only the mds client uses a ceph pagelist for message data;
    - if the mds client ever uses a pagelist, it never uses a page
      array (or anything else, for that matter) for data in the same
      message;
    - only the osd client uses the trail portion of a message data,
      and when it does, it never uses any other data fields for
      outgoing data in the same message; and finally
    - only the rbd client uses bio message data (never pagelist).

Therefore out_msg_pos_next() can assume:
    - if we're in the trail portion of a message, the message data
      pagelist, data, and bio can be ignored; and
    - if there is a page list, there will never be any a bio or page
      array data, and vice-versa.

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
2013-05-01 21:16:58 -07:00
Alex Elder
dd236fcb65 libceph: prepare for other message data item types
This just inserts some infrastructure in preparation for handling
other types of ceph message data items.  No functional changes,
just trying to simplify review by separating out some noise.

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
2013-05-01 21:16:57 -07:00
Alex Elder
fe38a2b67b libceph: start defining message data cursor
This patch lays out the foundation for using generic routines to
manage processing items of message data.

For simplicity, we'll start with just the trail portion of a
message, because it stands alone and is only present for outgoing
data.

First some basic concepts.  We'll use the term "data item" to
represent one of the ceph_msg_data structures associated with a
message.  There are currently four of those, with single-letter
field names p, l, b, and t.  A data item is further broken into
"pieces" which always lie in a single page.  A data item will
include a "cursor" that will track state as the memory defined by
the item is consumed by sending data from or receiving data into it.

We define three routines to manipulate a data item's cursor: the
"init" routine; the "next" routine; and the "advance" routine.  The
"init" routine initializes the cursor so it points at the beginning
of the first piece in the item.  The "next" routine returns the
page, page offset, and length (limited by both the page and item
size) of the next unconsumed piece in the item.  It also indicates
to the caller whether the piece being returned is the last one in
the data item.

The "advance" routine consumes the requested number of bytes in the
item (advancing the cursor).  This is used to record the number of
bytes from the current piece that were actually sent or received by
the network code.  It returns an indication of whether the result
means the current piece has been fully consumed.  This is used by
the message send code to determine whether it should calculate the
CRC for the next piece processed.

The trail of a message is implemented as a ceph pagelist.  The
routines defined for it will be usable for non-trail pagelist data
as well.

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
2013-05-01 21:16:56 -07:00
Alex Elder
437945094f libceph: abstract message data
Group the types of message data into an abstract structure with a
type indicator and a union containing fields appropriate to the
type of data it represents.  Use this to represent the pages,
pagelist, bio, and trail in a ceph message.

Verify message data is of type NONE in ceph_msg_data_set_*()
routines.  Since information about message data of type NONE really
should not be interpreted, get rid of the other assertions in those
functions.

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
2013-05-01 21:16:55 -07:00
Alex Elder
f9e15777af libceph: be explicit about message data representation
A ceph message has a data payload portion.  The memory for that data
(either the source of data to send or the location to place data
that is received) is specified in several ways.  The ceph_msg
structure includes fields for all of those ways, but this
mispresents the fact that not all of them are used at a time.

Specifically, the data in a message can be in:
    - an array of pages
    - a list of pages
    - a list of Linux bios
    - a second list of pages (the "trail")
(The two page lists are currently only ever used for outgoing data.)

Impose more structure on the ceph message, making the grouping of
some of these fields explicit.  Shorten the name of the
"page_alignment" field.

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
2013-05-01 21:16:54 -07:00
Alex Elder
97fb1c7f66 libceph: define ceph_msg_has_*() data macros
Define and use macros ceph_msg_has_*() to determine whether to
operate on the pages, pagelist, bio, and trail fields of a message.

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
2013-05-01 21:16:53 -07:00
Alex Elder
35b6280899 libceph: define and use ceph_crc32c_page()
Factor out a common block of code that updates a CRC calculation
over a range of data in a page.

This and the preceding patches are related to:
    http://tracker.ceph.com/issues/4403

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
2013-05-01 21:16:52 -07:00
Alex Elder
afb3d90e20 libceph: define and use ceph_tcp_recvpage()
Define a new function ceph_tcp_recvpage() that behaves in a way
comparable to ceph_tcp_sendpage().

Rearrange the code in both read_partial_message_pages() and
read_partial_message_bio() so they have matching structure,
(similar to what's in write_partial_msg_pages()), and use
this new function.

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
2013-05-01 21:16:51 -07:00
Alex Elder
34d2d2006c libceph: encapsulate reading message data
Pull the code that reads the data portion into a message into
a separate function read_partial_msg_data().

Rename write_partial_msg_pages() to be write_partial_message_data()
to match its read counterpart, and to reflect its more generic
purpose.

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
2013-05-01 21:16:50 -07:00
Alex Elder
e387d525b0 libceph: small write_partial_msg_pages() refactor
Define local variables page_offset and length to represent the range
of bytes within a page that will be sent by ceph_tcp_sendpage() in
write_partial_msg_pages().

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
2013-05-01 21:16:48 -07:00
Alex Elder
78625051b5 libceph: consolidate message prep code
In prepare_write_message_data(), various fields are initialized in
preparation for writing message data out.  Meanwhile, in
read_partial_message(), there is essentially the same block of code,
operating on message variables associated with an incoming message.

Generalize prepare_write_message_data() so it works for both
incoming and outcoming messages, and use it in both spots.  The
did_page_crc is not used for input (so it's harmless to initialize
it).

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
2013-05-01 21:16:47 -07:00
Alex Elder
bae6acd9c6 libceph: use local variables for message positions
There are several places where a message's out_msg_pos or in_msg_pos
field is used repeatedly within a function.  Use a local pointer
variable for this purpose to unclutter the code.

This and the upcoming cleanup patches are related to:
    http://tracker.ceph.com/issues/4403

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
2013-05-01 21:16:46 -07:00
Alex Elder
98a0370898 libceph: don't clear bio_iter in prepare_write_message()
At one time it was necessary to clear a message's bio_iter field to
avoid a bad pointer dereference in write_partial_msg_pages().

That no longer seems to be the case.  Here's why.

The message's bio fields represent (in this case) outgoing data.
Between where the bio_iter is made NULL in prepare_write_message()
and the call in that function to prepare_message_data(), the
bio fields are never used.

In prepare_message_data(), init-bio_iter() is called, and the result
of that overwrites the value in the message's bio_iter field.

Because it gets overwritten anyway, there is no need to set it to
NULL.  So don't do it.

This resolves:
    http://tracker.ceph.com/issues/4402

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
2013-05-01 21:16:45 -07:00
Alex Elder
07aa155878 libceph: activate message data assignment checks
The mds client no longer tries to assign zero-length message data,
and the osd client no longer sets its data info more than once.
This allows us to activate assertions in the messenger to verify
these things never happen.

This resolves both of these:
    http://tracker.ceph.com/issues/4263
    http://tracker.ceph.com/issues/4284

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Greg Farnum <greg@inktank.com>
2013-05-01 21:16:44 -07:00
Alex Elder
4a73ef27ad libceph: record message data byte length
Record the number of bytes of data in a page array rather than the
number of pages in the array.  It can be assumed that the page array
is of sufficient size to hold the number of bytes indicated (and
offset by the indicated alignment).

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
2013-05-01 21:16:42 -07:00
Alex Elder
ebf18f4709 ceph: only set message data pointers if non-empty
Change it so we only assign outgoing data information for messages
if there is outgoing data to send.

This then allows us to add a few more (currently commented-out)
assertions.

This is related to:
    http://tracker.ceph.com/issues/4284

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Greg Farnum <greg@inktank.com>
2013-05-01 21:16:41 -07:00
Alex Elder
27fa83852b libceph: isolate other message data fields
Define ceph_msg_data_set_pagelist(), ceph_msg_data_set_bio(), and
ceph_msg_data_set_trail() to clearly abstract the assignment of the
remaining data-related fields in a ceph message structure.  Use the
new functions in the osd client and mds client.

This partially resolves:
    http://tracker.ceph.com/issues/4263

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
2013-05-01 21:16:40 -07:00
Alex Elder
f1baeb2b9f libceph: set page info with byte length
When setting page array information for message data, provide the
byte length rather than the page count ceph_msg_data_set_pages().

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
2013-05-01 21:16:39 -07:00
Alex Elder
02afca6ca0 libceph: isolate message page field manipulation
Define a function ceph_msg_data_set_pages(), which more clearly
abstracts the assignment page-related fields for data in a ceph
message structure.  Use this new function in the osd client and mds
client.

Ideally, these fields would never be set more than once (with
BUG_ON() calls to guarantee that).  At the moment though the osd
client sets these every time it receives a message, and in the event
of a communication problem this can happen more than once.  (This
will be resolved shortly, but setting up these helpers first makes
it all a bit easier to work with.)

Rearrange the field order in a ceph_msg structure to group those
that are used to define the possible data payloads.

This partially resolves:
    http://tracker.ceph.com/issues/4263

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
2013-05-01 21:16:38 -07:00
Alex Elder
9516e45b25 libceph: simplify new message initialization
Rather than explicitly initializing many fields to 0, NULL, or false
in a newly-allocated message, just use kzalloc() for allocating new
messages.  This will become a much more convenient way of doing
things anyway for upcoming patches that abstract the data field.

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
2013-05-01 21:16:35 -07:00
Alex Elder
35c7bfbcd4 libceph: advance pagelist with list_rotate_left()
While processing an outgoing pagelist (either the data pagelist or
trail) in a ceph message, the messenger cycles through each of the
pages on the list.  This is accomplished in out_msg_pos_next(), if
the end of the first page on the list is reached, the first page is
moved to the end of the list.

There is a list operation, list_rotate_left(), which performs
exactly this operation, and by using it, what's really going on
becomes more obvious.

So replace these two list_move_tail() calls with list_rotate_left().

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
2013-05-01 21:16:34 -07:00
Alex Elder
e788182fa6 libceph: define and use in_msg_pos_next()
Define a new function in_msg_pos_next() to match out_msg_pos_next(),
and use it in place of code at the end of read_partial_message_pages()
and read_partial_message_bio().

Note that the page number is incremented and offset reset under
slightly different conditions from before.  The result is
equivalent, however, as explained below.

Each time an incoming message is going to arrive, we find out how
much room is left--not surpassing the current page--and provide that
as the number of bytes to receive.  So the amount we'll use is the
lesser of:  all that's left of the entire request; and all that's
left in the current page.

If we received exactly how many were requested, we either reached
the end of the request or the end of the page.  In the first case,
we're done, in the second, we move onto the next page in the array.

In all cases but (possibly) on the last page, after adding the
number of bytes received, page_pos == PAGE_SIZE.  On the last page,
it doesn't really matter whether we increment the page number and
reset the page position, because we're done and we won't come back
here again.  The code previously skipped over that last case,
basically.  The new code handles that case the same as the others,
incrementing and resetting.

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
2013-05-01 21:16:33 -07:00
Alex Elder
b3d56fab33 libceph: kill args in read_partial_message_bio()
There is only one caller for read_partial_message_bio(), and it
always passes &msg->bio_iter and &bio_seg as the second and third
arguments.  Furthermore, the message in question is always the
connection's in_msg, and we can get that inside the called function.

So drop those two parameters and use their derived equivalents.

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
2013-05-01 21:16:32 -07:00
Alex Elder
e1dcb128f8 libceph: change type of ceph_tcp_sendpage() "more"
Change the type of the "more" parameter from int to bool.

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
2013-05-01 21:16:31 -07:00
Alex Elder
6ebc8b32b3 libceph: minor byte order problems in read_partial_message()
Some values printed are not (necessarily) in CPU order.  We already
have a copy of the converted versions, so use them.

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
2013-05-01 21:16:30 -07:00
Alex Elder
7b11ba3758 libceph: define CEPH_MSG_MAX_MIDDLE_LEN
This is probably unnecessary but the code read as if it were wrong
in read_partial_message().

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
2013-05-01 21:16:29 -07:00
Alex Elder
4137577ae3 libceph: clean up skipped message logic
In ceph_con_in_msg_alloc() it is possible for a connection's
alloc_msg method to indicate an incoming message should be skipped.
By default, read_partial_message() initializes the skip variable
to 0 before it gets provided to ceph_con_in_msg_alloc().

The osd client, mon client, and mds client each supply an alloc_msg
method.  The mds client always assigns skip to be 0.

The other two leave the skip value of as-is, or assigns it to zero,
except:
    - if no (osd or mon) request having the given tid is found, in
      which case skip is set to 1 and NULL is returned; or
    - in the osd client, if the data of the reply message is not
      adequate to hold the message to be read, it assigns skip
      value 1 and returns NULL.
So the returned message pointer will always be NULL if skip is ever
non-zero.

Clean up the logic a bit in ceph_con_in_msg_alloc() to make this
state of affairs more obvious.  Add a comment explaining how a null
message pointer can mean either a message that should be skipped or
a problem allocating a message.

This resolves:
    http://tracker.ceph.com/issues/4324

Reported-by: Greg Farnum <greg@inktank.com>
Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Greg Farnum <greg@inktank.com>
2013-05-01 21:16:28 -07:00
Alex Elder
53ded495c6 libceph: define mds_alloc_msg() method
The only user of the ceph messenger that doesn't define an alloc_msg
method is the mds client.  Define one, such that it works just like
it did before, and simplify ceph_con_in_msg_alloc() by assuming the
alloc_msg method is always present.

This and the next patch resolve:
    http://tracker.ceph.com/issues/4322

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Greg Farnum <greg@inktank.com>
2013-05-01 21:16:19 -07:00
Alex Elder
1d866d1c31 libceph: drop mutex while allocating a message
In ceph_con_in_msg_alloc(), if no alloc_msg method is defined for a
connection a new message is allocated with ceph_msg_new().

Drop the mutex before making this call, and make sure we're still
connected when we get it back again.

This is preparing for the next patch, which ensures all connections
define an alloc_msg method, and then handles them all the same way.

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Greg Farnum <greg@inktank.com>
2013-05-01 21:16:18 -07:00
Alex Elder
ec02a2f2ff libceph: kill ceph_msg->pagelist_count
The pagelist_count field is never actually used, so get rid of it.

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
2013-05-01 21:16:16 -07:00
Alex Elder
d4b515fa10 libceph: distinguish page array and pagelist count
Use distinct fields for tracking the number of pages in a message's
page array and in a message's page list.  Currently only one or the
other is used at a time, but that will be changing soon.

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
2013-05-01 21:14:28 -07:00
Alex Elder
07c09b7255 libceph: make ceph_msg->bio_seg be unsigned
The bio_seg field is used by the ceph messenger in iterating through
a bio.  It should never have a negative value, so make it an
unsigned.  (I contemplated making it unsigned short to match the
struct bio definition, but it offered no benefit.)

Change variables used to hold bio_seg values to all be unsigned as
well.  Change two variable names in init_bio_iter() to match the
convention used everywhere else.

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
2013-05-01 21:14:23 -07:00
Alex Elder
49659416ba libceph: use a do..while loop in con_work()
This just converts a manually-implemented loop into a do..while loop
in con_work().  It also moves handling of EAGAIN inside the blocks
where it's already been determined an error code was returned.

Also update a few dout() calls near the affected code for
consistency.

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
2013-02-25 15:37:37 -06:00
Alex Elder
b6e7b6a119 libceph: use a flag to indicate a fault has occurred
This just rearranges the logic in con_work() a little bit so that a
flag is used to indicate a fault has occurred.  This allows both the
fault and non-fault case to be handled the same way and avoids a
couple of nearly consecutive gotos.

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
2013-02-25 15:37:32 -06:00
Alex Elder
9320926420 libceph: separate non-locked fault handling
An error occurring on a ceph connection is treated as a fault,
causing the connection to be reset.  The initial part of this fault
handling has to be done while holding the connection mutex, but
it must then be dropped for the last part.

Separate the part of this fault handling that executes without the
lock into its own function, con_fault_finish().  Move the call to
this new function, as well as call that drops the connection mutex,
into ceph_fault().  Rename that function con_fault() to reflect that
it's only handling the connection part of the fault handling.

The motivation for this was a warning from sparse about the locking
being done here.  Rearranging things this way keeps all the mutex
manipulation within ceph_fault(), and this stops sparse from
complaining.

This partially resolves:
    http://tracker.ceph.com/issues/4184

Reported-by: Fengguang Wu <fengguang.wu@intel.com>
Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
2013-02-25 15:37:28 -06:00
Alex Elder
f20a39fd6e libceph: encapsulate connection backoff
Collect the code that tests for and implements a backoff delay for a
ceph connection into a new function, ceph_backoff().

Make the debug output messages in that part of the code report
things consistently by reporting a message in the socket closed
case, and by making the one for PREOPEN state report the connection
pointer like the rest.

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
2013-02-25 15:37:24 -06:00
Alex Elder
1541716789 libceph: eliminate sparse warnings
Eliminate most of the problems in the libceph code that cause sparse
to issue warnings.
    - Convert functions that are never referenced externally to have
      static scope.
    - Pass NULL rather than 0 for a pointer argument in one spot in
      ceph_monc_delete_snapid()

This partially resolves:
    http://tracker.ceph.com/issues/4184

Reported-by: Fengguang Wu <fengguang.wu@intel.com>
Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
2013-02-25 15:37:18 -06:00
Alex Elder
c9ffc77ade libceph: define connection flag helpers
Define and use functions that encapsulate operations performed on
a connection's flags.

This resolves:
    http://tracker.ceph.com/issues/4234

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
2013-02-25 15:37:04 -06:00
Alex Elder
3ebc21f7bc libceph: fix messenger CONFIG_BLOCK dependencies
The ceph messenger has a few spots that are only used when
bio messages are supported, and that's only when CONFIG_BLOCK
is defined.  This surrounds a couple of spots with #ifdef's
that would cause a problem if CONFIG_BLOCK were not present
in the kernel configuration.

This resolves:
    http://tracker.ceph.com/issues/3976

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
2013-02-13 18:27:40 -08:00
Sage Weil
0fa6ebc600 libceph: fix protocol feature mismatch failure path
We should not set con->state to CLOSED here; that happens in
ceph_fault() in the caller, where it first asserts that the state
is not yet CLOSED.  Avoids a BUG when the features don't match.

Since the fail_protocol() has become a trivial wrapper, replace
calls to it with direct calls to reset_connection().

Signed-off-by: Sage Weil <sage@inktank.com>
Reviewed-by: Alex Elder <elder@inktank.com>
2012-12-27 20:27:04 -06:00
Alex Elder
122070a2ff libceph: WARN, don't BUG on unexpected connection states
A number of assertions in the ceph messenger are implemented with
BUG_ON(), killing the system if connection's state doesn't match
what's expected.  At this point our state model is (evidently) not
well understood enough for these assertions to trigger a BUG().
Convert all BUG_ON(con->state...) calls to be WARN_ON(con->state...)
so we learn about these issues without killing the machine.

We now recognize that a connection fault can occur due to a socket
closure at any time, regardless of the state of the connection.  So
there is really nothing we can assert about the state of the
connection at that point so eliminate that assertion.

Reported-by: Ugis <ugis22@gmail.com>
Tested-by: Ugis <ugis22@gmail.com>
Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Sage Weil <sage@inktank.com>
2012-12-27 20:27:04 -06:00
Alex Elder
28362986f8 libceph: report connection fault with warning
When a connection's socket disconnects, or if there's a protocol
error of some kind on the connection, a fault is signaled and
the connection is reset (closed and reopened, basically).  We
currently get an error message on the log whenever this occurs.

A ceph connection will attempt to reestablish a socket connection
repeatedly if a fault occurs.  This means that these error messages
will get repeatedly added to the log, which is undesirable.

Change the error message to be a warning, so they don't get
logged by default.

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Sage Weil <sage@inktank.com>
2012-12-20 10:56:13 -06:00
Alex Elder
7bb21d68c5 libceph: socket can close in any connection state
A connection's socket can close for any reason, independent of the
state of the connection (and without irrespective of the connection
mutex).  As a result, the connectino can be in pretty much any state
at the time its socket is closed.

Handle those other cases at the top of con_work().  Pull this whole
block of code into a separate function to reduce the clutter.

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Sage Weil <sage@inktank.com>
2012-12-17 12:07:32 -06:00
Sage Weil
7246240c7c libceph: avoid NULL kref_put from NULL alloc_msg return
The ceph_on_in_msg_alloc() method calls the ->alloc_msg() helper which
may return NULL.  It also drops con->mutex while it allocates a message,
which means that the connection state may change (e.g., get closed).  If
that happens, we clean up and bail out.  Avoid calling ceph_msg_put() on
a NULL return value and triggering a crash.

This was observed when an ->alloc_msg() call races with a timeout that
resends a zillion messages and resets the connection, and ->alloc_msg()
returns NULL (because the request was resent to another target).

Fixes http://tracker.newdream.net/issues/3342

Signed-off-by: Sage Weil <sage@inktank.com>
Reviewed-by: Alex Elder <elder@inktank.com>
2012-10-26 16:35:04 -05:00
Alex Elder
802c6d967f rbd: define common queue_con_delay()
This patch defines a single function, queue_con_delay() to call
queue_delayed_work() for a connection.  It basically generalizes
what was previously queue_con() by adding the delay argument.
queue_con() is now a simple helper that passes 0 for its delay.
queue_con_delay() returns 0 if it queued work or an errno if it
did not for some reason.

If con_work() finds the BACKOFF flag set for a connection, it now
calls queue_con_delay() to handle arranging to start again after a
delay.

Note about connection reference counts:  con_work() only ever gets
called as a work item function.  At the time that work is scheduled,
a reference to the connection is acquired, and the corresponding
con_work() call is then responsible for dropping that reference
before it returns.

Previously, the backoff handling inside con_work() silently handed
off its reference to delayed work it scheduled.  Now that
queue_con_delay() is used, a new reference is acquired for the
newly-scheduled work, and the original reference is dropped by the
con->ops->put() call at the end of the function.

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Sage Weil <sage@inktank.com>
2012-10-09 22:00:44 -07:00
Alex Elder
8618e30bc1 rbd: let con_work() handle backoff
Both ceph_fault() and con_work() include handling for imposing a
delay before doing further processing on a faulted connection.
The latter is used only if ceph_fault() is unable to.

Instead, just let con_work() always be responsible for implementing
the delay.  After setting up the delay value, set the BACKOFF flag
on the connection unconditionally and call queue_con() to ensure
con_work() will get called to handle it.

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Sage Weil <sage@inktank.com>
2012-10-09 22:00:21 -07:00
Alex Elder
588377d619 rbd: reset BACKOFF if unable to re-queue
If ceph_fault() is unable to queue work after a delay, it sets the
BACKOFF connection flag so con_work() will attempt to do so.

In con_work(), when BACKOFF is set, if queue_delayed_work() doesn't
result in newly-queued work, it simply ignores this condition and
proceeds as if no backoff delay were desired.  There are two
problems with this--one of which is a bug.

The first problem is simply that the intended behavior is to back
off, and if we aren't able queue the work item to run after a delay
we're not doing that.

The only reason queue_delayed_work() won't queue work is if the
provided work item is already queued.  In the messenger, this
means that con_work() is already scheduled to be run again.  So
if we simply set the BACKOFF flag again when this occurs, we know
the next con_work() call will again attempt to hold off activity
on the connection until after the delay.

The second problem--the bug--is a leak of a reference count.  If
queue_delayed_work() returns 0 in con_work(), con->ops->put() drops
the connection reference held on entry to con_work().  However,
processing is (was) allowed to continue, and at the end of the
function a second con->ops->put() is called.

This patch fixes both problems.

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Sage Weil <sage@inktank.com>
2012-10-09 21:59:52 -07:00
Alex Elder
5ce765a540 libceph: only kunmap kmapped pages
In write_partial_msg_pages(), pages need to be kmapped in order to
perform a CRC-32c calculation on them.  As an artifact of the way
this code used to be structured, the kunmap() call was separated
from the kmap() call and both were done conditionally.  But the
conditions under which the kmap() and kunmap() calls were made
differed, so there was a chance a kunmap() call would be done on a
page that had not been mapped.

The symptom of this was tripping a BUG() in kunmap_high() when
pkmap_count[nr] became 0.

Reported-by: Bryan K. Wright <bryan@virginia.edu>
Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Sage Weil <sage@inktank.com>
2012-09-21 20:49:26 -07:00
Jim Schutt
6d4221b537 libceph: avoid truncation due to racing banners
Because the Ceph client messenger uses a non-blocking connect, it is
possible for the sending of the client banner to race with the
arrival of the banner sent by the peer.

When ceph_sock_state_change() notices the connect has completed, it
schedules work to process the socket via con_work().  During this
time the peer is writing its banner, and arrival of the peer banner
races with con_work().

If con_work() calls try_read() before the peer banner arrives, there
is nothing for it to do, after which con_work() calls try_write() to
send the client's banner.  In this case Ceph's protocol negotiation
can complete succesfully.

The server-side messenger immediately sends its banner and addresses
after accepting a connect request, *before* actually attempting to
read or verify the banner from the client.  As a result, it is
possible for the banner from the server to arrive before con_work()
calls try_read().  If that happens, try_read() will read the banner
and prepare protocol negotiation info via prepare_write_connect().
prepare_write_connect() calls con_out_kvec_reset(), which discards
the as-yet-unsent client banner.  Next, con_work() calls
try_write(), which sends the protocol negotiation info rather than
the banner that the peer is expecting.

The result is that the peer sees an invalid banner, and the client
reports "negotiation failed".

Fix this by moving con_out_kvec_reset() out of
prepare_write_connect() to its callers at all locations except the
one where the banner might still need to be sent.

[elder@inktak.com: added note about server-side behavior]

Signed-off-by: Jim Schutt <jaschut@sandia.gov>
Reviewed-by: Alex Elder <elder@inktank.com>
2012-08-21 15:55:27 -07:00
Linus Torvalds
cc8362b1f6 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client
Pull Ceph changes from Sage Weil:
 "Lots of stuff this time around:

   - lots of cleanup and refactoring in the libceph messenger code, and
     many hard to hit races and bugs closed as a result.
   - lots of cleanup and refactoring in the rbd code from Alex Elder,
     mostly in preparation for the layering functionality that will be
     coming in 3.7.
   - some misc rbd cleanups from Josh Durgin that are finally going
     upstream
   - support for CRUSH tunables (used by newer clusters to improve the
     data placement)
   - some cleanup in our use of d_parent that Al brought up a while back
   - a random collection of fixes across the tree

  There is another patch coming that fixes up our ->atomic_open()
  behavior, but I'm going to hammer on it a bit more before sending it."

Fix up conflicts due to commits that were already committed earlier in
drivers/block/rbd.c, net/ceph/{messenger.c, osd_client.c}

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client: (132 commits)
  rbd: create rbd_refresh_helper()
  rbd: return obj version in __rbd_refresh_header()
  rbd: fixes in rbd_header_from_disk()
  rbd: always pass ops array to rbd_req_sync_op()
  rbd: pass null version pointer in add_snap()
  rbd: make rbd_create_rw_ops() return a pointer
  rbd: have __rbd_add_snap_dev() return a pointer
  libceph: recheck con state after allocating incoming message
  libceph: change ceph_con_in_msg_alloc convention to be less weird
  libceph: avoid dropping con mutex before fault
  libceph: verify state after retaking con lock after dispatch
  libceph: revoke mon_client messages on session restart
  libceph: fix handling of immediate socket connect failure
  ceph: update MAINTAINERS file
  libceph: be less chatty about stray replies
  libceph: clear all flags on con_close
  libceph: clean up con flags
  libceph: replace connection state bits with states
  libceph: drop unnecessary CLOSED check in socket state change callback
  libceph: close socket directly from ceph_con_close()
  ...
2012-07-31 14:35:28 -07:00
Sage Weil
6139919133 libceph: recheck con state after allocating incoming message
We drop the lock when calling the ->alloc_msg() con op, which means
we need to (a) not clobber con->in_msg without the mutex held, and (b)
we need to verify that we are still in the OPEN state when we retake
it to avoid causing any mayhem.  If the state does change, -EAGAIN
will get us back to con_work() and loop.

Signed-off-by: Sage Weil <sage@inktank.com>
Reviewed-by: Alex Elder <elder@inktank.com>
2012-07-30 18:19:45 -07:00
Sage Weil
4740a623d2 libceph: change ceph_con_in_msg_alloc convention to be less weird
This function's calling convention is very limiting.  In particular,
we can't return any error other than ENOMEM (and only implicitly),
which is a problem (see next patch).

Instead, return an normal 0 or error code, and make the skip a pointer
output parameter.  Drop the useless in_hdr argument (we have the con
pointer).

Signed-off-by: Sage Weil <sage@inktank.com>
Reviewed-by: Alex Elder <elder@inktank.com>
2012-07-30 18:19:30 -07:00
Sage Weil
8636ea672f libceph: avoid dropping con mutex before fault
The ceph_fault() function takes the con mutex, so we should avoid
dropping it before calling it.  This fixes a potential race with
another thread calling ceph_con_close(), or _open(), or similar (we
don't reverify con->state after retaking the lock).

Add annotation so that lockdep realizes we will drop the mutex before
returning.

Signed-off-by: Sage Weil <sage@inktank.com>
Reviewed-by: Alex Elder <elder@inktank.com>
2012-07-30 18:17:13 -07:00
Sage Weil
7b862e07b1 libceph: verify state after retaking con lock after dispatch
We drop the con mutex when delivering a message.  When we retake the
lock, we need to verify we are still in the OPEN state before
preparing to read the next tag, or else we risk stepping on a
connection that has been closed.

Signed-off-by: Sage Weil <sage@inktank.com>
Reviewed-by: Alex Elder <elder@inktank.com>
2012-07-30 18:16:56 -07:00
Sage Weil
8007b8d626 libceph: fix handling of immediate socket connect failure
If the connect() call immediately fails such that sock == NULL, we
still need con_close_socket() to reset our socket state to CLOSED.

Signed-off-by: Sage Weil <sage@inktank.com>
Reviewed-by: Alex Elder <elder@inktank.com>
2012-07-30 18:16:16 -07:00
Sage Weil
43c7427d10 libceph: clear all flags on con_close
Signed-off-by: Sage Weil <sage@inktank.com>
2012-07-30 18:16:02 -07:00
Sage Weil
4a86169208 libceph: clean up con flags
Rename flags with CON_FLAG prefix, move the definitions into the c file,
and (better) document their meaning.

Signed-off-by: Sage Weil <sage@inktank.com>
2012-07-30 18:16:01 -07:00
Sage Weil
8dacc7da69 libceph: replace connection state bits with states
Use a simple set of 6 enumerated values for the socket states (CON_STATE_*)
and use those instead of the state bits.  All of the con->state checks are
now under the protection of the con mutex, so this is safe.  It also
simplifies many of the state checks because we can check for anything other
than the expected state instead of various bits for races we can think of.

This appears to hold up well to stress testing both with and without socket
failure injection on the server side.

Signed-off-by: Sage Weil <sage@inktank.com>
2012-07-30 18:16:00 -07:00
Sage Weil
d7353dd5aa libceph: drop unnecessary CLOSED check in socket state change callback
If we are CLOSED, the socket is closed and we won't get these.

Signed-off-by: Sage Weil <sage@inktank.com>
2012-07-30 18:15:59 -07:00
Sage Weil
ee76e0736d libceph: close socket directly from ceph_con_close()
It is simpler to do this immediately, since we already hold the con mutex.
It also avoids the need to deal with a not-quite-CLOSED socket in con_work.

Signed-off-by: Sage Weil <sage@inktank.com>
2012-07-30 18:15:58 -07:00
Sage Weil
2e8cb10063 libceph: drop gratuitous socket close calls in con_work
If the state is CLOSED or OPENING, we shouldn't have a socket.

Signed-off-by: Sage Weil <sage@inktank.com>
2012-07-30 18:15:58 -07:00
Sage Weil
a59b55a602 libceph: move ceph_con_send() closed check under the con mutex
Take the con mutex before checking whether the connection is closed to
avoid racing with someone else closing it.

Signed-off-by: Sage Weil <sage@inktank.com>
2012-07-30 18:15:57 -07:00
Sage Weil
00650931e5 libceph: move msgr clear_standby under con mutex protection
Avoid dropping and retaking con->mutex in the ceph_con_send() case by
leaving locking up to the caller.

Signed-off-by: Sage Weil <sage@inktank.com>
2012-07-30 18:15:56 -07:00
Sage Weil
3b5ede07b5 libceph: fix fault locking; close socket on lossy fault
If we fault on a lossy connection, we should still close the socket
immediately, and do so under the con mutex.

We should also take the con mutex before printing out the state bits in
the debug output.

Signed-off-by: Sage Weil <sage@inktank.com>
2012-07-30 18:15:55 -07:00
Sage Weil
85effe183d libceph: reset connection retry on successfully negotiation
We exponentially back off when we encounter connection errors.  If several
errors accumulate, we will eventually wait ages before even trying to
reconnect.

Fix this by resetting the backoff counter after a successful negotiation/
connection with the remote node.  Fixes ceph issue #2802.

Signed-off-by: Sage Weil <sage@inktank.com>
Reviewed-by: Yehuda Sadeh <yehuda@inktank.com>
Reviewed-by: Alex Elder <elder@inktank.com>
2012-07-30 18:15:34 -07:00
Sage Weil
5469155f2b libceph: protect ceph_con_open() with mutex
Take the con mutex while we are initiating a ceph open.  This is necessary
because the may have previously been in use and then closed, which could
result in a racing workqueue running con_work().

Signed-off-by: Sage Weil <sage@inktank.com>
Reviewed-by: Yehuda Sadeh <yehuda@inktank.com>
Reviewed-by: Alex Elder <elder@inktank.com>
2012-07-30 18:15:33 -07:00
Sage Weil
a410702697 libceph: (re)initialize bio_iter on start of message receive
Previously, we were opportunistically initializing the bio_iter if it
appeared to be uninitialized in the middle of the read path.  The problem
is that a sequence like:

 - start reading message
 - initialize bio_iter
 - read half a message
 - messenger fault, reconnect
 - restart reading message
 - ** bio_iter now non-NULL, not reinitialized **
 - read past end of bio, crash

Instead, initialize the bio_iter unconditionally when we allocate/claim
the message for read.

Signed-off-by: Sage Weil <sage@inktank.com>
Reviewed-by: Alex Elder <elder@inktank.com>
Reviewed-by: Yehuda Sadeh <yehuda@inktank.com>
2012-07-30 18:15:31 -07:00
Sage Weil
8c50c81756 libceph: fix mutex coverage for ceph_con_close
Hold the mutex while twiddling all of the state bits to avoid possible
races.  While we're here, make not of why we cannot close the socket
directly.

Signed-off-by: Sage Weil <sage@inktank.com>
Reviewed-by: Alex Elder <elder@inktank.com>
Reviewed-by: Yehuda Sadeh <yehuda@inktank.com>
2012-07-30 18:15:30 -07:00
Sage Weil
3a140a0d5c libceph: report socket read/write error message
We need to set error_msg to something useful before calling ceph_fault();
do so here for try_{read,write}().  This is more informative than

libceph: osd0 192.168.106.220:6801 (null)

Signed-off-by: Sage Weil <sage@inktank.com>
Reviewed-by: Alex Elder <elder@inktank.com>
Reviewed-by: Yehuda Sadeh <yehuda@inktank.com>
2012-07-30 18:15:29 -07:00
Guanjun He
a2a3258417 libceph: prevent the race of incoming work during teardown
Add an atomic variable 'stopping' as flag in struct ceph_messenger,
set this flag to 1 in function ceph_destroy_client(), and add the condition code
in function ceph_data_ready() to test the flag value, if true(1), just return.

Signed-off-by: Guanjun He <gjhe@suse.com>
Reviewed-by: Sage Weil <sage@inktank.com>
2012-07-30 09:29:53 -07:00
Sage Weil
a16cb1f707 libceph: fix messenger retry
In ancient times, the messenger could both initiate and accept connections.
An artifact if that was data structures to store/process an incoming
ceph_msg_connect request and send an outgoing ceph_msg_connect_reply.
Sadly, the negotiation code was referencing those structures and ignoring
important information (like the peer's connect_seq) from the correct ones.

Among other things, this fixes tight reconnect loops where the server sends
RETRY_SESSION and we (the client) retries with the same connect_seq as last
time.  This bug pretty easily triggered by injecting socket failures on the
MDS and running some fs workload like workunits/direct_io/test_sync_io.

Signed-off-by: Sage Weil <sage@inktank.com>
2012-07-30 09:29:52 -07:00
Sage Weil
5bdca4e076 libceph: fix messenger retry
In ancient times, the messenger could both initiate and accept connections.
An artifact if that was data structures to store/process an incoming
ceph_msg_connect request and send an outgoing ceph_msg_connect_reply.
Sadly, the negotiation code was referencing those structures and ignoring
important information (like the peer's connect_seq) from the correct ones.

Among other things, this fixes tight reconnect loops where the server sends
RETRY_SESSION and we (the client) retries with the same connect_seq as last
time.  This bug pretty easily triggered by injecting socket failures on the
MDS and running some fs workload like workunits/direct_io/test_sync_io.

Signed-off-by: Sage Weil <sage@inktank.com>
2012-07-17 19:35:59 -07:00
Sage Weil
fbb85a478f libceph: allow sock transition from CONNECTING to CLOSED
It is possible to close a socket that is in the OPENING state.  For
example, it can happen if ceph_con_close() is called on the con before
the TCP connection is established.  con_work() will come around and shut
down the socket.

Signed-off-by: Sage Weil <sage@inktank.com>
2012-07-05 21:14:39 -07:00
Sage Weil
b7a9e5dd40 libceph: set peer name on con_open, not init
The peer name may change on each open attempt, even when the connection is
reused.

Signed-off-by: Sage Weil <sage@inktank.com>
2012-07-05 21:14:35 -07:00
Alex Elder
bc18f4b1c8 libceph: add some fine ASCII art
Sage liked the state diagram I put in my commit description so
I'm putting it in with the code.

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Sage Weil <sage@inktank.com>
2012-07-05 21:14:32 -07:00
Alex Elder
5821bd8ccd libceph: small changes to messenger.c
This patch gathers a few small changes in "net/ceph/messenger.c":
  out_msg_pos_next()
    - small logic change that mostly affects indentation
  write_partial_msg_pages().
    - use a local variable trail_off to represent the offset into
      a message of the trail portion of the data (if present)
    - once we are in the trail portion we will always be there, so we
      don't always need to check against our data position
    - avoid computing len twice after we've reached the trail
    - get rid of the variable tmpcrc, which is not needed
    - trail_off and trail_len never change so mark them const
    - update some comments
  read_partial_message_bio()
    - bio_iovec_idx() will never return an error, so don't bother
      checking for it

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Sage Weil <sage@inktank.com>
2012-07-05 21:14:31 -07:00
Alex Elder
7593af920b libceph: distinguish two phases of connect sequence
Currently a ceph connection enters a "CONNECTING" state when it
begins the process of (re-)connecting with its peer.  Once the two
ends have successfully exchanged their banner and addresses, an
additional NEGOTIATING bit is set in the ceph connection's state to
indicate the connection information exhange has begun.  The
CONNECTING bit/state continues to be set during this phase.

Rather than have the CONNECTING state continue while the NEGOTIATING
bit is set, interpret these two phases as distinct states.  In other
words, when NEGOTIATING is set, clear CONNECTING.  That way only
one of them will be active at a time.

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Sage Weil <sage@inktank.com>
2012-07-05 21:14:30 -07:00
Alex Elder
ab166d5aa3 libceph: separate banner and connect writes
There are two phases in the process of linking together the two ends
of a ceph connection.  The first involves exchanging a banner and
IP addresses, and if that is successful a second phase exchanges
some detail about each side's connection capabilities.

When initiating a connection, the client side now queues to send
its information for both phases of this process at the same time.
This is probably a bit more efficient, but it is slightly messier
from a layering perspective in the code.

So rearrange things so that the client doesn't send the connection
information until it has received and processed the response in the
initial banner phase (in process_banner()).

Move the code (in the (con->sock == NULL) case in try_write()) that
prepares for writing the connection information, delaying doing that
until the banner exchange has completed.  Move the code that begins
the transition to this second "NEGOTIATING" phase out of
process_banner() and into its caller, so preparing to write the
connection information and preparing to read the response are
adjacent to each other.

Finally, preparing to write the connection information now requires
the output kvec to be reset in all cases, so move that into the
prepare_write_connect() and delete it from all callers.

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Sage Weil <sage@inktank.com>
2012-07-05 21:14:29 -07:00
Alex Elder
e27947c767 libceph: define and use an explicit CONNECTED state
There is no state explicitly defined when a ceph connection is fully
operational.  So define one.

It's set when the connection sequence completes successfully, and is
cleared when the connection gets closed.

Be a little more careful when examining the old state when a socket
disconnect event is reported.

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Sage Weil <sage@inktank.com>
2012-07-05 21:14:27 -07:00
Alex Elder
3ec50d1868 libceph: clear NEGOTIATING when done
A connection state's NEGOTIATING bit gets set while in CONNECTING
state after we have successfully exchanged a ceph banner and IP
addresses with the connection's peer (the server).  But that bit
is not cleared again--at least not until another connection attempt
is initiated.

Instead, clear it as soon as the connection is fully established.
Also, clear it when a socket connection gets prematurely closed
in the midst of establishing a ceph connection (in case we had
reached the point where it was set).

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Sage Weil <sage@inktank.com>
2012-07-05 21:14:26 -07:00
Alex Elder
bb9e6bba5d libceph: clear CONNECTING in ceph_con_close()
A connection that is closed will no longer be connecting.  So
clear the CONNECTING state bit in ceph_con_close().  Similarly,
if the socket has been closed we no longer are in connecting
state (a new connect sequence will need to be initiated).

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Sage Weil <sage@inktank.com>
2012-07-05 21:14:24 -07:00
Alex Elder
456ea46865 libceph: don't touch con state in con_close_socket()
In con_close_socket(), a connection's SOCK_CLOSED flag gets set and
then cleared while its shutdown method is called and its reference
gets dropped.

Previously, that flag got set only if it had not already been set,
so setting it in con_close_socket() might have prevented additional
processing being done on a socket being shut down.  We no longer set
SOCK_CLOSED in the socket event routine conditionally, so setting
that bit here no longer provides whatever benefit it might have
provided before.

A race condition could still leave the SOCK_CLOSED bit set even
after we've issued the call to con_close_socket(), so we still clear
that bit after shutting the socket down.  Add a comment explaining
the reason for this.

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Sage Weil <sage@inktank.com>
2012-07-05 21:14:23 -07:00
Alex Elder
d65c9e0b9e libceph: just set SOCK_CLOSED when state changes
When a TCP_CLOSE or TCP_CLOSE_WAIT event occurs, the SOCK_CLOSED
connection flag bit is set, and if it had not been previously set
queue_con() is called to ensure con_work() will get a chance to
handle the changed state.

con_work() atomically checks--and if set, clears--the SOCK_CLOSED
bit if it was set.  This means that even if the bit were set
repeatedly, the related processing in con_work() only gets called
once per transition of the bit from 0 to 1.

What's important then is that we ensure con_work() gets called *at
least* once when a socket close event occurs, not that it gets
called *exactly* once.

The work queue mechanism already takes care of queueing work
only if it is not already queued, so there's no need for us
to call queue_con() conditionally.

So this patch just makes it so the SOCK_CLOSED flag gets set
unconditionally in ceph_sock_state_change().

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Sage Weil <sage@inktank.com>
2012-07-05 21:14:22 -07:00
Alex Elder
188048bce3 libceph: don't change socket state on sock event
Currently the socket state change event handler records an error
message on a connection to distinguish a close while connecting from
a close while a connection was already established.

Changing connection information during handling of a socket event is
not very clean, so instead move this assignment inside con_work(),
where it can be done during normal connection-level processing (and
under protection of the connection mutex as well).

Move the handling of a socket closed event up to the top of the
processing loop in con_work(); there's no point in handling backoff
etc. if we have a newly-closed socket to take care of.

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Sage Weil <sage@inktank.com>
2012-07-05 21:14:20 -07:00
Alex Elder
a8d00e3cde libceph: SOCK_CLOSED is a flag, not a state
The following commit changed it so SOCK_CLOSED bit was stored in
a connection's new "flags" field rather than its "state" field.

    libceph: start separating connection flags from state
    commit 928443cd

That bit is used in con_close_socket() to protect against setting an
error message more than once in the socket event handler function.

Unfortunately, the field being operated on in that function was not
updated to be "flags" as it should have been.  This fixes that
error.

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Sage Weil <sage@inktank.com>
2012-07-05 21:14:19 -07:00
Alex Elder
abdaa6a849 libceph: don't use bio_iter as a flag
Recently a bug was fixed in which the bio_iter field in a ceph
message was not being properly re-initialized when a message got
re-transmitted:
    commit 43643528cc
    Author: Yan, Zheng <zheng.z.yan@intel.com>
    rbd: Clear ceph_msg->bio_iter for retransmitted message

We are now only initializing the bio_iter field when we are about to
start to write message data (in prepare_write_message_data()),
rather than every time we are attempting to write any portion of the
message data (in write_partial_msg_pages()).  This means we no
longer need to use the msg->bio_iter field as a flag.

So just don't do that any more.  Trust prepare_write_message_data()
to ensure msg->bio_iter is properly initialized, every time we are
about to begin writing (or re-writing) a message's bio data.

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Sage Weil <sage@inktank.com>
2012-07-05 21:14:18 -07:00
Alex Elder
572c588eda libceph: move init of bio_iter
If a message has a non-null bio pointer, its bio_iter field is
initialized in write_partial_msg_pages() if this has not been done
already.  This is really a one-time setup operation for sending a
message's (bio) data, so move that initialization code into
prepare_write_message_data() which serves that purpose.

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Sage Weil <sage@inktank.com>
2012-07-05 21:14:16 -07:00
Alex Elder
df6ad1f973 libceph: move init_bio_*() functions up
Move init_bio_iter() and iter_bio_next() up in their source file so
the'll be defined before they're needed.

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Sage Weil <sage@inktank.com>
2012-07-05 21:14:15 -07:00
Alex Elder
fd154f3c75 libceph: don't mark footer complete before it is
This is a nit, but prepare_write_message() sets the FOOTER_COMPLETE
flag before the CRC for the data portion (recorded in the footer)
has been completely computed.  Hold off setting the complete flag
until we've decided it's ready to send.

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Sage Weil <sage@inktank.com>
2012-07-05 21:14:13 -07:00
Alex Elder
84ca8fc87f libceph: encapsulate advancing msg page
In write_partial_msg_pages(), once all the data from a page has been
sent we advance to the next one.  Put the code that takes care of
this into its own function.

While modifying write_partial_msg_pages(), make its local variable
"in_trail" be Boolean, and use the local variable "msg" (which is
just the connection's current out_msg pointer) consistently.

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Sage Weil <sage@inktank.com>
2012-07-05 21:14:12 -07:00
Alex Elder
739c905baa libceph: encapsulate out message data setup
Move the code that prepares to write the data portion of a message
into its own function.

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Sage Weil <sage@inktank.com>
2012-07-05 21:14:10 -07:00
Sage Weil
d59315ca8c libceph: drop ceph_con_get/put helpers and nref member
These are no longer used.  Every ceph_connection instance is embedded in
another structure, and refcounts manipulated via the get/put ops.

Signed-off-by: Sage Weil <sage@inktank.com>
2012-06-22 08:13:45 -05:00
Sage Weil
36eb71aa57 libceph: use con get/put methods
The ceph_con_get/put() helpers manipulate the embedded con ref
count, which isn't used now that ceph_connections are embedded in
other structures.

Signed-off-by: Sage Weil <sage@inktank.com>
Reviewed-by: Alex Elder <elder@inktank.com>
2012-06-22 07:30:27 -05:00
Yan, Zheng
b132cf4c73 rbd: Clear ceph_msg->bio_iter for retransmitted message
The bug can cause NULL pointer dereference in write_partial_msg_pages

Signed-off-by: Zheng Yan <zheng.z.yan@intel.com>
Reviewed-by: Alex Elder <elder@inktank.com>
(cherry picked from commit 43643528cc)
2012-06-20 07:43:50 -05:00
Dan Carpenter
26ce171915 libceph: fix NULL dereference in reset_connection()
We dereference "con->in_msg" on the line after it was set to NULL.

Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Reviewed-by: Alex Elder <elder@inktank.com>
2012-06-19 08:52:33 -05:00
Sage Weil
9a64e8e0ac Linux 3.5-rc1
-----BEGIN PGP SIGNATURE-----
 Version: GnuPG v2.0.18 (GNU/Linux)
 
 iQEcBAABAgAGBQJPyr4LAAoJEHm+PkMAQRiGhvMH/1uXaDmJiiyAtMhC9kQbLclK
 5RpUOV+ukRrPXBJhwWGEZvC9G/DiWAfZ/19Ee6qTGZbA46yxkgZklqO+bw7fuOLH
 dPf4MNXdhgOgbs0KkVAk6aXIYzIU836pcYg+LcapG8E8SZp3SWbJzrVbUPFwPM+m
 Sv11ZcpJfM2HH9wFRdKErUOiZHsMY+LZHcw0nx+BObytjgzBbzHNkpF57F714TUO
 QplYpIToO3XtGhIM1yRDxww+2zFlVNsCZ8IC57EDbLb8BMZWuyZoFgWZqLAnrU0u
 vy7CHLledMSvs855juJ9JxGo/EDnfwJpCnjmcp8BY+h4b5T/k5mGK6d9aeXYRf4=
 =CcWn
 -----END PGP SIGNATURE-----

Merge tag 'v3.5-rc1'

Linux 3.5-rc1

Conflicts:
	net/ceph/messenger.c
2012-06-15 12:32:04 -07:00
Sage Weil
89a86be0ce libceph: transition socket state prior to actual connect
Once we call ->connect(), we are racing against the actual
connection, and a subsequent transition from CONNECTING ->
CONNECTED.  Set the state to CONNECTING before that, under the
protection of the mutex, to avoid the race.

This was introduced in 928443cd96,
with the original socket state code.

Signed-off-by: Sage Weil <sage@inktank.com>
Reviewed-by: Alex Elder <elder@inktank.com>
2012-06-15 11:26:37 -07:00
Yan, Zheng
43643528cc rbd: Clear ceph_msg->bio_iter for retransmitted message
The bug can cause NULL pointer dereference in write_partial_msg_pages

Signed-off-by: Zheng Yan <zheng.z.yan@intel.com>
Reviewed-by: Alex Elder <elder@inktank.com>
2012-06-07 08:27:33 -05:00
Alex Elder
8921d114f5 libceph: make ceph_con_revoke_message() a msg op
ceph_con_revoke_message() is passed both a message and a ceph
connection.  A ceph_msg allocated for incoming messages on a
connection always has a pointer to that connection, so there's no
need to provide the connection when revoking such a message.

Note that the existing logic does not preclude the message supplied
being a null/bogus message pointer.  The only user of this interface
is the OSD client, and the only value an osd client passes is a
request's r_reply field.  That is always non-null (except briefly in
an error path in ceph_osdc_alloc_request(), and that drops the
only reference so the request won't ever have a reply to revoke).
So we can safely assume the passed-in message is non-null, but add a
BUG_ON() to make it very obvious we are imposing this restriction.

Rename the function ceph_msg_revoke_incoming() to reflect that it is
really an operation on an incoming message.

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Sage Weil <sage@inktank.com>
2012-06-06 09:23:55 -05:00
Alex Elder
6740a845b2 libceph: make ceph_con_revoke() a msg operation
ceph_con_revoke() is passed both a message and a ceph connection.
Now that any message associated with a connection holds a pointer
to that connection, there's no need to provide the connection when
revoking a message.

This has the added benefit of precluding the possibility of the
providing the wrong connection pointer.  If the message's connection
pointer is null, it is not being tracked by any connection, so
revoking it is a no-op.  This is supported as a convenience for
upper layers, so they can revoke a message that is not actually
"in flight."

Rename the function ceph_msg_revoke() to reflect that it is really
an operation on a message, not a connection.

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Sage Weil <sage@inktank.com>
2012-06-06 09:23:54 -05:00
Alex Elder
92ce034b5a libceph: have messages take a connection reference
There are essentially two types of ceph messages: incoming and
outgoing.  Outgoing messages are always allocated via ceph_msg_new(),
and at the time of their allocation they are not associated with any
particular connection.  Incoming messages are always allocated via
ceph_con_in_msg_alloc(), and they are initially associated with the
connection from which incoming data will be placed into the message.

When an outgoing message gets sent, it becomes associated with a
connection and remains that way until the message is successfully
sent.  The association of an incoming message goes away at the point
it is sent to an upper layer via a con->ops->dispatch method.

This patch implements reference counting for all ceph messages, such
that every message holds a reference (and a pointer) to a connection
if and only if it is associated with that connection (as described
above).


For background, here is an explanation of the ceph message
lifecycle, emphasizing when an association exists between a message
and a connection.

Outgoing Messages
An outgoing message is "owned" by its allocator, from the time it is
allocated in ceph_msg_new() up to the point it gets queued for
sending in ceph_con_send().  Prior to that point the message's
msg->con pointer is null; at the point it is queued for sending its
message pointer is assigned to refer to the connection.  At that
time the message is inserted into a connection's out_queue list.

When a message on the out_queue list has been sent to the socket
layer to be put on the wire, it is transferred out of that list and
into the connection's out_sent list.  At that point it is still owned
by the connection, and will remain so until an acknowledgement is
received from the recipient that indicates the message was
successfully transferred.  When such an acknowledgement is received
(in process_ack()), the message is removed from its list (in
ceph_msg_remove()), at which point it is no longer associated with
the connection.

So basically, any time a message is on one of a connection's lists,
it is associated with that connection.  Reference counting outgoing
messages can thus be done at the points a message is added to the
out_queue (in ceph_con_send()) and the point it is removed from
either its two lists (in ceph_msg_remove())--at which point its
connection pointer becomes null.

Incoming Messages
When an incoming message on a connection is getting read (in
read_partial_message()) and there is no message in con->in_msg,
a new one is allocated using ceph_con_in_msg_alloc().  At that
point the message is associated with the connection.  Once that
message has been completely and successfully read, it is passed to
upper layer code using the connection's con->ops->dispatch method.
At that point the association between the message and the connection
no longer exists.

Reference counting of connections for incoming messages can be done
by taking a reference to the connection when the message gets
allocated, and releasing that reference when it gets handed off
using the dispatch method.

We should never fail to get a connection reference for a
message--the since the caller should already hold one.

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Sage Weil <sage@inktank.com>
2012-06-06 09:23:54 -05:00
Alex Elder
38941f8031 libceph: have messages point to their connection
When a ceph message is queued for sending it is placed on a list of
pending messages (ceph_connection->out_queue).  When they are
actually sent over the wire, they are moved from that list to
another (ceph_connection->out_sent).  When acknowledgement for the
message is received, it is removed from the sent messages list.

During that entire time the message is "in the possession" of a
single ceph connection.  Keep track of that connection in the
message.  This will be used in the next patch (and is a helpful
bit of information for debugging anyway).

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Sage Weil <sage@inktank.com>
2012-06-06 09:23:54 -05:00
Alex Elder
1c20f2d267 libceph: tweak ceph_alloc_msg()
The function ceph_alloc_msg() is only used to allocate a message
that will be assigned to a connection's in_msg pointer.  Rename the
function so this implied usage is more clear.

In addition, make that assignment inside the function (again, since
that's precisely what it's intended to be used for).  This allows us
to return what is now provided via the passed-in address of a "skip"
variable.  The return type is now Boolean to be explicit that there
are only two possible outcomes.

Make sure the result of an ->alloc_msg method call always sets the
value of *skip properly.

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Sage Weil <sage@inktank.com>
2012-06-06 09:23:54 -05:00
Alex Elder
1bfd89f4e6 libceph: fully initialize connection in con_init()
Move the initialization of a ceph connection's private pointer,
operations vector pointer, and peer name information into
ceph_con_init().  Rearrange the arguments so the connection pointer
is first.  Hide the byte-swapping of the peer entity number inside
ceph_con_init()

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Sage Weil <sage@inktank.com>
2012-06-06 09:23:54 -05:00
Alex Elder
a5988c490e libceph: set CLOSED state bit in con_init
Once a connection is fully initialized, it is really in a CLOSED
state, so make that explicit by setting the bit in its state field.

It is possible for a connection in NEGOTIATING state to get a
failure, leading to ceph_fault() and ultimately ceph_con_close().
Clear that bits if it is set in that case, to reflect that the
connection truly is closed and is no longer participating in a
connect sequence.

Issue a warning if ceph_con_open() is called on a connection that
is not in CLOSED state.

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Sage Weil <sage@inktank.com>
2012-06-01 08:37:56 -05:00
Alex Elder
ce2c8903e7 libceph: start tracking connection socket state
Start explicitly keeping track of the state of a ceph connection's
socket, separate from the state of the connection itself.  Create
placeholder functions to encapsulate the state transitions.

    --------
    | NEW* |  transient initial state
    --------
        | con_sock_state_init()
        v
    ----------
    | CLOSED |  initialized, but no socket (and no
    ----------  TCP connection)
     ^      \
     |       \ con_sock_state_connecting()
     |        ----------------------
     |                              \
     + con_sock_state_closed()       \
     |\                               \
     | \                               \
     |  -----------                     \
     |  | CLOSING |  socket event;       \
     |  -----------  await close          \
     |       ^                            |
     |       |                            |
     |       + con_sock_state_closing()   |
     |      / \                           |
     |     /   ---------------            |
     |    /                   \           v
     |   /                    --------------
     |  /    -----------------| CONNECTING |  socket created, TCP
     |  |   /                 --------------  connect initiated
     |  |   | con_sock_state_connected()
     |  |   v
    -------------
    | CONNECTED |  TCP connection established
    -------------

Make the socket state an atomic variable, reinforcing that it's a
distinct transtion with no possible "intermediate/both" states.
This is almost certainly overkill at this point, though the
transitions into CONNECTED and CLOSING state do get called via
socket callback (the rest of the transitions occur with the
connection mutex held).  We can back out the atomicity later.

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Sage Weil<sage@inktank.com>
2012-06-01 08:37:56 -05:00
Alex Elder
928443cd96 libceph: start separating connection flags from state
A ceph_connection holds a mixture of connection state (as in "state
machine" state) and connection flags in a single "state" field.  To
make the distinction more clear, define a new "flags" field and use
it rather than the "state" field to hold Boolean flag values.

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Sage Weil<sage@inktank.com>
2012-06-01 08:37:56 -05:00
Alex Elder
15d9882c33 libceph: embed ceph messenger structure in ceph_client
A ceph client has a pointer to a ceph messenger structure in it.
There is always exactly one ceph messenger for a ceph client, so
there is no need to allocate it separate from the ceph client
structure.

Switch the ceph_client structure to embed its ceph_messenger
structure.

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Yehuda Sadeh <yehuda@inktank.com>
Reviewed-by: Sage Weil <sage@inktank.com>
2012-06-01 08:37:56 -05:00
Alex Elder
e22004235a libceph: rename kvec_reset and kvec_add functions
The functions ceph_con_out_kvec_reset() and ceph_con_out_kvec_add()
are entirely private functions, so drop the "ceph_" prefix in their
name to make them slightly more wieldy.

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Yehuda Sadeh <yehuda@inktank.com>
Reviewed-by: Sage Weil <sage@inktank.com>
2012-06-01 08:37:55 -05:00