Commit Graph

845 Commits

Author SHA1 Message Date
Alex Elder
f759ebb968 libceph: skip message if too big to receive
We know the length of our message buffers.  If we get a message
that's too long, just dump it and ignore it.  If skip was set
then con->in_msg won't be valid, so be careful not to dereference
a null pointer in the process.

This resolves:
    http://tracker.ceph.com/issues/4664

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
2013-05-01 21:18:27 -07:00
Alex Elder
ea96571f7b libceph: fix possible CONFIG_BLOCK build problem
This patch:
    15a0d7b libceph: record message data length
did not enclose some bio-specific code inside CONFIG_BLOCK as
it should have.  Fix that.

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
2013-05-01 21:18:26 -07:00
Alex Elder
5476492fba libceph: kill off osd request r_data_in and r_data_out
Finally!  Convert the osd op data pointers into real structures, and
make the switch over to using them instead of having all ops share
the in and/or out data structures in the osd request.

Set up a new function to traverse the set of ops and release any
data associated with them (pages).

This and the patches leading up to it resolve:
    http://tracker.ceph.com/issues/4657

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
2013-05-01 21:18:25 -07:00
Alex Elder
ec9123c567 libceph: set the data pointers when encoding ops
Still using the osd request r_data_in and r_data_out pointer, but
we're basically only referring to it via the data pointers in the
osd ops.  And we're transferring that information to the request
or reply message only when the op indicates it's needed, in
osd_req_encode_op().

To avoid a forward reference, ceph_osdc_msg_data_set() was moved up
in the file.

Don't bother calling ceph_osd_data_init(), in ceph_osd_alloc(),
because the ops array will already be zeroed anyway.

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
2013-05-01 21:18:24 -07:00
Alex Elder
a4ce40a9a7 libceph: combine initializing and setting osd data
This ends up being a rather large patch but what it's doing is
somewhat straightforward.

Basically, this is replacing two calls with one.  The first of the
two calls is initializing a struct ceph_osd_data with data (either a
page array, a page list, or a bio list); the second is setting an
osd request op so it associates that data with one of the op's
parameters.  In place of those two will be a single function that
initializes the op directly.

That means we sort of fan out a set of the needed functions:
    - extent ops with pages data
    - extent ops with pagelist data
    - extent ops with bio list data
and
    - class ops with page data for receiving a response

We also have define another one, but it's only used internally:
    - class ops with pagelist data for request parameters

Note that we *still* haven't gotten rid of the osd request's
r_data_in and r_data_out fields.  All the osd ops refer to them for
their data.  For now, these data fields are pointers assigned to the
appropriate r_data_* field when these new functions are called.

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
2013-05-01 21:18:23 -07:00
Alex Elder
39b44cbe86 libceph: set message data when building osd request
All calls of ceph_osdc_start_request() are preceded (in the case of
rbd, almost) immediately by a call to ceph_osdc_build_request().

Move the build calls at the top of ceph_osdc_start_request() out of
there and into the ceph_osdc_build_request().  Nothing prevents
moving these calls to the top of ceph_osdc_build_request(), either
(and we're going to want them there in the next patch) so put them
at the top.

This and the next patch are related to:
    http://tracker.ceph.com/issues/4657

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
2013-05-01 21:18:22 -07:00
Alex Elder
e65550fd94 libceph: move ceph_osdc_build_request()
This simply moves ceph_osdc_build_request() later in its source
file without any change.  Done as a separate patch to facilitate
review of the change in the next patch.

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
2013-05-01 21:18:21 -07:00
Alex Elder
5f562df5f5 libceph: format class info at init time
An object class method is formatted using a pagelist which contains
the class name, the method name, and the data concatenated into an
osd request's outbound data.

Currently when a class op is initialized in osd_req_op_cls_init(),
the lengths of and pointers to these three items are recorded.
Later, when the op is getting formatted into the request message, a
new pagelist is created and that is when these items get copied into
the pagelist.

This patch makes it so the pagelist to hold these items is created
when the op is initialized instead.

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
2013-05-01 21:18:19 -07:00
Alex Elder
c99d2d4abb libceph: specify osd op by index in request
An osd request now holds all of its source op structures, and every
place that initializes one of these is in fact initializing one
of the entries in the the osd request's array.

So rather than supplying the address of the op to initialize, have
caller specify the osd request and an indication of which op it
would like to initialize.  This better hides the details the
op structure (and faciltates moving the data pointers they use).

Since osd_req_op_init() is a common routine, and it's not used
outside the osd client code, give it static scope.  Also make
it return the address of the specified op (so all the other
init routines don't have to repeat that code).

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
2013-05-01 21:18:15 -07:00
Alex Elder
8c042b0df9 libceph: add data pointers in osd op structures
An extent type osd operation currently implies that there will
be corresponding data supplied in the data portion of the request
(for write) or response (for read) message.  Similarly, an osd class
method operation implies a data item will be supplied to receive
the response data from the operation.

Add a ceph_osd_data pointer to each of those structures, and assign
it to point to eithre the incoming or the outgoing data structure in
the osd message.  The data is not always available when an op is
initially set up, so add two new functions to allow setting them
after the op has been initialized.

Begin to make use of the data item pointer available in the osd
operation rather than the request data in or out structure in
places where it's convenient.  Add some assertions to verify
pointers are always set the way they're expected to be.

This is a sort of stepping stone toward really moving the data
into the osd request ops, to allow for some validation before
making that jump.

This is the first in a series of patches that resolve:
    http://tracker.ceph.com/issues/4657

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
2013-05-01 21:18:14 -07:00
Alex Elder
54d5064912 libceph: rename data out field in osd request op
There are fields "indata" and "indata_len" defined the ceph osd
request op structure.  The "in" part is with from the point of view
of the osd server, but is a little confusing here on the client
side.  Change their names to use "request" instead of "in" to
indicate that it defines data provided with the request (as opposed
the data returned in the response).

Rename the local variable in osd_req_encode_op() to match.

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
2013-05-01 21:18:13 -07:00
Alex Elder
79528734f3 libceph: keep source rather than message osd op array
An osd request keeps a pointer to the osd operations (ops) array
that it builds in its request message.

In order to allow each op in the array to have its own distinct
data, we will need to keep track of each op's data, and that
information does not go over the wire.

As long as we're tracking the data we might as well just track the
entire (source) op definition for each of the ops.  And if we're
doing that, we'll have no more need to keep a pointer to the
wire-encoded version.

This patch makes the array of source ops be kept with the osd
request structure, and uses that instead of the version encoded in
the message in places where that was previously used.  The array
will be embedded in the request structure, and the maximum number of
ops we ever actually use is currently 2.  So reduce CEPH_OSD_MAX_OP
to 2 to reduce the size of the structure.

The result of doing this sort of ripples back up, and as a result
various function parameters and local variables become unnecessary.

Make r_num_ops be unsigned, and move the definition of struct
ceph_osd_req_op earlier to ensure it's defined where needed.

It does not yet add per-op data, that's coming soon.

This resolves:
    http://tracker.ceph.com/issues/4656

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
2013-05-01 21:18:12 -07:00
Alex Elder
23c08a9cb2 libceph: define ceph_osd_data_length()
One more osd data helper, which returns the length of the
data item, regardless of its type.

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
2013-05-01 21:18:09 -07:00
Alex Elder
c54d47bfad libceph: define a few more helpers
Define ceph_osd_data_init() and ceph_osd_data_release() to clean up
a little code.

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
2013-05-01 21:18:08 -07:00
Alex Elder
43bfe5de9f libceph: define osd data initialization helpers
Define and use functions that encapsulate the initializion of a
ceph_osd_data structure.

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
2013-05-01 21:18:06 -07:00
Alex Elder
9fc6e06471 libceph: compute incoming bytes once
This is a simple change, extracting the number of incoming data
bytes just once in handle_reply().

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
2013-05-01 21:18:05 -07:00
Alex Elder
98fa5dd883 libceph: provide data length when preparing message
In prepare_message_data(), the length used to initialize the cursor
is taken from the header of the message provided.  I'm working
toward not using the header data length field to determine length in
outbound messages, and this is a step in that direction.  For
inbound messages this will be set to be the actual number of bytes
that are arriving (which may be less than the total size of the data
buffer available).

This resolves:
    http://tracker.ceph.com/issues/4589

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
2013-05-01 21:18:03 -07:00
Alex Elder
e5975c7c8e ceph: build osd request message later for writepages
Hold off building the osd request message in ceph_writepages_start()
until just before it will be submitted to the osd client for
execution.

We'll still create the request and allocate the page pointer array
after we learn we have at least one page to write.  A local variable
will be used to keep track of the allocated array of pages.  Wait
until just before submitting the request for assigning that page
array pointer to the request message.

Create ands use a new function osd_req_op_extent_update() whose
purpose is to serve this one spot where the length value supplied
when an osd request's op was initially formatted might need to get
changed (reduced, never increased) before submitting the request.

Previously, ceph_writepages_start() assigned the message header's
data length because of this update.  That's no longer necessary,
because ceph_osdc_build_request() will recalculate the right
value to use based on the content of the ops in the request.

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
2013-05-01 21:18:02 -07:00
Alex Elder
02ee07d300 libceph: hold off building osd request
Defer building the osd request until just before submitting it in
all callers except ceph_writepages_start().  (That caller will be
handed in the next patch.)

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
2013-05-01 21:18:01 -07:00
Alex Elder
acead002b2 libceph: don't build request in ceph_osdc_new_request()
This patch moves the call to ceph_osdc_build_request() out of
ceph_osdc_new_request() and into its caller.

This is in order to defer formatting osd operation information into
the request message until just before request is started.

The only unusual (ab)user of ceph_osdc_build_request() is
ceph_writepages_start(), where the final length of write request may
change (downward) based on the current inode size or the oldest
snapshot context with dirty data for the inode.

The remaining callers don't change anything in the request after has
been built.

This means the ops array is now supplied by the caller.  It also
means there is no need to pass the mtime to ceph_osdc_new_request()
(it gets provided to ceph_osdc_build_request()).  And rather than
passing a do_sync flag, have the number of ops in the ops array
supplied imply adding a second STARTSYNC operation after the READ or
WRITE requested.

This and some of the patches that follow are related to having the
messenger (only) be responsible for filling the content of the
message header, as described here:
    http://tracker.ceph.com/issues/4589

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
2013-05-01 21:17:58 -07:00
Alex Elder
a193080481 libceph: record message data length
Keep track of the length of the data portion for a message in a
separate field in the ceph_msg structure.  This information has
been maintained in wire byte order in the message header, but
that's going to change soon.

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
2013-05-01 21:17:57 -07:00
Alex Elder
ace6d3a96f libceph: drop ceph_osd_request->r_con_filling_msg
A field in an osd request keeps track of whether a connection is
currently filling the request's reply message.  This patch gets rid
of that field.

An osd request includes two messages--a request and a reply--and
they're both associated with the connection that existed to its
the target osd at the time the request was created.

An osd request can be dropped early, even when it's in flight.
And at that time both messages are released.  It's possible the
reply message has been supplied to its connection to receive
an incoming response message at the time the osd request gets
dropped.  So ceph_osdc_release_request() revokes that message
from the connection before releasing it so things get cleaned up
properly.

Previously this may have caused a problem, because the connection
that a message was associated with might have gone away before the
revoke request.  And to avoid any problems using that connection,
the osd client held a reference to it when it supplies its response
message.

However since this commit:
    38941f80 libceph: have messages point to their connection
all messages hold a reference to the connection they are associated
with whenever the connection is actively operating on the message
(i.e. while the message is queued to send or sending, and when it
data is being received into it).  And if a message has no connection
associated with it, ceph_msg_revoke_incoming() won't do anything
when asked to revoke it.

As a result, there is no need to keep an additional reference to the
connection associated with a message when we hand the message to the
messenger when it calls our alloc_msg() method to receive something.
If the connection *were* operating on it, it would have its own
reference, and if not, there's no work to be done when we need to
revoke it.

So get rid of the osd request's r_con_filling_msg field.

This resolves:
    http://tracker.ceph.com/issues/4647

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
2013-05-01 21:17:54 -07:00
Alex Elder
ef4859d647 libceph: define ceph_decode_pgid() only once
There are two basically identical definitions of __decode_pgid()
in libceph, one in "net/ceph/osdmap.c" and the other in
"net/ceph/osd_client.c".  Get rid of both, and instead define
a single inline version in "include/linux/ceph/osdmap.h".

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
2013-05-01 21:17:52 -07:00
Alex Elder
8058fd4503 libceph: drop mutex on error in handle_reply()
The osd client mutex is acquired just before getting a reference to
a request in handle_reply().  However the error paths after that
don't drop the mutex before returning as they should.

Drop the mutex after dropping the request reference.  Also add a
bad_mutex label at that point and use it so the failed request
lookup case can be handled with the rest.

This resolves:
    http://tracker.ceph.com/issues/4615

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Sage Weil <sage@inktank.com>
2013-05-01 21:17:51 -07:00
Alex Elder
b0270324c5 libceph: use osd_req_op_extent_init()
Use osd_req_op_extent_init() in ceph_osdc_new_request() to
initialize the one or two ops built in that function.

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
2013-05-01 21:17:49 -07:00
Alex Elder
d18d1e2807 libceph: clean up ceph_osd_new_request()
All callers of ceph_osd_new_request() pass either CEPH_OSD_OP_READ
or CEPH_OSD_OP_WRITE as the opcode value.  The function assumes it
by filling in the extent fields in the ops array it builds.  So just
assert that is the case, and don't bother calling op_has_extent()
before filling in the first osd operation in the array.

Define some local variables to gather the information to fill into
the first op, and then fill in the op array all in one place.

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
2013-05-01 21:17:48 -07:00
Alex Elder
a19dadfba9 libceph: don't update op in calc_layout()
The ceph_osdc_new_request() an array of osd operations is built up
and filled in partially within that function and partially in the
called function calc_layout().  Move the latter part back out to
ceph_osdc_new_request() so it's all done in one place.  This makes
it unnecessary to pass the op pointer to calc_layout(), so get rid
of that parameter.

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
2013-05-01 21:17:47 -07:00
Alex Elder
75d1c941e5 libceph: pass offset and length out of calc_layout()
The purpose of calc_layout() is to determine, given a file offset
and length and a layout describing the placement of file data across
objects, where in "object space" that data resides.

Specifically, it determines which object should hold the first part
of the specified range of file data, and the offset and length of
data within that object.  The length will not exceed the bounds
of the object, and the caller is informed of that maximum length.

Add two parameters to calc_layout() to allow the object-relative
offset and length to be passed back to the caller.

This is the first steps toward having ceph_osdc_new_request() build
its osd op structure using osd_req_op_extent_init().

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
2013-05-01 21:17:46 -07:00
Alex Elder
33803f3300 libceph: define source request op functions
The rbd code has a function that allocates and populates a
ceph_osd_req_op structure (the in-core version of an osd request
operation).  When reviewed, Josh suggested two things: that the
big varargs function might be better split into type-specific
functions; and that this functionality really belongs in the osd
client rather than rbd.

This patch implements both of Josh's suggestions.  It breaks
up the rbd function into separate functions and defines them
in the osd client module as exported interfaces.  Unlike the
rbd version, however, the functions don't allocate an osd_req_op
structure; they are provided the address of one and that is
initialized instead.

The rbd function has been eliminated and calls to it have been
replaced by calls to the new routines.  The rbd code now now use a
stack (struct) variable to hold the op rather than allocating and
freeing it each time.

For now only the capabilities used by rbd are implemented.
Implementing all the other osd op types, and making the rest of the
code use it will be done separately, in the next few patches.

Note that only the extent, cls, and watch portions of the
ceph_osd_req_op structure are currently used.  Delete the others
(xattr, pgls, and snap) from its definition so nobody thinks it's
actually implemented or needed.  We can add it back again later
if needed, when we know it's been tested.

This (and a few follow-on patches) resolves:
    http://tracker.ceph.com/issues/3861

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
2013-05-01 21:17:45 -07:00
Alex Elder
a8dd0a37bc libceph: define osd_req_opcode_valid()
Define a separate function to determine the validity of an opcode,
and use it inside osd_req_encode_op() in order to unclutter that
function.

Don't update the destination op at all--and return zero--if an
unsupported or unrecognized opcode is seen in osd_req_encode_op().

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
2013-05-01 21:17:44 -07:00
Alex Elder
0baa1bd9b6 libceph: be explicit in masking bottom 16 bits
In ceph_osdc_build_request() there is a call to cpu_to_le16() which
provides a 64-bit value as its argument.  Because of the implied
byte swapping going on it looked pretty suspect to me.

At the moment it turns out the behavior is well defined, but masking
off those bottom bits explicitly eliminates this distraction, and is
in fact more directly related to the purpose of the message header's
data_off field.

This resolves:
    http://tracker.ceph.com/issues/4125

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
2013-05-01 21:17:41 -07:00
Alex Elder
56fc565916 libceph: account for alignment in pages cursor
When a cursor for a page array data message is initialized it needs
to determine the initial value for cursor->last_piece.  Currently it
just checks if length is less than a page, but that's not correct.
The data in the first page in the array will be offset by a page
offset based on the alignment recorded for the data.  (All pages
thereafter will be aligned at the base of the page, so there's
no need to account for this except for the first page.)

Because this was wrong, there was a case where the length of a piece
would be calculated as all of the residual bytes in the message and
that plus the page offset could exceed the length of a page.

So fix this case.  Make sure the sum won't wrap.

This resolves a third issue described in:
    http://tracker.ceph.com/issues/4598

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Sage Weil <sage@inktank.com>
2013-05-01 21:17:40 -07:00
Alex Elder
5df521b1ee libceph: page offset must be less than page size
Currently ceph_msg_data_pages_advance() allows the page offset value
to be PAGE_SIZE, apparently assuming ceph_msg_data_pages_next() will
treat it as 0.  But that doesn't happen, and the result led to a
helpful assertion failure.

Change ceph_msg_data_pages_advance() to truncate the offset to 0
before returning if it reaches PAGE_SIZE.

Make a few other minor adjustments in this area (comments and a
better assertion) while modifying it.

This resolves a second issue described in:
    http://tracker.ceph.com/issues/4598

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Sage Weil <sage@inktank.com>
2013-05-01 21:17:39 -07:00
Alex Elder
1190bf06a6 libceph: fix broken data length assertions
It's OK for the result of a read to come back with fewer bytes than
were requested.  So don't trigger a BUG() in that case when
initializing the data cursor.

This resolves the first problem described in:
    http://tracker.ceph.com/issues/4598

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Sage Weil <sage@inktank.com>
2013-05-01 21:17:38 -07:00
Alex Elder
6644ed7b7e libceph: make message data be a pointer
Begin the transition from a single message data item to a list of
them by replacing the "data" structure in a message with a pointer
to a ceph_msg_data structure.

A null pointer will indicate the message has no data; replace the
use of ceph_msg_has_data() with a simple check for a null pointer.

Create functions ceph_msg_data_create() and ceph_msg_data_destroy()
to dynamically allocate and free a data item structure of a given type.

When a message has its data item "set," allocate one of these to
hold the data description, and free it when the last reference to
the message is dropped.

This partially resolves:
    http://tracker.ceph.com/issues/4429

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
2013-05-01 21:17:37 -07:00
Alex Elder
8ea299bcbc libceph: use only ceph_msg_data_advance()
The *_msg_pos_next() functions do little more than call
ceph_msg_data_advance().  Replace those wrapper functions with
a simple call to ceph_msg_data_advance().

This cleanup is related to:
    http://tracker.ceph.com/issues/4428

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
2013-05-01 21:17:36 -07:00
Alex Elder
143334ff44 libceph: don't add to crc unless data sent
In write_partial_message_data() we aggregate the crc for the data
portion of the message as each new piece of the data item is
encountered.  Because it was computed *before* sending the data, if
an attempt to send a new piece resulted in 0 bytes being sent, the
crc crc across that piece would erroneously get computed again and
added to the aggregate result.  This would occasionally happen in
the evnet of a connection failure.

The crc value isn't really needed until the complete value is known
after sending all data, so there's no need to compute it before
sending.

So don't calculate the crc for a piece until *after* we know at
least one byte of it has been sent.  That will avoid this problem.

This resolves:
    http://tracker.ceph.com/issues/4450

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Sage Weil <sage@inktank.com>
2013-05-01 21:17:35 -07:00
Alex Elder
f5db90bcf2 libceph: kill last of ceph_msg_pos
The only remaining field in the ceph_msg_pos structure is
did_page_crc.  In the new cursor model of things that flag (or
something like it) belongs in the cursor.

Define a new field "need_crc" in the cursor (which applies to all
types of data) and initialize it to true whenever a cursor is
initialized.

In write_partial_message_data(), the data CRC still will be computed
as before, but it will check the cursor->need_crc field to determine
whether it's needed.  Any time the cursor is advanced to a new piece
of a data item, need_crc will be set, and this will cause the crc
for that entire piece to be accumulated into the data crc.

In write_partial_message_data() the intermediate crc value is now
held in a local variable so it doesn't have to be byte-swapped so
many times.  In read_partial_msg_data() we do something similar
(but mainly for consistency there).

With that, the ceph_msg_pos structure can go away,  and it no longer
needs to be passed as an argument to prepare_message_data().

This cleanup is related to:
    http://tracker.ceph.com/issues/4428

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
2013-05-01 21:17:34 -07:00
Alex Elder
859a35d552 libceph: kill most of ceph_msg_pos
All but one of the fields in the ceph_msg_pos structure are now
never used (only assigned), so get rid of them.  This allows
several small blocks of code to go away.

This is cleanup of old code related to:
    http://tracker.ceph.com/issues/4428

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
2013-05-01 21:17:33 -07:00
Alex Elder
643c68a4a9 libceph: use cursor resid for loop condition
Use the "resid" field of a cursor rather than finding when the
message data position has moved up to meet the data length to
determine when all data has been sent or received in
write_partial_message_data() and read_partial_msg_data().

This is cleanup of old code related to:
    http://tracker.ceph.com/issues/4428

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
2013-05-01 21:17:32 -07:00
Alex Elder
4c59b4a278 libceph: collapse all data items into one
It turns out that only one of the data item types is ever used at
any one time in a single message (currently).
    - A page array is used by the osd client (on behalf of the file
      system) and by rbd.  Only one osd op (and therefore at most
      one data item) is ever used at a time by rbd.  And the only
      time the file system sends two, the second op contains no
      data.
    - A bio is only used by the rbd client (and again, only one
      data item per message)
    - A page list is used by the file system and by rbd for outgoing
      data, but only one op (and one data item) at a time.

We can therefore collapse all three of our data item fields into a
single field "data", and depend on the messenger code to properly
handle it based on its type.

This allows us to eliminate quite a bit of duplicated code.

This is related to:
    http://tracker.ceph.com/issues/4429

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
2013-05-01 21:17:30 -07:00
Alex Elder
686be20875 libceph: get rid of read helpers
Now that read_partial_message_pages() and read_partial_message_bio()
are literally identical functions we can factor them out.  They're
pretty simple as well, so just move their relevant content into
read_partial_msg_data().

This is and previous patches together resolve:
    http://tracker.ceph.com/issues/4428

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
2013-05-01 21:17:29 -07:00
Alex Elder
61fcdc97c0 libceph: no outbound zero data
There is handling in write_partial_message_data() for the case where
only the length of--and no other information about--the data to be
sent has been specified.  It uses the zero page as the source of
data to send in this case.

This case doesn't occur.  All message senders set up a page array,
pagelist, or bio describing the data to be sent.  So eliminate the
block of code that handles this (but check and issue a warning for
now, just in case it happens for some reason).

This resolves:
    http://tracker.ceph.com/issues/4426

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
2013-05-01 21:17:28 -07:00
Alex Elder
878efabd32 libceph: use cursor for inbound data pages
The cursor code for a page array selects the right page, page
offset, and length to use for a ceph_tcp_recvpage() call, so
we can use it to replace a block in read_partial_message_pages().

This partially resolves:
    http://tracker.ceph.com/issues/4428

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
2013-05-01 21:17:27 -07:00
Alex Elder
6518be47f9 libceph: kill ceph message bio_iter, bio_seg
The bio_iter and bio_seg fields in a message are no longer used, we
use the cursor instead.  So get rid of them and the functions that
operate on them them.

This is related to:
    http://tracker.ceph.com/issues/4428

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
2013-05-01 21:17:26 -07:00
Alex Elder
463207aa40 libceph: use cursor for bio reads
Replace the use of the information in con->in_msg_pos for incoming
bio data.  The old in_msg_pos and the new cursor mechanism do
basically the same thing, just slightly differently.

The main functional difference is that in_msg_pos keeps track of the
length of the complete bio list, and assumed it was fully consumed
when that many bytes had been transferred.  The cursor does not assume
a length, it simply consumes all bytes in the bio list.  Because the
only user of bio data is the rbd client, and because the length of a
bio list provided by rbd client always matches the number of bytes
in the list, both ways of tracking length are equivalent.

In addition, for in_msg_pos the initial bio vector is selected as
the initial value of the bio->bi_idx, while the cursor assumes this
is zero.  Again, the rbd client always passes 0 as the initial index
so the effect is the same.

Other than that, they basically match:
    in_msg_pos      cursor
    ----------      ------
    bio_iter        bio
    bio_seg         vec_index
    page_pos        page_offset

The in_msg_pos field is initialized by a call to init_bio_iter().
The bio cursor is initialized by ceph_msg_data_cursor_init().
Both now happen in the same spot, in prepare_message_data().

The in_msg_pos field is advanced by a call to in_msg_pos_next(),
which updates page_pos and calls iter_bio_next() to move to the next
bio vector, or to the next bio in the list.  The cursor is advanced
by ceph_msg_data_advance().  That isn't currently happening so
add a call to that in in_msg_pos_next().

Finally, the next piece of data to use for a read is determined
by a bunch of lines in read_partial_message_bio().  Those can be
replaced by an equivalent ceph_msg_data_bio_next() call.

This partially resolves:
    http://tracker.ceph.com/issues/4428

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
2013-05-01 21:17:25 -07:00
Alex Elder
25aff7c559 libceph: record residual bytes for all message data types
All of the data types can use this, not just the page array.  Until
now, only the bio type doesn't have it available, and only the
initiator of the request (the rbd client) is able to supply the
length of the full request without re-scanning the bio list.  Change
the cursor init routines so the length is supplied based on the
message header "data_len" field, and use that length to intiialize
the "resid" field of the cursor.

In addition, change the way "last_piece" is defined so it is based
on the residual number of bytes in the original request.  This is
necessary (at least for bio messages) because it is possible for
a read request to succeed without consuming all of the space
available in the data buffer.

This resolves:
    http://tracker.ceph.com/issues/4427

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
2013-05-01 21:17:24 -07:00
Alex Elder
28a89ddece libceph: drop pages parameter
The value passed for "pages" in read_partial_message_pages() is
always the pages pointer from the incoming message, which can be
derived inside that function.  So just get rid of the parameter.

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
2013-05-01 21:17:23 -07:00
Alex Elder
888334f966 libceph: initialize data fields on last msg put
When the last reference to a ceph message is dropped,
ceph_msg_last_put() is called to clean things up.

For "normal" messages (allocated via ceph_msg_new() rather than
being allocated from a memory pool) it's sufficient to just release
resources.  But for a mempool-allocated message we actually have to
re-initialize the data fields in the message back to initial state
so they're ready to go in the event the message gets reused.

Some of this was already done; this fleshes it out so it's done
more completely.

This resolves:
    http://tracker.ceph.com/issues/4540

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Sage Weil <sage@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
2013-05-01 21:17:22 -07:00
Alex Elder
7e2766a113 libceph: send queued requests when starting new one
An osd expects the transaction ids of arriving request messages from
a given client to a given osd to increase monotonically.  So the osd
client needs to send its requests in ascending tid order.

The transaction id for a request is set at the time it is
registered, in __register_request().  This is also where the request
gets placed at the end of the osd client's unsent messages list.

At the end of ceph_osdc_start_request(), the request message for a
newly-mapped osd request is supplied to the messenger to be sent
(via __send_request()).  If any other messages were present in the
osd client's unsent list at that point they would be sent *after*
this new request message.

Because those unsent messages have already been registered, their
tids would be lower than the newly-mapped request message, and
sending that message first can violate the tid ordering rule.

Rather than sending the new request only, send all queued requests
(including the new one) at that point in ceph_osdc_start_request().
This ensures the tid ordering property is preserved.

With this in place, all messages should now be sent in tid order
regardless of whether they're being sent for the first time or
re-sent as a result of a call to osd_reset().

This resolves:
    http://tracker.ceph.com/issues/4392

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-off-by: Sage Weil <sage@inktank.com>
2013-05-01 21:17:21 -07:00
Alex Elder
ad885927de libceph: keep request lists in tid order
In __map_request(), when adding a request to an osd client's unsent
list, add it to the tail rather than the head.  That way the newest
entries (with the highest tid value) will be last.

Maintain an osd's request list in order of increasing tid also.

Finally--to be consistent--maintain an osd client's "notarget" list
in that order as well.

This partially resolves:
    http://tracker.ceph.com/issues/4392

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-off-by: Sage Weil <sage@inktank.com>
2013-05-01 21:17:19 -07:00
Alex Elder
e02493c07c libceph: requeue only sent requests when kicking
The osd expects incoming requests for a given object from a given
client to arrive in order, with the tid for each request being
greater than the tid for requests that have already arrived.  This
patch fixes two places the osd client might not maintain that
ordering.

For the osd client, the connection fault method is osd_reset().
That function calls __reset_osd() to close and re-open the
connection, then calls __kick_osd_requests() to cause all
outstanding requests for the affected osd to be re-sent after
the connection has been re-established.

When an osd is reset, any in-flight messages will need to be
re-sent.  An osd client maintains distinct lists for unsent and
in-flight messages.  Meanwhile, an osd maintains a single list of
all its requests (both sent and un-sent).  (Each message is linked
into two lists--one for the osd client and one list for the osd.)

To process an osd "kick" operation, the request list for the *osd*
is traversed, and each request is moved off whichever osd *client*
list it was on (unsent or sent) and placed onto the osd client's
unsent list.  (It remains where it is on the osd's request list.)

When that is done, osd_reset() calls __send_queued() to cause each
of the osd client's unsent messages to be sent.

OK, with that background...

As the osd request list is traversed each request is prepended to
the osd client's unsent list in the order they're seen.  The effect
of this is to reverse the order of these requests as they are put
(back) onto the unsent list.

Instead, build up a list of only the requests for an osd that have
already been sent (by checking their r_sent flag values).  Once an
unsent request is found, stop examining requests and prepend the
requests that need re-sending to the osd client's unsent list.

Preserve the original order of requests in the process (previously
re-queued requests were reversed in this process).  Because they
have already been sent, they will have lower tids than any request
already present on the unsent list.

Just below that, traverse the linger list in forward order as
before, but add them to the *tail* of the list rather than the head.
These requests get re-registered, and in the process are give a new
(higher) tid, so the should go at the end.

This partially resolves:
    http://tracker.ceph.com/issues/4392

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-off-by: Sage Weil <sage@inktank.com>
2013-05-01 21:17:18 -07:00
Alex Elder
92451b4910 libceph: no more kick_requests() race
Since we no longer drop the request mutex between registering and
mapping an osd request in ceph_osdc_start_request(), there is no
chance of a race with kick_requests().

We can now therefore map and send the new request unconditionally
(but we'll issue a warning should it ever occur).

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-off-by: Sage Weil <sage@inktank.com>
2013-05-01 21:17:17 -07:00
Alex Elder
dc4b870c97 libceph: slightly defer registering osd request
One of the first things ceph_osdc_start_request() does is register
the request.  It then acquires the osd client's map semaphore and
request mutex and proceeds to map and send the request.

There is no reason the request has to be registered before acquiring
the map semaphore.  So hold off doing so until after the map
semaphore is held.

Since register_request() is nothing more than a wrapper around
__register_request(), call the latter function instead, after
acquiring the request mutex.

That leaves register_request() unused, so get rid of it.

This partially resolves:
    http://tracker.ceph.com/issues/4392

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-off-by: Sage Weil <sage@inktank.com>
2013-05-01 21:17:16 -07:00
Sage Weil
e9966076cd libceph: wrap auth methods in a mutex
The auth code is called from a variety of contexts, include the mon_client
(protected by the monc's mutex) and the messenger callbacks (currently
protected by nothing).  Avoid chaos by protecting all auth state with a
mutex.  Nothing is blocking, so this should be simple and lightweight.

Signed-off-by: Sage Weil <sage@inktank.com>
Reviewed-by: Alex Elder <elder@inktank.com>
2013-05-01 21:17:15 -07:00
Sage Weil
27859f9773 libceph: wrap auth ops in wrapper functions
Use wrapper functions that check whether the auth op exists so that callers
do not need a bunch of conditional checks.  Simplifies the external
interface.

Signed-off-by: Sage Weil <sage@inktank.com>
Reviewed-by: Alex Elder <elder@inktank.com>
2013-05-01 21:17:14 -07:00
Sage Weil
0bed9b5c52 libceph: add update_authorizer auth method
Currently the messenger calls out to a get_authorizer con op, which will
create a new authorizer if it doesn't yet have one.  In the meantime, when
we rotate our service keys, the authorizer doesn't get updated.  Eventually
it will be rejected by the server on a new connection attempt and get
invalidated, and we will then rebuild a new authorizer, but this is not
ideal.

Instead, if we do have an authorizer, call a new update_authorizer op that
will verify that the current authorizer is using the latest secret.  If it
is not, we will build a new one that does.  This avoids the transient
failure.

This fixes one of the sorry sequence of events for bug

	http://tracker.ceph.com/issues/4282

Signed-off-by: Sage Weil <sage@inktank.com>
Reviewed-by: Alex Elder <elder@inktank.com>
2013-05-01 21:17:13 -07:00
Sage Weil
4b8e8b5d78 libceph: fix authorizer invalidation
We were invalidating the authorizer by removing the ticket handler
entirely.  This was effective in inducing us to request a new authorizer,
but in the meantime it mean that any authorizer we generated would get a
new and initialized handler with secret_id=0, which would always be
rejected by the server side with a confusing error message:

 auth: could not find secret_id=0
 cephx: verify_authorizer could not get service secret for service osd secret_id=0

Instead, simply clear the validity field.  This will still induce the auth
code to request a new secret, but will let us continue to use the old
ticket in the meantime.  The messenger code will probably continue to fail,
but the exponential backoff will kick in, and eventually the we will get a
new (hopefully more valid) ticket from the mon and be able to continue.

Signed-off-by: Sage Weil <sage@inktank.com>
Reviewed-by: Alex Elder <elder@inktank.com>
2013-05-01 21:17:12 -07:00
Sage Weil
20e55c4cc7 libceph: clear messenger auth_retry flag when we authenticate
We maintain a counter of failed auth attempts to allow us to retry once
before failing.  However, if the second attempt succeeds, the flag isn't
cleared, which makes us think auth failed again later when the connection
resets for other reasons (like a socket error).

This is one part of the sorry sequence of events in bug

	http://tracker.ceph.com/issues/4282

Signed-off-by: Sage Weil <sage@inktank.com>
Reviewed-by: Alex Elder <elder@inktank.com>
2013-05-01 21:17:11 -07:00
Sage Weil
3a23083bda libceph: implement RECONNECT_SEQ feature
This is an old protocol extension that allows the client and server to
avoid resending old messages after a reconnect (following a socket error).
Instead, the exchange their sequence numbers during the handshake.  This
avoids sending a bunch of useless data over the socket.

It has been supported in the server code since v0.22 (Sep 2010).

Signed-off-by: Sage Weil <sage@inktank.com>
Reviewed-by: Alex Elder <elder@inktank.com>
2013-05-01 21:17:09 -07:00
Alex Elder
8a166d0536 libceph: more cleanup of write_partial_msg_pages()
Basically all cases in write_partial_msg_pages() use the cursor, and
as a result we can simplify that function quite a bit.

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
2013-05-01 21:17:06 -07:00
Alex Elder
9d2a06c275 libceph: kill message trail
The wart that is the ceph message trail can now be removed, because
its only user was the osd client, and the previous patch made that
no longer the case.

The result allows write_partial_msg_pages() to be simplified
considerably.

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
2013-05-01 21:17:05 -07:00
Alex Elder
95e072eb38 libceph: kill osd request r_trail
The osd trail is a pagelist, used only for a CALL osd operation
to hold the class and method names, along with any input data for
the call.

It is only currently used by the rbd client, and when it's used it
is the only bit of outbound data in the osd request.  Since we
already support (non-trail) pagelist data in a message, we can
just save this outbound CALL data in the "normal" pagelist rather
than the trail, and get rid of the trail entirely.

The existing pagelist support depends on the pagelist being
dynamically allocated, and ownership of it is passed to the
messenger once it's been attached to a message.  (That is to say,
the messenger releases and frees the pagelist when it's done with
it).  That means we need to dynamically allocate the pagelist also.

Note that we simply assert that the allocation of a pagelist
structure succeeds.  Appending to a pagelist might require a dynamic
allocation, so we're already assuming we won't run into trouble
doing so (we're just ignore any failures--and that should be fixed
at some point).

This resolves:
    http://tracker.ceph.com/issues/4407

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
2013-05-01 21:17:04 -07:00
Alex Elder
9a5e6d09dd libceph: have osd requests support pagelist data
Add support for recording a ceph pagelist as data associated with an
osd request.

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
2013-05-01 21:17:03 -07:00
Alex Elder
175face2ba libceph: let osd ops determine request data length
The length of outgoing data in an osd request is dependent on the
osd ops that are embedded in that request.  Each op is encoded into
a request message using osd_req_encode_op(), so that should be used
to determine the amount of outgoing data implied by the op as it
is encoded.

Have osd_req_encode_op() return the number of bytes of outgoing data
implied by the op being encoded, and accumulate and use that in
ceph_osdc_build_request().

As a result, ceph_osdc_build_request() no longer requires its "len"
parameter, so get rid of it.

Using the sum of the op lengths rather than the length provided is
a valid change because:
    - The only callers of osd ceph_osdc_build_request() are
      rbd and the osd client (in ceph_osdc_new_request() on
      behalf of the file system).
    - When rbd calls it, the length provided is only non-zero for
      write requests, and in that case the single op has the
      same length value as what was passed here.
    - When called from ceph_osdc_new_request(), (it's not all that
      easy to see, but) the length passed is also always the same
      as the extent length encoded in its (single) write op if
      present.

This resolves:
    http://tracker.ceph.com/issues/4406

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
2013-05-01 21:17:02 -07:00
Alex Elder
e766d7b55e libceph: implement pages array cursor
Implement and use cursor routines for page array message data items
for outbound message data.

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
2013-05-01 21:17:01 -07:00
Alex Elder
6aaa4511de libceph: implement bio message data item cursor
Implement and use cursor routines for bio message data items for
outbound message data.

(See the previous commit for reasoning in support of the changes
in out_msg_pos_next().)

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
2013-05-01 21:16:59 -07:00
Alex Elder
7fe1e5e57b libceph: use data cursor for message pagelist
Switch to using the message cursor for the (non-trail) outgoing
pagelist data item in a message if present.

Notes on the logic changes in out_msg_pos_next():
    - only the mds client uses a ceph pagelist for message data;
    - if the mds client ever uses a pagelist, it never uses a page
      array (or anything else, for that matter) for data in the same
      message;
    - only the osd client uses the trail portion of a message data,
      and when it does, it never uses any other data fields for
      outgoing data in the same message; and finally
    - only the rbd client uses bio message data (never pagelist).

Therefore out_msg_pos_next() can assume:
    - if we're in the trail portion of a message, the message data
      pagelist, data, and bio can be ignored; and
    - if there is a page list, there will never be any a bio or page
      array data, and vice-versa.

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
2013-05-01 21:16:58 -07:00
Alex Elder
dd236fcb65 libceph: prepare for other message data item types
This just inserts some infrastructure in preparation for handling
other types of ceph message data items.  No functional changes,
just trying to simplify review by separating out some noise.

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
2013-05-01 21:16:57 -07:00
Alex Elder
fe38a2b67b libceph: start defining message data cursor
This patch lays out the foundation for using generic routines to
manage processing items of message data.

For simplicity, we'll start with just the trail portion of a
message, because it stands alone and is only present for outgoing
data.

First some basic concepts.  We'll use the term "data item" to
represent one of the ceph_msg_data structures associated with a
message.  There are currently four of those, with single-letter
field names p, l, b, and t.  A data item is further broken into
"pieces" which always lie in a single page.  A data item will
include a "cursor" that will track state as the memory defined by
the item is consumed by sending data from or receiving data into it.

We define three routines to manipulate a data item's cursor: the
"init" routine; the "next" routine; and the "advance" routine.  The
"init" routine initializes the cursor so it points at the beginning
of the first piece in the item.  The "next" routine returns the
page, page offset, and length (limited by both the page and item
size) of the next unconsumed piece in the item.  It also indicates
to the caller whether the piece being returned is the last one in
the data item.

The "advance" routine consumes the requested number of bytes in the
item (advancing the cursor).  This is used to record the number of
bytes from the current piece that were actually sent or received by
the network code.  It returns an indication of whether the result
means the current piece has been fully consumed.  This is used by
the message send code to determine whether it should calculate the
CRC for the next piece processed.

The trail of a message is implemented as a ceph pagelist.  The
routines defined for it will be usable for non-trail pagelist data
as well.

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
2013-05-01 21:16:56 -07:00
Alex Elder
437945094f libceph: abstract message data
Group the types of message data into an abstract structure with a
type indicator and a union containing fields appropriate to the
type of data it represents.  Use this to represent the pages,
pagelist, bio, and trail in a ceph message.

Verify message data is of type NONE in ceph_msg_data_set_*()
routines.  Since information about message data of type NONE really
should not be interpreted, get rid of the other assertions in those
functions.

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
2013-05-01 21:16:55 -07:00
Alex Elder
f9e15777af libceph: be explicit about message data representation
A ceph message has a data payload portion.  The memory for that data
(either the source of data to send or the location to place data
that is received) is specified in several ways.  The ceph_msg
structure includes fields for all of those ways, but this
mispresents the fact that not all of them are used at a time.

Specifically, the data in a message can be in:
    - an array of pages
    - a list of pages
    - a list of Linux bios
    - a second list of pages (the "trail")
(The two page lists are currently only ever used for outgoing data.)

Impose more structure on the ceph message, making the grouping of
some of these fields explicit.  Shorten the name of the
"page_alignment" field.

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
2013-05-01 21:16:54 -07:00
Alex Elder
97fb1c7f66 libceph: define ceph_msg_has_*() data macros
Define and use macros ceph_msg_has_*() to determine whether to
operate on the pages, pagelist, bio, and trail fields of a message.

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
2013-05-01 21:16:53 -07:00
Alex Elder
35b6280899 libceph: define and use ceph_crc32c_page()
Factor out a common block of code that updates a CRC calculation
over a range of data in a page.

This and the preceding patches are related to:
    http://tracker.ceph.com/issues/4403

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
2013-05-01 21:16:52 -07:00
Alex Elder
afb3d90e20 libceph: define and use ceph_tcp_recvpage()
Define a new function ceph_tcp_recvpage() that behaves in a way
comparable to ceph_tcp_sendpage().

Rearrange the code in both read_partial_message_pages() and
read_partial_message_bio() so they have matching structure,
(similar to what's in write_partial_msg_pages()), and use
this new function.

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
2013-05-01 21:16:51 -07:00
Alex Elder
34d2d2006c libceph: encapsulate reading message data
Pull the code that reads the data portion into a message into
a separate function read_partial_msg_data().

Rename write_partial_msg_pages() to be write_partial_message_data()
to match its read counterpart, and to reflect its more generic
purpose.

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
2013-05-01 21:16:50 -07:00
Alex Elder
e387d525b0 libceph: small write_partial_msg_pages() refactor
Define local variables page_offset and length to represent the range
of bytes within a page that will be sent by ceph_tcp_sendpage() in
write_partial_msg_pages().

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
2013-05-01 21:16:48 -07:00
Alex Elder
78625051b5 libceph: consolidate message prep code
In prepare_write_message_data(), various fields are initialized in
preparation for writing message data out.  Meanwhile, in
read_partial_message(), there is essentially the same block of code,
operating on message variables associated with an incoming message.

Generalize prepare_write_message_data() so it works for both
incoming and outcoming messages, and use it in both spots.  The
did_page_crc is not used for input (so it's harmless to initialize
it).

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
2013-05-01 21:16:47 -07:00
Alex Elder
bae6acd9c6 libceph: use local variables for message positions
There are several places where a message's out_msg_pos or in_msg_pos
field is used repeatedly within a function.  Use a local pointer
variable for this purpose to unclutter the code.

This and the upcoming cleanup patches are related to:
    http://tracker.ceph.com/issues/4403

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
2013-05-01 21:16:46 -07:00
Alex Elder
98a0370898 libceph: don't clear bio_iter in prepare_write_message()
At one time it was necessary to clear a message's bio_iter field to
avoid a bad pointer dereference in write_partial_msg_pages().

That no longer seems to be the case.  Here's why.

The message's bio fields represent (in this case) outgoing data.
Between where the bio_iter is made NULL in prepare_write_message()
and the call in that function to prepare_message_data(), the
bio fields are never used.

In prepare_message_data(), init-bio_iter() is called, and the result
of that overwrites the value in the message's bio_iter field.

Because it gets overwritten anyway, there is no need to set it to
NULL.  So don't do it.

This resolves:
    http://tracker.ceph.com/issues/4402

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
2013-05-01 21:16:45 -07:00
Alex Elder
07aa155878 libceph: activate message data assignment checks
The mds client no longer tries to assign zero-length message data,
and the osd client no longer sets its data info more than once.
This allows us to activate assertions in the messenger to verify
these things never happen.

This resolves both of these:
    http://tracker.ceph.com/issues/4263
    http://tracker.ceph.com/issues/4284

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Greg Farnum <greg@inktank.com>
2013-05-01 21:16:44 -07:00
Alex Elder
70636773b7 libceph: set response data fields earlier
When an incoming message is destined for the osd client, the
messenger calls the osd client's alloc_msg method.  That function
looks up which request has the tid matching the incoming message,
and returns the request message that was preallocated to receive the
response.  The response message is therefore known before the
request is even started.

Between the start of the request and the receipt of the response,
the request and its data fields will not change, so there's no
reason we need to hold off setting them.  In fact it's preferable
to set them just once because it's more obvious that they're
unchanging.

So set up the fields describing where incoming data is to land in a
response message at the beginning of ceph_osdc_start_request().
Define a helper function that sets these fields, and use it to
set the fields for both outgoing data in the request message and
incoming data in the response.

This resolves:
    http://tracker.ceph.com/issues/4284

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
2013-05-01 21:16:43 -07:00
Alex Elder
4a73ef27ad libceph: record message data byte length
Record the number of bytes of data in a page array rather than the
number of pages in the array.  It can be assumed that the page array
is of sufficient size to hold the number of bytes indicated (and
offset by the indicated alignment).

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
2013-05-01 21:16:42 -07:00
Alex Elder
ebf18f4709 ceph: only set message data pointers if non-empty
Change it so we only assign outgoing data information for messages
if there is outgoing data to send.

This then allows us to add a few more (currently commented-out)
assertions.

This is related to:
    http://tracker.ceph.com/issues/4284

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Greg Farnum <greg@inktank.com>
2013-05-01 21:16:41 -07:00
Alex Elder
27fa83852b libceph: isolate other message data fields
Define ceph_msg_data_set_pagelist(), ceph_msg_data_set_bio(), and
ceph_msg_data_set_trail() to clearly abstract the assignment of the
remaining data-related fields in a ceph message structure.  Use the
new functions in the osd client and mds client.

This partially resolves:
    http://tracker.ceph.com/issues/4263

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
2013-05-01 21:16:40 -07:00
Alex Elder
f1baeb2b9f libceph: set page info with byte length
When setting page array information for message data, provide the
byte length rather than the page count ceph_msg_data_set_pages().

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
2013-05-01 21:16:39 -07:00
Alex Elder
02afca6ca0 libceph: isolate message page field manipulation
Define a function ceph_msg_data_set_pages(), which more clearly
abstracts the assignment page-related fields for data in a ceph
message structure.  Use this new function in the osd client and mds
client.

Ideally, these fields would never be set more than once (with
BUG_ON() calls to guarantee that).  At the moment though the osd
client sets these every time it receives a message, and in the event
of a communication problem this can happen more than once.  (This
will be resolved shortly, but setting up these helpers first makes
it all a bit easier to work with.)

Rearrange the field order in a ceph_msg structure to group those
that are used to define the possible data payloads.

This partially resolves:
    http://tracker.ceph.com/issues/4263

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
2013-05-01 21:16:38 -07:00
Alex Elder
e0c594878e libceph: record byte count not page count
Record the byte count for an osd request rather than the page count.
The number of pages can always be derived from the byte count (and
alignment/offset) but the reverse is not true.

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
2013-05-01 21:16:36 -07:00
Alex Elder
9516e45b25 libceph: simplify new message initialization
Rather than explicitly initializing many fields to 0, NULL, or false
in a newly-allocated message, just use kzalloc() for allocating new
messages.  This will become a much more convenient way of doing
things anyway for upcoming patches that abstract the data field.

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
2013-05-01 21:16:35 -07:00
Alex Elder
35c7bfbcd4 libceph: advance pagelist with list_rotate_left()
While processing an outgoing pagelist (either the data pagelist or
trail) in a ceph message, the messenger cycles through each of the
pages on the list.  This is accomplished in out_msg_pos_next(), if
the end of the first page on the list is reached, the first page is
moved to the end of the list.

There is a list operation, list_rotate_left(), which performs
exactly this operation, and by using it, what's really going on
becomes more obvious.

So replace these two list_move_tail() calls with list_rotate_left().

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
2013-05-01 21:16:34 -07:00
Alex Elder
e788182fa6 libceph: define and use in_msg_pos_next()
Define a new function in_msg_pos_next() to match out_msg_pos_next(),
and use it in place of code at the end of read_partial_message_pages()
and read_partial_message_bio().

Note that the page number is incremented and offset reset under
slightly different conditions from before.  The result is
equivalent, however, as explained below.

Each time an incoming message is going to arrive, we find out how
much room is left--not surpassing the current page--and provide that
as the number of bytes to receive.  So the amount we'll use is the
lesser of:  all that's left of the entire request; and all that's
left in the current page.

If we received exactly how many were requested, we either reached
the end of the request or the end of the page.  In the first case,
we're done, in the second, we move onto the next page in the array.

In all cases but (possibly) on the last page, after adding the
number of bytes received, page_pos == PAGE_SIZE.  On the last page,
it doesn't really matter whether we increment the page number and
reset the page position, because we're done and we won't come back
here again.  The code previously skipped over that last case,
basically.  The new code handles that case the same as the others,
incrementing and resetting.

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
2013-05-01 21:16:33 -07:00
Alex Elder
b3d56fab33 libceph: kill args in read_partial_message_bio()
There is only one caller for read_partial_message_bio(), and it
always passes &msg->bio_iter and &bio_seg as the second and third
arguments.  Furthermore, the message in question is always the
connection's in_msg, and we can get that inside the called function.

So drop those two parameters and use their derived equivalents.

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
2013-05-01 21:16:32 -07:00
Alex Elder
e1dcb128f8 libceph: change type of ceph_tcp_sendpage() "more"
Change the type of the "more" parameter from int to bool.

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
2013-05-01 21:16:31 -07:00
Alex Elder
6ebc8b32b3 libceph: minor byte order problems in read_partial_message()
Some values printed are not (necessarily) in CPU order.  We already
have a copy of the converted versions, so use them.

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
2013-05-01 21:16:30 -07:00
Alex Elder
7b11ba3758 libceph: define CEPH_MSG_MAX_MIDDLE_LEN
This is probably unnecessary but the code read as if it were wrong
in read_partial_message().

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
2013-05-01 21:16:29 -07:00
Alex Elder
4137577ae3 libceph: clean up skipped message logic
In ceph_con_in_msg_alloc() it is possible for a connection's
alloc_msg method to indicate an incoming message should be skipped.
By default, read_partial_message() initializes the skip variable
to 0 before it gets provided to ceph_con_in_msg_alloc().

The osd client, mon client, and mds client each supply an alloc_msg
method.  The mds client always assigns skip to be 0.

The other two leave the skip value of as-is, or assigns it to zero,
except:
    - if no (osd or mon) request having the given tid is found, in
      which case skip is set to 1 and NULL is returned; or
    - in the osd client, if the data of the reply message is not
      adequate to hold the message to be read, it assigns skip
      value 1 and returns NULL.
So the returned message pointer will always be NULL if skip is ever
non-zero.

Clean up the logic a bit in ceph_con_in_msg_alloc() to make this
state of affairs more obvious.  Add a comment explaining how a null
message pointer can mean either a message that should be skipped or
a problem allocating a message.

This resolves:
    http://tracker.ceph.com/issues/4324

Reported-by: Greg Farnum <greg@inktank.com>
Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Greg Farnum <greg@inktank.com>
2013-05-01 21:16:28 -07:00
Alex Elder
0fff87ec79 libceph: separate read and write data
An osd request defines information about where data to be read
should be placed as well as where data to write comes from.
Currently these are represented by common fields.

Keep information about data for writing separate from data to be
read by splitting these into data_in and data_out fields.

This is the key patch in this whole series, in that it actually
identifies which osd requests generate outgoing data and which
generate incoming data.  It's less obvious (currently) that an osd
CALL op generates both outgoing and incoming data; that's the focus
of some upcoming work.

This resolves:
    http://tracker.ceph.com/issues/4127

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
2013-05-01 21:16:27 -07:00
Alex Elder
2ac2b7a6d4 libceph: distinguish page and bio requests
An osd request uses either pages or a bio list for its data.  Use a
union to record information about the two, and add a data type
tag to select between them.

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
2013-05-01 21:16:25 -07:00
Alex Elder
2794a82a11 libceph: separate osd request data info
Pull the fields in an osd request structure that define the data for
the request out into a separate structure.

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
2013-05-01 21:16:24 -07:00
Alex Elder
153e5167e0 libceph: don't assign page info in ceph_osdc_new_request()
Currently ceph_osdc_new_request() assigns an osd request's
r_num_pages and r_alignment fields.  The only thing it does
after that is call ceph_osdc_build_request(), and that doesn't
need those fields to be assigned.

Move the assignment of those fields out of ceph_osdc_new_request()
and into its caller.  As a result, the page_align parameter is no
longer used, so get rid of it.

Note that in ceph_sync_write(), the value for req->r_num_pages had
already been calculated earlier (as num_pages, and fortunately
it was computed the same way).  So don't bother recomputing it,
but because it's not needed earlier, move that calculation after the
call to ceph_osdc_new_request().  Hold off making the assignment to
r_alignment, doing it instead r_pages and r_num_pages are
getting set.

Similarly, in start_read(), nr_pages already holds the number of
pages in the array (and is calculated the same way), so there's no
need to recompute it.  Move the assignment of the page alignment
down with the others there as well.

This and the next few patches are preparation work for:
    http://tracker.ceph.com/issues/4127

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
2013-05-01 21:16:23 -07:00
Alex Elder
53ded495c6 libceph: define mds_alloc_msg() method
The only user of the ceph messenger that doesn't define an alloc_msg
method is the mds client.  Define one, such that it works just like
it did before, and simplify ceph_con_in_msg_alloc() by assuming the
alloc_msg method is always present.

This and the next patch resolve:
    http://tracker.ceph.com/issues/4322

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Greg Farnum <greg@inktank.com>
2013-05-01 21:16:19 -07:00
Alex Elder
1d866d1c31 libceph: drop mutex while allocating a message
In ceph_con_in_msg_alloc(), if no alloc_msg method is defined for a
connection a new message is allocated with ceph_msg_new().

Drop the mutex before making this call, and make sure we're still
connected when we get it back again.

This is preparing for the next patch, which ensures all connections
define an alloc_msg method, and then handles them all the same way.

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Greg Farnum <greg@inktank.com>
2013-05-01 21:16:18 -07:00
Alex Elder
41766f87f5 libceph: rename ceph_calc_object_layout()
The purpose of ceph_calc_object_layout() is to fill in the pool
number and seed for a ceph_pg structure provided, based on a given
osd map and target object id.

Currently that function takes a file layout parameter, but the only
thing used out of that is its pool number.

Change the function so it takes a pool number rather than the full
file layout structure.  Only update the ceph_pg if the pool is found
in the osd map.  Get rid of few useless lines of code from the
function while there.

Since the function now very clearly just fills in the ceph_pg
structure it's provided, rename it ceph_calc_ceph_pg().

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
2013-05-01 21:16:17 -07:00
Alex Elder
ec02a2f2ff libceph: kill ceph_msg->pagelist_count
The pagelist_count field is never actually used, so get rid of it.

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
2013-05-01 21:16:16 -07:00
Alex Elder
8f63ca2d23 libceph: fix wrong opcode use in osd_req_encode_op()
The new cases added to osd_req_encode_op() caused a new sparse
error, which highlighted an existing problem that had been
overlooked since it was originally checked in.  When an unsupported
opcode is found the destination rather than the source opcode was
being used in the error message.  The two differ in their byte
order, and we want to be using the one in the source.

Fix the problem in both spots.

Reported-by: Fengguang Wu <fengguang.wu@intel.com>
Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
2013-05-01 21:16:13 -07:00
Alex Elder
0d5af16435 libceph: complete lingering requests only once
An osd request marked to linger will be re-submitted in the event
a connection to the target osd gets dropped.  Currently, if there
is a callback function associated with a request it will be called
each time a request is submitted--which for lingering requests can
be more than once.

Change it so a request--including lingering ones--will get completed
(from the perspective of the user of the osd client) exactly once.

This resolves:
    http://tracker.ceph.com/issues/3967

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
2013-05-01 21:16:12 -07:00
Alex Elder
f51a822c31 libceph: set page alignment in start_request()
The page alignment field for a request is currently set in
ceph_osdc_build_request().  It's not needed at that point
nor do either of its callers need that value assigned at
any point before they call ceph_osdc_start_request().

So move that assignment into ceph_osdc_start_request().

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
2013-05-01 21:14:29 -07:00
Alex Elder
d4b515fa10 libceph: distinguish page array and pagelist count
Use distinct fields for tracking the number of pages in a message's
page array and in a message's page list.  Currently only one or the
other is used at a time, but that will be changing soon.

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
2013-05-01 21:14:28 -07:00
Alex Elder
60cf5992d9 libceph: don't pass request to calc_layout()
The only remaining reason to pass the osd request to calc_layout()
is to fill in its r_num_pages and r_page_alignment fields.  Once it
fills those in, it doesn't do anything more with them.

We can therefore move those assignments into the caller, and get rid
of the "req" parameter entirely.

Note, however, that the only caller is ceph_osdc_new_request(),
and that immediately overwrites those fields with values based on
its passed-in page offset.  So the assignment inside calc_layout()
was redundant anyway.

This resolves:
    http://tracker.ceph.com/issues/4262

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
2013-05-01 21:14:27 -07:00
Alex Elder
dbe0fc4188 libceph: format target object name in caller
Move the formatting of the object name (oid) to use for an object
request into the caller of calc_layout().  This makes the "vino"
parameter no longer necessary, so get rid of it.

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
2013-05-01 21:14:26 -07:00
Alex Elder
47a05811b6 libceph: pass object number back to calc_layout() caller
Have calc_layout() pass the computed object number back to its
caller.  (This is a small step to simplify review.)

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
2013-05-01 21:14:25 -07:00
Alex Elder
07c09b7255 libceph: make ceph_msg->bio_seg be unsigned
The bio_seg field is used by the ceph messenger in iterating through
a bio.  It should never have a negative value, so make it an
unsigned.  (I contemplated making it unsigned short to match the
struct bio definition, but it offered no benefit.)

Change variables used to hold bio_seg values to all be unsigned as
well.  Change two variable names in init_bio_iter() to match the
convention used everywhere else.

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
2013-05-01 21:14:23 -07:00
Alex Elder
3ff5f385b1 libceph: fix a osd request memory leak
If an invalid layout is provided to ceph_osdc_new_request(), its
call to calc_layout() might return an error.  At that point in the
function we've already allocated an osd request structure, so we
need to free it (drop a reference) in the event such an error
occurs.

The only other value calc_layout() will return is 0, so make that
explicit in the successful case.

This resolves:
    http://tracker.ceph.com/issues/4240

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
2013-05-01 21:14:22 -07:00
Linus Torvalds
368edaadc0 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client
Pull Ceph fix from Sage Weil:
 "This fixes a bug in the new message decoding that just went in during
  the last window."

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client:
  libceph: fix decoding of pgids
2013-03-12 09:22:42 -07:00
Sage Weil
d6c0dd6b0c libceph: fix decoding of pgids
In 4f6a7e5ee1 we effectively dropped support
for the legacy encoding for the OSDMap and incremental.  However, we didn't
fix the decoding for the pgid.

Signed-off-by: Sage Weil <sage@inktank.com>
Reviewed-by: Yehuda Sadeh <yehuda@inktank.com>
2013-03-11 14:31:00 -07:00
Linus Torvalds
1cf0209c43 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client
Pull Ceph updates from Sage Weil:
 "A few groups of patches here.  Alex has been hard at work improving
  the RBD code, layout groundwork for understanding the new formats and
  doing layering.  Most of the infrastructure is now in place for the
  final bits that will come with the next window.

  There are a few changes to the data layout.  Jim Schutt's patch fixes
  some non-ideal CRUSH behavior, and a set of patches from me updates
  the client to speak a newer version of the protocol and implement an
  improved hashing strategy across storage nodes (when the server side
  supports it too).

  A pair of patches from Sam Lang fix the atomicity of open+create
  operations.  Several patches from Yan, Zheng fix various mds/client
  issues that turned up during multi-mds torture tests.

  A final set of patches expose file layouts via virtual xattrs, and
  allow the policies to be set on directories via xattrs as well
  (avoiding the awkward ioctl interface and providing a consistent
  interface for both kernel mount and ceph-fuse users)."

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client: (143 commits)
  libceph: add support for HASHPSPOOL pool flag
  libceph: update osd request/reply encoding
  libceph: calculate placement based on the internal data types
  ceph: update support for PGID64, PGPOOL3, OSDENC protocol features
  ceph: update "ceph_features.h"
  libceph: decode into cpu-native ceph_pg type
  libceph: rename ceph_pg -> ceph_pg_v1
  rbd: pass length, not op for osd completions
  rbd: move rbd_osd_trivial_callback()
  libceph: use a do..while loop in con_work()
  libceph: use a flag to indicate a fault has occurred
  libceph: separate non-locked fault handling
  libceph: encapsulate connection backoff
  libceph: eliminate sparse warnings
  ceph: eliminate sparse warnings in fs code
  rbd: eliminate sparse warnings
  libceph: define connection flag helpers
  rbd: normalize dout() calls
  rbd: barriers are hard
  rbd: ignore zero-length requests
  ...
2013-02-28 17:43:09 -08:00
Sage Weil
83ca14fdd3 libceph: add support for HASHPSPOOL pool flag
The legacy behavior adds the pgid seed and pool together as the input for
CRUSH.  That is problematic because each pool's PGs end up mapping to the
same OSDs: 1.5 == 2.4 == 3.3 == ...

Instead, if the HASHPSPOOL flag is set, we has the ps and pool together and
feed that into CRUSH.  This ensures that two adjacent pools will map to
an independent pseudorandom set of OSDs.

Advertise our support for this via a protocol feature flag.

Signed-off-by: Sage Weil <sage@inktank.com>
Reviewed-by: Alex Elder <elder@inktank.com>
2013-02-26 15:03:06 -08:00
Sage Weil
1b83bef24c libceph: update osd request/reply encoding
Use the new version of the encoding for osd requests and replies.  In the
process, update the way we are tracking request ops and reply lengths and
results in the struct ceph_osd_request.  Update the rbd and fs/ceph users
appropriately.

The main changes are:
 - we keep pointers into the request memory for fields we need to update
   each time the request is sent out over the wire
 - we keep information about the result in an array in the request struct
   where the users can easily get at it.

Signed-off-by: Sage Weil <sage@inktank.com>
Reviewed-by: Alex Elder <elder@inktank.com>
2013-02-26 15:02:50 -08:00
Sage Weil
2169aea649 libceph: calculate placement based on the internal data types
Instead of using the old ceph_object_layout struct, update our internal
ceph_calc_object_layout method to use the ceph_pg type.  This allows us to
pass the full 32-bit precision of the pgid.seed to the callers.  It also
allows some callers to avoid reaching into the request structures for the
struct ceph_object_layout fields.

Signed-off-by: Sage Weil <sage@inktank.com>
Reviewed-by: Alex Elder <elder@inktank.com>
2013-02-26 15:02:37 -08:00
Sage Weil
4f6a7e5ee1 ceph: update support for PGID64, PGPOOL3, OSDENC protocol features
Support (and require) the PGID64, PGPOOL3, and OSDENC protocol features.
These have been present in ceph.git since v0.42, Feb 2012.  Require these
features to simplify support; nobody is running older userspace.

Note that the new request and reply encoding is still not in place, so the new
code is not yet functional.

Signed-off-by: Sage Weil <sage@inktank.com>
Reviewed-by: Alex Elder <elder@inktank.com>
2013-02-26 15:02:25 -08:00
Sage Weil
5b191d9914 libceph: decode into cpu-native ceph_pg type
Always decode data into our cpu-native ceph_pg type that has the correct
field widths.  Limit any remaining uses of ceph_pg_v1 to dealing with the
legacy protocol.

Signed-off-by: Sage Weil <sage@inktank.com>
Reviewed-by: Alex Elder <elder@inktank.com>
2013-02-26 15:01:57 -08:00
Sage Weil
12979354a1 libceph: rename ceph_pg -> ceph_pg_v1
Rename the old version this type to distinguish it from the new version.

Signed-off-by: Sage Weil <sage@inktank.com>
Reviewed-by: Alex Elder <elder@inktank.com>
2013-02-26 15:01:41 -08:00
Linus Torvalds
94f2f14234 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace
Pull user namespace and namespace infrastructure changes from Eric W Biederman:
 "This set of changes starts with a few small enhnacements to the user
  namespace.  reboot support, allowing more arbitrary mappings, and
  support for mounting devpts, ramfs, tmpfs, and mqueuefs as just the
  user namespace root.

  I do my best to document that if you care about limiting your
  unprivileged users that when you have the user namespace support
  enabled you will need to enable memory control groups.

  There is a minor bug fix to prevent overflowing the stack if someone
  creates way too many user namespaces.

  The bulk of the changes are a continuation of the kuid/kgid push down
  work through the filesystems.  These changes make using uids and gids
  typesafe which ensures that these filesystems are safe to use when
  multiple user namespaces are in use.  The filesystems converted for
  3.9 are ceph, 9p, afs, ocfs2, gfs2, ncpfs, nfs, nfsd, and cifs.  The
  changes for these filesystems were a little more involved so I split
  the changes into smaller hopefully obviously correct changes.

  XFS is the only filesystem that remains.  I was hoping I could get
  that in this release so that user namespace support would be enabled
  with an allyesconfig or an allmodconfig but it looks like the xfs
  changes need another couple of days before it they are ready."

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace: (93 commits)
  cifs: Enable building with user namespaces enabled.
  cifs: Convert struct cifs_ses to use a kuid_t and a kgid_t
  cifs: Convert struct cifs_sb_info to use kuids and kgids
  cifs: Modify struct smb_vol to use kuids and kgids
  cifs: Convert struct cifsFileInfo to use a kuid
  cifs: Convert struct cifs_fattr to use kuid and kgids
  cifs: Convert struct tcon_link to use a kuid.
  cifs: Modify struct cifs_unix_set_info_args to hold a kuid_t and a kgid_t
  cifs: Convert from a kuid before printing current_fsuid
  cifs: Use kuids and kgids SID to uid/gid mapping
  cifs: Pass GLOBAL_ROOT_UID and GLOBAL_ROOT_GID to keyring_alloc
  cifs: Use BUILD_BUG_ON to validate uids and gids are the same size
  cifs: Override unmappable incoming uids and gids
  nfsd: Enable building with user namespaces enabled.
  nfsd: Properly compare and initialize kuids and kgids
  nfsd: Store ex_anon_uid and ex_anon_gid as kuids and kgids
  nfsd: Modify nfsd4_cb_sec to use kuids and kgids
  nfsd: Handle kuids and kgids in the nfs4acl to posix_acl conversion
  nfsd: Convert nfsxdr to use kuids and kgids
  nfsd: Convert nfs3xdr to use kuids and kgids
  ...
2013-02-25 16:00:49 -08:00
Alex Elder
49659416ba libceph: use a do..while loop in con_work()
This just converts a manually-implemented loop into a do..while loop
in con_work().  It also moves handling of EAGAIN inside the blocks
where it's already been determined an error code was returned.

Also update a few dout() calls near the affected code for
consistency.

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
2013-02-25 15:37:37 -06:00
Alex Elder
b6e7b6a119 libceph: use a flag to indicate a fault has occurred
This just rearranges the logic in con_work() a little bit so that a
flag is used to indicate a fault has occurred.  This allows both the
fault and non-fault case to be handled the same way and avoids a
couple of nearly consecutive gotos.

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
2013-02-25 15:37:32 -06:00
Alex Elder
9320926420 libceph: separate non-locked fault handling
An error occurring on a ceph connection is treated as a fault,
causing the connection to be reset.  The initial part of this fault
handling has to be done while holding the connection mutex, but
it must then be dropped for the last part.

Separate the part of this fault handling that executes without the
lock into its own function, con_fault_finish().  Move the call to
this new function, as well as call that drops the connection mutex,
into ceph_fault().  Rename that function con_fault() to reflect that
it's only handling the connection part of the fault handling.

The motivation for this was a warning from sparse about the locking
being done here.  Rearranging things this way keeps all the mutex
manipulation within ceph_fault(), and this stops sparse from
complaining.

This partially resolves:
    http://tracker.ceph.com/issues/4184

Reported-by: Fengguang Wu <fengguang.wu@intel.com>
Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
2013-02-25 15:37:28 -06:00
Alex Elder
f20a39fd6e libceph: encapsulate connection backoff
Collect the code that tests for and implements a backoff delay for a
ceph connection into a new function, ceph_backoff().

Make the debug output messages in that part of the code report
things consistently by reporting a message in the socket closed
case, and by making the one for PREOPEN state report the connection
pointer like the rest.

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
2013-02-25 15:37:24 -06:00
Alex Elder
1541716789 libceph: eliminate sparse warnings
Eliminate most of the problems in the libceph code that cause sparse
to issue warnings.
    - Convert functions that are never referenced externally to have
      static scope.
    - Pass NULL rather than 0 for a pointer argument in one spot in
      ceph_monc_delete_snapid()

This partially resolves:
    http://tracker.ceph.com/issues/4184

Reported-by: Fengguang Wu <fengguang.wu@intel.com>
Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
2013-02-25 15:37:18 -06:00
Alex Elder
c9ffc77ade libceph: define connection flag helpers
Define and use functions that encapsulate operations performed on
a connection's flags.

This resolves:
    http://tracker.ceph.com/issues/4234

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
2013-02-25 15:37:04 -06:00
Alex Elder
903bb32e89 libceph: drop return value from page vector copy routines
The return values provided for ceph_copy_to_page_vector() and
ceph_copy_from_page_vector() serve no purpose, so get rid of them.

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
2013-02-19 19:14:05 -06:00
Alex Elder
b324814e84 libceph: use void pointers in page vector functions
The functions used for working with ceph page vectors are defined
with char pointers, but they're really intended to operate on
untyped data.  Change the types of these function parameters
to (void *) to reflect this.

(Note that the functions now assume void pointer arithmetic works
like arithmetic on char pointers.)

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
2013-02-19 19:14:04 -06:00
Alex Elder
fbfab53966 libceph: allow STAT osd operations
Add support for CEPH_OSD_OP_STAT operations in the osd client
and in rbd.

This operation sends no data to the osd; everything required is
encoded in identity of the target object.

The result will be ENOENT if the object doesn't exist.  If it does
exist and no other error occurs the server returns the size and last
modification time of the target object as output data (in little
endian format).  The size is a 64 bit unsigned and the time is
ceph_timespec structure (two unsigned 32-bit integers, representing
a seconds and nanoseconds value).

This resolves:
    http://tracker.ceph.com/issues/4007

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
2013-02-19 19:14:03 -06:00
Alex Elder
f44246e394 libceph: simplify data length calculation
Simplify the way the data length recorded in a message header is
calculated in ceph_osdc_build_request().

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
2013-02-19 19:14:02 -06:00
Alex Elder
a9f36c3ed4 libceph: remove dead code in osd_req_encode_op()
In osd_req_encode_op() there are a few cases that handle osd
opcodes that are never used in the kernel.  The presence of
this code gives the impression it's correct (which really can't
be assumed), and may impose some unnecessary restrictions on
some upcoming refactoring of this code.

So delete this effectively dead code, and report uses of the
previously handled cases as unsupported.

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
2013-02-18 12:20:26 -06:00
Alex Elder
4c46459cae libceph: report defined but unsupported osd ops
If osd_req_encode_op() is given any opcode it doesn't recognize
it reports an error.

This patch fleshes out that routine to distinguish between
well-defined but unsupported values and values that are simply
bogus.

This and the next commit are related to:
    http://tracker.ceph.com/issues/4126

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
2013-02-18 12:20:22 -06:00
Alex Elder
2979ddb11b libceph: update ceph_osd_op_name()
Update ceph_osd_op_name() to include the newly-added definitions in
"rados.h", and to match its counterpart in the user space code.

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
2013-02-18 12:20:18 -06:00
Alex Elder
4b568b1aaf libceph: add ceph_osd_state_name()
Add the definition of ceph_osd_state_name(), to match its
counterpart in user space.

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
2013-02-18 12:20:13 -06:00
Alex Elder
2d2f522699 libceph: kill ceph_osdc_wait_event()
There are no actual users of ceph_osdc_wait_event().  This would
have been one-shot events, but we no longer support those so just
get rid of this function.

Since this leaves nothing else that waits for the completion of an
event, we can get rid of the completion in a struct ceph_osd_event.

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
2013-02-18 12:20:05 -06:00
Alex Elder
3c663bbdcd libceph: kill ceph_osdc_create_event() "one_shot" parameter
There is only one caller of ceph_osdc_create_event(), and it
provides 0 as its "one_shot" argument.  Get rid of that argument and
just use 0 in its place.

Replace the code in handle_watch_notify() that executes if one_shot
is nonzero in the event with a BUG_ON() call.

While modifying "osd_client.c", give handle_watch_notify() static
scope.

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
2013-02-18 12:20:00 -06:00
Alex Elder
60e56f1381 libceph: kill ceph_calc_raw_layout()
There is no caller of ceph_calc_raw_layout() outside of libceph, so
there's no need to export from the module.

Furthermore, there is only one caller, in calc_layout(), and it
is not much more than a simple wrapper for that function.

So get rid of ceph_calc_raw_layout() and embed it instead within
calc_layout().

While touching "osd_client.c", get rid of the unnecessary forward
declaration of __send_request().

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
2013-02-18 12:19:54 -06:00
Alex Elder
60789380ae libdeph: don't export ceph_osdc_init() or ceph_osdc_stop()
The only callers of ceph_osdc_init() and ceph_osdc_stop()
ceph_create_client() and ceph_destroy_client() (respectively)
and they are in the same kernel module as those two functions.
There's therefore no need to export those interfaces, so don't.

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
2013-02-18 12:19:48 -06:00
Alex Elder
f9d2519945 libceph: lock outside send_queued()
Two of the three callers of the osd client's send_queued() function
already hold the osd client mutex and drop it before the call.

Change send_queued() so it assumes the caller holds the mutex, and
update all callers accordingly.  Rename it __send_queued() to match
the convention used elsewhere in the file with respect to the lock.

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
2013-02-18 12:19:44 -06:00
Alex Elder
a3bea47e8b ceph: kill ceph_osdc_new_request() "num_reply" parameter
The "num_reply" parameter to ceph_osdc_new_request() is never
used inside that function, so get rid of it.

Note that ceph_sync_write() passes 2 for that argument, while all
other callers pass 1.  It doesn't matter, but perhaps someone should
verify this doesn't indicate a problem.

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
2013-02-18 12:19:39 -06:00
Alex Elder
2480882611 ceph: kill ceph_osdc_writepages() "flags" parameter
There is only one caller of ceph_osdc_writepages(), and it always
passes 0 as its "flags" argument.  Get rid of that argument and
replace its use in ceph_osdc_writepages() with 0.

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
2013-02-18 12:19:35 -06:00
Alex Elder
fbf8685fb1 ceph: kill ceph_osdc_writepages() "dosync" parameter
There is only one caller of ceph_osdc_writepages(), and it always
passes 0 as its "dosync" argument.  Get rid of that argument and
replace its use in ceph_osdc_writepages() with 0.

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
2013-02-18 12:19:28 -06:00
Alex Elder
87f979d390 ceph: kill ceph_osdc_writepages() "nofail" parameter
There is only one caller of ceph_osdc_writepages(), and it always
passes the value true as its "nofail" argument.  Get rid of that
argument and replace its use in ceph_osdc_writepages() with the
constant value true.

This and a number of cleanup patches that follow resolve:
    http://tracker.ceph.com/issues/4126

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
2013-02-18 12:19:22 -06:00
Alex Elder
9cbb1d7268 libceph: don't require r_num_pages for bio requests
There is a check in the completion path for osd requests that
ensures the number of pages allocated is enough to hold the amount
of incoming data expected.

For bio requests coming from rbd the "number of pages" is not really
meaningful (although total length would be).  So stop requiring that
nr_pages be supplied for bio requests.  This is done by checking
whether the pages pointer is null before checking the value of
nr_pages.

Note that this value is passed on to the messenger, but there it's
only used for debugging--it's never used for validation.

While here, change another spot that used r_pages in a debug message
inappropriately, and also invalidate the r_con_filling_msg pointer
after dropping a reference to it.

This resolves:
    http://tracker.ceph.com/issues/3875

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
2013-02-13 18:29:11 -08:00
Alex Elder
1e32d34cfa rbd: don't take extra bio reference for osd client
Currently, if the OSD client finds an osd request has had a bio list
attached to it, it drops a reference to it (or rather, to the first
entry on that list) when the request is released.

The code that added that reference (i.e., the rbd client) is
therefore required to take an extra reference to that first bio
structure.

The osd client doesn't really do anything with the bio pointer other
than transfer it from the osd request structure to outgoing (for
writes) and ingoing (for reads) messages.  So it really isn't the
right place to be taking or dropping references.

Furthermore, the rbd client already holds references to all bio
structures it passes to the osd client, and holds them until the
request is completed.  So there's no need for this extra reference
whatsoever.

So remove the bio_put() call in ceph_osdc_release_request(), as
well as its matching bio_get() call in rbd_osd_req_create().

This change could lead to a crash if old libceph.ko was used with
new rbd.ko.  Add a compatibility check at rbd initialization time to
avoid this possibilty.

This resolves:
    http://tracker.ceph.com/issues/3798    and
    http://tracker.ceph.com/issues/3799

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
2013-02-13 18:29:11 -08:00
Alex Elder
72fe25e346 libceph: add a compatibility check interface
An upcoming change implements semantic change that could lead to
a crash if an old version of the libceph kernel module is used with
a new version of the rbd kernel module.

In order to preclude that possibility, this adds a compatibilty
check interface.  If this interface doesn't exist, the modules are
obviously not compatible.  But if it does exist, this provides a way
of letting the caller know whether it will operate properly with
this libceph module.

Perhaps confusingly, it returns false right now.  The semantic
change mentioned above will make it return true.

This resolves:
    http://tracker.ceph.com/issues/3800

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
2013-02-13 18:29:11 -08:00
Alex Elder
3ebc21f7bc libceph: fix messenger CONFIG_BLOCK dependencies
The ceph messenger has a few spots that are only used when
bio messages are supported, and that's only when CONFIG_BLOCK
is defined.  This surrounds a couple of spots with #ifdef's
that would cause a problem if CONFIG_BLOCK were not present
in the kernel configuration.

This resolves:
    http://tracker.ceph.com/issues/3976

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
2013-02-13 18:27:40 -08:00
Eric W. Biederman
eea553c21f ceph: Only allow mounts in the initial network namespace
Today ceph opens tcp sockets from a delayed work callback.  Delayed
work happens from kernel threads which are always in the initial
network namespace.   Therefore fail early if someone attempts
to mount a ceph filesystem from something other than the initial
network namespace.

Cc: Sage Weil <sage@inktank.com>
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
2013-02-12 03:19:23 -08:00
Alex Elder
969e5aa3b0 Merge branch 'testing' of github.com:ceph/ceph-client into v3.8-rc5-testing 2013-01-30 07:54:34 -06:00
Cong Ding
1ec3911dbd libceph: fix undefined behavior when using snprintf()
The variable "str" is used as both the source and destination in
function snprintf(), which is undefined behavior based on C11. The
original description in C11 is:
	"If copying takes place between objects that
	overlap, the behavior is undefined."

And, the function of ceph_osdmap_state_str() is to return the osdmap
state, so it should return "doesn't exist" when all the conditions
are not satisfied. I fix it in this patch.

[elder@inktank.com: shortened the commit message]

Signed-off-by: Cong Ding <dinggnu@gmail.com>
Reviewed-by: Alex Elder <elder@inktank.com>
2013-01-25 17:48:59 -06:00
Alex Elder
ae7ca4a35b libceph: pass num_op with ops
Both ceph_osdc_alloc_request() and ceph_osdc_build_request() are
provided an array of ceph osd request operations.  Rather than just
passing the number of operations in the array, the caller is
required append an additional zeroed operation structure to signal
the end of the array.

All callers know the number of operations at the time these
functions are called, so drop the silly zero entry and supply that
number directly.  As a result, get_num_ops() is no longer needed.
This also means that ceph_osdc_alloc_request() never uses its ops
argument, so that can be dropped.

Also rbd_create_rw_ops() no longer needs to add one to reserve room
for the additional op.

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
2013-01-17 16:34:57 -06:00
Alex Elder
54a5400721 libceph: don't set pages or bio in ceph_osdc_alloc_request()
Only one of the two callers of ceph_osdc_alloc_request() provides
page or bio data for its payload.  And essentially all that function
was doing with those arguments was assigning them to fields in the
osd request structure.

Simplify ceph_osdc_alloc_request() by having the caller take care of
making those assignments

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
2013-01-17 16:34:57 -06:00
Alex Elder
d178a9e740 libceph: don't set flags in ceph_osdc_alloc_request()
The only thing ceph_osdc_alloc_request() really does with the
flags value it is passed is assign it to the newly-created
osd request structure.  Do that in the caller instead.

Both callers subsequently call ceph_osdc_build_request(), so have
that function (instead of ceph_osdc_alloc_request()) issue a warning
if a request comes through with neither the read nor write flags set.

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
2013-01-17 15:52:05 -06:00
Alex Elder
e75b45cf36 libceph: drop osdc from ceph_calc_raw_layout()
The osdc parameter to ceph_calc_raw_layout() is not used, so get rid
of it.  Consequently, the corresponding parameter in calc_layout()
becomes unused, so get rid of that as well.

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
2013-01-17 15:52:05 -06:00
Alex Elder
4d6b250bf1 libceph: drop snapid in ceph_calc_raw_layout()
A snapshot id must be provided to ceph_calc_raw_layout() even though
it is not needed at all for calculating the layout.

Where the snapshot id *is* needed is when building the request
message for an osd operation.

Drop the snapid parameter from ceph_calc_raw_layout() and pass
that value instead in ceph_osdc_build_request().

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
2013-01-17 15:52:05 -06:00
Alex Elder
e8afad656c libceph: pass length to ceph_calc_file_object_mapping()
ceph_calc_file_object_mapping() takes (among other things) a "file"
offset and length, and based on the layout, determines the object
number ("bno") backing the affected portion of the file's data and
the offset into that object where the desired range begins.  It also
computes the size that should be used for the request--either the
amount requested or something less if that would exceed the end of
the object.

This patch changes the input length parameter in this function so it
is used only for input.  That is, the argument will be passed by
value rather than by address, so the value provided won't get
updated by the function.

The value would only get updated if the length would surpass the
current object, and in that case the value it got updated to would
be exactly that returned in *oxlen.

Only one of the two callers is affected by this change.  Update
ceph_calc_raw_layout() so it records any updated value.

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
2013-01-17 15:52:04 -06:00
Alex Elder
0120be3c60 libceph: pass length to ceph_osdc_build_request()
The len argument to ceph_osdc_build_request() is set up to be
passed by address, but that function never updates its value
so there's no need to do this.  Tighten up the interface by
passing the length directly.

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
2013-01-17 15:52:04 -06:00
Alex Elder
5b9d1b1cd4 libceph: kill op_needs_trail()
Since every osd message is now prepared to include trailing data,
there's no need to check ahead of time whether any operations will
make use of the trail portion of the message.

We can drop the second argument to get_num_ops(), and as a result we
can also get rid of op_needs_trail() which is no longer used.

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
2013-01-17 15:52:03 -06:00
Alex Elder
c885837f7d libceph: always allow trail in osd request
An osd request structure contains an optional trail portion, which
if present will contain data to be passed in the payload portion of
the message containing the request.  The trail field is a
ceph_pagelist pointer, and if null it indicates there is no trail.

A ceph_pagelist structure contains a length field, and it can
legitimately hold value 0.  Make use of this to change the
interpretation of the "trail" of an osd request so that every osd
request has trailing data, it just might have length 0.

This means we change the r_trail field in a ceph_osd_request
structure from a pointer to a structure that is always initialized.

Note that in ceph_osdc_start_request(), the trail pointer (or now
address of that structure) is assigned to a ceph message's trail
field.  Here's why that's still OK (looking at net/ceph/messenger.c):
    - What would have resulted in a null pointer previously will now
      refer to a 0-length page list.  That message trail pointer
      is used in two functions, write_partial_msg_pages() and
      out_msg_pos_next().
    - In write_partial_msg_pages(), a null page list pointer is
      handled the same as a message with 0-length trail, and both
      result in a "in_trail" variable set to false.  The trail
      pointer is only used if in_trail is true.
    - The only other place the message trail pointer is used is
      out_msg_pos_next().  That function is only called by
      write_partial_msg_pages() and only touches the trail pointer
      if the in_trail value it is passed is true.
Therefore a null ceph_msg->trail pointer is equivalent to a non-null
pointer referring to a 0-length page list structure.

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
2013-01-17 15:52:03 -06:00
Alex Elder
af77f26caa rbd: drop oid parameters from ceph_osdc_build_request()
The last two parameters to ceph_osd_build_request() describe the
object id, but the values passed always come from the osd request
structure whose address is also provided.  Get rid of those last
two parameters.

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
2013-01-17 15:52:01 -06:00
Alex Elder
c3acb18196 libceph: reformat __reset_osd()
Reformat __reset_osd() into three distinct blocks of code
handling the three return cases.

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
2013-01-17 14:07:44 -06:00
Sage Weil
7d7c1f6136 crush: avoid recursion if we have already collided
This saves us some cycles, but does not affect the placement result at
all.

This corresponds to ceph.git commit 4abb53d4f.

Signed-off-by: Sage Weil <sage@inktank.com>
2013-01-17 12:42:39 -06:00
Jim Schutt
1604f488ac libceph: for chooseleaf rules, retry CRUSH map descent from root if leaf is failed
Add libceph support for a new CRUSH tunable recently added to Ceph servers.

Consider the CRUSH rule
  step chooseleaf firstn 0 type <node_type>

This rule means that <n> replicas will be chosen in a manner such that
each chosen leaf's branch will contain a unique instance of <node_type>.

When an object is re-replicated after a leaf failure, if the CRUSH map uses
a chooseleaf rule the remapped replica ends up under the <node_type> bucket
that held the failed leaf.  This causes uneven data distribution across the
storage cluster, to the point that when all the leaves but one fail under a
particular <node_type> bucket, that remaining leaf holds all the data from
its failed peers.

This behavior also limits the number of peers that can participate in the
re-replication of the data held by the failed leaf, which increases the
time required to re-replicate after a failure.

For a chooseleaf CRUSH rule, the tree descent has two steps: call them the
inner and outer descents.

If the tree descent down to <node_type> is the outer descent, and the descent
from <node_type> down to a leaf is the inner descent, the issue is that a
down leaf is detected on the inner descent, so only the inner descent is
retried.

In order to disperse re-replicated data as widely as possible across a
storage cluster after a failure, we want to retry the outer descent. So,
fix up crush_choose() to allow the inner descent to return immediately on
choosing a failed leaf.  Wire this up as a new CRUSH tunable.

Note that after this change, for a chooseleaf rule, if the primary OSD
in a placement group has failed, choosing a replacement may result in
one of the other OSDs in the PG colliding with the new primary.  This
requires that OSD's data for that PG to need moving as well.  This
seems unavoidable but should be relatively rare.

This corresponds to ceph.git commit 88f218181a9e6d2292e2697fc93797d0f6d6e5dc.

Signed-off-by: Jim Schutt <jaschut@sandia.gov>
Reviewed-by: Sage Weil <sage@inktank.com>
2013-01-17 12:42:39 -06:00
Yan, Zheng
a41bad1a9b ceph: re-calculate truncate_size for strip object
Otherwise osd may truncate the object to larger size.

Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
Reviewed-by: Sage Weil <sage@inktank.com>
2013-01-17 12:42:37 -06:00
Kees Cook
11e9ec6e61 net/ceph: remove depends on CONFIG_EXPERIMENTAL
The CONFIG_EXPERIMENTAL config item has not carried much meaning for a
while now and is almost always enabled by default. As agreed during the
Linux kernel summit, remove it from any "depends on" lines in Kconfigs.

CC: Sage Weil <sage@inktank.com>
CC: "David S. Miller" <davem@davemloft.net>
Signed-off-by: Kees Cook <keescook@chromium.org>
Acked-by: David S. Miller <davem@davemloft.net>
Acked-by: Sage Weil <sage@inktank.com>
2013-01-11 11:39:33 -08:00
Linus Torvalds
58890c0669 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client
Pull Ceph fixes from Sage Weil:
 "Two of Alex's patches deal with a race when reseting server
  connections for open RBD images, one demotes some non-fatal BUGs to
  WARNs, and my patch fixes a protocol feature bit failure path."

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client:
  libceph: fix protocol feature mismatch failure path
  libceph: WARN, don't BUG on unexpected connection states
  libceph: always reset osds when kicking
  libceph: move linger requests sooner in kick_requests()
2013-01-02 17:32:49 -08:00
Sage Weil
0fa6ebc600 libceph: fix protocol feature mismatch failure path
We should not set con->state to CLOSED here; that happens in
ceph_fault() in the caller, where it first asserts that the state
is not yet CLOSED.  Avoids a BUG when the features don't match.

Since the fail_protocol() has become a trivial wrapper, replace
calls to it with direct calls to reset_connection().

Signed-off-by: Sage Weil <sage@inktank.com>
Reviewed-by: Alex Elder <elder@inktank.com>
2012-12-27 20:27:04 -06:00
Alex Elder
122070a2ff libceph: WARN, don't BUG on unexpected connection states
A number of assertions in the ceph messenger are implemented with
BUG_ON(), killing the system if connection's state doesn't match
what's expected.  At this point our state model is (evidently) not
well understood enough for these assertions to trigger a BUG().
Convert all BUG_ON(con->state...) calls to be WARN_ON(con->state...)
so we learn about these issues without killing the machine.

We now recognize that a connection fault can occur due to a socket
closure at any time, regardless of the state of the connection.  So
there is really nothing we can assert about the state of the
connection at that point so eliminate that assertion.

Reported-by: Ugis <ugis22@gmail.com>
Tested-by: Ugis <ugis22@gmail.com>
Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Sage Weil <sage@inktank.com>
2012-12-27 20:27:04 -06:00
Alex Elder
e6d50f67a6 libceph: always reset osds when kicking
When ceph_osdc_handle_map() is called to process a new osd map,
kick_requests() is called to ensure all affected requests are
updated if necessary to reflect changes in the osd map.  This
happens in two cases:  whenever an incremental map update is
processed; and when a full map update (or the last one if there is
more than one) gets processed.

In the former case, the kick_requests() call is followed immediately
by a call to reset_changed_osds() to ensure any connections to osds
affected by the map change are reset.  But for full map updates
this isn't done.

Both cases should be doing this osd reset.

Rather than duplicating the reset_changed_osds() call, move it into
the end of kick_requests().

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Sage Weil <sage@inktank.com>
2012-12-27 20:27:04 -06:00
Alex Elder
ab60b16d3c libceph: move linger requests sooner in kick_requests()
The kick_requests() function is called by ceph_osdc_handle_map()
when an osd map change has been indicated.  Its purpose is to
re-queue any request whose target osd is different from what it
was when it was originally sent.

It is structured as two loops, one for incomplete but registered
requests, and a second for handling completed linger requests.
As a special case, in the first loop if a request marked to linger
has not yet completed, it is moved from the request list to the
linger list.  This is as a quick and dirty way to have the second
loop handle sending the request along with all the other linger
requests.

Because of the way it's done now, however, this quick and dirty
solution can result in these incomplete linger requests never
getting re-sent as desired.  The problem lies in the fact that
the second loop only arranges for a linger request to be sent
if it appears its target osd has changed.  This is the proper
handling for *completed* linger requests (it avoids issuing
the same linger request twice to the same osd).

But although the linger requests added to the list in the first loop
may have been sent, they have not yet completed, so they need to be
re-sent regardless of whether their target osd has changed.

The first required fix is we need to avoid calling __map_request()
on any incomplete linger request.  Otherwise the subsequent
__map_request() call in the second loop will find the target osd
has not changed and will therefore not re-send the request.

Second, we need to be sure that a sent but incomplete linger request
gets re-sent.  If the target osd is the same with the new osd map as
it was when the request was originally sent, this won't happen.
This can be fixed through careful handling when we move these
requests from the request list to the linger list, by unregistering
the request *before* it is registered as a linger request.  This
works because a side-effect of unregistering the request is to make
the request's r_osd pointer be NULL, and *that* will ensure the
second loop actually re-sends the linger request.

Processing of such a request is done at that point, so continue with
the next one once it's been moved.

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Sage Weil <sage@inktank.com>
2012-12-27 20:27:04 -06:00
Linus Torvalds
40889e8d9f Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client
Pull Ceph update from Sage Weil:
 "There are a few different groups of commits here.  The largest is
  Alex's ongoing work to enable the coming RBD features (cloning,
  striping).  There is some cleanup in libceph that goes along with it.

  Cyril and David have fixed some problems with NFS reexport (leaking
  dentries and page locks), and there is a batch of patches from Yan
  fixing problems with the fs client when running against a clustered
  MDS.  There are a few bug fixes mixed in for good measure, many of
  which will be going to the stable trees once they're upstream.

  My apologies for the late pull.  There is still a gremlin in the rbd
  map/unmap code and I was hoping to include the fix for that as well,
  but we haven't been able to confirm the fix is correct yet; I'll send
  that in a separate pull once it's nailed down."

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client: (68 commits)
  rbd: get rid of rbd_{get,put}_dev()
  libceph: register request before unregister linger
  libceph: don't use rb_init_node() in ceph_osdc_alloc_request()
  libceph: init event->node in ceph_osdc_create_event()
  libceph: init osd->o_node in create_osd()
  libceph: report connection fault with warning
  libceph: socket can close in any connection state
  rbd: don't use ENOTSUPP
  rbd: remove linger unconditionally
  rbd: get rid of RBD_MAX_SEG_NAME_LEN
  libceph: avoid using freed osd in __kick_osd_requests()
  ceph: don't reference req after put
  rbd: do not allow remove of mounted-on image
  libceph: Unlock unprocessed pages in start_read() error path
  ceph: call handle_cap_grant() for cap import message
  ceph: Fix __ceph_do_pending_vmtruncate
  ceph: Don't add dirty inode to dirty list if caps is in migration
  ceph: Fix infinite loop in __wake_requests
  ceph: Don't update i_max_size when handling non-auth cap
  bdi_register: add __printf verification, fix arg mismatch
  ...
2012-12-20 14:00:13 -08:00
Alex Elder
c89ce05e0c libceph: register request before unregister linger
In kick_requests(), we need to register the request before we
unregister the linger request.  Otherwise the unregister will
reset the request's osd pointer to NULL.

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Sage Weil <sage@inktank.com>
2012-12-20 10:56:39 -06:00
Alex Elder
a978fa20fb libceph: don't use rb_init_node() in ceph_osdc_alloc_request()
The red-black node in the ceph osd request structure is initialized
in ceph_osdc_alloc_request() using rbd_init_node().  We do need to
initialize this, because in __unregister_request() we call
RB_EMPTY_NODE(), which expects the node it's checking to have
been initialized.  But rb_init_node() is apparently overkill, and
may in fact be on its way out.  So use RB_CLEAR_NODE() instead.

For a little more background, see this commit:
    4c199a93 rbtree: empty nodes have no color"

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Sage Weil <sage@inktank.com>
2012-12-20 10:56:33 -06:00
Alex Elder
3ee5234df6 libceph: init event->node in ceph_osdc_create_event()
The red-black node node in the ceph osd event structure is not
initialized in create_osdc_create_event().  Because this node can
be the subject of a RB_EMPTY_NODE() call later on, we should ensure
the node is initialized properly for that.

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Sage Weil <sage@inktank.com>
2012-12-20 10:56:28 -06:00
Alex Elder
f407731d12 libceph: init osd->o_node in create_osd()
The red-black node node in the ceph osd structure is not initialized
in create_osd().  Because this node can be the subject of a
RB_EMPTY_NODE() call later on, we should ensure the node is
initialized properly for that.  Add a call to RB_CLEAR_NODE()
initialize it.

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Sage Weil <sage@inktank.com>
2012-12-20 10:56:21 -06:00
Alex Elder
28362986f8 libceph: report connection fault with warning
When a connection's socket disconnects, or if there's a protocol
error of some kind on the connection, a fault is signaled and
the connection is reset (closed and reopened, basically).  We
currently get an error message on the log whenever this occurs.

A ceph connection will attempt to reestablish a socket connection
repeatedly if a fault occurs.  This means that these error messages
will get repeatedly added to the log, which is undesirable.

Change the error message to be a warning, so they don't get
logged by default.

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Sage Weil <sage@inktank.com>
2012-12-20 10:56:13 -06:00
Alex Elder
7bb21d68c5 libceph: socket can close in any connection state
A connection's socket can close for any reason, independent of the
state of the connection (and without irrespective of the connection
mutex).  As a result, the connectino can be in pretty much any state
at the time its socket is closed.

Handle those other cases at the top of con_work().  Pull this whole
block of code into a separate function to reduce the clutter.

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Sage Weil <sage@inktank.com>
2012-12-17 12:07:32 -06:00
Alex Elder
61c7403562 rbd: remove linger unconditionally
In __unregister_linger_request(), the request is being removed
from the osd client's req_linger list only when the request
has a non-null osd pointer.  It should be done whether or not
the request currently has an osd.

This is most likely a non-issue because I believe the request
will always have an osd when this function is called.

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Sage Weil <sage@inktank.com>
2012-12-17 12:07:31 -06:00
Alex Elder
685a7555ca libceph: avoid using freed osd in __kick_osd_requests()
If an osd has no requests and no linger requests, __reset_osd()
will just remove it with a call to __remove_osd().  That drops
a reference to the osd, and therefore the osd may have been free
by the time __reset_osd() returns.  That function offers no
indication this may have occurred, and as a result the osd will
continue to be used even when it's no longer valid.

Change__reset_osd() so it returns an error (ENODEV) when it
deletes the osd being reset.  And change __kick_osd_requests() so it
returns immediately (before referencing osd again) if __reset_osd()
returns *any* error.

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Sage Weil <sage@inktank.com>
2012-12-17 08:37:23 -06:00
Alex Elder
7d5f24812b ceph: don't reference req after put
In __unregister_request(), there is a call to list_del_init()
referencing a request that was the subject of a call to
ceph_osdc_put_request() on the previous line.  This is not
safe, because the request structure could have been freed
by the time we reach the list_del_init().

Fix this by reversing the order of these lines.

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-off-by: Sage Weil <sage@inktank.com>
2012-12-17 08:37:19 -06:00
Sage Weil
83aff95eb9 libceph: remove 'osdtimeout' option
This would reset a connection with any OSD that had an outstanding
request that was taking more than N seconds.  The idea was that if the
OSD was buggy, the client could compensate by resending the request.

In reality, this only served to hide server bugs, and we haven't
actually seen such a bug in quite a while.  Moreover, the userspace
client code never did this.

More importantly, often the request is taking a long time because the
OSD is trying to recover, or overloaded, and killing the connection
and retrying would only make the situation worse by giving the OSD
more work to do.

Signed-off-by: Sage Weil <sage@inktank.com>
Reviewed-by: Alex Elder <elder@inktank.com>
2012-12-13 08:13:06 -06:00
Alex Elder
72afc71ffc libceph: define ceph_pg_pool_name_by_id()
Define and export function ceph_pg_pool_name_by_id() to supply
the name of a pg pool whose id is given.  This will be used by
the next patch.

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
2012-11-01 07:55:42 -05:00
Sage Weil
0ed7285e00 libceph: fix osdmap decode error paths
Ensure that we set the err value correctly so that we do not pass a 0
value to ERR_PTR and confuse the calling code.  (In particular,
osd_client.c handle_map() will BUG(!newmap)).

Signed-off-by: Sage Weil <sage@inktank.com>
Reviewed-by: Alex Elder <elder@inktank.com>
2012-10-30 08:21:05 -05:00
Linus Torvalds
35fd3dc58d Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client
Pull Ceph fixes form Sage Weil:
 "There are two fixes in the messenger code, one that can trigger a NULL
  dereference, and one that error in refcounting (extra put).  There is
  also a trivial fix that in the fs client code that is triggered by NFS
  reexport."

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client:
  ceph: fix dentry reference leak in encode_fh()
  libceph: avoid NULL kref_put when osd reset races with alloc_msg
  rbd: reset BACKOFF if unable to re-queue
2012-10-29 08:49:25 -07:00
Sage Weil
7246240c7c libceph: avoid NULL kref_put from NULL alloc_msg return
The ceph_on_in_msg_alloc() method calls the ->alloc_msg() helper which
may return NULL.  It also drops con->mutex while it allocates a message,
which means that the connection state may change (e.g., get closed).  If
that happens, we clean up and bail out.  Avoid calling ceph_msg_put() on
a NULL return value and triggering a crash.

This was observed when an ->alloc_msg() call races with a timeout that
resends a zillion messages and resets the connection, and ->alloc_msg()
returns NULL (because the request was resent to another target).

Fixes http://tracker.newdream.net/issues/3342

Signed-off-by: Sage Weil <sage@inktank.com>
Reviewed-by: Alex Elder <elder@inktank.com>
2012-10-26 16:35:04 -05:00
Sage Weil
9bd952615a libceph: avoid NULL kref_put when osd reset races with alloc_msg
The ceph_on_in_msg_alloc() method drops con->mutex while it allocates a
message.  If that races with a timeout that resends a zillion messages and
resets the connection, and the ->alloc_msg() method returns a NULL message,
it will call ceph_msg_put(NULL) and BUG.

Fix by only calling put if msg is non-NULL.

Fixes http://tracker.newdream.net/issues/3142

Signed-off-by: Sage Weil <sage@inktank.com>
2012-10-24 16:19:19 -07:00
Linus Torvalds
d25282d1c9 Merge branch 'modules-next' of git://git.kernel.org/pub/scm/linux/kernel/git/rusty/linux
Pull module signing support from Rusty Russell:
 "module signing is the highlight, but it's an all-over David Howells frenzy..."

Hmm "Magrathea: Glacier signing key". Somebody has been reading too much HHGTTG.

* 'modules-next' of git://git.kernel.org/pub/scm/linux/kernel/git/rusty/linux: (37 commits)
  X.509: Fix indefinite length element skip error handling
  X.509: Convert some printk calls to pr_devel
  asymmetric keys: fix printk format warning
  MODSIGN: Fix 32-bit overflow in X.509 certificate validity date checking
  MODSIGN: Make mrproper should remove generated files.
  MODSIGN: Use utf8 strings in signer's name in autogenerated X.509 certs
  MODSIGN: Use the same digest for the autogen key sig as for the module sig
  MODSIGN: Sign modules during the build process
  MODSIGN: Provide a script for generating a key ID from an X.509 cert
  MODSIGN: Implement module signature checking
  MODSIGN: Provide module signing public keys to the kernel
  MODSIGN: Automatically generate module signing keys if missing
  MODSIGN: Provide Kconfig options
  MODSIGN: Provide gitignore and make clean rules for extra files
  MODSIGN: Add FIPS policy
  module: signature checking hook
  X.509: Add a crypto key parser for binary (DER) X.509 certificates
  MPILIB: Provide a function to read raw data into an MPI
  X.509: Add an ASN.1 decoder
  X.509: Add simple ASN.1 grammar compiler
  ...
2012-10-14 13:39:34 -07:00
Alex Elder
802c6d967f rbd: define common queue_con_delay()
This patch defines a single function, queue_con_delay() to call
queue_delayed_work() for a connection.  It basically generalizes
what was previously queue_con() by adding the delay argument.
queue_con() is now a simple helper that passes 0 for its delay.
queue_con_delay() returns 0 if it queued work or an errno if it
did not for some reason.

If con_work() finds the BACKOFF flag set for a connection, it now
calls queue_con_delay() to handle arranging to start again after a
delay.

Note about connection reference counts:  con_work() only ever gets
called as a work item function.  At the time that work is scheduled,
a reference to the connection is acquired, and the corresponding
con_work() call is then responsible for dropping that reference
before it returns.

Previously, the backoff handling inside con_work() silently handed
off its reference to delayed work it scheduled.  Now that
queue_con_delay() is used, a new reference is acquired for the
newly-scheduled work, and the original reference is dropped by the
con->ops->put() call at the end of the function.

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Sage Weil <sage@inktank.com>
2012-10-09 22:00:44 -07:00
Alex Elder
8618e30bc1 rbd: let con_work() handle backoff
Both ceph_fault() and con_work() include handling for imposing a
delay before doing further processing on a faulted connection.
The latter is used only if ceph_fault() is unable to.

Instead, just let con_work() always be responsible for implementing
the delay.  After setting up the delay value, set the BACKOFF flag
on the connection unconditionally and call queue_con() to ensure
con_work() will get called to handle it.

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Sage Weil <sage@inktank.com>
2012-10-09 22:00:21 -07:00
Alex Elder
588377d619 rbd: reset BACKOFF if unable to re-queue
If ceph_fault() is unable to queue work after a delay, it sets the
BACKOFF connection flag so con_work() will attempt to do so.

In con_work(), when BACKOFF is set, if queue_delayed_work() doesn't
result in newly-queued work, it simply ignores this condition and
proceeds as if no backoff delay were desired.  There are two
problems with this--one of which is a bug.

The first problem is simply that the intended behavior is to back
off, and if we aren't able queue the work item to run after a delay
we're not doing that.

The only reason queue_delayed_work() won't queue work is if the
provided work item is already queued.  In the messenger, this
means that con_work() is already scheduled to be run again.  So
if we simply set the BACKOFF flag again when this occurs, we know
the next con_work() call will again attempt to hold off activity
on the connection until after the delay.

The second problem--the bug--is a leak of a reference count.  If
queue_delayed_work() returns 0 in con_work(), con->ops->put() drops
the connection reference held on entry to con_work().  However,
processing is (was) allowed to continue, and at the end of the
function a second con->ops->put() is called.

This patch fixes both problems.

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Sage Weil <sage@inktank.com>
2012-10-09 21:59:52 -07:00
Michel Lespinasse
4c199a93a2 rbtree: empty nodes have no color
Empty nodes have no color.  We can make use of this property to simplify
the code emitted by the RB_EMPTY_NODE and RB_CLEAR_NODE macros.  Also,
we can get rid of the rb_init_node function which had been introduced by
commit 88d19cf379 ("timers: Add rb_init_node() to allow for stack
allocated rb nodes") to avoid some issue with the empty node's color not
being initialized.

I'm not sure what the RB_EMPTY_NODE checks in rb_prev() / rb_next() are
doing there, though.  axboe introduced them in commit 10fd48f237
("rbtree: fixed reversed RB_EMPTY_NODE and rb_next/prev").  The way I
see it, the 'empty node' abstraction is only used by rbtree users to
flag nodes that they haven't inserted in any rbtree, so asking the
predecessor or successor of such nodes doesn't make any sense.

One final rb_init_node() caller was recently added in sysctl code to
implement faster sysctl name lookups.  This code doesn't make use of
RB_EMPTY_NODE at all, and from what I could see it only called
rb_init_node() under the mistaken assumption that such initialization was
required before node insertion.

[sfr@canb.auug.org.au: fix net/ceph/osd_client.c build]
Signed-off-by: Michel Lespinasse <walken@google.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Acked-by: David Woodhouse <David.Woodhouse@intel.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Daniel Santos <daniel.santos@pobox.com>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: John Stultz <john.stultz@linaro.org>
Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-10-09 16:22:32 +09:00
David Howells
cf7f601c06 KEYS: Add payload preparsing opportunity prior to key instantiate or update
Give the key type the opportunity to preparse the payload prior to the
instantiation and update routines being called.  This is done with the
provision of two new key type operations:

	int (*preparse)(struct key_preparsed_payload *prep);
	void (*free_preparse)(struct key_preparsed_payload *prep);

If the first operation is present, then it is called before key creation (in
the add/update case) or before the key semaphore is taken (in the update and
instantiate cases).  The second operation is called to clean up if the first
was called.

preparse() is given the opportunity to fill in the following structure:

	struct key_preparsed_payload {
		char		*description;
		void		*type_data[2];
		void		*payload;
		const void	*data;
		size_t		datalen;
		size_t		quotalen;
	};

Before the preparser is called, the first three fields will have been cleared,
the payload pointer and size will be stored in data and datalen and the default
quota size from the key_type struct will be stored into quotalen.

The preparser may parse the payload in any way it likes and may store data in
the type_data[] and payload fields for use by the instantiate() and update()
ops.

The preparser may also propose a description for the key by attaching it as a
string to the description field.  This can be used by passing a NULL or ""
description to the add_key() system call or the key_create_or_update()
function.  This cannot work with request_key() as that required the description
to tell the upcall about the key to be created.

This, for example permits keys that store PGP public keys to generate their own
name from the user ID and public key fingerprint in the key.

The instantiate() and update() operations are then modified to look like this:

	int (*instantiate)(struct key *key, struct key_preparsed_payload *prep);
	int (*update)(struct key *key, struct key_preparsed_payload *prep);

and the new payload data is passed in *prep, whether or not it was preparsed.

Signed-off-by: David Howells <dhowells@redhat.com>
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
2012-10-08 13:49:48 +10:30
Sage Weil
6816282dab ceph: propagate layout error on osd request creation
If we are creating an osd request and get an invalid layout, return
an EINVAL to the caller.  We switch up the return to have an error
code instead of NULL implying -ENOMEM.

Signed-off-by: Sage Weil <sage@inktank.com>
Reviewed-by: Alex Elder <elder@inktank.com>
2012-10-01 17:20:00 -05:00
Sage Weil
d63b77f4c5 libceph: check for invalid mapping
If we encounter an invalid (e.g., zeroed) mapping, return an error
and avoid a divide by zero.

Signed-off-by: Sage Weil <sage@inktank.com>
Reviewed-by: Alex Elder <elder@inktank.com>
2012-10-01 17:20:00 -05:00
Wei Yongjun
cc4829e596 ceph: use list_move_tail instead of list_del/list_add_tail
Using list_move_tail() instead of list_del() + list_add_tail().

Signed-off-by: Wei Yongjun <yongjun_wei@trendmicro.com.cn>
Signed-off-by: Sage Weil <sage@inktank.com>
2012-10-01 14:30:49 -05:00
Iulius Curt
7698f2f5e0 libceph: Fix sparse warning
Make ceph_monc_do_poolop() static to remove the following sparse warning:
 * net/ceph/mon_client.c:616:5: warning: symbol 'ceph_monc_do_poolop' was not
   declared. Should it be static?
Also drops the 'ceph_monc_' prefix, now being a private function.

Signed-off-by: Iulius Curt <icurt@ixiacom.com>
Signed-off-by: Sage Weil <sage@inktank.com>
2012-10-01 14:30:49 -05:00
Sage Weil
290e33593d libceph: remove unused monc->have_fsid
This is unused; use monc->client->have_fsid.

Signed-off-by: Sage Weil <sage@inktank.com>
2012-10-01 14:30:49 -05:00
Alex Elder
5ce765a540 libceph: only kunmap kmapped pages
In write_partial_msg_pages(), pages need to be kmapped in order to
perform a CRC-32c calculation on them.  As an artifact of the way
this code used to be structured, the kunmap() call was separated
from the kmap() call and both were done conditionally.  But the
conditions under which the kmap() and kunmap() calls were made
differed, so there was a chance a kunmap() call would be done on a
page that had not been mapped.

The symptom of this was tripping a BUG() in kunmap_high() when
pkmap_count[nr] became 0.

Reported-by: Bryan K. Wright <bryan@virginia.edu>
Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Sage Weil <sage@inktank.com>
2012-09-21 20:49:26 -07:00
Jim Schutt
6d4221b537 libceph: avoid truncation due to racing banners
Because the Ceph client messenger uses a non-blocking connect, it is
possible for the sending of the client banner to race with the
arrival of the banner sent by the peer.

When ceph_sock_state_change() notices the connect has completed, it
schedules work to process the socket via con_work().  During this
time the peer is writing its banner, and arrival of the peer banner
races with con_work().

If con_work() calls try_read() before the peer banner arrives, there
is nothing for it to do, after which con_work() calls try_write() to
send the client's banner.  In this case Ceph's protocol negotiation
can complete succesfully.

The server-side messenger immediately sends its banner and addresses
after accepting a connect request, *before* actually attempting to
read or verify the banner from the client.  As a result, it is
possible for the banner from the server to arrive before con_work()
calls try_read().  If that happens, try_read() will read the banner
and prepare protocol negotiation info via prepare_write_connect().
prepare_write_connect() calls con_out_kvec_reset(), which discards
the as-yet-unsent client banner.  Next, con_work() calls
try_write(), which sends the protocol negotiation info rather than
the banner that the peer is expecting.

The result is that the peer sees an invalid banner, and the client
reports "negotiation failed".

Fix this by moving con_out_kvec_reset() out of
prepare_write_connect() to its callers at all locations except the
one where the banner might still need to be sent.

[elder@inktak.com: added note about server-side behavior]

Signed-off-by: Jim Schutt <jaschut@sandia.gov>
Reviewed-by: Alex Elder <elder@inktank.com>
2012-08-21 15:55:27 -07:00
Sage Weil
d1c338a509 libceph: delay debugfs initialization until we learn global_id
The debugfs directory includes the cluster fsid and our unique global_id.
We need to delay the initialization of the debug entry until we have
learned both the fsid and our global_id from the monitor or else the
second client can't create its debugfs entry and will fail (and multiple
client instances aren't properly reflected in debugfs).

Reported by: Yan, Zheng <zheng.z.yan@intel.com>
Signed-off-by: Sage Weil <sage@inktank.com>
Reviewed-by: Yehuda Sadeh <yehuda@inktank.com>
2012-08-20 10:03:15 -07:00
Sylvain Munaut
f0666b1ac8 libceph: fix crypto key null deref, memory leak
Avoid crashing if the crypto key payload was NULL, as when it was not correctly
allocated and initialized.  Also, avoid leaking it.

Signed-off-by: Sylvain Munaut <tnt@246tNt.com>
Signed-off-by: Sage Weil <sage@inktank.com>
Reviewed-by: Alex Elder <elder@inktank.com>
2012-08-02 09:19:20 -07:00
Linus Torvalds
cc8362b1f6 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client
Pull Ceph changes from Sage Weil:
 "Lots of stuff this time around:

   - lots of cleanup and refactoring in the libceph messenger code, and
     many hard to hit races and bugs closed as a result.
   - lots of cleanup and refactoring in the rbd code from Alex Elder,
     mostly in preparation for the layering functionality that will be
     coming in 3.7.
   - some misc rbd cleanups from Josh Durgin that are finally going
     upstream
   - support for CRUSH tunables (used by newer clusters to improve the
     data placement)
   - some cleanup in our use of d_parent that Al brought up a while back
   - a random collection of fixes across the tree

  There is another patch coming that fixes up our ->atomic_open()
  behavior, but I'm going to hammer on it a bit more before sending it."

Fix up conflicts due to commits that were already committed earlier in
drivers/block/rbd.c, net/ceph/{messenger.c, osd_client.c}

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client: (132 commits)
  rbd: create rbd_refresh_helper()
  rbd: return obj version in __rbd_refresh_header()
  rbd: fixes in rbd_header_from_disk()
  rbd: always pass ops array to rbd_req_sync_op()
  rbd: pass null version pointer in add_snap()
  rbd: make rbd_create_rw_ops() return a pointer
  rbd: have __rbd_add_snap_dev() return a pointer
  libceph: recheck con state after allocating incoming message
  libceph: change ceph_con_in_msg_alloc convention to be less weird
  libceph: avoid dropping con mutex before fault
  libceph: verify state after retaking con lock after dispatch
  libceph: revoke mon_client messages on session restart
  libceph: fix handling of immediate socket connect failure
  ceph: update MAINTAINERS file
  libceph: be less chatty about stray replies
  libceph: clear all flags on con_close
  libceph: clean up con flags
  libceph: replace connection state bits with states
  libceph: drop unnecessary CLOSED check in socket state change callback
  libceph: close socket directly from ceph_con_close()
  ...
2012-07-31 14:35:28 -07:00
Sage Weil
6139919133 libceph: recheck con state after allocating incoming message
We drop the lock when calling the ->alloc_msg() con op, which means
we need to (a) not clobber con->in_msg without the mutex held, and (b)
we need to verify that we are still in the OPEN state when we retake
it to avoid causing any mayhem.  If the state does change, -EAGAIN
will get us back to con_work() and loop.

Signed-off-by: Sage Weil <sage@inktank.com>
Reviewed-by: Alex Elder <elder@inktank.com>
2012-07-30 18:19:45 -07:00
Sage Weil
4740a623d2 libceph: change ceph_con_in_msg_alloc convention to be less weird
This function's calling convention is very limiting.  In particular,
we can't return any error other than ENOMEM (and only implicitly),
which is a problem (see next patch).

Instead, return an normal 0 or error code, and make the skip a pointer
output parameter.  Drop the useless in_hdr argument (we have the con
pointer).

Signed-off-by: Sage Weil <sage@inktank.com>
Reviewed-by: Alex Elder <elder@inktank.com>
2012-07-30 18:19:30 -07:00
Sage Weil
8636ea672f libceph: avoid dropping con mutex before fault
The ceph_fault() function takes the con mutex, so we should avoid
dropping it before calling it.  This fixes a potential race with
another thread calling ceph_con_close(), or _open(), or similar (we
don't reverify con->state after retaking the lock).

Add annotation so that lockdep realizes we will drop the mutex before
returning.

Signed-off-by: Sage Weil <sage@inktank.com>
Reviewed-by: Alex Elder <elder@inktank.com>
2012-07-30 18:17:13 -07:00
Sage Weil
7b862e07b1 libceph: verify state after retaking con lock after dispatch
We drop the con mutex when delivering a message.  When we retake the
lock, we need to verify we are still in the OPEN state before
preparing to read the next tag, or else we risk stepping on a
connection that has been closed.

Signed-off-by: Sage Weil <sage@inktank.com>
Reviewed-by: Alex Elder <elder@inktank.com>
2012-07-30 18:16:56 -07:00
Sage Weil
4f471e4a9c libceph: revoke mon_client messages on session restart
Revoke all mon_client messages when we shut down the old connection.
This is mostly moot since we are re-using the same ceph_connection,
but it is cleaner.

Signed-off-by: Sage Weil <sage@inktank.com>
Reviewed-by: Alex Elder <elder@inktank.com>
2012-07-30 18:16:40 -07:00
Sage Weil
8007b8d626 libceph: fix handling of immediate socket connect failure
If the connect() call immediately fails such that sock == NULL, we
still need con_close_socket() to reset our socket state to CLOSED.

Signed-off-by: Sage Weil <sage@inktank.com>
Reviewed-by: Alex Elder <elder@inktank.com>
2012-07-30 18:16:16 -07:00
Sage Weil
756a16a5d5 libceph: be less chatty about stray replies
There are many (normal) conditions that can lead to us getting
unexpected replies, include cluster topology changes, osd failures,
and timeouts.  There's no need to spam the console about it.

Signed-off-by: Sage Weil <sage@inktank.com>
Reviewed-by: Alex Elder <elder@inktank.com>
2012-07-30 18:16:03 -07:00
Sage Weil
43c7427d10 libceph: clear all flags on con_close
Signed-off-by: Sage Weil <sage@inktank.com>
2012-07-30 18:16:02 -07:00
Sage Weil
4a86169208 libceph: clean up con flags
Rename flags with CON_FLAG prefix, move the definitions into the c file,
and (better) document their meaning.

Signed-off-by: Sage Weil <sage@inktank.com>
2012-07-30 18:16:01 -07:00
Sage Weil
8dacc7da69 libceph: replace connection state bits with states
Use a simple set of 6 enumerated values for the socket states (CON_STATE_*)
and use those instead of the state bits.  All of the con->state checks are
now under the protection of the con mutex, so this is safe.  It also
simplifies many of the state checks because we can check for anything other
than the expected state instead of various bits for races we can think of.

This appears to hold up well to stress testing both with and without socket
failure injection on the server side.

Signed-off-by: Sage Weil <sage@inktank.com>
2012-07-30 18:16:00 -07:00
Sage Weil
d7353dd5aa libceph: drop unnecessary CLOSED check in socket state change callback
If we are CLOSED, the socket is closed and we won't get these.

Signed-off-by: Sage Weil <sage@inktank.com>
2012-07-30 18:15:59 -07:00
Sage Weil
ee76e0736d libceph: close socket directly from ceph_con_close()
It is simpler to do this immediately, since we already hold the con mutex.
It also avoids the need to deal with a not-quite-CLOSED socket in con_work.

Signed-off-by: Sage Weil <sage@inktank.com>
2012-07-30 18:15:58 -07:00
Sage Weil
2e8cb10063 libceph: drop gratuitous socket close calls in con_work
If the state is CLOSED or OPENING, we shouldn't have a socket.

Signed-off-by: Sage Weil <sage@inktank.com>
2012-07-30 18:15:58 -07:00
Sage Weil
a59b55a602 libceph: move ceph_con_send() closed check under the con mutex
Take the con mutex before checking whether the connection is closed to
avoid racing with someone else closing it.

Signed-off-by: Sage Weil <sage@inktank.com>
2012-07-30 18:15:57 -07:00
Sage Weil
00650931e5 libceph: move msgr clear_standby under con mutex protection
Avoid dropping and retaking con->mutex in the ceph_con_send() case by
leaving locking up to the caller.

Signed-off-by: Sage Weil <sage@inktank.com>
2012-07-30 18:15:56 -07:00
Sage Weil
3b5ede07b5 libceph: fix fault locking; close socket on lossy fault
If we fault on a lossy connection, we should still close the socket
immediately, and do so under the con mutex.

We should also take the con mutex before printing out the state bits in
the debug output.

Signed-off-by: Sage Weil <sage@inktank.com>
2012-07-30 18:15:55 -07:00
Jiaju Zhang
048a9d2d06 libceph: trivial fix for the incorrect debug output
This is a trivial fix for the debug output, as it is inconsistent
with the function name so may confuse people when debugging.

[elder@inktank.com: switched to use __func__]

Signed-off-by: Jiaju Zhang <jjzhang@suse.de>
Reviewed-by: Alex Elder <elder@inktank.com>
2012-07-30 18:15:36 -07:00
Sage Weil
85effe183d libceph: reset connection retry on successfully negotiation
We exponentially back off when we encounter connection errors.  If several
errors accumulate, we will eventually wait ages before even trying to
reconnect.

Fix this by resetting the backoff counter after a successful negotiation/
connection with the remote node.  Fixes ceph issue #2802.

Signed-off-by: Sage Weil <sage@inktank.com>
Reviewed-by: Yehuda Sadeh <yehuda@inktank.com>
Reviewed-by: Alex Elder <elder@inktank.com>
2012-07-30 18:15:34 -07:00
Sage Weil
5469155f2b libceph: protect ceph_con_open() with mutex
Take the con mutex while we are initiating a ceph open.  This is necessary
because the may have previously been in use and then closed, which could
result in a racing workqueue running con_work().

Signed-off-by: Sage Weil <sage@inktank.com>
Reviewed-by: Yehuda Sadeh <yehuda@inktank.com>
Reviewed-by: Alex Elder <elder@inktank.com>
2012-07-30 18:15:33 -07:00
Sage Weil
a410702697 libceph: (re)initialize bio_iter on start of message receive
Previously, we were opportunistically initializing the bio_iter if it
appeared to be uninitialized in the middle of the read path.  The problem
is that a sequence like:

 - start reading message
 - initialize bio_iter
 - read half a message
 - messenger fault, reconnect
 - restart reading message
 - ** bio_iter now non-NULL, not reinitialized **
 - read past end of bio, crash

Instead, initialize the bio_iter unconditionally when we allocate/claim
the message for read.

Signed-off-by: Sage Weil <sage@inktank.com>
Reviewed-by: Alex Elder <elder@inktank.com>
Reviewed-by: Yehuda Sadeh <yehuda@inktank.com>
2012-07-30 18:15:31 -07:00
Sage Weil
6194ea895e libceph: resubmit linger ops when pg mapping changes
The linger op registration (i.e., watch) modifies the object state.  As
such, the OSD will reply with success if it has already applied without
doing the associated side-effects (setting up the watch session state).
If we lose the ACK and resubmit, we will see success but the watch will not
be correctly registered and we won't get notifies.

To fix this, always resubmit the linger op with a new tid.  We accomplish
this by re-registering as a linger (i.e., 'registered') if we are not yet
registered.  Then the second loop will treat this just like a normal
case of re-registering.

This mirrors a similar fix on the userland ceph.git, commit 5dd68b95, and
ceph bug #2796.

Signed-off-by: Sage Weil <sage@inktank.com>
Reviewed-by: Alex Elder <elder@inktank.com>
Reviewed-by: Yehuda Sadeh <yehuda@inktank.com>
2012-07-30 18:15:31 -07:00
Sage Weil
8c50c81756 libceph: fix mutex coverage for ceph_con_close
Hold the mutex while twiddling all of the state bits to avoid possible
races.  While we're here, make not of why we cannot close the socket
directly.

Signed-off-by: Sage Weil <sage@inktank.com>
Reviewed-by: Alex Elder <elder@inktank.com>
Reviewed-by: Yehuda Sadeh <yehuda@inktank.com>
2012-07-30 18:15:30 -07:00
Sage Weil
3a140a0d5c libceph: report socket read/write error message
We need to set error_msg to something useful before calling ceph_fault();
do so here for try_{read,write}().  This is more informative than

libceph: osd0 192.168.106.220:6801 (null)

Signed-off-by: Sage Weil <sage@inktank.com>
Reviewed-by: Alex Elder <elder@inktank.com>
Reviewed-by: Yehuda Sadeh <yehuda@inktank.com>
2012-07-30 18:15:29 -07:00
Sage Weil
546f04ef71 libceph: support crush tunables
The server side recently added support for tuning some magic
crush variables. Decode these variables if they are present, or use the
default values if they are not present.

Corresponds to ceph.git commit 89af369c25f274fe62ef730e5e8aad0c54f1e5a5.

Signed-off-by: caleb miles <caleb.miles@inktank.com>
Reviewed-by: Sage Weil <sage@inktank.com>
Reviewed-by: Alex Elder <elder@inktank.com>
Reviewed-by: Yehuda Sadeh <yehuda@inktank.com>
2012-07-30 18:15:23 -07:00
Sage Weil
1fe60e51a3 libceph: move feature bits to separate header
This is simply cleanup that will keep things more closely synced with the
userland code.

Signed-off-by: Sage Weil <sage@inktank.com>
Reviewed-by: Alex Elder <elder@inktank.com>
Reviewed-by: Yehuda Sadeh <yehuda@inktank.com>
2012-07-30 16:23:22 -07:00
Guanjun He
a2a3258417 libceph: prevent the race of incoming work during teardown
Add an atomic variable 'stopping' as flag in struct ceph_messenger,
set this flag to 1 in function ceph_destroy_client(), and add the condition code
in function ceph_data_ready() to test the flag value, if true(1), just return.

Signed-off-by: Guanjun He <gjhe@suse.com>
Reviewed-by: Sage Weil <sage@inktank.com>
2012-07-30 09:29:53 -07:00
Sage Weil
a16cb1f707 libceph: fix messenger retry
In ancient times, the messenger could both initiate and accept connections.
An artifact if that was data structures to store/process an incoming
ceph_msg_connect request and send an outgoing ceph_msg_connect_reply.
Sadly, the negotiation code was referencing those structures and ignoring
important information (like the peer's connect_seq) from the correct ones.

Among other things, this fixes tight reconnect loops where the server sends
RETRY_SESSION and we (the client) retries with the same connect_seq as last
time.  This bug pretty easily triggered by injecting socket failures on the
MDS and running some fs workload like workunits/direct_io/test_sync_io.

Signed-off-by: Sage Weil <sage@inktank.com>
2012-07-30 09:29:52 -07:00
Sage Weil
cd43045c2d libceph: initialize rb, list nodes in ceph_osd_request
These don't strictly need to be initialized based on how they are used, but
it is good practice to do so.

Reported-by: Alex Elder <elder@inktank.com>
Signed-off-by: Sage Weil <sage@inktank.com>
2012-07-30 09:29:51 -07:00
Sage Weil
d50b409fb8 libceph: initialize msgpool message types
Initialize the type field for messages in a msgpool.  The caller was doing
this for osd ops, but not for the reply messages.

Reported-by: Alex Elder <elder@inktank.com>
Signed-off-by: Sage Weil <sage@inktank.com>
2012-07-30 09:29:50 -07:00
Linus Torvalds
3c4cfadef6 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next
Pull networking changes from David S Miller:

 1) Remove the ipv4 routing cache.  Now lookups go directly into the FIB
    trie and use prebuilt routes cached there.

    No more garbage collection, no more rDOS attacks on the routing
    cache.  Instead we now get predictable and consistent performance,
    no matter what the pattern of traffic we service.

    This has been almost 2 years in the making.  Special thanks to
    Julian Anastasov, Eric Dumazet, Steffen Klassert, and others who
    have helped along the way.

    I'm sure that with a change of this magnitude there will be some
    kind of fallout, but such things ought the be simple to fix at this
    point.  Luckily I'm not European so I'll be around all of August to
    fix things :-)

    The major stages of this work here are each fronted by a forced
    merge commit whose commit message contains a top-level description
    of the motivations and implementation issues.

 2) Pre-demux of established ipv4 TCP sockets, saves a route demux on
    input.

 3) TCP SYN/ACK performance tweaks from Eric Dumazet.

 4) Add namespace support for netfilter L4 conntrack helpers, from Gao
    Feng.

 5) Add config mechanism for Energy Efficient Ethernet to ethtool, from
    Yuval Mintz.

 6) Remove quadratic behavior from /proc/net/unix, from Eric Dumazet.

 7) Support for connection tracker helpers in userspace, from Pablo
    Neira Ayuso.

 8) Allow userspace driven TX load balancing functions in TEAM driver,
    from Jiri Pirko.

 9) Kill off NLMSG_PUT and RTA_PUT macros, more gross stuff with
    embedded gotos.

10) TCP Small Queues, essentially minimize the amount of TCP data queued
    up in the packet scheduler layer.  Whereas the existing BQL (Byte
    Queue Limits) limits the pkt_sched --> netdevice queuing levels,
    this controls the TCP --> pkt_sched queueing levels.

    From Eric Dumazet.

11) Reduce the number of get_page/put_page ops done on SKB fragments,
    from Alexander Duyck.

12) Implement protection against blind resets in TCP (RFC 5961), from
    Eric Dumazet.

13) Support the client side of TCP Fast Open, basically the ability to
    send data in the SYN exchange, from Yuchung Cheng.

    Basically, the sender queues up data with a sendmsg() call using
    MSG_FASTOPEN, then they do the connect() which emits the queued up
    fastopen data.

14) Avoid all the problems we get into in TCP when timers or PMTU events
    hit a locked socket.  The TCP Small Queues changes added a
    tcp_release_cb() that allows us to queue work up to the
    release_sock() caller, and that's what we use here too.  From Eric
    Dumazet.

15) Zero copy on TX support for TUN driver, from Michael S. Tsirkin.

* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next: (1870 commits)
  genetlink: define lockdep_genl_is_held() when CONFIG_LOCKDEP
  r8169: revert "add byte queue limit support".
  ipv4: Change rt->rt_iif encoding.
  net: Make skb->skb_iif always track skb->dev
  ipv4: Prepare for change of rt->rt_iif encoding.
  ipv4: Remove all RTCF_DIRECTSRC handliing.
  ipv4: Really ignore ICMP address requests/replies.
  decnet: Don't set RTCF_DIRECTSRC.
  net/ipv4/ip_vti.c: Fix __rcu warnings detected by sparse.
  ipv4: Remove redundant assignment
  rds: set correct msg_namelen
  openvswitch: potential NULL deref in sample()
  tcp: dont drop MTU reduction indications
  bnx2x: Add new 57840 device IDs
  tcp: avoid oops in tcp_metrics and reset tcpm_stamp
  niu: Change niu_rbr_fill() to use unlikely() to check niu_rbr_add_page() return value
  niu: Fix to check for dma mapping errors.
  net: Fix references to out-of-scope variables in put_cmsg_compat()
  net: ethernet: davinci_emac: add pm_runtime support
  net: ethernet: davinci_emac: Remove unnecessary #include
  ...
2012-07-24 10:01:50 -07:00
Sage Weil
5bdca4e076 libceph: fix messenger retry
In ancient times, the messenger could both initiate and accept connections.
An artifact if that was data structures to store/process an incoming
ceph_msg_connect request and send an outgoing ceph_msg_connect_reply.
Sadly, the negotiation code was referencing those structures and ignoring
important information (like the peer's connect_seq) from the correct ones.

Among other things, this fixes tight reconnect loops where the server sends
RETRY_SESSION and we (the client) retries with the same connect_seq as last
time.  This bug pretty easily triggered by injecting socket failures on the
MDS and running some fs workload like workunits/direct_io/test_sync_io.

Signed-off-by: Sage Weil <sage@inktank.com>
2012-07-17 19:35:59 -07:00
Ben Hutchings
ae86b9e384 net: Fix non-kernel-doc comments with kernel-doc start marker
Signed-off-by: Ben Hutchings <bhutchings@solarflare.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2012-07-10 23:13:45 -07:00
Sage Weil
fbb85a478f libceph: allow sock transition from CONNECTING to CLOSED
It is possible to close a socket that is in the OPENING state.  For
example, it can happen if ceph_con_close() is called on the con before
the TCP connection is established.  con_work() will come around and shut
down the socket.

Signed-off-by: Sage Weil <sage@inktank.com>
2012-07-05 21:14:39 -07:00
Sage Weil
735a72ef95 libceph: initialize mon_client con only once
Do not re-initialize the con on every connection attempt.  When we
ceph_con_close, there may still be work queued on the socket (e.g., to
close it), and re-initializing will clobber the work_struct state.

Signed-off-by: Sage Weil <sage@inktank.com>
2012-07-05 21:14:38 -07:00
Sage Weil
b7a9e5dd40 libceph: set peer name on con_open, not init
The peer name may change on each open attempt, even when the connection is
reused.

Signed-off-by: Sage Weil <sage@inktank.com>
2012-07-05 21:14:35 -07:00
Alex Elder
bc18f4b1c8 libceph: add some fine ASCII art
Sage liked the state diagram I put in my commit description so
I'm putting it in with the code.

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Sage Weil <sage@inktank.com>
2012-07-05 21:14:32 -07:00
Alex Elder
5821bd8ccd libceph: small changes to messenger.c
This patch gathers a few small changes in "net/ceph/messenger.c":
  out_msg_pos_next()
    - small logic change that mostly affects indentation
  write_partial_msg_pages().
    - use a local variable trail_off to represent the offset into
      a message of the trail portion of the data (if present)
    - once we are in the trail portion we will always be there, so we
      don't always need to check against our data position
    - avoid computing len twice after we've reached the trail
    - get rid of the variable tmpcrc, which is not needed
    - trail_off and trail_len never change so mark them const
    - update some comments
  read_partial_message_bio()
    - bio_iovec_idx() will never return an error, so don't bother
      checking for it

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Sage Weil <sage@inktank.com>
2012-07-05 21:14:31 -07:00
Alex Elder
7593af920b libceph: distinguish two phases of connect sequence
Currently a ceph connection enters a "CONNECTING" state when it
begins the process of (re-)connecting with its peer.  Once the two
ends have successfully exchanged their banner and addresses, an
additional NEGOTIATING bit is set in the ceph connection's state to
indicate the connection information exhange has begun.  The
CONNECTING bit/state continues to be set during this phase.

Rather than have the CONNECTING state continue while the NEGOTIATING
bit is set, interpret these two phases as distinct states.  In other
words, when NEGOTIATING is set, clear CONNECTING.  That way only
one of them will be active at a time.

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Sage Weil <sage@inktank.com>
2012-07-05 21:14:30 -07:00
Alex Elder
ab166d5aa3 libceph: separate banner and connect writes
There are two phases in the process of linking together the two ends
of a ceph connection.  The first involves exchanging a banner and
IP addresses, and if that is successful a second phase exchanges
some detail about each side's connection capabilities.

When initiating a connection, the client side now queues to send
its information for both phases of this process at the same time.
This is probably a bit more efficient, but it is slightly messier
from a layering perspective in the code.

So rearrange things so that the client doesn't send the connection
information until it has received and processed the response in the
initial banner phase (in process_banner()).

Move the code (in the (con->sock == NULL) case in try_write()) that
prepares for writing the connection information, delaying doing that
until the banner exchange has completed.  Move the code that begins
the transition to this second "NEGOTIATING" phase out of
process_banner() and into its caller, so preparing to write the
connection information and preparing to read the response are
adjacent to each other.

Finally, preparing to write the connection information now requires
the output kvec to be reset in all cases, so move that into the
prepare_write_connect() and delete it from all callers.

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Sage Weil <sage@inktank.com>
2012-07-05 21:14:29 -07:00
Alex Elder
e27947c767 libceph: define and use an explicit CONNECTED state
There is no state explicitly defined when a ceph connection is fully
operational.  So define one.

It's set when the connection sequence completes successfully, and is
cleared when the connection gets closed.

Be a little more careful when examining the old state when a socket
disconnect event is reported.

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Sage Weil <sage@inktank.com>
2012-07-05 21:14:27 -07:00
Alex Elder
3ec50d1868 libceph: clear NEGOTIATING when done
A connection state's NEGOTIATING bit gets set while in CONNECTING
state after we have successfully exchanged a ceph banner and IP
addresses with the connection's peer (the server).  But that bit
is not cleared again--at least not until another connection attempt
is initiated.

Instead, clear it as soon as the connection is fully established.
Also, clear it when a socket connection gets prematurely closed
in the midst of establishing a ceph connection (in case we had
reached the point where it was set).

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Sage Weil <sage@inktank.com>
2012-07-05 21:14:26 -07:00
Alex Elder
bb9e6bba5d libceph: clear CONNECTING in ceph_con_close()
A connection that is closed will no longer be connecting.  So
clear the CONNECTING state bit in ceph_con_close().  Similarly,
if the socket has been closed we no longer are in connecting
state (a new connect sequence will need to be initiated).

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Sage Weil <sage@inktank.com>
2012-07-05 21:14:24 -07:00
Alex Elder
456ea46865 libceph: don't touch con state in con_close_socket()
In con_close_socket(), a connection's SOCK_CLOSED flag gets set and
then cleared while its shutdown method is called and its reference
gets dropped.

Previously, that flag got set only if it had not already been set,
so setting it in con_close_socket() might have prevented additional
processing being done on a socket being shut down.  We no longer set
SOCK_CLOSED in the socket event routine conditionally, so setting
that bit here no longer provides whatever benefit it might have
provided before.

A race condition could still leave the SOCK_CLOSED bit set even
after we've issued the call to con_close_socket(), so we still clear
that bit after shutting the socket down.  Add a comment explaining
the reason for this.

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Sage Weil <sage@inktank.com>
2012-07-05 21:14:23 -07:00
Alex Elder
d65c9e0b9e libceph: just set SOCK_CLOSED when state changes
When a TCP_CLOSE or TCP_CLOSE_WAIT event occurs, the SOCK_CLOSED
connection flag bit is set, and if it had not been previously set
queue_con() is called to ensure con_work() will get a chance to
handle the changed state.

con_work() atomically checks--and if set, clears--the SOCK_CLOSED
bit if it was set.  This means that even if the bit were set
repeatedly, the related processing in con_work() only gets called
once per transition of the bit from 0 to 1.

What's important then is that we ensure con_work() gets called *at
least* once when a socket close event occurs, not that it gets
called *exactly* once.

The work queue mechanism already takes care of queueing work
only if it is not already queued, so there's no need for us
to call queue_con() conditionally.

So this patch just makes it so the SOCK_CLOSED flag gets set
unconditionally in ceph_sock_state_change().

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Sage Weil <sage@inktank.com>
2012-07-05 21:14:22 -07:00
Alex Elder
188048bce3 libceph: don't change socket state on sock event
Currently the socket state change event handler records an error
message on a connection to distinguish a close while connecting from
a close while a connection was already established.

Changing connection information during handling of a socket event is
not very clean, so instead move this assignment inside con_work(),
where it can be done during normal connection-level processing (and
under protection of the connection mutex as well).

Move the handling of a socket closed event up to the top of the
processing loop in con_work(); there's no point in handling backoff
etc. if we have a newly-closed socket to take care of.

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Sage Weil <sage@inktank.com>
2012-07-05 21:14:20 -07:00
Alex Elder
a8d00e3cde libceph: SOCK_CLOSED is a flag, not a state
The following commit changed it so SOCK_CLOSED bit was stored in
a connection's new "flags" field rather than its "state" field.

    libceph: start separating connection flags from state
    commit 928443cd

That bit is used in con_close_socket() to protect against setting an
error message more than once in the socket event handler function.

Unfortunately, the field being operated on in that function was not
updated to be "flags" as it should have been.  This fixes that
error.

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Sage Weil <sage@inktank.com>
2012-07-05 21:14:19 -07:00
Alex Elder
abdaa6a849 libceph: don't use bio_iter as a flag
Recently a bug was fixed in which the bio_iter field in a ceph
message was not being properly re-initialized when a message got
re-transmitted:
    commit 43643528cc
    Author: Yan, Zheng <zheng.z.yan@intel.com>
    rbd: Clear ceph_msg->bio_iter for retransmitted message

We are now only initializing the bio_iter field when we are about to
start to write message data (in prepare_write_message_data()),
rather than every time we are attempting to write any portion of the
message data (in write_partial_msg_pages()).  This means we no
longer need to use the msg->bio_iter field as a flag.

So just don't do that any more.  Trust prepare_write_message_data()
to ensure msg->bio_iter is properly initialized, every time we are
about to begin writing (or re-writing) a message's bio data.

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Sage Weil <sage@inktank.com>
2012-07-05 21:14:18 -07:00
Alex Elder
572c588eda libceph: move init of bio_iter
If a message has a non-null bio pointer, its bio_iter field is
initialized in write_partial_msg_pages() if this has not been done
already.  This is really a one-time setup operation for sending a
message's (bio) data, so move that initialization code into
prepare_write_message_data() which serves that purpose.

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Sage Weil <sage@inktank.com>
2012-07-05 21:14:16 -07:00
Alex Elder
df6ad1f973 libceph: move init_bio_*() functions up
Move init_bio_iter() and iter_bio_next() up in their source file so
the'll be defined before they're needed.

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Sage Weil <sage@inktank.com>
2012-07-05 21:14:15 -07:00
Alex Elder
fd154f3c75 libceph: don't mark footer complete before it is
This is a nit, but prepare_write_message() sets the FOOTER_COMPLETE
flag before the CRC for the data portion (recorded in the footer)
has been completely computed.  Hold off setting the complete flag
until we've decided it's ready to send.

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Sage Weil <sage@inktank.com>
2012-07-05 21:14:13 -07:00
Alex Elder
84ca8fc87f libceph: encapsulate advancing msg page
In write_partial_msg_pages(), once all the data from a page has been
sent we advance to the next one.  Put the code that takes care of
this into its own function.

While modifying write_partial_msg_pages(), make its local variable
"in_trail" be Boolean, and use the local variable "msg" (which is
just the connection's current out_msg pointer) consistently.

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Sage Weil <sage@inktank.com>
2012-07-05 21:14:12 -07:00
Alex Elder
739c905baa libceph: encapsulate out message data setup
Move the code that prepares to write the data portion of a message
into its own function.

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Sage Weil <sage@inktank.com>
2012-07-05 21:14:10 -07:00
Sage Weil
d59315ca8c libceph: drop ceph_con_get/put helpers and nref member
These are no longer used.  Every ceph_connection instance is embedded in
another structure, and refcounts manipulated via the get/put ops.

Signed-off-by: Sage Weil <sage@inktank.com>
2012-06-22 08:13:45 -05:00
Sage Weil
36eb71aa57 libceph: use con get/put methods
The ceph_con_get/put() helpers manipulate the embedded con ref
count, which isn't used now that ceph_connections are embedded in
other structures.

Signed-off-by: Sage Weil <sage@inktank.com>
Reviewed-by: Alex Elder <elder@inktank.com>
2012-06-22 07:30:27 -05:00
Sage Weil
642c0dbde3 libceph: flush msgr queue during mon_client shutdown
We need to flush the msgr workqueue during mon_client shutdown to
ensure that any work affecting our embedded ceph_connection is
finished so that we can be safely destroyed.

Previously, we were flushing the work queue after osd_client
shutdown and before mon_client shutdown to ensure that any osd
connection refs to authorizers are flushed.  Remove the redundant
flush, and document in the comment that the mon_client flush is
needed to cover that case as well.

Signed-off-by: Sage Weil <sage@inktank.com>
Reviewed-by: Alex Elder <elder@inktank.com>
(cherry picked from commit f3dea7edd3)
2012-06-20 07:43:50 -05:00
Yan, Zheng
b132cf4c73 rbd: Clear ceph_msg->bio_iter for retransmitted message
The bug can cause NULL pointer dereference in write_partial_msg_pages

Signed-off-by: Zheng Yan <zheng.z.yan@intel.com>
Reviewed-by: Alex Elder <elder@inktank.com>
(cherry picked from commit 43643528cc)
2012-06-20 07:43:50 -05:00
Sage Weil
88ed6ea0b2 libceph: use con get/put ops from osd_client
There were a few direct calls to ceph_con_{get,put}() instead of the con
ops from osd_client.c.  This is a bug since those ops aren't defined to
be ceph_con_get/put.

This breaks refcounting on the ceph_osd structs that contain the
ceph_connections, and could lead to all manner of strangeness.

The purpose of the ->get and ->put methods in a ceph connection are
to allow the connection to indicate it has a reference to something
external to the messaging system, *not* to indicate something
external has a reference to the connection.

[elder@inktank.com: added that last sentence]

Signed-off-by: Sage Weil <sage@newdream.net>
Reviewed-by: Alex Elder <elder@inktank.com>
(cherry picked from commit 0d47766f14)
2012-06-20 07:43:49 -05:00
Alex Elder
680584fab0 libceph: osd_client: don't drop reply reference too early
In ceph_osdc_release_request(), a reference to the r_reply message
is dropped.  But just after that, that same message is revoked if it
was in use to receive an incoming reply.  Reorder these so we are
sure we hold a reference until we're actually done with the message.

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Sage Weil <sage@inktank.com>
(cherry picked from commit ab8cb34a4b)
2012-06-20 07:43:48 -05:00
Dan Carpenter
26ce171915 libceph: fix NULL dereference in reset_connection()
We dereference "con->in_msg" on the line after it was set to NULL.

Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Reviewed-by: Alex Elder <elder@inktank.com>
2012-06-19 08:52:33 -05:00
Sage Weil
9a64e8e0ac Linux 3.5-rc1
-----BEGIN PGP SIGNATURE-----
 Version: GnuPG v2.0.18 (GNU/Linux)
 
 iQEcBAABAgAGBQJPyr4LAAoJEHm+PkMAQRiGhvMH/1uXaDmJiiyAtMhC9kQbLclK
 5RpUOV+ukRrPXBJhwWGEZvC9G/DiWAfZ/19Ee6qTGZbA46yxkgZklqO+bw7fuOLH
 dPf4MNXdhgOgbs0KkVAk6aXIYzIU836pcYg+LcapG8E8SZp3SWbJzrVbUPFwPM+m
 Sv11ZcpJfM2HH9wFRdKErUOiZHsMY+LZHcw0nx+BObytjgzBbzHNkpF57F714TUO
 QplYpIToO3XtGhIM1yRDxww+2zFlVNsCZ8IC57EDbLb8BMZWuyZoFgWZqLAnrU0u
 vy7CHLledMSvs855juJ9JxGo/EDnfwJpCnjmcp8BY+h4b5T/k5mGK6d9aeXYRf4=
 =CcWn
 -----END PGP SIGNATURE-----

Merge tag 'v3.5-rc1'

Linux 3.5-rc1

Conflicts:
	net/ceph/messenger.c
2012-06-15 12:32:04 -07:00
Sage Weil
f3dea7edd3 libceph: flush msgr queue during mon_client shutdown
We need to flush the msgr workqueue during mon_client shutdown to
ensure that any work affecting our embedded ceph_connection is
finished so that we can be safely destroyed.

Previously, we were flushing the work queue after osd_client
shutdown and before mon_client shutdown to ensure that any osd
connection refs to authorizers are flushed.  Remove the redundant
flush, and document in the comment that the mon_client flush is
needed to cover that case as well.

Signed-off-by: Sage Weil <sage@inktank.com>
Reviewed-by: Alex Elder <elder@inktank.com>
2012-06-15 11:26:40 -07:00
Sage Weil
89a86be0ce libceph: transition socket state prior to actual connect
Once we call ->connect(), we are racing against the actual
connection, and a subsequent transition from CONNECTING ->
CONNECTED.  Set the state to CONNECTING before that, under the
protection of the mutex, to avoid the race.

This was introduced in 928443cd96,
with the original socket state code.

Signed-off-by: Sage Weil <sage@inktank.com>
Reviewed-by: Alex Elder <elder@inktank.com>
2012-06-15 11:26:37 -07:00
Xi Wang
a550604950 libceph: fix overflow in osdmap_apply_incremental()
On 32-bit systems, a large `pglen' would overflow `pglen*sizeof(u32)'
and bypass the check ceph_decode_need(p, end, pglen*sizeof(u32), bad).
It would also overflow the subsequent kmalloc() size, leading to
out-of-bounds write.

Signed-off-by: Xi Wang <xi.wang@gmail.com>
Reviewed-by: Alex Elder <elder@inktank.com>
2012-06-07 08:28:16 -05:00
Xi Wang
e91a9b639a libceph: fix overflow in osdmap_decode()
On 32-bit systems, a large `n' would overflow `n * sizeof(u32)' and bypass
the check ceph_decode_need(p, end, n * sizeof(u32), bad).  It would also
overflow the subsequent kmalloc() size, leading to out-of-bounds write.

Signed-off-by: Xi Wang <xi.wang@gmail.com>
Reviewed-by: Alex Elder <elder@inktank.com>
2012-06-07 08:28:10 -05:00
Xi Wang
ad3b904c07 libceph: fix overflow in __decode_pool_names()
`len' is read from network and thus needs validation.  Otherwise a
large `len' would cause out-of-bounds access via the memcpy() call.
In addition, len = 0xffffffff would overflow the kmalloc() size,
leading to out-of-bounds write.

This patch adds a check of `len' via ceph_decode_need().  Also use
kstrndup rather than kmalloc/memcpy.

[elder@inktank.com: added -ENOMEM return for null kstrndup() result]

Signed-off-by: Xi Wang <xi.wang@gmail.com>
Reviewed-by: Alex Elder <elder@inktank.com>
2012-06-07 08:28:04 -05:00
Yan, Zheng
43643528cc rbd: Clear ceph_msg->bio_iter for retransmitted message
The bug can cause NULL pointer dereference in write_partial_msg_pages

Signed-off-by: Zheng Yan <zheng.z.yan@intel.com>
Reviewed-by: Alex Elder <elder@inktank.com>
2012-06-07 08:27:33 -05:00
Alex Elder
8921d114f5 libceph: make ceph_con_revoke_message() a msg op
ceph_con_revoke_message() is passed both a message and a ceph
connection.  A ceph_msg allocated for incoming messages on a
connection always has a pointer to that connection, so there's no
need to provide the connection when revoking such a message.

Note that the existing logic does not preclude the message supplied
being a null/bogus message pointer.  The only user of this interface
is the OSD client, and the only value an osd client passes is a
request's r_reply field.  That is always non-null (except briefly in
an error path in ceph_osdc_alloc_request(), and that drops the
only reference so the request won't ever have a reply to revoke).
So we can safely assume the passed-in message is non-null, but add a
BUG_ON() to make it very obvious we are imposing this restriction.

Rename the function ceph_msg_revoke_incoming() to reflect that it is
really an operation on an incoming message.

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Sage Weil <sage@inktank.com>
2012-06-06 09:23:55 -05:00
Alex Elder
6740a845b2 libceph: make ceph_con_revoke() a msg operation
ceph_con_revoke() is passed both a message and a ceph connection.
Now that any message associated with a connection holds a pointer
to that connection, there's no need to provide the connection when
revoking a message.

This has the added benefit of precluding the possibility of the
providing the wrong connection pointer.  If the message's connection
pointer is null, it is not being tracked by any connection, so
revoking it is a no-op.  This is supported as a convenience for
upper layers, so they can revoke a message that is not actually
"in flight."

Rename the function ceph_msg_revoke() to reflect that it is really
an operation on a message, not a connection.

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Sage Weil <sage@inktank.com>
2012-06-06 09:23:54 -05:00
Alex Elder
92ce034b5a libceph: have messages take a connection reference
There are essentially two types of ceph messages: incoming and
outgoing.  Outgoing messages are always allocated via ceph_msg_new(),
and at the time of their allocation they are not associated with any
particular connection.  Incoming messages are always allocated via
ceph_con_in_msg_alloc(), and they are initially associated with the
connection from which incoming data will be placed into the message.

When an outgoing message gets sent, it becomes associated with a
connection and remains that way until the message is successfully
sent.  The association of an incoming message goes away at the point
it is sent to an upper layer via a con->ops->dispatch method.

This patch implements reference counting for all ceph messages, such
that every message holds a reference (and a pointer) to a connection
if and only if it is associated with that connection (as described
above).


For background, here is an explanation of the ceph message
lifecycle, emphasizing when an association exists between a message
and a connection.

Outgoing Messages
An outgoing message is "owned" by its allocator, from the time it is
allocated in ceph_msg_new() up to the point it gets queued for
sending in ceph_con_send().  Prior to that point the message's
msg->con pointer is null; at the point it is queued for sending its
message pointer is assigned to refer to the connection.  At that
time the message is inserted into a connection's out_queue list.

When a message on the out_queue list has been sent to the socket
layer to be put on the wire, it is transferred out of that list and
into the connection's out_sent list.  At that point it is still owned
by the connection, and will remain so until an acknowledgement is
received from the recipient that indicates the message was
successfully transferred.  When such an acknowledgement is received
(in process_ack()), the message is removed from its list (in
ceph_msg_remove()), at which point it is no longer associated with
the connection.

So basically, any time a message is on one of a connection's lists,
it is associated with that connection.  Reference counting outgoing
messages can thus be done at the points a message is added to the
out_queue (in ceph_con_send()) and the point it is removed from
either its two lists (in ceph_msg_remove())--at which point its
connection pointer becomes null.

Incoming Messages
When an incoming message on a connection is getting read (in
read_partial_message()) and there is no message in con->in_msg,
a new one is allocated using ceph_con_in_msg_alloc().  At that
point the message is associated with the connection.  Once that
message has been completely and successfully read, it is passed to
upper layer code using the connection's con->ops->dispatch method.
At that point the association between the message and the connection
no longer exists.

Reference counting of connections for incoming messages can be done
by taking a reference to the connection when the message gets
allocated, and releasing that reference when it gets handed off
using the dispatch method.

We should never fail to get a connection reference for a
message--the since the caller should already hold one.

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Sage Weil <sage@inktank.com>
2012-06-06 09:23:54 -05:00
Alex Elder
38941f8031 libceph: have messages point to their connection
When a ceph message is queued for sending it is placed on a list of
pending messages (ceph_connection->out_queue).  When they are
actually sent over the wire, they are moved from that list to
another (ceph_connection->out_sent).  When acknowledgement for the
message is received, it is removed from the sent messages list.

During that entire time the message is "in the possession" of a
single ceph connection.  Keep track of that connection in the
message.  This will be used in the next patch (and is a helpful
bit of information for debugging anyway).

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Sage Weil <sage@inktank.com>
2012-06-06 09:23:54 -05:00
Alex Elder
1c20f2d267 libceph: tweak ceph_alloc_msg()
The function ceph_alloc_msg() is only used to allocate a message
that will be assigned to a connection's in_msg pointer.  Rename the
function so this implied usage is more clear.

In addition, make that assignment inside the function (again, since
that's precisely what it's intended to be used for).  This allows us
to return what is now provided via the passed-in address of a "skip"
variable.  The return type is now Boolean to be explicit that there
are only two possible outcomes.

Make sure the result of an ->alloc_msg method call always sets the
value of *skip properly.

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Sage Weil <sage@inktank.com>
2012-06-06 09:23:54 -05:00
Alex Elder
1bfd89f4e6 libceph: fully initialize connection in con_init()
Move the initialization of a ceph connection's private pointer,
operations vector pointer, and peer name information into
ceph_con_init().  Rearrange the arguments so the connection pointer
is first.  Hide the byte-swapping of the peer entity number inside
ceph_con_init()

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Sage Weil <sage@inktank.com>
2012-06-06 09:23:54 -05:00
Alex Elder
20581c1faf libceph: init monitor connection when opening
Hold off initializing a monitor client's connection until just
before it gets opened for use.

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Sage Weil <sage@inktank.com>
2012-06-06 09:23:54 -05:00
Sage Weil
ec87ef4309 libceph: drop connection refcounting for mon_client
All references to the embedded ceph_connection come from the msgr
workqueue, which is drained prior to mon_client destruction.  That
means we can ignore con refcounting entirely.

Signed-off-by: Sage Weil <sage@newdream.net>
Reviewed-by: Alex Elder <elder@inktank.com>
2012-06-06 09:23:54 -05:00
Alex Elder
67130934fb libceph: embed ceph connection structure in mon_client
A monitor client has a pointer to a ceph connection structure in it.
This is the only one of the three ceph client types that do it this
way; the OSD and MDS clients embed the connection into their main
structures.  There is always exactly one ceph connection for a
monitor client, so there is no need to allocate it separate from the
monitor client structure.

So switch the ceph_mon_client structure to embed its
ceph_connection structure.

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Sage Weil <sage@inktank.com>
2012-06-06 09:23:54 -05:00
Sage Weil
0d47766f14 libceph: use con get/put ops from osd_client
There were a few direct calls to ceph_con_{get,put}() instead of the con
ops from osd_client.c.  This is a bug since those ops aren't defined to
be ceph_con_get/put.

This breaks refcounting on the ceph_osd structs that contain the
ceph_connections, and could lead to all manner of strangeness.

The purpose of the ->get and ->put methods in a ceph connection are
to allow the connection to indicate it has a reference to something
external to the messaging system, *not* to indicate something
external has a reference to the connection.

[elder@inktank.com: added that last sentence]

Signed-off-by: Sage Weil <sage@newdream.net>
Reviewed-by: Alex Elder <elder@inktank.com>
2012-06-06 09:23:54 -05:00
Alex Elder
ab8cb34a4b libceph: osd_client: don't drop reply reference too early
In ceph_osdc_release_request(), a reference to the r_reply message
is dropped.  But just after that, that same message is revoked if it
was in use to receive an incoming reply.  Reorder these so we are
sure we hold a reference until we're actually done with the message.

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Sage Weil <sage@inktank.com>
2012-06-06 09:23:54 -05:00
Alex Elder
a5988c490e libceph: set CLOSED state bit in con_init
Once a connection is fully initialized, it is really in a CLOSED
state, so make that explicit by setting the bit in its state field.

It is possible for a connection in NEGOTIATING state to get a
failure, leading to ceph_fault() and ultimately ceph_con_close().
Clear that bits if it is set in that case, to reflect that the
connection truly is closed and is no longer participating in a
connect sequence.

Issue a warning if ceph_con_open() is called on a connection that
is not in CLOSED state.

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Sage Weil <sage@inktank.com>
2012-06-01 08:37:56 -05:00
Alex Elder
e10006f807 libceph: provide osd number when creating osd
Pass the osd number to the create_osd() routine, and move the
initialization of fields that depend on it therein.

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Sage Weil <sage@inktank.com>
2012-06-01 08:37:56 -05:00
Alex Elder
ce2c8903e7 libceph: start tracking connection socket state
Start explicitly keeping track of the state of a ceph connection's
socket, separate from the state of the connection itself.  Create
placeholder functions to encapsulate the state transitions.

    --------
    | NEW* |  transient initial state
    --------
        | con_sock_state_init()
        v
    ----------
    | CLOSED |  initialized, but no socket (and no
    ----------  TCP connection)
     ^      \
     |       \ con_sock_state_connecting()
     |        ----------------------
     |                              \
     + con_sock_state_closed()       \
     |\                               \
     | \                               \
     |  -----------                     \
     |  | CLOSING |  socket event;       \
     |  -----------  await close          \
     |       ^                            |
     |       |                            |
     |       + con_sock_state_closing()   |
     |      / \                           |
     |     /   ---------------            |
     |    /                   \           v
     |   /                    --------------
     |  /    -----------------| CONNECTING |  socket created, TCP
     |  |   /                 --------------  connect initiated
     |  |   | con_sock_state_connected()
     |  |   v
    -------------
    | CONNECTED |  TCP connection established
    -------------

Make the socket state an atomic variable, reinforcing that it's a
distinct transtion with no possible "intermediate/both" states.
This is almost certainly overkill at this point, though the
transitions into CONNECTED and CLOSING state do get called via
socket callback (the rest of the transitions occur with the
connection mutex held).  We can back out the atomicity later.

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Sage Weil<sage@inktank.com>
2012-06-01 08:37:56 -05:00
Alex Elder
928443cd96 libceph: start separating connection flags from state
A ceph_connection holds a mixture of connection state (as in "state
machine" state) and connection flags in a single "state" field.  To
make the distinction more clear, define a new "flags" field and use
it rather than the "state" field to hold Boolean flag values.

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Sage Weil<sage@inktank.com>
2012-06-01 08:37:56 -05:00
Alex Elder
15d9882c33 libceph: embed ceph messenger structure in ceph_client
A ceph client has a pointer to a ceph messenger structure in it.
There is always exactly one ceph messenger for a ceph client, so
there is no need to allocate it separate from the ceph client
structure.

Switch the ceph_client structure to embed its ceph_messenger
structure.

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Yehuda Sadeh <yehuda@inktank.com>
Reviewed-by: Sage Weil <sage@inktank.com>
2012-06-01 08:37:56 -05:00
Alex Elder
e22004235a libceph: rename kvec_reset and kvec_add functions
The functions ceph_con_out_kvec_reset() and ceph_con_out_kvec_add()
are entirely private functions, so drop the "ceph_" prefix in their
name to make them slightly more wieldy.

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Yehuda Sadeh <yehuda@inktank.com>
Reviewed-by: Sage Weil <sage@inktank.com>
2012-06-01 08:37:55 -05:00
Alex Elder
327800bdc2 libceph: rename socket callbacks
Change the names of the three socket callback functions to make it
more obvious they're specifically associated with a connection's
socket (not the ceph connection that uses it).

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Yehuda Sadeh <yehuda@inktank.com>
Reviewed-by: Sage Weil <sage@inktank.com>
2012-06-01 08:37:55 -05:00
Alex Elder
6384bb8b8e libceph: kill bad_proto ceph connection op
No code sets a bad_proto method in its ceph connection operations
vector, so just get rid of it.

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Yehuda Sadeh <yehuda@inktank.com>
2012-06-01 08:37:55 -05:00
Alex Elder
e5e372da9a libceph: eliminate connection state "DEAD"
The ceph connection state "DEAD" is never set and is therefore not
needed.  Eliminate it.

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Yehuda Sadeh <yehuda@inktank.com>
2012-06-01 08:37:55 -05:00
Linus Torvalds
af56e0aa35 Merge git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client
Pull ceph updates from Sage Weil:
 "There are some updates and cleanups to the CRUSH placement code, a bug
  fix with incremental maps, several cleanups and fixes from Josh Durgin
  in the RBD block device code, a series of cleanups and bug fixes from
  Alex Elder in the messenger code, and some miscellaneous bounds
  checking and gfp cleanups/fixes."

Fix up trivial conflicts in net/ceph/{messenger.c,osdmap.c} due to the
networking people preferring "unsigned int" over just "unsigned".

* git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client: (45 commits)
  libceph: fix pg_temp updates
  libceph: avoid unregistering osd request when not registered
  ceph: add auth buf in prepare_write_connect()
  ceph: rename prepare_connect_authorizer()
  ceph: return pointer from prepare_connect_authorizer()
  ceph: use info returned by get_authorizer
  ceph: have get_authorizer methods return pointers
  ceph: ensure auth ops are defined before use
  ceph: messenger: reduce args to create_authorizer
  ceph: define ceph_auth_handshake type
  ceph: messenger: check return from get_authorizer
  ceph: messenger: rework prepare_connect_authorizer()
  ceph: messenger: check prepare_write_connect() result
  ceph: don't set WRITE_PENDING too early
  ceph: drop msgr argument from prepare_write_connect()
  ceph: messenger: send banner in process_connect()
  ceph: messenger: reset connection kvec caller
  libceph: don't reset kvec in prepare_write_banner()
  ceph: ignore preferred_osd field
  ceph: fully initialize new layout
  ...
2012-05-30 11:17:19 -07:00
Sage Weil
6bd9adbdf9 libceph: fix pg_temp updates
Usually, we are adding pg_temp entries or removing them.  Occasionally they
update.  In that case, osdmap_apply_incremental() was failing because the
rbtree entry already exists.

Fix by removing the existing entry before inserting a new one.

Fixes http://tracker.newdream.net/issues/2446

Signed-off-by: Sage Weil <sage@inktank.com>
Reviewed-by: Alex Elder <elder@inktank.com>
2012-05-21 20:40:42 -07:00
Sage Weil
35f9f8a09e libceph: avoid unregistering osd request when not registered
There is a race between two __unregister_request() callers: the
reply path and the ceph_osdc_wait_request().  If we get a reply
*and* the timeout expires at roughly the same time, both callers
will try to unregister the request, and the second one will do bad
things.

Simply check if the request is still already unregistered; if so,
return immediately and do nothing.

Fixes http://tracker.newdream.net/issues/2420

Signed-off-by: Sage Weil <sage@inktank.com>
Reviewed-by: Alex Elder <elder@inktank.com>
2012-05-18 17:36:00 -07:00
Alex Elder
3da54776e2 ceph: add auth buf in prepare_write_connect()
Move the addition of the authorizer buffer to a connection's
out_kvec out of get_connect_authorizer() and into its caller.  This
way, the caller--prepare_write_connect()--can avoid adding the
connect header to out_kvec before it has been fully initialized.

Prior to this patch, it was possible for a connect header to be
sent over the wire before the authorizer protocol or buffer length
fields were initialized.  An authorizer buffer associated with that
header could also be queued to send only after the connection header
that describes it was on the wire.

Fixes http://tracker.newdream.net/issues/2424

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Sage Weil <sage@inktank.com>
2012-05-18 17:35:59 -07:00
Alex Elder
dac1e716c6 ceph: rename prepare_connect_authorizer()
Change the name of prepare_connect_authorizer().  The next
patch is going to make this function no longer add anything to the
connection's out_kvec, so it will no longer fit the pattern of
the rest of the prepare_connect_*() functions.

In addition, pass the address of a variable that will hold the
authorization protocol to use.  Move the assignment of that to the
connection's out_connect structure into prepare_write_connect().

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Sage Weil <sage@inktank.com>
2012-05-17 08:18:13 -05:00
Alex Elder
729796be91 ceph: return pointer from prepare_connect_authorizer()
Change prepare_connect_authorizer() so it returns a pointer (or
pointer-coded error).

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Sage Weil <sage@inktank.com>
2012-05-17 08:18:13 -05:00
Alex Elder
8f43fb5389 ceph: use info returned by get_authorizer
Rather than passing a bunch of arguments to be filled in with the
content of the ceph_auth_handshake buffer now returned by the
get_authorizer method, just use the returned information in the
caller, and drop the unnecessary arguments.

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Sage Weil <sage@inktank.com>
2012-05-17 08:18:13 -05:00
Alex Elder
a3530df33e ceph: have get_authorizer methods return pointers
Have the get_authorizer auth_client method return a ceph_auth
pointer rather than an integer, pointer-encoding any returned
error value.  This is to pave the way for making use of the
returned value in an upcoming patch.

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Sage Weil <sage@inktank.com>
2012-05-17 08:18:13 -05:00
Alex Elder
a255651d4c ceph: ensure auth ops are defined before use
In the create_authorizer method for both the mds and osd clients,
the auth_client->ops pointer is blindly dereferenced.  There is no
obvious guarantee that this pointer has been assigned.  And
furthermore, even if the ops pointer is non-null there is definitely
no guarantee that the create_authorizer or destroy_authorizer
methods are defined.

Add checks in both routines to make sure they are defined (non-null)
before use.  Add similar checks in a few other spots in these files
while we're at it.

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Sage Weil <sage@inktank.com>
2012-05-17 08:18:13 -05:00
Alex Elder
74f1869f76 ceph: messenger: reduce args to create_authorizer
Make use of the new ceph_auth_handshake structure in order to reduce
the number of arguments passed to the create_authorizor method in
ceph_auth_client_ops.  Use a local variable of that type as a
shorthand in the get_authorizer method definitions.

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Sage Weil <sage@inktank.com>
2012-05-17 08:18:12 -05:00
Alex Elder
6c4a19158b ceph: define ceph_auth_handshake type
The definitions for the ceph_mds_session and ceph_osd both contain
five fields related only to "authorizers."  Encapsulate those fields
into their own struct type, allowing for better isolation in some
upcoming patches.

Fix the #includes in "linux/ceph/osd_client.h" to lay out their more
complete canonical path.

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Sage Weil <sage@inktank.com>
2012-05-17 08:18:12 -05:00
Alex Elder
ed96af6460 ceph: messenger: check return from get_authorizer
In prepare_connect_authorizer(), a connection's get_authorizer
method is called but ignores its return value.  This function can
return an error, so check for it and return it if that ever occurs.

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Sage Weil <sage@inktank.com>
2012-05-17 08:18:12 -05:00
Alex Elder
b1c6b9803f ceph: messenger: rework prepare_connect_authorizer()
Change prepare_connect_authorizer() so it returns without dropping
the connection mutex if the connection has no get_authorizer method.

Use the symbolic CEPH_AUTH_UNKNOWN instead of 0 when assigning
authorization protocols.

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Sage Weil <sage@inktank.com>
2012-05-17 08:18:12 -05:00
Alex Elder
5a0f8fdd8a ceph: messenger: check prepare_write_connect() result
prepare_write_connect() can return an error, but only one of its
callers checks for it.  All the rest are in functions that already
return errors, so it should be fine to return the error if one
gets returned.

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Sage Weil <sage@inktank.com>
2012-05-17 08:18:12 -05:00
Alex Elder
e10c758e40 ceph: don't set WRITE_PENDING too early
prepare_write_connect() prepares a connect message, then sets
WRITE_PENDING on the connection.  Then *after* this, it calls
prepare_connect_authorizer(), which updates the content of the
connection buffer already queued for sending.  It's also possible it
will result in prepare_write_connect() returning -EAGAIN despite the
WRITE_PENDING big getting set.

Fix this by preparing the connect authorizer first, setting the
WRITE_PENDING bit only after that is done.

Partially addresses http://tracker.newdream.net/issues/2424

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Sage Weil <sage@inktank.com>
2012-05-17 08:18:12 -05:00
Alex Elder
e825a66df9 ceph: drop msgr argument from prepare_write_connect()
In all cases, the value passed as the msgr argument to
prepare_write_connect() is just con->msgr.  Just get the msgr
value from the ceph connection and drop the unneeded argument.

The only msgr passed to prepare_write_banner() is also therefore
just the one from con->msgr, so change that function to drop the
msgr argument as well.

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Sage Weil <sage@inktank.com>
2012-05-17 08:18:12 -05:00
Alex Elder
41b90c0085 ceph: messenger: send banner in process_connect()
prepare_write_connect() has an argument indicating whether a banner
should be sent out before sending out a connection message.  It's
only ever set in one of its callers, so move the code that arranges
to send the banner into that caller and drop the "include_banner"
argument from prepare_write_connect().

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Sage Weil <sage@inktank.com>
2012-05-17 08:18:12 -05:00
Alex Elder
84fb3adf64 ceph: messenger: reset connection kvec caller
Reset a connection's kvec fields in the caller rather than in
prepare_write_connect().   This ends up repeating a few lines of
code but it's improving the separation between distinct operations
on the connection, which we can take advantage of later.

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Sage Weil <sage@inktank.com>
2012-05-17 08:18:12 -05:00
Alex Elder
d329156f16 libceph: don't reset kvec in prepare_write_banner()
Move the kvec reset for a connection out of prepare_write_banner and
into its only caller.

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Sage Weil <sage@inktank.com>
2012-05-17 08:18:12 -05:00
Alex Elder
fd51653f78 ceph: messenger: change read_partial() to take "end" arg
Make the second argument to read_partial() be the ending input byte
position rather than the beginning offset it now represents.  This
amounts to moving the addition "to + size" into the caller.

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Sage Weil <sage@inktank.com>
2012-05-14 12:16:42 -05:00
Alex Elder
e6cee71fac ceph: messenger: update "to" in read_partial() caller
read_partial() always increases whatever "to" value is supplied by
adding the requested size to it, and that's the only thing it does
with that pointed-to value.

Do that pointer advance in the caller (and then only when the
updated value will be subsequently used), and change the "to"
parameter to be an in-only and non-pointer value.

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Sage Weil <sage@inktank.com>
2012-05-14 12:16:42 -05:00
Alex Elder
57dac9d162 ceph: messenger: use read_partial() in read_partial_message()
There are two blocks of code in read_partial_message()--those that
read the header and footer of the message--that can be replaced by a
call to read_partial().  Do that.

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Sage Weil <sage@inktank.com>
2012-05-14 12:16:41 -05:00
Alex Elder
065a68f916 ceph: osd_client: fix endianness bug in osd_req_encode_op()
From Al Viro <viro@zeniv.linux.org.uk>

Al Viro noticed that we were using a non-cpu-encoded value in
a switch statement in osd_req_encode_op().  The result would
clearly not work correctly on a big-endian machine.

Signed-off-by: Alex Elder <elder@dreamhost.com>
2012-05-14 12:12:22 -05:00
Sage Weil
8b39326900 crush: warn on do_rule failure
If we get an error code from crush_do_rule(), print an error to the
console.

Reviewed-by: Alex Elder <elder@inktank.com>
Signed-off-by: Sage Weil <sage@inktank.com>
2012-05-07 15:39:37 -07:00
Sage Weil
6eb43f4b5a crush: fix memory leak when destroying tree buckets
Reflects ceph.git commit 46d63d98434b3bc9dad2fc9ab23cbaedc3bcb0e4.

Reported-by: Alexander Lyakas <alex.bolshoy@gmail.com>
Reviewed-by: Alex Elder <elder@inktank.com>
Signed-off-by: Sage Weil <sage@inktank.com>
2012-05-07 15:39:36 -07:00
Sage Weil
f671d4cd9b crush: fix tree node weight lookup
Fix the node weight lookup for tree buckets by using a correct accessor.

Reflects ceph.git commit d287ade5bcbdca82a3aef145b92924cf1e856733.

Reviewed-by: Alex Elder <elder@inktank.com>
Signed-off-by: Sage Weil <sage@inktank.com>
2012-05-07 15:39:35 -07:00
Sage Weil
fc7c3ae5ab crush: remove parent maps
These were used for the ill-fated forcefeed feature.  Remove them.

Reflects ceph.git commit ebdf80edfecfbd5a842b71fbe5732857994380c1.

Reviewed-by: Alex Elder <elder@inktank.com>
Signed-off-by: Sage Weil <sage@inktank.com>
2012-05-07 15:39:34 -07:00
Sage Weil
41ebcc0907 crush: remove forcefeed functionality
Remove forcefeed functionality from CRUSH.  This is an ugly misfeature that
is mostly useless and unused.  Remove it.

Reflects ceph.git commit ed974b5000f2851207d860a651809af4a1867942.

Reviewed-by: Alex Elder <elder@inktank.com>
Signed-off-by: Sage Weil <sage@inktank.com>

Conflicts:

	net/ceph/crush/mapper.c
2012-05-07 15:39:29 -07:00
Sage Weil
0668216efe crush: use a temporary variable to simplify crush_do_rule
Use a temporary variable here to avoid repeated array lookups and clean up
the code a bit.

This reflects ceph.git commit 6b5be27634ad307b471a5bf0db85c4f5c834885f.

Reviewed-by: Alex Elder <elder@inktank.com>
Signed-off-by: Sage Weil <sage@inktank.com>
2012-05-07 15:38:51 -07:00
Sage Weil
a1f4895be8 crush: be more tolerant of nonsensical crush maps
If we get a map that doesn't make sense, error out or ignore the badness
instead of BUGging out.  This reflects the ceph.git commits
9895f0bff7dc68e9b49b572613d242315fb11b6c and
8ded26472058d5205803f244c2f33cb6cb10de79.

Reviewed-by: Alex Elder <elder@inktank.com>
Signed-off-by: Sage Weil <sage@inktank.com>
2012-05-07 15:38:51 -07:00
Sage Weil
c90f95ed46 crush: adjust local retry threshold
This small adjustment reflects a change that was made in ceph.git commit
af6a9f30696c900a2a8bd7ae24e8ed15fb4964bb, about 6 months ago.  An N-1
search is not exhaustive.  Fixed ceph.git bug #1594.

Reviewed-by: Alex Elder <elder@inktank.com>
Signed-off-by: Sage Weil <sage@inktank.com>
2012-05-07 15:38:50 -07:00
Sage Weil
8b12d47b80 crush: clean up types, const-ness
Move various types from int -> __u32 (or similar), and add const as
appropriate.

This reflects changes that have been present in the userland implementation
for some time.

Reviewed-by: Alex Elder <elder@inktank.com>
Signed-off-by: Sage Weil <sage@inktank.com>
2012-05-07 15:38:35 -07:00
Sage Weil
3469ac1aa3 ceph: drop support for preferred_osd pgs
This was an ill-conceived feature that has been removed from Ceph.  Do
this gracefully:

 - reject attempts to specify a preferred_osd via the ioctl
 - stop exposing this information via virtual xattrs
 - always fill in -1 for requests, in case we talk to an older server
 - don't calculate preferred_osd placements/pgids

Reviewed-by: Alex Elder <elder@inktank.com>
Signed-off-by: Sage Weil <sage@inktank.com>
2012-05-07 15:33:36 -07:00
hartleys
feb50ac19e crush: include header for global symbols
Include the header to pickup the definitions of the global symbols.

Quiets the following sparse warnings:

warning: symbol 'crush_find_rule' was not declared. Should it be static?
warning: symbol 'crush_do_rule' was not declared. Should it be static?

Signed-off-by: H Hartley Sweeten <hsweeten@visionengravers.com>
Cc: Sage Weil <sage@newdream.net>
Cc: "David S. Miller" <davem@davemloft.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2012-04-27 00:03:34 -04:00
Eric Dumazet
95c9617472 net: cleanup unsigned to unsigned int
Use of "unsigned int" is preferred to bare "unsigned" in net tree.

Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2012-04-15 12:44:40 -04:00
Alex Elder
8d63e318c4 libceph: isolate kmap() call in write_partial_msg_pages()
In write_partial_msg_pages(), every case now does an identical call
to kmap(page).  Instead, just call it once inside the CRC-computing
block where it's needed.  Move the definition of kaddr inside that
block, and make it a (char *) to ensure portable pointer arithmetic.

We still don't kunmap() it until after the sendpage() call, in case
that also ends up needing to use the mapping.

Signed-off-by: Alex Elder <elder@dreamhost.com>
Reviewed-by: Sage Weil <sage@newdream.net>
2012-03-22 10:47:52 -05:00
Alex Elder
9bd1966344 libceph: rename "page_shift" variable to something sensible
In write_partial_msg_pages() there is a local variable used to
track the starting offset within a bio segment to use.  Its name,
"page_shift" defies the Linux convention of using that name for
log-base-2(page size).

Since it's only used in the bio case rename it "bio_offset".  Use it
along with the page_pos field to compute the memory offset when
computing CRC's in that function.  This makes the bio case match the
others more closely.

Signed-off-by: Alex Elder <elder@dreamhost.com>
Reviewed-by: Sage Weil <sage@newdream.net>
2012-03-22 10:47:52 -05:00
Alex Elder
0cdf9e6018 libceph: get rid of zero_page_address
There's not a lot of benefit to zero_page_address, which basically
holds a mapping of the zero page through the life of the messenger
module.  Even with our own mapping, the sendpage interface where
it's used may need to kmap() it again.  It's almost certain to
be in low memory anyway.

So stop treating the zero page specially in write_partial_msg_pages()
and just get rid of zero_page_address entirely.

Signed-off-by: Alex Elder <elder@dreamhost.com>
Reviewed-by: Sage Weil <sage@newdream.net>
2012-03-22 10:47:52 -05:00
Alex Elder
e36b13cceb libceph: only call kernel_sendpage() via helper
Make ceph_tcp_sendpage() be the only place kernel_sendpage() is
used, by using this helper in write_partial_msg_pages().

Signed-off-by: Alex Elder <elder@dreamhost.com>
Reviewed-by: Sage Weil <sage@newdream.net>
2012-03-22 10:47:52 -05:00
Alex Elder
31739139f3 libceph: use kernel_sendpage() for sending zeroes
If a message queued for send gets revoked, zeroes are sent over the
wire instead of any unsent data.  This is done by constructing a
message and passing it to kernel_sendmsg() via ceph_tcp_sendmsg().

Since we are already working with a page in this case we can use
the sendpage interface instead.  Create a new ceph_tcp_sendpage()
helper that sets up flags to match the way ceph_tcp_sendmsg()
does now.

Signed-off-by: Alex Elder <elder@dreamhost.com>
Reviewed-by: Sage Weil <sage@newdream.net>
2012-03-22 10:47:51 -05:00
Alex Elder
37675b0f42 libceph: fix inverted crc option logic
CRC's are computed for all messages between ceph entities.  The CRC
computation for the data portion of message can optionally be
disabled using the "nocrc" (common) ceph option.  The default is
for CRC computation for the data portion to be enabled.

Unfortunately, the code that implements this feature interprets the
feature flag wrong, meaning that by default the CRC's have *not*
been computed (or checked) for the data portion of messages unless
the "nocrc" option was supplied.

Fix this, in write_partial_msg_pages() and read_partial_message().
Also change the flag variable in write_partial_msg_pages() to be
"no_datacrc" to match the usage elsewhere in the file.

This fixes http://tracker.newdream.net/issues/2064

Signed-off-by: Alex Elder <elder@dreamhost.com>
Reviewed-by: Sage Weil <sage@newdream.net>
2012-03-22 10:47:51 -05:00
Alex Elder
84495f4961 libceph: some simple changes
Nothing too big here.
    - define the size of the buffer used for consuming ignored
      incoming data using a symbolic constant
    - simplify the condition determining whether to unmap the page
      in write_partial_msg_pages(): do it for crc but not if the
      page is the zero page

Signed-off-by: Alex Elder <elder@dreamhost.com>
Signed-off-by: Sage Weil <sage@newdream.net>
2012-03-22 10:47:51 -05:00
Alex Elder
f42299e6c3 libceph: small refactor in write_partial_kvec()
Make a small change in the code that counts down kvecs consumed by
a ceph_tcp_sendmsg() call.  Same functionality, just blocked out
a little differently.

Signed-off-by: Alex Elder <elder@dreamhost.com>
Signed-off-by: Sage Weil <sage@newdream.net>
2012-03-22 10:47:51 -05:00
Alex Elder
fe3ad593e2 libceph: do crc calculations outside loop
Move blocks of code out of loops in read_partial_message_section()
and read_partial_message().  They were only was getting called at
the end of the last iteration of the loop anyway.

Signed-off-by: Alex Elder <elder@dreamhost.com>
Signed-off-by: Sage Weil <sage@newdream.net>
2012-03-22 10:47:51 -05:00
Alex Elder
a9a0c51af4 libceph: separate CRC calculation from byte swapping
Calculate CRC in a separate step from rearranging the byte order
of the result, to improve clarity and readability.

Use offsetof() to determine the number of bytes to include in the
CRC calculation.

In read_partial_message(), switch which value gets byte-swapped,
since the just-computed CRC is already likely to be in a register.

Signed-off-by: Alex Elder <elder@dreamhost.com>
Signed-off-by: Sage Weil <sage@newdream.net>
2012-03-22 10:47:51 -05:00
Alex Elder
bca064d236 libceph: use "do" in CRC-related Boolean variables
Change the name (and type) of a few CRC-related Boolean local
variables so they contain the word "do", to distingish their purpose
from variables used for holding an actual CRC value.

Note that in the process of doing this I identified a fairly serious
logic error in write_partial_msg_pages():  the value of "do_crc"
assigned appears to be the opposite of what it should be.  No
attempt to fix this is made here; this change preserves the
erroneous behavior.  The problem I found is documented here:
    http://tracker.newdream.net/issues/2064

Signed-off-by: Alex Elder <elder@dreamhost.com>
Signed-off-by: Sage Weil <sage@newdream.net>
2012-03-22 10:47:51 -05:00
Alex Elder
cffaba15cd ceph: ensure Boolean options support both senses
Many ceph-related Boolean options offer the ability to both enable
and disable a feature.  For all those that don't offer this, add
a new option so that they do.

Note that ceph_show_options()--which reports mount options currently
in effect--only reports the option if it is different from the
default value.

Signed-off-by: Alex Elder <elder@dreamhost.com>
Signed-off-by: Sage Weil <sage@newdream.net>
2012-03-22 10:47:51 -05:00
Alex Elder
d3002b974c libceph: a few small changes
This gathers a number of very minor changes:
    - use %hu when formatting the a socket address's address family
    - null out the ceph_msgr_wq pointer after the queue has been
      destroyed
    - drop a needless cast in ceph_write_space()
    - add a WARN() call in ceph_state_change() in the event an
      unrecognized socket state is encountered
    - rearrange the logic in ceph_con_get() and ceph_con_put() so
      that:
        - the reference counts are only atomically read once
	- the values displayed via dout() calls are known to
	  be meaningful at the time they are formatted

Signed-off-by: Alex Elder <elder@dreamhost.com>
Signed-off-by: Sage Weil <sage@newdream.net>
2012-03-22 10:47:51 -05:00
Alex Elder
41617d0c9c libceph: make ceph_tcp_connect() return int
There is no real need for ceph_tcp_connect() to return the socket
pointer it creates, since it already assigns it to con->sock, which
is visible to the caller.  Instead, have it return an error code,
which tidies things up a bit.

Signed-off-by: Alex Elder <elder@dreamhost.com>
Signed-off-by: Sage Weil <sage@newdream.net>
2012-03-22 10:47:51 -05:00
Alex Elder
6173d1f02f libceph: encapsulate some messenger cleanup code
Define a helper function to perform various cleanup operations.  Use
it both in the exit routine and in the init routine in the event of
an error.

Signed-off-by: Alex Elder <elder@dreamhost.com>
Signed-off-by: Sage Weil <sage@newdream.net>
2012-03-22 10:47:50 -05:00
Alex Elder
e0f43c9419 libceph: make ceph_msgr_wq private
The messenger workqueue has no need to be public.  So give it static
scope.

Signed-off-by: Alex Elder <elder@dreamhost.com>
Signed-off-by: Sage Weil <sage@newdream.net>
2012-03-22 10:47:50 -05:00
Alex Elder
859eb79948 libceph: encapsulate connection kvec operations
Encapsulate the operation of adding a new chunk of data to the next
open slot in a ceph_connection's out_kvec array.  Also add a "reset"
operation to make subsequent add operations start at the beginning
of the array again.

Use these routines throughout, avoiding duplicate code and ensuring
all calls are handled consistently.

Signed-off-by: Alex Elder <elder@dreamhost.com>
Signed-off-by: Sage Weil <sage@newdream.net>
2012-03-22 10:47:50 -05:00
Alex Elder
963be4d770 libceph: move prepare_write_banner()
One of the arguments to prepare_write_connect() indicates whether it
is being called immediately after a call to prepare_write_banner().
Move the prepare_write_banner() call inside prepare_write_connect(),
and reinterpret (and rename) the "after_banner" argument so it
indicates that prepare_write_connect() should *make* the call
rather than should know it has already been made.

This was split out from the next patch to highlight this change in
logic.

Signed-off-by: Alex Elder <elder@dreamhost.com>
Signed-off-by: Sage Weil <sage@newdream.net>
2012-03-22 10:47:50 -05:00
Alex Elder
ee57741c52 rbd: make ceph_parse_options() return a pointer
ceph_parse_options() takes the address of a pointer as an argument
and uses it to return the address of an allocated structure if
successful.  With this interface is not evident at call sites that
the pointer is always initialized.  Change the interface to return
the address instead (or a pointer-coded error code) to make the
validity of the returned pointer obvious.

Signed-off-by: Alex Elder <elder@dreamhost.com>
Signed-off-by: Sage Weil <sage@newdream.net>
2012-03-22 10:47:47 -05:00
Alex Elder
99f0f3b2c4 ceph: eliminate some abusive casts
This fixes some spots where a type cast to (void *) was used as
as a universal type hiding mechanism.  Instead, properly cast the
type to the intended target type.

Signed-off-by: Alex Elder <elder@newdream.net>
Signed-off-by: Sage Weil <sage@newdream.net>
2012-03-22 10:47:45 -05:00
Alex Elder
bd40614512 ceph: eliminate some needless casts
This eliminates type casts in some places where they are not
required.

Signed-off-by: Alex Elder <elder@newdream.net>
Signed-off-by: Sage Weil <sage@newdream.net>
2012-03-22 10:47:45 -05:00
Alex Elder
f64a93172b ceph: kill addr_str_lock spinlock; use atomic instead
A spinlock is used to protect a value used for selecting an array
index for a string used for formatting a socket address for human
consumption.  The index is reset to 0 if it ever reaches the maximum
index value.

Instead, use an ever-increasing atomic variable as a sequence
number, and compute the array index by masking off all but the
sequence number's lowest bits.  Make the number of entries in the
array a power of two to allow the use of such a mask (to avoid jumps
in the index value when the sequence number wraps).

The length of these strings is somewhat arbitrarily set at 60 bytes.
The worst-case length of a string produced is 54 bytes, for an IPv6
address that can't be shortened, e.g.:
    [1234:5678:9abc:def0:1111:2222:123.234.210.100]:32767
Change it so we arbitrarily use 64 bytes instead; if nothing else
it will make the array of these line up better in hex dumps.

Rename a few things to reinforce the distinction between the number
of strings in the array and the length of individual strings.

Signed-off-by: Alex Elder <elder@newdream.net>
Signed-off-by: Sage Weil <sage@newdream.net>
2012-03-22 10:47:45 -05:00
Alex Elder
a5bc3129a2 ceph: make use of "else" where appropriate
Rearrange ceph_tcp_connect() a bit, making use of "else" rather than
re-testing a value with consecutive "if" statements.  Don't record a
connection's socket pointer unless the connect operation is
successful.

Signed-off-by: Alex Elder <elder@dreamhost.com>
Signed-off-by: Sage Weil <sage@newdream.net>
2012-03-22 10:47:45 -05:00
Alex Elder
5766651971 ceph: use a shared zero page rather than one per messenger
Each messenger allocates a page to be used when writing zeroes
out in the event of error or other abnormal condition.  Instead,
use the kernel ZERO_PAGE() for that purpose.

Signed-off-by: Alex Elder <elder@dreamhost.com>
Signed-off-by: Sage Weil <sage@newdream.net>
2012-03-22 10:47:45 -05:00
Xi Wang
6448669777 libceph: fix overflow check in crush_decode()
The existing overflow check (n > ULONG_MAX / b) didn't work, because
n = ULONG_MAX / b would both bypass the check and still overflow the
allocation size a + n * b.

The correct check should be (n > (ULONG_MAX - a) / b).

Signed-off-by: Xi Wang <xi.wang@gmail.com>
Signed-off-by: Sage Weil <sage@newdream.net>
2012-03-22 10:47:45 -05:00
Jim Schutt
182fac2689 net/ceph: Only clear SOCK_NOSPACE when there is sufficient space in the socket buffer
The Ceph messenger would sometimes queue multiple work items to write
data to a socket when the socket buffer was full.

Fix this problem by making ceph_write_space() use SOCK_NOSPACE in the
same way that net/core/stream.c:sk_stream_write_space() does, i.e.,
clearing it only when sufficient space is available in the socket buffer.

Signed-off-by: Jim Schutt <jaschut@sandia.gov>
Reviewed-by: Alex Elder <elder@dreamhost.com>
2012-03-22 10:47:45 -05:00
Linus Torvalds
6c073a7ee2 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client:
  rbd: fix safety of rbd_put_client()
  rbd: fix a memory leak in rbd_get_client()
  ceph: create a new session lock to avoid lock inversion
  ceph: fix length validation in parse_reply_info()
  ceph: initialize client debugfs outside of monc->mutex
  ceph: change "ceph.layout" xattr to be "ceph.file.layout"
2012-02-02 15:47:33 -08:00
Sage Weil
ab434b60ab ceph: initialize client debugfs outside of monc->mutex
Initializing debufs under monc->mutex introduces a lock dependency for
sb->s_type->i_mutex_key, which (combined with several other dependencies)
leads to an annoying lockdep warning.  There's no particular reason to do
the debugfs setup under this lock, so move it out.

It used to be the case that our first monmap could come from the OSD; that
is no longer the case with recent servers, so we will reliably set up the
client entry during the initial authentication.

We don't have to worry about racing with debugfs teardown by
ceph_debugfs_client_cleanup() because ceph_destroy_client() calls
ceph_msgr_flush() first, which will wait for the message dispatch work
to complete (and the debugfs init to complete).

Fixes: #1940
Signed-off-by: Sage Weil <sage@newdream.net>
2012-02-02 12:49:01 -08:00
Sage Weil
56e925b677 libceph: remove useless return value for osd_client __send_request()
Signed-off-by: Sage Weil <sage@newdream.net>
2012-01-10 08:57:03 -08:00
Sage Weil
e11b05d31f crush: fix force for non-root TAKE
Signed-off-by: Sage Weil <sage@newdream.net>
2012-01-10 08:56:57 -08:00
Thomas Meyer
186482560f ceph: Use kmemdup rather than duplicating its implementation
Use kmemdup rather than duplicating its implementation

The semantic patch that makes this change is available
in scripts/coccinelle/api/memdup.cocci.

Signed-off-by: Thomas Meyer <thomas@m3y3r.de>
Signed-off-by: Sage Weil <sage@newdream.net>
2012-01-10 08:56:54 -08:00
Linus Torvalds
653f42f6b6 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client:
  ceph: add missing spin_unlock at ceph_mdsc_build_path()
  ceph: fix SEEK_CUR, SEEK_SET regression
  crush: fix mapping calculation when force argument doesn't exist
  ceph: use i_ceph_lock instead of i_lock
  rbd: remove buggy rollback functionality
  rbd: return an error when an invalid header is read
  ceph: fix rasize reporting by ceph_show_options
2011-12-13 14:59:42 -08:00
Sage Weil
f1932fc1a6 crush: fix mapping calculation when force argument doesn't exist
If the force argument isn't valid, we should continue calculating a
mapping as if it weren't specified.

Signed-off-by: Sage Weil <sage@newdream.net>
2011-12-12 09:09:45 -08:00
Linus Torvalds
c292fe4aae Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client:
  libceph: Allocate larger oid buffer in request msgs
  ceph: initialize root dentry
  ceph: fix iput race when queueing inode work
2011-11-21 12:11:13 -08:00
Stratos Psomadakis
224736d911 libceph: Allocate larger oid buffer in request msgs
ceph_osd_request struct allocates a 40-byte buffer for object names.
RBD image names can be up to 96 chars long (100 with the .rbd suffix),
which results in the object name for the image being truncated, and a
subsequent map failure.

Increase the oid buffer in request messages, in order to avoid the
truncation.

Signed-off-by: Stratos Psomadakis <psomas@grnet.gr>
Signed-off-by: Sage Weil <sage@newdream.net>
2011-11-11 09:50:19 -08:00
Paul Gortmaker
bc3b2d7fb9 net: Add export.h for EXPORT_SYMBOL/THIS_MODULE to non-modules
These files are non modular, but need to export symbols using
the macros now living in export.h -- call out the include so
that things won't break when we remove the implicit presence
of module.h from everywhere.

Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com>
2011-10-31 19:30:30 -04:00
Sage Weil
38d6453ca3 libceph: force resend of osd requests if we skip an osdmap
If we skip over one or more map epochs, we need to resend all osd requests
because it is possible they remapped to other servers and then back.

Signed-off-by: Sage Weil <sage@newdream.net>
2011-10-25 16:10:17 -07:00
Noah Watkins
ee3b56f265 ceph: use kernel DNS resolver
Change ceph_parse_ips to take either names given as
IP addresses or standard hostnames (e.g. localhost).
The DNS lookup is done using the dns_resolver facility
similar to its use in AFS, NFS, and CIFS.

This patch defines CONFIG_CEPH_LIB_USE_DNS_RESOLVER
that controls if this feature is on or off.

Signed-off-by: Noah Watkins <noahwatkins@gmail.com>
Signed-off-by: Sage Weil <sage@newdream.net>
2011-10-25 16:10:16 -07:00
Noah Watkins
49d9224c04 ceph: fix ceph_monc_init memory leak
failure clean up does not consider ceph_auth_init.

Signed-off-by: Noah Watkins <noahwatkins@gmail.com>
Signed-off-by: Sage Weil <sage@newdream.net>
2011-10-25 16:10:16 -07:00
Sage Weil
f0ed1b7cef libceph: warn on msg allocation failures
Any non-masked msg allocation failure should generate a warning and stack
trace to the console.  All of these need to eventually be replaced by
safe preallocation or msgpools.

Signed-off-by: Sage Weil <sage@newdream.net>
2011-10-25 16:10:16 -07:00
Sage Weil
b61c27636f libceph: don't complain on msgpool alloc failures
The pool allocation failures are masked by the pool; there is no need to
spam the console about them.  (That's the whole point of having the pool
in the first place.)

Mark msg allocations whose failure is safely handled as such.

Signed-off-by: Sage Weil <sage@newdream.net>
2011-10-25 16:10:15 -07:00
Sage Weil
f6a2f5be07 libceph: always preallocate mon connection
Allocate the mon connection on init.  We already reuse it across
reconnects.  Remove now unnecessary (and incomplete) NULL checks.

Signed-off-by: Sage Weil <sage@newdream.net>
2011-10-25 16:10:15 -07:00
Sage Weil
6ab00d465a libceph: create messenger with client
This simplifies the init/shutdown paths, and makes client->msgr available
during the rest of the setup process.

Signed-off-by: Sage Weil <sage@newdream.net>
2011-10-25 16:10:15 -07:00
Linus Torvalds
92bb062fe3 Merge branch 'for-linus' of git://github.com/NewDreamNetwork/ceph-client
* 'for-linus' of git://github.com/NewDreamNetwork/ceph-client:
  libceph: fix pg_temp mapping update
  libceph: fix pg_temp mapping calculation
  libceph: fix linger request requeuing
  libceph: fix parse options memory leak
  libceph: initialize ack_stamp to avoid unnecessary connection reset
2011-09-29 19:58:58 -07:00
Sage Weil
8adc8b3d78 libceph: fix pg_temp mapping update
The incremental map updates have a record for each pg_temp mapping that is
to be add/updated (len > 0) or removed (len == 0).  The old code was
written as if the updates were a complete enumeration; that was just wrong.
Update the code to remove 0-length entries and drop the rbtree traversal.

This avoids misdirected (and hung) requests that manifest as server
errors like

[WRN] client4104 10.0.1.219:0/275025290 misdirected client4104.1:129 0.1 to osd0 not [1,0] in e11/11

Signed-off-by: Sage Weil <sage@newdream.net>
2011-09-28 10:13:35 -07:00
Sage Weil
782e182e91 libceph: fix pg_temp mapping calculation
We need to apply the modulo pg_num calculation before looking up a pgid in
the pg_temp mapping rbtree.  This fixes pg_temp mappings, and fixes
(some) misdirected requests that result in messages like

[WRN] client4104 10.0.1.219:0/275025290 misdirected client4104.1:129 0.1 to osd0 not [1,0] in e11/11

on the server and stall make the client block without getting a reply (at
least until the pg_temp mapping goes way, but that can take a long long
time).

Reorder calc_pg_raw() a bit to make more sense.

Signed-off-by: Sage Weil <sage@newdream.net>
2011-09-28 10:13:31 -07:00
Sage Weil
935b639a04 libceph: fix linger request requeuing
The r_req_lru_item list node moves between several lists, and that cycle
is not directly related (and does not begin) with __register_request().
Initialize it in the request constructor, not __register_request(). This
fixes later badness (below) when OSDs restart underneath an rbd mount.

Crashes we've seen due to this include:

[  213.974288] kernel BUG at net/ceph/messenger.c:2193!

and

[  144.035274] BUG: unable to handle kernel NULL pointer dereference at 0000000000000048
[  144.035278] IP: [<ffffffffa036c053>] con_work+0x1463/0x2ce0 [libceph]

Signed-off-by: Sage Weil <sage@newdream.net>
2011-09-16 11:13:17 -07:00
Noah Watkins
1cad78932a libceph: fix parse options memory leak
ceph_destroy_options does not free opt->mon_addr that
is allocated in ceph_parse_options.

Signed-off-by: Noah Watkins <noahwatkins@gmail.com>
Signed-off-by: Sage Weil <sage@newdream.net>
2011-09-16 09:19:53 -07:00
Jim Schutt
c0d5f9db1c libceph: initialize ack_stamp to avoid unnecessary connection reset
Commit 4cf9d54463 recorded when an outgoing ceph message was ACKed,
in order to avoid unnecessary connection resets when an OSD is busy.

However, ack_stamp is uninitialized, so there is a window between
when the message is sent and when it is ACKed in which handle_timeout()
interprets the unitialized value as an expired timeout, and resets
the connection unnecessarily.

Close the window by initializing ack_stamp.

Signed-off-by: Jim Schutt <jaschut@sandia.gov>
Signed-off-by: Sage Weil <sage@newdream.net>
2011-09-16 09:16:22 -07:00
Linus Torvalds
0d20fbbe82 Merge branch 'for-linus' of git://ceph.newdream.net/git/ceph-client
* 'for-linus' of git://ceph.newdream.net/git/ceph-client:
  libceph: fix leak of osd structs during shutdown
  ceph: fix memory leak
  ceph: fix encoding of ino only (not relative) paths
  libceph: fix msgpool
2011-09-09 15:48:34 -07:00
Sage Weil
aca420bc51 libceph: fix leak of osd structs during shutdown
We want to remove all OSDs, not just those on the idle LRU.

Signed-off-by: Sage Weil <sage@newdream.net>
2011-08-31 15:22:46 -07:00
Sage Weil
5185352c16 libceph: fix msgpool
There were several problems here:

 1- we weren't tagging allocations with the pool, so they were never
    returned to the pool.
 2- msgpool_put didn't add back to the mempool, even it were called.
 3- msgpool_release didn't clear the pool pointer, so it would have looped
    had #1 not been broken.

These may or may not have been responsible for #1136 or #1381 (BUG due to
non-empty mempool on umount).  I can't seem to trigger the crash now using
the method I was using before.

Signed-off-by: Sage Weil <sage@newdream.net>
2011-08-09 15:26:17 -07:00
Linus Torvalds
ba5b56cb3e Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client: (23 commits)
  ceph: document unlocked d_parent accesses
  ceph: explicitly reference rename old_dentry parent dir in request
  ceph: document locking for ceph_set_dentry_offset
  ceph: avoid d_parent in ceph_dentry_hash; fix ceph_encode_fh() hashing bug
  ceph: protect d_parent access in ceph_d_revalidate
  ceph: protect access to d_parent
  ceph: handle racing calls to ceph_init_dentry
  ceph: set dir complete frag after adding capability
  rbd: set blk_queue request sizes to object size
  ceph: set up readahead size when rsize is not passed
  rbd: cancel watch request when releasing the device
  ceph: ignore lease mask
  ceph: fix ceph_lookup_open intent usage
  ceph: only link open operations to directory unsafe list if O_CREAT|O_TRUNC
  ceph: fix bad parent_inode calc in ceph_lookup_open
  ceph: avoid carrying Fw cap during write into page cache
  libceph: don't time out osd requests that haven't been received
  ceph: report f_bfree based on kb_avail rather than diffing.
  ceph: only queue capsnap if caps are dirty
  ceph: fix snap writeback when racing with writes
  ...
2011-07-26 13:38:50 -07:00
Sage Weil
4cf9d54463 libceph: don't time out osd requests that haven't been received
Keep track of when an outgoing message is ACKed (i.e., the server fully
received it and, presumably, queued it for processing).  Time out OSD
requests only if it's been too long since they've been received.

This prevents timeouts and connection thrashing when the OSDs are simply
busy and are throttling the requests they read off the network.

Reviewed-by: Yehuda Sadeh <yehuda@hq.newdream.net>
Signed-off-by: Sage Weil <sage@newdream.net>
2011-07-26 11:27:24 -07:00
David S. Miller
033b1142f4 Merge branch 'master' of master.kernel.org:/pub/scm/linux/kernel/git/davem/net-2.6
Conflicts:
	net/bluetooth/l2cap_core.c
2011-07-21 13:38:42 -07:00
Sage Weil
38be7a79f7 ceph: fix file mode calculation
open(2) must always include one of O_RDONLY, O_WRONLY, or O_RDWR.  No need
for any O_APPEND special case.

Passing O_WRONLY|O_RDWR is undefined according to the man page, but the
Linux VFS interprets this as O_RDWR, so we'll do the same.

This fixes open(2) with flags O_RDWR|O_APPEND, which was incorrectly being
translated to readonly.

Reported-by: Fyodor Ustinov <ufm@ufm.su>
Signed-off-by: Sage Weil <sage@newdream.net>
2011-07-19 11:25:04 -07:00
David S. Miller
6a7ebdf2fd Merge branch 'master' of master.kernel.org:/pub/scm/linux/kernel/git/davem/net-2.6
Conflicts:
	net/bluetooth/l2cap_core.c
2011-07-14 07:56:40 -07:00
David S. Miller
9f6ec8d697 Merge branch 'master' of master.kernel.org:/pub/scm/linux/kernel/git/davem/net-2.6
Conflicts:
	drivers/net/wireless/iwlwifi/iwl-agn-rxon.c
	drivers/net/wireless/rtlwifi/pci.c
	net/netfilter/ipvs/ip_vs_core.c
2011-06-20 22:29:08 -07:00
Joe Perches
ea11073387 net: Remove casts of void *
Unnecessary casts of void * clutter the code.

These are the remainder casts after several specific
patches to remove netdev_priv and dev_priv.

Done via coccinelle script:

$ cat cast_void_pointer.cocci
@@
type T;
T *pt;
void *pv;
@@

- pt = (T *)pv;
+ pt = pv;

Signed-off-by: Joe Perches <joe@perches.com>
Acked-by: Paul Moore <paul.moore@hp.com>
Signed-off-by: David S. Miller <davem@conan.davemloft.net>
2011-06-16 23:19:27 -04:00
Sage Weil
9bb0ce2b0b libceph: fix page calculation for non-page-aligned io
Set the page count correctly for non-page-aligned IO.  We were already
doing this correctly for alignment, but not the page count.  Fixes
DIRECT_IO writes from unaligned pages.

Signed-off-by: Sage Weil <sage@newdream.net>
2011-06-13 16:26:17 -07:00
Sage Weil
2584547230 ceph: fix sync vs canceled write
If we cancel a write, trigger the safe completions to prevent a sync from
blocking indefinitely in ceph_osdc_sync().

Signed-off-by: Sage Weil <sage@newdream.net>
2011-06-07 21:34:13 -07:00
Sage Weil
cd634fb6ee libceph: subscribe to osdmap when cluster is full
When the cluster is marked full, subscribe to subsequent map updates to
ensure we find out promptly when it is no longer full.  This will prevent
us from spewing ENOSPC for (much) longer than necessary.

Signed-off-by: Sage Weil <sage@newdream.net>
2011-05-24 11:52:11 -07:00
Sage Weil
7662d8ff57 libceph: handle new osdmap down/state change encoding
Old incrementals encode a 0 value (nearly always) when an osd goes down.
Change that to allow any state bit(s) to be flipped.  Special case 0 to
mean flip the CEPH_OSD_UP bit to mimic the old behavior.

Signed-off-by: Sage Weil <sage@newdream.net>
2011-05-24 11:52:09 -07:00
Sage Weil
9d6fcb081a ceph: check return value for start_request in writepages
Since we pass the nofail arg, we should never get an error; BUG if we do.
(And fix the function to not return an error if __map_request fails.)

Signed-off-by: Sage Weil <sage@newdream.net>
2011-05-19 11:25:05 -07:00
Sage Weil
a2a79609c0 libceph: add missing breaks in addr_set_port
Signed-off-by: Sage Weil <sage@newdream.net>
2011-05-19 11:25:05 -07:00
Sage Weil
0417788226 libceph: fix TAG_WAIT case
If we get a WAIT as a client something went wrong; error out.  And don't
fall through to an unrelated case.

Signed-off-by: Sage Weil <sage@newdream.net>
2011-05-19 11:25:04 -07:00
Sage Weil
31456665a0 libceph: fix osdmap timestamp assignment
Signed-off-by: Sage Weil <sage@newdream.net>
2011-05-19 11:25:03 -07:00
Sage Weil
12a2f643b0 libceph: use snprintf for unknown addrs
Signed-off-by: Sage Weil <sage@newdream.net>
2011-05-19 11:25:03 -07:00
Sage Weil
2dab036b8c libceph: use snprintf for formatting object name
Signed-off-by: Sage Weil <sage@newdream.net>
2011-05-19 11:25:02 -07:00
Sage Weil
e8f54ce169 libceph: fix uninitialized value when no get_authorizer method is set
If there is no get_authorizer method we set the out_kvec to a bogus
pointer.  The length is also zero in that case, so it doesn't much matter,
but it's better not to add the empty item in the first place.

Signed-off-by: Sage Weil <sage@newdream.net>
2011-05-19 11:25:02 -07:00
Sage Weil
0da5d70369 libceph: handle connection reopen race with callbacks
If a connection is closed and/or reopened (ceph_con_close, ceph_con_open)
it can race with a callback.  con_work does various state checks for
closed or reopened sockets at the beginning, but drops con->mutex before
making callbacks.  We need to check for state bit changes after retaking
the lock to ensure we restart con_work and execute those CLOSED/OPENING
tests or else we may end up operating under stale assumptions.

In Jim's case, this was causing 'bad tag' errors.

There are four cases where we re-take the con->mutex inside con_work: catch
them all and return EAGAIN from try_{read,write} so that we can restart
con_work.

Reported-by: Jim Schutt <jaschut@sandia.gov>
Tested-by: Jim Schutt <jaschut@sandia.gov>
Signed-off-by: Sage Weil <sage@newdream.net>
2011-05-19 11:21:05 -07:00
Sage Weil
4ad12621e4 libceph: fix ceph_osdc_alloc_request error checks
ceph_osdc_alloc_request returns NULL on failure.

Signed-off-by: Sage Weil <sage@newdream.net>
2011-05-03 09:28:13 -07:00
Henry C Chang
ca20892db7 libceph: fix ceph_msg_new error path
If memory allocation failed, calling ceph_msg_put() will cause GPF
since some of ceph_msg variables are not initialized first.

Fix Bug #970.

Signed-off-by: Henry C Chang <henry_c_chang@tcloudcomputing.com>
Signed-off-by: Sage Weil <sage@newdream.net>
2011-05-03 09:28:11 -07:00
Linus Torvalds
e6d2831834 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client:
  libceph: fix linger request requeueing
2011-04-14 19:02:55 -07:00
Linus Torvalds
42933bac11 Merge branch 'for-linus2' of git://git.profusion.mobi/users/lucas/linux-2.6
* 'for-linus2' of git://git.profusion.mobi/users/lucas/linux-2.6:
  Fix common misspellings
2011-04-07 11:14:49 -07:00
Sage Weil
77f38e0eea libceph: fix linger request requeueing
Fix the request transition from linger -> normal request.  The key is to
preserve r_osd and requeue on the same OSD.  Reregister as a normal request,
add the request to the proper queues, then unregister the linger.  Fix the
unregister helper to avoid clearing r_osd (and also simplify the parallel
check in __unregister_request()).

Reported-by: Henry Chang <henry.cy.chang@gmail.com>
Signed-off-by: Sage Weil <sage@newdream.net>
2011-04-06 09:09:16 -07:00
Lucas De Marchi
25985edced Fix common misspellings
Fixes generated by 'codespell' and manually reviewed.

Signed-off-by: Lucas De Marchi <lucas.demarchi@profusion.mobi>
2011-03-31 11:26:23 -03:00
Tommi Virtanen
4b2a58abd1 libceph: Create a new key type "ceph".
This allows us to use existence of the key type as a feature test,
from userspace.

Signed-off-by: Tommi Virtanen <tommi.virtanen@dreamhost.com>
Signed-off-by: Sage Weil <sage@newdream.net>
2011-03-29 12:11:24 -07:00
Tommi Virtanen
e2c3d29b42 libceph: Get secret from the kernel keys api when mounting with key=NAME.
Signed-off-by: Tommi Virtanen <tommi.virtanen@dreamhost.com>
Signed-off-by: Sage Weil <sage@newdream.net>
2011-03-29 12:11:19 -07:00
Tommi Virtanen
8323c3aa74 ceph: Move secret key parsing earlier.
This makes the base64 logic be contained in mount option parsing,
and prepares us for replacing the homebew key management with the
kernel key retention service.

Signed-off-by: Tommi Virtanen <tommi.virtanen@dreamhost.com>
Signed-off-by: Sage Weil <sage@newdream.net>
2011-03-29 12:11:16 -07:00
Sage Weil
fbdb919048 libceph: fix null dereference when unregistering linger requests
We should only clear r_osd if we are neither registered as a linger or a
regular request.  We may unregister as a linger while still registered as
a regular request (e.g., in reset_osd).  Incorrectly clearing r_osd there
leads to a null pointer dereference in __send_request.

Also simplify the parallel check in __unregister_request() where we just
removed r_osd_item and know it's empty.

Signed-off-by: Sage Weil <sage@newdream.net>
2011-03-29 12:11:06 -07:00
Dan Carpenter
234af26ff1 ceph: unlock on error in ceph_osdc_start_request()
There was a missing unlock on the error path if __map_request() failed.

Signed-off-by: Dan Carpenter <error27@gmail.com>
Signed-off-by: Sage Weil <sage@newdream.net>
2011-03-29 08:59:54 -07:00
Mariusz Kozlowski
6b0ae4097c ceph: fix possible NULL pointer dereference
This patch fixes 'event_work' dereference before it is checked for NULL.

Signed-off-by: Mariusz Kozlowski <mk@lab.zgora.pl>
Signed-off-by: Sage Weil <sage@newdream.net>
2011-03-26 13:41:20 -07:00
Sage Weil
ef550f6f4f ceph: flush msgr_wq during mds_client shutdown
The release method for mds connections uses a backpointer to the
mds_client, so we need to flush the workqueue of any pending work (and
ceph_connection references) prior to freeing the mds_client.  This fixes
an oops easily triggered under UML by

 while true ; do mount ... ; umount ... ; done

Also fix an outdated comment: the flush in ceph_destroy_client only flushes
OSD connections out.  This bug is basically an artifact of the ceph ->
ceph+libceph conversion.

Signed-off-by: Sage Weil <sage@newdream.net>
2011-03-25 13:27:48 -07:00
Yehuda Sadeh
a40c4f10e3 libceph: add lingering request and watch/notify event framework
Lingering requests are requests that are sent to the OSD normally but
tracked also after we get a successful request.  This keeps the OSD
connection open and resends the original request if the object moves to
another OSD.  The OSD can then send notification messages back to us
if another client initiates a notify.

This framework will be used by RBD so that the client gets notification
when a snapshot is created by another node or tool.

Signed-off-by: Yehuda Sadeh <yehuda@hq.newdream.net>
Signed-off-by: Sage Weil <sage@newdream.net>
2011-03-22 11:33:55 -07:00
Sage Weil
6f6c700675 libceph: fix osd request queuing on osdmap updates
If we send a request to osd A, and the request's pg remaps to osd B and
then back to A in quick succession, we need to resend the request to A. The
old code was only calling kick_requests after processing all incremental
maps in a message, so it was very possible to not resend a request that
needed to be resent.  This would make the osd eventually time out (at least
with the current default of osd timeouts enabled).

The correct approach is to scan requests on every map incremental.  This
patch refactors the kick code in a few ways:
 - all requests are either on req_lru (in flight), req_unsent (ready to
   send), or req_notarget (currently map to no up osd)
 - mapping always done by map_request (previous map_osds)
 - if the mapping changes, we requeue.  requests are resent only after all
   map incrementals are processed.
 - some osd reset code is moved out of kick_requests into a separate
   function
 - the "kick this osd" functionality is moved to kick_osd_requests, as it
   is unrelated to scanning for request->pg->osd mapping changes

Signed-off-by: Sage Weil <sage@newdream.net>
2011-03-21 12:24:19 -07:00
Tommi Virtanen
b09734b1f4 libceph: Fix base64-decoding when input ends in newline.
It used to return -EINVAL because it thought the end was not aligned
to 4 bytes.

Clean up superfluous src < end test in if, the while itself guarantees
that.

Signed-off-by: Tommi Virtanen <tommi.virtanen@dreamhost.com>
Signed-off-by: Sage Weil <sage@newdream.net>
2011-03-15 09:14:02 -07:00
Sage Weil
e00de341fd libceph: fix msgr standby handling
The standby logic used to be pretty dependent on the work requeueing
behavior that changed when we switched to WQ_NON_REENTRANT.  It was also
very fragile.

Restructure things so that:
 - We clear WRITE_PENDING when we set STANDBY.  This ensures we will
   requeue work when we wake up later.
 - con_work backs off if STANDBY is set.  There is nothing to do if we are
   in standby.
 - clear_standby() helper is called by both con_send() and con_keepalive(),
   the two actions that can wake us up again.  Move the connect_seq++
   logic here.

Signed-off-by: Sage Weil <sage@newdream.net>
2011-03-04 12:25:05 -08:00
Sage Weil
e76661d0a5 libceph: fix msgr keepalive flag
There was some broken keepalive code using a dead variable.  Shift to using
the proper bit flag.

Signed-off-by: Sage Weil <sage@newdream.net>
2011-03-04 12:24:31 -08:00
Sage Weil
60bf8bf881 libceph: fix msgr backoff
With commit f363e45f we replaced a bunch of hacky workqueue mutual
exclusion logic with the WQ_NON_REENTRANT flag.  One pieces of fallout is
that the exponential backoff breaks in certain cases:

 * con_work attempts to connect.
 * we get an immediate failure, and the socket state change handler queues
   immediate work.
 * con_work calls con_fault, we decide to back off, but can't queue delayed
   work.

In this case, we add a BACKOFF bit to make con_work reschedule delayed work
next time it runs (which should be immediately).

Signed-off-by: Sage Weil <sage@newdream.net>
2011-03-04 12:24:28 -08:00
Sage Weil
692d20f576 libceph: retry after authorization failure
If we mark the connection CLOSED we will give up trying to reconnect to
this server instance.  That is appropriate for things like a protocol
version mismatch that won't change until the server is restarted, at which
point we'll get a new addr and reconnect.  An authorization failure like
this is probably due to the server not properly rotating it's secret keys,
however, and should be treated as transient so that the normal backoff and
retry behavior kicks in.

Signed-off-by: Sage Weil <sage@newdream.net>
2011-03-03 13:47:40 -08:00
Sage Weil
38815b7802 libceph: fix handling of short returns from get_user_pages
get_user_pages() can return fewer pages than we ask for.  We were returning
a bogus pointer/error code in that case.  Instead, loop until we get all
the pages we want or get an error we can return to the caller.

Signed-off-by: Sage Weil <sage@newdream.net>
2011-03-03 13:47:39 -08:00
Linus Torvalds
8bd89ca220 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client:
  ceph: keep reference to parent inode on ceph_dentry
  ceph: queue cap_snaps once per realm
  libceph: fix socket write error handling
  libceph: fix socket read error handling
2011-02-21 15:01:38 -08:00
Sage Weil
42961d2333 libceph: fix socket write error handling
Pass errors from writing to the socket up the stack.  If we get -EAGAIN,
return 0 from the helper to simplify the callers' checks.

Signed-off-by: Sage Weil <sage@newdream.net>
2011-01-25 08:19:34 -08:00
Sage Weil
98bdb0aa00 libceph: fix socket read error handling
If we get EAGAIN when trying to read from the socket, it is not an error.
Return 0 from the helper in this case to simplify the error handling cases
in the caller (indirectly, try_read).

Fix try_read to pass any error to it's caller (con_work) instead of almost
always returning 0.  This let's us respond to things like socket
disconnects.

Signed-off-by: Sage Weil <sage@newdream.net>
2011-01-25 08:17:48 -08:00
Linus Torvalds
a170315420 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client:
  rbd: fix cleanup when trying to mount inexistent image
  net/ceph: make ceph_msgr_wq non-reentrant
  ceph: fsc->*_wq's aren't used in memory reclaim path
  ceph: Always free allocated memory in osdmap_decode()
  ceph: Makefile: Remove unnessary code
  ceph: associate requests with opening sessions
  ceph: drop redundant r_mds field
  ceph: implement DIRLAYOUTHASH feature to get dir layout from MDS
  ceph: add dir_layout to inode
2011-01-13 10:25:24 -08:00
Tejun Heo
f363e45fd1 net/ceph: make ceph_msgr_wq non-reentrant
ceph messenger code does a rather complex dancing around multithread
workqueue to make sure the same work item isn't executed concurrently
on different CPUs.  This restriction can be provided by workqueue with
WQ_NON_REENTRANT.

Make ceph_msgr_wq non-reentrant workqueue with the default concurrency
level and remove the QUEUED/BUSY logic.

* This removes backoff handling in con_work() but it couldn't reliably
  block execution of con_work() to begin with - queue_con() can be
  called after the work started but before BUSY is set.  It seems that
  it was an optimization for a rather cold path and can be safely
  removed.

* The number of concurrent work items is bound by the number of
  connections and connetions are independent from each other.  With
  the default concurrency level, different connections will be
  executed independently.

Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Sage Weil <sage@newdream.net>
Cc: ceph-devel@vger.kernel.org
Signed-off-by: Sage Weil <sage@newdream.net>
2011-01-12 15:15:14 -08:00
Jesper Juhl
b0aee3516d ceph: Always free allocated memory in osdmap_decode()
Always free memory allocated to 'pi' in
net/ceph/osdmap.c::osdmap_decode().

Signed-off-by: Jesper Juhl <jj@chaosbits.net>
Signed-off-by: Sage Weil <sage@newdream.net>
2011-01-12 15:15:14 -08:00
Sage Weil
6c0f3af72c ceph: add dir_layout to inode
Add a ceph_dir_layout to the inode, and calculate dentry hash values based
on the parent directory's specified dir_hash function.  This is needed
because the old default Linux dcache hash function is extremely week and
leads to a poor distribution of files among dir fragments.

Signed-off-by: Sage Weil <sage@newdream.net>
2011-01-12 15:15:12 -08:00
David S. Miller
17f7f4d9fc Merge branch 'master' of master.kernel.org:/pub/scm/linux/kernel/git/davem/net-2.6
Conflicts:
	net/ipv4/fib_frontend.c
2010-12-26 22:37:05 -08:00
Linus Torvalds
9d5004fcf6 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client:
  ceph: handle partial result from get_user_pages
  ceph: mark user pages dirty on direct-io reads
  ceph: fix null pointer dereference in ceph_init_dentry for nfs reexport
  ceph: fix direct-io on non-page-aligned buffers
  ceph: fix msgr_init error path
2010-12-20 21:32:20 -08:00
Henry C Chang
361cf40519 ceph: handle partial result from get_user_pages
The get_user_pages() helper can return fewer than the requested pages.
Error out in that case, and clean up the partial result.

Signed-off-by: Henry C Chang <henry_c_chang@tcloudcomputing.com>
Signed-off-by: Sage Weil <sage@newdream.net>
2010-12-17 09:55:59 -08:00
Henry C Chang
b6aa5901c7 ceph: mark user pages dirty on direct-io reads
For read operation, we have to set the argument _write_ of get_user_pages
to 1 since we will write data to pages. Also, we need to SetPageDirty before
releasing these pages.

Signed-off-by: Henry C Chang <henry_c_chang@tcloudcomputing.com>
Signed-off-by: Sage Weil <sage@newdream.net>
2010-12-17 09:54:40 -08:00
Sage Weil
d96c9043d1 ceph: fix msgr_init error path
create_workqueue() returns NULL on failure.

Signed-off-by: Sage Weil <sage@newdream.net>
2010-12-13 20:30:28 -08:00
David S. Miller
fe6c791570 Merge branch 'master' of master.kernel.org:/pub/scm/linux/kernel/git/davem/net-2.6
Conflicts:
	drivers/net/wireless/ath/ath9k/ar9003_eeprom.c
	net/llc/af_llc.c
2010-12-08 13:47:38 -08:00
Linus Torvalds
a01af8e4a4 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-2.6
* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-2.6: (27 commits)
  af_unix: limit recursion level
  pch_gbe driver: The wrong of initializer entry
  pch_gbe dreiver: chang author
  ucc_geth: fix ucc halt problem in half duplex mode
  inet: Fix __inet_inherit_port() to correctly increment bsockets and num_owners
  ehea: Add some info messages and fix an issue
  hso: fix disable_net
  NET: wan/x25_asy, move lapb_unregister to x25_asy_close_tty
  cxgb4vf: fix setting unicast/multicast addresses ...
  net, ppp: Report correct error code if unit allocation failed
  DECnet: don't leak uninitialized stack byte
  au1000_eth: fix invalid address accessing the MAC enable register
  dccp: fix error in updating the GAR
  tcp: restrict net.ipv4.tcp_adv_win_scale (#20312)
  netns: Don't leak others' openreq-s in proc
  Net: ceph: Makefile: Remove unnessary code
  vhost/net: fix rcu check usage
  econet: fix CVE-2010-3848
  econet: fix CVE-2010-3850
  econet: disallow NULL remote addr for sendmsg(), fixes CVE-2010-3849
  ...
2010-11-29 14:36:33 -08:00
Tracey Dent
4cb6a614ba Net: ceph: Makefile: Remove unnessary code
Remove the if and else conditional because the code is in mainline and there
is no need in it being there.

Signed-off-by: Tracey Dent <tdent48227@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2010-11-27 17:39:29 -08:00
Linus Torvalds
3cbaa0f7a7 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-2.6
* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-2.6:
  of/phylib: Use device tree properties to initialize Marvell PHYs.
  phylib: Add support for Marvell 88E1149R devices.
  phylib: Use common page register definition for Marvell PHYs.
  qlge: Fix incorrect usage of module parameters and netdev msg level
  ipv6: fix missing in6_ifa_put in addrconf
  SuperH IrDA: correct Baud rate error correction
  atl1c: Fix hardware type check for enabling OTP CLK
  net: allow GFP_HIGHMEM in __vmalloc()
  bonding: change list contact to netdev@vger.kernel.org
  e1000: fix screaming IRQ
2010-11-24 08:22:34 +09:00
Tracey Dent
fa13bc3daa Net: ceph: Makefile: remove deprecated kbuild goal definitions
Changed Makefile to use <modules>-y instead of <modules>-objs
because -objs is deprecated and not mentioned in
Documentation/kbuild/makefiles.txt.

Signed-off-by: Tracey Dent <tdent48227@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2010-11-22 08:16:10 -08:00
Eric Dumazet
7a1c8e5ab1 net: allow GFP_HIGHMEM in __vmalloc()
We forgot to use __GFP_HIGHMEM in several __vmalloc() calls.

In ceph, add the missing flag.

In fib_trie.c, xfrm_hash.c and request_sock.c, using vzalloc() is
cleaner and allows using HIGHMEM pages as well.

Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2010-11-21 10:04:04 -08:00
Sage Weil
c5c6b19d4b ceph: explicitly specify page alignment in network messages
The alignment used for reading data into or out of pages used to be taken
from the data_off field in the message header.  This only worked as long
as the page alignment matched the object offset, breaking direct io to
non-page aligned offsets.

Instead, explicitly specify the page alignment next to the page vector
in the ceph_msg struct, and use that instead of the message header (which
probably shouldn't be trusted).  The alloc_msg callback is responsible for
filling in this field properly when it sets up the page vector.

Signed-off-by: Sage Weil <sage@newdream.net>
2010-11-09 12:43:17 -08:00
Sage Weil
b7495fc2ff ceph: make page alignment explicit in osd interface
We used to infer alignment of IOs within a page based on the file offset,
which assumed they matched.  This broke with direct IO that was not aligned
to pages (e.g., 512-byte aligned IO).  We were also trusting the alignment
specified in the OSD reply, which could have been adjusted by the server.

Explicitly specify the page alignment when setting up OSD IO requests.

Signed-off-by: Sage Weil <sage@newdream.net>
2010-11-09 12:43:12 -08:00
Sage Weil
e98b6fed84 ceph: fix comment, remove extraneous args
The offset/length arguments aren't used.

Signed-off-by: Sage Weil <sage@newdream.net>
2010-11-09 12:24:53 -08:00
Sage Weil
df9f86faf3 ceph: fix small seq message skipping
If the client gets out of sync with the server message sequence number, we
normally skip low seq messages (ones we already received).  The skip code
was also incrementing the expected seq, such that all subsequent messages
also appeared old and got skipped, and an eventual timeout on the osd
connection.  This resulted in some lagging requests and console messages
like

[233480.882885] ceph: skipping osd22 10.138.138.13:6804 seq 2016, expected 2017
[233480.882919] ceph: skipping osd22 10.138.138.13:6804 seq 2017, expected 2018
[233480.882963] ceph: skipping osd22 10.138.138.13:6804 seq 2018, expected 2019
[233480.883488] ceph: skipping osd22 10.138.138.13:6804 seq 2019, expected 2020
[233485.219558] ceph: skipping osd22 10.138.138.13:6804 seq 2020, expected 2021
[233485.906595] ceph: skipping osd22 10.138.138.13:6804 seq 2021, expected 2022
[233490.379536] ceph: skipping osd22 10.138.138.13:6804 seq 2022, expected 2023
[233495.523260] ceph: skipping osd22 10.138.138.13:6804 seq 2023, expected 2024
[233495.923194] ceph: skipping osd22 10.138.138.13:6804 seq 2024, expected 2025
[233500.534614] ceph:  tid 6023602 timed out on osd22, will reset osd

Reported-by: Theodore Ts'o <tytso@mit.edu>
Signed-off-by: Sage Weil <sage@newdream.net>
2010-11-01 15:49:23 -07:00
Sage Weil
240634e9b3 ceph: fix num_pages_free accounting in pagelist
Decrement the free page counter when removing a page from the free_list.

Signed-off-by: Sage Weil <sage@newdream.net>
2010-10-20 15:38:23 -07:00
Yehuda Sadeh
010e3b48fc ceph: don't crash when passed bad mount options
This only happened when parse_extra_token was not passed
to ceph_parse_option() (hence, only happened in rbd).

Signed-off-by: Yehuda Sadeh <yehuda@hq.newdream.net>
2010-10-20 15:38:22 -07:00
Greg Farnum
ac0b74d8a1 ceph: add pagelist_reserve, pagelist_truncate, pagelist_set_cursor
These facilitate preallocation of pages so that we can encode into the pagelist
in an atomic context.

Signed-off-by: Greg Farnum <gregf@hq.newdream.net>
Signed-off-by: Sage Weil <sage@newdream.net>
2010-10-20 15:38:16 -07:00
Yehuda Sadeh
602adf4002 rbd: introduce rados block device (rbd), based on libceph
The rados block device (rbd), based on osdblk, creates a block device
that is backed by objects stored in the Ceph distributed object storage
cluster.  Each device consists of a single metadata object and data
striped over many data objects.

The rbd driver supports read-only snapshots.

Signed-off-by: Yehuda Sadeh <yehuda@hq.newdream.net>
Signed-off-by: Sage Weil <sage@newdream.net>
2010-10-20 15:38:13 -07:00
Yehuda Sadeh
3d14c5d2b6 ceph: factor out libceph from Ceph file system
This factors out protocol and low-level storage parts of ceph into a
separate libceph module living in net/ceph and include/linux/ceph.  This
is mostly a matter of moving files around.  However, a few key pieces
of the interface change as well:

 - ceph_client becomes ceph_fs_client and ceph_client, where the latter
   captures the mon and osd clients, and the fs_client gets the mds client
   and file system specific pieces.
 - Mount option parsing and debugfs setup is correspondingly broken into
   two pieces.
 - The mon client gets a generic handler callback for otherwise unknown
   messages (mds map, in this case).
 - The basic supported/required feature bits can be expanded (and are by
   ceph_fs_client).

No functional change, aside from some subtle error handling cases that got
cleaned up in the refactoring process.

Signed-off-by: Sage Weil <sage@newdream.net>
2010-10-20 15:37:28 -07:00