An NFS operation that creates a new symlink includes the symlink data,
which is xdr-encoded as a length followed by the data plus 0 to 3 bytes
of zero-padding as required to reach a 4-byte boundary.
The vfs, on the other hand, wants null-terminated data.
The simple way to handle this would be by copying the data into a newly
allocated buffer with space for the final null.
The current nfsd_symlink code tries to be more clever by skipping that
step in the (likely) case where the byte following the string is already
0.
But that assumes that the byte following the string is ours to look at.
In fact, it might be the first byte of a page that we can't read, or of
some object that another task might modify.
Worse, the NFSv4 code tries to fix the problem by actually writing to
that byte.
In the NFSv2/v3 cases this actually appears to be safe:
- nfs3svc_decode_symlinkargs explicitly null-terminates the data
(after first checking its length and copying it to a new
page).
- NFSv2 limits symlinks to 1k. The buffer holding the rpc
request is always at least a page, and the link data (and
previous fields) have maximum lengths that prevent the request
from reaching the end of a page.
In the NFSv4 case the CREATE op is potentially just one part of a long
compound so can end up on the end of a page if you're unlucky.
The minimal fix here is to copy and null-terminate in the NFSv4 case.
The nfsd_symlink() interface here seems too fragile, though. It should
really either do the copy itself every time or just require a
null-terminated string.
Reported-by: Jeff Layton <jlayton@primarydata.com>
Cc: stable@vger.kernel.org
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
Introduced by commit 561f0ed498 (nfsd4: allow large readdirs).
Signed-off-by: Kinglong Mee <kinglongmee@gmail.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
XDR requires 4-byte alignment; nfs4d READLINK reply writes out the padding,
but truncates the packet to the padding-less size.
Fix by taking the padding into consideration when truncating the packet.
Symptoms:
# ll /mnt/
ls: cannot read symbolic link /mnt/test: Input/output error
total 4
-rw-r--r--. 1 root root 0 Jun 14 01:21 123456
lrwxrwxrwx. 1 root root 6 Jul 2 03:33 test
drwxr-xr-x. 1 root root 0 Jul 2 23:50 tmp
drwxr-xr-x. 1 root root 60 Jul 2 23:44 tree
Signed-off-by: Avi Kivity <avi@cloudius-systems.com>
Fixes: 476a7b1f4b (nfsd4: don't treat readlink like a zero-copy operation)
Reviewed-by: Kinglong Mee <kinglongmee@gmail.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
An NFS operation that creates a new symlink includes the symlink data,
which is xdr-encoded as a length followed by the data plus 0 to 3 bytes
of zero-padding as required to reach a 4-byte boundary.
The vfs, on the other hand, wants null-terminated data.
The simple way to handle this would be by copying the data into a newly
allocated buffer with space for the final null.
The current nfsd_symlink code tries to be more clever by skipping that
step in the (likely) case where the byte following the string is already
0.
But that assumes that the byte following the string is ours to look at.
In fact, it might be the first byte of a page that we can't read, or of
some object that another task might modify.
Worse, the NFSv4 code tries to fix the problem by actually writing to
that byte.
In the NFSv2/v3 cases this actually appears to be safe:
- nfs3svc_decode_symlinkargs explicitly null-terminates the data
(after first checking its length and copying it to a new
page).
- NFSv2 limits symlinks to 1k. The buffer holding the rpc
request is always at least a page, and the link data (and
previous fields) have maximum lengths that prevent the request
from reaching the end of a page.
In the NFSv4 case the CREATE op is potentially just one part of a long
compound so can end up on the end of a page if you're unlucky.
The minimal fix here is to copy and null-terminate in the NFSv4 case.
The nfsd_symlink() interface here seems too fragile, though. It should
really either do the copy itself every time or just require a
null-terminated string.
Reported-by: Jeff Layton <jlayton@primarydata.com>
Cc: stable@vger.kernel.org
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
Currently rpc_pton() fails to handle the case where you echo an address
into the file, as it barfs on the newline. Ensure that we NULL out the
first occurrence of any newline.
Signed-off-by: Jeff Layton <jlayton@primarydata.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
AFAICT, the only way to hit this error is to pass this function a bogus
"who" value. In that case, we probably don't want to return -1 as that
could get sent back to the client. Turn this into nfserr_serverfault,
which is a more appropriate error for a server bug like this.
Signed-off-by: Jeff Layton <jlayton@primarydata.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
The filehandle structs all use host-endian values, but will sometimes
stuff big-endian values into those fields. This is OK since these
values are opaque to the client, but it confuses sparse. Add __force to
make it clear that we are doing this intentionally.
Signed-off-by: Jeff Layton <jlayton@primarydata.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
The callers expect a __be32 return and the functions they call return
__be32, so having these return int is just wrong. Also, nfsd_finish_read
can be made static.
Signed-off-by: Jeff Layton <jlayton@primarydata.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
We currently hash the XID to determine a hash bucket to use for the
reply cache entry, which is fed into hash_32 without byte-swapping it.
Add __force to make sparse happy, and add some comments to explain
why.
Signed-off-by: Jeff Layton <jlayton@primarydata.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
sparse complains that we're stuffing non-byte-swapped values into
__be32's here. Since they're supposed to be opaque, it doesn't matter
much. Just add __force to make sparse happy.
Signed-off-by: Jeff Layton <jlayton@primarydata.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
Don't using cache_get besides export.h, using exp_get for export.
Signed-off-by: Kinglong Mee <kinglongmee@gmail.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
rq_usedeferral and rq_splice_ok are used as 0 and 1, just defined to bool.
Signed-off-by: Kinglong Mee <kinglongmee@gmail.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
Commit 561f0ed498 (nfsd4: allow large readdirs) introduces a bug
about readdir the root of pseudofs.
Call xdr_truncate_encode() revert encoded name when skipping.
Signed-off-by: Kinglong Mee <kinglongmee@gmail.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
If nfsd needs to recall a delegation for some reason it implies that there is
contention on the file, so further delegations should not be handed out.
The current code fails to do so, and the result is effectively a
live-lock under some workloads: a client attempting a conflicting
operation on a read-delegated file receives NFS4ERR_DELAY and retries
the operation, but by the time it retries the server may already have
given out another delegation.
We could simply avoid delegations for (say) 30 seconds after any recall, but
this is probably too heavy handed.
We could keep a list of inodes (or inode numbers or filehandles) for recalled
delegations, but that requires memory allocation and searching.
The approach taken here is to use a bloom filter to record the filehandles
which are currently blocked from delegation, and to accept the cost of a few
false positives.
We have 2 bloom filters, each of which is valid for 30 seconds. When a
delegation is recalled the filehandle is added to one filter and will remain
disabled for between 30 and 60 seconds.
We keep a count of the number of filehandles that have been added, so when
that count is zero we can bypass all other tests.
The bloom filters have 256 bits and 3 hash functions. This should allow a
couple of dozen blocked filehandles with minimal false positives. If many
more filehandles are all blocked at once, behaviour will degrade towards
rejecting all delegations for between 30 and 60 seconds, then resetting and
allowing new delegations.
Signed-off-by: NeilBrown <neilb@suse.de>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
27b11428b7 ("nfsd4: remove lockowner when removing lock stateid")
introduced a memory leak.
Cc: stable@vger.kernel.org
Reported-by: Jeff Layton <jeff.layton@primarydata.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
Currently, the DRC cache pruner will stop scanning the list when it
hits an entry that is RC_INPROG. It's possible however for a call to
take a *very* long time. In that case, we don't want it to block other
entries from being pruned if they are expired or we need to trim the
cache to get back under the limit.
Fix the DRC cache pruner to just ignore RC_INPROG entries.
Signed-off-by: Jeff Layton <jlayton@primarydata.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
While we're here, let's kill off a couple of the read-side macros.
Leaving the more complicated ones alone for now.
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
The rpc code makes available to the NFS server an array of pages to
encod into. The server represents its reply as an xdr buf, with the
head pointing into the first page in that array, the pages ** array
starting just after that, and the tail (if any) sharing any leftover
space in the page used by the head.
While encoding, we use xdr_stream->page_ptr to keep track of which page
we're currently using.
Currently we set xdr_stream->page_ptr to buf->pages, which makes the
head a weird exception to the rule that page_ptr always points to the
page we're currently encoding into. So, instead set it to buf->pages -
1 (the page actually containing the head), and remove the need for a
little unintuitive logic in xdr_get_next_encode_buffer() and
xdr_truncate_encode.
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
We don't want the stateid to be found in the hash table before the delegation
is granted.
Currently this is protected by the client_mutex, but we want to break that
up and this is a necessary step toward that goal.
Signed-off-by: Benny Halevy <bhalevy@primarydata.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
...as the name is a bit more descriptive and we've started using it for
other purposes.
Signed-off-by: Benny Halevy <bhalevy@primarydata.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
The memset of resp in svc_process_common should ensure that these are
already zeroed by the time they get here.
Signed-off-by: Jeff Layton <jlayton@primarydata.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
In the NFS4_OPEN_CLAIM_PREVIOUS case, we should only mark it confirmed
if the nfs4_check_open_reclaim check succeeds.
In the NFS4_OPEN_CLAIM_DELEG_PREV_FH and NFS4_OPEN_CLAIM_DELEGATE_PREV
cases, I see no point in declaring the openowner confirmed when the
operation is going to fail anyway, and doing so might allow the client
to game things such that it wouldn't need to confirm a subsequent open
with the same owner.
Signed-off-by: Jeff Layton <jlayton@primarydata.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
This fixes a bug in the handling of the fi_delegations list.
nfs4_setlease does not hold the recall_lock when adding to it. The
client_mutex is held, which prevents against concurrent list changes,
but nfsd_break_deleg_cb does not hold while walking it. New delegations
could theoretically creep onto the list while we're walking it there.
Signed-off-by: Benny Halevy <bhalevy@primarydata.com>
Signed-off-by: Jeff Layton <jlayton@primarydata.com>
Cc: stable@vger.kernel.org
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
The laundromat uses two variables to calculate when it should next run,
but one is completely ignored at the end of the run. Merge the two and
rename the variable to be more descriptive of what it does.
Signed-off-by: Jeff Layton <jlayton@primarydata.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
sparse says:
CHECK fs/nfsd/nfs4xdr.c
fs/nfsd/nfs4xdr.c:2043:1: warning: symbol 'nfsd4_encode_fattr' was not declared. Should it be static?
Signed-off-by: Jeff Layton <jlayton@primarydata.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
When debugging, rpc prints messages from dprintk(KERN_WARNING ...)
with "^A4" prefixed,
[ 2780.339988] ^A4nfsd: connect from unprivileged port: 127.0.0.1, port=35316
Trond tells,
> dprintk != printk. We have NEVER supported dprintk(KERN_WARNING...)
This patch removes using of dprintk with KERN_WARNING.
Signed-off-by: Kinglong Mee <kinglongmee@gmail.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
Note nobody's ever noticed because the typical client probably never
requests FILES_AVAIL without also requesting something else on the list.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Cc: stable@vger.kernel.org
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
ex_nflavors can't be negative number, just defined by uint32_t.
Signed-off-by: Kinglong Mee <kinglongmee@gmail.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
No need for a typedef wrapper for svc_export or svc_client, remove them.
Signed-off-by: Kinglong Mee <kinglongmee@gmail.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
Commit 49b28684fd ("nfsd: Remove deprecated nfsctl system call and
related code") removed the only use of ipv6_addr_set_v4mapped(), so
net/ipv6.h is unneeded now.
Signed-off-by: Kinglong Mee <kinglongmee@gmail.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
Commit 8f6c5ffc89 ("kernel/groups.c: remove return value of
set_groups") removed the last use of "ret".
Signed-off-by: Kinglong Mee <kinglongmee@gmail.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
After commit 4c1e1b34d5 ("nfsd: Store ex_anon_uid and ex_anon_gid as
kuids and kgids") using kuid/kgid for ex_anon_uid/ex_anon_gid,
user_namespace.h is not needed.
Signed-off-by: Kinglong Mee <kinglongmee@gmail.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
If fsloc_parse() failed at kzalloc(), fs/nfsd/export.c
411
412 fsloc->locations = kzalloc(fsloc->locations_count
413 * sizeof(struct nfsd4_fs_location), GFP_KERNEL);
414 if (!fsloc->locations)
415 return -ENOMEM;
svc_export_parse() will call nfsd4_fslocs_free() with fsloc->locations = NULL,
so that, "kfree(fsloc->locations[i].path);" will cause a crash.
If fsloc_parse() failed after that, fsloc_parse() will call nfsd4_fslocs_free(),
and svc_export_parse() will call it again, so that, a double free is caused.
This patch checks the fsloc->locations, and set to NULL after it be freed.
Signed-off-by: Kinglong Mee <kinglongmee@gmail.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
RPC_MAX_AUTH_SIZE is scattered around several places. Better to set it
once in the auth code, where this kind of estimate should be made. And
while we're at it we can leave it zero when we're not using krb5i or
krb5p.
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
And switch a couple other functions from the encode(&p,...) convention
to the p = encode(p,...) convention mostly used elsewhere.
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
encode_getattr, for example, can return nfserr_resource to indicate it
ran out of buffer space. That's not a legal error in the 4.1 case.
And in the 4.1 case, if we ran out of buffer space, we should have
exceeded a session limit too.
(Note in 1bc49d83c3 "nfsd4: fix
nfs4err_resource in 4.1 case" we originally tried fixing this error
return before fixing the problem that we could error out while we still
had lots of available space. The result was to trade one illegal error
for another in those cases. We decided that was helpful, so reverted
the change in fc208d026b, and are only
reinstating it now that we've elimited almost all of those cases.)
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
I'm not sure why a client would want to stuff multiple reads in a
single compound rpc, but it's legal for them to do it, and we should
really support it.
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
The splice and readv cases are actually quite different--for example the
former case ignores the array of vectors we build up for the latter.
It is probably clearer to separate the two cases entirely.
There's some code duplication between the split out encoders, but this
is only temporary and will be fixed by a later patch.
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
We currently allow only one read per compound, with operations before
and after whose responses will require no more than about a page to
encode.
While we don't expect clients to violate those limits any time soon,
this limitation isn't really condoned by the spec, so to future proof
the server we should lift the limitation.
At the same time we'd like to continue to support zero-copy reads.
Supporting multiple zero-copy-reads per compound would require a new
data structure to replace struct xdr_buf, which can represent only one
set of included pages.
So for now we plan to modify encode_read() to support either zero-copy
or non-zero-copy reads, and use some heuristics at the start of the
compound processing to decide whether a zero-copy read will work.
This will allow us to support more exotic compounds without introducing
a performance regression in the normal case.
Later patches handle those "exotic compounds", this one just makes sure
zero-copy is turned off in those cases.
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
We plan to use this estimate to decide whether or not to allow zero-copy
reads. Currently we're assuming all getattr's are a page, which can be
both too small (ACLs e.g. may be arbitrarily long) and too large (after
an upcoming read patch this will unnecessarily prevent zero copy reads
in any read compound also containing a getattr).
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
There's no advantage to this zero-copy-style readlink encoding, and it
unnecessarily limits the kinds of compounds we can handle. (In practice
I can't see why a client would want e.g. multiple readlink calls in a
comound, but it's probably a spec violation for us not to handle it.)
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
As long as we're here, let's enforce the protocol's limit on the number
of directory entries to return in a readdir.
I don't think anyone's ever noticed our lack of enforcement, but maybe
there's more of a chance they will now that we allow larger readdirs.
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
Currently we limit readdir results to a single page. This can result in
a performance regression compared to NFSv3 when reading large
directories.
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
Once we know the limits the session places on the size of the rpc, we
can also use that information to release any unnecessary reserved reply
buffer space.
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
We can simplify session limit enforcement by restricting the xdr buflen
to the session size.
Also fix a preexisting bug: we should really have been taking into
account the auth-required space when comparing against session limits,
which are limits on the size of the entire rpc reply, including any krb5
overhead.
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
We don't necessarily want to assume that the buflen is the same
as the number of bytes available in the pages. We may have some reason
to set it to something less (for example, later patches will use a
smaller buflen to enforce session limits).
So, calculate the buflen relative to the previous buflen instead of
recalculating it from scratch.
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
It will turn out to be useful to have a more accurate estimate of reply
size; so, piggyback on the existing op reply-size estimators.
Also move nfsd4_max_reply to nfs4proc.c to get easier access to struct
nfsd4_operation and friends. (Thanks to Christoph Hellwig for pointing
out that simplification.)
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
I ran into this corner case in testing: in theory clients can provide
state owners up to 1024 bytes long. In the sessions case there might be
a risk of this pushing us over the DRC slot size.
The conflicting owner isn't really that important, so let's humor a
client that provides a small maxresponsize_cached by allowing ourselves
to return without the conflicting owner instead of outright failing the
operation.
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
Limits on maxresp_sz mean that we only ever need to replay rpc's that
are contained entirely in the head.
The one exception is very small zero-copy reads. That's an odd corner
case as clients wouldn't normally ask those to be cached.
in any case, this seems a little more robust.
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
After this we can handle for example getattr of very large ACLs.
Read, readdir, readlink are still special cases with their own limits.
Also we can't handle a new operation starting close to the end of a
page.
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
Now that all op encoders can handle running out of space, we no longer
need to check the remaining size for every operation; only nonidempotent
operations need that check, and that can be done by
nfsd4_check_resp_size.
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
Once we've included page-cache pages in the encoding it's difficult to
remove them and restart encoding. (xdr_truncate_encode doesn't handle
that case.) So, make sure we'll have adequate space to finish the
operation first.
For now COMPOUND_SLACK_SPACE checks should prevent this case happening,
but we want to remove those checks.
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
We've tried to prevent running out of space with COMPOUND_SLACK_SPACE
and special checking in those operations (getattr) whose result can vary
enormously.
However:
- COMPOUND_SLACK_SPACE may be difficult to maintain as we add
more protocol.
- BUG_ON or page faulting on failure seems overly fragile.
- Especially in the 4.1 case, we prefer not to fail compounds
just because the returned result came *close* to session
limits. (Though perfect enforcement here may be difficult.)
- I'd prefer encoding to be uniform for all encoders instead of
having special exceptions for encoders containing, for
example, attributes.
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
Normally xdr encoding proceeds in a single pass from start of a buffer
to end, but sometimes we have to write a few bytes to an earlier
position.
Use write_bytes_to_xdr_buf for these cases rather than saving a pointer
to write to. We plan to rewrite xdr_reserve_space to handle encoding
across page boundaries using a scratch buffer, and don't want to risk
writing to a pointer that was contained in a scratch buffer.
Also it will no longer be safe to calculate lengths by subtracting two
pointers, so use xdr_buf offsets instead.
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
xdr_reserve_space should now be calculating the length correctly as we
go, so there's no longer any need to fix it up here.
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
This is a cosmetic change for now; no change in behavior.
Note we're just depending on xdr_reserve_space to do the bounds checking
for us, we're not really depending on its adjustment of iovec or xdr_buf
lengths yet, as those are fixed up by as necessary after the fact by
read-link operations and by nfs4svc_encode_compoundres. However we do
have to update xdr->iov on read-like operations to prevent
xdr_reserve_space from messing with the already-fixed-up length of the
the head.
When the attribute encoding fails partway through we have to undo the
length adjustments made so far. We do it manually for now, but later
patches will add an xdr_truncate_encode() helper to handle cases like
this.
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
This post-encoding check should be taking into account the need to
encode at least an out-of-space error to the following op (if any).
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
If nfsd4_check_resp_size() returns an error then we should really be
truncating the reply here, otherwise we may leave extra garbage at the
end of the rpc reply.
Also add a warning to catch any cases where our reply-size estimates may
be wrong in the case of a non-idempotent operation.
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
Currently if the nfs-level part of a reply would be too large, we'll
return an error to the client. But if the nfs-level part fits and
leaves no room for krb5p or krb5i stuff, then we just drop the request
entirely.
That's no good. Instead, reserve some slack space at the end of the
buffer and make sure we fail outright if we'd come close.
The slack space here is a massive overstimate of what's required, we
should probably try for a tighter limit at some point.
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
Just change the nfsd4_encode_getattr api. Not changing any code or
adding any new functionality yet.
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
This is a mechanical transformation with no change in behavior.
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
Currently a non-idempotent op reply may be cached if it fails in the
proc code but not if it fails at xdr decoding. I doubt there are any
xdr-decoding-time errors that would make this a problem in practice, so
this probably isn't a serious bug.
The space estimates should also take into account space required for
encoding of error returns. Again, not a practical problem, though it
would become one after future patches which will tighten the space
estimates.
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
The client is actually asking for 2532 bytes. I suspect that's a
mistake. But maybe we can allow some more. In theory lock needs more
if it might return a maximum-length lockowner in the denied case.
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
OP_MODIFIES_SOMETHING flags operations that we should be careful not to
initiate without being sure we have the buffer space to encode a reply.
None of these ops fall into that category.
We could probably remove a few more, but this isn't a very important
problem at least for ops whose reply size is easy to estimate.
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
PF_LESS_THROTTLE has a very specific use case: to avoid deadlocks
and live-locks while writing to the page cache in a loop-back
NFS mount situation.
It therefore makes sense to *only* set PF_LESS_THROTTLE in this
situation.
We now know when a request came from the local-host so it could be a
loop-back mount. We already know when we are handling write requests,
and when we are doing anything else.
So combine those two to allow nfsd to still be throttled (like any
other process) in every situation except when it is known to be
problematic.
Signed-off-by: NeilBrown <neilb@suse.de>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
No need for a kmem_cache_destroy wrapper in nfsd, just do proper
goto based unwinding.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
Assignments should not happen inside an if conditional, but in the line
before. This issue was reported by checkpatch.
The semantic patch that makes this change is as follows
(http://coccinelle.lip6.fr/):
// <smpl>
@@
identifier i1;
expression e1;
statement S;
@@
-if(!(i1 = e1)) S
+i1 = e1;
+if(!i1)
+S
// </smpl>
It has been tested by compilation.
Signed-off-by: Benoit Taine <benoit.taine@lip6.fr>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
We're not cleaning up everything we need to on error. In particular,
we're not removing our lease. Among other problems this can cause the
struct nfs4_file used as fl_owner to be referenced after it has been
destroyed.
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
We're clearing the SUID/SGID bits on write by hand in nfsd_vfs_write,
even though the subsequent vfs_writev() call will end up doing this for
us (through file system write methods eventually calling
file_remove_suid(), e.g., from __generic_file_aio_write).
So, remove the redundant nfsd code.
The only change in behavior is when the write is by root, in which case
we previously cleared SUID/SGID, but will now leave it alone. The new
behavior is the behavior of every filesystem we've checked.
It seems better to be consistent with local filesystem behavior. And
the security advantage seems limited as root could always restore these
bits by hand if it wanted.
SUID/SGID is not cleared after writing data with (root, local ext4),
File: ‘test’
Size: 0 Blocks: 0 IO Block: 4096 regular
empty file
Device: 803h/2051d Inode: 1200137 Links: 1
Access: (4777/-rwsrwxrwx) Uid: ( 0/ root) Gid: ( 0/ root)
Context: unconfined_u:object_r:admin_home_t:s0
Access: 2014-04-18 21:36:31.016029014 +0800
Modify: 2014-04-18 21:36:31.016029014 +0800
Change: 2014-04-18 21:36:31.026030285 +0800
Birth: -
File: ‘test’
Size: 5 Blocks: 8 IO Block: 4096 regular file
Device: 803h/2051d Inode: 1200137 Links: 1
Access: (4777/-rwsrwxrwx) Uid: ( 0/ root) Gid: ( 0/ root)
Context: unconfined_u:object_r:admin_home_t:s0
Access: 2014-04-18 21:36:31.016029014 +0800
Modify: 2014-04-18 21:36:31.040032065 +0800
Change: 2014-04-18 21:36:31.040032065 +0800
Birth: -
With no_root_squash, (root, remote ext4), SUID/SGID are cleared,
File: ‘test’
Size: 0 Blocks: 0 IO Block: 262144 regular
empty file
Device: 24h/36d Inode: 786439 Links: 1
Access: (4777/-rwsrwxrwx) Uid: ( 1000/ test) Gid: ( 1000/ test)
Context: system_u:object_r:nfs_t:s0
Access: 2014-04-18 21:45:32.155805097 +0800
Modify: 2014-04-18 21:45:32.155805097 +0800
Change: 2014-04-18 21:45:32.168806749 +0800
Birth: -
File: ‘test’
Size: 5 Blocks: 8 IO Block: 262144 regular file
Device: 24h/36d Inode: 786439 Links: 1
Access: (0777/-rwxrwxrwx) Uid: ( 1000/ test) Gid: ( 1000/ test)
Context: system_u:object_r:nfs_t:s0
Access: 2014-04-18 21:45:32.155805097 +0800
Modify: 2014-04-18 21:45:32.184808783 +0800
Change: 2014-04-18 21:45:32.184808783 +0800
Birth: -
Signed-off-by: Kinglong Mee <kinglongmee@gmail.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
The current code assumes a one-to-one lockowner<->lock stateid
correspondance.
Cc: stable@vger.kernel.org
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
The nfsv4 state code has always assumed a one-to-one correspondance
between lock stateid's and lockowners even if it appears not to in some
places.
We may actually change that, but for now when FREE_STATEID releases a
lock stateid it also needs to release the parent lockowner.
Symptoms were a subsequent LOCK crashing in find_lockowner_str when it
calls same_lockowner_ino on a lockowner that unexpectedly has an empty
so_stateids list.
Cc: stable@vger.kernel.org
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
As of 06f9cc12ca "nfsd4: don't create
unnecessary mask acl", any non-trivial ACL will be left with an
unitialized entry, and a trivial ACL may write one entry beyond what's
allocated.
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
Use fh_fsid when reffering to the fsid part of the filehandle. The
variable length auth field envisioned in nfsfh wasn't ever implemented.
Also clean up some lose ends around this and document the file handle
format better.
Btw, why do we even export nfsfh.h to userspace? The file handle very
much is kernel private, and nothing in nfs-utils include the header
either.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
commit 4ac7249ea5 have remove all EXPORT_SYMBOL,
linux/export.h is not needed, just clean it.
Signed-off-by: Kinglong Mee <kinglongmee@gmail.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
Move the state locking and file descriptor reference out from the
callers and into nfs4_preprocess_stateid_op() itself.
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
They do not need to be used outside fs/nfsd/nfs4state.c
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
There is almost nothing left it in, just merge it into the only file
that includes it.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
There are no legitimate users outside of fs/nfsd, so move it there.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
There are no legitimate users outside of fs/nfsd, so move it there.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
The only real user of this header is fs/nfsd/nfsfh.h, so merge the
two. Various lockѕ source files used it to indirectly get other
sunrpc or nfs headers, so fix those up.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
It is large, it is used in more than one place, and it is not performance
critical. Let gcc figure out whether it should be inlined...
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
Mainly to ensure that we don't leave any hanging timers.
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
Cc: stable@vger.kernel.org
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
Aside from making it clearer what is non-trivial in create_client(), it
also fixes a bug whereby we can call free_client() before idr_init()
has been called.
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
Cc: stable@vger.kernel.org
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
Since we're still limiting attributes to a page, the result here is that
a large getattr result will return NFS4ERR_REP_TOO_BIG/TOO_BIG_TO_CACHE
instead of NFS4ERR_RESOURCE.
Both error returns are wrong, and the real bug here is the arbitrary
limit on getattr results, fixed by as-yet out-of-tree patches. But at a
minimum we can make life easier for clients by sticking to one broken
behavior in released kernels instead of two....
Trond says:
one immediate consequence of this patch will be that NFSv4.1
clients will now report EIO instead of EREMOTEIO if they hit the
problem. That may make debugging a little less obvious.
Another consequence will be that if we ever do try to add client
side handling of NFS4ERR_REP_TOO_BIG, then we now have to deal
with the “handle existing buggy server” syndrome.
Reported-by: Trond Myklebust <trond.myklebust@primarydata.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
...otherwise the logic in the timeout handling doesn't work correctly.
Spotted-by: Trond Myklebust <trond.myklebust@primarydata.com>
Cc: stable@vger.kernel.org
Signed-off-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
Pull nfsd updates from Bruce Fields:
"Highlights:
- server-side nfs/rdma fixes from Jeff Layton and Tom Tucker
- xdr fixes (a larger xdr rewrite has been posted but I decided it
would be better to queue it up for 3.16).
- miscellaneous fixes and cleanup from all over (thanks especially to
Kinglong Mee)"
* 'for-3.15' of git://linux-nfs.org/~bfields/linux: (36 commits)
nfsd4: don't create unnecessary mask acl
nfsd: revert v2 half of "nfsd: don't return high mode bits"
nfsd4: fix memory leak in nfsd4_encode_fattr()
nfsd: check passed socket's net matches NFSd superblock's one
SUNRPC: Clear xpt_bc_xprt if xs_setup_bc_tcp failed
NFSD/SUNRPC: Check rpc_xprt out of xs_setup_bc_tcp
SUNRPC: New helper for creating client with rpc_xprt
NFSD: Free backchannel xprt in bc_destroy
NFSD: Clear wcc data between compound ops
nfsd: Don't return NFS4ERR_STALE_STATEID for NFSv4.1+
nfsd4: fix nfs4err_resource in 4.1 case
nfsd4: fix setclientid encode size
nfsd4: remove redundant check from nfsd4_check_resp_size
nfsd4: use more generous NFS4_ACL_MAX
nfsd4: minor nfsd4_replay_cache_entry cleanup
nfsd4: nfsd4_replay_cache_entry should be static
nfsd4: update comments with obsolete function name
rpc: Allow xdr_buf_subsegment to operate in-place
NFSD: Using free_conn free connection
SUNRPC: fix memory leak of peer addresses in XPRT
...
Pull renameat2 system call from Miklos Szeredi:
"This adds a new syscall, renameat2(), which is the same as renameat()
but with a flags argument.
The purpose of extending rename is to add cross-rename, a symmetric
variant of rename, which exchanges the two files. This allows
interesting things, which were not possible before, for example
atomically replacing a directory tree with a symlink, etc... This
also allows overlayfs and friends to operate on whiteouts atomically.
Andy Lutomirski also suggested a "noreplace" flag, which disables the
overwriting behavior of rename.
These two flags, RENAME_EXCHANGE and RENAME_NOREPLACE are only
implemented for ext4 as an example and for testing"
* 'cross-rename' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/vfs:
ext4: add cross rename support
ext4: rename: split out helper functions
ext4: rename: move EMLINK check up
ext4: rename: create ext4_renament structure for local vars
vfs: add cross-rename
vfs: lock_two_nondirectories: allow directory args
security: add flags to rename hooks
vfs: add RENAME_NOREPLACE flag
vfs: add renameat2 syscall
vfs: rename: use common code for dir and non-dir
vfs: rename: move d_move() up
vfs: add d_is_dir()
Any setattr of the ACL attribute, even if it sets just the basic 3-ACE
ACL exactly as it was returned from a file with only mode bits, creates
a mask entry, and it is only the mask, not group, entry that is changed
by subsequent modifications of the mode bits.
So, for example, it's surprising that GROUP@ is left without read or
write permissions after a chmod 0666:
touch test
chmod 0600 test
nfs4_getfacl test
A::OWNER@:rwatTcCy
A::GROUP@:tcy
A::EVERYONE@:tcy
nfs4_getfacl test | nfs4_setfacl -S - test #
chmod 0666 test
nfs4_getfacl test
A::OWNER@:rwatTcCy
A::GROUP@:tcy
D::GROUP@:rwa
A::EVERYONE@:rwatcy
So, let's stop creating the unnecessary mask ACL.
A mask will still be created on non-trivial ACLs (ACLs with actual named
user and group ACEs), so the odd posix-acl behavior of chmod modifying
only the mask will still be left in that case; but that's consistent
with local behavior.
Reported-by: Soumya Koduri <skoduri@redhat.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
This reverts the part of commit 6e14b46b91
that changes NFSv2 behavior.
Mark Lord found that it broke nfs-root for Linux clients, because it
broke NFSv2.
In fact, from RFC 1094:
"Notice that the file type is specified both in the mode bits
and in the file type. This is really a bug in the protocol and
will be fixed in future versions."
So NFSv2 clients really are expected to depend on the high bits of the
mode.
Cc: stable@kernel.org
Reported-by: Mark Lord <mlord@pobox.com>
Reviewed-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
After commit 6307f8fee2 ("security: remove dead hook task_setgroups"),
set_groups will always return zero, so we could just remove return value
of set_groups.
This patch reduces code size, and simplfies code to use set_groups,
because we don't need to check its return value any more.
[akpm@linux-foundation.org: remove obsolete claims from set_groups() comment]
Signed-off-by: Wang YanQing <udknight@gmail.com>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Serge Hallyn <serge.hallyn@canonical.com>
Cc: Eric Paris <eparis@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Add new renameat2 syscall, which is the same as renameat with an added
flags argument.
Pass flags to vfs_rename() and to i_op->rename() as well.
Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
Reviewed-by: J. Bruce Fields <bfields@redhat.com>
fh_put() does not free the temporary file handle.
Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
Cc: stable@vger.kernel.org
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
There could be a case, when NFSd file system is mounted in network, different
to socket's one, like below:
"ip netns exec" creates new network and mount namespace, which duplicates NFSd
mount point, created in init_net context. And thus NFS server stop in nested
network context leads to RPCBIND client destruction in init_net.
Then, on NFSd start in nested network context, rpc.nfsd process creates socket
in nested net and passes it into "write_ports", which leads to RPCBIND sockets
creation in init_net context because of the same reason (NFSd monut point was
created in init_net context). An attempt to register passed socket in nested
net leads to panic, because no RPCBIND client present in nexted network
namespace.
This patch add check that passed socket's net matches NFSd superblock's one.
And returns -EINVAL error to user psace otherwise.
v2: Put socket on exit.
Reported-by: Weng Meiling <wengmeiling.weng@huawei.com>
Signed-off-by: Stanislav Kinsbursky <skinsbursky@parallels.com>
Cc: stable@vger.kernel.org
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
Besides checking rpc_xprt out of xs_setup_bc_tcp,
increase it's reference (it's important).
Signed-off-by: Kinglong Mee <kinglongmee@gmail.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
Testing NFS4.0 by pynfs, I got some messeages as,
"nfsd: inode locked twice during operation."
When one compound RPC contains two or more ops that locks
the filehandle,the second op will cause the message.
As two SETATTR ops, after the first SETATTR, nfsd will not call
fh_put() to release current filehandle, it means filehandle have
unlocked with fh_post_saved = 1.
The second SETATTR find fh_post_saved = 1, and printk the message.
v2: introduce helper fh_clear_wcc().
Signed-off-by: Kinglong Mee <kinglongmee@gmail.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
RFC5661 obsoletes NFS4ERR_STALE_STATEID in favour of NFS4ERR_BAD_STATEID.
Note that because nfsd encodes the clientid boot time in the stateid, we
can hit this error case in certain scenarios where the Linux client
state management thread exits early, before it has finished recovering
all state.
Reported-by: Idan Kedar <idank@primarydata.com>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
encode_getattr, for example, can return nfserr_resource to indicate it
ran out of buffer space. That's not a legal error in the 4.1 case. And
in the 4.1 case, if we ran out of buffer space, we should have exceeded
a session limit too.
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
cstate->slot and ->session are each set together in nfsd4_sequence. If
one is non-NULL, so is the other.
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
Connection from alloc_conn must be freed through free_conn,
otherwise, the reference of svc_xprt will never be put.
Signed-off-by: Kinglong Mee <kinglongmee@gmail.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
We have a WARN_ON in the nfsd4_decode_write() that tells us when the
client has sent a request that is not padded out properly according to
RFC4506. A WARN_ON really isn't appropriate in this case though since
this indicates a client bug, not a server one.
Move this check out to the top-level compound decoder and have it just
explicitly return an error. Also add a dprintk() that shows the client
address and xid to help track down clients and frames that trigger it.
Signed-off-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
Looks like this bug has been here since these write counts were
introduced, not sure why it was just noticed now.
Thanks also to Jan Kara for pointing out the problem.
Cc: stable@vger.kernel.org
Reported-by: Matthew Rahtz <mrahtz@rapitasystems.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
This fixes an ommission from 18032ca062
"NFSD: Server implementation of MAC Labeling", which increased the size
of the setattr error reply without increasing COMPOUND_ERR_SLACK_SPACE.
Cc: stable@vger.kernel.org
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
If a client attempts to set an excessively large ACL, return
NFS4ERR_FBIG instead of NFS4ERR_RESOURCE. I'm not sure FBIG is correct,
but I'm positive RESOURCE is wrong (it isn't even a well-defined error
any more for NFS versions since 4.1).
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
This was an omission from 8c18f2052e
"nfsd41: SUPPATTR_EXCLCREAT attribute".
Cc: Benny Halevy <bhalevy@primarydata.com>
Cc: stable@vger.kernel.org
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
There is a regression in
208d0ac 2014-01-07 nfsd4: break only delegations when appropriate
which deletes an nfserrno() call in nfsd_setattr() (by accident,
probably), and NFSD becomes ignoring an error from VFS.
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
4ac7249ea5 "nfsd: use get_acl and
->set_acl" forgets to set the size in the case get_acl() succeeds, so
_posix_to_nfsv4_one() can then write past the end of its allocation.
Symptoms were slab corruption warnings.
Also, some minor cleanup while we're here. (Among other things, note
that the first few lines guarantee that pacl is non-NULL.)
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
Pull nfsd updates from Bruce Fields:
- Handle some loose ends from the vfs read delegation support.
(For example nfsd can stop breaking leases on its own in a
fewer places where it can now depend on the vfs to.)
- Make life a little easier for NFSv4-only configurations
(thanks to Kinglong Mee).
- Fix some gss-proxy problems (thanks Jeff Layton).
- miscellaneous bug fixes and cleanup
* 'for-3.14' of git://linux-nfs.org/~bfields/linux: (38 commits)
nfsd: consider CLAIM_FH when handing out delegation
nfsd4: fix delegation-unlink/rename race
nfsd4: delay setting current_fh in open
nfsd4: minor nfs4_setlease cleanup
gss_krb5: use lcm from kernel lib
nfsd4: decrease nfsd4_encode_fattr stack usage
nfsd: fix encode_entryplus_baggage stack usage
nfsd4: simplify xdr encoding of nfsv4 names
nfsd4: encode_rdattr_error cleanup
nfsd4: nfsd4_encode_fattr cleanup
minor svcauth_gss.c cleanup
nfsd4: better VERIFY comment
nfsd4: break only delegations when appropriate
NFSD: Fix a memory leak in nfsd4_create_session
sunrpc: get rid of use_gssp_lock
sunrpc: fix potential race between setting use_gss_proxy and the upcall rpc_clnt
sunrpc: don't wait for write before allowing reads from use-gss-proxy file
nfsd: get rid of unused function definition
Define op_iattr for nfsd4_open instead using macro
NFSD: fix compile warning without CONFIG_NFSD_V3
...
CLAIM_FH was added by NFSv4.1. It is the same as CLAIM_NULL except that it
uses only current FH to identify the file to be opened.
The NFS client is using CLAIM_FH if the FH is available when opening a file.
Currently, we cannot get any delegation if we stat a file before open it
because the server delegation code does not recognize CLAIM_FH.
We tested this patch and found delegation can be handed out now when claim is
CLAIM_FH.
See http://marc.info/?l=linux-nfs&m=136369847801388&w=2 and
http://www.linux-nfs.org/wiki/index.php/Server_4.0_and_4.1_issues#New_open_claim_types
Signed-off-by: Ming Chen <mchen@cs.stonybrook.edu>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
If a file is unlinked or renamed between the time when we do the local
open and the time when we get the delegation, then we will return to the
client indicating that it holds a delegation even though the file no
longer exists under the name it was open under.
But a client performing an open-by-name, when it is returned a
delegation, must be able to assume that the file is still linked at the
name it was opened under.
So, hold the parent i_mutex for longer to prevent concurrent renames or
unlinks.
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
This is basically a no-op, to simplify a following patch.
Acked-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
As far as I can tell, this list is used only under the state lock, so we
may as well do this in the simpler order.
Acked-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
Remove the boilerplate code to marshall and unmarhall ACL objects into
xattrs and operate on the posix_acl objects directly. Also move all
the ACL handling code into nfs?acl.c where it belongs.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
A struct svc_fh is 320 bytes on x86_64, it'd be better not to have these
on the stack.
kmalloc'ing them probably isn't ideal either, but this is the simplest
thing to do. If it turns out to be a problem in the readdir case then
we could add a svc_fh to nfsd4_readdir and pass that in.
Acked-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
We stick an extra svc_fh in nfsd3_readdirres to save the need to
kmalloc, though maybe it would be fine to kmalloc instead.
Acked-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
As a temporary fix, nfsd was breaking all leases on unlink, link,
rename, and setattr.
Now that we can distinguish between leases and delegations, we can be
nicer and break only the delegations, and not bother lease-holders with
operations they don't care about.
And we get to delete some code while we're at it.
Note that in the presence of delegations the vfs calls here all return
-EWOULDBLOCK instead of blocking, so nfsd threads will not get stuck
waiting for delegation returns.
Acked-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
If failed after calling alloc_session but before init_session, nfsd will call __free_session to
free se_slots in session. But, session->se_fchannel.maxreqs is not initialized (value is zero).
So that, the memory malloced for slots will be lost in free_session_slots for maxreqs is zero.
This path sets the information for channel in alloc_session after mallocing slots succeed,
instead in init_session.
Signed-off-by: Kinglong Mee <kinglongmee@gmail.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
Without CONFIG_NFSD_V3, compile will get warning as,
fs/nfsd/nfssvc.c: In function 'nfsd_svc':
>> fs/nfsd/nfssvc.c:246:60: warning: array subscript is above array bounds [-Warray-bounds]
return (nfsd_versions[2] != NULL) || (nfsd_versions[3] != NULL);
^
Reported-by: kbuild test robot <fengguang.wu@intel.com>
Signed-off-by: Kinglong Mee <kinglongmee@gmail.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
When starting without nfsv2 and nfsv3, nfsd does not need to start
lockd (and certainly doesn't need to fail because lockd failed to
register with the portmapper).
Reported-by: Gareth Williams <gareth@garethwilliams.me.uk>
Reviewed-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Kinglong Mee <kinglongmee@gmail.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
NFSv4 clients can contact port 2049 directly instead of needing the
portmapper.
Therefore a failure to register to the portmapper when starting an
NFSv4-only server isn't really a problem.
But Gareth Williams reports that an attempt to start an NFSv4-only
server without starting portmap fails:
#rpc.nfsd -N 2 -N 3
rpc.nfsd: writing fd to kernel failed: errno 111 (Connection refused)
rpc.nfsd: unable to set any sockets for nfsd
Add a flag to svc_version to tell the rpc layer it can safely ignore an
rpcbind failure in the NFSv4-only case.
Reported-by: Gareth Williams <gareth@garethwilliams.me.uk>
Reviewed-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Kinglong Mee <kinglongmee@gmail.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
the length for backchannel checking should be multiplied by sizeof(__be32).
Signed-off-by: Kinglong Mee <kinglongmee@gmail.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
check_forechannel_attrs gets drc memory, so nfsd must put it when
check_backchannel_attrs fails.
After many requests with bad back channel attrs, nfsd will deny any
client's CREATE_SESSION forever.
A new test case named CSESS29 for pynfs will send in another mail.
Signed-off-by: Kinglong Mee <kinglongmee@gmail.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
commit 5b6feee960 forgot
recording the back channel attrs in nfsd4_session.
nfsd just check the back channel attars by check_backchannel_attrs,
but do not record it in nfsd4_session in the latest kernel.
Signed-off-by: Kinglong Mee <kinglongmee@gmail.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
Since defined in Linux-2.6.12-rc2, READTIME has not been used.
Signed-off-by: Kinglong Mee <kinglongmee@gmail.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
host_err was only used for nfs4_acl_new.
This patch delete it, and return nfserr_jukebox directly.
Signed-off-by: Kinglong Mee <kinglongmee@gmail.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
Get rid of the extra code, using nfsd4_encode_noop for encoding destroy_session and free_stateid.
And, delete unused argument (fr_status) int nfsd4_free_stateid.
Signed-off-by: Kinglong Mee <kinglongmee@gmail.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
We should use XDR_LEN to calculate reserved space in case the oid is not
a multiple of 4.
RESERVE_SPACE actually rounds up for us, but it's probably better to be
careful here.
Signed-off-by: Kinglong Mee <kinglongmee@gmail.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
commit 557ce2646e
"nfsd41: replace page based DRC with buffer based DRC"
have remove unused nfsd4_set_statp, but miss the function definition.
Signed-off-by: Kinglong Mee <kinglongmee@gmail.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
commit 58cd57bfd9
"nfsd: Fix SP4_MACH_CRED negotiation in EXCHANGE_ID"
miss calculating the length of bitmap for spo_must_enforce and spo_must_allow.
Signed-off-by: Kinglong Mee <kinglongmee@gmail.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
There is an inconsistency in the handling of SUID/SGID file
bits after chown() between NFS and other local file systems.
Local file systems (for example, ext3, ext4, xfs, btrfs) revoke
SUID/SGID bits after chown() on a regular file even if
the owner/group of the file has not been changed:
~# touch file; chmod ug+s file; chmod u+x file
~# ls -l file
-rwsr-Sr-- 1 root root 0 Dec 6 04:49 file
~# chown root file; ls -l file
-rwxr-Sr-- 1 root root 0 Dec 6 04:49 file
but NFS doesn't do that:
~# touch file; chmod ug+s file; chmod u+x file
~# ls -l file
-rwsr-Sr-- 1 root root 0 Dec 6 04:49 file
~# chown root file; ls -l file
-rwsr-Sr-- 1 root root 0 Dec 6 04:49 file
NFS does that only if the owner/group has been changed:
~# touch file; chmod ug+s file; chmod u+x file
~# ls -l file
-rwsr-Sr-- 1 root root 0 Dec 6 05:02 file
~# chown bin file; ls -l file
-rwxr-Sr-- 1 bin root 0 Dec 6 05:02 file
See: http://pubs.opengroup.org/onlinepubs/9699919799/functions/chown.html
"If the specified file is a regular file, one or more of
the S_IXUSR, S_IXGRP, or S_IXOTH bits of the file mode are set,
and the process has appropriate privileges, it is
implementation-defined whether the set-user-ID and set-group-ID
bits are altered."
So both variants are acceptable by POSIX.
This patch makes NFS to behave like local file systems.
Signed-off-by: Stanislav Kholmanskikh <stanislav.kholmanskikh@oracle.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
Currently when we are processing a request, we try to scrape an expired
or over-limit entry off the list in preference to allocating a new one
from the slab.
This is unnecessarily complicated. Just use the slab layer.
Signed-off-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
The Linux NFS server replies among other things to a "Check access permission"
the following:
NFS: File type = 2 (Directory)
NFS: Mode = 040755
A netapp server replies here:
NFS: File type = 2 (Directory)
NFS: Mode = 0755
The RFC 1813 i read:
fattr3
struct fattr3 {
ftype3 type;
mode3 mode;
uint32 nlink;
...
For the mode bits only the lowest 9 are defined in the RFC
As far as I can tell, knfsd has always done this, so apparently it's harmless.
Nevertheless, it appears to be wrong.
Note this is already correct in the NFSv4 case, only v2 and v3 need
fixing.
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
The DRC code will attempt to reuse an existing, expired cache entry in
preference to allocating a new one. It'll then search the cache, and if
it gets a hit it'll then free the cache entry that it was going to
reuse.
The cache code doesn't unhash the entry that it's going to reuse
however, so it's possible for it end up designating an entry for reuse
and then subsequently freeing the same entry after it finds it. This
leads it to a later use-after-free situation and usually some list
corruption warnings or an oops.
Fix this by simply unhashing the entry that we intend to reuse. That
will mean that it's not findable via a search and should prevent this
situation from occurring.
Cc: stable@vger.kernel.org # v3.10+
Reported-by: Christoph Hellwig <hch@infradead.org>
Reported-by: g. artim <gartim@gmail.com>
Signed-off-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
This fixes a regression from 247500820e
"nfsd4: fix decoding of compounds across page boundaries". The previous
code was correct: argp->pagelist is initialized in
nfs4svc_deocde_compoundargs to rqstp->rq_arg.pages, and is therefore a
pointer to the page *after* the page we are currently decoding.
The reason that patch nevertheless fixed a problem with decoding
compounds containing write was a bug in the write decoding introduced by
5a80a54d21 "nfsd4: reorganize write
decoding", after which write decoding no longer adhered to the rule that
argp->pagelist point to the next page.
Cc: stable@vger.kernel.org
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
Use a straight goto error label style in nfsd_setattr to make sure
we always do the put_write_access call after we got it earlier.
Note that the we have been failing to do that in the case
nfsd_break_lease() returns an error, a bug introduced into 2.6.38 with
6a76bebefe "nfsd4: break lease on nfsd
setattr".
Signed-off-by: Christoph Hellwig <hch@lst.de>
Cc: stable@vger.kernel.org
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
Split out two helpers to make the code more readable and easier to verify
for correctness.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Cc: stable@vger.kernel.org
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
Pull nfsd changes from Bruce Fields:
"This includes miscellaneous bugfixes and cleanup and a performance fix
for write-heavy NFSv4 workloads.
(The most significant nfsd-relevant change this time is actually in
the delegation patches that went through Viro, fixing a long-standing
bug that can cause NFSv4 clients to miss updates made by non-nfs users
of the filesystem. Those enable some followup nfsd patches which I
have queued locally, but those can wait till 3.14)"
* 'nfsd-next' of git://linux-nfs.org/~bfields/linux: (24 commits)
nfsd: export proper maximum file size to the client
nfsd4: improve write performance with better sendspace reservations
svcrpc: remove an unnecessary assignment
sunrpc: comment typo fix
Revert "nfsd: remove_stid can be incorporated into nfs4_put_delegation"
nfsd4: fix discarded security labels on setattr
NFSD: Add support for NFS v4.2 operation checking
nfsd4: nfsd_shutdown_net needs state lock
NFSD: Combine decode operations for v4 and v4.1
nfsd: -EINVAL on invalid anonuid/gid instead of silent failure
nfsd: return better errors to exportfs
nfsd: fh_update should error out in unexpected cases
nfsd4: need to destroy revoked delegations in destroy_client
nfsd: no need to unhash_stid before free
nfsd: remove_stid can be incorporated into nfs4_put_delegation
nfsd: nfs4_open_delegation needs to remove_stid rather than unhash_stid
nfsd: nfs4_free_stid
nfsd: fix Kconfig syntax
sunrpc: trim off EC bytes in GSSAPI v2 unwrap
gss_krb5: document that we ignore sequence number
...
I noticed that we export a way to high value for the maxfilesize
attribute when debugging a client issue. The issue didn't turn
out to be related to it, but I think we should export it, so that
clients can limit what write sizes they accept before hitting
the server.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
Currently the rpc code conservatively refuses to accept rpc's from a
client if the sum of its worst-case estimates of the replies it owes
that client exceed the send buffer space.
Unfortunately our estimate of the worst-case reply for an NFSv4 compound
is always the maximum read size. This can unnecessarily limit the
number of operations we handle concurrently, for example in the case
most operations are writes (which have small replies).
We can do a little better if we check which ops the compound contains.
This is still a rough estimate, we'll need to improve on it some day.
Reported-by: Shyam Kaushik <shyamnfs1@gmail.com>
Tested-by: Shyam Kaushik <shyamnfs1@gmail.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
NFSv4 uses leases to guarantee that clients can cache metadata as well
as data.
Cc: Mikulas Patocka <mikulas@artax.karlin.mff.cuni.cz>
Cc: David Howells <dhowells@redhat.com>
Cc: Tyler Hicks <tyhicks@canonical.com>
Cc: Dustin Kirkland <dustin.kirkland@gazzang.com>
Acked-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Cc: David Howells <dhowells@redhat.com>
Acked-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
We need to break delegations on any operation that changes the set of
links pointing to an inode. Start with unlink.
Such operations also hold the i_mutex on a parent directory. Breaking a
delegation may require waiting for a timeout (by default 90 seconds) in
the case of a unresponsive NFS client. To avoid blocking all directory
operations, we therefore drop locks before waiting for the delegation.
The logic then looks like:
acquire locks
...
test for delegation; if found:
take reference on inode
release locks
wait for delegation break
drop reference on inode
retry
It is possible this could never terminate. (Even if we take precautions
to prevent another delegation being acquired on the same inode, we could
get a different inode on each retry.) But this seems very unlikely.
The initial test for a delegation happens after the lock on the target
inode is acquired, but the directory inode may have been acquired
further up the call stack. We therefore add a "struct inode **"
argument to any intervening functions, which we use to pass the inode
back up to the caller in the case it needs a delegation synchronously
broken.
Cc: David Howells <dhowells@redhat.com>
Cc: Tyler Hicks <tyhicks@canonical.com>
Cc: Dustin Kirkland <dustin.kirkland@gazzang.com>
Acked-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
For now FL_DELEG is just a synonym for FL_LEASE. So this patch doesn't
change behavior.
Next we'll modify break_lease to treat FL_DELEG leases differently, to
account for the fact that NFSv4 delegations should be broken in more
situations than Windows oplocks.
Acked-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
This reverts commit 7ebe40f203. We forgot
the nfs4_put_delegation call in fs/nfsd/nfs4callback.c which should not
be unhashing the stateid. This lead to warnings from the idr code when
we tried to removed id's twice.
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
Security labels in setattr calls are currently ignored because we forget
to set label->len.
Cc: stable@vger.kernel.org
Reported-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
The server does allow NFS over v4.2, even if it doesn't add any new
operations yet.
I also switch to using constants to represent the last operation for
each minor version since this makes the code cleaner and easier to
understand at a quick glance.
Signed-off-by: Anna Schumaker <bjschuma@netapp.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
A comment claims the caller should take it, but that's not being done.
Note we don't want it around the cancel_delayed_work_sync since that may
wait on work which holds the client lock.
Reported-by: Benny Halevy <bhalevy@primarydata.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
We were using a different array of function pointers to represent each
minor version. This makes adding a new minor version tedious, since it
needs a step to copy, paste and modify a new version of the same
functions.
This patch combines the v4 and v4.1 arrays into a single instance and
will check minor version support inside each decoder function.
Signed-off-by: Anna Schumaker <bjschuma@netapp.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
If we're going to refuse to accept these it would be polite of us to at
least say so....
This introduces a slight complication since we need to grandfather in
exportfs's ill-advised use of -1 uid and gid on its test_export.
If it turns out there are other users passing down -1 we may need to
do something else.
Best might be to drop the checks entirely, but I'm not sure if other
parts of the kernel might assume that a task can't run as uid or gid -1.
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
Someone noticed exportfs happily accepted exports that would later be
rejected when mountd tried to give them to the kernel. Fix this.
This is a regression from 4c1e1b34d5
"nfsd: Store ex_anon_uid and ex_anon_gid as kuids and kgids".
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: stable@vger.kernel.org
Reported-by: Yin.JianHong <jiyin@redhat.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
The reporter saw a NULL dereference when a filesystem's ->mknod returned
success but left the dentry negative, and then nfsd tried to dereference
d_inode (in this case because the CREATE was followed by a GETATTR in
the same nfsv4 compound).
fh_update already checks for this and another broken case, but for some
reason it returns success and leaves nfsd trying to soldier on. If it
failed we'd avoid the crash. There's only so much we can do with a
buggy filesystem, but it's easy enough to bail out here, so let's do
that.
Reported-by: Antti Tönkyrä <daedalus@pingtimeout.net>
Tested-by: Antti Tönkyrä <daedalus@pingtimeout.net>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
[use list_splice_init]
Signed-off-by: Benny Halevy <bhalevy@primarydata.com>
[bfields: no need for recall_lock here]
Signed-off-by: J. Bruce Fields <bfields@redhat.com>