Currently iomap_dio_rw() only handles (data)sync write completions
for AIO. This means we can't optimised non-AIO IO to minimise device
flushes as we can't tell the caller whether a flush is required or
not.
To solve this problem and enable further optimisations, make
iomap_dio_rw responsible for data sync behaviour for all IO, not
just AIO.
In doing so, the sync operation is now accounted as part of the DIO
IO by inode_dio_end(), hence post-IO data stability updates will no
long race against operations that serialise via inode_dio_wait()
such as truncate or hole punch.
Signed-Off-By: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
To prepare for iomap iinfrastructure based DSYNC optimisations.
While moving the code araound, move the XFS write bytes metric
update for direct IO into xfs_dio_write_end_io callback so that we
always capture the amount of data written via AIO+DIO. This fixes
the problem where queued AIO+DIO writes are not accounted to this
metric.
Signed-Off-By: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
When looking at an event trace recently, I noticed that non-blocking
buffer lookup attempts would fail on cached locked buffers and then
run the slow cache-miss path. This means we are doing an xfs_buf
allocation, lookup and free unnecessarily every time we avoid
blocking on a locked buffer.
Fix this by changing _xfs_buf_find() to return an error status to
the caller to indicate that we failed the lock attempt rather than
just returning a NULL. This allows the higher level code to
discriminate between a cache miss and an cache hit that we failed to
lock.
This also allows us to return a -EFSCORRUPTED state if we are asked
to look up a block number outside the range of the filesystem in
_xfs_buf_find(), which moves us one step closer to being able to
handle such errors in a more graceful manner at the higher levels.
Signed-Off-By: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Carlos Maiolino <cmaiolino@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Move xfs_buf_incore out of line and make it the only way to look up
a buffer in the buffer cache from outside the buffer cache. Convert
the external users of _xfs_buf_find() to xfs_buf_incore() and make
_xfs_buf_find() static.
Signed-Off-By: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Carlos Maiolino <cmaiolino@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
[darrick: actually rename xfs_incore -> xfs_buf_incore]
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
This will trace i.e. the ATTR_SECURE/ATTR_CREATE/ATTR_REPLACE
flags as well as the OP_FLAGS.
Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
When we have corrupted free inode btrees, we can attempt to
allocate inodes that we know are already allocated. Catch allocation
of these inodes and report corruption as early as possible to
prevent corruption propagation or deadlocks.
Signed-Off-By: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Carlos Maiolino <cmaiolino@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
A recent fuzzed filesystem image cached random dcache corruption
when the reproducer was run. This often showed up as panics in
lookup_slow() on a null inode->i_ops pointer when doing pathwalks.
BUG: unable to handle kernel NULL pointer dereference at 0000000000000000
....
Call Trace:
lookup_slow+0x44/0x60
walk_component+0x3dd/0x9f0
link_path_walk+0x4a7/0x830
path_lookupat+0xc1/0x470
filename_lookup+0x129/0x270
user_path_at_empty+0x36/0x40
path_listxattr+0x98/0x110
SyS_listxattr+0x13/0x20
do_syscall_64+0xf5/0x280
entry_SYSCALL_64_after_hwframe+0x42/0xb7
but had many different failure modes including deadlocks trying to
lock the inode that was just allocated or KASAN reports of
use-after-free violations.
The cause of the problem was a corrupt INOBT on a v4 fs where the
root inode was marked as free in the inobt record. Hence when we
allocated an inode, it chose the root inode to allocate, found it in
the cache and re-initialised it.
We recently fixed a similar inode allocation issue caused by inobt
record corruption problem in xfs_iget_cache_miss() in commit
ee457001ed ("xfs: catch inode allocation state mismatch
corruption"). This change adds similar checks to the cache-hit path
to catch it, and turns the reproducer into a corruption shutdown
situation.
Reported-by: Wen Xu <wen.xu@gatech.edu>
Signed-Off-By: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Carlos Maiolino <cmaiolino@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
[darrick: fix typos in comment]
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
- Various build fixes (USER_ACCESS=m and ADDR_TRANS turned off)
- SPDX license tag cleanups (new tag Linux-OpenIB)
- RoCE GID fixes related to default GIDs
- Various fixes to: cxgb4, uverbs, cma, iwpm, rxe, hns (big batch),
mlx4, mlx5, and hfi1 (medium batch)
-----BEGIN PGP SIGNATURE-----
iQIcBAABAgAGBQJa7JXPAAoJELgmozMOVy/dc0AP/0i7EajAmgl1ihka6BYVj2pa
DV8iSrVMDPulh9AVnAtwLJSbdwmgN/HeVzLzcutHyMYk6tAf8RCs6TsyoB36XiOL
oUh5+V2GyNnyh9veWPwyGTgZKCpPJc3uQFV6502lZVDYwArMfGatumApBgQVKiJ+
YdPEXEQZPNIs6YZB1WXkNYV/ra9u0aBByQvUrxwVZ2AND+srJYO82tqZit2wBtjK
UXrhmZbWXGWMFg8K3/lpfUkQhkG3Arj+tMKkCfqsVzC7wUPhlTKBHR9NmvdLIiy9
5Vhv7Xp78udcxZKtUeTFsbhaMqqK7x7sKHnpKAs7hOZNZ/Eg47BrMwMrZVLOFuDF
nBLUL1H+nJ1mASZoMWH5xzOpVew+e9X0cot09pVDBIvsOIh97wCG7hgptQ2Z5xig
fcDiMmg6tuakMsaiD0dzC9JI5HR6Z7+6oR1tBkQFDxQ+XkkcoFabdmkJaIRRwOj7
CUhXRgcm0UgVd03Jdta6CtYXsjSODirWg4AvSSMt9lUFpjYf9WZP00/YojcBbBEH
UlVrPbsKGyncgrm3FUP6kXmScESfdTljTPDLiY9cO9+bhhPGo1OHf005EfAp178B
jGp6hbKlt+rNs9cdXrPSPhjds+QF8HyfSlwyYVWKw8VWlh/5DG8uyGYjF05hYO0q
xhjIS6/EZjcTAh5e4LzR
=PI8v
-----END PGP SIGNATURE-----
Merge tag 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdma
Pull rdma fixes from Doug Ledford:
"This is our first pull request of the rc cycle. It's not that it's
been overly quiet, we were just waiting on a few things before sending
this off.
For instance, the 6 patch series from Intel for the hfi1 driver had
actually been pulled in on Tuesday for a Wednesday pull request, only
to have Jason notice something I missed, so we held off for some
testing, and then on Thursday had to respin the series because the
very first patch needed a minor fix (unnecessary cast is all).
There is a sizable hns patch series in here, as well as a reasonably
largish hfi1 patch series, then all of the lines of uapi updates are
just the change to the new official Linux-OpenIB SPDX tag (a bunch of
our files had what amounts to a BSD-2-Clause + MIT Warranty statement
as their license as a result of the initial code submission years ago,
and the SPDX folks decided it was unique enough to warrant a unique
tag), then the typical mlx4 and mlx5 updates, and finally some cxgb4
and core/cache/cma updates to round out the bunch.
None of it was overly large by itself, but in the 2 1/2 weeks we've
been collecting patches, it has added up :-/.
As best I can tell, it's been through 0day (I got a notice about my
last for-next push, but not for my for-rc push, but Jason seems to
think that failure messages are prioritized and success messages not
so much). It's also been through linux-next. And yes, we did notice in
the context portion of the CMA query gid fix patch that there is a
dubious BUG_ON() in the code, and have plans to audit our BUG_ON usage
and remove it anywhere we can.
Summary:
- Various build fixes (USER_ACCESS=m and ADDR_TRANS turned off)
- SPDX license tag cleanups (new tag Linux-OpenIB)
- RoCE GID fixes related to default GIDs
- Various fixes to: cxgb4, uverbs, cma, iwpm, rxe, hns (big batch),
mlx4, mlx5, and hfi1 (medium batch)"
* tag 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdma: (52 commits)
RDMA/cma: Do not query GID during QP state transition to RTR
IB/mlx4: Fix integer overflow when calculating optimal MTT size
IB/hfi1: Fix memory leak in exception path in get_irq_affinity()
IB/{hfi1, rdmavt}: Fix memory leak in hfi1_alloc_devdata() upon failure
IB/hfi1: Fix NULL pointer dereference when invalid num_vls is used
IB/hfi1: Fix loss of BECN with AHG
IB/hfi1 Use correct type for num_user_context
IB/hfi1: Fix handling of FECN marked multicast packet
IB/core: Make ib_mad_client_id atomic
iw_cxgb4: Atomically flush per QP HW CQEs
IB/uverbs: Fix kernel crash during MR deregistration flow
IB/uverbs: Prevent reregistration of DM_MR to regular MR
RDMA/mlx4: Add missed RSS hash inner header flag
RDMA/hns: Fix a couple misspellings
RDMA/hns: Submit bad wr
RDMA/hns: Update assignment method for owner field of send wqe
RDMA/hns: Adjust the order of cleanup hem table
RDMA/hns: Only assign dqpn if IB_QP_PATH_DEST_QPN bit is set
RDMA/hns: Remove some unnecessary attr_mask judgement
RDMA/hns: Only assign mtu if IB_QP_PATH_MTU bit is set
...
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1
iQIcBAABCAAGBQJa7HerAAoJEPfTWPspceCmxdsP/3ncJdw/PJRGaQNt99ogEIbl
y/YscWxPWsxbigM0Yc0zh134vO5ZeE7v12InpoE3i5OO4UW+oC+WYP/KDAo3TJIy
j/9r25p1kfb3j/8fNlb8uMf/6/nKk29cu+gqIZleHMOj6hfap5AFdTwW0/B/gC/p
BJ+C3e3s41intl+NikZmD4M959gpPTgm5ma8wyCz1XKtGQMH5AxFFrIc22vug/Fb
3Nk++xuFvgF04tCXwimhgny2eOtHt5L6KNuYYHFWBnd1gXALttsisLgAW2vXbfFB
c9PDEya3c+btr8+ied27Tp0hHlcQa2/ZY+yFJ3RJ35AXMvTVNDx6bKF3PzfJWzt+
ynjrywsXC/k7G1JBZntdXF7+y8b52keaIBS8DBBxzhhmzrv0NOTGTaQRhuK5eeem
tHrvEZlP5iqPRGGQz7F1RYztdWulo/iMLJwibuy2rcNYeHL5T0Olhv9hdH26OVqV
CNEuEvy+xO4uzkXAGm3j/EoHryHvGgp2xD/8OuQfTnjB6IdcuLznJuyBiUyOj/te
PgSAI/SdUKPnWyVVONKjXyOyvAglcenNtWMmAZQbsOSNZAW2blrXSFvzHa8wDVe+
Zpw5+fWJOioemMo+gf884jMRbNDfwyq5hcgjpbkYRz+qg60abqefNt7e87mTqTcJ
WqP9luNiP9RmXsXo4k+w
=P6V8
-----END PGP SIGNATURE-----
Merge tag 'for-linus-20180504' of git://git.kernel.dk/linux-block
Pull block fixes from Jens Axboe:
"A collection of fixes that should to into this release. This contains:
- Set of bcache fixes from Coly, fixing regression in patches that
went into this series.
- Set of NVMe fixes by way of Keith.
- Set of bdi related fixes, one from Jan and two from Tetsuo Handa,
fixing various issues around device addition/removal.
- Two block inflight fixes from Omar, fixing issues around the
transition to using tags for blk-mq inflight accounting that we
did a few releases ago"
* tag 'for-linus-20180504' of git://git.kernel.dk/linux-block:
bdi: Fix oops in wb_workfn()
nvmet: switch loopback target state to connecting when resetting
nvme/multipath: Fix multipath disabled naming collisions
nvme/multipath: Disable runtime writable enabling parameter
nvme: Set integrity flag for user passthrough commands
nvme: fix potential memory leak in option parsing
bdi: Fix use after free bug in debugfs_remove()
bdi: wake up concurrent wb_shutdown() callers.
bcache: use pr_info() to inform duplicated CACHE_SET_IO_DISABLE set
bcache: set dc->io_disable to true in conditional_stop_bcache_device()
bcache: add wait_for_kthread_stop() in bch_allocator_thread()
bcache: count backing device I/O error for writeback I/O
bcache: set CACHE_SET_IO_DISABLE in bch_cached_dev_error()
bcache: store disk name in struct cache and struct cached_dev
blk-mq: fix sysfs inflight counter
blk-mq: count allocated but not started requests in iostats inflight
- Cap the maximum length of a deduplication request at MAX_RW_COUNT/2
to avoid kernel livelock due to excessively large IO requests.
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEEUzaAxoMeQq6m2jMV+H93GTRKtOsFAlrp5f0ACgkQ+H93GTRK
tOuflQ//VOKn73zjiwhYPcN4a45/O4sWVxGdAS59+W3usFYG5S6u/OC9XNmaGP2V
XVmf5VhfYT6MsYq562JVMuXo9GiqU2cMbvFuHQHibdlk+hHVPeIQI5ipBn7K7OMj
cgvlE3UWKm407QamC9mM62HyeDb9fyXJnDN+nXi6LruPg6ClJQHt1AQLxlDRu6Bm
ioIpyBJKWbCGbXEGiGIUu5Jv8ohKyZ0lUnxJTRqYrKvqEQ/pDpkMI6POwC0TuHVr
Kd9z2KCKH7RQmtC1zSFzaXn7N5jb2C1Z70i50F7eeC8MLZJbsXxK3pJlugsOYU4D
NCgELpqoYV5R08B+Z4yp2Wj91PWBDuY6cTigArLjqvRN1IX2Q+g8uOQhKi1q0c4x
x+18y+cAPsNgpzu92SFUzJHGLJg5kHxN2wcqt2rIsTdJQt3Z3mIW/sCW/+69v4nu
arHRMhfa6whcFm30phGyw3juYOsxwypasCGWoDR4A5x8Jy0hrsGhVN1SW6Iw1v7W
AGYNH1BIb+2qjQ6To6YgRvwvhbT8EpvxwBi3c0wgmOuI3/oOdiJBS4A7tGXo4fHQ
9CK6gucf9VCX3Nij+juHBXwIeBJ6m6yWX4zD1R5zhA5Z8RIZRMe16aFoOfmHyrAg
SnwYrOBZFD1k3I5fwcWovxxY9V0CxiZOGg217YkSZP5N7+PamZc=
=l9pt
-----END PGP SIGNATURE-----
Merge tag 'xfs-4.17-fixes-2' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux
Pull xfs fixes from Darrick Wong:
"I've got one more bug fix for xfs for 4.17-rc4, which caps the amount
of data we try to handle in one dedupe request so that userspace can't
livelock the kernel.
This series has been run through a full xfstests run during the week
and through a quick xfstests run against this morning's master, with
no ajor failures reported.
Summary:
- Cap the maximum length of a deduplication request at MAX_RW_COUNT/2
to avoid kernel livelock due to excessively large IO requests"
* tag 'xfs-4.17-fixes-2' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux:
xfs: cap the length of deduplication requests
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEE8rQSAMVO+zA4DBdWxWXV+ddtWDsFAlrsbkEACgkQxWXV+ddt
WDvm6Q/+KdFKJ7T8hBOc6o5EeULXCDF3FmMA7HvDC696WXKsckXJFKk52awvrSb6
3wnIzfWmI3K+rwX3cKqLRKe6tMXtBrTjVWXyezfvx1SMcCO4hSQ+nWLqK08htaNf
h7m3OC3y0xO8QHcFSkvHUov6KRWG3rH+4p46JsJjN7GTBtWmR6tsiyQQ9JMC3gNR
8Jnl1YaQq/JDLFm8GmFfPqIK+MLnNJ+GOJC1pm2vJQFtjnDw9dic+dI2hGX2oh9M
SSRAoJu7jUvTWSmQN9aJfbBUr4atzoKKGYsyAgx5qgXbzOnbUGTIhtyAZirRWWBy
0pT2b/8XuqsIabwR5dR4UbL4Ke1h5DS4c6GFydwO4DeddTovHtDzbN0cPuPQABL1
rwFzlnHhcM/qRu9SKXx1jRy7w4Vju8fVX9D4lyjLcyk24flkEAn1NlJCWEqSzPYR
ikTTm71r/1/62XqE6AcOyugS8E6EYtYHo3PjrfFXr3fQCbctTLEaUKoegFkTezbX
EZLRPy9KNGfuyUyh3eiSypNHZZ9WiT+W42BaLIcpbHJLnqB+A14Z0oZx6aHVhY0T
VFLL91O//OmUvjpsZ7I99LyswvrzsmU0jKS41GXOQlLsLTxtGhcZbvRt4uX4UUie
UOQrNCgO0Y2slGfv+uoDCLhH0tTCRziuEvmgu8bjPcfZE7tph+s=
=Jdmp
-----END PGP SIGNATURE-----
Merge tag 'for-4.17-rc3-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux
Pull btrfs fixes from David Sterba:
"Two regression fixes and one fix for stable"
* tag 'for-4.17-rc3-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux:
Btrfs: send, fix missing truncate for inode with prealloc extent past eof
btrfs: Take trans lock before access running trans in check_delayed_ref
btrfs: Fix wrong first_key parameter in replace_path
Syzbot has reported that it can hit a NULL pointer dereference in
wb_workfn() due to wb->bdi->dev being NULL. This indicates that
wb_workfn() was called for an already unregistered bdi which should not
happen as wb_shutdown() called from bdi_unregister() should make sure
all pending writeback works are completed before bdi is unregistered.
Except that wb_workfn() itself can requeue the work with:
mod_delayed_work(bdi_wq, &wb->dwork, 0);
and if this happens while wb_shutdown() is waiting in:
flush_delayed_work(&wb->dwork);
the dwork can get executed after wb_shutdown() has finished and
bdi_unregister() has cleared wb->bdi->dev.
Make wb_workfn() use wakeup_wb() for requeueing the work which takes all
the necessary precautions against racing with bdi unregistration.
CC: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
CC: Tejun Heo <tj@kernel.org>
Fixes: 839a8e8660
Reported-by: syzbot <syzbot+9873874c735f2892e7e9@syzkaller.appspotmail.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Since deduplication potentially has to read in all the pages in both
files in order to compare the contents, cap the deduplication request
length at MAX_RW_COUNT/2 (roughly 1GB) so that we have /some/ upper bound
on the request length and can't just lock up the kernel forever. Found
by running generic/304 after commit 1ddae54555b62 ("common/rc: add
missing 'local' keywords").
Reported-by: matorola@gmail.com
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Carlos Maiolino <cmaiolino@redhat.com>
An incremental send operation can miss a truncate operation when an inode
has an increased size in the send snapshot and a prealloc extent beyond
its size.
Consider the following scenario where a necessary truncate operation is
missing in the incremental send stream:
1) In the parent snapshot an inode has a size of 1282957 bytes and it has
no prealloc extents beyond its size;
2) In the the send snapshot it has a size of 5738496 bytes and has a new
extent at offsets 1884160 (length of 106496 bytes) and a prealloc
extent beyond eof at offset 6729728 (and a length of 339968 bytes);
3) When processing the prealloc extent, at offset 6729728, we end up at
send.c:send_write_or_clone() and set the @len variable to a value of
18446744073708560384 because @offset plus the original @len value is
larger then the inode's size (6729728 + 339968 > 5738496). We then
call send_extent_data(), with that @offset and @len, which in turn
calls send_write(), and then the later calls fill_read_buf(). Because
the offset passed to fill_read_buf() is greater then inode's i_size,
this function returns 0 immediately, which makes send_write() and
send_extent_data() do nothing and return immediately as well. When
we get back to send.c:send_write_or_clone() we adjust the value
of sctx->cur_inode_next_write_offset to @offset plus @len, which
corresponds to 6729728 + 18446744073708560384 = 5738496, which is
precisely the the size of the inode in the send snapshot;
4) Later when at send.c:finish_inode_if_needed() we determine that
we don't need to issue a truncate operation because the value of
sctx->cur_inode_next_write_offset corresponds to the inode's new
size, 5738496 bytes. This is wrong because the last write operation
that was issued started at offset 1884160 with a length of 106496
bytes, so the correct value for sctx->cur_inode_next_write_offset
should be 1990656 (1884160 + 106496), so that a truncate operation
with a value of 5738496 bytes would have been sent to insert a
trailing hole at the destination.
So fix the issue by making send.c:send_write_or_clone() not attempt
to send write or clone operations for extents that start beyond the
inode's size, since such attempts do nothing but waste time by
calling helper functions and allocating path structures, and send
currently has no fallocate command in order to create prealloc extents
at the destination (either beyond a file's eof or not).
The issue was found running the test btrfs/007 from fstests using a seed
value of 1524346151 for fsstress.
Reported-by: Gu, Jinxiang <gujx@cn.fujitsu.com>
Fixes: ffa7c4296e ("Btrfs: send, do not issue unnecessary truncate operations")
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
In preivous patch:
Btrfs: kill trans in run_delalloc_nocow and btrfs_cross_ref_exist
We avoid starting btrfs transaction and get this information from
fs_info->running_transaction directly.
When accessing running_transaction in check_delayed_ref, there's a
chance that current transaction will be freed by commit transaction
after the NULL pointer check of running_transaction is passed.
After looking all the other places using fs_info->running_transaction,
they are either protected by trans_lock or holding the transactions.
Fix this by using trans_lock and increasing the use_count.
Fixes: e4c3b2dcd1 ("Btrfs: kill trans in run_delalloc_nocow and btrfs_cross_ref_exist")
CC: stable@vger.kernel.org # 4.14+
Signed-off-by: ethanwu <ethanwu@synology.com>
Signed-off-by: David Sterba <dsterba@suse.com>
- Enhance inode fork verifiers to prevent loading of corrupted metadata.
- Fix a crash when we try to convert extents format inodes to btree
format, we run out of space, but forget to revert the in-core state
changes.
- Fix file size checks when doing INSERT_RANGE that could cause files
to end up negative size if there previously was an extent mapped at
s_maxbytes.
- Fix a bug when doing a remove-then-add ATTR_REPLACE xattr update where
we forget to clear ATTR_REPLACE after the remove, which causes the
attr to be lost and the fs to shut down due to (what it thinks is)
inconsistent in-core state.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2
iQIcBAABCgAGBQJa2gs1AAoJEPh/dxk0SrTru50P+wZuY2DRfBSCH20Oq3qTbO9M
dQGBnOIqRUC+9IDkpRExB4BYVZoK9/v3fnW3zttRMlEmTQfHtMsoEUXpByoETY5b
/bTZh+c7fMhj7eNZspfc+8xsHvq0k3hSxrITe4zjL2rSy72KUrsYtDoIu2UvXyZK
nJsqCiyFOdFgMi6IBRLOAVBPzs/q8sIBfl/axjyvokLL/6ki/TfvCAtLkdT4FRIt
UHzE8ly/Z99honciPQW4axZ9TobAVd6g2d11XJpbhku3ijTL/vHyflRCFgmM/2T3
VyJ3tTH/w3rCAXOEEj3H8TvKAlHiKpB+g9VwhsTZrP0B14/ljkClm/yEFnCOLmb1
/26t+A++fkqy71PoqeQvXGJGEAxC/1TGo7dxq6Gn5SESc1yr8CjXXMiWTMBBWgfF
xIsBNA6Ok0F5+OkEhfSUtfIBziPeUIjbmNGTnMV/EiL9stpHwrQS+JLAK5+POHvQ
ZoH5RTY69mK5rJtzN6USGyPe4J0f+S7YTf9fjKTcjjpWHLjJYEYJcGvQb3ynZSV+
5Wu1TqaUkl/Mp3mkZ1KFMq+MXBgOnarOZug80fbhkv3Vyzbelzpebuc5fMJnzofp
0x6BQUC+bwOxB2KJX8TC+9ajen8O+oxrE2RyehVtCZL0O830NvZ4hQg3yxjKUbwv
8Y0Mq6Q2CQEmsBJkIHH2
=lFf0
-----END PGP SIGNATURE-----
Merge tag 'xfs-4.17-fixes-1' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux
Pull xfs fixes from Darrick Wong:
"Here are a few more bug fixes for xfs for 4.17-rc4. Most of them are
fixes for bad behavior.
This series has been run through a full xfstests run during LSF and
through a quick xfstests run against this morning's master, with no
major failures reported.
Summary:
- Enhance inode fork verifiers to prevent loading of corrupted
metadata.
- Fix a crash when we try to convert extents format inodes to btree
format, we run out of space, but forget to revert the in-core state
changes.
- Fix file size checks when doing INSERT_RANGE that could cause files
to end up negative size if there previously was an extent mapped at
s_maxbytes.
- Fix a bug when doing a remove-then-add ATTR_REPLACE xattr update
where we forget to clear ATTR_REPLACE after the remove, which
causes the attr to be lost and the fs to shut down due to (what it
thinks is) inconsistent in-core state"
* tag 'xfs-4.17-fixes-1' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux:
xfs: don't fail when converting shortform attr to long form during ATTR_REPLACE
xfs: prevent creating negative-sized file via INSERT_RANGE
xfs: set format back to extents if xfs_bmap_extents_to_btree
xfs: enhance dinode verifier
-----BEGIN PGP SIGNATURE-----
iQEzBAABCAAdFiEEK2m5VNv+CHkogTfJ8vlZVpUNgaMFAlrlM9IACgkQ8vlZVpUN
gaOtrQf+OhNwH0vIfCQwQG36m7DM5DNJ1eiFV6Qp++7B+nzMO7nRaXrINTKty/iO
vSK59a2lCa1Uufjii/vIiVoVjiaiRfVJJt4/WFV6jtESxnBAPTj//nFDGBoQANsR
XxTNDO9Bkra9QlWgasSiqbkyyKd2KFHf23LP1fdfspXuRFrGhu6pYqaZpbx8V0/2
j+TeLS9V8vNx/rWNmvpMK+WopapvrGYoA0YESAZJBCLMNO5uZZ+2qteORPp+Y5oQ
0d0mCLPedYy5gagHUIN4EnpwP4zNh8efQhQA16teEqs+foIHnyp7VnYYG1lJ3z+Y
bYQoQGmKKdnd6Hl+/5sLYct8yZEhXQ==
=EF3H
-----END PGP SIGNATURE-----
Merge tag 'for_linus_stable' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4
Pull ext4 fixes from Ted Ts'o:
"Fix misc bugs and a regression for ext4"
* tag 'for_linus_stable' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4:
ext4: add MODULE_SOFTDEP to ensure crc32c is included in the initramfs
ext4: fix bitmap position validation
ext4: set h_journal if there is a failure starting a reserved handle
ext4: prevent right-shifting extents beyond EXT_MAX_BLOCKS
-----BEGIN PGP SIGNATURE-----
iQGwBAABCAAaBQJa4LQCExxzbWZyZW5jaEBnbWFpbC5jb20ACgkQiiy9cAdyT1GQ
OAv+KPrprp+2jkoEZRgy/cJBZGmMCfyfjxM9eAZUr35FfkuMF8ir4cJ0AbbSpQOY
E+WDdeRhS9FjueIVGKGi74C9yhJpEEDPtvFCFqhGJ5/AIDBBDpC4KyCQEfJDTlGo
b+USbBcu+6bzqASN/L4fwhz1E+Q45RUb98E/IHPOKzV7qASyjYpB9CUcFaANt03k
GK+VKNJF5ppa9YRgXwoPFbD4M2B/Wfe5ZP5+9ZYnYmZnpIUFCuY1rrASuJdcwFN2
z97sGmqacR+a12FkdfZCoeK+dpsQ+ZeyKiB5sOpj+gr7apKAEmESjRlC1xpNRJ4B
TOvkOSyYaNUr94HJo8FXzs1a/j8I59Cn2ER2o8Z0f1s7QvgFxvF09AnDUPQt1OBK
197KNO6a/E1Rd+umPzKvgTIbrm7fcPYZgEWNHdbd3Hf8eBvFs8372GZXGLSkupmJ
jnBkwYbukz6KsVNEx8m/9fGhKhyJCZ34yTEKrT+aNDQ0aA4vVuP/WUa+GtNSkDuL
WX0u
=U4ba
-----END PGP SIGNATURE-----
Merge tag '4.17-rc2-smb3' of git://git.samba.org/sfrench/cifs-2.6
Pull cifs fixes from Steve French:
"A few security related fixes for SMB3, most importantly for SMB3.11
encryption"
* tag '4.17-rc2-smb3' of git://git.samba.org/sfrench/cifs-2.6:
cifs: smbd: Avoid allocating iov on the stack
cifs: smbd: Don't use RDMA read/write when signing is used
SMB311: Fix reconnect
SMB3: Fix 3.11 encryption to Windows and handle encrypted smb3 tcon
CIFS: set *resp_buf_type to NO_BUFFER on error
CIFS_SMB_DIRECT code depends on INFINIBAND_ADDR_TRANS provided symbols.
So declare the kconfig dependency. This is necessary to allow for
enabling INFINIBAND without INFINIBAND_ADDR_TRANS.
Signed-off-by: Greg Thelen <gthelen@google.com>
Cc: Tarick Bedeir <tarick@google.com>
Reviewed-by: Long Li <longli@microsoft.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
Commit 581c176041 ("btrfs: Validate child tree block's level and first
key") introduced new @first_key parameter for read_tree_block(), however
caller in replace_path() is parasing wrong key to read_tree_block().
It should use parameter @first_key other than @key.
Normally it won't expose problem as @key is normally initialzied to the
same value of @first_key we expect.
However in relocation recovery case, @key can be set to (0, 0, 0), and
since no valid key in relocation tree can be (0, 0, 0), it will cause
read_tree_block() to return -EUCLEAN and interrupt relocation recovery.
Fix it by setting @first_key correctly.
Fixes: 581c176041 ("btrfs: Validate child tree block's level and first key")
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
It's not necessary to allocate another iov when going through the buffers
in smbd_send() through RDMA send.
Remove it to reduce stack size.
Thanks to Matt for spotting a printk typo in the earlier version of this.
CC: Matt Redfearn <matt.redfearn@mips.com>
Signed-off-by: Long Li <longli@microsoft.com>
Acked-by: Ronnie Sahlberg <lsahlber@redhat.com>
Cc: stable@vger.kernel.org
Signed-off-by: Steve French <smfrench@gmail.com>
SMB server will not sign data transferred through RDMA read/write. When
signing is used, it's a good idea to have all the data signed.
In this case, use RDMA send/recv for all data transfers. This will degrade
performance as this is not generally configured in RDMA environemnt. So
warn the user on signing and RDMA send/recv.
Signed-off-by: Long Li <longli@microsoft.com>
Acked-by: Ronnie Sahlberg <lsahlber@redhat.com>
Cc: stable@vger.kernel.org
Signed-off-by: Steve French <smfrench@gmail.com>
The preauth hash was not being recalculated properly on reconnect
of SMB3.11 dialect mounts (which caused access denied repeatedly
on auto-reconnect).
Fixes: 8bd68c6e47 ("CIFS: implement v3.11 preauth integrity")
Signed-off-by: Steve French <smfrench@gmail.com>
CC: Stable <stable@vger.kernel.org>
Reviewed-by: Ronnie Sahlberg <lsahlber@redhat.com>
Currently in ext4_valid_block_bitmap() we expect the bitmap to be
positioned anywhere between 0 and s_blocksize clusters, but that's
wrong because the bitmap can be placed anywhere in the block group. This
causes false positives when validating bitmaps on perfectly valid file
system layouts. Fix it by checking whether the bitmap is within the group
boundary.
The problem can be reproduced using the following
mkfs -t ext3 -E stride=256 /dev/vdb1
mount /dev/vdb1 /mnt/test
cd /mnt/test
wget https://cdn.kernel.org/pub/linux/kernel/v4.x/linux-4.16.3.tar.xz
tar xf linux-4.16.3.tar.xz
This will result in the warnings in the logs
EXT4-fs error (device vdb1): ext4_validate_block_bitmap:399: comm tar: bg 84: block 2774529: invalid block bitmap
[ Changed slightly for clarity and to not drop a overflow test -- TYT ]
Signed-off-by: Lukas Czerner <lczerner@redhat.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Reported-by: Ilya Dryomov <idryomov@gmail.com>
Fixes: 7dac4a1726 ("ext4: add validity checks for bitmap block numbers")
Cc: stable@vger.kernel.org
Temporarily disable AES-GCM, as AES-CCM is only currently
enabled mechanism on client side. This fixes SMB3.11
encrypted mounts to Windows.
Also the tree connect request itself should be encrypted if
requested encryption ("seal" on mount), in addition we should be
enabling encryption in 3.11 based on whether we got any valid
encryption ciphers back in negprot (the corresponding session flag is
not set as it is in 3.0 and 3.02)
Signed-off-by: Steve French <smfrench@gmail.com>
Reviewed-by: Pavel Shilovsky <pshilov@microsoft.com>
Reviewed-by: Ronnie Sahlberg <lsahlber@redhat.com>
CC: Stable <stable@vger.kernel.org>
Dan Carpenter had pointed this out a while ago, but the code around
this had changed so wasn't causing any problems since that field
was not used in this error path.
Still, it is cleaner to always initialize this field, so changing
the error path to set it.
Reviewed-by: Ronnie Sahlberg <lsahlber@redhat.com>
CC: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Steve French <smfrench@gmail.com>
-----BEGIN PGP SIGNATURE-----
iQGwBAABCAAaBQJa2jQtExxzbWZyZW5jaEBnbWFpbC5jb20ACgkQiiy9cAdyT1Fb
AAv9FqmD7ElRfZAHBO1gFhOMojhGngf4VsdoKD7raFhkzQLJbEB/Vjjf0OQ6IeR9
EJ+yW7H7lLIifv+xFandBxhjFvg3H+HTFdOuAUk4zpffripOgyfIJeHmc0Lt3DDA
sGGsIcgl5wWXV9hEEtMGlUOoKeVfMKoDqOmIvtOUeSZ616G2NXSRM7ySnSX3KEQR
SJo+rI16P4OQbEZ3W5ElJE6otytKQF76cCoOdEuWLF5mSJfkaCiovrDw/VM2oXFF
C8Uu4cHJFDDxL2uPrYGFPGKWisi0K0G8Lq5q0nOQWC8Ajh9EgfGR667p3btEGQ0d
TPYLHaXleb45qtMUDq7JiPy3VTs5qvr5uJyQdB9s3E+M+9pq1bgQmIVymKJi9ccu
QbKospypFsz81t0DwB7/2TEVsh00Y6bcNIALTALUNntP/emeiQwLrwOkWb3PBlIf
WM0OJH1JJChvypDZ8n29wy2A2p1YIflrGghieONf2jG/dL7bUzssfMCXFxY2MVvf
GCUk
=lT4e
-----END PGP SIGNATURE-----
Merge tag '4.17-rc1-SMB3-CIFS' of git://git.samba.org/sfrench/cifs-2.6
Pull cifs fixes from Steve French:
"Various SMB3/CIFS fixes.
There are three more security related fixes in progress that are not
included in this set but they are still being tested and reviewed, so
sending this unrelated set of smaller fixes now"
* tag '4.17-rc1-SMB3-CIFS' of git://git.samba.org/sfrench/cifs-2.6:
CIFS: fix typo in cifs_dbg
cifs: do not allow creating sockets except with SMB1 posix exensions
cifs: smbd: Dump SMB packet when configured
cifs: smbd: Check for iov length on sending the last iov
fs: cifs: Adding new return type vm_fault_t
cifs: smb2ops: Fix NULL check in smb2_query_symlink
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEE8rQSAMVO+zA4DBdWxWXV+ddtWDsFAlrcop8ACgkQxWXV+ddt
WDuT2Q/9HZo9b1N4f+dM38P/cyBwEt9VYT0MLzpIpOid6ogpdF+1E7F6rDnUOyEQ
0Me+znhC7kgXLBnpUDS16/2uGbp9VgCZo9NSq8juLrowGYpe9asadWbCqfFBmAh6
Pc6r03Z1Z1EhXXUSm0cqJwvNHpNN6WnfBP1ithcp4OAltjVZ7IU45xHCcymfTxi+
R6/kvJlG56YXoT97on/CHF85pAidELJ7CysLBGpCHP7Yl/lkVkyPVxSZO9up6ZtS
078l0nCG9PpzONbdUcB9QGICPDcNLbBDhrS8yzQ+oajiRBz0+Im4c2UUvmZEUiMb
6EAo6QV58OFUskgsY1cHWgZGS3YIFwzpmdsRQrCyuI4Z+1KXVu+RWMv1hF1G1Dqw
8fHDm6+V2oqcoK3lSnb40UUySGw2d7PwjGxUk5iGKWISdmrxQsLcgkY/ERbjZxar
emelwAZbsPwUNYNJj8qbm1QZlmys8FACKzJqyCR6+7/n353uWE30Tnv1MgfJg0nQ
7ejk3U96R+oynN/9WOBFCLpuvPCbYaZk4eA9+3EjcSLzN81r/WtbgQDntEWi9auO
/OUADaHOXoIaUnTBNtFV9DTseDW21hQ/4NseTVVsjKGWjh2qWRoEwuqI7TEZD/4M
kGgdJL+xvD0MzR6Tq7xbhnQTcCOwH6NQ7aI4rI3abDbU9ZPuDrY=
=lgaO
-----END PGP SIGNATURE-----
Merge tag 'for-4.17-rc1-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux
Pull btrfs fixes from David Sterba:
"This contains a few fixups to the qgroup patches that were merged this
dev cycle, unaligned access fix, blockgroup removal corner case fix
and a small debugging output tweak"
* tag 'for-4.17-rc1-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux:
btrfs: print-tree: debugging output enhancement
btrfs: Fix race condition between delayed refs and blockgroup removal
btrfs: fix unaligned access in readdir
btrfs: Fix wrong btrfs_delalloc_release_extents parameter
btrfs: delayed-inode: Remove wrong qgroup meta reservation calls
btrfs: qgroup: Use independent and accurate per inode qgroup rsv
btrfs: qgroup: Commit transaction in advance to reduce early EDQUOT
Commit 4ed2863951 ("fs, elf: drop MAP_FIXED usage from elf_map") is
printing spurious messages under memory pressure due to map_addr == -ENOMEM.
9794 (a.out): Uhuuh, elf segment at 00007f2e34738000(fffffffffffffff4) requested but the memory is mapped already
14104 (a.out): Uhuuh, elf segment at 00007f34fd76c000(fffffffffffffff4) requested but the memory is mapped already
16843 (a.out): Uhuuh, elf segment at 00007f930ecc7000(fffffffffffffff4) requested but the memory is mapped already
Complain only if -EEXIST, and use %px for printing the address.
Link: http://lkml.kernel.org/r/201804182307.FAC17665.SFMOFJVFtHOLOQ@I-love.SAKURA.ne.jp
Fixes: 4ed2863951 ("fs, elf: drop MAP_FIXED usage from elf_map") is
Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: Andrei Vagin <avagin@openvz.org>
Cc: Khalid Aziz <khalid.aziz@oracle.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Kees Cook <keescook@chromium.org>
Cc: Abdul Haleem <abdhalee@linux.vnet.ibm.com>
Cc: Joel Stanley <joel@jms.id.au>
Cc: Anshuman Khandual <khandual@linux.vnet.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Commit 95846ecf9d ("pid: replace pid bitmap implementation with IDR
API") changed last field of /proc/loadavg (last pid allocated) to be off
by one:
# unshare -p -f --mount-proc cat /proc/loadavg
0.00 0.00 0.00 1/60 2 <===
It should be 1 after first fork into pid namespace.
This is formally a regression but given how useless this field is I
don't think anyone is affected.
Bug was found by /proc testsuite!
Link: http://lkml.kernel.org/r/20180413175408.GA27246@avx2
Fixes: 95846ecf9d ("pid: replace pid bitmap implementation with IDR API")
Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Gargi Sharma <gs051095@gmail.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
task_dump_owner() has the following code:
mm = task->mm;
if (mm) {
if (get_dumpable(mm) != SUID_DUMP_USER) {
uid = ...
}
}
Check for ->mm is buggy -- kernel thread might be borrowing mm
and inode will go to some random uid:gid pair.
Link: http://lkml.kernel.org/r/20180412220109.GA20978@avx2
Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The autofs file system mkdir inode operation blindly sets the created
directory mode to S_IFDIR | 0555, ingoring the passed in mode, which can
cause selinux dac_override denials.
But the function also checks if the caller is the daemon (as no-one else
should be able to do anything here) so there's no point in not honouring
the passed in mode, allowing the daemon to set appropriate mode when
required.
Link: http://lkml.kernel.org/r/152361593601.8051.14014139124905996173.stgit@pluto.themaw.net
Signed-off-by: Ian Kent <raven@themaw.net>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
lock_page_memcg()/unlock_page_memcg() use spin_lock_irqsave/restore() if
the page's memcg is undergoing move accounting, which occurs when a
process leaves its memcg for a new one that has
memory.move_charge_at_immigrate set.
unlocked_inode_to_wb_begin,end() use spin_lock_irq/spin_unlock_irq() if
the given inode is switching writeback domains. Switches occur when
enough writes are issued from a new domain.
This existing pattern is thus suspicious:
lock_page_memcg(page);
unlocked_inode_to_wb_begin(inode, &locked);
...
unlocked_inode_to_wb_end(inode, locked);
unlock_page_memcg(page);
If both inode switch and process memcg migration are both in-flight then
unlocked_inode_to_wb_end() will unconditionally enable interrupts while
still holding the lock_page_memcg() irq spinlock. This suggests the
possibility of deadlock if an interrupt occurs before unlock_page_memcg().
truncate
__cancel_dirty_page
lock_page_memcg
unlocked_inode_to_wb_begin
unlocked_inode_to_wb_end
<interrupts mistakenly enabled>
<interrupt>
end_page_writeback
test_clear_page_writeback
lock_page_memcg
<deadlock>
unlock_page_memcg
Due to configuration limitations this deadlock is not currently possible
because we don't mix cgroup writeback (a cgroupv2 feature) and
memory.move_charge_at_immigrate (a cgroupv1 feature).
If the kernel is hacked to always claim inode switching and memcg
moving_account, then this script triggers lockup in less than a minute:
cd /mnt/cgroup/memory
mkdir a b
echo 1 > a/memory.move_charge_at_immigrate
echo 1 > b/memory.move_charge_at_immigrate
(
echo $BASHPID > a/cgroup.procs
while true; do
dd if=/dev/zero of=/mnt/big bs=1M count=256
done
) &
while true; do
sync
done &
sleep 1h &
SLEEP=$!
while true; do
echo $SLEEP > a/cgroup.procs
echo $SLEEP > b/cgroup.procs
done
The deadlock does not seem possible, so it's debatable if there's any
reason to modify the kernel. I suggest we should to prevent future
surprises. And Wang Long said "this deadlock occurs three times in our
environment", so there's more reason to apply this, even to stable.
Stable 4.4 has minor conflicts applying this patch. For a clean 4.4 patch
see "[PATCH for-4.4] writeback: safer lock nesting"
https://lkml.org/lkml/2018/4/11/146
Wang Long said "this deadlock occurs three times in our environment"
[gthelen@google.com: v4]
Link: http://lkml.kernel.org/r/20180411084653.254724-1-gthelen@google.com
[akpm@linux-foundation.org: comment tweaks, struct initialization simplification]
Change-Id: Ibb773e8045852978f6207074491d262f1b3fb613
Link: http://lkml.kernel.org/r/20180410005908.167976-1-gthelen@google.com
Fixes: 682aa8e1a6 ("writeback: implement unlocked_inode_to_wb transaction and use it for stat updates")
Signed-off-by: Greg Thelen <gthelen@google.com>
Reported-by: Wang Long <wanglong19@meituan.com>
Acked-by: Wang Long <wanglong19@meituan.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Tejun Heo <tj@kernel.org>
Cc: Nicholas Piggin <npiggin@gmail.com>
Cc: <stable@vger.kernel.org> [v4.2+]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The swap offset reported by /proc/<pid>/pagemap may be not correct for
PMD migration entries. If addr passed into pagemap_pmd_range() isn't
aligned with PMD start address, the swap offset reported doesn't
reflect this. And in the loop to report information of each sub-page,
the swap offset isn't increased accordingly as that for PFN.
This may happen after opening /proc/<pid>/pagemap and seeking to a page
whose address doesn't align with a PMD start address. I have verified
this with a simple test program.
BTW: migration swap entries have PFN information, do we need to restrict
whether to show them?
[akpm@linux-foundation.org: fix typo, per Huang, Ying]
Link: http://lkml.kernel.org/r/20180408033737.10897-1-ying.huang@intel.com
Signed-off-by: "Huang, Ying" <ying.huang@intel.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
Cc: Andrei Vagin <avagin@openvz.org>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: "Jerome Glisse" <jglisse@redhat.com>
Cc: Daniel Colascione <dancol@google.com>
Cc: Zi Yan <zi.yan@cs.rutgers.edu>
Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Cc: Alexey Dobriyan <adobriyan@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
RHBZ: 1453123
Since at least the 3.10 kernel and likely a lot earlier we have
not been able to create unix domain sockets in a cifs share
when mounted using the SFU mount option (except when mounted
with the cifs unix extensions to Samba e.g.)
Trying to create a socket, for example using the af_unix command from
xfstests will cause :
BUG: unable to handle kernel NULL pointer dereference at 00000000
00000040
Since no one uses or depends on being able to create unix domains sockets
on a cifs share the easiest fix to stop this vulnerability is to simply
not allow creation of any other special files than char or block devices
when sfu is used.
Added update to Ronnie's patch to handle a tcon link leak, and
to address a buf leak noticed by Gustavo and Colin.
Acked-by: Gustavo A. R. Silva <gustavo@embeddedor.com>
CC: Colin Ian King <colin.king@canonical.com>
Reviewed-by: Pavel Shilovsky <pshilov@microsoft.com>
Reported-by: Eryu Guan <eguan@redhat.com>
Signed-off-by: Ronnie Sahlberg <lsahlber@redhat.com>
Signed-off-by: Steve French <smfrench@gmail.com>
Cc: stable@vger.kernel.org
When sending through SMB Direct, also dump the packet in SMB send path.
Also fixed a typo in debug message.
Signed-off-by: Long Li <longli@microsoft.com>
Cc: stable@vger.kernel.org
Signed-off-by: Steve French <smfrench@gmail.com>
Reviewed-by: Ronnie Sahlberg <lsahlber@redhat.com>
This patch enhances the following things:
- tree block header
* add generation and owner output for node and leaf
- node pointer generation output
- allow btrfs_print_tree() to not follow nodes
* just like btrfs-progs
Please note that, although function btrfs_print_tree() is not called by
anyone right now, it's still a pretty useful function to debug kernel.
So that function is still kept for later use.
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: Lu Fengqi <lufq.fnst@cn.fujitsu.com>
Signed-off-by: David Sterba <dsterba@suse.com>
When the delayed refs for a head are all run, eventually
cleanup_ref_head is called which (in case of deletion) obtains a
reference for the relevant btrfs_space_info struct by querying the bg
for the range. This is problematic because when the last extent of a
bg is deleted a race window emerges between removal of that bg and the
subsequent invocation of cleanup_ref_head. This can result in cache being null
and either a null pointer dereference or assertion failure.
task: ffff8d04d31ed080 task.stack: ffff9e5dc10cc000
RIP: 0010:assfail.constprop.78+0x18/0x1a [btrfs]
RSP: 0018:ffff9e5dc10cfbe8 EFLAGS: 00010292
RAX: 0000000000000044 RBX: 0000000000000000 RCX: 0000000000000000
RDX: ffff8d04ffc1f868 RSI: ffff8d04ffc178c8 RDI: ffff8d04ffc178c8
RBP: ffff8d04d29e5ea0 R08: 00000000000001f0 R09: 0000000000000001
R10: ffff9e5dc0507d58 R11: 0000000000000001 R12: ffff8d04d29e5ea0
R13: ffff8d04d29e5f08 R14: ffff8d04efe29b40 R15: ffff8d04efe203e0
FS: 00007fbf58ead500(0000) GS:ffff8d04ffc00000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00007fe6c6975648 CR3: 0000000013b2a000 CR4: 00000000000006f0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
Call Trace:
__btrfs_run_delayed_refs+0x10e7/0x12c0 [btrfs]
btrfs_run_delayed_refs+0x68/0x250 [btrfs]
btrfs_should_end_transaction+0x42/0x60 [btrfs]
btrfs_truncate_inode_items+0xaac/0xfc0 [btrfs]
btrfs_evict_inode+0x4c6/0x5c0 [btrfs]
evict+0xc6/0x190
do_unlinkat+0x19c/0x300
do_syscall_64+0x74/0x140
entry_SYSCALL_64_after_hwframe+0x3d/0xa2
RIP: 0033:0x7fbf589c57a7
To fix this, introduce a new flag "is_system" to head_ref structs,
which is populated at insertion time. This allows to decouple the
querying for the spaceinfo from querying the possibly deleted bg.
Fixes: d7eae3403f ("Btrfs: rework delayed ref total_bytes_pinned accounting")
CC: stable@vger.kernel.org # 4.14+
Suggested-by: Omar Sandoval <osandov@osandov.com>
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: Omar Sandoval <osandov@fb.com>
Signed-off-by: David Sterba <dsterba@suse.com>
In do_mount() when the MS_* flags are being converted to MNT_* flags,
MS_RDONLY got accidentally convered to SB_RDONLY.
Undo this change.
Fixes: e462ec50cb ("VFS: Differentiate mount flags (MS_*) from internal superblock flags")
Signed-off-by: David Howells <dhowells@redhat.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
AFS server records get removed from the net->fs_servers tree when
they're deleted, but not from the net->fs_addresses{4,6} lists, which
can lead to an oops in afs_find_server() when a server record has been
removed, for instance during rmmod.
Fix this by deleting the record from the by-address lists before posting
it for RCU destruction.
The reason this hasn't been noticed before is that the fileserver keeps
probing the local cache manager, thereby keeping the service record
alive, so the oops would only happen when a fileserver eventually gets
bored and stops pinging or if the module gets rmmod'd and a call comes
in from the fileserver during the window between the server records
being destroyed and the socket being closed.
The oops looks something like:
BUG: unable to handle kernel NULL pointer dereference at 000000000000001c
...
Workqueue: kafsd afs_process_async_call [kafs]
RIP: 0010:afs_find_server+0x271/0x36f [kafs]
...
Call Trace:
afs_deliver_cb_init_call_back_state3+0x1f2/0x21f [kafs]
afs_deliver_to_call+0x1ee/0x5e8 [kafs]
afs_process_async_call+0x5b/0xd0 [kafs]
process_one_work+0x2c2/0x504
worker_thread+0x1d4/0x2ac
kthread+0x11f/0x127
ret_from_fork+0x24/0x30
Fixes: d2ddc776a4 ("afs: Overhaul volume and server record caching and fileserver rotation")
Signed-off-by: David Howells <dhowells@redhat.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Pull vfs fixes from Al Viro:
"Assorted fixes.
Some of that is only a matter with fault injection (broken handling of
small allocation failure in various mount-related places), but the
last one is a root-triggerable stack overflow, and combined with
userns it gets really nasty ;-/"
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
Don't leak MNT_INTERNAL away from internal mounts
mm,vmscan: Allow preallocating memory for register_shrinker().
rpc_pipefs: fix double-dput()
orangefs_kill_sb(): deal with allocation failures
jffs2_kill_sb(): deal with failed allocations
hypfs_kill_super(): deal with failed allocations
in the lower filesystem when filename encryption is enabled at the
eCryptfs layer.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1
iQIcBAABCgAGBQJa2LwFAAoJENaSAD2qAscKt00P+wSOy/iUn7wd3L/mtjN+B34l
8MaBNoruoZ87pPPHm4ZjB85zk5fSzwQl/jKgS8y6dRA1bxcwsUYRMooTIsiOBFeg
c1BEWfJVL8uq5AtmZ+czwUD2+DNn/U6S+XejXKr/bQyzOOgUtAeX8IPERPzGqJkB
dXbz21NLBpICxslIHTTHCpCWodm8B9SJSrBeKd1WIad1d+ITjB6fnyHyIlEyFxlk
yxkXLVZskR/yihIQ8JugJ4Li0nWz4KcdAvS2fbSIG5zgkAtLS9ypPAAQrfCf9pCA
BYc4Q2Ai425tgVqAwK2uHXOzy7/WsPL4yrIXt/04vTra1XcK9cv2EdojYi1cK04W
/a6xIWzFtp7HYEhuHZiVdvSTRXEOF9IYUXTw/gavE0KvMZ36k2FU9bLo4WKmx/91
atr+9p4vE6uA932YY54Wit++IjpLGN7FzXYxVVoACVz7HJ+sESqytPiLVX5iufux
/xrrbmlEpLEeJa6/F1Yboo9R7aclrMsxEz4Y+Txkk/zmjzP8wYYiYodOCJ23pauY
5MJruAxLtlcmAzW7FbES/Id+6Rfj42BuzXi3TRceiUv8NWmDTk3y205o94CfJW5U
mxEiFgb7Nxr1UsL66rZGmpnwcfJY4FFEyuXRGEmATBquEGJPTqhku9HKUkmORa7I
mYJzK27VX7dE2lAzuMpr
=bwS0
-----END PGP SIGNATURE-----
Merge tag 'ecryptfs-4.17-rc2-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tyhicks/ecryptfs
Pull eCryptfs fixes from Tyler Hicks:
"Minor cleanups and a bug fix to completely ignore unencrypted
filenames in the lower filesystem when filename encryption is enabled
at the eCryptfs layer"
* tag 'ecryptfs-4.17-rc2-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tyhicks/ecryptfs:
eCryptfs: don't pass up plaintext names when using filename encryption
ecryptfs: fix spelling mistake: "cadidate" -> "candidate"
ecryptfs: lookup: Don't check if mount_crypt_stat is NULL
-----BEGIN PGP SIGNATURE-----
iQEzBAABCAAdFiEEq1nRK9aeMoq1VSgcnJ2qBz9kQNkFAlrXYi4ACgkQnJ2qBz9k
QNmp5Af9HhwGDuTgY+OcVEeZzXQ30BbFQNP4cby5LZ3Wmffio8T97c2DnlDDjr2E
l8Biq5HTm3Gg5x7WEuopyrS6L4on76RDyr1Ohf34mOoIcxdcWQ7XpiyA8jeP9uLn
9IJSPYPbA2RsQk4yr8GBRlVQh/udJfSG0vQOQ5cPylgICLsfxuMIC294vFfcgDt1
wLOy24INlJYbyECMIQKrleyoZQF37Cv7v12KPU0w3eLtuNX0rdI4gwpOS4eEb27G
2KrAsRYCb5KyFl3mRXBeCPNytiJ3ffFgvFC1Z57GVWVmNnadj/Jctw3zFirsMC2d
j7IUA4XI/T+1Gql5rCOXcSgeJ0LRZA==
=KQzD
-----END PGP SIGNATURE-----
Merge tag 'for_v4.17-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs
- isofs memory leak fix
- two fsnotify fixes of event mask handling
- udf fix of UTF-16 handling
- couple other smaller cleanups
* tag 'for_v4.17-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs:
udf: Fix leak of UTF-16 surrogates into encoded strings
fs: ext2: Adding new return type vm_fault_t
isofs: fix potential memory leak in mount option parsing
MAINTAINERS: add an entry for FSNOTIFY infrastructure
fsnotify: fix typo in a comment about mark->g_list
fsnotify: fix ignore mask logic in send_to_group()
isofs compress: Remove VLA usage
fs: quota: Replace GFP_ATOMIC with GFP_KERNEL in dquot_init
fanotify: fix logic of events on child
We want it only for the stuff created by SB_KERNMOUNT mounts, *not* for
their copies. As it is, creating a deep stack of bindings of /proc/*/ns/*
somewhere in a new namespace and exiting yields a stack overflow.
Cc: stable@kernel.org
Reported-by: Alexander Aring <aring@mojatatu.com>
Bisected-by: Kirill Tkhai <ktkhai@virtuozzo.com>
Tested-by: Kirill Tkhai <ktkhai@virtuozzo.com>
Tested-by: Alexander Aring <aring@mojatatu.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
When sending the last iov that breaks into smaller buffers to fit the
transfer size, it's necessary to check if this is the last iov.
If this is the latest iov, stop and proceed to send pages.
Signed-off-by: Long Li <longli@microsoft.com>
Cc: stable@vger.kernel.org
Signed-off-by: Steve French <stfrench@microsoft.com>
Reviewed-by: Ronnie Sahlberg <lsahlber@redhat.com>
The last update to readdir introduced a temporary buffer to store the
emitted readdir data, but as there are file names of variable length,
there's a lot of unaligned access.
This was observed on a sparc64 machine:
Kernel unaligned access at TPC[102f3080] btrfs_real_readdir+0x51c/0x718 [btrfs]
Fixes: 23b5ec7494 ("btrfs: fix readdir deadlock with pagefault")
CC: stable@vger.kernel.org # 4.14+
Reported-and-tested-by: René Rebe <rene@exactcode.com>
Reviewed-by: Liu Bo <bo.liu@linux.alibaba.com>
Signed-off-by: David Sterba <dsterba@suse.com>
If ext4 tries to start a reserved handle via
jbd2_journal_start_reserved(), and the journal has been aborted, this
can result in a NULL pointer dereference. This is because the fields
h_journal and h_transaction in the handle structure share the same
memory, via a union, so jbd2_journal_start_reserved() will clear
h_journal before calling start_this_handle(). If this function fails
due to an aborted handle, h_journal will still be NULL, and the call
to jbd2_journal_free_reserved() will pass a NULL journal to
sub_reserve_credits().
This can be reproduced by running "kvm-xfstests -c dioread_nolock
generic/475".
Cc: stable@kernel.org # 3.11
Fixes: 8f7d89f368 ("jbd2: transaction reservation support")
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Reviewed-by: Andreas Dilger <adilger@dilger.ca>
Reviewed-by: Jan Kara <jack@suse.cz>
Commit 43b18595d6 ("btrfs: qgroup: Use separate meta reservation type
for delalloc") merged into mainline is not the latest version submitted
to mail list in Dec 2017.
It has a fatal wrong @qgroup_free parameter, which results increasing
qgroup metadata pertrans reserved space, and causing a lot of early EDQUOT.
Fix it by applying the correct diff on top of current branch.
Fixes: 43b18595d6 ("btrfs: qgroup: Use separate meta reservation type for delalloc")
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Commit 4f5427ccce ("btrfs: delayed-inode: Use new qgroup meta rsv for
delayed inode and item") merged into mainline was not latest version
submitted to the mail list in Dec 2017.
Which lacks the following fixes:
1) Remove btrfs_qgroup_convert_reserved_meta() call in
btrfs_delayed_item_release_metadata()
2) Remove btrfs_qgroup_reserve_meta_prealloc() call in
btrfs_delayed_inode_reserve_metadata()
Those fixes will resolve unexpected EDQUOT problems.
Fixes: 4f5427ccce ("btrfs: delayed-inode: Use new qgroup meta rsv for delayed inode and item")
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Unlike reservation calculation used in inode rsv for metadata, qgroup
doesn't really need to care about things like csum size or extent usage
for the whole tree COW.
Qgroups care more about net change of the extent usage.
That's to say, if we're going to insert one file extent, it will mostly
find its place in COWed tree block, leaving no change in extent usage.
Or causing a leaf split, resulting in one new net extent and increasing
qgroup number by nodesize.
Or in an even more rare case, increase the tree level, increasing qgroup
number by 2 * nodesize.
So here instead of using the complicated calculation for extent
allocator, which cares more about accuracy and no error, qgroup doesn't
need that over-estimated reservation.
This patch will maintain 2 new members in btrfs_block_rsv structure for
qgroup, using much smaller calculation for qgroup rsv, reducing false
EDQUOT.
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Unlike previous method that tries to commit transaction inside
qgroup_reserve(), this time we will try to commit transaction using
fs_info->transaction_kthread to avoid nested transaction and no need to
worry about locking context.
Since it's an asynchronous function call and we won't wait for
transaction commit, unlike previous method, we must call it before we
hit the qgroup limit.
So this patch will use the ratio and size of qgroup meta_pertrans
reservation as indicator to check if we should trigger a transaction
commit. (meta_prealloc won't be cleaned in transaction committ, it's
useless anyway)
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
OSTA UDF specification does not mention whether the CS0 charset in case
of two bytes per character encoding should be treated in UTF-16 or
UCS-2. The sample code in the standard does not treat UTF-16 surrogates
in any special way but on systems such as Windows which work in UTF-16
internally, filenames would be treated as being in UTF-16 effectively.
In Linux it is more difficult to handle characters outside of Base
Multilingual plane (beyond 0xffff) as NLS framework works with 2-byte
characters only. Just make sure we don't leak UTF-16 surrogates into the
resulting string when loading names from the filesystem for now.
CC: stable@vger.kernel.org # >= v4.6
Reported-by: Mingye Wang <arthur200126@gmail.com>
Signed-off-by: Jan Kara <jack@suse.cz>
Kanda Motohiro reported that expanding a tiny xattr into a large xattr
fails on XFS because we remove the tiny xattr from a shortform fork and
then try to re-add it after converting the fork to extents format having
not removed the ATTR_REPLACE flag. This fails because the attr is no
longer present, causing a fs shutdown.
This is derived from the patch in his bug report, but we really
shouldn't ignore a nonzero retval from the remove call.
Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=199119
Reported-by: kanda.motohiro@gmail.com
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
During the "insert range" fallocate operation, i_size grows by the
specified 'len' bytes. XFS verifies that i_size + len < s_maxbytes, as
it should. But this comparison is done using the signed 'loff_t', and
'i_size + len' can wrap around to a negative value, causing the check to
incorrectly pass, resulting in an inode with "negative" i_size. This is
possible on 64-bit platforms, where XFS sets s_maxbytes = LLONG_MAX.
ext4 and f2fs don't run into this because they set a smaller s_maxbytes.
Fix it by using subtraction instead.
Reproducer:
xfs_io -f file -c "truncate $(((1<<63)-1))" -c "finsert 0 4096"
Fixes: a904b1ca57 ("xfs: Add support FALLOC_FL_INSERT_RANGE for fallocate")
Cc: <stable@vger.kernel.org> # v4.1+
Originally-From: Eric Biggers <ebiggers@google.com>
Signed-off-by: Eric Biggers <ebiggers@google.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
[darrick: fix signed integer addition overflow too]
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
If xfs_bmap_extents_to_btree fails in a mode where we call
xfs_iroot_realloc(-1) to de-allocate the root, set the
format back to extents.
Otherwise we can assume we can dereference ifp->if_broot
based on the XFS_DINODE_FMT_BTREE format, and crash.
Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=199423
Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Add several more validations to xfs_dinode_verify:
- For LOCAL data fork formats, di_nextents must be 0.
- For LOCAL attr fork formats, di_anextents must be 0.
- For inodes with no attr fork offset,
- format must be XFS_DINODE_FMT_EXTENTS if set at all
- di_anextents must be 0.
Thanks to dchinner for pointing out a couple related checks I had
forgotten to add.
Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=199377
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Use new return type vm_fault_t for page_mkwrite
handler.
Signed-off-by: Souptick Joarder <jrdr.linux@gmail.com>
Reviewed-by: Matthew Wilcox <mawilcox@microsoft.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
The current code null checks variable err_buf, which is always null
when it is checked, hence utf16_path is free'd and the function
returns -ENOENT everytime it is called, making it impossible for the
execution path to reach the following code:
err_buf = err_iov.iov_base;
Fix this by null checking err_iov.iov_base instead of err_buf. Also,
notice that err_buf no longer needs to be initialized to NULL.
Addresses-Coverity-ID: 1467876 ("Logically dead code")
Fixes: 2d636199e400 ("cifs: Change SMB2_open to return an iov for the error parameter")
Signed-off-by: Gustavo A. R. Silva <gustavo@embeddedor.com>
Signed-off-by: Steve French <smfrench@gmail.com>
Reviewed-by: Pavel Shilovsky <pshilov@microsoft.com>
Both ecryptfs_filldir() and ecryptfs_readlink_lower() use
ecryptfs_decode_and_decrypt_filename() to translate lower filenames to
upper filenames. The function correctly passes up lower filenames,
unchanged, when filename encryption isn't in use. However, it was also
passing up lower filenames when the filename wasn't encrypted or
when decryption failed. Since 88ae4ab980, eCryptfs refuses to lookup
lower plaintext names when filename encryption is enabled so this
resulted in a situation where userspace would see lower plaintext
filenames in calls to getdents(2) but then not be able to lookup those
filenames.
An example of this can be seen when enabling filename encryption on an
eCryptfs mount at the root directory of an Ext4 filesystem:
$ ls -1i /lower
12 ECRYPTFS_FNEK_ENCRYPTED.FWYZD8TcW.5FV-TKTEYOHsheiHX9a-w.NURCCYIMjI8pn5BDB9-h3fXwrE--
11 lost+found
$ ls -1i /upper
ls: cannot access '/upper/lost+found': No such file or directory
? lost+found
12 test
With this change, the lower lost+found dentry is ignored:
$ ls -1i /lower
12 ECRYPTFS_FNEK_ENCRYPTED.FWYZD8TcW.5FV-TKTEYOHsheiHX9a-w.NURCCYIMjI8pn5BDB9-h3fXwrE--
11 lost+found
$ ls -1i /upper
12 test
Additionally, some potentially noisy error/info messages in the related
code paths are turned into debug messages so that the logs can't be
easily filled.
Fixes: 88ae4ab980 ("ecryptfs_lookup(): try either only encrypted or plaintext name")
Reported-by: Guenter Roeck <linux@roeck-us.net>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Tyler Hicks <tyhicks@canonical.com>
Use new return type vm_fault_t for page_mkwrite,
pfn_mkwrite and fault handler.
Signed-off-by: Souptick Joarder <jrdr.linux@gmail.com>
Reviewed-by: Matthew Wilcox <mawilcox@microsoft.com>
Signed-off-by: Jan Kara <jack@suse.cz>
When specifying string type mount option (e.g., iocharset)
several times in a mount, current option parsing may
cause memory leak. Hence, call kfree for previous one
in this case. Meanwhile, check memory allocation result
for it.
Signed-off-by: Chengguang Xu <cgxu519@gmx.com>
Signed-off-by: Jan Kara <jack@suse.cz>
syzbot is catching so many bugs triggered by commit 9ee332d99e
("sget(): handle failures of register_shrinker()"). That commit expected
that calling kill_sb() from deactivate_locked_super() without successful
fill_super() is safe, but the reality was different; some callers assign
attributes which are needed for kill_sb() after sget() succeeds.
For example, [1] is a report where sb->s_mode (which seems to be either
FMODE_READ | FMODE_EXCL | FMODE_WRITE or FMODE_READ | FMODE_EXCL) is not
assigned unless sget() succeeds. But it does not worth complicate sget()
so that register_shrinker() failure path can safely call
kill_block_super() via kill_sb(). Making alloc_super() fail if memory
allocation for register_shrinker() failed is much simpler. Let's avoid
calling deactivate_locked_super() from sget_userns() by preallocating
memory for the shrinker and making register_shrinker() in sget_userns()
never fail.
[1] https://syzkaller.appspot.com/bug?id=588996a25a2587be2e3a54e8646728fb9cae44e7
Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Reported-by: syzbot <syzbot+5a170e19c963a2e0df79@syzkaller.appspotmail.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Michal Hocko <mhocko@suse.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
orangefs_fill_sb() might've failed to allocate ORANGEFS_SB(s); don't
oops in that case.
Cc: stable@kernel.org
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEE8rQSAMVO+zA4DBdWxWXV+ddtWDsFAlrTQ4sACgkQxWXV+ddt
WDti3Q/+MAeqsLTjvre2RQ3ka5hNyCuVftUIBmcP3YfJbt+xZYQyaewW4Xkfi3cm
cbJE+zehzf5ag+RJhxk3OwFTvNfLGIO9asWs3b08NGUi6VzwL0/8B/iOdZPuHSAV
TrecQIBE2Tp+xax9cQEnxav34D4dUtXNaDweGjp1MIIUkDneQP/I0vlTu7vafBgX
UVxP6riL/MCs7sjTHGIPs0lv8L/fgdmo+dk5SnNuIPTOcFTQXgVrtHjw9IvbKWd4
aq+sbNWoSrhXUfllbFg/wZqDe9tWn9E2f6m/H0ThSoNdxusSVgacOjFRYh20NKLW
WGB8Amd/ItGtJwJ1CIypa7VX2U11UAi0XT7BeiK82rUNEJ6moRqFOXG861gRLoTZ
SpH8uWO+e+CogfXob1KCndn5lot4AM2ZTkCqfrjpM35Nul72PZdne0CxNlmiRupY
Fdt5GB+sg8plcMaRiYr++BbbHP5tggX1MrhLGEbx2XBs2eRdn+2Lv2I1Ig/U4NUb
Vf+xk/tFLKGOTSZlbv7SV7ekXxG/3+7gAuL7A1XMETZCwBF4L3hwyW7CgkEBb3PC
TqX8BwMaRpyp/FgW/QL6edjXZ3a64VaHIfqRPNks3lWWcCHVbzyiVPfEjx+AEJ/0
abx6jXTJhLUBPuPxEgb5rsSv62RoxPoYHqCErrG95XwnZ8rSCts=
=I0y0
-----END PGP SIGNATURE-----
Merge tag 'for-4.17-part2-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux
Pull more btrfs updates from David Sterba:
"We have queued a few more fixes (error handling, log replay,
softlockup) and the rest is SPDX updates that touche almost all files
so the diffstat is long"
* tag 'for-4.17-part2-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux:
btrfs: Only check first key for committed tree blocks
btrfs: add SPDX header to Kconfig
btrfs: replace GPL boilerplate by SPDX -- sources
btrfs: replace GPL boilerplate by SPDX -- headers
Btrfs: fix loss of prealloc extents past i_size after fsync log replay
Btrfs: clean up resources during umount after trans is aborted
btrfs: Fix possible softlock on single core machines
Btrfs: bail out on error during replay_dir_deletes
Btrfs: fix NULL pointer dereference in log_dir_items
-----BEGIN PGP SIGNATURE-----
iQGwBAABCAAaBQJa0mDhExxzbWZyZW5jaEBnbWFpbC5jb20ACgkQiiy9cAdyT1HA
EQv+KxZAFONG/YGBIdoyJevnSobzZSTYgsrEmox28rIvWSFr6Xkt9K6WN4lki+Td
nacByZBhyeKPp7CxdgjmiNunbGfBZJsWk5LwtgdE2jOLjM9CFq8Z7m2qmkXQ4IkJ
EoH+NWpCUvEe9nj8oaTdSelR2OHH63SG5E5dK29eOB59ixcVe7qKUG29SpJks9qd
mCdZyhYyuzh469CKGFeQ8qfWopBH+QFBs4SrpQ9d4liNtZ3W8zTsQ3ezX069AGlU
Ef2xVTT1DtlZ4/6SiXBKlXGH3jjUg67vEGHzT3ZJwfLVNhTLEpKM/hsPiduSnjS8
JqdVRdFLc2lQzr4HknhNR9yuE5wT5tZM4iZOjNDwT6Sef2Mt0QIBuYI6kb4mv6Qg
aCVv7uXUMdvSrTk0mYiO1dXYozGUC+uFhJCuaiLax583o3ihp+/INDosQyqTbavZ
amRBZpWeNysSJwwJu21RdJiM74+fasR/CDy11U94ssxplg1qE72M8DShK7kXtUIy
g9ZW
=Ah5z
-----END PGP SIGNATURE-----
Merge tag '4.17-rc1SMB3-Fixes' of git://git.samba.org/sfrench/cifs-2.6
Pull cifs fixes from Steve French:
"SMB3 fixes, a few for stable, and some important cleanup work from
Ronnie of the smb3 transport code"
* tag '4.17-rc1SMB3-Fixes' of git://git.samba.org/sfrench/cifs-2.6:
cifs: change validate_buf to validate_iov
cifs: remove rfc1002 hardcoded constants from cifs_discard_remaining_data()
cifs: Change SMB2_open to return an iov for the error parameter
cifs: add resp_buf_size to the mid_q_entry structure
smb3.11: replace a 4 with server->vals->header_preamble_size
cifs: replace a 4 with server->vals->header_preamble_size
cifs: add pdu_size to the TCP_Server_Info structure
SMB311: Improve checking of negotiate security contexts
SMB3: Fix length checking of SMB3.11 negotiate request
CIFS: add ONCE flag for cifs_dbg type
cifs: Use ULL suffix for 64-bit constant
SMB3: Log at least once if tree connect fails during reconnect
cifs: smb2pdu: Fix potential NULL pointer dereference
Merge yet more updates from Andrew Morton:
- various hotfixes
- kexec_file updates and feature work
* emailed patches from Andrew Morton <akpm@linux-foundation.org>: (27 commits)
kernel/kexec_file.c: move purgatories sha256 to common code
kernel/kexec_file.c: allow archs to set purgatory load address
kernel/kexec_file.c: remove mis-use of sh_offset field during purgatory load
kernel/kexec_file.c: remove unneeded variables in kexec_purgatory_setup_sechdrs
kernel/kexec_file.c: remove unneeded for-loop in kexec_purgatory_setup_sechdrs
kernel/kexec_file.c: split up __kexec_load_puragory
kernel/kexec_file.c: use read-only sections in arch_kexec_apply_relocations*
kernel/kexec_file.c: search symbols in read-only kexec_purgatory
kernel/kexec_file.c: make purgatory_info->ehdr const
kernel/kexec_file.c: remove checks in kexec_purgatory_load
include/linux/kexec.h: silence compile warnings
kexec_file, x86: move re-factored code to generic side
x86: kexec_file: clean up prepare_elf64_headers()
x86: kexec_file: lift CRASH_MAX_RANGES limit on crash_mem buffer
x86: kexec_file: remove X86_64 dependency from prepare_elf64_headers()
x86: kexec_file: purge system-ram walking from prepare_elf64_headers()
kexec_file,x86,powerpc: factor out kexec_file_ops functions
kexec_file: make use of purgatory optional
proc: revalidate misc dentries
mm, slab: reschedule cache_reap() on the same CPU
...
If module removes proc directory while another process pins it by
chdir'ing to it, then subsequent recreation of proc entry and all
entries down the tree will not be visible to any process until pinning
process unchdir from directory and unpins everything.
Steps to reproduce:
proc_mkdir("aaa", NULL);
proc_create("aaa/bbb", ...);
chdir("/proc/aaa");
remove_proc_entry("aaa/bbb", NULL);
remove_proc_entry("aaa", NULL);
proc_mkdir("aaa", NULL);
# inaccessible because "aaa" dentry still points
# to the original "aaa".
proc_create("aaa/bbb", ...);
Fix is to implement ->d_revalidate and ->d_delete.
Link: http://lkml.kernel.org/r/20180312201938.GA4871@avx2
Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Pull overlayfs updates from Miklos Szeredi:
"In addition to bug fixes and cleanups there are two new features from
Amir:
- Consistent inode number support for the case when layers are not
all on the same filesystem (feature is dubbed "xino").
- Optimize overlayfs file handle decoding. This one touches the
exportfs interface to allow detecting the disconnected directory
case"
* 'overlayfs-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/vfs:
ovl: update documentation w.r.t "xino" feature
ovl: add support for "xino" mount and config options
ovl: consistent d_ino for non-samefs with xino
ovl: consistent i_ino for non-samefs with xino
ovl: constant st_ino for non-samefs with xino
ovl: allocate anon bdev per unique lower fs
ovl: factor out ovl_map_dev_ino() helper
ovl: cleanup ovl_update_time()
ovl: add WARN_ON() for non-dir redirect cases
ovl: cleanup setting OVL_INDEX
ovl: set d->is_dir and d->opaque for last path element
ovl: Do not check for redirect if this is last layer
ovl: lookup in inode cache first when decoding lower file handle
ovl: do not try to reconnect a disconnected origin dentry
ovl: disambiguate ovl_encode_fh()
ovl: set lower layer st_dev only if setting lower st_ino
ovl: fix lookup with middle layer opaque dir and absolute path redirects
ovl: Set d->last properly during lookup
ovl: set i_ino to the value of st_ino for NFS export
When looping btrfs/074 with many cpus (>= 8), it's possible to trigger
kernel warning due to first key verification:
[ 4239.523446] WARNING: CPU: 5 PID: 2381 at fs/btrfs/disk-io.c:460 btree_read_extent_buffer_pages+0x1ad/0x210
[ 4239.523830] Modules linked in:
[ 4239.524630] RIP: 0010:btree_read_extent_buffer_pages+0x1ad/0x210
[ 4239.527101] Call Trace:
[ 4239.527251] read_tree_block+0x42/0x70
[ 4239.527434] read_node_slot+0xd2/0x110
[ 4239.527632] push_leaf_right+0xad/0x1b0
[ 4239.527809] split_leaf+0x4ea/0x700
[ 4239.527988] ? leaf_space_used+0xbc/0xe0
[ 4239.528192] ? btrfs_set_lock_blocking_rw+0x99/0xb0
[ 4239.528416] btrfs_search_slot+0x8cc/0xa40
[ 4239.528605] btrfs_insert_empty_items+0x71/0xc0
[ 4239.528798] __btrfs_run_delayed_refs+0xa98/0x1680
[ 4239.529013] btrfs_run_delayed_refs+0x10b/0x1b0
[ 4239.529205] btrfs_commit_transaction+0x33/0xaf0
[ 4239.529445] ? start_transaction+0xa8/0x4f0
[ 4239.529630] btrfs_alloc_data_chunk_ondemand+0x1b0/0x4e0
[ 4239.529833] btrfs_check_data_free_space+0x54/0xa0
[ 4239.530045] btrfs_delalloc_reserve_space+0x25/0x70
[ 4239.531907] btrfs_direct_IO+0x233/0x3d0
[ 4239.532098] generic_file_direct_write+0xcb/0x170
[ 4239.532296] btrfs_file_write_iter+0x2bb/0x5f4
[ 4239.532491] aio_write+0xe2/0x180
[ 4239.532669] ? lock_acquire+0xac/0x1e0
[ 4239.532839] ? __might_fault+0x3e/0x90
[ 4239.533032] do_io_submit+0x594/0x860
[ 4239.533223] ? do_io_submit+0x594/0x860
[ 4239.533398] SyS_io_submit+0x10/0x20
[ 4239.533560] ? SyS_io_submit+0x10/0x20
[ 4239.533729] do_syscall_64+0x75/0x1d0
[ 4239.533979] entry_SYSCALL_64_after_hwframe+0x42/0xb7
[ 4239.534182] RIP: 0033:0x7f8519741697
The problem here is, at btree_read_extent_buffer_pages() we don't have
acquired read/write lock on that extent buffer, only basic info like
level/bytenr is reliable.
So race condition leads to such false alert.
However in current call site, it's impossible to acquire proper lock
without race window.
To fix the problem, we only verify first key for committed tree blocks
(whose generation is no larger than fs_info->last_trans_committed), so
the content of such tree blocks will not change and there is no need to
get read/write lock.
Reported-by: Nikolay Borisov <nborisov@suse.com>
Fixes: 581c176041 ("btrfs: Validate child tree block's level and first key")
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The ignore mask logic in send_to_group() does not match the logic
in fanotify_should_send_event(). In the latter, a vfsmount mark ignore
mask precedes an inode mark mask and in the former, it does not.
That difference may cause events to be sent to fanotify backend for no
reason. Fix the logic in send_to_group() to match that of
fanotify_should_send_event().
Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Ronnie Sahlberg <lsahlber@redhat.com>
Signed-off-by: Steve French <smfrench@gmail.com>
Reviewed-by: Pavel Shilovsky <pshilov@microsoft.com>
Signed-off-by: Ronnie Sahlberg <lsahlber@redhat.com>
Signed-off-by: Steve French <smfrench@gmail.com>
Reviewed-by: Pavel Shilovsky <pshilov@microsoft.com>
Signed-off-by: Ronnie Sahlberg <lsahlber@redhat.com>
Signed-off-by: Steve French <smfrench@gmail.com>
Reviewed-by: Pavel Shilovsky <pshilov@microsoft.com>
and get rid of some more calls to get_rfc1002_length()
Signed-off-by: Ronnie Sahlberg <lsahlber@redhat.com>
Signed-off-by: Steve French <smfrench@gmail.com>
Reviewed-by: Pavel Shilovsky <pshilov@microsoft.com>
More cleanup of use of hardcoded 4 byte RFC1001 field size
Signed-off-by: Steve French <smfrench@gmail.com>
Reviewed-by: Pavel Shilovsky <pshilov@microsoft.com>
Reviewed-by: Ronnie Sahlberg <lsahlber@redhat.com>
Signed-off-by: Ronnie Sahlberg <lsahlber@redhat.com>
Signed-off-by: Steve French <smfrench@gmail.com>
Reviewed-by: Pavel Shilovsky <pshilov@microsoft.com>
and get rid of some get_rfc1002_length() in smb2
Signed-off-by: Ronnie Sahlberg <lsahlber@redhat.com>
Signed-off-by: Steve French <smfrench@gmail.com>
Reviewed-by: Pavel Shilovsky <pshilov@microsoft.com>
SMB3.11 crypto and hash contexts were not being checked strictly enough.
Add parsing and validity checking for the security contexts in the SMB3.11
negotiate response.
Signed-off-by: Steve French <smfrench@gmail.com>
CC: Stable <stable@vger.kernel.org>
Reviewed-by: Pavel Shilovsky <pshilov@microsoft.com>
The length checking for SMB3.11 negotiate request includes
"negotiate contexts" which caused a buffer validation problem
and a confusing warning message on SMB3.11 mount e.g.:
SMB2 server sent bad RFC1001 len 236 not 170
Fix the length checking for SMB3.11 negotiate to account for
the new negotiate context so that we don't log a warning on
SMB3.11 mount by default but do log warnings if lengths returned
by the server are incorrect.
CC: Stable <stable@vger.kernel.org>
Signed-off-by: Steve French <smfrench@gmail.com>
Reviewed-by: Aurelien Aptel <aaptel@suse.com>
Reviewed-by: Pavel Shilovsky <pshilov@microsoft.com>
- Cleanup unnecessary function call parameters
- Fix a use-after-free bug when aborting logging intents
- Refactor filestreams state data to avoid use-after-free bug
- Fix incorrect removal of cow extents when truncating extended
attributes.
- Refactor open-coded __set_page_dirty in favor of using vfs function.
- Fix a deadlock when fstrim and fs shutdown race.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2
iQIcBAABCgAGBQJazZ/HAAoJEPh/dxk0SrTrGeYP/Asis7MZ3TWfzsPHwK6EoH+w
q1VnKVMqjSRnE8DYmx8w3tMqny2qg9klLkT1SRRz9Htr5CER5XqIX6ZYlQpVKx2R
ycS+I1l/V/kqEmAgDgSK3DZ5uMgKHfo0w7GbKFDg69YDztN3yNBpDfUZ4VqA/Ua2
ADaeMYkJh6oBB5yZAWGyAJEM5TieqQppqyc+WLhbORyjEFreUTTmLqbzLPnceSih
rQLg0noDwZK0XqbwndYXGTNKKoQtiJalnZP18DhH4zOr+FH03i/gmlU3w+ANl3eX
IEE0TR2EkNHLZeduz7xT7ZHUOo0TaGObBK8CJFSojibQ/HooVjLcFQResd3q0coN
WTkbTuxHwMqk2IujKRTqli/saENhvFrOrm/nYTFPw9+3GpRt0iLrXPSBbeMjmsZG
XntdimPEHywjYrdW10VRH+6E6tvQiC/tl3abBuXdaEOJs1KZmPYNt42EF0ZQ5Xs5
IeDOhPLuuyUwRf12RVA9WS6xGdMR0+foMqncXZNcAzxQeAVfUNEtASNMTNIOO2H5
xD34/1ooFJnwT755VLT9U/qUd5CHtWO3AkH+9RVCpKWTaEOSY+gV6ZgmINv9npur
Vw2xnZwxysegVlu76uct/b/sy/8J3OKYIoSYwbt6Zxi0M1WF9zanJ9pKCsBYdL6A
xWoMr0X1yFGcYJozqwJU
=uGz0
-----END PGP SIGNATURE-----
Merge tag 'xfs-4.17-merge-4' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux
Pull more xfs updates from Darrick Wong:
"Most of these are code cleanups, but there are a couple of notable
use-after-free bug fixes.
This series has been run through a full xfstests run over the week and
through a quick xfstests run against this morning's master, with no
major failures reported.
- clean up unnecessary function call parameters
- fix a use-after-free bug when aborting logging intents
- refactor filestreams state data to avoid use-after-free bug
- fix incorrect removal of cow extents when truncating extended
attributes.
- refactor open-coded __set_page_dirty in favor of using vfs
function.
- fix a deadlock when fstrim and fs shutdown race"
* tag 'xfs-4.17-merge-4' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux:
Force log to disk before reading the AGF during a fstrim
Export __set_page_dirty
xfs: only cancel cow blocks when truncating the data fork
xfs: non-scrub - remove unused function parameters
xfs: remove filestream item xfs_inode reference
xfs: fix intent use-after-free on abort
xfs: Remove "committed" argument of xfs_dir_ialloc
merge window while it's still open.
1. The first patch adds a new function to lockref: lockref_put_not_zero
2. The second patch fixes GFS2's glock dump code so it uses the new lockref
function. This fixes a problem whereby lock dumps could miss glocks.
3. I made a minor patch to update some comments and fix the lock ordering
text in our gfs2-glocks.txt Documentation file.
-----BEGIN PGP SIGNATURE-----
iQEcBAABAgAGBQJaz6pdAAoJENeLYdPf93o71wMH/0cEo34xWiScRM07EgLmZZ3q
YXMvpTvrwK+9i2u8anxiX1smezHeS+7jPrYOG8AGu3IZvKYGTDOwoIY9pxESy5gs
1Rf60s6pPE/dkTSqPaNNuBxPrM1yVyRWOPx04LxC5BCXhsS/6U2RS9ElxGDe7Nyq
P66z1wfm63+erDR7mKSuOL3Ejtglj2EPcrAupaBlRS0wjdUQ9ORyrZBpT6JMOWqd
HWjchrzWVAqx+iyLHlKZjTyPHsPaUBaj1fuv/Vcgu5sJmEJ9mF4s/GQTdwIzi8ip
ByD7MfilyrT7dxRm1uw8OJ7TvqNeaCtxsyNGGBOlSx81s/pk5Vhs8bevnczNvi8=
=jWsi
-----END PGP SIGNATURE-----
Merge tag 'gfs2-4.17.fixes2' of git://git.kernel.org/pub/scm/linux/kernel/git/gfs2/linux-gfs2
Pull more gfs2 updates from Bob Peterson:
"We decided to request the latest three patches to be merged into this
merge window while it's still open.
- The first patch adds a new function to lockref:
lockref_put_not_zero
- The second patch fixes GFS2's glock dump code so it uses the new
lockref function. This fixes a problem whereby lock dumps could
miss glocks.
- I made a minor patch to update some comments and fix the lock
ordering text in our gfs2-glocks.txt Documentation file"
* tag 'gfs2-4.17.fixes2' of git://git.kernel.org/pub/scm/linux/kernel/git/gfs2/linux-gfs2:
GFS2: Minor improvements to comments and documentation
gfs2: Stop using rhashtable_walk_peek
lockref: Add lockref_put_not_zero
Stable bugfixes:
- xprtrdma: Fix corner cases when handling device removal # v4.12+
- xprtrdma: Fix latency regression on NUMA NFS/RDMA clients # v4.15+
Features:
- New sunrpc tracepoint for RPC pings
- Finer grained NFSv4 attribute checking
- Don't unnecessarily return NFS v4 delegations
Other bugfixes and cleanups:
- Several other small NFSoRDMA cleanups
- Improvements to the sunrpc RTT measurements
- A few sunrpc tracepoint cleanups
- Various fixes for NFS v4 lock notifications
- Various sunrpc and NFS v4 XDR encoding cleanups
- Switch to the ida_simple API
- Fix NFSv4.1 exclusive create
- Forget acl cache after setattr operation
- Don't advance the nfs_entry readdir cookie if xdr decoding fails
-----BEGIN PGP SIGNATURE-----
iQIzBAABCAAdFiEEnZ5MQTpR7cLU7KEp18tUv7ClQOsFAlrNG1IACgkQ18tUv7Cl
QOvotw//fQoUgQ/AOJGlZo/4ws2mGJN3dfwwKM8xYOnHaxppOYubZRHwvswK8d22
+XR/Q6IVbUxI3mJluv1L0d9CJT06s3c9CO90McIJbk4CWihGP19bNIY4JiPlzrbv
4FDiyOvMBej2UXbHX5EzKj0srxyBoEVf3iUAIa6DaHi3c6EIUo6fP3d2eRNJStqd
WMyZs+nqr2W9biyClxntT7l/Sk+o+4I7M3Oo9pjjS+PiePYdaMrL5T1kPeHaJshF
GMGXkbvVdqpDRiXX84R9+2/nuSiA15eEnaR94UNvs84oLR3qob3ZhxhudqFdSPrX
RS6E7m34gY/EaQm/wbB26PZm+3jHd4Pqm5SKLbyFfoCmG6oMwBvXNRJZas1DFaHM
CMOECvfAr6kixVLkAN0MNQ2Ku/FuJ52OLP1dRLmxsblocnhEPujc6RSz6Ju/v3a0
adbpmJMA2IoSGgXMu3g1VGnjHfMj7ZmjtpigXVvlcUqQGCL7t4ngh23cpeTQeJ76
bMwSHUQu18NbmtJjBTE+PIm7mdCrpQD7ZuOPWpK62zxLYUnnv7nm75m84DrDru7d
XAmrCmdUJNrVWQs6BAtCXgO4PZ6xNGLosb0xTQXTAQYftc+DRJ9SW/VGc0Mp1L9m
0G0iz++b8cy4Pih5UCDJcCkpjCIvHLcn72zn1kbufWqG3xr2koc=
=IlWo
-----END PGP SIGNATURE-----
Merge tag 'nfs-for-4.17-1' of git://git.linux-nfs.org/projects/anna/linux-nfs
Pull NFS client updates from Anna Schumaker:
"Stable bugfixes:
- xprtrdma: Fix corner cases when handling device removal # v4.12+
- xprtrdma: Fix latency regression on NUMA NFS/RDMA clients # v4.15+
Features:
- New sunrpc tracepoint for RPC pings
- Finer grained NFSv4 attribute checking
- Don't unnecessarily return NFS v4 delegations
Other bugfixes and cleanups:
- Several other small NFSoRDMA cleanups
- Improvements to the sunrpc RTT measurements
- A few sunrpc tracepoint cleanups
- Various fixes for NFS v4 lock notifications
- Various sunrpc and NFS v4 XDR encoding cleanups
- Switch to the ida_simple API
- Fix NFSv4.1 exclusive create
- Forget acl cache after setattr operation
- Don't advance the nfs_entry readdir cookie if xdr decoding fails"
* tag 'nfs-for-4.17-1' of git://git.linux-nfs.org/projects/anna/linux-nfs: (47 commits)
NFS: advance nfs_entry cookie only after decoding completes successfully
NFSv3/acl: forget acl cache after setattr
NFSv4.1: Fix exclusive create
NFSv4: Declare the size up to date after it was set.
nfs: Use ida_simple API
NFSv4: Fix the nfs_inode_set_delegation() arguments
NFSv4: Clean up CB_GETATTR encoding
NFSv4: Don't ask for attributes when ACCESS is protected by a delegation
NFSv4: Add a helper to encode/decode struct timespec
NFSv4: Clean up encode_attrs
NFSv4; Clean up XDR encoding of type bitmap4
NFSv4: Allow GFP_NOIO sleeps in decode_attr_owner/decode_attr_group
SUNRPC: Add a helper for encoding opaque data inline
SUNRPC: Add helpers for decoding opaque and string types
NFSv4: Ignore change attribute invalidations if we hold a delegation
NFS: More fine grained attribute tracking
NFS: Don't force unnecessary cache invalidation in nfs_update_inode()
NFS: Don't redirty the attribute cache in nfs_wcc_update_inode()
NFS: Don't force a revalidation of all attributes if change is missing
NFS: Convert NFS_INO_INVALID flags to unsigned long
...
Pull vfs thaw updates from Al Viro:
"An ancient series that has fallen through the cracks in the previous
cycle"
* 'work.thaw' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
buffer.c: call thaw_super during emergency thaw
vfs: factor sb iteration out of do_emergency_remount
Pull AFS updates from Al Viro:
"The AFS series posted by dhowells depended upon lookup_one_len()
rework; now that prereq is in the mainline, that series had been
rebased on top of it and got some exposure and testing..."
* 'afs-dh' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
afs: Do better accretion of small writes on newly created content
afs: Add stats for data transfer operations
afs: Trace protocol errors
afs: Locally edit directory data for mkdir/create/unlink/...
afs: Adjust the directory XDR structures
afs: Split the directory content defs into a header
afs: Fix directory handling
afs: Split the dynroot stuff out and give it its own ops tables
afs: Keep track of invalid-before version for dentry coherency
afs: Rearrange status mapping
afs: Make it possible to get the data version in readpage
afs: Init inode before accessing cache
afs: Introduce a statistics proc file
afs: Dump bad status record
afs: Implement @cell substitution handling
afs: Implement @sys substitution handling
afs: Prospectively look up extra files when doing a single lookup
afs: Don't over-increment the cell usage count when pinning it
afs: Fix checker warnings
vfs: Remove the const from dir_context::actor
This patch simply fixes some comments and the gfs2-glocks.txt file:
Places where i_rwsem was called i_mutex, and adding i_rw_mutex.
Signed-off-by: Bob Peterson <rpeterso@redhat.com>
Function rhashtable_walk_peek is problematic because there is no
guarantee that the glock previously returned still exists; when that key
is deleted, rhashtable_walk_peek can end up returning a different key,
which will cause an inconsistent glock dump. Fix this by keeping track
of the current glock in the seq file iterator functions instead.
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
Signed-off-by: Bob Peterson <rpeterso@redhat.com>
During the "insert range" fallocate operation, extents starting at the
range offset are shifted "right" (to a higher file offset) by the range
length. But, as shown by syzbot, it's not validated that this doesn't
cause extents to be shifted beyond EXT_MAX_BLOCKS. In that case
->ee_block can wrap around, corrupting the extent tree.
Fix it by returning an error if the space between the end of the last
extent and EXT4_MAX_BLOCKS is smaller than the range being inserted.
This bug can be reproduced by running the following commands when the
current directory is on an ext4 filesystem with a 4k block size:
fallocate -l 8192 file
fallocate --keep-size -o 0xfffffffe000 -l 4096 -n file
fallocate --insert-range -l 8192 file
Then after unmounting the filesystem, e2fsck reports corruption.
Reported-by: syzbot+06c885be0edcdaeab40c@syzkaller.appspotmail.com
Fixes: 331573febb ("ext4: Add support FALLOC_FL_INSERT_RANGE for fallocate")
Cc: stable@vger.kernel.org # v4.2+
Signed-off-by: Eric Biggers <ebiggers@google.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Remove GPL boilerplate text (long, short, one-line) and keep the rest,
ie. personal, company or original source copyright statements. Add the
SPDX header.
Signed-off-by: David Sterba <dsterba@suse.com>
Remove GPL boilerplate text (long, short, one-line) and keep the rest,
ie. personal, company or original source copyright statements. Add the
SPDX header.
Unify the include protection macros to match the file names.
Signed-off-by: David Sterba <dsterba@suse.com>
Currently if we allocate extents beyond an inode's i_size (through the
fallocate system call) and then fsync the file, we log the extents but
after a power failure we replay them and then immediately drop them.
This behaviour happens since about 2009, commit c71bf099ab ("Btrfs:
Avoid orphan inodes cleanup while replaying log"), because it marks
the inode as an orphan instead of dropping any extents beyond i_size
before replaying logged extents, so after the log replay, and while
the mount operation is still ongoing, we find the inode marked as an
orphan and then perform a truncation (drop extents beyond the inode's
i_size). Because the processing of orphan inodes is still done
right after replaying the log and before the mount operation finishes,
the intention of that commit does not make any sense (at least as
of today). However reverting that behaviour is not enough, because
we can not simply discard all extents beyond i_size and then replay
logged extents, because we risk dropping extents beyond i_size created
in past transactions, for example:
add prealloc extent beyond i_size
fsync - clears the flag BTRFS_INODE_NEEDS_FULL_SYNC from the inode
transaction commit
add another prealloc extent beyond i_size
fsync - triggers the fast fsync path
power failure
In that scenario, we would drop the first extent and then replay the
second one. To fix this just make sure that all prealloc extents
beyond i_size are logged, and if we find too many (which is far from
a common case), fallback to a full transaction commit (like we do when
logging regular extents in the fast fsync path).
Trivial reproducer:
$ mkfs.btrfs -f /dev/sdb
$ mount /dev/sdb /mnt
$ xfs_io -f -c "pwrite -S 0xab 0 256K" /mnt/foo
$ sync
$ xfs_io -c "falloc -k 256K 1M" /mnt/foo
$ xfs_io -c "fsync" /mnt/foo
<power failure>
# mount to replay log
$ mount /dev/sdb /mnt
# at this point the file only has one extent, at offset 0, size 256K
A test case for fstests follows soon, covering multiple scenarios that
involve adding prealloc extents with previous shrinking truncates and
without such truncates.
Fixes: c71bf099ab ("Btrfs: Avoid orphan inodes cleanup while replaying log")
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Currently if some fatal errors occur, like all IO get -EIO, resources
would be cleaned up when
a) transaction is being committed or
b) BTRFS_FS_STATE_ERROR is set
However, in some rare cases, resources may be left alone after transaction
gets aborted and umount may run into some ASSERT(), e.g.
ASSERT(list_empty(&block_group->dirty_list));
For case a), in btrfs_commit_transaciton(), there're several places at the
beginning where we just call btrfs_end_transaction() without cleaning up
resources. For case b), it is possible that the trans handle doesn't have
any dirty stuff, then only trans hanlde is marked as aborted while
BTRFS_FS_STATE_ERROR is not set, so resources remain in memory.
This makes btrfs also check BTRFS_FS_STATE_TRANS_ABORTED to make sure that
all resources won't stay in memory after umount.
Signed-off-by: Liu Bo <bo.liu@linux.alibaba.com>
Signed-off-by: David Sterba <dsterba@suse.com>
With mount option "xino=on", mounter declares that there are enough
free high bits in underlying fs to hold the layer fsid.
If overlayfs does encounter underlying inodes using the high xino
bits reserved for layer fsid, a warning will be emitted and the original
inode number will be used.
The mount option name "xino" goes after a similar meaning mount option
of aufs, but in overlayfs case, the mapping is stateless.
An example for a use case of "xino=on" is when upper/lower is on an xfs
filesystem. xfs uses 64bit inode numbers, but it currently never uses the
upper 8bit for inode numbers exposed via stat(2) and that is not likely to
change in the future without user opting-in for a new xfs feature. The
actual number of unused upper bit is much larger and determined by the xfs
filesystem geometry (64 - agno_log - agblklog - inopblog). That means
that for all practical purpose, there are enough unused bits in xfs
inode numbers for more than OVL_MAX_STACK unique fsid's.
Another use case of "xino=on" is when upper/lower is on tmpfs. tmpfs inode
numbers are allocated sequentially since boot, so they will practially
never use the high inode number bits.
For compatibility with applications that expect 32bit inodes, the feature
can be disabled with "xino=off". The option "xino=auto" automatically
detects underlying filesystem that use 32bit inodes and enables the
feature. The Kconfig option OVERLAY_FS_XINO_AUTO and module parameter of
the same name, determine if the default mode for overlayfs mount is
"xino=auto" or "xino=off".
Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
When overlay layers are not all on the same fs, but all inode numbers
of underlying fs do not use the high 'xino' bits, overlay st_ino values
are constant and persistent.
In that case, relax non-samefs constraint for consistent d_ino and always
iterate non-merge dir using ovl_fill_real() actor so we can remap lower
inode numbers to unique lower fs range.
Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
When overlay layers are not all on the same fs, but all inode numbers
of underlying fs do not use the high 'xino' bits, overlay st_ino values
are constant and persistent.
In that case, set i_ino value to the same value as st_ino for nfsd
readdirplus validator.
Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>