Commit Graph

41353 Commits

Author SHA1 Message Date
Jaegeuk Kim
244f4fc1c5 f2fs: set cached_en after checking finally
This patch relocates cached_en not only to be covered by spin_lock, but also
to set once after checking out completely.

Reviewed-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2015-08-04 14:09:54 -07:00
Jaegeuk Kim
cbe91923a9 f2fs: update on-disk extents even under extent_cache
Previously, f2fs_update_extent_cache() updates in-memory extent_cache all the
time, and then finally preserves its up-to-date extent into on-disk one during
f2fs_evict_inode.

But, in the following scenario:

1. mount
2. open & write an extent X
3. f2fs_evict_inode; on-disk extent is X
4. open & update the extent X with Y
5. sync; trigger checkpoint
6. power-cut

after power-on, f2fs should serve extent Y, but we have an on-disk extent X.

This causes a failure on xfstests/311.

Reviewed-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2015-08-04 14:09:54 -07:00
Jaegeuk Kim
7a2cb67867 f2fs: fix wrong block address calculation for a split extent
This patch fixes wrong calculation on block address field when an extent is
split.

Reviewed-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2015-08-04 14:09:54 -07:00
Jaegeuk Kim
97a7b2c274 f2fs: convert inline_data for various fallocate
For newly added fallocate types, it should convert inline_data before handling
block swapping.

Reviewed-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2015-08-04 14:09:53 -07:00
Jaegeuk Kim
c9b63bd01d f2fs: avoid to use failed inode immediately
Before iput is called, the inode number used by a bad inode can be reassigned
to other new inode, resulting in any abnormal behaviors on the new inode.
This should not happen for the new inode.

Reviewed-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2015-08-04 14:09:53 -07:00
Jaegeuk Kim
eca616f8c1 f2fs: avoid freed stat information
The write_checkpoint can update stat information, so we should destroy the stat
structure after it.

Reviewed-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2015-08-04 14:09:53 -07:00
Chao Yu
5ac9f36fca f2fs: fix to record dirty page count for symlink
Dirty page can be exist in mapping of newly created symlink, but previously
we did not maintain the counting of dirty page for symlink like we maintained
for regular/directory, so the counting we lookuped should be wrong.

This patch adds missed dirty page counting for symlink to fix this issue.

Signed-off-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2015-08-04 14:09:52 -07:00
Markus Elfring
92859a5efd f2fs crypto: delete an unnecessary check before the function call "key_put"
The key_put() function tests whether its argument is NULL and then
returns immediately. Thus the test around the call is not needed.

This issue was detected by using the Coccinelle software.

Signed-off-by: Markus Elfring <elfring@users.sourceforge.net>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2015-08-04 14:09:52 -07:00
Linus Torvalds
7e884479bf Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client
Pull Ceph fixes from Sage Weil:
 "There are two critical regression fixes for CephFS from Zheng, and an
  RBD completion fix for layered images from Ilya"

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client:
  rbd: fix copyup completion race
  ceph: always re-send cap flushes when MDS recovers
  ceph: fix ceph_encode_locks_to_buffer()
2015-08-03 11:09:07 -07:00
Linus Torvalds
01183609ab Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs
Pull VFS fix from Al Viro:
 "Spurious ENOTDIR fix"

This should fix the problems reported by Dominique Martinet and Hugh
Dickins.

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
  link_path_walk(): be careful when failing with ENOTDIR
2015-08-01 17:42:14 -07:00
Al Viro
97242f99a0 link_path_walk(): be careful when failing with ENOTDIR
In RCU mode we might end up with dentry evicted just we check
that it's a directory.  In such case we should return ECHILD
rather than ENOTDIR, so that pathwalk would be retries in non-RCU
mode.

Breakage had been introduced in commit b18825a - prior to that
we were looking at nd->inode, which had been fetched before
verifying that ->d_seq was still valid.  That form of check
would only be satisfied if at some point the pathname prefix
would indeed have resolved to a non-directory.  The fix consists
of checking ->d_seq after we'd run into a non-directory dentry,
and failing with ECHILD in case of mismatch.

Note that all branches since 3.12 have that problem...

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2015-08-01 20:18:38 -04:00
Linus Torvalds
acea568fa9 Merge branch 'for-linus-4.2' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs
Pull btrfs fixes from Chris Mason:
 "Filipe fixed up a hard to trigger ENOSPC regression from our merge
  window pull, and we have a few other smaller fixes"

* 'for-linus-4.2' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs:
  Btrfs: fix quick exhaustion of the system array in the superblock
  btrfs: its btrfs_err() instead of btrfs_error()
  btrfs: Avoid NULL pointer dereference of free_extent_buffer when read_tree_block() fail
  btrfs: Fix lockdep warning of btrfs_run_delayed_iputs()
2015-07-31 17:05:37 -07:00
Yan, Zheng
fc927cd32f ceph: always re-send cap flushes when MDS recovers
commit e548e9b93d makes the kclient
only re-send cap flush once during MDS failover. If the kclient sends
a cap flush after MDS enters reconnect stage but before MDS recovers.
The kclient will skip re-sending the same cap flush when MDS recovers.

This causes problem for newly created inode. The MDS handles cap
flushes before replaying unsafe requests, so it's possible that MDS
find corresponding inode is missing when handling cap flush. The fix
is reverting to old behaviour: always re-send when MDS recovers

Signed-off-by: Yan, Zheng <zyan@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2015-07-31 11:38:53 +03:00
Yan, Zheng
f6762cb2ca ceph: fix ceph_encode_locks_to_buffer()
posix locks should be in ctx->flc_posix list

Signed-off-by: Yan, Zheng <zyan@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2015-07-31 11:38:47 +03:00
Linus Torvalds
8400935737 xfs: updates for 4.2-rc4
- remote attribute log recovery corruption fixes
 - DAX page faults need to use direct mappings, not a page cache
   mapping.
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1.4.12 (GNU/Linux)
 
 iQIcBAABAgAGBQJVutbbAAoJEK3oKUf0dfode50QANILLE7Rq9rKTp9ogHZKxTfv
 Y3SJLCpwsuZ9GQuvqgRk2MbZDgNsQxbKg4lCVlf+lbXWDoKINkLw73g1qPmEMA8q
 3EQ+GNYcvsoGKlyIRod6HMuqIa9zIfremj+qk4/PCDhD50IjI/1QH5LqVz9lnaas
 3N1zuU1t5dJvmUMPdLUvj46s51O9stiKTN0gbLnv5CEbTbyH1PiyqGubecJkcFVU
 3oEdp0PyxeoxUgUWLh24tgQXzeLQTR/95viJsUILcdPBd3geCBbW3/pgke5Rg/3+
 G/8aowNtDvNrg8sT6FuDgSR8kYudEssleQnfWO7rX7yigKLzSnNpD7m4779ZmEXP
 ey32NVbXwslg92AYwM9A8EqzGInm+cMDWcRu22WpHKzWs1wpVlpdkJXbP1SE9fKm
 pXVw1bT4XNArPjew/cclLMI2T0Tc6LMTUR5WxcF1N8RCTvdTm2qf+l43zXYFcBvX
 IxaD6U8z7tyShbgG2xllM3D/ANznvIs4pAb9zK6vJK7ZcfCgTVY/2YkHZeoKGKdG
 2Mo1SUQpaG8Xhm53IDfSLtEmTp9eQAdDooHCwjugsLXe9A65787EIabTK9oTsNG1
 W90CBQzyrIliQ9v291eAB/e5gRmExVxyV9VnVPMw3jXJ9mwxyfKOEd9mW+pv8RfH
 x3mj0CxjO14vQrrfCLjd
 =kjas
 -----END PGP SIGNATURE-----

Merge tag 'xfs-for-linus-4.2-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/dgc/linux-xfs

Pull xfs fixes from Dave Chinner:
 "There are a couple of recently found, long standing remote attribute
  corruption fixes caused by log recovery getting confused after a
  crash, and the new DAX code in XFS (merged in 4.2-rc1) needs to
  actually use the DAX fault path on read faults.

  Summary:

   - remote attribute log recovery corruption fixes

   - DAX page faults need to use direct mappings, not a page cache
     mapping"

* tag 'xfs-for-linus-4.2-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/dgc/linux-xfs:
  xfs: remote attributes need to be considered data
  xfs: remote attribute headers contain an invalid LSN
  xfs: call dax_fault on read page faults for DAX
2015-07-30 20:36:49 -07:00
Dave Chinner
df150ed102 xfs: remote attributes need to be considered data
We don't log remote attribute contents, and instead write them
synchronously before we commit the block allocation and attribute
tree update transaction. As a result we are writing to the allocated
space before the allcoation has been made permanent.

As a result, we cannot consider this allocation to be a metadata
allocation. Metadata allocation can take blocks from the free list
and so reuse them before the transaction that freed the block is
committed to disk. This behaviour is perfectly fine for journalled
metadata changes as log recovery will ensure the free operation is
replayed before the overwrite, but for remote attribute writes this
is not the case.

Hence we have to consider the remote attribute blocks to contain
data and allocate accordingly. We do this by dropping the
XFS_BMAPI_METADATA flag from the block allocation. This means the
allocation will not use blocks that are on the busy list without
first ensuring that the freeing transaction has been committed to
disk and the blocks removed from the busy list. This ensures we will
never overwrite a freed block without first ensuring that it is
really free.

cc: <stable@vger.kernel.org>
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2015-07-29 11:48:02 +10:00
Dave Chinner
e3c32ee9e3 xfs: remote attribute headers contain an invalid LSN
In recent testing, a system that crashed failed log recovery on
restart with a bad symlink buffer magic number:

XFS (vda): Starting recovery (logdev: internal)
XFS (vda): Bad symlink block magic!
XFS: Assertion failed: 0, file: fs/xfs/xfs_log_recover.c, line: 2060

On examination of the log via xfs_logprint, none of the symlink
buffers in the log had a bad magic number, nor were any other types
of buffer log format headers mis-identified as symlink buffers.
Tracing was used to find the buffer the kernel was tripping over,
and xfs_db identified it's contents as:

000: 5841524d 00000000 00000346 64d82b48 8983e692 d71e4680 a5f49e2c b317576e
020: 00000000 00602038 00000000 006034ce d0020000 00000000 4d4d4d4d 4d4d4d4d
040: 4d4d4d4d 4d4d4d4d 4d4d4d4d 4d4d4d4d 4d4d4d4d 4d4d4d4d 4d4d4d4d 4d4d4d4d
060: 4d4d4d4d 4d4d4d4d 4d4d4d4d 4d4d4d4d 4d4d4d4d 4d4d4d4d 4d4d4d4d 4d4d4d4d
.....

This is a remote attribute buffer, which are notable in that they
are not logged but are instead written synchronously by the remote
attribute code so that they exist on disk before the attribute
transactions are committed to the journal.

The above remote attribute block has an invalid LSN in it - cycle
0xd002000, block 0 - which means when log recovery comes along to
determine if the transaction that writes to the underlying block
should be replayed, it sees a block that has a future LSN and so
does not replay the buffer data in the transaction. Instead, it
validates the buffer magic number and attaches the buffer verifier
to it.  It is this buffer magic number check that is failing in the
above assert, indicating that we skipped replay due to the LSN of
the underlying buffer.

The problem here is that the remote attribute buffers cannot have a
valid LSN placed into them, because the transaction that contains 
the attribute tree pointer changes and the block allocation that the
attribute data is being written to hasn't yet been committed. Hence
the LSN field in the attribute block is completely unwritten,
thereby leaving the underlying contents of the block in the LSN
field. It could have any value, and hence a future overwrite of the
block by log recovery may or may not work correctly.

Fix this by always writing an invalid LSN to the remote attribute
block, as any buffer in log recovery that needs to write over the
remote attribute should occur. We are protected from having old data
written over the attribute by the fact that freeing the block before
the remote attribute is written will result in the buffer being
marked stale in the log and so all changes prior to the buffer stale
transaction will be cancelled by log recovery.

Hence it is safe to ignore the LSN in the case or synchronously
written, unlogged metadata such as remote attribute blocks, and to
ensure we do that correctly, we need to write an invalid LSN to all
remote attribute blocks to trigger immediate recovery of metadata
that is written over the top.

As a further protection for filesystems that may already have remote
attribute blocks with bad LSNs on disk, change the log recovery code
to always trigger immediate recovery of metadata over remote
attribute blocks.

cc: <stable@vger.kernel.org>
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2015-07-29 11:48:01 +10:00
Dave Chinner
b2442c5a7f xfs: call dax_fault on read page faults for DAX
When modifying the patch series to handle the XFS MMAP_LOCK nesting
of page faults, I botched the conversion of the read page fault
path, and so it is only every calling through the page cache. Re-add
the necessary __dax_fault() call for such files.

Because the get_blocks callback on read faults may not set up the
mapping buffer correctly to allow unwritten extent completion to be
run, we need to allow callers of __dax_fault() to pass a null
complete_unwritten() callback. The DAX code always zeros the
unwritten page when it is read faulted so there are no stale data
exposure issues with not doing the conversion. The only downside
will be the potential for increased CPU overhead on repeated read
faults of the same page. If this proves to be a problem, then the
filesystem needs to fix it's get_block callback and provide a
convert_unwritten() callback to the read fault path.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Matthew Wilcox <willy@linux.intel.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2015-07-29 11:48:00 +10:00
Linus Torvalds
d8132e08d2 NFS client bugfixes for Linux 4.2
Highlights include:
 
 Stable patches:
 - Fix a situation where the client uses the wrong (zero) stateid.
 - Fix a memory leak in nfs_do_recoalesce
 
 Bugfixes:
 - Plug a memory leak when ->prepare_layoutcommit fails
 - Fix an Oops in the NFSv4 open code
 - Fix a backchannel deadlock
 - Fix a livelock in sunrpc when sendmsg fails due to low memory availability
 - Don't revalidate the mapping if both size and change attr are up to date
 - Ensure we don't miss a file extension when doing pNFS
 - Several fixes to handle NFSv4.1 sequence operation status bits correctly
 - Several pNFS layout return bugfixes
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1
 
 iQIcBAABAgAGBQJVt6RGAAoJEGcL54qWCgDyiDIP/2+fUM7Tc1llCxYbM2WLC6Ar
 34v5yVwO96MqhI4L2mXB5FJvr4LP2/EZ4ZExMcf4ymT7pgJnjFK4nEv9IHUSy6xb
 ea+oS9GjvFSeGdkukJLRniNER5/ZG3GWkojlHNJCgByoIVRK4ISXF/qL9w2sedGw
 +5ejvjqie9NmBnBXMq8DRlU+kXhVYCF6E9qWATwUNK5Eq2eeQnDbA2w9ACSBVK3W
 LhCvZi0eBq7krSbHob018PmlQ0VPvmYwk5xL4d//FvcaNj/utk82VjAZCdKOK1sH
 qn8hcKgVeVko/3jwcUp6m3zAkKZ1IX/XaXJeHbosnKG/g0vy3hQirpa/g2iDTQ4H
 NXOSwcsd6syReZDZbQTxbvaSOp5ACxZAQKYLnlPerJ/hMpXDQCEAwyeAFKzEaKz4
 FfF0VJF+30w9PJk3wgk2DF66xbYVfHyvrLtVcb/ki8gb91cH09i+nFFSSfHQBMLh
 +ciHg7rOyXnbXoCaW9fBvONz2sCYDwbHATmhpWWZIx/3UTDf5owxHFa3BFDgGKnD
 jyiPjMh6I3JUE+Qm1zwInsfsskBKRSl2BdJgTHBGY5ODuQGF/sogOmvgbrT7Ox3t
 kbL8nzCydqLixM+4aw61nYakZqgDsKNER5Ggr+lkv4AZ2dH6IeP2IZjuoHLLylvZ
 dyqHwpCjoUtmYAUr166U
 =wlUD
 -----END PGP SIGNATURE-----

Merge tag 'nfs-for-4.2-2' of git://git.linux-nfs.org/projects/trondmy/linux-nfs

Pull NFS client bugfixes from Trond Myklebust:
 "Highlights include:

  Stable patches:
   - Fix a situation where the client uses the wrong (zero) stateid.
   - Fix a memory leak in nfs_do_recoalesce

  Bugfixes:
   - Plug a memory leak when ->prepare_layoutcommit fails
   - Fix an Oops in the NFSv4 open code
   - Fix a backchannel deadlock
   - Fix a livelock in sunrpc when sendmsg fails due to low memory
     availability
   - Don't revalidate the mapping if both size and change attr are up to
     date
   - Ensure we don't miss a file extension when doing pNFS
   - Several fixes to handle NFSv4.1 sequence operation status bits
     correctly
   - Several pNFS layout return bugfixes"

* tag 'nfs-for-4.2-2' of git://git.linux-nfs.org/projects/trondmy/linux-nfs: (28 commits)
  nfs: Fix an oops caused by using other thread's stack space in ASYNC mode
  nfs: plug memory leak when ->prepare_layoutcommit fails
  SUNRPC: Report TCP errors to the caller
  sunrpc: translate -EAGAIN to -ENOBUFS when socket is writable.
  NFSv4.2: handle NFS-specific llseek errors
  NFS: Don't clear desc->pg_moreio in nfs_do_recoalesce()
  NFS: Fix a memory leak in nfs_do_recoalesce
  NFS: nfs_mark_for_revalidate should always set NFS_INO_REVAL_PAGECACHE
  NFS: Remove the "NFS_CAP_CHANGE_ATTR" capability
  NFS: Set NFS_INO_REVAL_PAGECACHE if the change attribute is uninitialised
  NFS: Don't revalidate the mapping if both size and change attr are up to date
  NFSv4/pnfs: Ensure we don't miss a file extension
  NFSv4: We must set NFS_OPEN_STATE flag in nfs_resync_open_stateid_locked
  SUNRPC: xprt_complete_bc_request must also decrement the free slot count
  SUNRPC: Fix a backchannel deadlock
  pNFS: Don't throw out valid layout segments
  pNFS: pnfs_roc_drain() fix a race with open
  pNFS: Fix races between return-on-close and layoutreturn.
  pNFS: pnfs_roc_drain should return 'true' when sleeping
  pNFS: Layoutreturn must invalidate all existing layout segments.
  ...
2015-07-28 09:37:44 -07:00
Kinglong Mee
a49c269111 nfs: Fix an oops caused by using other thread's stack space in ASYNC mode
An oops caused by using other thread's stack space in sunrpc ASYNC sending thread.

[ 9839.007187] ------------[ cut here ]------------
[ 9839.007923] kernel BUG at fs/nfs/nfs4xdr.c:910!
[ 9839.008069] invalid opcode: 0000 [#1] SMP
[ 9839.008069] Modules linked in: blocklayoutdriver rpcsec_gss_krb5 nfsv4 dns_resolver nfs fscache snd_hda_codec_generic snd_hda_intel snd_hda_controller snd_hda_codec snd_hwdep snd_seq snd_seq_device snd_pcm joydev iosf_mbi crct10dif_pclmul snd_timer crc32_pclmul crc32c_intel ghash_clmulni_intel snd soundcore ppdev pvpanic parport_pc i2c_piix4 serio_raw virtio_balloon parport acpi_cpufreq nfsd nfs_acl lockd grace auth_rpcgss sunrpc qxl drm_kms_helper virtio_net virtio_console virtio_blk ttm drm virtio_pci virtio_ring virtio ata_generic pata_acpi
[ 9839.008069] CPU: 0 PID: 308 Comm: kworker/0:1H Not tainted 4.0.0-0.rc4.git1.3.fc23.x86_64 #1
[ 9839.008069] Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011
[ 9839.008069] Workqueue: rpciod rpc_async_schedule [sunrpc]
[ 9839.008069] task: ffff8800d8b4d8e0 ti: ffff880036678000 task.ti: ffff880036678000
[ 9839.008069] RIP: 0010:[<ffffffffa0339cc9>]  [<ffffffffa0339cc9>] reserve_space.part.73+0x9/0x10 [nfsv4]
[ 9839.008069] RSP: 0018:ffff88003667ba58  EFLAGS: 00010246
[ 9839.008069] RAX: 0000000000000000 RBX: 000000001fc15e18 RCX: ffff8800c0193800
[ 9839.008069] RDX: ffff8800e4ae3f24 RSI: 000000001fc15e2c RDI: ffff88003667bcd0
[ 9839.008069] RBP: ffff88003667ba58 R08: ffff8800d9173008 R09: 0000000000000003
[ 9839.008069] R10: ffff88003667bcd0 R11: 000000000000000c R12: 0000000000010000
[ 9839.008069] R13: ffff8800d9173350 R14: 0000000000000000 R15: ffff8800c0067b98
[ 9839.008069] FS:  0000000000000000(0000) GS:ffff88011fc00000(0000) knlGS:0000000000000000
[ 9839.008069] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 9839.008069] CR2: 00007f988c9c8bb0 CR3: 00000000d99b6000 CR4: 00000000000407f0
[ 9839.008069] Stack:
[ 9839.008069]  ffff88003667bbc8 ffffffffa03412c5 00000000c6c55680 ffff880000000003
[ 9839.008069]  0000000000000088 00000010c6c55680 0001000000000002 ffffffff816e87e9
[ 9839.008069]  0000000000000000 00000000477290e2 ffff88003667bab8 ffffffff81327ba3
[ 9839.008069] Call Trace:
[ 9839.008069]  [<ffffffffa03412c5>] encode_attrs+0x435/0x530 [nfsv4]
[ 9839.008069]  [<ffffffff816e87e9>] ? inet_sendmsg+0x69/0xb0
[ 9839.008069]  [<ffffffff81327ba3>] ? selinux_socket_sendmsg+0x23/0x30
[ 9839.008069]  [<ffffffff8164c1df>] ? do_sock_sendmsg+0x9f/0xc0
[ 9839.008069]  [<ffffffff8164c278>] ? kernel_sendmsg+0x58/0x70
[ 9839.008069]  [<ffffffffa011acc0>] ? xdr_reserve_space+0x20/0x170 [sunrpc]
[ 9839.008069]  [<ffffffffa011acc0>] ? xdr_reserve_space+0x20/0x170 [sunrpc]
[ 9839.008069]  [<ffffffffa0341b40>] ? nfs4_xdr_enc_open_noattr+0x130/0x130 [nfsv4]
[ 9839.008069]  [<ffffffffa03419a5>] encode_open+0x2d5/0x340 [nfsv4]
[ 9839.008069]  [<ffffffffa0341b40>] ? nfs4_xdr_enc_open_noattr+0x130/0x130 [nfsv4]
[ 9839.008069]  [<ffffffffa011ab89>] ? xdr_encode_opaque+0x19/0x20 [sunrpc]
[ 9839.008069]  [<ffffffffa0339cfb>] ? encode_string+0x2b/0x40 [nfsv4]
[ 9839.008069]  [<ffffffffa0341bf3>] nfs4_xdr_enc_open+0xb3/0x140 [nfsv4]
[ 9839.008069]  [<ffffffffa0110a4c>] rpcauth_wrap_req+0xac/0xf0 [sunrpc]
[ 9839.008069]  [<ffffffffa01017db>] call_transmit+0x18b/0x2d0 [sunrpc]
[ 9839.008069]  [<ffffffffa0101650>] ? call_decode+0x860/0x860 [sunrpc]
[ 9839.008069]  [<ffffffffa0101650>] ? call_decode+0x860/0x860 [sunrpc]
[ 9839.008069]  [<ffffffffa010caa0>] __rpc_execute+0x90/0x460 [sunrpc]
[ 9839.008069]  [<ffffffffa010ce85>] rpc_async_schedule+0x15/0x20 [sunrpc]
[ 9839.008069]  [<ffffffff810b452b>] process_one_work+0x1bb/0x410
[ 9839.008069]  [<ffffffff810b47d3>] worker_thread+0x53/0x470
[ 9839.008069]  [<ffffffff810b4780>] ? process_one_work+0x410/0x410
[ 9839.008069]  [<ffffffff810b4780>] ? process_one_work+0x410/0x410
[ 9839.008069]  [<ffffffff810ba7b8>] kthread+0xd8/0xf0
[ 9839.008069]  [<ffffffff810ba6e0>] ? kthread_worker_fn+0x180/0x180
[ 9839.008069]  [<ffffffff81786418>] ret_from_fork+0x58/0x90
[ 9839.008069]  [<ffffffff810ba6e0>] ? kthread_worker_fn+0x180/0x180
[ 9839.008069] Code: 00 00 48 c7 c7 21 fa 37 a0 e8 94 1c d6 e0 c6 05 d2 17 05 00 01 8b 03 eb d7 66 0f 1f 84 00 00 00 00 00 66 66 66 66 90 55 48 89 e5 <0f> 0b 0f 1f 44 00 00 66 66 66 66 90 55 48 89 e5 41 54 53 89 f3
[ 9839.008069] RIP  [<ffffffffa0339cc9>] reserve_space.part.73+0x9/0x10 [nfsv4]
[ 9839.008069]  RSP <ffff88003667ba58>
[ 9839.071114] ---[ end trace cc14c03adb522e94 ]---

Signed-off-by: Kinglong Mee <kinglongmee@gmail.com>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2015-07-28 09:07:03 -04:00
Jeff Layton
3471648a75 nfs: plug memory leak when ->prepare_layoutcommit fails
"data" is currently leaked when the prepare_layoutcommit operation
returns an error. Put the cred before taking the spinlock in that
case, take the lock and then goto out_unlock which will drop the
lock and then free "data".

Signed-off-by: Jeff Layton <jeff.layton@primarydata.com>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2015-07-28 09:07:02 -04:00
J. Bruce Fields
bdcc2cd14e NFSv4.2: handle NFS-specific llseek errors
Handle NFS-specific llseek errors instead of letting them leak out to
userspace.

Reported-by: Benjamin Coddington <bcodding@redhat.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2015-07-27 11:16:25 -04:00
Trond Myklebust
d4c30454db NFS: Don't clear desc->pg_moreio in nfs_do_recoalesce()
Recoalescing does not affect whether or not we've already sent off
I/O, and doing so means that we end up sending a bunch of synchronous
for cases where we actually need to be using unstable writes.

Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2015-07-27 10:33:12 -04:00
Trond Myklebust
03d5eb65b5 NFS: Fix a memory leak in nfs_do_recoalesce
If the function exits early, then we must put those requests that were
not processed back onto the &mirror->pg_list so they can be cleaned up
by nfs_pgio_error().

Fixes: a7d42ddb30 ("nfs: add mirroring support to pgio layer")
Cc: stable@vger.kernel.org # v4.0+
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2015-07-27 10:33:08 -04:00
Jaegeuk Kim
6282adbf93 f2fs: call set_page_dirty to attach i_wb for cgroup
The cgroup attaches inode->i_wb via mark_inode_dirty and when set_page_writeback
is called, __inc_wb_stat() updates i_wb's stat.

So, we need to explicitly call set_page_dirty->__mark_inode_dirty in prior to
any writebacking pages.

This patch should resolve the following kernel panic reported by Andreas Reis.

https://bugzilla.kernel.org/show_bug.cgi?id=101801

--- Comment #2 from Andreas Reis <andreas.reis@gmail.com> ---
BUG: unable to handle kernel NULL pointer dereference at 00000000000000a8
IP: [<ffffffff8149deea>] __percpu_counter_add+0x1a/0x90
PGD 2951ff067 PUD 2df43f067 PMD 0
Oops: 0000 [#1] PREEMPT SMP
Modules linked in:
CPU: 7 PID: 10356 Comm: gcc Tainted: G        W       4.2.0-1-cu #1
Hardware name: Gigabyte Technology Co., Ltd. G1.Sniper M5/G1.Sniper M5, BIOS
T01 02/03/2015
task: ffff880295044f80 ti: ffff880295140000 task.ti: ffff880295140000
RIP: 0010:[<ffffffff8149deea>]  [<ffffffff8149deea>]
__percpu_counter_add+0x1a/0x90
RSP: 0018:ffff880295143ac8  EFLAGS: 00010082
RAX: 0000000000000003 RBX: ffffea000a526d40 RCX: 0000000000000001
RDX: 0000000000000020 RSI: 0000000000000001 RDI: 0000000000000088
RBP: ffff880295143ae8 R08: 0000000000000000 R09: ffff88008f69bb30
R10: 00000000fffffffa R11: 0000000000000000 R12: 0000000000000088
R13: 0000000000000001 R14: ffff88041d099000 R15: ffff880084a205d0
FS:  00007f8549374700(0000) GS:ffff88042f3c0000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00000000000000a8 CR3: 000000033e1d5000 CR4: 00000000001406e0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
Stack:
 0000000000000000 ffffea000a526d40 ffff880084a20738 ffff880084a20750
 ffff880295143b48 ffffffff811cc91e ffff880000000000 0000000000000296
 0000000000000000 ffff880417090198 0000000000000000 ffffea000a526d40
Call Trace:
 [<ffffffff811cc91e>] __test_set_page_writeback+0xde/0x1d0
 [<ffffffff813fee87>] do_write_data_page+0xe7/0x3a0
 [<ffffffff813faeea>] gc_data_segment+0x5aa/0x640
 [<ffffffff813fb0b8>] do_garbage_collect+0x138/0x150
 [<ffffffff813fb3fe>] f2fs_gc+0x1be/0x3e0
 [<ffffffff81405541>] f2fs_balance_fs+0x81/0x90
 [<ffffffff813ee357>] f2fs_unlink+0x47/0x1d0
 [<ffffffff81239329>] vfs_unlink+0x109/0x1b0
 [<ffffffff8123e3d7>] do_unlinkat+0x287/0x2c0
 [<ffffffff8123ebc6>] SyS_unlink+0x16/0x20
 [<ffffffff81942e2e>] entry_SYSCALL_64_fastpath+0x12/0x71
Code: 41 5e 5d c3 0f 1f 00 66 2e 0f 1f 84 00 00 00 00 00 55 48 89 e5 41 55 49
89 f5 41 54 49 89 fc 53 48 83 ec 08 65 ff 05 e6 d9 b6 7e <48> 8b 47 20 48 63 ca
65 8b 18 48 63 db 48 01 f3 48 39 cb 7d 0a
RIP  [<ffffffff8149deea>] __percpu_counter_add+0x1a/0x90
 RSP <ffff880295143ac8>
CR2: 00000000000000a8
---[ end trace 5132449a58ed93a3 ]---
note: gcc[10356] exited with preempt_count 2

Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2015-07-25 08:54:26 -07:00
Jaegeuk Kim
548aedac51 f2fs: handle error cases in move_encrypted_block
This patch fixes some missing error handlers.

Reviewed-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2015-07-25 08:54:25 -07:00
Linus Torvalds
33b40178cb Merge branch 'for-linus' of git://git.kernel.dk/linux-block
Pull block fixes from Jens Axboe:
 "Four smaller fixes for the current series.  This contains:

   - A fix for clones of discard bio's, that can cause data corruption.
     From Martin.

   - A fix for null_blk, where in certain queue modes it could access a
     request after it had been freed.  From Mike Krinkin.

   - An error handling leak fix for blkcg, from Tejun.

   - Also from Tejun, export of the functions that a file system needs
     to implement cgroup writeback support"

* 'for-linus' of git://git.kernel.dk/linux-block:
  block: Do a full clone when splitting discard bios
  block: export bio_associate_*() and wbc_account_io()
  blkcg: fix gendisk reference leak in blkg_conf_prep()
  null_blk: fix use-after-free problem
2015-07-24 17:00:04 -07:00
Linus Torvalds
45b4b782e8 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace
Pull namespace fixes from Eric Biederman:
 "While reading through the code of detach_mounts I realized the code
  was slightly off.  Testing it revealed two buggy corner cases that can
  send the code of detach_mounts into an infinite loop.

  Fixing the code to do the right thing removes the possibility of these
  user triggered infinite loops in the code"

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace:
  mnt: In detach_mounts detach the appropriate unmounted mount
  mnt: Clarify and correct the disconnect logic in umount_tree
2015-07-23 13:16:21 -07:00
Tejun Heo
5aa2a96b34 block: export bio_associate_*() and wbc_account_io()
bio_associate_blkcg(), bio_associate_current() and wbc_account_io()
are used to implement cgroup writeback support for filesystems and
thus need to be exported.  Export them.

Signed-off-by: Tejun Heo <tj@kernel.org>
Reported-by: Stephen Rothwell <sfr@canb.auug.org.au>
Signed-off-by: Jens Axboe <axboe@fb.com>
2015-07-23 13:36:44 -06:00
Eric W. Biederman
fe78fcc85a mnt: In detach_mounts detach the appropriate unmounted mount
The handling of in detach_mounts of unmounted but connected mounts is
buggy and can lead to an infinite loop.

Correct the handling of unmounted mounts in detach_mount.  When the
mountpoint of an unmounted but connected mount is connected to a
dentry, and that dentry is deleted we need to disconnect that mount
from the parent mount and the deleted dentry.

Nothing changes for the unmounted and connected children.  They can be
safely ignored.

Cc: stable@vger.kernel.org
Fixes: ce07d891a0 mnt: Honor MNT_LOCKED when detaching mounts
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
2015-07-23 11:31:15 -05:00
Eric W. Biederman
f2d0a123bc mnt: Clarify and correct the disconnect logic in umount_tree
rmdir mntpoint will result in an infinite loop when there is
a mount locked on the mountpoint in another mount namespace.

This is because the logic to test to see if a mount should
be disconnected in umount_tree is buggy.

Move the logic to decide if a mount should remain connected to
it's mountpoint into it's own function disconnect_mount so that
clarity of expression instead of terseness of expression becomes
a virtue.

When the conditions where it is invalid to leave a mount connected
are first ruled out, the logic for deciding if a mount should
be disconnected becomes much clearer and simpler.

Fixes: e0c9c0afd2 mnt: Update detach_mounts to leave mounts connected
Fixes: ce07d891a0 mnt: Honor MNT_LOCKED when detaching mounts
Cc: stable@vger.kernel.org
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
2015-07-22 20:33:27 -05:00
Filipe Manana
00d80e342c Btrfs: fix quick exhaustion of the system array in the superblock
Omar reported that after commit 4fbcdf6694 ("Btrfs: fix -ENOSPC when
finishing block group creation"), introduced in 4.2-rc1, the following
test was failing due to exhaustion of the system array in the superblock:

  #!/bin/bash

  truncate -s 100T big.img
  mkfs.btrfs big.img
  mount -o loop big.img /mnt/loop

  num=5
  sz=10T
  for ((i = 0; i < $num; i++)); do
      echo fallocate $i $sz
      fallocate -l $sz /mnt/loop/testfile$i
  done
  btrfs filesystem sync /mnt/loop

  for ((i = 0; i < $num; i++)); do
        echo rm $i
        rm /mnt/loop/testfile$i
        btrfs filesystem sync /mnt/loop
  done
  umount /mnt/loop

This made btrfs_add_system_chunk() fail with -EFBIG due to excessive
allocation of system block groups. This happened because the test creates
a large number of data block groups per transaction and when committing
the transaction we start the writeout of the block group caches for all
the new new (dirty) block groups, which results in pre-allocating space
for each block group's free space cache using the same transaction handle.
That in turn often leads to creation of more block groups, and all get
attached to the new_bgs list of the same transaction handle to the point
of getting a list with over 1500 elements, and creation of new block groups
leads to the need of reserving space in the chunk block reserve and often
creating a new system block group too.

So that made us quickly exhaust the chunk block reserve/system space info,
because as of the commit mentioned before, we do reserve space for each
new block group in the chunk block reserve, unlike before where we would
not and would at most allocate one new system block group and therefore
would only ensure that there was enough space in the system space info to
allocate 1 new block group even if we ended up allocating thousands of
new block groups using the same transaction handle. That worked most of
the time because the computed required space at check_system_chunk() is
very pessimistic (assumes a chunk tree height of BTRFS_MAX_LEVEL/8 and
that all nodes/leafs in a path will be COWed and split) and since the
updates to the chunk tree all happen at btrfs_create_pending_block_groups
it is unlikely that a path needs to be COWed more than once (unless
writepages() for the btree inode is called by mm in between) and that
compensated for the need of creating any new nodes/leads in the chunk
tree.

So fix this by ensuring we don't accumulate a too large list of new block
groups in a transaction's handles new_bgs list, inserting/updating the
chunk tree for all accumulated new block groups and releasing the unused
space from the chunk block reserve whenever the list becomes sufficiently
large. This is a generic solution even though the problem currently can
only happen when starting the writeout of the free space caches for all
dirty block groups (btrfs_start_dirty_block_groups()).

Reported-by: Omar Sandoval <osandov@fb.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Tested-by: Omar Sandoval <osandov@fb.com>
Signed-off-by: Chris Mason <clm@fb.com>
2015-07-22 18:20:54 -07:00
Anand Jain
3e303ea60d btrfs: its btrfs_err() instead of btrfs_error()
sorry I indented to use btrfs_err() and I have no idea
how btrfs_error() got there.
infact I was thinking about these kind of oversights
since these two func are too closely named.

Signed-off-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: Liu Bo <bo.li.liu@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: Chris Mason <clm@fb.com>
2015-07-22 18:20:53 -07:00
Zhao Lei
95ab1f6490 btrfs: Avoid NULL pointer dereference of free_extent_buffer when read_tree_block() fail
When read_tree_block() failed, we can see following dmesg:
 [  134.371389] BUG: unable to handle kernel NULL pointer dereference at 0000000000000063
 [  134.372236] IP: [<ffffffff813a4a51>] free_extent_buffer+0x21/0x90
 [  134.372236] PGD 0
 [  134.372236] Oops: 0000 [#1] SMP
 [  134.372236] Modules linked in:
 [  134.372236] CPU: 0 PID: 2289 Comm: mount Not tainted 4.2.0-rc1_HEAD_c65b99f046843d2455aa231747b5a07a999a9f3d_+ #115
 [  134.372236] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.7.5.1-0-g8936dbb-20141113_115728-nilsson.home.kraxel.org 04/01/2014
 [  134.372236] task: ffff88003b6e1a00 ti: ffff880011e60000 task.ti: ffff880011e60000
 [  134.372236] RIP: 0010:[<ffffffff813a4a51>]  [<ffffffff813a4a51>] free_extent_buffer+0x21/0x90
 ...
 [  134.372236] Call Trace:
 [  134.372236]  [<ffffffff81379aa1>] free_root_extent_buffers+0x91/0xb0
 [  134.372236]  [<ffffffff81379c3d>] free_root_pointers+0x17d/0x190
 [  134.372236]  [<ffffffff813801b0>] open_ctree+0x1ca0/0x25b0
 [  134.372236]  [<ffffffff8144d017>] ? disk_name+0x97/0xb0
 [  134.372236]  [<ffffffff813558aa>] btrfs_mount+0x8fa/0xab0
 ...

Reason:
 read_tree_block() changed to return error number on fail,
 and this value(not NULL) is set to tree_root->node, then subsequent
 code will run to:
  free_root_pointers()
  ->free_root_extent_buffers()
  ->free_extent_buffer()
  ->atomic_read((extent_buffer *)(-E_XXX)->refs);
 and trigger above error.

Fix:
 Set tree_root->node to NULL on fail to make error_handle code
 happy.

Signed-off-by: Zhao Lei <zhaolei@cn.fujitsu.com>
Signed-off-by: Chris Mason <clm@fb.com>
2015-07-22 18:20:52 -07:00
Zhao Lei
8a73301304 btrfs: Fix lockdep warning of btrfs_run_delayed_iputs()
Liu Bo <bo.li.liu@oracle.com> reported a lockdep warning of
delayed_iput_sem in xfstests generic/241:
  [ 2061.345955] =============================================
  [ 2061.346027] [ INFO: possible recursive locking detected ]
  [ 2061.346027] 4.1.0+ #268 Tainted: G        W
  [ 2061.346027] ---------------------------------------------
  [ 2061.346027] btrfs-cleaner/3045 is trying to acquire lock:
  [ 2061.346027]  (&fs_info->delayed_iput_sem){++++..}, at:
  [<ffffffff814063ab>] btrfs_run_delayed_iputs+0x6b/0x100
  [ 2061.346027] but task is already holding lock:
  [ 2061.346027]  (&fs_info->delayed_iput_sem){++++..}, at: [<ffffffff814063ab>] btrfs_run_delayed_iputs+0x6b/0x100
  [ 2061.346027] other info that might help us debug this:
  [ 2061.346027]  Possible unsafe locking scenario:

  [ 2061.346027]        CPU0
  [ 2061.346027]        ----
  [ 2061.346027]   lock(&fs_info->delayed_iput_sem);
  [ 2061.346027]   lock(&fs_info->delayed_iput_sem);
  [ 2061.346027]
   *** DEADLOCK ***
It is rarely happened, about 1/400 in my test env.

The reason is recursion of btrfs_run_delayed_iputs():
  cleaner_kthread
  -> btrfs_run_delayed_iputs() *1
  -> get delayed_iput_sem lock *2
  -> iput()
  -> ...
  -> btrfs_commit_transaction()
  -> btrfs_run_delayed_iputs() *1
  -> get delayed_iput_sem lock (dead lock) *2
  *1: recursion of btrfs_run_delayed_iputs()
  *2: warning of lockdep about delayed_iput_sem

When fs is in high stress, new iputs may added into fs_info->delayed_iputs
list when btrfs_run_delayed_iputs() is running, which cause
second btrfs_run_delayed_iputs() run into down_read(&fs_info->delayed_iput_sem)
again, and cause above lockdep warning.

Actually, it will not cause real problem because both locks are read lock,
but to avoid lockdep warning, we can do a fix.

Fix:
  Don't do btrfs_run_delayed_iputs() in btrfs_commit_transaction() for
  cleaner_kthread thread to break above recursion path.
  cleaner_kthread is calling btrfs_run_delayed_iputs() explicitly in code,
  and don't need to call btrfs_run_delayed_iputs() again in
  btrfs_commit_transaction(), it also give us a bonus to avoid stack overflow.

Test:
  No above lockdep warning after patch in 1200 generic/241 tests.

Reported-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: Zhao Lei <zhaolei@cn.fujitsu.com>
Reviewed-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: Chris Mason <clm@fb.com>
2015-07-22 18:20:51 -07:00
Trond Myklebust
cd81259979 NFS: Remove the "NFS_CAP_CHANGE_ATTR" capability
Setting the change attribute has been mandatory for all NFS versions, since
commit 3a1556e866 ("NFSv2/v3: Simulate the change attribute"). We should
therefore not have anything be conditional on it being set/unset.

Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2015-07-22 17:15:54 -04:00
Trond Myklebust
5c675d6420 NFS: Set NFS_INO_REVAL_PAGECACHE if the change attribute is uninitialised
We can't allow caching of data until the change attribute has been
initialised correctly.

Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2015-07-22 17:15:53 -04:00
Trond Myklebust
85a23cee3f NFS: Don't revalidate the mapping if both size and change attr are up to date
If we've ensured that the size and the change attribute are both correct,
then there is no point in marking those attributes as needing revalidation
again. Only do so if we know the size is incorrect and was not updated.

Fixes: f2467b6f64 ("NFS: Clear NFS_INO_REVAL_PAGECACHE when...")
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2015-07-22 17:15:53 -04:00
Trond Myklebust
2b83d3de4c NFSv4/pnfs: Ensure we don't miss a file extension
pNFS writes don't return attributes, however that doesn't mean that we
should ignore the fact that they may be extending the file. This patch
ensures that if a write is seen to extend the file, then we always set
an attribute barrier, and update the cached file size.

Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2015-07-22 17:15:53 -04:00
Trond Myklebust
3c38cbe2ad NFSv4: We must set NFS_OPEN_STATE flag in nfs_resync_open_stateid_locked
Otherwise, nfs4_select_rw_stateid() will always return the zero stateid
instead of the correct open stateid.

Fixes: f95549cf24 ("NFSv4: More CLOSE/OPEN races")
Cc: stable@vger.kernel.org # 4.0+
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2015-07-22 17:10:51 -04:00
Linus Torvalds
d725e66c06 Revert "fsnotify: fix oops in fsnotify_clear_marks_by_group_flags()"
This reverts commit a2673b6e04.

Kinglong Mee reports a memory leak with that patch, and Jan Kara confirms:

 "Thanks for report! You are right that my patch introduces a race
  between fsnotify kthread and fsnotify_destroy_group() which can result
  in leaking inotify event on group destruction.

  I haven't yet decided whether the right fix is not to queue events for
  dying notification group (as that is pointless anyway) or whether we
  should just fix the original problem differently...  Whenever I look
  at fsnotify code mark handling I get lost in the maze of locks, lists,
  and subtle differences between how different notification systems
  handle notification marks :( I'll think about it over night"

and after thinking about it, Jan says:

 "OK, I have looked into the code some more and I found another
  relatively simple way of fixing the original oops.  It will be IMHO
  better than trying to fixup this issue which has more potential for
  breakage.  I'll ask Linus to revert the fsnotify fix he already merged
  and send a new fix"

Reported-by: Kinglong Mee <kinglongmee@gmail.com>
Requested-by: Jan Kara <jack@suse.cz>
Cc: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-07-21 16:06:53 -07:00
Linus Torvalds
8426fb302c Merge branch 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs
Pull UDF fix from Jan Kara:
 "A fix for UDF corruption when certain disk-format feature is enabled"

* 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs:
  udf: Don't corrupt unalloc spacetable when writing it
2015-07-21 15:18:06 -07:00
Linus Torvalds
0e1dbccd8f Merge branch 'x86-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull x86 fixes from Ingo Molnar:
 "Two families of fixes:

   - Fix an FPU context related boot crash on newer x86 hardware with
     larger context sizes than what most people test.  To fix this
     without ugly kludges or extensive reverts we had to touch core task
     allocator, to allow x86 to determine the task size dynamically, at
     boot time.

     I've tested it on a number of x86 platforms, and I cross-built it
     to a handful of architectures:

                                        (warns)               (warns)
       testing     x86-64:  -git:  pass (    0),  -tip:  pass (    0)
       testing     x86-32:  -git:  pass (    0),  -tip:  pass (    0)
       testing        arm:  -git:  pass ( 1359),  -tip:  pass ( 1359)
       testing       cris:  -git:  pass ( 1031),  -tip:  pass ( 1031)
       testing       m32r:  -git:  pass ( 1135),  -tip:  pass ( 1135)
       testing       m68k:  -git:  pass ( 1471),  -tip:  pass ( 1471)
       testing       mips:  -git:  pass ( 1162),  -tip:  pass ( 1162)
       testing    mn10300:  -git:  pass ( 1058),  -tip:  pass ( 1058)
       testing     parisc:  -git:  pass ( 1846),  -tip:  pass ( 1846)
       testing      sparc:  -git:  pass ( 1185),  -tip:  pass ( 1185)

     ... so I hope the cross-arch impact 'none', as intended.

     (by Dave Hansen)

   - Fix various NMI handling related bugs unearthed by the big asm code
     rewrite and generally make the NMI code more robust and more
     maintainable while at it.  These changes are a bit late in the
     cycle, I hope they are still acceptable.

     (by Andy Lutomirski)"

* 'x86-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  x86/fpu, sched: Introduce CONFIG_ARCH_WANTS_DYNAMIC_TASK_STRUCT and use it on x86
  x86/fpu, sched: Dynamically allocate 'struct fpu'
  x86/entry/64, x86/nmi/64: Add CONFIG_DEBUG_ENTRY NMI testing code
  x86/nmi/64: Make the "NMI executing" variable more consistent
  x86/nmi/64: Minor asm simplification
  x86/nmi/64: Use DF to avoid userspace RSP confusing nested NMI detection
  x86/nmi/64: Reorder nested NMI checks
  x86/nmi/64: Improve nested NMI comments
  x86/nmi/64: Switch stacks on userspace NMI entry
  x86/nmi/64: Remove asm code that saves CR2
  x86/nmi: Enable nested do_nmi() handling for 64-bit kernels
2015-07-18 10:49:57 -07:00
Linus Torvalds
3a26a5b151 Merge branch 'akpm' (patches from Andrew)
Merge fixes from Andrew Morton:
 "25 fixes"

* emailed patches from Andrew Morton <akpm@linux-foundation.org>: (25 commits)
  lib/decompress: set the compressor name to NULL on error
  mm/cma_debug: correct size input to bitmap function
  mm/cma_debug: fix debugging alloc/free interface
  mm/page_owner: set correct gfp_mask on page_owner
  mm/page_owner: fix possible access violation
  fsnotify: fix oops in fsnotify_clear_marks_by_group_flags()
  /proc/$PID/cmdline: fixup empty ARGV case
  dma-debug: skip debug_dma_assert_idle() when disabled
  hexdump: fix for non-aligned buffers
  checkpatch: fix long line messages about patch context
  mm: clean up per architecture MM hook header files
  MAINTAINERS: uclinux-h8-devel is moderated for non-subscribers
  mailmap: update Sudeep Holla's email id
  Update Viresh Kumar's email address
  mm, meminit: suppress unused memory variable warning
  configfs: fix kernel infoleak through user-controlled format string
  include, lib: add __printf attributes to several function prototypes
  s390/hugetlb: add hugepages_supported define
  mm: hugetlb: allow hugepages_supported to be architecture specific
  revert "s390/mm: make hugepages_supported a boot time decision"
  ...
2015-07-18 10:01:04 -07:00
Linus Torvalds
8be5701342 Merge branch 'for-linus-4.2' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs
Pull btrfs fixes from Chris Mason:
 "These are all from Filipe, and cover a few problems we've had reported
  on the list recently (along with ones he found on his own)"

* 'for-linus-4.2' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs:
  Btrfs: fix file corruption after cloning inline extents
  Btrfs: fix order by which delayed references are run
  Btrfs: fix list transaction->pending_ordered corruption
  Btrfs: fix memory leak in the extent_same ioctl
  Btrfs: fix shrinking truncate when the no_holes feature is enabled
2015-07-17 21:46:57 -07:00
Ingo Molnar
5aaeb5c01c x86/fpu, sched: Introduce CONFIG_ARCH_WANTS_DYNAMIC_TASK_STRUCT and use it on x86
Don't burden architectures without dynamic task_struct sizing
with the overhead of dynamic sizing.

Also optimize the x86 code a bit by caching task_struct_size.

Acked-and-Tested-by: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Dave Hansen <dave@sr71.net>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/1437128892-9831-3-git-send-email-mingo@kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-07-18 03:42:51 +02:00
Dave Hansen
0c8c0f03e3 x86/fpu, sched: Dynamically allocate 'struct fpu'
The FPU rewrite removed the dynamic allocations of 'struct fpu'.
But, this potentially wastes massive amounts of memory (2k per
task on systems that do not have AVX-512 for instance).

Instead of having a separate slab, this patch just appends the
space that we need to the 'task_struct' which we dynamically
allocate already.  This saves from doing an extra slab
allocation at fork().

The only real downside here is that we have to stick everything
and the end of the task_struct.  But, I think the
BUILD_BUG_ON()s I stuck in there should keep that from being too
fragile.

Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Dave Hansen <dave@sr71.net>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/1437128892-9831-2-git-send-email-mingo@kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-07-18 03:42:35 +02:00
Jan Kara
a2673b6e04 fsnotify: fix oops in fsnotify_clear_marks_by_group_flags()
fsnotify_clear_marks_by_group_flags() can race with
fsnotify_destroy_marks() so when fsnotify_destroy_mark_locked() drops
mark_mutex, a mark from the list iterated by
fsnotify_clear_marks_by_group_flags() can be freed and we dereference free
memory in the loop there.

Fix the problem by keeping mark_mutex held in
fsnotify_destroy_mark_locked().  The reason why we drop that mutex is that
we need to call a ->freeing_mark() callback which may acquire mark_mutex
again.  To avoid this and similar lock inversion issues, we move the call
to ->freeing_mark() callback to the kthread destroying the mark.

Signed-off-by: Jan Kara <jack@suse.cz>
Reported-by: Ashish Sangwan <a.sangwan@samsung.com>
Suggested-by: Lino Sanfilippo <LinoSanfilippo@gmx.de>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-07-17 16:39:54 -07:00
Alexey Dobriyan
3581d458c3 /proc/$PID/cmdline: fixup empty ARGV case
/proc/*/cmdline code checks if it should look at ENVP area by checking
last byte of ARGV area:

	rv = access_remote_vm(mm, arg_end - 1, &c, 1, 0);
	if (rv <= 0)
		goto out_free_page;

If ARGV is somehow made empty (by doing execve(..., NULL, ...) or
manually setting ->arg_start and ->arg_end to equal values), the decision
will be based on byte which doesn't even belong to ARGV/ENVP.

So, quickly check if ARGV area is empty and report 0 to match previous
behaviour.

Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-07-17 16:39:54 -07:00
Nicolas Iooss
3958b79266 configfs: fix kernel infoleak through user-controlled format string
Some modules call config_item_init_type_name() and config_group_init_type_name()
with parameter "name" directly controlled by userspace.  These two
functions call config_item_set_name() with this name used as a format
string, which can be used to leak information such as content of the
stack to userspace.

For example, make_netconsole_target() in netconsole module calls
config_item_init_type_name() with the name of a newly-created directory.
This means that the following commands give some unexpected output, with
configfs mounted in /sys/kernel/config/ and on a system with a
configured eth0 ethernet interface:

    # modprobe netconsole
    # mkdir /sys/kernel/config/netconsole/target_%lx
    # echo eth0 > /sys/kernel/config/netconsole/target_%lx/dev_name
    # echo 1 > /sys/kernel/config/netconsole/target_%lx/enabled
    # echo eth0 > /sys/kernel/config/netconsole/target_%lx/dev_name
    # dmesg |tail -n1
    [  142.697668] netconsole: target (target_ffffffffc0ae8080) is
    enabled, disable to update parameters

The directory name is correct but %lx has been interpreted in the
internal item name, displayed here in the error message used by
store_dev_name() in drivers/net/netconsole.c.

To fix this, update every caller of config_item_set_name to use "%s"
when operating on untrusted input.

This issue was found using -Wformat-security gcc flag, once a __printf
attribute has been added to config_item_set_name().

Signed-off-by: Nicolas Iooss <nicolas.iooss_linux@m4x.org>
Acked-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Acked-by: Felipe Balbi <balbi@ti.com>
Acked-by: Joel Becker <jlbec@evilplan.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-07-17 16:39:53 -07:00