linux_dsm_epyc7002

mirror of https://github.com/AuxXxilium/linux_dsm_epyc7002.git synced 2024-12-25 04:22:19 +07:00

Author	SHA1	Message	Date
Jan Kara	23d0127096	fs/sync.c: make sync_file_range(2) use WB_SYNC_NONE writeback sync_file_range(2) is documented to issue writeback only for pages that are not currently being written. After all the system call has been created for userspace to be able to issue background writeout and so waiting for in-flight IO is undesirable there. However commit `ee53a891f4` ("mm: do_sync_mapping_range integrity fix") switched do_sync_mapping_range() and thus sync_file_range() to issue writeback in WB_SYNC_ALL mode since do_sync_mapping_range() was used by other code relying on WB_SYNC_ALL semantics. These days do_sync_mapping_range() went away and we can switch sync_file_range(2) back to issuing WB_SYNC_NONE writeback. That should help PostgreSQL avoid large latency spikes when flushing data in the background. Andres measured a 20% increase in transactions per second on an SSD disk. Signed-off-by: Jan Kara <jack@suse.com> Reported-by: Andres Freund <andres@anarazel.de> Tested-By: Andres Freund <andres@anarazel.de> Cc: Al Viro <viro@ZenIV.linux.org.uk> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2015-11-06 17:50:42 -08:00
Michal Hocko	c62d25556b	mm, fs: introduce mapping_gfp_constraint() There are many places which use mapping_gfp_mask to restrict a more generic gfp mask which would be used for allocations which are not directly related to the page cache but they are performed in the same context. Let's introduce a helper function which makes the restriction explicit and easier to track. This patch doesn't introduce any functional changes. [akpm@linux-foundation.org: coding-style fixes] Signed-off-by: Michal Hocko <mhocko@suse.com> Suggested-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2015-11-06 17:50:42 -08:00
Mel Gorman	71baba4b92	mm, page_alloc: rename __GFP_WAIT to __GFP_RECLAIM __GFP_WAIT was used to signal that the caller was in atomic context and could not sleep. Now it is possible to distinguish between true atomic context and callers that are not willing to sleep. The latter should clear __GFP_DIRECT_RECLAIM so kswapd will still wake. As clearing __GFP_WAIT behaves differently, there is a risk that people will clear the wrong flags. This patch renames __GFP_WAIT to __GFP_RECLAIM to clearly indicate what it does -- setting it allows all reclaim activity, clearing them prevents it. [akpm@linux-foundation.org: fix build] [akpm@linux-foundation.org: coding-style fixes] Signed-off-by: Mel Gorman <mgorman@techsingularity.net> Acked-by: Michal Hocko <mhocko@suse.com> Acked-by: Vlastimil Babka <vbabka@suse.cz> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Cc: Christoph Lameter <cl@linux.com> Acked-by: David Rientjes <rientjes@google.com> Cc: Vitaly Wool <vitalywool@gmail.com> Cc: Rik van Riel <riel@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2015-11-06 17:50:42 -08:00
Mel Gorman	d0164adc89	mm, page_alloc: distinguish between being unable to sleep, unwilling to sleep and avoiding waking kswapd __GFP_WAIT has been used to identify atomic context in callers that hold spinlocks or are in interrupts. They are expected to be high priority and have access one of two watermarks lower than "min" which can be referred to as the "atomic reserve". __GFP_HIGH users get access to the first lower watermark and can be called the "high priority reserve". Over time, callers had a requirement to not block when fallback options were available. Some have abused __GFP_WAIT leading to a situation where an optimisitic allocation with a fallback option can access atomic reserves. This patch uses __GFP_ATOMIC to identify callers that are truely atomic, cannot sleep and have no alternative. High priority users continue to use __GFP_HIGH. __GFP_DIRECT_RECLAIM identifies callers that can sleep and are willing to enter direct reclaim. __GFP_KSWAPD_RECLAIM to identify callers that want to wake kswapd for background reclaim. __GFP_WAIT is redefined as a caller that is willing to enter direct reclaim and wake kswapd for background reclaim. This patch then converts a number of sites o __GFP_ATOMIC is used by callers that are high priority and have memory pools for those requests. GFP_ATOMIC uses this flag. o Callers that have a limited mempool to guarantee forward progress clear __GFP_DIRECT_RECLAIM but keep __GFP_KSWAPD_RECLAIM. bio allocations fall into this category where kswapd will still be woken but atomic reserves are not used as there is a one-entry mempool to guarantee progress. o Callers that are checking if they are non-blocking should use the helper gfpflags_allow_blocking() where possible. This is because checking for __GFP_WAIT as was done historically now can trigger false positives. Some exceptions like dm-crypt.c exist where the code intent is clearer if __GFP_DIRECT_RECLAIM is used instead of the helper due to flag manipulations. o Callers that built their own GFP flags instead of starting with GFP_KERNEL and friends now also need to specify __GFP_KSWAPD_RECLAIM. The first key hazard to watch out for is callers that removed __GFP_WAIT and was depending on access to atomic reserves for inconspicuous reasons. In some cases it may be appropriate for them to use __GFP_HIGH. The second key hazard is callers that assembled their own combination of GFP flags instead of starting with something like GFP_KERNEL. They may now wish to specify __GFP_KSWAPD_RECLAIM. It's almost certainly harmless if it's missed in most cases as other activity will wake kswapd. Signed-off-by: Mel Gorman <mgorman@techsingularity.net> Acked-by: Vlastimil Babka <vbabka@suse.cz> Acked-by: Michal Hocko <mhocko@suse.com> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Cc: Christoph Lameter <cl@linux.com> Cc: David Rientjes <rientjes@google.com> Cc: Vitaly Wool <vitalywool@gmail.com> Cc: Rik van Riel <riel@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2015-11-06 17:50:42 -08:00
Linus Torvalds	2e3078af2c	Merge branch 'akpm' (patches from Andrew) Merge patch-bomb from Andrew Morton: - inotify tweaks - some ocfs2 updates (many more are awaiting review) - various misc bits - kernel/watchdog.c updates - Some of mm. I have a huge number of MM patches this time and quite a lot of it is quite difficult and much will be held over to next time. * emailed patches from Andrew Morton <akpm@linux-foundation.org>: (162 commits) selftests: vm: add tests for lock on fault mm: mlock: add mlock flags to enable VM_LOCKONFAULT usage mm: introduce VM_LOCKONFAULT mm: mlock: add new mlock system call mm: mlock: refactor mlock, munlock, and munlockall code kasan: always taint kernel on report mm, slub, kasan: enable user tracking by default with KASAN=y kasan: use IS_ALIGNED in memory_is_poisoned_8() kasan: Fix a type conversion error lib: test_kasan: add some testcases kasan: update reference to kasan prototype repo kasan: move KASAN_SANITIZE in arch/x86/boot/Makefile kasan: various fixes in documentation kasan: update log messages kasan: accurately determine the type of the bad access kasan: update reported bug types for kernel memory accesses kasan: update reported bug types for not user nor kernel memory accesses mm/kasan: prevent deadlock in kasan reporting mm/kasan: don't use kasan shadow pointer in generic functions mm/kasan: MODULE_VADDR is not available on all archs ...	2015-11-05 23:10:54 -08:00
Eric Biggers	ea5c58e70c	vfs: clear remainder of 'full_fds_bits' in dup_fd() This fixes a bug from commit `f3f86e33dc` ("vfs: Fix pathological performance case for __alloc_fd()"). v2: refactor to share fd bitmap copying code Signed-off-by: Eric Biggers <ebiggers3@gmail.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2015-11-05 23:05:32 -08:00
David Rientjes	b72bdfa736	mm, oom: add comment for why oom_adj exists /proc/pid/oom_adj exists solely to avoid breaking existing userspace binaries that write to the tunable. Add a comment in the only possible location within the kernel tree to describe the situation and motivation for keeping it around. Signed-off-by: David Rientjes <rientjes@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2015-11-05 19:34:48 -08:00
Laurent Dufour	5d3875a01e	mm: clear_soft_dirty_pmd() requires THP Don't build clear_soft_dirty_pmd() if transparent huge pages are not enabled. Signed-off-by: Laurent Dufour <ldufour@linux.vnet.ibm.com> Reviewed-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Cc: Pavel Emelyanov <xemul@parallels.com> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: Paul Mackerras <paulus@samba.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2015-11-05 19:34:48 -08:00
Laurent Dufour	326c2597a3	mm: clear pte in clear_soft_dirty() As mentioned in the commit `56eecdb912` ("mm: Use ptep/pmdp_set_numa() for updating _PAGE_NUMA bit"), architectures like ppc64 don't do tlb flush in set_pte/pmd functions. So when dealing with existing pte in clear_soft_dirty, the pte must be cleared before being modified. Signed-off-by: Laurent Dufour <ldufour@linux.vnet.ibm.com> Reviewed-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Cc: Pavel Emelyanov <xemul@parallels.com> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: Paul Mackerras <paulus@samba.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2015-11-05 19:34:48 -08:00
Junichi Nomura	aa750fd71c	mm/filemap.c: make global sync not clear error status of individual inodes filemap_fdatawait() is a function to wait for on-going writeback to complete but also consume and clear error status of the mapping set during writeback. The latter functionality is critical for applications to detect writeback error with system calls like fsync(2)/fdatasync(2). However filemap_fdatawait() is also used by sync(2) or FIFREEZE ioctl, which don't check error status of individual mappings. As a result, fsync() may not be able to detect writeback error if events happen in the following order: Application System admin ---------------------------------------------------------- write data on page cache Run sync command writeback completes with error filemap_fdatawait() clears error fsync returns success (but the data is not on disk) This patch adds filemap_fdatawait_keep_errors() for call sites where writeback error is not handled so that they don't clear error status. Signed-off-by: Jun'ichi Nomura <j-nomura@ce.jp.nec.com> Acked-by: Andi Kleen <ak@linux.intel.com> Reviewed-by: Tejun Heo <tj@kernel.org> Cc: Fengguang Wu <fengguang.wu@gmail.com> Cc: Dave Chinner <david@fromorbit.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2015-11-05 19:34:48 -08:00
Naoya Horiguchi	5d317b2b65	mm: hugetlb: proc: add HugetlbPages field to /proc/PID/status Currently there's no easy way to get per-process usage of hugetlb pages, which is inconvenient because userspace applications which use hugetlb typically want to control their processes on the basis of how much memory (including hugetlb) they use. So this patch simply provides easy access to the info via /proc/PID/status. Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Acked-by: Joern Engel <joern@logfs.org> Acked-by: David Rientjes <rientjes@google.com> Acked-by: Michal Hocko <mhocko@suse.com> Cc: Mike Kravetz <mike.kravetz@oracle.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2015-11-05 19:34:48 -08:00
Naoya Horiguchi	25ee01a2fc	mm: hugetlb: proc: add hugetlb-related fields to /proc/PID/smaps Currently /proc/PID/smaps provides no usage info for vma(VM_HUGETLB), which is inconvenient when we want to know per-task or per-vma base hugetlb usage. To solve this, this patch adds new fields for hugetlb usage like below: Size: 20480 kB Rss: 0 kB Pss: 0 kB Shared_Clean: 0 kB Shared_Dirty: 0 kB Private_Clean: 0 kB Private_Dirty: 0 kB Referenced: 0 kB Anonymous: 0 kB AnonHugePages: 0 kB Shared_Hugetlb: 18432 kB Private_Hugetlb: 2048 kB Swap: 0 kB KernelPageSize: 2048 kB MMUPageSize: 2048 kB Locked: 0 kB VmFlags: rd wr mr mw me de ht [hughd@google.com: fix Private_Hugetlb alignment ] Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Acked-by: Joern Engel <joern@logfs.org> Acked-by: David Rientjes <rientjes@google.com> Acked-by: Michal Hocko <mhocko@suse.cz> Cc: Mike Kravetz <mike.kravetz@oracle.com> Signed-off-by: Hugh Dickins <hughd@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2015-11-05 19:34:48 -08:00
Dominique Martinet	b64787401f	9p: do not overwrite return code when locking fails If the remote locking fail, we run a local vfs unlock that should work and return success to userland when we didn't actually lock at all. We need to tell the application that tried to lock that it didn't get it, not that all went well. Signed-off-by: Dominique Martinet <dominique.martinet@cea.fr> Cc: Eric Van Hensbergen <ericvh@gmail.com> Cc: Ron Minnich <rminnich@sandia.gov> Cc: Latchesar Ionkov <lucho@ionkov.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2015-11-05 19:34:48 -08:00
Joseph Qi	262d8a8779	ocfs2: clean up unused variable in ocfs2_duplicate_clusters_by_page() readahead_pages in ocfs2_duplicate_clusters_by_page is defined but not used, so clean it up. Signed-off-by: Joseph Qi <joseph.qi@huawei.com> Cc: Mark Fasheh <mfasheh@suse.com> Cc: Joel Becker <jlbec@evilplan.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2015-11-05 19:34:48 -08:00
Joseph Qi	5afc44e2e9	ocfs2: add uuid to ocfs2 thread name for problem analysis A node can mount multiple ocfs2 volumes. And if thread names are same for each volume/domain, it will bring inconvenience when analyzing problems because we have to identify which volume/domain the messages belong to. Since thread name will be printed to messages, so add volume uuid or dlm name to thread name can benefit problem analysis. Signed-off-by: Joseph Qi <joseph.qi@huawei.com> Cc: Mark Fasheh <mfasheh@suse.com> Cc: Joel Becker <jlbec@evilplan.org> Cc: Gang He <ghe@suse.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2015-11-05 19:34:48 -08:00
alex chen	b1529a41f7	ocfs2: should reclaim the inode if '__ocfs2_mknod_locked' returns an error In ocfs2_mknod_locked if '__ocfs2_mknod_locke d' returns an error, we should reclaim the inode successfully claimed above, otherwise, the inode never be reused. The case is described below: ocfs2_mknod ocfs2_mknod_locked ocfs2_claim_new_inode Successfully claim the inode __ocfs2_mknod_locked ocfs2_journal_access_di Failed because of -ENOMEM or other reasons, the inode lockres has not been initialized yet. iput(inode) ocfs2_evict_inode ocfs2_delete_inode ocfs2_inode_lock ocfs2_inode_lock_full_nested __ocfs2_cluster_lock Return -EINVAL because of the inode lockres has not been initialized. So the following operations are not performed ocfs2_wipe_inode ocfs2_remove_inode ocfs2_free_dinode ocfs2_free_suballoc_bits Signed-off-by: Alex Chen <alex.chen@huawei.com> Reviewed-by: Joseph Qi <joseph.qi@huawei.com> Cc: Mark Fasheh <mfasheh@suse.de> Cc: Joel Becker <jlbec@evilplan.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2015-11-05 19:34:48 -08:00
Joseph Qi	0986fe9b50	ocfs2: fix race between mount and delete node/cluster There is a race case between mount and delete node/cluster, which will lead o2hb_thread to malfunctioning dead loop. o2hb_thread { o2nm_depend_this_node(); <<<<<< race window, node may have already been deleted, and then enter the loop, o2hb thread will be malfunctioning because of no configured nodes found. while (!kthread_should_stop() && !reg->hr_unclean_stop && !reg->hr_aborted_start) { } So check the return value of o2nm_depend_this_node() is needed. If node has been deleted, do not enter the loop and let mount fail. Signed-off-by: Joseph Qi <joseph.qi@huawei.com> Cc: Mark Fasheh <mfasheh@suse.com> Cc: Joel Becker <jlbec@evilplan.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2015-11-05 19:34:48 -08:00
Joseph Qi	93d911fcce	ocfs2: only take lock if dio entry when recover orphans We have no need to take inode mutex, rw and inode lock if it is not dio entry when recover orphans. Optimize it by adding a flag OCFS2_INODE_DIO_ORPHAN_ENTRY to ocfs2_inode_info to reduce contention. Signed-off-by: Joseph Qi <joseph.qi@huawei.com> Cc: Mark Fasheh <mfasheh@suse.de> Cc: Joel Becker <jlbec@evilplan.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2015-11-05 19:34:48 -08:00
Joseph Qi	30edc43c7f	ocfs2: do not include dio entry in case of orphan scan dio entry will only do truncate in case of ORPHAN_NEED_TRUNCATE. So do not include it when doing normal orphan scan to reduce contention. Signed-off-by: Joseph Qi <joseph.qi@huawei.com> Cc: Mark Fasheh <mfasheh@suse.de> Cc: Joel Becker <jlbec@evilplan.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2015-11-05 19:34:48 -08:00
Joseph Qi	1d1aff8cf3	ocfs2: improve performance for localalloc Currently cluster allocation is always trying to find a victim chain (a chian has most space), and this may lead to poor performance because of discontiguous allocation in some scenarios. Our test case is block size 4k, cluster size 1M and mount option with localalloc=2048 (2G), since a gd is 32256M (about 31.5G) and a localalloc window is only 2G, creating 50G file will result in 2G from gd0, 2G from gd1, ... One way to improve performance is enlarge localalloc window size (max 31104M), but this will make end user feel that about 30G is suddenly "missing", and localalloc currently do not support steal, which means one node cannot use another node's localalloc even it is not used in fact. So using the last gd to record the allocation and continues with the gd if it has enough space for a localalloc window can make the allocation as more contiguous as possible. Our test result is below (evaluated in IOPS), which is using iometer running in VM, dynamic vhd virtual disk stored in ocfs2. IO model Original After Improved(%) 16K60%Write100%Random 703 876 24.59% 8K90%Write100%Random 735 827 12.59% 4K100%Write100%Random 859 915 6.52% 4K100%Read100%Random 2092 2600 24.30% Signed-off-by: Joseph Qi <joseph.qi@huawei.com> Tested-by: Norton Zhu <norton.zhu@huawei.com> Cc: Mark Fasheh <mfasheh@suse.de> Cc: Joel Becker <jlbec@evilplan.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2015-11-05 19:34:48 -08:00
jiangyiwen	4e357b932a	ocfs2: fill in the unused portion of the block with zeros by dio_zero_block() A simplified test case is (this case from Ryan): 1) dd if=/dev/zero of=/mnt/hello bs=512 count=1 oflag=direct; 2) truncate /mnt/hello -s 2097152 file 'hello' is not exist before test. After this command, file 'hello' should be all zero. But 512~4096 is some random data. Setting bh state to new when get a new block, if so, direct_io_worker()->dio_zero_block() will fill-in the unused portion of the block with zero. Signed-off-by: Yiwen Jiang <jiangyiwen@huawei.com> Reviewed-by: Joseph Qi <joseph.qi@huawei.com> Cc: Mark Fasheh <mfasheh@suse.de> Cc: Joel Becker <jlbec@evilplan.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2015-11-05 19:34:48 -08:00
Norton.Zhu	d162eaad77	ocfs2_direct_IO_write() misses ocfs2_is_overwrite() error code If ocfs2_is_overwrite failed, ocfs2_direct_IO_write mays till return success to the caller. Signed-off-by: Norton.Zhu <norton.zhu@huawei.com> Cc: Mark Fasheh <mfasheh@suse.de> Cc: Joel Becker <jlbec@evilplan.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2015-11-05 19:34:48 -08:00
Sudip Mukherjee	ce4f2fd7ea	logfs: fix build warning fs/logfs/dev_bdev.c: In function '__bdev_writeseg': include/linux/kernel.h:601:17: warning: comparison of distinct pointer types lacks a cast [enabled by default] (void) (&_min1 == &_min2); \ fs/logfs/dev_bdev.c:84:14: note: in expansion of macro 'min' max_pages = min(nr_pages, BIO_MAX_PAGES); fs/logfs/dev_bdev.c: In function 'do_erase': include/linux/kernel.h:601:17: warning: comparison of distinct pointer types lacks a cast [enabled by default] (void) (&_min1 == &_min2); \ fs/logfs/dev_bdev.c:174:14: note: in expansion of macro 'min' max_pages = min(nr_pages, BIO_MAX_PAGES); Lets use min_t and mention the type. Signed-off-by: Sudip Mukherjee <sudip@vectorindia.org> Cc: Joern Engel <joern@logfs.org> Cc: Prasad Joshi <prasadjoshi.linux@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2015-11-05 19:34:48 -08:00
Dave Hansen	d30e2c05a1	inotify: actually check for invalid bits in sys_inotify_add_watch() The comment here says that it is checking for invalid bits. But, the mask is actually checking to ensure that _any_ valid bit is set, which is quite different. Without this check, an unexpected bit could get set on an inotify object. Since these bits are also interpreted by the fsnotify/dnotify code, there is the potential for an object to be mishandled inside the kernel. For instance, can we be sure that setting the dnotify flag FS_DN_RENAME on an inotify watch is harmless? Add the actual check which was intended. Retain the existing inotify bits are being added to the watch. Plus, this is existing behavior which would be nice to preserve. I did a quick sniff test that inotify functions and that my 'inotify-tools' package passes 'make check'. Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Cc: John McCutchan <john@johnmccutchan.com> Cc: Robert Love <rlove@rlove.org> Cc: Eric Paris <eparis@parisplace.org> Cc: Josh Boyer <jwboyer@fedoraproject.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2015-11-05 19:34:48 -08:00
Dave Hansen	6933599697	inotify: hide internal kernel bits from fdinfo There was a report that my patch: inotify: actually check for invalid bits in sys_inotify_add_watch() broke CRIU. The reason is that CRIU looks up raw flags in /proc/$pid/fdinfo/* to figure out how to rebuild inotify watches and then passes those flags directly back in to the inotify API. One of those flags (FS_EVENT_ON_CHILD) is set in mark->mask, but is not part of the inotify API. It is used inside the kernel to _implement_ inotify but it is not and has never been part of the API. My patch above ensured that we only allow bits which are part of the API (IN_ALL_EVENTS). This broke CRIU. FS_EVENT_ON_CHILD is really internal to the kernel. It is set _anyway_ on all inotify marks. So, CRIU was really just trying to set a bit that was already set. This patch hides that bit from fdinfo. CRIU will not see the bit, not try to set it, and should work as before. We should not have been exposing this bit in the first place, so this is a good patch independent of the CRIU problem. Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Reported-by: Andrey Wagin <avagin@gmail.com> Acked-by: Andrey Vagin <avagin@openvz.org> Acked-by: Cyrill Gorcunov <gorcunov@openvz.org> Acked-by: Eric Paris <eparis@redhat.com> Cc: Pavel Emelyanov <xemul@parallels.com> Cc: John McCutchan <john@johnmccutchan.com> Cc: Robert Love <rlove@rlove.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2015-11-05 19:34:48 -08:00
Linus Torvalds	1873499e13	Merge branch 'next' of git://git.kernel.org/pub/scm/linux/kernel/git/jmorris/linux-security Pull security subsystem update from James Morris: "This is mostly maintenance updates across the subsystem, with a notable update for TPM 2.0, and addition of Jarkko Sakkinen as a maintainer of that" * 'next' of git://git.kernel.org/pub/scm/linux/kernel/git/jmorris/linux-security: (40 commits) apparmor: clarify CRYPTO dependency selinux: Use a kmem_cache for allocation struct file_security_struct selinux: ioctl_has_perm should be static selinux: use sprintf return value selinux: use kstrdup() in security_get_bools() selinux: use kmemdup in security_sid_to_context_core() selinux: remove pointless cast in selinux_inode_setsecurity() selinux: introduce security_context_str_to_sid selinux: do not check open perm on ftruncate call selinux: change CONFIG_SECURITY_SELINUX_CHECKREQPROT_VALUE default KEYS: Merge the type-specific data with the payload data KEYS: Provide a script to extract a module signature KEYS: Provide a script to extract the sys cert list from a vmlinux file keys: Be more consistent in selection of union members used certs: add .gitignore to stop git nagging about x509_certificate_list KEYS: use kvfree() in add_key Smack: limited capability for changing process label TPM: remove unnecessary little endian conversion vTPM: support little endian guests char: Drop owner assignment from i2c_driver ...	2015-11-05 15:32:38 -08:00
Linus Torvalds	6de29ccb50	Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace Pull userns hardlink capability check fix from Eric Biederman: "This round just contains a single patch. There has been a lot of other work this period but it is not quite ready yet, so I am pushing it until 4.5. The remaining change by Dirk Steinmetz wich fixes both Gentoo and Ubuntu containers allows hardlinks if we have the appropriate capabilities in the user namespace. Security wise it is really a gimme as the user namespace root can already call setuid become that user and create the hardlink" * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace: namei: permit linking with CAP_FOWNER in userns	2015-11-05 15:20:56 -08:00
Linus Torvalds	66339fdacb	Half dozen small cleanups plus change to allow pstore backend drivers to be unloaded. -----BEGIN PGP SIGNATURE----- Version: GnuPG v1 iQIcBAABAgAGBQJWN9nyAAoJEKurIx+X31iBlSQQALBEB5pwkCEBcJlXa1SL1uRN WFOUhoNI6Rh1Wlsu6t0P9AhtotCHUeSRtF+Y05EXcytBb1EqsW90fk4m7VFDFuCp drDYPJhNaFcvxDkt1PKKGRysGLXsCjr5szuApCFpYwg3FaxqddXTfFdZ7zdWaRb2 NkUd+aSElNr1avrulgyTqHjWFCP93bWMh6tFhUjCRjwmXDhxvHxLtlRUMtPlsgrC nWuawkyrR31EJJoQ9lnvEQBjP6i5qSMfU+2o6nUm6/5LNe/m9iWDxmWakoa7p8e4 XArywFijO18byvjsvaJhUOLzLV0TT+PoL14m5U7JP0JA9mtpCYNvnb62CWmFulM2 Q75FGdfj2UQOnwMnaBpYPNC6S/ddLtl0iWGivgI3ja47xG9TGzEYmTrObt9LfVzd kv1Nw/dNUY0fTb+n7rPBkpyHKO9ZQPQSebDOU7MJ61uuS/QB/sEbk2gU7HsV/Q68 ivOvy1zui9ggpPOuApqQTVi/OCHtb+TJep2+U5O8NR6DFY/bhQCH6g8mPaMKU5G0 6HF0kwx4h905VHWiP4I0EphmrDWVvpFMUgZoHgi3CCdmbm64+BPXYa6AiJ3I00Ed FstB7pYuwsJ8jsYhxRVVSEW7vKZQ+jS5tzbLBjuTYe+KfBLFm7A5OdBHmhpXug0k zw4kssZrm30mCpWR90ew =liJx -----END PGP SIGNATURE----- Merge tag 'please-pull-pstore' of git://git.kernel.org/pub/scm/linux/kernel/git/aegl/linux Pull pstore updates from Tony Luck: "Half dozen small cleanups plus change to allow pstore backend drivers to be unloaded" * tag 'please-pull-pstore' of git://git.kernel.org/pub/scm/linux/kernel/git/aegl/linux: pstore: fix code comment to match code efi-pstore: fix kernel-doc argument name pstore: Fix return type of pstore_is_mounted() pstore: add pstore unregister pstore: add a helper function pstore_register_kmsg pstore: add vmalloc error check	2015-11-05 11:51:18 -08:00
Linus Torvalds	0fcb9d21b4	Merge tag 'for-f2fs-4.4' of git://git.kernel.org/pub/scm/linux/kernel/git/jaegeuk/f2fs Pull f2fs updates from Jaegeuk Kim: "Most part of the patches include enhancing the stability and performance of in-memory extent caches feature. In addition, it introduces several new features and configurable points: - F2FS_GOING_DOWN_METAFLUSH ioctl to test power failures - F2FS_IOC_WRITE_CHECKPOINT ioctl to trigger checkpoint by users - background_gc=sync mount option to do gc synchronously - periodic checkpoints - sysfs entry to control readahead blocks for free nids And the following bug fixes have been merged. - fix SSA corruption by collapse/insert_range - correct a couple of gc behaviors - fix the results of f2fs_map_blocks - fix error case handling of volatile/atomic writes" * tag 'for-f2fs-4.4' of git://git.kernel.org/pub/scm/linux/kernel/git/jaegeuk/f2fs: (54 commits) f2fs: fix to skip shrinking extent nodes f2fs: fix error path of ->symlink f2fs: fix to clear GCed flag for atomic written page f2fs: don't need to submit bio on error case f2fs: fix leakage of inmemory atomic pages f2fs: refactor __find_rev_next_{zero}_bit f2fs: support fiemap for inline_data f2fs: flush dirty data for bmap f2fs: relocate the tracepoint for background_gc f2fs crypto: fix racing of accessing encrypted page among f2fs: export ra_nid_pages to sysfs f2fs: readahead for free nids building f2fs: support lower priority asynchronous readahead in ra_meta_pages f2fs: don't tag REQ_META for temporary non-meta pages f2fs: add a tracepoint for f2fs_read_data_pages f2fs: set GFP_NOFS for grab_cache_page f2fs: fix SSA updates resulting in corruption Revert "f2fs: do not skip dentry block writes" f2fs: add F2FS_GOING_DOWN_METAFLUSH to test power-failure f2fs: merge meta writes as many possible ...	2015-11-05 11:22:07 -08:00
Linus Torvalds	d000f8d67f	dlm for 4.4 This includes one simple fix to make posix locks interruptible by signals in cases where a signal handler is used. -----BEGIN PGP SIGNATURE----- Version: GnuPG v1 iQIcBAABAgAGBQJWO3ZiAAoJEDgbc8f8gGmqXc8P/j1GcuiMyHbRjlyGgExYYuFG l9eeoCAzUy4aE9PnN90ur/9FhYGQ3fEte4xaKDU7FMTsQIvNrKYdNAt6qWQdQI6M shogDw6hXwXJ42IPS9fuj2dEPFD5wBoMYjEGowzdAvsMH3cROyN03hIWqSWTL3jI UFSR7NjbnQwT8ZUAYcICEE5VerqsGzxGkaFF+V3fYASsZlALTlQkVHT5DQzCkXgq 0CkNhsMx5H4Ng9y/2dnhPj3y24NqbhdtLX4dkcKevMmHP5FJ/rEI82oizxPgJ8oZ QlcSOUZatNuqLVAStecmsd5sH80/IDspnpMDxnQCKnioNq3x6YXXfhyv5CKB6Ahy atA3SlYDACiZz5tydJ/97DJvvIrF2rUETPXk2Lobc972UU99r8zxCUah8xv4ThD/ DtuSkqNnTmXjMcTssHDqo/Kg16dZxpx+itxsWCEivfZm6EL1j5RAvZO5G04wMmry D/FXDKT/FZR+xYDIg1FLc1uOMldeRbMWhb+zGfTAnYy0aH43oyePVddbC+lSuVfp Pat2avXoovR59+7nhFk+s+xf3c8oKMoSwCZOso4OoVySRZQmE1dH6m6D5RLkgRHw nTGggRRAAOsLoiYFtnCKXHkxUTGZWDJfwtv1OI1IABqdoSj5Px/4JAswWj1e5+dH haCGlyvMDqK8ImaekGcZ =m+FJ -----END PGP SIGNATURE----- Merge tag 'dlm-4.4' of git://git.kernel.org/pub/scm/linux/kernel/git/teigland/linux-dlm Pull dlm update from David Teigland: "This includes one simple fix to make posix locks interruptible by signals in cases where a signal handler is used" * tag 'dlm-4.4' of git://git.kernel.org/pub/scm/linux/kernel/git/teigland/linux-dlm: dlm: make posix locks interruptible	2015-11-05 11:15:25 -08:00
Linus Torvalds	9576c2f293	File locking related changes for v4.4 (pile #1 ) -----BEGIN PGP SIGNATURE----- Version: GnuPG v1 iQIcBAABAgAGBQJWNsKlAAoJEAAOaEEZVoIVKNMP+QHb96HMNWnMlBE9jwPbBK/2 yM80sa6wRcbCF519sRFbmOheet4bgNSHixegtUez5kyqyI7Hr0tsRYvIo5/amAWX EIh03fZoM+Bgm+dblYivorSrPmmx2UQ9RG6pUbcOPtxdCpQ79tfzVyYVykG5wcb5 NLSibG9s5USutOXPTatxDqS6P2QwvvWXHR5oX1mkU2W7nQXfHOdQKSuk5CqUeIWx JSGIa+plS9fath1Ndu4pJ7atvU8cR0t+VeOqPmGoqqIDyGVbo45XgXZmk0xCxEs9 XsVSbdGBMAtA63xlZHFROADFNXIosay2zA7mdG0i3IrLRMQr/okQhTqBrFMKmj0m cDMDNOs4j4M8JJPkwrJQ3S/1Tnl+zyAuKKTJwgvVnd1tcyTZjs3g77I9e84pSTsp chL4FmfeR7dhk+YJgcnbzvnnP7tBbQcV0ET/ILVsDU7bNDujWlcDzYkbbWx70WLa KobjmsW/OAGaQugIMA1oGLTexT1u9HtDYOw8JVNBKwlrnPKyFVb8X88gx2Laf34L Qa04TdrFseuxbnBGifLyQTsLxgF9QalUo+51J0I4a7G3WX0U2Zuk+ZTbHc6ChhdW d0oL2SEyToscRADRL0/u2CUR1dEXkdDXi3pxgvDs5PTJVU+lIy4czp/dI5JrjKUA L7O27Kstgoe2GctHn6FI =OYAZ -----END PGP SIGNATURE----- Merge tag 'locks-v4.4-1' of git://git.samba.org/jlayton/linux Pull file locking updates from Jeff Layton: "The largest series of changes is from Ben who offered up a set to add a new helper function for setting locks based on the type set in fl_flags. Dmitry also send in a fix for a potential race that he found with KTSAN" * tag 'locks-v4.4-1' of git://git.samba.org/jlayton/linux: locks: cleanup posix_lock_inode_wait and flock_lock_inode_wait Move locks API users to locks_lock_inode_wait() locks: introduce locks_lock_inode_wait() locks: Use more file_inode and fix a comment fs: fix data races on inode->i_flctx locks: change tracepoint for generic_add_lease	2015-11-05 10:31:29 -08:00
Linus Torvalds	e880e87488	driver core update for 4.4-rc1 Here's the "big" driver core updates for 4.4-rc1. Primarily a bunch of debugfs updates, with a smattering of minor driver core fixes and updates as well. All have been in linux-next for a long time. Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> -----BEGIN PGP SIGNATURE----- Version: GnuPG v2 iEYEABECAAYFAlY6ePQACgkQMUfUDdst+ymNTgCgpP0CZw57GpwF/Hp2L/lMkVeo Kx8AoKhEi4iqD5fdCQS9qTfomB+2/M6g =g7ZO -----END PGP SIGNATURE----- Merge tag 'driver-core-4.4-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core Pull driver core updates from Greg KH: "Here's the "big" driver core updates for 4.4-rc1. Primarily a bunch of debugfs updates, with a smattering of minor driver core fixes and updates as well. All have been in linux-next for a long time" * tag 'driver-core-4.4-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core: debugfs: Add debugfs_create_ulong() of: to support binding numa node to specified device in devicetree debugfs: Add read-only/write-only bool file ops debugfs: Add read-only/write-only size_t file ops debugfs: Add read-only/write-only x64 file ops debugfs: Consolidate file mode checks in debugfs_create_() Revert "mm: Check if section present during memory block (un)registering" driver-core: platform: Provide helpers for multi-driver modules mm: Check if section present during memory block (un)registering devres: fix a for loop bounds check CMA: fix CONFIG_CMA_SIZE_MBYTES overflow in 64bit base/platform: assert that dev_pm_domain callbacks are called unconditionally sysfs: correctly handle short reads on PREALLOC attrs. base: soc: siplify ida usage kobject: move EXPORT_SYMBOL() macros next to corresponding definitions kobject: explain what kobject's sd field is debugfs: document that debugfs_remove() accepts NULL and error values debugfs: Pass bool pointer to debugfs_create_bool() ACPI / EC: Fix broken 64bit big-endian users of 'global_lock'	2015-11-04 21:50:37 -08:00
Linus Torvalds	527d1529e3	Merge branch 'for-4.4/integrity' of git://git.kernel.dk/linux-block Pull block integrity updates from Jens Axboe: ""This is the joint work of Dan and Martin, cleaning up and improving the support for block data integrity" * 'for-4.4/integrity' of git://git.kernel.dk/linux-block: block, libnvdimm, nvme: provide a built-in blk_integrity nop profile block: blk_flush_integrity() for bio-based drivers block: move blk_integrity to request_queue block: generic request_queue reference counting nvme: suspend i/o during runtime blk_integrity_unregister md: suspend i/o during runtime blk_integrity_unregister md, dm, scsi, nvme, libnvdimm: drop blk_integrity_unregister() at shutdown block: Inline blk_integrity in struct gendisk block: Export integrity data interval size in sysfs block: Reduce the size of struct blk_integrity block: Consolidate static integrity profile properties block: Move integrity kobject to struct gendisk	2015-11-04 20:51:48 -08:00
Linus Torvalds	d9734e0d1c	Merge branch 'for-4.4/core' of git://git.kernel.dk/linux-block Pull core block updates from Jens Axboe: "This is the core block pull request for 4.4. I've got a few more topic branches this time around, some of them will layer on top of the core+drivers changes and will come in a separate round. So not a huge chunk of changes in this round. This pull request contains: - Enable blk-mq page allocation tracking with kmemleak, from Catalin. - Unused prototype removal in blk-mq from Christoph. - Cleanup of the q->blk_trace exchange, using cmpxchg instead of two xchg()'s, from Davidlohr. - A plug flush fix from Jeff. - Also from Jeff, a fix that means we don't have to update shared tag sets at init time unless we do a state change. This cuts down boot times on thousands of devices a lot with scsi/blk-mq. - blk-mq waitqueue barrier fix from Kosuke. - Various fixes from Ming: - Fixes for segment merging and splitting, and checks, for the old core and blk-mq. - Potential blk-mq speedup by marking ctx pending at the end of a plug insertion batch in blk-mq. - direct-io no page dirty on kernel direct reads. - A WRITE_SYNC fix for mpage from Roman" * 'for-4.4/core' of git://git.kernel.dk/linux-block: blk-mq: avoid excessive boot delays with large lun counts blktrace: re-write setting q->blk_trace blk-mq: mark ctx as pending at batch in flush plug path blk-mq: fix for trace_block_plug() block: check bio_mergeable() early before merging blk-mq: check bio_mergeable() early before merging block: avoid to merge splitted bio block: setup bi_phys_segments after splitting block: fix plug list flushing for nomerge queues blk-mq: remove unused blk_mq_clone_flush_request prototype blk-mq: fix waitqueue_active without memory barrier in block/blk-mq-tag.c fs: direct-io: don't dirtying pages for ITER_BVEC/ITER_KVEC direct read fs/mpage.c: forgotten WRITE_SYNC in case of data integrity write block: kmemleak: Track the page allocations for struct request	2015-11-04 20:28:10 -08:00
Linus Torvalds	e627078a0c	Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/s390/linux Pull s390 updates from Martin Schwidefsky: "There is only one new feature in this pull for the 4.4 merge window, most of it is small enhancements, cleanup and bug fixes: - Add the s390 backend for the software dirty bit tracking. This adds two new pgtable functions pte_clear_soft_dirty and pmd_clear_soft_dirty which is why there is a hit to arch/x86/include/asm/pgtable.h in this pull request. - A series of cleanup patches for the AP bus, this includes the removal of the support for two outdated crypto cards (PCICC and PCICA). - The irq handling / signaling on buffer full in the runtime instrumentation code is dropped. - Some micro optimizations: remove unnecessary memory barriers for a couple of functions: [smb_]rmb, [smb_]wmb, atomics, bitops, and for spin_unlock. Use the builtin bswap if available and make test_and_set_bit_lock more cache friendly. - Statistics and a tracepoint for the diagnose calls to the hypervisor. - The CPU measurement facility support to sample KVM guests is improved. - The vector instructions are now always enabled for user space processes if the hardware has the vector facility. This simplifies the FPU handling code. The fpu-internal.h header is split into fpu internals, api and types just like x86. - Cleanup and improvements for the common I/O layer. - Rework udelay to solve a problem with kprobe. udelay has busy loop semantics but still uses an idle processor state for the wait" * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/s390/linux: (66 commits) s390: remove runtime instrumentation interrupts s390/cio: de-duplicate subchannel validation s390/css: unneeded initialization in for_each_subchannel s390/Kconfig: use builtin bswap s390/dasd: fix disconnected device with valid path mask s390/dasd: fix invalid PAV assignment after suspend/resume s390/dasd: fix double free in dasd_eckd_read_conf s390/kernel: fix ptrace peek/poke for floating point registers s390/cio: move ccw_device_stlck functions s390/cio: move ccw_device_call_handler s390/topology: reduce per_cpu() invocations s390/nmi: reduce size of percpu variable s390/nmi: fix terminology s390/nmi: remove casts s390/nmi: remove pointless error strings s390: don't store registers on disabled wait anymore s390: get rid of __set_psw_mask() s390/fpu: split fpu-internal.h into fpu internals, api, and type headers s390/dasd: fix list_del corruption after lcu changes s390/spinlock: remove unneeded serializations at unlock ...	2015-11-04 11:31:31 -08:00
Linus Torvalds	2814228699	Merge branch 'core-rcu-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull RCU changes from Ingo Molnar: "The main changes in this cycle were: - Improvements to expedited grace periods (Paul E McKenney) - Performance improvements to and locktorture tests for percpu-rwsem (Oleg Nesterov, Paul E McKenney) - Torture-test changes (Paul E McKenney, Davidlohr Bueso) - Documentation updates (Paul E McKenney) - Miscellaneous fixes (Paul E McKenney, Boqun Feng, Oleg Nesterov, Patrick Marlier)" * 'core-rcu-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (54 commits) fs/writeback, rcu: Don't use list_entry_rcu() for pointer offsetting in bdi_split_work_to_wbs() rcu: Better hotplug handling for synchronize_sched_expedited() rcu: Enable stall warnings for synchronize_rcu_expedited() rcu: Add tasks to expedited stall-warning messages rcu: Add online/offline info to expedited stall warning message rcu: Consolidate expedited CPU selection rcu: Prepare for consolidating expedited CPU selection cpu: Remove try_get_online_cpus() rcu: Stop excluding CPU hotplug in synchronize_sched_expedited() rcu: Stop silencing lockdep false positive for expedited grace periods rcu: Switch synchronize_sched_expedited() to IPI locktorture: Fix module unwind when bad torture_type specified torture: Forgive non-plural arguments rcutorture: Fix unused-function warning for torturing_tasks() rcutorture: Fix module unwind when bad torture_type specified rcu_sync: Cleanup the CONFIG_PROVE_RCU checks locking/percpu-rwsem: Clean up the lockdep annotations in percpu_down_read() locking/percpu-rwsem: Fix the comments outdated by rcu_sync locking/percpu-rwsem: Make use of the rcu_sync infrastructure locking/percpu-rwsem: Make percpu_free_rwsem() after kzalloc() safe ...	2015-11-03 15:40:38 -08:00
Linus Torvalds	7eeef2abe8	Merge branch 'core-debug-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull wchan kernel address hiding from Ingo Molnar: "This fixes a wchan related information leak in /proc/PID/stat. There's a bit of an ABI twist to it: instead of setting the wchan field to 0 (which is our usual technique) we set it conditionally to a 0/1 flag to keep ABI compatibility with older procps versions that only fetches /proc/PID/wchan (symbolic names) if the absolute wchan address is nonzero" * 'core-debug-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: fs/proc, core/debug: Don't expose absolute kernel addresses via wchan	2015-11-03 15:04:04 -08:00
Eric Ren	a6b1533e9a	dlm: make posix locks interruptible Replace wait_event_killable with wait_event_interruptible so that a program waiting for a posix lock can be interrupted by a signal. With the killable version, a program was not interruptible by a signal if it had a signal handler set for it, overriding the default action of terminating the process. Signed-off-by: Eric Ren <zren@suse.com> Signed-off-by: David Teigland <teigland@redhat.com>	2015-11-03 10:38:22 -06:00
Geliang Tang	306e5c2a3c	pstore: fix code comment to match code Fix code comment about kmsg_dump register so it matches the code. Signed-off-by: Geliang Tang <geliangtang@163.com> Signed-off-by: Tony Luck <tony.luck@intel.com>	2015-11-02 13:41:52 -08:00
Linus Torvalds	a5ad88ce8c	mm: get rid of 'vmalloc_info' from /proc/meminfo It turns out that at least some versions of glibc end up reading /proc/meminfo at every single startup, because glibc wants to know the amount of memory the machine has. And while that's arguably insane, it's just how things are. And it turns out that it's not all that expensive most of the time, but the vmalloc information statistics (amount of virtual memory used in the vmalloc space, and the biggest remaining chunk) can be rather expensive to compute. The 'get_vmalloc_info()' function actually showed up on my profiles as 4% of the CPU usage of "make test" in the git source repository, because the git tests are lots of very short-lived shell-scripts etc. It turns out that apparently this same silly vmalloc info gathering shows up on the facebook servers too, according to Dave Jones. So it's not just "make test" for git. We had two patches to just cache the information (one by me, one by Ingo) to mitigate this issue, but the whole vmalloc information of of rather dubious value to begin with, and people who actually want to know what the situation is wrt the vmalloc area should just look at the much more complete /proc/vmallocinfo instead. In fact, according to my testing - and perhaps more importantly, according to that big search engine in the sky: Google - there is nothing out there that actually cares about those two expensive fields: VmallocUsed and VmallocChunk. So let's try to just remove them entirely. Actually, this just removes the computation and reports the numbers as zero for now, just to try to be minimally intrusive. If this breaks anything, we'll obviously have to re-introduce the code to compute this all and add the caching patches on top. But if given the option, I'd really prefer to just remove this bad idea entirely rather than add even more code to work around our historical mistake that likely nobody really cares about. Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2015-11-01 17:09:15 -08:00
Linus Torvalds	2e00266297	Merge branch 'fs-file-descriptor-optimization' Merge file descriptor allocation speedup. Eric Dumazet has a test-case for a fairly common network deamon load pattern: openign and closing a lot of sockets that each have very little work done on them. It turns out that in that case, the cost of just finding the correct file descriptor number can be a dominating factor. We've long had a trivial optimization for allocating file descriptors sequentially, but that optimization ends up being not very effective when other file descriptors are being closed concurrently, and the fd patterns are not some simple FIFO pattern. In such cases we ended up spending a lot of time just scanning the bitmap of open file descriptors in order to find the next file descriptor number to open. This trivial patch-series mitigates that by simply introducing a second-level bitmap of which words in the first bitmap are already fully allocated. That cuts down the cost of scanning by an order of magnitude in some pathological (but realistic) cases. The second patch is an even more trivial patch to avoid unnecessarily dirtying the cacheline for the close-on-exec bit array that normally ends up being all empty. * fs-file-descriptor-optimization: vfs: conditionally clear close-on-exec flag vfs: Fix pathological performance case for __alloc_fd()	2015-11-01 16:43:24 -08:00
Linus Torvalds	fc90888d07	vfs: conditionally clear close-on-exec flag We clear the close-on-exec flag when opening and closing files, and the bit was almost always already clear before. Avoid dirtying the cacheline if the clearning isn't necessary. That avoids unnecessary cacheline dirtying and bouncing in multi-socket environments. Eric Dumazet has a file descriptor benchmark that goes 4% faster from this on his two-socket machine. It's probably partly superlinear improvement due to getting slightly less spinlock contention on the file_lock spinlock due to less work in the critical section. Tested-by: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2015-10-31 16:14:51 -07:00
Linus Torvalds	f3f86e33dc	vfs: Fix pathological performance case for __alloc_fd() Al Viro points out that: > > * [Linux-specific aside] our __alloc_fd() can degrade quite badly > > with some use patterns. The cacheline pingpong in the bitmap is probably > > inevitable, unless we accept considerably heavier memory footprint, > > but we also have a case when alloc_fd() takes O(n) and it's _not_ hard > > to trigger - close(3);open(...); will have the next open() after that > > scanning the entire in-use bitmap. And Eric Dumazet has a somewhat realistic multithreaded microbenchmark that opens and closes a lot of sockets with minimal work per socket. This patch largely fixes it. We keep a 2nd-level bitmap of the open file bitmaps, showing which words are already full. So then we can traverse that second-level bitmap to efficiently skip already allocated file descriptors. On his benchmark, this improves performance by up to an order of magnitude, by avoiding the excessive open file bitmap scanning. Tested-and-acked-by: Eric Dumazet <edumazet@google.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2015-10-31 16:12:10 -07:00
Linus Torvalds	4bb0fb57f3	Merge branch 'overlayfs-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/vfs Pull overlayfs bug fixes from Miklos Szeredi: "This contains fixes for bugs that appeared in earlier kernels (all are marked for -stable)" * 'overlayfs-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/vfs: ovl: free lower_mnt array in ovl_put_super ovl: free stack of paths in ovl_fill_super ovl: fix open in stacked overlay ovl: fix dentry reference leak ovl: use O_LARGEFILE in ovl_copy_up()	2015-10-31 14:49:19 -07:00
Tejun Heo	b33e18f61b	fs/writeback, rcu: Don't use list_entry_rcu() for pointer offsetting in bdi_split_work_to_wbs() bdi_split_work_to_wbs() uses list_for_each_entry_rcu_continue() to walk @bdi->wb_list. To set up the initial iteration condition, it uses list_entry_rcu() to calculate the entry pointer corresponding to the list head; however, this isn't an actual RCU dereference and using list_entry_rcu() for it ended up breaking a proposed list_entry_rcu() change because it was feeding an non-lvalue pointer into the macro. Don't use the RCU variant for simple pointer offsetting. Use list_entry() instead. Reported-by: Ingo Molnar <mingo@kernel.org> Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Darren Hart <dvhart@linux.intel.com> Cc: David Howells <dhowells@redhat.com> Cc: Dipankar Sarma <dipankar@in.ibm.com> Cc: Eric Dumazet <edumazet@google.com> Cc: Frederic Weisbecker <fweisbec@gmail.com> Cc: Josh Triplett <josh@joshtriplett.org> Cc: Lai Jiangshan <jiangshanlai@gmail.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Patrick Marlier <patrick.marlier@gmail.com> Cc: Paul McKenney <paulmck@linux.vnet.ibm.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: pranith kumar <bobby.prani@gmail.com> Link: http://lkml.kernel.org/r/20151027051939.GA19355@mtj.duckdns.org Signed-off-by: Ingo Molnar <mingo@kernel.org>	2015-10-28 13:17:30 +01:00
Dirk Steinmetz	f2ca379642	namei: permit linking with CAP_FOWNER in userns Attempting to hardlink to an unsafe file (e.g. a setuid binary) from within an unprivileged user namespace fails, even if CAP_FOWNER is held within the namespace. This may cause various failures, such as a gentoo installation within a lxc container failing to build and install specific packages. This change permits hardlinking of files owned by mapped uids, if CAP_FOWNER is held for that namespace. Furthermore, it improves consistency by using the existing inode_owner_or_capable(), which is aware of namespaced capabilities as of `23adbe12ef` ("fs,userns: Change inode_capable to capable_wrt_inode_uidgid"). Signed-off-by: Dirk Steinmetz <public@rsjtdrjgfuzkfg.com> This is hitting us in Ubuntu during some dpkg upgrades in containers. When upgrading a file dpkg creates a hard link to the old file to back it up before overwriting it. When packages upgrade suid files owned by a non-root user the link isn't permitted, and the package upgrade fails. This patch fixes our problem. Tested-by: Seth Forshee <seth.forshee@canonical.com> Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>	2015-10-27 16:12:35 -05:00
Linus Torvalds	ea1ee5ff1b	Merge branch 'for-linus' of git://git.kernel.dk/linux-block Pull block layer fixes from Jens Axboe: "A final set of fixes for 4.3. It is (again) bigger than I would have liked, but it's all been through the testing mill and has been carefully reviewed by multiple parties. Each fix is either a regression fix for this cycle, or is marked stable. You can scold me at KS. The pull request contains: - Three simple fixes for NVMe, fixing regressions since 4.3. From Arnd, Christoph, and Keith. - A single xen-blkfront fix from Cathy, fixing a NULL dereference if an error is returned through the staste change callback. - Fixup for some bad/sloppy code in nbd that got introduced earlier in this cycle. From Markus Pargmann. - A blk-mq tagset use-after-free fix from Junichi. - A backing device lifetime fix from Tejun, fixing a crash. - And finally, a set of regression/stable fixes for cgroup writeback from Tejun" * 'for-linus' of git://git.kernel.dk/linux-block: writeback: remove broken rbtree_postorder_for_each_entry_safe() usage in cgwb_bdi_destroy() NVMe: Fix memory leak on retried commands block: don't release bdi while request_queue has live references nvme: use an integer value to Linux errno values blk-mq: fix use-after-free in blk_mq_free_tag_set() nvme: fix 32-bit build warning writeback: fix incorrect calculation of available memory for memcg domains writeback: memcg dirty_throttle_control should be initialized with wb->memcg_completions writeback: bdi_writeback iteration must not skip dying ones writeback: fix bdi_writeback iteration in wakeup_dirtytime_writeback() writeback: laptop_mode_timer_fn() needs rcu_read_lock() around bdi_writeback iteration nbd: Add locking for tasks xen-blkfront: check for null drvdata in blkback_changed (XenbusStateClosing)	2015-10-24 07:20:57 +09:00
Linus Torvalds	37902bc190	Merge branch 'for-linus-4.3' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs Pull btrfs fixes from Chris Mason: "I have two more small fixes this week: Qu's fix avoids unneeded COW during fallocate, and Christian found a memory leak in the error handling of an earlier fix" * 'for-linus-4.3' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs: btrfs: fix possible leak in btrfs_ioctl_balance() btrfs: Avoid truncate tailing page if fallocate range doesn't exceed inode size	2015-10-24 07:17:58 +09:00
Joseph Qi	b67de018b3	ocfs2/dlm: unlock lockres spinlock before dlm_lockres_put dlm_lockres_put will call dlm_lockres_release if it is the last reference, and then it may call dlm_print_one_lock_resource and take lockres spinlock. So unlock lockres spinlock before dlm_lockres_put to avoid deadlock. Signed-off-by: Joseph Qi <joseph.qi@huawei.com> Cc: Mark Fasheh <mfasheh@suse.de> Cc: Joel Becker <jlbec@evilplan.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2015-10-23 17:55:10 +09:00
Benjamin Coddington	616fb38fa7	locks: cleanup posix_lock_inode_wait and flock_lock_inode_wait All callers use locks_lock_inode_wait() instead. Signed-off-by: Benjamin Coddington <bcodding@redhat.com> Signed-off-by: Jeff Layton <jeff.layton@primarydata.com>	2015-10-22 14:57:42 -04:00

1 2 3 4 5 ...

42150 Commits