In preparation for including the casefold feature within f2fs, elevate
the EXT4_CASEFOLD_FL flag to FS_CASEFOLD_FL.
Signed-off-by: Daniel Rosenberg <drosen@google.com>
Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
vfree() don't wish to be called from interrupt context, move it
out of spin_lock_irqsave() coverage.
Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
In fill_super() and put_super(), f2fs_destroy_stats() is called
in prior to f2fs_destroy_segment_manager(), so if current
sbi can still be visited in global stat list, SM_I(sbi) should be
released yet.
For this reason, SM_I(sbi) does not need to be checked in
update_general_status().
Thank Chao Yu for advice.
Signed-off-by: Jia-Ju Bai <baijiaju1990@gmail.com>
Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Atomic write needs page cache to cache data of transaction,
direct IO should never be allowed in atomic write, detect
and deny it when open atomic write file.
Signed-off-by: Gao Xiang <gaoxiang25@huawei.com>
Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
With quota_ino feature on, generic/232 reports an inconsistence issue
on the image.
The root cause is that the testcase tries to:
- use quotactl to shutdown journalled quota based on sysfile;
- and then use quotactl to enable/turn on quota based on specific file
(aquota.user or aquota.group).
Eventually, quota sysfile will be out-of-update due to following specific
file creation.
Change as below to fix this issue:
- deny enabling quota based on specific file if quota sysfile exists.
- set SBI_QUOTA_NEED_REPAIR once sysfile based quota shutdowns via
ioctl.
Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
It needs to return -EIO if filesystem has been shutdown, fix the
miss case in f2fs_setxattr().
Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
We missed to call f2fs_is_checkpoint_ready() in several places, it may
allow space allocation even when free space was exhausted during
checkpoint is disabled, fix to add them.
Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Adjust f2fs_fiemap() to support fiemap() on directory inode.
Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
=============================================================================
BUG discard_cmd (Tainted: G B OE ): Objects remaining in discard_cmd on __kmem_cache_shutdown()
-----------------------------------------------------------------------------
INFO: Slab 0xffffe1ac481d22c0 objects=36 used=2 fp=0xffff936b4748bf50 flags=0x2ffff0000000100
Call Trace:
dump_stack+0x63/0x87
slab_err+0xa1/0xb0
__kmem_cache_shutdown+0x183/0x390
shutdown_cache+0x14/0x110
kmem_cache_destroy+0x195/0x1c0
f2fs_destroy_segment_manager_caches+0x21/0x40 [f2fs]
exit_f2fs_fs+0x35/0x641 [f2fs]
SyS_delete_module+0x155/0x230
? vtime_user_exit+0x29/0x70
do_syscall_64+0x6e/0x160
entry_SYSCALL64_slow_path+0x25/0x25
INFO: Object 0xffff936b4748b000 @offset=0
INFO: Object 0xffff936b4748b070 @offset=112
kmem_cache_destroy discard_cmd: Slab cache still has objects
Call Trace:
dump_stack+0x63/0x87
kmem_cache_destroy+0x1b4/0x1c0
f2fs_destroy_segment_manager_caches+0x21/0x40 [f2fs]
exit_f2fs_fs+0x35/0x641 [f2fs]
SyS_delete_module+0x155/0x230
do_syscall_64+0x6e/0x160
entry_SYSCALL64_slow_path+0x25/0x25
Recovery can cache discard commands, so in error path of fill_super(),
we need give a chance to handle them, otherwise it will lead to leak
of discard_cmd slab cache.
Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
On a quota disabled image, with fault injection, SBI_QUOTA_NEED_REPAIR
will be set incorrectly in error path of f2fs_evict_inode(), fix it.
Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
As reported in bugzilla:
https://bugzilla.kernel.org/show_bug.cgi?id=204193
A null pointer dereference bug is triggered in f2fs under kernel-5.1.3.
kasan_report.cold+0x5/0x32
f2fs_write_end_io+0x215/0x650
bio_endio+0x26e/0x320
blk_update_request+0x209/0x5d0
blk_mq_end_request+0x2e/0x230
lo_complete_rq+0x12c/0x190
blk_done_softirq+0x14a/0x1a0
__do_softirq+0x119/0x3e5
irq_exit+0x94/0xe0
call_function_single_interrupt+0xf/0x20
During umount, we will access NULL sbi->node_inode pointer in
f2fs_write_end_io():
f2fs_bug_on(sbi, page->mapping == NODE_MAPPING(sbi) &&
page->index != nid_of_node(page));
The reason is if disable_checkpoint mount option is on, meta dirty
pages can remain during umount, and then be flushed by iput() of
meta_inode, however node_inode has been iput()ed before
meta_inode's iput().
Since checkpoint is disabled, all meta/node datas are useless and
should be dropped in next mount, so in umount, let's adjust
drop_inode() to give a hint to iput_final() to drop all those dirty
datas correctly.
Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
If IO alignment feature is turned on after remount, we didn't
initialize mempool of it, it turns out we will encounter panic
during IO submission due to access NULL mempool pointer.
This feature should be set only at mount time, so simply deny
configuring during remount.
This fixes bug reported in bugzilla:
https://bugzilla.kernel.org/show_bug.cgi?id=204135
Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Since 07173c3ec2 ("block: enable multipage bvecs"), one bio vector
can store multi pages, so that we can not calculate max IO size of
bio as PAGE_SIZE * bio->bi_max_vecs. However IO alignment feature of
f2fs always has that assumption, so finally, it may cause panic during
IO submission as below stack.
kernel BUG at fs/f2fs/data.c:317!
RIP: 0010:__submit_merged_bio+0x8b0/0x8c0
Call Trace:
f2fs_submit_page_write+0x3cd/0xdd0
do_write_page+0x15d/0x360
f2fs_outplace_write_data+0xd7/0x210
f2fs_do_write_data_page+0x43b/0xf30
__write_data_page+0xcf6/0x1140
f2fs_write_cache_pages+0x3ba/0xb40
f2fs_write_data_pages+0x3dd/0x8b0
do_writepages+0xbb/0x1e0
__writeback_single_inode+0xb6/0x800
writeback_sb_inodes+0x441/0x910
wb_writeback+0x261/0x650
wb_workfn+0x1f9/0x7a0
process_one_work+0x503/0x970
worker_thread+0x7d/0x820
kthread+0x1ad/0x210
ret_from_fork+0x35/0x40
This patch adds one extra condition to check left space in bio while
trying merging page to bio, to avoid panic.
This bug was reported in bugzilla:
https://bugzilla.kernel.org/show_bug.cgi?id=204043
Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Wrap merge condition into function for readability, no logic change.
Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
- Don't taint the kernel if CPUs have different sets of page sizes
supported (other than the one in use).
- Issue I-cache maintenance for module ftrace trampoline.
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEE5RElWfyWxS+3PLO2a9axLQDIXvEFAl1W5VMACgkQa9axLQDI
XvFAWQ//RrxMHeN7/Ynv1MDqucZlMpQJMUV2V0K5wYO6ZwGMKYXw62SBzb9AMv0y
iFhYNZbtBoE2JhLNEfwhdNkk9NfUsKbUcfyt0cono2DU1tVihmmVqbNbairNhKVo
j7vxnCx8SMQ5ZT+QaTpKUddzlzb4jdycXGYaje62bbA19BCOVVIuy61wGNtHP4IU
817RHqIj6aqIbmplt79+Q3vOopB8BbxrnpC5cp8rw1CKbfu7kQN4zePIA4Z4bhag
G5hm+aTV1qzrmEud2WkuA7044vsw2Wkd/8gksRyQKkypfqfQ3NrASDn3xGqOi/5t
2DsyJPaVUC1tsJdrVqpfWZfLJJ8FGyox7aeXL7OdPrfWP6HI9jJnodRVH/av97g6
psaSoJNxXbVqrg/wva6i2f25KBYdp/vQW0Nmjljvt4dmEDrE/jpifAYAH/LUmi5H
fMNXyOdeDcgUoVfcEk/S1leiDf0gUy+B8ylknk12knbMuk/9ATq/2H3RKqrRJYWL
qUQBvB04d7NHZl8wl+IVLlK8g4x5Jeetjm7GHvLbOp9agb63kwVFq50c4zD052LA
eKy6LbUO2xHyA3PrtvBem/hKZ/GCTh0o0xcwUkyNcsLUHfKnMcHLEH/WWd5rmYIQ
xvbQVhVORR8ru7eCQCRMBmjGaHFz2ZQa4Q3V3qVBnX+q/F6dku8=
=VneQ
-----END PGP SIGNATURE-----
Merge tag 'arm64-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux
Pull arm64 fixes from Catalin Marinas:
- Don't taint the kernel if CPUs have different sets of page sizes
supported (other than the one in use).
- Issue I-cache maintenance for module ftrace trampoline.
* tag 'arm64-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux:
arm64: ftrace: Ensure module ftrace trampoline is coherent with I-side
arm64: cpufeature: Don't treat granule sizes as strict
The initial support for dynamic ftrace trampolines in modules made use
of an indirect branch which loaded its target from the beginning of
a special section (e71a4e1beb ("arm64: ftrace: add support for far
branches to dynamic ftrace")). Since no instructions were being patched,
no cache maintenance was needed. However, later in be0f272bfc ("arm64:
ftrace: emit ftrace-mod.o contents through code") this code was reworked
to output the trampoline instructions directly into the PLT entry but,
unfortunately, the necessary cache maintenance was overlooked.
Add a call to __flush_icache_range() after writing the new trampoline
instructions but before patching in the branch to the trampoline.
Cc: Ard Biesheuvel <ard.biesheuvel@linaro.org>
Cc: James Morse <james.morse@arm.com>
Cc: <stable@vger.kernel.org>
Fixes: be0f272bfc ("arm64: ftrace: emit ftrace-mod.o contents through code")
Signed-off-by: Will Deacon <will@kernel.org>
Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
- Disable NVMe power optimization related to suspend-to-idle added
recently on systems where PCIe ASPM is not able to put PCIe links
into low-power states to prevent excess power from being drawn by
the system while suspended (Rafael Wysocki).
- Make the schedutil governor handle frequency limits changes
properly in all cases (Viresh Kumar).
- Prevent the cpufreq core from treating positive values returned
by dev_pm_qos_update_request() as errors (Viresh Kumar).
-----BEGIN PGP SIGNATURE-----
iQJGBAABCAAwFiEE4fcc61cGeeHD/fCwgsRv/nhiVHEFAl1WqLMSHHJqd0Byand5
c29ja2kubmV0AAoJEILEb/54YlRxCkAP/146AuGXj8tOdHxkpl6DVgm0WVRNxCtL
Z9Y+1xBRBSYVZkeDsjzox995z8Ha/0tnMp6EPcnxebkFpRx3fyldXKUKxqJARPPi
n2jGhqCPNcAHK2UPdGH8EvHOI2uWBMBa2jW2Qw9m0V/+9Zy58ZvKqso/+myFkz2S
YRekJPADsI3GZW1SZ3dY4/12jcKsQt32TWaGOLqKx3R1J1BnpyxduXfqJ6FUrH9b
P/F9cVb2UEbawh5QpNmfMsfBb/DsE08NQhPWe91m0VgcLd6IZsoNux0Rd8HJOvRM
+5vh6qPTABnNN1+7blFw64/hCu1N2hq8KLl6DzPeKohysKiDkmLh3QGB+ISRpj+H
5GKF8gnQFvN0fPJF8NU+eIZ0IaOryrooSu4TeCcAWAozJ0ln2mjNoC2h6U1B8Y29
UH+e2z+6kVTHwjiTjPacjQ0wnkUctoiT71kMxQ8Q+GFG3fQcz3GFFM17eITnAI/Q
ws1bPHn1ovxl1GmdQwQK3KnT1cK5/fApaVKQLJiRkUvZ1gCZ3ZcruPlh+qA5zpGf
+RGPXn/Rm1LA1uCkS4j6REBp6vhcVJoVEVnEGzhovdtJcuJ9erlh5I2zz4UxURnn
cHH48exFmwC+uBhIyQVuYOYgLU3naztBLFg1/l68sMQFonWjIQ/Hp1B9cIgigwbf
5+BlT1llvIH3
=eCcy
-----END PGP SIGNATURE-----
Merge tag 'pm-5.3-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm
Pull power management fixes from Rafael Wysocki:
"These add a check to avoid recent suspend-to-idle power regression on
systems with NVMe drives where the PCIe ASPM policy is "performance"
(or when the kernel is built without ASPM support), fix an issue
related to frequency limits in the schedutil cpufreq governor and fix
a mistake related to the PM QoS usage in the cpufreq core introduced
recently.
Specifics:
- Disable NVMe power optimization related to suspend-to-idle added
recently on systems where PCIe ASPM is not able to put PCIe links
into low-power states to prevent excess power from being drawn by
the system while suspended (Rafael Wysocki).
- Make the schedutil governor handle frequency limits changes
properly in all cases (Viresh Kumar).
- Prevent the cpufreq core from treating positive values returned by
dev_pm_qos_update_request() as errors (Viresh Kumar)"
* tag 'pm-5.3-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm:
nvme-pci: Allow PCI bus-level PM to be used if ASPM is disabled
PCI/ASPM: Add pcie_aspm_enabled()
cpufreq: schedutil: Don't skip freq update when limits change
cpufreq: dev_pm_qos_update_request() can return 1 on success
All small fixes targeted for stable:
- Two fixes for USB-audio with malformed descriptor, spotted by
fuzzers
- Two fixes Conexant HD-audio codec wrt power management
- Quirks for HD-audio AMD platform and HP laptop
- HD-audio memory leak fix
-----BEGIN PGP SIGNATURE-----
iQJCBAABCAAsFiEEIXTw5fNLNI7mMiVaLtJE4w1nLE8FAl1WXYkOHHRpd2FpQHN1
c2UuZGUACgkQLtJE4w1nLE8PdQ/7BBffAHVuC43sLno+AwRHUXyFBmkXKrheGaqu
in4tjm2Usk79vxDeAgdHc/J/Ge26D+ZMHwKTlU9+7RVWeKkPcsrA5EigGmjPAl59
RbHktmVd47vRWQr2GdmMxJ8fGRkR+68qImujfHw0+iWWUPZJrrwOsrerzwNaFlNf
siIIfD5yFzmEgjD8mQDT8PAp47x8tU46t6x85GkQ2BQZHulpkkemfA6H9nRfLQQz
qsrdnBGTZJ+Pz2plFK0bhotWnb2F6amFxjJ6PI4/pesgVx9pMLVABvjQOujJqpi+
tr+K7wC3WADKDdSv5roA9iNKV09sMFxvCJX+49bCMsDWvF9mDrzHMeL/1O2rd9gg
AAiSn1UT0RvXT1y7xdzJRc6xxD1paVANWT3qXQItapgFCI6Mhi1k4qUu4Vy1R/dr
mdlt4NhGYlRRoNrZFoWpFjgzdDXnuo4UJA81sTTTUCYBO7PB62eKJJuzb8Q5qCew
ay1WOOgM9QjYptA71nshD3UWF+V1I4imd7EtWMeZ/rBR/1CKxdH+QugFmw90Z2ew
mVgQd4NjRtDhognJ8OTnVHr5PP/Wj/LwHdd11QYHZLV8YCCzJCIGMtxvNS0R3T1E
7zmBzzIOtUzegtFR6Ov5cRhuKh8zmxD1Rz63j/u9M3yFBmDftzbSFDd6uAOPpO6z
AGonmQI=
=9hVN
-----END PGP SIGNATURE-----
Merge tag 'sound-5.3-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/tiwai/sound
Pull sound fixes from Takashi Iwai:
"All small fixes targeted for stable:
- Two fixes for USB-audio with malformed descriptor, spotted by
fuzzers
- Two fixes Conexant HD-audio codec wrt power management
- Quirks for HD-audio AMD platform and HP laptop
- HD-audio memory leak fix"
* tag 'sound-5.3-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/tiwai/sound:
ALSA: usb-audio: Fix a stack buffer overflow bug in check_input_term
ALSA: usb-audio: Fix an OOB bug in parse_audio_mixer_unit
ALSA: hda - Add a generic reboot_notify
ALSA: hda - Let all conexant codec enter D3 when rebooting
ALSA: hda/realtek - Add quirk for HP Envy x360
ALSA: hda - Fix a memory leak bug
ALSA: hda - Apply workaround for another AMD chip 1022:1487
i915:
- single GVT use after free fix
scheduler:
- entity destruction race fix
amdgpu:
- struct allocation fix
- gfx9 soft recovery fix
nouveau:
- followup MST fix
ast:
- vga register race fix.
-----BEGIN PGP SIGNATURE-----
iQIcBAABAgAGBQJdVi4dAAoJEAx081l5xIa+MWYP/2e4fqA03cee+ExwG3oILILQ
7aFeYk3pA/popEJsdYqUU4SWqAJhQ6DDpIex5K4ASIxEmz75Mt2HhohuYU6jQXmP
auK0b162TeGJDhFkiRpK3rD0EidUYg4q7fq3eooLqqrU66Bli8OvYKHChoXcU4za
jGbqywFHRW4VwxfLxJErLIz+wq6k8CMFie3vxGcEN3uQaX81KrVfPwhzauh+9GJd
aN4rDsvyp0jzl/HRK9GtKBlkfNCUeSSZ/By7PMbUfTCgP2dkTFUuKpoieJ0JU3tR
VnUEmZTtMweL/3f/yPfTkAFURYMjdeXkgnTR7DZv8y46zFbbMOsTFoBI7df9ef9k
+QBmW0dMKcS16i+aJBckb5bDOYwt39hiscGuBporaNhaVrQLGtmxEQet5N4iEOBY
3P+yEXOvviz0WXh4WrbHkcz7tGTjITeiXZSNzcNueZj3n+2AN8EwDMJNuVPQbIh1
87Kb53Gbd3kHqHe5fJ82sHlRifjlPcQmKyd6+uk72ZewHQq2VPswtdnjcfmtuqpO
cM3pwdZGtvfuQs6+EurzmXpNCzn32GdJdJY8A/q+zvekin5yn+RmW2eb9w/T1hV2
hP399EoMDNQQ750YKP4Paf3QdDSGtrY7UxSfUxVcYOKuGi1RzTBKndIGTv9IGLg9
PNLm3KaPKtLNx6yq4+OW
=TusM
-----END PGP SIGNATURE-----
Merge tag 'drm-fixes-2019-08-16' of git://anongit.freedesktop.org/drm/drm
Pull drm fixes from Dave Airlie:
"Nothing too crazy this week, one amdgpu fix to use vmalloc for a
struct that grew in size, and another MST fix for nouveau, and some
other misc fixes:
i915:
- single GVT use after free fix
scheduler:
- entity destruction race fix
amdgpu:
- struct allocation fix
- gfx9 soft recovery fix
nouveau:
- followup MST fix
ast:
- vga register race fix"
* tag 'drm-fixes-2019-08-16' of git://anongit.freedesktop.org/drm/drm:
drm/nouveau: Only recalculate PBN/VCPI on mode/connector changes
drm/ast: Fixed reboot test may cause system hanged
drm/scheduler: use job count instead of peek
drm/amd/display: use kvmalloc for dc_state (v2)
drm/amdgpu: fix gfx9 soft recovery
drm/i915: Use after free in error path in intel_vgpu_create_workload()
`check_input_term` recursively calls itself with input from
device side (e.g., uac_input_terminal_descriptor.bCSourceID)
as argument (id). In `check_input_term`, if `check_input_term`
is called with the same `id` argument as the caller, it triggers
endless recursive call, resulting kernel space stack overflow.
This patch fixes the bug by adding a bitmap to `struct mixer_build`
to keep track of the checked ids and stop the execution if some id
has been checked (similar to how parse_audio_unit handles unitid
argument).
Reported-by: Hui Peng <benquike@gmail.com>
Reported-by: Mathias Payer <mathias.payer@nebelwelt.net>
Signed-off-by: Hui Peng <benquike@gmail.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Takashi Iwai <tiwai@suse.de>
- Fix crashes when the attr fork isn't present due to errors but inode
inactivation tries to zap the attr data anyway.
- Convert more directory corruption debugging asserts to actual
EFSCORRUPTED returns instead of blowing up later on.
- Don't fail writeback just because we ran out of memory allocating
metadata log data.
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEEUzaAxoMeQq6m2jMV+H93GTRKtOsFAl1RlXoACgkQ+H93GTRK
tOtc7A/+JIidhI/MHQLs7Ab9GW+PsHHBMSbVTV4Ge+SlfZtPNI38zrC1MC5LWvvV
bndOpRjLm4nOJcB7fsoEWufTs1dKOIUjk2yQi8x47ZvE+B/RcA4b6IDhwpbAI8GW
kt1RLNec9kpzhxCFFPzsXT9MwjEvvOvTeXfxaXTmiuB2kbJkR5dTlCUS2nUDnqsG
FGdmOUDjy1uVfFcSrp75KT/iYaqW08cG+uY/eUHRm+YMUKI8hF1t+n8cDnSg96VX
IN2DT1d3dTWiiF+JUZnMhVwJvPgV95DOf+yYy/F7qOcJUEmQ9tD6+0Ml/cI/AeLG
zERxHXM9A9Jy8S+2xkvf0J/+HStwfviWNToK3pbMIM1ZsoMTi9q8VgbB3AaFiijf
C4Q4T3W0jC44om8X/Ta/c+G/64Tj8yenzLDeTHvtQkoq77QPBam/aYjBc79oYvHi
r+R61kHNto+YjJsRbkwgF/S+bzru1qY9Ccr0LJZrUkSzh4d6p94fbQc+NX4L2sv7
WzAc+kOR/7qgVgy4gVr3ju0d89kP/Xn/0e0Ma0V8CSZlX5yg1dMLew5TJq693UYX
xjLGD2ltOoFEN8e7/WXI0/ktvvSCAQalmz+sPgJvTlosUhpGXky85ced1PSrKiEV
l0tREpmawDo9WVvC/06yBj97Op6PDdb4CovDcyLT6Yt3v1aBZT0=
=ivN3
-----END PGP SIGNATURE-----
Merge tag 'xfs-5.3-fixes-2' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux
Pull xfs fixes from Darrick Wong:
- Fix crashes when the attr fork isn't present due to errors but inode
inactivation tries to zap the attr data anyway.
- Convert more directory corruption debugging asserts to actual
EFSCORRUPTED returns instead of blowing up later on.
- Don't fail writeback just because we ran out of memory allocating
metadata log data.
* tag 'xfs-5.3-fixes-2' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux:
xfs: don't crash on null attr fork xfs_bmapi_read
xfs: remove more ondisk directory corruption asserts
fs: xfs: xfs_log: Don't use KM_MAYFAIL at xfs_log_reserve().
- Update MAINTAINERS now that we've removed fs/iomap.c.
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEEUzaAxoMeQq6m2jMV+H93GTRKtOsFAl1UPLMACgkQ+H93GTRK
tOvPVQ/+O9rOWD1udgyx3Qvn0mZhYgyKw8k3rHdZjRVV8fo5SIyS7cZRqi/B5WGf
pRLHVWMGVeCiiRlJJzJLuD6tSJ04AOASO/VklRVbDpWkqRWqk3Hk8NiYt4KbgAcm
P31I836I84L2qhfUN+EGemR7ByeH3XjSgNqRHRC60WiEs+gK5Em5loJfmohuBE29
+XIlBusc84rN6SfP4xBpj/SL3U2XTDkXQX4ESXbU80mOypVqTJMsDv4YLBbWZcCT
ZZ9HV35duUQFUC5t98Z+KQ4buwVeTvuXpzbxvZRnT5uMcQGPboH8NseyenNTJvI2
YLMRoP9yQImxbhKTD+VTCzfVw6qGX7JrVMT0hYszK3uhSUNVUbD6ChhdeOHI5I+U
BdD6t/pKsM/3k030jxrqrI/PMTPJRe2Pu1MKcK1qR7uptagUFBfVUVLz8uoXp4/U
O/WOub72pLcbIJ2kCffknfvZkD7YnlB1qwNBHaMqdSZJp89ZL5zluXh58RvXv15e
aU3UPzotC3qoCwz/CW5/R0oiUQeEs78jR91urKCGPaOQqRtgQatIccaxJqG9NsMz
pHAY8Ouw8bx4q6Q1CNwrNVqVHJsh87pxvpBgw/VVGUhprwkNKaBG2UvVnu4CqkLN
MW2KaI66dmCL+VbEK5ikPNF0YXXbQLDdtW4K/S3OIpW7H2IRgjo=
=mOPT
-----END PGP SIGNATURE-----
Merge tag 'iomap-5.3-fixes-1' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux
Pull iomap fixlet from Darrick Wong:
"A single update to the MAINTAINERS entry for iomap now that we've
removed fs/iomap.c"
* tag 'iomap-5.3-fixes-1' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux:
MAINTAINERS: iomap: Remove fs/iomap.c record
- A couple of small header cleanups for charlcd
From Masahiro Yamada
- A trivial typo fix for the sampels of cfag12864b
From Masahiro Yamada
- An Kconfig help text improvement for charlcd
From Mans Rullgard
- An error path fix for panel
From zhengbin
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEEPjU5OPd5QIZ9jqqOGXyLc2htIW0FAl1VeiAACgkQGXyLc2ht
IW3GGhAAzvDyOYj6Qql0z8z0VtciPWXgf12gpuQQENfBxdTaqMDVkpirAN/D2h+R
VT1lBenlOHbBYdNs73B97xbM7RRU6K0EsTCEzny9ScnHA8dWLXCVpUuEY1sadrAt
s+7FKPLxHbLA8PNhMSViXhgFyXdMWnQToRFfnOEamMzKf66AlsUWrl0ESTvBKgD+
Ph6yVENuFqBMr6818mFG/NdnChNY6nKhPbmrMEUY9igzGxhkED1Y918qJuxILAi6
uyOM7SWZuJJHG+tGXvNYyp73qy6SFBP2iUJb3UIfkhzIwSvupHRFpEmTEtYS1PAa
ADTzeMtY/y42NfYGzy17S3jEE0ew2W32yD/wh25Tjm34HXHnLAR1sbgohNdSskkm
gBocO8N2QJ1IS3rULXHYqLuBfIvATiiOQCtBH2NynA23DJrEAG59RvkZYevz6Dfk
cZeKp8eiJjiF7ji7LQLfNjOWcQcvXQHMhXLiJg34lakDhSHpz+abCpI7wFaUMISK
0cFLaHoST0yQc4KCqY5PuBPfU/7/HvhO+MQd2QBkws2htbKuHwt7KGbooA9Rsayx
XNNZz6V0YgA/AMLaEASfiBDzJdXrTUBEaDu9bWqlhZ7cZ3fIRP2KvsM8OiGNobUE
946wffX7GzAgmoVFLKKdKGHeDVuDeG5kR5IcpUmKOWqsNXUs4f4=
=xjfn
-----END PGP SIGNATURE-----
Merge tag 'auxdisplay-for-linus-v5.3-rc5' of git://github.com/ojeda/linux
Pull auxdisplay fixes from Miguel Ojeda:
"A few minor auxdisplay improvements:
- A couple of small header cleanups for charlcd (Masahiro Yamada)
- A trivial typo fix for the examples of cfag12864b (Masahiro Yamada)
- An Kconfig help text improvement for charlcd (Mans Rullgard)
- An error path fix for panel (zhengbin)"
* tag 'auxdisplay-for-linus-v5.3-rc5' of git://github.com/ojeda/linux:
auxdisplay: Fix a typo in cfag12864b-example.c
auxdisplay: charlcd: add include guard to charlcd.h
auxdisplay: charlcd: move charlcd.h to drivers/auxdisplay
auxdisplay: charlcd: add help text for backlight initial state
auxdisplay: panel: need to delete scan_timer when misc_register fails in panel_attach
- Fix building DT binding examples for in tree builds
- Correct some refcounting in adjust_local_phandle_references()
- Update FSL FEC binding with deprecated properties
- Schema fix in stm32 pinctrl
- Fix typo in of_irq_parse_one docbook comment
-----BEGIN PGP SIGNATURE-----
iQJEBAABCgAuFiEEktVUI4SxYhzZyEuo+vtdtY28YcMFAl1UwjkQHHJvYmhAa2Vy
bmVsLm9yZwAKCRD6+121jbxhw6NsEACk9VYUmQhN4haogNkYGapJg3mvfuguzuLU
ps8Wd61WDYqmtcHLC9bJaiYzaSrSzpgJcAAsU7ryJuesZD8XRe5jGFs494lBCGq9
wZeXSECFr898ThmHWtIbQmXvBd1sQjh+EhHn5u1G1Pzxyan5NeiPgJ4LM/sLRl0m
kdunXbXzg1LxwEYuPiEV/DmPAHG6MPV98E9haQ3z82j167CuRpU0Hi0QQim7Dzu1
wc+f8hUD4X6hFJsHxho25mMvcSKypadzgv2wGEQve+GdDOSOrWOfXZUgHLopkNrm
tDy2FZD39McX4Tuv90DDLVNulVbfZdXRyepgH666QOelelyIAWuE2mcUrV5sgms9
EPTmcQC73o/Vnx/ALpvhs5kYvxmwPH3xPfDiF4QEcC8uHZkM5Ldms0SVnAcuTlYY
8Z+rDmhR9GQwJ4oDk0ZwsovNWB7HICJRC7uxotqL6A5kY3WfbzlrfWiA8zV/j9h6
wHXlWH6hf8hUTnNSRV9mztL7m52qfDbALDQTShNitbKsUQgVPaKMwA1hcXsMGyPS
5wwpU1gtj6UbTeLLs+3IFQQUTx1cI3yVhXrpTUOf6ke24510NBWfyJrUPGQPH24E
Cgd67Octcbxe+r6gReo4r0XNrDQwE+f7CC0YP9VZg3zmFlHhjlsTDoAddnLPDewn
KcIYS6GfGA==
=kE9S
-----END PGP SIGNATURE-----
Merge tag 'devicetree-fixes-for-5.3-3' of git://git.kernel.org/pub/scm/linux/kernel/git/robh/linux
Pull devicetree fixes from Rob Herring:
- Fix building DT binding examples for in tree builds
- Correct some refcounting in adjust_local_phandle_references()
- Update FSL FEC binding with deprecated properties
- Schema fix in stm32 pinctrl
- Fix typo in of_irq_parse_one docbook comment
* tag 'devicetree-fixes-for-5.3-3' of git://git.kernel.org/pub/scm/linux/kernel/git/robh/linux:
of: irq: fix a trivial typo in a doc comment
dt-bindings: pinctrl: stm32: Fix 'st,syscfg' schema
dt-bindings: fec: explicitly mark deprecated properties
of: resolver: Add of_node_put() before return and break
dt-bindings: Fix generated example files getting added to schemas
drm-fixes-5.3-2019-08-14:
amdgpu:
- Use kvalloc for dc_state to avoid allocation
failures in some cases.
- Fix gfx9 soft recovery
scheduler:
- Fix a race condition when destroying entities
Signed-off-by: Dave Airlie <airlied@redhat.com>
From: Alex Deucher <alexdeucher@gmail.com>
Link: https://patchwork.freedesktop.org/patch/msgid/20190815024919.3434-1-alexander.deucher@amd.com
I -thought- I had fixed this entirely, but it looks like that I didn't
test this thoroughly enough as we apparently still make one big mistake
with nv50_msto_atomic_check() - we don't handle the following scenario:
* CRTC #1 has n VCPI allocated to it, is attached to connector DP-4
which is attached to encoder #1. enabled=y active=n
* CRTC #1 is changed from DP-4 to DP-5, causing:
* DP-4 crtc=#1→NULL (VCPI n→0)
* DP-5 crtc=NULL→#1
* CRTC #1 steals encoder #1 back from DP-4 and gives it to DP-5
* CRTC #1 maintains the same mode as before, just with a different
connector
* mode_changed=n connectors_changed=y
(we _SHOULD_ do VCPI 0→n here, but don't)
Once the above scenario is repeated once, we'll attempt freeing VCPI
from the connector that we didn't allocate due to the connectors
changing, but the mode staying the same. Sigh.
Since nv50_msto_atomic_check() has broken a few times now, let's rethink
things a bit to be more careful: limit both VCPI/PBN allocations to
mode_changed || connectors_changed, since neither VCPI or PBN should
ever need to change outside of routing and mode changes.
Changes since v1:
* Fix accidental reversal of clock and bpp arguments in
drm_dp_calc_pbn_mode() - William Lewis
Signed-off-by: Lyude Paul <lyude@redhat.com>
Reported-by: Bohdan Milar <bmilar@redhat.com>
Tested-by: Bohdan Milar <bmilar@redhat.com>
Fixes: 232c9eec41 ("drm/nouveau: Use atomic VCPI helpers for MST")
References: 412e85b605 ("drm/nouveau: Only release VCPI slots on mode changes")
Cc: Lyude Paul <lyude@redhat.com>
Cc: Ben Skeggs <bskeggs@redhat.com>
Cc: Daniel Vetter <daniel.vetter@ffwll.ch>
Cc: David Airlie <airlied@redhat.com>
Cc: Jerry Zuo <Jerry.Zuo@amd.com>
Cc: Harry Wentland <harry.wentland@amd.com>
Cc: Juston Li <juston.li@intel.com>
Cc: Laurent Pinchart <laurent.pinchart@ideasonboard.com>
Cc: Karol Herbst <karolherbst@gmail.com>
Cc: Ilia Mirkin <imirkin@alum.mit.edu>
Cc: <stable@vger.kernel.org> # v5.1+
Acked-by: Ben Skeggs <bskeggs@redhat.com>
Signed-off-by: Dave Airlie <airlied@redhat.com>
Link: https://patchwork.freedesktop.org/patch/msgid/20190809005307.18391-1-lyude@redhat.com
Diverged from what the code does with commit 530210c781 ("of/irq: Replace
of_irq with of_phandle_args").
Signed-off-by: Lubomir Rintel <lkundrak@v3.sk>
Signed-off-by: Rob Herring <robh@kernel.org>
The proper way to add additional contraints to an existing json-schema
is using 'allOf' to reference the base schema. Using just '$ref' doesn't
work. Fix this for the 'st,syscfg' property.
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Maxime Coquelin <mcoquelin.stm32@gmail.com>
Cc: Alexandre Torgue <alexandre.torgue@st.com>
Cc: linux-gpio@vger.kernel.org
Cc: linux-stm32@st-md-mailman.stormreply.com
Cc: linux-arm-kernel@lists.infradead.org
Reviewed-by: Linus Walleij <linus.walleij@linaro.org>
Signed-off-by: Rob Herring <robh@kernel.org>
Hi Linus,
Please, pull the following patches that fix sh mainline builds:
- Fix fall-through warning in sh.
- Fix missing break bug in sh (this is a 10-year-old bug).
Currently, mainline builds for sh are broken. These patches fix that.
Thanks
Signed-off-by: Gustavo A. R. Silva <gustavo@embeddedor.com>
-----BEGIN PGP SIGNATURE-----
iQIzBAABCAAdFiEEkmRahXBSurMIg1YvRwW0y0cG2zEFAl1UVM0ACgkQRwW0y0cG
2zFwhhAAw4Of7EnmcEJwLAiIaE7O9JHewg3CoVyZr2jim+EVoXJErdBV3Kg7yTcf
JH6eYicdA50zr38bQ+eVBxSAJvqsYry7ZZhgyWoucNZmFSRxuhyQh9ucIGiTVfSO
k24Bb7n4new24rf2zo/JXpqD5mWEtKlQqJYp84rNovIbFwENnOimRU1f0H64pdRg
4UH578lsJTuw8DL1x+sc8ZxYvTa8YbaEcrwMnexiYR8ZAtBMfJzIRcKNwScxUe86
+WKfipUTmidtLMZU9J9I9jQYFyX/o/Dkenq+6uxeephl0M7mUbsA0xc9oHLEnc3Q
ieOduorxTgFb13SyRXhdW5frTGcMWJufoyD7+pTsCE09ws1PdXcxQ1CWPN09lG3/
R+3ceARfakuvBnCchawpses64zjlq6As68WfW1BBnwhl5sAr9uE/eDBLxXb/tYc3
qnJ5q32Vj9c7exLL95BtlZwm2Xaz687ZYhdZB7nsoYEA2CvExmriMFt9pl6SgRlQ
ixkyJs0uDh6Fy+RawTFMS7/1NFRI8I/3r2vzqTN7ykFd5QsR9HCLfi2xcz/BCy84
bsd6WfdtrGolpnodnyUx3C99wO5yjRGJc0H17W2Lz28JXYFFlbQuambNV18krN1v
CHrVpqmZybWf6hfNtedwdGcvv7AR46IMdULfPUS26O0nt1T67aU=
=hnDA
-----END PGP SIGNATURE-----
Merge tag 'Wimplicit-fallthrough-5.3-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/gustavoars/linux
Pull fallthrough fixes from Gustavo A. R. Silva:
"Fix sh mainline builds:
- Fix fall-through warning in sh.
- Fix missing break bug in sh (this is a 10-year-old bug)
Currently, mainline builds for sh are broken. These patches fix that"
* tag 'Wimplicit-fallthrough-5.3-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/gustavoars/linux:
sh: kernel: hw_breakpoint: Fix missing break in switch statement
sh: kernel: disassemble: Mark expected switch fall-throughs
Reviewed-by: Marc Dionne <marc.dionne@auristor.com>
-----BEGIN PGP SIGNATURE-----
iQIzBAABCAAdFiEEqG5UsNXhtOCrfGQP+7dXa6fLC2sFAl1UELcACgkQ+7dXa6fL
C2viWw//eDLvElBjaQsabfpMOVGkf02t+zCzNfKSC0KdM1+GUZ1FyXQ0UAtgwICY
Sp01h/RZa0V07sfYP7R5kL4/KIMdODmhrP0iiHDpoMjKCL7qR9tFbJDAcHtH8xz2
52UV2dmdDBI/wdw/i5dn6M02SoYAQMl1XT49SkzhFSELVchkpraGsf1vf4yITeVe
eI1TaOxI+TUaeH5f6+KWp6c8K8q70p3KfrR2VmCWkBrD7PNg9lp19pVnz8tdofYu
xURHQbJulSqM+mY7pcNBOi2iWy3dCLjBTkVJIwIhZcZqLThACY38SSaPtmdhgif4
wcyyZUtd8EGPzPPqbfCx7ycTIIDtL/r98XtGyiTJBKrCK+flZONdu0g/oIzvJ/Wu
hV4+ButxCuMakbLOe+Hew3lhHFOy7m9XZtOURzxzZSm9uazHDMxnw4ocxIOs24F1
qus1sG0+rlVDcMYjo2tKEAzOl/ZejJ/NUTd60ANIWKTHply2/2/5dH94B0yLwDnp
tfifBrBkyqFB4XUKGvqvvJczl0d7+zsEScs4VQLVO/WhATjj6jNnrYKgwvBS5pCM
890qUzj3TRW7ciZLi0THMEHBlEfbEWhNCaggAqieIvbKv7t4Kh2cUBaIsxo4IYqU
PBZZhFXRul5ocTJrV9pScl4RbzxE5V0j9cwSiiWnzZL1sQucIgQ=
=zivP
-----END PGP SIGNATURE-----
Merge tag 'afs-fixes-20190814' of git://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-fs
Pull afs fixes from David Howells:
- Fix the CB.ProbeUuid handler to generate its reply correctly.
- Fix a mix up in indices when parsing a Volume Location entry record.
- Fix a potential NULL-pointer deref when cleaning up a read request.
- Fix the expected data version of the destination directory in
afs_rename().
- Fix afs_d_revalidate() to only update d_fsdata if it's not the same
as the directory data version to reduce the likelihood of overwriting
the result of a competing operation. (d_fsdata carries the directory
DV or the least-significant word thereof).
- Fix the tracking of the data-version on a directory and make sure
that dentry objects get properly initialised, updated and
revalidated.
Also fix rename to update d_fsdata to match the new directory's DV if
the dentry gets moved over and unhash the dentry to stop
afs_d_revalidate() from interfering.
* tag 'afs-fixes-20190814' of git://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-fs:
afs: Fix missing dentry data version updating
afs: Only update d_fsdata if different in afs_d_revalidate()
afs: Fix off-by-one in afs_rename() expected data version calculation
fs: afs: Fix a possible null-pointer dereference in afs_put_read()
afs: Fix loop index mixup in afs_deliver_vl_get_entry_by_name_u()
afs: Fix the CB.ProbeUuid service handler to reply correctly
The spsc_queue_peek function is accessing queue->head which belongs to
the consumer thread and shouldn't be accessed by the producer
This is fixing a rare race condition when destroying entities.
Signed-off-by: Christian König <christian.koenig@amd.com>
Acked-by: Andrey Grodzovsky <andrey.grodzovsky@amd.com>
Reviewed-by: Monk.liu@amd.com
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
- Fix a memory registration release flow issue that was causing a
WARN_ON (mlx5)
- If the counters for a port aren't allocated, then we can't do
operations on the non-existent counters (core)
- Check the right variable for error code result (mlx5)
- Fix a use after free issue (mlx5)
- Fix an off by one memory leak (siw)
- Actually return an error code on error (core)
- Allow siw to be built on 32bit arches (siw, ABI change, but OK since
siw was just merged this merge window and there is no prior released
kernel to maintain compatibility with and we also updated the
rdma-core user space package to match)
Signed-off-by: Doug Ledford <dledford@redhat.com>
-----BEGIN PGP SIGNATURE-----
iQIzBAABCAAdFiEErmsb2hIrI7QmWxJ0uCajMw5XL90FAl1UH9kACgkQuCajMw5X
L922QQ/+ON5Vhb3CZkv1K6mVk/+sXSBIkeceoBJCw46XjVkYQaiE46DyonLDOwco
4z6caV5HmS0CDY7VuoCuA3OmvsYEYWpLi0ktyRJIaRJtWnJmYmVLju8ORrD6s709
FBe7Ay9pE6VIXXbDz2np3aAZW1EL1dPr6fBccHZWvGjb6bwu+a2HbZlIdtKKBRgf
r+bp9G3M5FKL0RTGSy+S+w/xO0Ntc0Nbo0RRj+/4sRdxjTdx+B1sLxPya5AgycF9
kQ/a+/mppmfmXe0/PzL30rvbmf29ocodYHokb+OTc1Mwll6yc9Yo3BOlvZmK+EYG
yyYXK23MkJDoJ7qaSI7cbiEd5pY2EgSABBKPv5b5wqt03AM0qdRpEUdPSbBZF0tv
Lt/i2pke13R+TW3u2e8sY8iHWHC8+GDOyWFiVmrpEcoP80hfRKDkiULv5vrvFzVP
3XOG1z5hHDmZ4jJtHCjCNJLi1+/AxhYIaPSRyJnL5R5cJGX/hXOSex+OsjbcAx7o
djVTRbR1JOx603NX4sYgpLcn1TEPvaxKXcrqP8Nhj++xgZWNNfDw0RBk8jICYkOq
k+tt70hq1ME0DvsJZiV2vyyVR/o5Amj7o7cdUtT3T2IDJAK1jbrNVD79VrXqJecq
laOmge4M40pHPvFs/gtVuQsqsM7YHa1urX+vrFsG3i7QpMDekIo=
=misR
-----END PGP SIGNATURE-----
Merge tag 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdma
Pull rdma fixes from Doug Ledford:
"Fairly small pull request for -rc3. I'm out of town the rest of this
week, so I made sure to clean out as much as possible from patchworks
in enough time for 0-day to chew through it (Yay! for 0-day being back
online! :-)). Jason might send through any emergency stuff that could
pop up, otherwise I'm back next week.
The only real thing of note is the siw ABI change. Since we just
merged siw *this* release, there are no prior kernel releases to
maintain kernel ABI with. I told Bernard that if there is anything
else about the siw ABI he thinks he might want to change before it
goes set in stone, he should get it in ASAP. The siw module was around
for several years outside the kernel tree, and it had to be revamped
considerably for inclusion upstream, so we are making no attempts to
be backward compatible with the out of tree version. Once 5.3 is
actually released, we will have our baseline ABI to maintain.
Summary:
- Fix a memory registration release flow issue that was causing a
WARN_ON (mlx5)
- If the counters for a port aren't allocated, then we can't do
operations on the non-existent counters (core)
- Check the right variable for error code result (mlx5)
- Fix a use after free issue (mlx5)
- Fix an off by one memory leak (siw)
- Actually return an error code on error (core)
- Allow siw to be built on 32bit arches (siw, ABI change, but OK
since siw was just merged this merge window and there is no prior
released kernel to maintain compatibility with and we also updated
the rdma-core user space package to match)"
* tag 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdma:
RDMA/siw: Change CQ flags from 64->32 bits
RDMA/core: Fix error code in stat_get_doit_qp()
RDMA/siw: Fix a memory leak in siw_init_cpulist()
IB/mlx5: Fix use-after-free error while accessing ev_file pointer
IB/mlx5: Check the correct variable in error handling code
RDMA/counter: Prevent QP counter binding if counters unsupported
IB/mlx5: Fix implicit MR release flow
The `uac_mixer_unit_descriptor` shown as below is read from the
device side. In `parse_audio_mixer_unit`, `baSourceID` field is
accessed from index 0 to `bNrInPins` - 1, the current implementation
assumes that descriptor is always valid (the length of descriptor
is no shorter than 5 + `bNrInPins`). If a descriptor read from
the device side is invalid, it may trigger out-of-bound memory
access.
```
struct uac_mixer_unit_descriptor {
__u8 bLength;
__u8 bDescriptorType;
__u8 bDescriptorSubtype;
__u8 bUnitID;
__u8 bNrInPins;
__u8 baSourceID[];
}
```
This patch fixes the bug by add a sanity check on the length of
the descriptor.
Reported-by: Hui Peng <benquike@gmail.com>
Reported-by: Mathias Payer <mathias.payer@nebelwelt.net>
Cc: <stable@vger.kernel.org>
Signed-off-by: Hui Peng <benquike@gmail.com>
Signed-off-by: Takashi Iwai <tiwai@suse.de>
- fix the handling of the bus_dma_mask in dma_get_required_mask, which
caused a regression in this merge window (Lucas Stach)
- fix a regression in the handling of DMA_ATTR_NO_KERNEL_MAPPING (me)
- fix dma_mmap_coherent to not cause page attribute mismatches on
coherent architectures like x86 (me)
-----BEGIN PGP SIGNATURE-----
iQI/BAABCgApFiEEgdbnc3r/njty3Iq9D55TZVIEUYMFAl1UFhILHGhjaEBsc3Qu
ZGUACgkQD55TZVIEUYOjexAAjPKLo4WGBGO1nd0btwXcI9A7jQTQlXrokmorDVzx
5++GmTUBeEgvUJath5D3qpQTRZXo9Wb9oGMdS5U6bWJB+SbWtErM304t905TJoDM
Cs7xcB1ZQeG/5OrQ+qGPgQCo6WO1dOl9FpaIptjNm4dn+OYhyO/YA+dgrJDwgkiA
140RYUWa+Zhq3df4YqP4M4EnezLN1c4uE80wUxVQKDcq59sxCJek0QT0pUAMbdmQ
/cUd2XSU113o1llmIRUh0Oj6VSEhWKHb+bdb8JfGndLzxvDcXZKl60tikWe6xpy2
Ue0kkHRk6OPVRIxWkRjt8D+mlrCyNqN6HWx6eBmVnRKHxZ4ia2hYOFuYN9FFLLK+
kCUlu5P/HUabBedKIxk4rbWITUqcRSviPD2WdnH2RWblvXNSDoSAufYuJ/9IGSoL
P6a43DVKFesVF/MxeH9Ko8bnxMUO9Zn97GHcQIUplRwaqrnrCEPlvLVf/teswSQG
C13rTnouZ0FA4z/uV96G6HfGIj87MLe/RovmLCMTeiSKrDpbcO7szP037Km73M+V
UBmatoYCioVLxBjw3NkxCRc9UpDPdRUu31uVHrAarh4tutUASEWLrb6s9vFlGyED
zis9IHWtIAYP3VfFtkXdZ7oDlqC/3KdEErHZuT+z4PK3Wj/QtQVfQ8SB79xFMneD
V2E=
=Jzmo
-----END PGP SIGNATURE-----
Merge tag 'dma-mapping-5.3-4' of git://git.infradead.org/users/hch/dma-mapping
Pull dma-mapping fixes from Christoph Hellwig:
- fix the handling of the bus_dma_mask in dma_get_required_mask, which
caused a regression in this merge window (Lucas Stach)
- fix a regression in the handling of DMA_ATTR_NO_KERNEL_MAPPING (me)
- fix dma_mmap_coherent to not cause page attribute mismatches on
coherent architectures like x86 (me)
* tag 'dma-mapping-5.3-4' of git://git.infradead.org/users/hch/dma-mapping:
dma-mapping: fix page attributes for dma_mmap_*
dma-direct: don't truncate dma_required_mask to bus addressing capabilities
dma-direct: fix DMA_ATTR_NO_KERNEL_MAPPING
Including:
- A couple more fixes for the Intel VT-d driver for bugs
introduced during the recent conversion of this driver to use
IOMMU core default domains.
- Fix for common dma-iommu code to make sure MSI mappings happen
in the correct domain for a device.
- Fix a corner case in the handling of sg-lists in dma-iommu
code that might cause dma_length to be truncated.
- Mark a switch as fall-through in arm-smmu code.
-----BEGIN PGP SIGNATURE-----
iQIzBAABCAAdFiEEr9jSbILcajRFYWYyK/BELZcBGuMFAl1UFQIACgkQK/BELZcB
GuNW1A//Y86ZFRSVCA/+ZiHgADwqsof1/Cdc1Ou1tXMbINbyWvWyT5t8JtplYsEJ
17xlS7l2M9x1VCljzr3fTfBMGu8+CQY2KT6YJliLQZzrQ6LKoxmscCmg6DmH4Gjy
CfoRLBXCKTm1F8aNt7f/XupuI+OGpq8h/VPDxYqZZIGKxsMfOH8ZIzF7DjDO2MxS
NROjwAyVMZdzR5X/dM1dYK0zwxQvgRGEx8gdGssoyUCJvGdAyQXym30j8esNWJ6J
okXVpuQoX/CJQLZP/xF8psWcL+0IJSyd3G90ToBRsoLDc50a4qTdelGvGkVHmU8L
WVm+x7GjJrWZieqUtFnW/X7p4qSZdNMIK9c/+/cKg+BxyAKE9FqUJzg6UaSpzTbk
XVh0jSiSq7/txU8pyGhEDQxgg4xbIUA5x1gqnqFm8k9Noz1/+AhfdyEUFzIHeE0s
XwBfVVGzP2NW5zi97NebEuYsbHgDDSnR9sEKxhhq6G30vrwHEfg/MzdvNp6EupNp
J1DnWD0DgMlYMxjZ8YskrSI7/MFB5PCxj/InwAXRZmlPPmlWIRTJfUtwYmhlkoLS
zCxfS/sIof9C1pU7noe1WwOz8ylVPeQO3KvBIVhy3WJcVnCDlYX7/Uf/z/sU/d0Z
Hd3/PQ6F6xTUEBzXKOFG/3y9EUQuoYP/fckFM4vmH9OEvYWmWqc=
=+b4H
-----END PGP SIGNATURE-----
Merge tag 'iommu-fixes-v5.3-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/joro/iommu
Pull iommu fixes from Joerg Roedel:
- A couple more fixes for the Intel VT-d driver for bugs introduced
during the recent conversion of this driver to use IOMMU core default
domains.
- Fix for common dma-iommu code to make sure MSI mappings happen in the
correct domain for a device.
- Fix a corner case in the handling of sg-lists in dma-iommu code that
might cause dma_length to be truncated.
- Mark a switch as fall-through in arm-smmu code.
* tag 'iommu-fixes-v5.3-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/joro/iommu:
iommu/vt-d: Fix possible use-after-free of private domain
iommu/vt-d: Detach domain before using a private one
iommu/dma: Handle SG length overflow better
iommu/vt-d: Correctly check format of page table in debugfs
iommu/vt-d: Detach domain when move device out of group
iommu/arm-smmu: Mark expected switch fall-through
iommu/dma: Handle MSI mappings separately
Merge misc VM fixes from Andrew Morton:
"A bunch of hotfixes, all affecting mm/.
The two-patch series from Andrea may be controversial. This restores
patches which were reverted in Dec 2018 due to a regression report [*].
After extensive discussion it is evident that the problems which these
patches solved were significantly more serious than the problems they
introduced. I am told that major distros are already carrying these
two patches for this reason"
[*] See
https://lore.kernel.org/lkml/alpine.DEB.2.21.1812061343240.144733@chino.kir.corp.google.com/https://lore.kernel.org/lkml/alpine.DEB.2.21.1812031545560.161134@chino.kir.corp.google.com/
for the google-specific issues brought up by David Rijentes. And as
Andrew says:
"I'm unaware of anyone else who will be adversely affected by this,
and google already carries over a thousand kernel patches - another
won't kill them.
There has been sporadic discussion about fixing these things for
real but it's clear that nobody apart from David is particularly
motivated"
* emailed patches from Andrew Morton <akpm@linux-foundation.org>:
hugetlbfs: fix hugetlb page migration/fault race causing SIGBUS
mm, vmscan: do not special-case slab reclaim when watermarks are boosted
Revert "mm, thp: restore node-local hugepage allocations"
Revert "Revert "mm, thp: consolidate THP gfp handling into alloc_hugepage_direct_gfpmask""
include/asm-generic/5level-fixup.h: fix variable 'p4d' set but not used
seq_file: fix problem when seeking mid-record
mm: workingset: fix vmstat counters for shadow nodes
mm/usercopy: use memory range to be accessed for wraparound check
mm: kmemleak: disable early logging in case of error
mm/vmalloc.c: fix percpu free VM area search criteria
mm/memcontrol.c: fix use after free in mem_cgroup_iter()
mm/z3fold.c: fix z3fold_destroy_pool() race condition
mm/z3fold.c: fix z3fold_destroy_pool() ordering
mm: mempolicy: handle vma with unmovable pages mapped correctly in mbind
mm: mempolicy: make the behavior consistent when MPOL_MF_MOVE* and MPOL_MF_STRICT were specified
mm/hmm: fix bad subpage pointer in try_to_unmap_one
mm/hmm: fix ZONE_DEVICE anon page mapping reuse
mm: document zone device struct page field usage
Make codec enter D3 before rebooting or poweroff can fix the noise
issue on some laptops. And in theory it is harmless for all codecs
to enter D3 before rebooting or poweroff, let us add a generic
reboot_notify, then realtek and conexant drivers can call this
function.
Cc: stable@vger.kernel.org
Signed-off-by: Hui Wang <hui.wang@canonical.com>
Signed-off-by: Takashi Iwai <tiwai@suse.de>
We have 3 new lenovo laptops which have conexant codec 0x14f11f86,
these 3 laptops also have the noise issue when rebooting, after
letting the codec enter D3 before rebooting or poweroff, the noise
disappers.
Instead of adding a new ID again in the reboot_notify(), let us make
this function apply to all conexant codec. In theory make codec enter
D3 before rebooting or poweroff is harmless, and I tested this change
on a couple of other Lenovo laptops which have different conexant
codecs, there is no side effect so far.
Cc: stable@vger.kernel.org
Signed-off-by: Hui Wang <hui.wang@canonical.com>
Signed-off-by: Takashi Iwai <tiwai@suse.de>
Li Wang discovered that LTP/move_page12 V2 sometimes triggers SIGBUS in
the kernel-v5.2.3 testing. This is caused by a race between hugetlb
page migration and page fault.
If a hugetlb page can not be allocated to satisfy a page fault, the task
is sent SIGBUS. This is normal hugetlbfs behavior. A hugetlb fault
mutex exists to prevent two tasks from trying to instantiate the same
page. This protects against the situation where there is only one
hugetlb page, and both tasks would try to allocate. Without the mutex,
one would fail and SIGBUS even though the other fault would be
successful.
There is a similar race between hugetlb page migration and fault.
Migration code will allocate a page for the target of the migration. It
will then unmap the original page from all page tables. It does this
unmap by first clearing the pte and then writing a migration entry. The
page table lock is held for the duration of this clear and write
operation. However, the beginnings of the hugetlb page fault code
optimistically checks the pte without taking the page table lock. If
clear (as it can be during the migration unmap operation), a hugetlb
page allocation is attempted to satisfy the fault. Note that the page
which will eventually satisfy this fault was already allocated by the
migration code. However, the allocation within the fault path could
fail which would result in the task incorrectly being sent SIGBUS.
Ideally, we could take the hugetlb fault mutex in the migration code
when modifying the page tables. However, locks must be taken in the
order of hugetlb fault mutex, page lock, page table lock. This would
require significant rework of the migration code. Instead, the issue is
addressed in the hugetlb fault code. After failing to allocate a huge
page, take the page table lock and check for huge_pte_none before
returning an error. This is the same check that must be made further in
the code even if page allocation is successful.
Link: http://lkml.kernel.org/r/20190808000533.7701-1-mike.kravetz@oracle.com
Fixes: 290408d4a2 ("hugetlb: hugepage migration core")
Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com>
Reported-by: Li Wang <liwang@redhat.com>
Tested-by: Li Wang <liwang@redhat.com>
Reviewed-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: Cyril Hrubis <chrubis@suse.cz>
Cc: Xishi Qiu <xishi.qiuxishi@alibaba-inc.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Dave Chinner reported a problem pointing a finger at commit 1c30844d2d
("mm: reclaim small amounts of memory when an external fragmentation
event occurs").
The report is extensive:
https://lore.kernel.org/linux-mm/20190807091858.2857-1-david@fromorbit.com/
and it's worth recording the most relevant parts (colorful language and
typos included).
When running a simple, steady state 4kB file creation test to
simulate extracting tarballs larger than memory full of small
files into the filesystem, I noticed that once memory fills up
the cache balance goes to hell.
The workload is creating one dirty cached inode for every dirty
page, both of which should require a single IO each to clean and
reclaim, and creation of inodes is throttled by the rate at which
dirty writeback runs at (via balance dirty pages). Hence the ingest
rate of new cached inodes and page cache pages is identical and
steady. As a result, memory reclaim should quickly find a steady
balance between page cache and inode caches.
The moment memory fills, the page cache is reclaimed at a much
faster rate than the inode cache, and evidence suggests that
the inode cache shrinker is not being called when large batches
of pages are being reclaimed. In roughly the same time period
that it takes to fill memory with 50% pages and 50% slab caches,
memory reclaim reduces the page cache down to just dirty pages
and slab caches fill the entirety of memory.
The LRU is largely full of dirty pages, and we're getting spikes
of random writeback from memory reclaim so it's all going to shit.
Behaviour never recovers, the page cache remains pinned at just
dirty pages, and nothing I could tune would make any difference.
vfs_cache_pressure makes no difference - I would set it so high
it should trim the entire inode caches in a single pass, yet it
didn't do anything. It was clear from tracing and live telemetry
that the shrinkers were pretty much not running except when
there was absolutely no memory free at all, and then they did
the minimum necessary to free memory to make progress.
So I went looking at the code, trying to find places where pages
got reclaimed and the shrinkers weren't called. There's only one
- kswapd doing boosted reclaim as per commit 1c30844d2d ("mm:
reclaim small amounts of memory when an external fragmentation
event occurs").
The watermark boosting introduced by the commit is triggered in response
to an allocation "fragmentation event". The boosting was not intended
to target THP specifically and triggers even if THP is disabled.
However, with Dave's perfectly reasonable workload, fragmentation events
can be very common given the ratio of slab to page cache allocations so
boosting remains active for long periods of time.
As high-order allocations might use compaction and compaction cannot
move slab pages the decision was made in the commit to special-case
kswapd when watermarks are boosted -- kswapd avoids reclaiming slab as
reclaiming slab does not directly help compaction.
As Dave notes, this decision means that slab can be artificially
protected for long periods of time and messes up the balance with slab
and page caches.
Removing the special casing can still indirectly help avoid
fragmentation by avoiding fragmentation-causing events due to slab
allocation as pages from a slab pageblock will have some slab objects
freed. Furthermore, with the special casing, reclaim behaviour is
unpredictable as kswapd sometimes examines slab and sometimes does not
in a manner that is tricky to tune or analyse.
This patch removes the special casing. The downside is that this is not
a universal performance win. Some benchmarks that depend on the
residency of data when rereading metadata may see a regression when slab
reclaim is restored to its original behaviour. Similarly, some
benchmarks that only read-once or write-once may perform better when
page reclaim is too aggressive. The primary upside is that slab
shrinker is less surprising (arguably more sane but that's a matter of
opinion), behaves consistently regardless of the fragmentation state of
the system and properly obeys VM sysctls.
A fsmark benchmark configuration was constructed similar to what Dave
reported and is codified by the mmtest configuration
config-io-fsmark-small-file-stream. It was evaluated on a 1-socket
machine to avoid dealing with NUMA-related issues and the timing of
reclaim. The storage was an SSD Samsung Evo and a fresh trimmed XFS
filesystem was used for the test data.
This is not an exact replication of Dave's setup. The configuration
scales its parameters depending on the memory size of the SUT to behave
similarly across machines. The parameters mean the first sample
reported by fs_mark is using 50% of RAM which will barely be throttled
and look like a big outlier. Dave used fake NUMA to have multiple
kswapd instances which I didn't replicate. Finally, the number of
iterations differ from Dave's test as the target disk was not large
enough. While not identical, it should be representative.
fsmark
5.3.0-rc3 5.3.0-rc3
vanilla shrinker-v1r1
Min 1-files/sec 4444.80 ( 0.00%) 4765.60 ( 7.22%)
1st-qrtle 1-files/sec 5005.10 ( 0.00%) 5091.70 ( 1.73%)
2nd-qrtle 1-files/sec 4917.80 ( 0.00%) 4855.60 ( -1.26%)
3rd-qrtle 1-files/sec 4667.40 ( 0.00%) 4831.20 ( 3.51%)
Max-1 1-files/sec 11421.50 ( 0.00%) 9999.30 ( -12.45%)
Max-5 1-files/sec 11421.50 ( 0.00%) 9999.30 ( -12.45%)
Max-10 1-files/sec 11421.50 ( 0.00%) 9999.30 ( -12.45%)
Max-90 1-files/sec 4649.60 ( 0.00%) 4780.70 ( 2.82%)
Max-95 1-files/sec 4491.00 ( 0.00%) 4768.20 ( 6.17%)
Max-99 1-files/sec 4491.00 ( 0.00%) 4768.20 ( 6.17%)
Max 1-files/sec 11421.50 ( 0.00%) 9999.30 ( -12.45%)
Hmean 1-files/sec 5004.75 ( 0.00%) 5075.96 ( 1.42%)
Stddev 1-files/sec 1778.70 ( 0.00%) 1369.66 ( 23.00%)
CoeffVar 1-files/sec 33.70 ( 0.00%) 26.05 ( 22.71%)
BHmean-99 1-files/sec 5053.72 ( 0.00%) 5101.52 ( 0.95%)
BHmean-95 1-files/sec 5053.72 ( 0.00%) 5101.52 ( 0.95%)
BHmean-90 1-files/sec 5107.05 ( 0.00%) 5131.41 ( 0.48%)
BHmean-75 1-files/sec 5208.45 ( 0.00%) 5206.68 ( -0.03%)
BHmean-50 1-files/sec 5405.53 ( 0.00%) 5381.62 ( -0.44%)
BHmean-25 1-files/sec 6179.75 ( 0.00%) 6095.14 ( -1.37%)
5.3.0-rc3 5.3.0-rc3
vanillashrinker-v1r1
Duration User 501.82 497.29
Duration System 4401.44 4424.08
Duration Elapsed 8124.76 8358.05
This is showing a slight skew for the max result representing a large
outlier for the 1st, 2nd and 3rd quartile are similar indicating that
the bulk of the results show little difference. Note that an earlier
version of the fsmark configuration showed a regression but that
included more samples taken while memory was still filling.
Note that the elapsed time is higher. Part of this is that the
configuration included time to delete all the test files when the test
completes -- the test automation handles the possibility of testing
fsmark with multiple thread counts. Without the patch, many of these
objects would be memory resident which is part of what the patch is
addressing.
There are other important observations that justify the patch.
1. With the vanilla kernel, the number of dirty pages in the system is
very low for much of the test. With this patch, dirty pages is
generally kept at 10% which matches vm.dirty_background_ratio which
is normal expected historical behaviour.
2. With the vanilla kernel, the ratio of Slab/Pagecache is close to
0.95 for much of the test i.e. Slab is being left alone and
dominating memory consumption. With the patch applied, the ratio
varies between 0.35 and 0.45 with the bulk of the measured ratios
roughly half way between those values. This is a different balance to
what Dave reported but it was at least consistent.
3. Slabs are scanned throughout the entire test with the patch applied.
The vanille kernel has periods with no scan activity and then
relatively massive spikes.
4. Without the patch, kswapd scan rates are very variable. With the
patch, the scan rates remain quite steady.
4. Overall vmstats are closer to normal expectations
5.3.0-rc3 5.3.0-rc3
vanilla shrinker-v1r1
Ops Direct pages scanned 99388.00 328410.00
Ops Kswapd pages scanned 45382917.00 33451026.00
Ops Kswapd pages reclaimed 30869570.00 25239655.00
Ops Direct pages reclaimed 74131.00 5830.00
Ops Kswapd efficiency % 68.02 75.45
Ops Kswapd velocity 5585.75 4002.25
Ops Page reclaim immediate 1179721.00 430927.00
Ops Slabs scanned 62367361.00 73581394.00
Ops Direct inode steals 2103.00 1002.00
Ops Kswapd inode steals 570180.00 5183206.00
o Vanilla kernel is hitting direct reclaim more frequently,
not very much in absolute terms but the fact the patch
reduces it is interesting
o "Page reclaim immediate" in the vanilla kernel indicates
dirty pages are being encountered at the tail of the LRU.
This is generally bad and means in this case that the LRU
is not long enough for dirty pages to be cleaned by the
background flush in time. This is much reduced by the
patch.
o With the patch, kswapd is reclaiming 10 times more slab
pages than with the vanilla kernel. This is indicative
of the watermark boosting over-protecting slab
A more complete set of tests were run that were part of the basis for
introducing boosting and while there are some differences, they are well
within tolerances.
Bottom line, the special casing kswapd to avoid slab behaviour is
unpredictable and can lead to abnormal results for normal workloads.
This patch restores the expected behaviour that slab and page cache is
balanced consistently for a workload with a steady allocation ratio of
slab/pagecache pages. It also means that if there are workloads that
favour the preservation of slab over pagecache that it can be tuned via
vm.vfs_cache_pressure where as the vanilla kernel effectively ignores
the parameter when boosting is active.
Link: http://lkml.kernel.org/r/20190808182946.GM2739@techsingularity.net
Fixes: 1c30844d2d ("mm: reclaim small amounts of memory when an external fragmentation event occurs")
Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: <stable@vger.kernel.org> [5.0+]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This reverts commit 2f0799a0ff ("mm, thp: restore node-local
hugepage allocations").
commit 2f0799a0ff was rightfully applied to avoid the risk of a
severe regression that was reported by the kernel test robot at the end
of the merge window. Now we understood the regression was a false
positive and was caused by a significant increase in fairness during a
swap trashing benchmark. So it's safe to re-apply the fix and continue
improving the code from there. The benchmark that reported the
regression is very useful, but it provides a meaningful result only when
there is no significant alteration in fairness during the workload. The
removal of __GFP_THISNODE increased fairness.
__GFP_THISNODE cannot be used in the generic page faults path for new
memory allocations under the MPOL_DEFAULT mempolicy, or the allocation
behavior significantly deviates from what the MPOL_DEFAULT semantics are
supposed to be for THP and 4k allocations alike.
Setting THP defrag to "always" or using MADV_HUGEPAGE (with THP defrag
set to "madvise") has never meant to provide an implicit MPOL_BIND on
the "current" node the task is running on, causing swap storms and
providing a much more aggressive behavior than even zone_reclaim_node =
3.
Any workload who could have benefited from __GFP_THISNODE has now to
enable zone_reclaim_mode=1||2||3. __GFP_THISNODE implicitly provided
the zone_reclaim_mode behavior, but it only did so if THP was enabled:
if THP was disabled, there would have been no chance to get any 4k page
from the current node if the current node was full of pagecache, which
further shows how this __GFP_THISNODE was misplaced in MADV_HUGEPAGE.
MADV_HUGEPAGE has never been intended to provide any zone_reclaim_mode
semantics, in fact the two are orthogonal, zone_reclaim_mode = 1|2|3
must work exactly the same with MADV_HUGEPAGE set or not.
The performance characteristic of memory depends on the hardware
details. The numbers below are obtained on Naples/EPYC architecture and
the N/A projection extends them to show what we should aim for in the
future as a good THP NUMA locality default. The benchmark used
exercises random memory seeks (note: the cost of the page faults is not
part of the measurement).
D0 THP | D0 4k | D1 THP | D1 4k | D2 THP | D2 4k | D3 THP | D3 4k | ...
0% | +43% | +45% | +106% | +131% | +224% | N/A | N/A
D0 means distance zero (i.e. local memory), D1 means distance one (i.e.
intra socket memory), D2 means distance two (i.e. inter socket memory),
etc...
For the guest physical memory allocated by qemu and for guest mode
kernel the performance characteristic of RAM is more complex and an
ideal default could be:
D0 THP | D1 THP | D0 4k | D2 THP | D1 4k | D3 THP | D2 4k | D3 4k | ...
0% | +58% | +101% | N/A | +222% | N/A | N/A | N/A
NOTE: the N/A are projections and haven't been measured yet, the
measurement in this case is done on a 1950x with only two NUMA nodes.
The THP case here means THP was used both in the host and in the guest.
After applying this commit the THP NUMA locality order that we'll get
out of MADV_HUGEPAGE is this:
D0 THP | D1 THP | D2 THP | D3 THP | ... | D0 4k | D1 4k | D2 4k | D3 4k | ...
Before this commit it was:
D0 THP | D0 4k | D1 4k | D2 4k | D3 4k | ...
Even if we ignore the breakage of large workloads that can't fit in a
single node that the __GFP_THISNODE implicit "current node" mbind
caused, the THP NUMA locality order provided by __GFP_THISNODE was still
not the one we shall aim for in the long term (i.e. the first one at
the top).
After this commit is applied, we can introduce a new allocator multi
order API and to replace those two alloc_pages_vmas calls in the page
fault path, with a single multi order call:
unsigned int order = (1 << HPAGE_PMD_ORDER) | (1 << 0);
page = alloc_pages_multi_order(..., &order);
if (!page)
goto out;
if (!(order & (1 << 0))) {
VM_WARN_ON(order != 1 << HPAGE_PMD_ORDER);
/* THP fault */
} else {
VM_WARN_ON(order != 1 << 0);
/* 4k fallback */
}
The page allocator logic has to be altered so that when it fails on any
zone with order 9, it has to try again with a order 0 before falling
back to the next zone in the zonelist.
After that we need to do more measurements and evaluate if adding an
opt-in feature for guest mode is worth it, to swap "DN 4k | DN+1 THP"
with "DN+1 THP | DN 4k" at every NUMA distance crossing.
Link: http://lkml.kernel.org/r/20190503223146.2312-3-aarcange@redhat.com
Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Acked-by: Mel Gorman <mgorman@suse.de>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: David Rientjes <rientjes@google.com>
Cc: Zi Yan <zi.yan@cs.rutgers.edu>
Cc: Stefan Priebe - Profihost AG <s.priebe@profihost.ag>
Cc: "Kirill A. Shutemov" <kirill@shutemov.name>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Patch series "reapply: relax __GFP_THISNODE for MADV_HUGEPAGE mappings".
The fixes for what was originally reported as "pathological THP
behavior" we rightfully reverted to be sure not to introduced
regressions at end of a merge window after a severe regression report
from the kernel bot. We can safely re-apply them now that we had time
to analyze the problem.
The mm process worked fine, because the good fixes were eventually
committed upstream without excessive delay.
The regression reported by the kernel bot however forced us to revert
the good fixes to be sure not to introduce regressions and to give us
the time to analyze the issue further. The silver lining is that this
extra time allowed to think more at this issue and also plan for a
future direction to improve things further in terms of THP NUMA
locality.
This patch (of 2):
This reverts commit 356ff8a9a7 ("Revert "mm, thp: consolidate THP
gfp handling into alloc_hugepage_direct_gfpmask"). So it reapplies
89c83fb539 ("mm, thp: consolidate THP gfp handling into
alloc_hugepage_direct_gfpmask").
Consolidation of the THP allocation flags at the same place was meant to
be a clean up to easier handle otherwise scattered code which is
imposing a maintenance burden. There were no real problems observed
with the gfp mask consolidation but the reversion was rushed through
without a larger consensus regardless.
This patch brings the consolidation back because this should make the
long term maintainability easier as well as it should allow future
changes to be less error prone.
[mhocko@kernel.org: changelog additions]
Link: http://lkml.kernel.org/r/20190503223146.2312-2-aarcange@redhat.com
Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: David Rientjes <rientjes@google.com>
Cc: Zi Yan <zi.yan@cs.rutgers.edu>
Cc: Stefan Priebe - Profihost AG <s.priebe@profihost.ag>
Cc: "Kirill A. Shutemov" <kirill@shutemov.name>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
A compiler throws a warning on an arm64 system since commit 9849a5697d
("arch, mm: convert all architectures to use 5level-fixup.h"),
mm/kasan/init.c: In function 'kasan_free_p4d':
mm/kasan/init.c:344:9: warning: variable 'p4d' set but not used [-Wunused-but-set-variable]
p4d_t *p4d;
^~~
because p4d_none() in "5level-fixup.h" is compiled away while it is a
static inline function in "pgtable-nopud.h".
However, if converted p4d_none() to a static inline there, powerpc would
be unhappy as it reads those in assembler language in
"arch/powerpc/include/asm/book3s/64/pgtable.h", so it needs to skip
assembly include for the static inline C function.
While at it, converted a few similar functions to be consistent with the
ones in "pgtable-nopud.h".
Link: http://lkml.kernel.org/r/20190806232917.881-1-cai@lca.pw
Signed-off-by: Qian Cai <cai@lca.pw>
Acked-by: Arnd Bergmann <arnd@arndb.de>
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Jason Gunthorpe <jgg@ziepe.ca>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
If you use lseek or similar (e.g. pread) to access a location in a
seq_file file that is within a record, rather than at a record boundary,
then the first read will return the remainder of the record, and the
second read will return the whole of that same record (instead of the
next record). When seeking to a record boundary, the next record is
correctly returned.
This bug was introduced by a recent patch (identified below). Before
that patch, seq_read() would increment m->index when the last of the
buffer was returned (m->count == 0). After that patch, we rely on
->next to increment m->index after filling the buffer - but there was
one place where that didn't happen.
Link: https://lkml.kernel.org/lkml/877e7xl029.fsf@notabene.neil.brown.name/
Fixes: 1f4aace60b ("fs/seq_file.c: simplify seq_file iteration code and interface")
Signed-off-by: NeilBrown <neilb@suse.com>
Reported-by: Sergei Turchanov <turchanov@farpost.com>
Tested-by: Sergei Turchanov <turchanov@farpost.com>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Markus Elfring <Markus.Elfring@web.de>
Cc: <stable@vger.kernel.org> [4.19+]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Memcg counters for shadow nodes are broken because the memcg pointer is
obtained in a wrong way. The following approach is used:
virt_to_page(xa_node)->mem_cgroup
Since commit 4d96ba3530 ("mm: memcg/slab: stop setting
page->mem_cgroup pointer for slab pages") page->mem_cgroup pointer isn't
set for slab pages, so memcg_from_slab_page() should be used instead.
Also I doubt that it ever worked correctly: virt_to_head_page() should
be used instead of virt_to_page(). Otherwise objects residing on tail
pages are not accounted, because only the head page contains a valid
mem_cgroup pointer. That was a case since the introduction of these
counters by the commit 68d48e6a2d ("mm: workingset: add vmstat counter
for shadow nodes").
Link: http://lkml.kernel.org/r/20190801233532.138743-1-guro@fb.com
Fixes: 4d96ba3530 ("mm: memcg/slab: stop setting page->mem_cgroup pointer for slab pages")
Signed-off-by: Roman Gushchin <guro@fb.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
Cc: Shakeel Butt <shakeelb@google.com>
Cc: Michal Hocko <mhocko@suse.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>