linux_dsm_epyc7002

mirror of https://github.com/AuxXxilium/linux_dsm_epyc7002.git synced 2024-12-18 03:46:46 +07:00

Author	SHA1	Message	Date
Baolin Wang	39232ed5a1	time: Introduce one suspend clocksource to compensate the suspend time On some hardware with multiple clocksources, we have coarse grained clocksources that support the CLOCK_SOURCE_SUSPEND_NONSTOP flag, but which are less than ideal for timekeeping whereas other clocksources can be better candidates but halt on suspend. Currently, the timekeeping core only supports timing suspend using CLOCK_SOURCE_SUSPEND_NONSTOP clocksources if that clocksource is the current clocksource for timekeeping. As a result, some architectures try to implement read_persistent_clock64() using those non-stop clocksources, but isn't really ideal, which will introduce more duplicate code. To fix this, provide logic to allow a registered SUSPEND_NONSTOP clocksource, which isn't the current clocksource, to be used to calculate the suspend time. Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Ingo Molnar <mingo@kernel.org> Cc: Miroslav Lichvar <mlichvar@redhat.com> Cc: Richard Cochran <richardcochran@gmail.com> Cc: Prarit Bhargava <prarit@redhat.com> Cc: Stephen Boyd <sboyd@kernel.org> Cc: Daniel Lezcano <daniel.lezcano@linaro.org> Reviewed-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Daniel Lezcano <daniel.lezcano@linaro.org> Suggested-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Baolin Wang <baolin.wang@linaro.org> [jstultz: minor tweaks to merge with previous resume changes] Signed-off-by: John Stultz <john.stultz@linaro.org>	2018-07-19 17:08:52 -07:00
Mukesh Ojha	f473e5f467	time: Fix extra sleeptime injection when suspend fails Currently, there exists a corner case assuming when there is only one clocksource e.g RTC, and system failed to go to suspend mode. While resume rtc_resume() injects the sleeptime as timekeeping_rtc_skipresume() returned 'false' (default value of sleeptime_injected) due to which we can see mismatch in timestamps. This issue can also come in a system where more than one clocksource are present and very first suspend fails. Success case: ------------ {sleeptime_injected=false} rtc_suspend() => timekeeping_suspend() => timekeeping_resume() => (sleeptime injected) rtc_resume() Failure case: ------------ {failure in sleep path} {sleeptime_injected=false} rtc_suspend() => rtc_resume() {sleeptime injected again which was not required as the suspend failed} Fix this by handling the boolean logic properly. Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Ingo Molnar <mingo@kernel.org> Cc: Miroslav Lichvar <mlichvar@redhat.com> Cc: Richard Cochran <richardcochran@gmail.com> Cc: Prarit Bhargava <prarit@redhat.com> Cc: Stephen Boyd <sboyd@kernel.org> Originally-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Mukesh Ojha <mojha@codeaurora.org> Signed-off-by: John Stultz <john.stultz@linaro.org>	2018-07-19 17:08:51 -07:00
Ondrej Mosnacek	985e695074	timekeeping/ntp: Constify some function arguments Add 'const' to some function arguments and variables to make it easier to read the code. Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Ingo Molnar <mingo@kernel.org> Cc: Miroslav Lichvar <mlichvar@redhat.com> Cc: Richard Cochran <richardcochran@gmail.com> Cc: Prarit Bhargava <prarit@redhat.com> Cc: Stephen Boyd <sboyd@kernel.org> Signed-off-by: Ondrej Mosnacek <omosnace@redhat.com> [jstultz: Also fixup pre-existing checkpatch warnings for prototype arguments with no variable name] Signed-off-by: John Stultz <john.stultz@linaro.org>	2018-07-19 17:08:05 -07:00
Pavel Tatashin	46457ea464	sched/clock: Use static key for sched_clock_running sched_clock_running may be read every time sched_clock_cpu() is called. Yet, this variable is updated only twice during boot, and never changes again, therefore it is better to make it a static key. Signed-off-by: Pavel Tatashin <pasha.tatashin@oracle.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Acked-by: Peter Zijlstra <peterz@infradead.org> Cc: steven.sistare@oracle.com Cc: daniel.m.jordan@oracle.com Cc: linux@armlinux.org.uk Cc: schwidefsky@de.ibm.com Cc: heiko.carstens@de.ibm.com Cc: john.stultz@linaro.org Cc: sboyd@codeaurora.org Cc: hpa@zytor.com Cc: douly.fnst@cn.fujitsu.com Cc: prarit@redhat.com Cc: feng.tang@intel.com Cc: pmladek@suse.com Cc: gnomes@lxorguk.ukuu.org.uk Cc: linux-s390@vger.kernel.org Cc: boris.ostrovsky@oracle.com Cc: jgross@suse.com Cc: pbonzini@redhat.com Link: https://lkml.kernel.org/r/20180719205545.16512-25-pasha.tatashin@oracle.com	2018-07-20 00:02:43 +02:00
Pavel Tatashin	857baa87b6	sched/clock: Enable sched clock early Allow sched_clock() to be used before schec_clock_init() is called. This provides a way to get early boot timestamps on machines with unstable clocks. Signed-off-by: Pavel Tatashin <pasha.tatashin@oracle.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Cc: steven.sistare@oracle.com Cc: daniel.m.jordan@oracle.com Cc: linux@armlinux.org.uk Cc: schwidefsky@de.ibm.com Cc: heiko.carstens@de.ibm.com Cc: john.stultz@linaro.org Cc: sboyd@codeaurora.org Cc: hpa@zytor.com Cc: douly.fnst@cn.fujitsu.com Cc: peterz@infradead.org Cc: prarit@redhat.com Cc: feng.tang@intel.com Cc: pmladek@suse.com Cc: gnomes@lxorguk.ukuu.org.uk Cc: linux-s390@vger.kernel.org Cc: boris.ostrovsky@oracle.com Cc: jgross@suse.com Cc: pbonzini@redhat.com Link: https://lkml.kernel.org/r/20180719205545.16512-24-pasha.tatashin@oracle.com	2018-07-20 00:02:43 +02:00
Pavel Tatashin	5d2a4e91a5	sched/clock: Move sched clock initialization and merge with generic clock sched_clock_postinit() initializes a generic clock on systems where no other clock is provided. This function may be called only after timekeeping_init(). Rename sched_clock_postinit to generic_clock_inti() and call it from sched_clock_init(). Move the call for sched_clock_init() until after time_init(). Suggested-by: Peter Zijlstra <peterz@infradead.org> Signed-off-by: Pavel Tatashin <pasha.tatashin@oracle.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Cc: steven.sistare@oracle.com Cc: daniel.m.jordan@oracle.com Cc: linux@armlinux.org.uk Cc: schwidefsky@de.ibm.com Cc: heiko.carstens@de.ibm.com Cc: john.stultz@linaro.org Cc: sboyd@codeaurora.org Cc: hpa@zytor.com Cc: douly.fnst@cn.fujitsu.com Cc: prarit@redhat.com Cc: feng.tang@intel.com Cc: pmladek@suse.com Cc: gnomes@lxorguk.ukuu.org.uk Cc: linux-s390@vger.kernel.org Cc: boris.ostrovsky@oracle.com Cc: jgross@suse.com Cc: pbonzini@redhat.com Link: https://lkml.kernel.org/r/20180719205545.16512-23-pasha.tatashin@oracle.com	2018-07-20 00:02:43 +02:00
Pavel Tatashin	4b1b7f8054	timekeeping: Default boot time offset to local_clock() read_persistent_wall_and_boot_offset() is called during boot to read both the persistent clock and also return the offset between the boot time and the value of persistent clock. Change the default boot_offset from zero to local_clock() so architectures, that do not have a dedicated boot_clock but have early sched_clock(), such as SPARCv9, x86, and possibly more will benefit from this change by getting a better and more consistent estimate of the boot time without need for an arch specific implementation. Signed-off-by: Pavel Tatashin <pasha.tatashin@oracle.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Cc: steven.sistare@oracle.com Cc: daniel.m.jordan@oracle.com Cc: linux@armlinux.org.uk Cc: schwidefsky@de.ibm.com Cc: heiko.carstens@de.ibm.com Cc: john.stultz@linaro.org Cc: sboyd@codeaurora.org Cc: hpa@zytor.com Cc: douly.fnst@cn.fujitsu.com Cc: peterz@infradead.org Cc: prarit@redhat.com Cc: feng.tang@intel.com Cc: pmladek@suse.com Cc: gnomes@lxorguk.ukuu.org.uk Cc: linux-s390@vger.kernel.org Cc: boris.ostrovsky@oracle.com Cc: jgross@suse.com Cc: pbonzini@redhat.com Link: https://lkml.kernel.org/r/20180719205545.16512-17-pasha.tatashin@oracle.com	2018-07-20 00:02:41 +02:00
Pavel Tatashin	3eca993740	timekeeping: Replace read_boot_clock64() with read_persistent_wall_and_boot_offset() If architecture does not support exact boot time, it is challenging to estimate boot time without having a reference to the current persistent clock value. Yet, it cannot read the persistent clock time again, because this may lead to math discrepancies with the caller of read_boot_clock64() who have read the persistent clock at a different time. This is why it is better to provide two values simultaneously: the persistent clock value, and the boot time. Replace read_boot_clock64() with: read_persistent_wall_and_boot_offset(wall_time, boot_offset) Where wall_time is returned by read_persistent_clock() And boot_offset is wall_time - boot time, which defaults to 0. Signed-off-by: Pavel Tatashin <pasha.tatashin@oracle.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Cc: steven.sistare@oracle.com Cc: daniel.m.jordan@oracle.com Cc: linux@armlinux.org.uk Cc: schwidefsky@de.ibm.com Cc: heiko.carstens@de.ibm.com Cc: john.stultz@linaro.org Cc: sboyd@codeaurora.org Cc: hpa@zytor.com Cc: douly.fnst@cn.fujitsu.com Cc: peterz@infradead.org Cc: prarit@redhat.com Cc: feng.tang@intel.com Cc: pmladek@suse.com Cc: gnomes@lxorguk.ukuu.org.uk Cc: linux-s390@vger.kernel.org Cc: boris.ostrovsky@oracle.com Cc: jgross@suse.com Cc: pbonzini@redhat.com Link: https://lkml.kernel.org/r/20180719205545.16512-16-pasha.tatashin@oracle.com	2018-07-20 00:02:40 +02:00
Ondrej Mosnacek	86b2dcd4f0	ntp: Use kstrtos64 for s64 variable ...instead of kstrtol with a dirty cast. Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Ingo Molnar <mingo@kernel.org> Cc: Miroslav Lichvar <mlichvar@redhat.com> Cc: Richard Cochran <richardcochran@gmail.com> Cc: Prarit Bhargava <prarit@redhat.com> Cc: Stephen Boyd <sboyd@kernel.org> Signed-off-by: Ondrej Mosnacek <omosnace@redhat.com> Signed-off-by: John Stultz <john.stultz@linaro.org>	2018-07-19 14:58:37 -07:00
Ondrej Mosnacek	0f9987b63d	ntp: Remove redundant arguments The 'ts' argument of process_adj_status() and process_adjtimex_modes() is unused and can be safely removed. Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Ingo Molnar <mingo@kernel.org> Cc: Miroslav Lichvar <mlichvar@redhat.com> Cc: Richard Cochran <richardcochran@gmail.com> Cc: Prarit Bhargava <prarit@redhat.com> Cc: Stephen Boyd <sboyd@kernel.org> Signed-off-by: Ondrej Mosnacek <omosnace@redhat.com> Signed-off-by: John Stultz <john.stultz@linaro.org>	2018-07-19 14:58:29 -07:00
Yi Wang	3058758925	timer: Fix coding style The call to wake_up_nohz_cpu() is incorrectly indented. Remove the surplus TAB. Signed-off-by: Yi Wang <wang.yi59@zte.com.cn> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Jiang Biao <jiang.biao2@zte.com.cn> Cc: john.stultz@linaro.org Cc: sboyd@kernel.org Cc: zhong.weidong@zte.com.cn CC: Anna-Maria Gleixner <anna-maria@linutronix.de> Link: https://lkml.kernel.org/r/1531721337-30284-1-git-send-email-wang.yi59@zte.com.cn	2018-07-19 16:52:40 +02:00
Linus Torvalds	024ddc0ce1	Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net Pull networking fixes from David Miller: "Lots of fixes, here goes: 1) NULL deref in qtnfmac, from Gustavo A. R. Silva. 2) Kernel oops when fw download fails in rtlwifi, from Ping-Ke Shih. 3) Lost completion messages in AF_XDP, from Magnus Karlsson. 4) Correct bogus self-assignment in rhashtable, from Rishabh Bhatnagar. 5) Fix regression in ipv6 route append handling, from David Ahern. 6) Fix masking in __set_phy_supported(), from Heiner Kallweit. 7) Missing module owner set in x_tables icmp, from Florian Westphal. 8) liquidio's timeouts are HZ dependent, fix from Nicholas Mc Guire. 9) Link setting fixes for sh_eth and ravb, from Vladimir Zapolskiy. 10) Fix NULL deref when using chains in act_csum, from Davide Caratti. 11) XDP_REDIRECT needs to check if the interface is up and whether the MTU is sufficient. From Toshiaki Makita. 12) Net diag can do a double free when killing TCP_NEW_SYN_RECV connections, from Lorenzo Colitti. 13) nf_defrag in ipv6 can unnecessarily hold onto dst entries for a full minute, delaying device unregister. From Eric Dumazet. 14) Update MAC entries in the correct order in ixgbe, from Alexander Duyck. 15) Don't leave partial mangles bpf program in jit_subprogs, from Daniel Borkmann. 16) Fix pfmemalloc SKB state propagation, from Stefano Brivio. 17) Fix ACK handling in DCTCP congestion control, from Yuchung Cheng. 18) Use after free in tun XDP_TX, from Toshiaki Makita. 19) Stale ipv6 header pointer in ipv6 gre code, from Prashant Bhole. 20) Don't reuse remainder of RX page when XDP is set in mlx4, from Saeed Mahameed. 21) Fix window probe handling of TCP rapair sockets, from Stefan Baranoff. 22) Missing socket locking in smc_ioctl(), from Ursula Braun. 23) IPV6_ILA needs DST_CACHE, from Arnd Bergmann. 24) Spectre v1 fix in cxgb3, from Gustavo A. R. Silva. 25) Two spots in ipv6 do a rol32() on a hash value but ignore the result. Fixes from Colin Ian King" * git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (176 commits) tcp: identify cryptic messages as TCP seq # bugs ptp: fix missing break in switch hv_netvsc: Fix napi reschedule while receive completion is busy MAINTAINERS: Drop inactive Vitaly Bordug's email net: cavium: Add fine-granular dependencies on PCI net: qca_spi: Fix log level if probe fails net: qca_spi: Make sure the QCA7000 reset is triggered net: qca_spi: Avoid packet drop during initial sync ipv6: fix useless rol32 call on hash ipv6: sr: fix useless rol32 call on hash net: sched: Using NULL instead of plain integer net: usb: asix: replace mii_nway_restart in resume path net: cxgb3_main: fix potential Spectre v1 lib/rhashtable: consider param->min_size when setting initial table size net/smc: reset recv timeout after clc handshake net/smc: add error handling for get_user() net/smc: optimize consumer cursor updates net/nfc: Avoid stalls when nfc_alloc_send_skb() returned NULL. ipv6: ila: select CONFIG_DST_CACHE net: usb: rtl8150: demote allmulti message to dev_dbg() ...	2018-07-18 19:32:54 -07:00
Ronny Chevalier	baa2a4fdd5	audit: fix use-after-free in audit_add_watch audit_add_watch stores locally krule->watch without taking a reference on watch. Then, it calls audit_add_to_parent, and uses the watch stored locally. Unfortunately, it is possible that audit_add_to_parent updates krule->watch. When it happens, it also drops a reference of watch which could free the watch. How to reproduce (with KASAN enabled): auditctl -w /etc/passwd -F success=0 -k test_passwd auditctl -w /etc/passwd -F success=1 -k test_passwd2 The second call to auditctl triggers the use-after-free, because audit_to_parent updates krule->watch to use a previous existing watch and drops the reference to the newly created watch. To fix the issue, we grab a reference of watch and we release it at the end of the function. Signed-off-by: Ronny Chevalier <ronny.chevalier@hp.com> Reviewed-by: Richard Guy Briggs <rgb@redhat.com> Signed-off-by: Paul Moore <paul@paul-moore.com>	2018-07-18 11:43:36 -04:00
Jakub Kicinski	fd4f227dea	bpf: offload: allow program and map sharing per-ASIC Allow programs and maps to be re-used across different netdevs, as long as they belong to the same struct bpf_offload_dev. Update the bpf_offload_prog_map_match() helper for the verifier and export a new helper for the drivers to use when checking programs at attachment time. Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com> Reviewed-by: Quentin Monnet <quentin.monnet@netronome.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>	2018-07-18 15:10:34 +02:00
Jakub Kicinski	602144c224	bpf: offload: keep the offload state per-ASIC Create a higher-level entity to represent a device/ASIC to allow programs and maps to be shared between device ports. The extra work is required to make sure we don't destroy BPF objects as soon as the netdev for which they were loaded gets destroyed, as other ports may still be using them. When netdev goes away all of its BPF objects will be moved to other netdevs of the device, and only destroyed when last netdev is unregistered. Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com> Reviewed-by: Quentin Monnet <quentin.monnet@netronome.com> Acked-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>	2018-07-18 15:10:34 +02:00
Jakub Kicinski	9fd7c55591	bpf: offload: aggregate offloads per-device Currently we have two lists of offloaded objects - programs and maps. Netdevice unregister notifier scans those lists to orphan objects associated with device being unregistered. This puts unnecessary (even if negligible) burden on all netdev unregister calls in BPF- -enabled kernel. The lists of objects may potentially get long making the linear scan even more problematic. There haven't been complaints about this mechanisms so far, but it is suboptimal. Instead of relying on notifiers, make the few BPF-capable drivers register explicitly for BPF offloads. The programs and maps will now be collected per-device not on a global list, and only scanned for removal when driver unregisters from BPF offloads. Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com> Reviewed-by: Quentin Monnet <quentin.monnet@netronome.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>	2018-07-18 15:10:34 +02:00
Jakub Kicinski	09728266b6	bpf: offload: rename bpf_offload_dev_match() to bpf_offload_prog_map_match() A set of new API functions exported for the drivers will soon use 'bpf_offload_dev_' as a prefix. Rename the bpf_offload_dev_match() which is internal to the core (used by the verifier) to avoid any confusion. Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com> Reviewed-by: Quentin Monnet <quentin.monnet@netronome.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>	2018-07-18 15:10:34 +02:00
Colin Ian King	c23e014a4b	bpf: sockmap: remove redundant pointer sg Pointer sg is being assigned but is never used hence it is redundant and can be removed. Cleans up clang warning: warning: variable 'sg' set but not used [-Wunused-but-set-variable] Signed-off-by: Colin Ian King <colin.king@canonical.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>	2018-07-18 15:06:24 +02:00
Roman Gushchin	3960f4fd65	bpf: fix rcu annotations in compute_effective_progs() The progs local variable in compute_effective_progs() is marked as __rcu, which is not correct. This is a local pointer, which is initialized by bpf_prog_array_alloc(), which also now returns a generic non-rcu pointer. The real rcu-protected pointer is *array (array is a pointer to an RCU-protected pointer), so the assignment should be performed using rcu_assign_pointer(). Fixes: `324bda9e6c` ("bpf: multi program support for cgroup+bpf") Signed-off-by: Roman Gushchin <guro@fb.com> Cc: Alexei Starovoitov <ast@kernel.org> Cc: Daniel Borkmann <daniel@iogearbox.net> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>	2018-07-18 15:01:54 +02:00
Roman Gushchin	d29ab6e1fa	bpf: bpf_prog_array_alloc() should return a generic non-rcu pointer Currently the return type of the bpf_prog_array_alloc() is struct bpf_prog_array __rcu , which is not quite correct. Obviously, the returned pointer is a generic pointer, which is valid for an indefinite amount of time and it's not shared with anyone else, so there is no sense in marking it as __rcu. This change eliminate the following sparse warnings: kernel/bpf/core.c:1544:31: warning: incorrect type in return expression (different address spaces) kernel/bpf/core.c:1544:31: expected struct bpf_prog_array [noderef] <asn:4> kernel/bpf/core.c:1544:31: got void * kernel/bpf/core.c:1548:17: warning: incorrect type in return expression (different address spaces) kernel/bpf/core.c:1548:17: expected struct bpf_prog_array [noderef] <asn:4>* kernel/bpf/core.c:1548:17: got struct bpf_prog_array <noident> kernel/bpf/core.c:1681:15: warning: incorrect type in assignment (different address spaces) kernel/bpf/core.c:1681:15: expected struct bpf_prog_array array kernel/bpf/core.c:1681:15: got struct bpf_prog_array [noderef] <asn:4>* Fixes: `324bda9e6c` ("bpf: multi program support for cgroup+bpf") Signed-off-by: Roman Gushchin <guro@fb.com> Cc: Alexei Starovoitov <ast@kernel.org> Cc: Daniel Borkmann <daniel@iogearbox.net> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>	2018-07-18 15:01:20 +02:00
Paul Moore	290e44b7dd	audit: use ktime_get_coarse_real_ts64() for timestamps Commit `c72051d577` ("audit: use ktime_get_coarse_ts64() for time access") converted audit's use of current_kernel_time64() to the new ktime_get_coarse_ts64() function. Unfortunately this resulted in incorrect timestamps, e.g. events stamped with the year 1969 despite it being 2018. This patch corrects this by using ktime_get_coarse_real_ts64() just like the current_kernel_time64() wrapper. Fixes: `c72051d577` ("audit: use ktime_get_coarse_ts64() for time access") Reviewed-by: Arnd Bergmann <arnd@arndb.de> Signed-off-by: Paul Moore <paul@paul-moore.com>	2018-07-17 14:45:08 -04:00
Linus Torvalds	3c53776e29	Mark HI and TASKLET softirq synchronous Way back in 4.9, we committed `4cd13c21b2` ("softirq: Let ksoftirqd do its job"), and ever since we've had small nagging issues with it. For example, we've had: `1ff688209e` ("watchdog: core: make sure the watchdog_worker is not deferred") `8d5755b3f7` ("watchdog: softdog: fire watchdog even if softirqs do not get to run") `217f697436` ("net: busy-poll: allow preemption in sk_busy_loop()") all of which worked around some of the effects of that commit. The DVB people have also complained that the commit causes excessive USB URB latencies, which seems to be due to the USB code using tasklets to schedule USB traffic. This seems to be an issue mainly when already living on the edge, but waiting for ksoftirqd to handle it really does seem to cause excessive latencies. Now Hanna Hawa reports that this issue isn't just limited to USB URB and DVB, but also causes timeout problems for the Marvell SoC team: "I'm facing kernel panic issue while running raid 5 on sata disks connected to Macchiatobin (Marvell community board with Armada-8040 SoC with 4 ARMv8 cores of CA72) Raid 5 built with Marvell DMA engine and async_tx mechanism (ASYNC_TX_DMA [=y]); the DMA driver (mv_xor_v2) uses a tasklet to clean the done descriptors from the queue" The latency problem causes a panic: mv_xor_v2 f0400000.xor: dma_sync_wait: timeout! Kernel panic - not syncing: async_tx_quiesce: DMA error waiting for transaction We've discussed simply just reverting the original commit entirely, and also much more involved solutions (with per-softirq threads etc). This patch is intentionally stupid and fairly limited, because the issue still remains, and the other solutions either got sidetracked or had other issues. We should probably also consider the timer softirqs to be synchronous and not be delayed to ksoftirqd (since they were the issue with the earlier watchdog problems), but that should be done as a separate patch. This does only the tasklet cases. Reported-and-tested-by: Hanna Hawa <hannah@marvell.com> Reported-and-tested-by: Josef Griebichler <griebichler.josef@gmx.at> Reported-by: Mauro Carvalho Chehab <mchehab@s-opensource.com> Cc: Alan Stern <stern@rowland.harvard.edu> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Eric Dumazet <edumazet@google.com> Cc: Ingo Molnar <mingo@kernel.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2018-07-17 11:12:43 -07:00
Masahiro Yamada	c417fbce98	kbuild: move bin2c back to scripts/ from scripts/basic/ Commit `8370edea81` ("bin2c: move bin2c in scripts/basic") moved bin2c to the scripts/basic/ directory, incorrectly stating "Kexec wants to use bin2c and it wants to use it really early in the build process. See arch/x86/purgatory/ code in later patches." Commit `bdab125c93` ("Revert "kexec/purgatory: Add clean-up for purgatory directory"") and commit `d6605b6bbe` ("x86/build: Remove unnecessary preparation for purgatory") removed the redundant purgatory build magic entirely. That means that the move of bin2c was unnecessary in the first place. fixdep is the only host program that deserves to sit in the scripts/basic/ directory. Signed-off-by: Masahiro Yamada <yamada.masahiro@socionext.com>	2018-07-18 01:18:05 +09:00
RAGHU Halharvi	d91cfeb0aa	genirq: Remove redundant NULL pointer check in __free_irq() The NULL pointer check in __free_irq() triggers a 'dereference before NULL pointer check' warning in static code analysis. It turns out that the check is redundant because all callers have a NULL pointer check already. Remove it. Signed-off-by: RAGHU Halharvi <raghuhack78@gmail.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Link: https://lkml.kernel.org/r/20180717102009.7708-1-raghuhack78@gmail.com	2018-07-17 13:35:44 +02:00
Rik van Riel	c1a2f7f0c0	mm: Allocate the mm_cpumask (mm->cpu_bitmap[]) dynamically based on nr_cpu_ids The mm_struct always contains a cpumask bitmap, regardless of CONFIG_CPUMASK_OFFSTACK. That means the first step can be to simplify things, and simply have one bitmask at the end of the mm_struct for the mm_cpumask. This does necessitate moving everything else in mm_struct into an anonymous sub-structure, which can be randomized when struct randomization is enabled. The second step is to determine the correct size for the mm_struct slab object from the size of the mm_struct (excluding the CPU bitmap) and the size the cpumask. For init_mm we can simply allocate the maximum size this kernel is compiled for, since we only have one init_mm in the system, anyway. Pointer magic by Mike Galbraith, to evade -Wstringop-overflow getting confused by the dynamically sized array. Tested-by: Song Liu <songliubraving@fb.com> Signed-off-by: Rik van Riel <riel@surriel.com> Signed-off-by: Mike Galbraith <efault@gmx.de> Signed-off-by: Rik van Riel <riel@surriel.com> Acked-by: Dave Hansen <dave.hansen@intel.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: kernel-team@fb.com Cc: luto@kernel.org Link: http://lkml.kernel.org/r/20180716190337.26133-2-riel@surriel.com Signed-off-by: Ingo Molnar <mingo@kernel.org>	2018-07-17 09:35:30 +02:00
Andrea Parri	7696f9910a	sched/Documentation: Update wake_up() & co. memory-barrier guarantees Both the implementation and the users' expectation [1] for the various wakeup primitives have evolved over time, but the documentation has not kept up with these changes: brings it into 2018. [1] http://lkml.kernel.org/r/20180424091510.GB4064@hirez.programming.kicks-ass.net Also applied feedback from Alan Stern. Suggested-by: Peter Zijlstra <peterz@infradead.org> Signed-off-by: Andrea Parri <andrea.parri@amarulasolutions.com> Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Akira Yokosawa <akiyks@gmail.com> Cc: Alan Stern <stern@rowland.harvard.edu> Cc: Boqun Feng <boqun.feng@gmail.com> Cc: Daniel Lustig <dlustig@nvidia.com> Cc: David Howells <dhowells@redhat.com> Cc: Jade Alglave <j.alglave@ucl.ac.uk> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Luc Maranget <luc.maranget@inria.fr> Cc: Nicholas Piggin <npiggin@gmail.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Will Deacon <will.deacon@arm.com> Cc: linux-arch@vger.kernel.org Cc: parri.andrea@gmail.com Link: http://lkml.kernel.org/r/20180716180605.16115-12-paulmck@linux.vnet.ibm.com Signed-off-by: Ingo Molnar <mingo@kernel.org>	2018-07-17 09:30:34 +02:00
Andrea Parri	3d85b27037	locking/spinlock, sched/core: Clarify requirements for smp_mb__after_spinlock() There are 11 interpretations of the requirements described in the header comment for smp_mb__after_spinlock(): one for each LKMM maintainer, and one currently encoded in the Cat file. Stick to the latter (until a more satisfactory solution is available). This also reworks some snippets related to the barrier to illustrate the requirements and to link them to the idioms which are relied upon at its call sites. Suggested-by: Boqun Feng <boqun.feng@gmail.com> Signed-off-by: Andrea Parri <andrea.parri@amarulasolutions.com> Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Acked-by: Peter Zijlstra <peterz@infradead.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Will Deacon <will.deacon@arm.com> Cc: akiyks@gmail.com Cc: dhowells@redhat.com Cc: j.alglave@ucl.ac.uk Cc: linux-arch@vger.kernel.org Cc: luc.maranget@inria.fr Cc: npiggin@gmail.com Cc: parri.andrea@gmail.com Cc: stern@rowland.harvard.edu Link: http://lkml.kernel.org/r/20180716180605.16115-11-paulmck@linux.vnet.ibm.com Signed-off-by: Ingo Molnar <mingo@kernel.org>	2018-07-17 09:30:33 +02:00
Andrea Parri	76e079fefc	sched/core: Use smp_mb() in wake_woken_function() wake_woken_function() synchronizes with wait_woken() as follows: [wait_woken] [wake_woken_function] entry->flags &= ~wq_flag_woken; condition = true; smp_mb(); smp_wmb(); if (condition) wq_entry->flags \|= wq_flag_woken; break; This commit replaces the above smp_wmb() with an smp_mb() in order to guarantee that either wait_woken() sees the wait condition being true or the store to wq_entry->flags in woken_wake_function() follows the store in wait_woken() in the coherence order (so that the former can eventually be observed by wait_woken()). The commit also fixes a comment associated to set_current_state() in wait_woken(): the comment pairs the barrier in set_current_state() to the above smp_wmb(), while the actual pairing involves the barrier in set_current_state() and the barrier executed by the try_to_wake_up() in wake_woken_function(). Signed-off-by: Andrea Parri <andrea.parri@amarulasolutions.com> Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: akiyks@gmail.com Cc: boqun.feng@gmail.com Cc: dhowells@redhat.com Cc: j.alglave@ucl.ac.uk Cc: linux-arch@vger.kernel.org Cc: luc.maranget@inria.fr Cc: npiggin@gmail.com Cc: parri.andrea@gmail.com Cc: stern@rowland.harvard.edu Cc: will.deacon@arm.com Link: http://lkml.kernel.org/r/20180716180605.16115-10-paulmck@linux.vnet.ibm.com Signed-off-by: Ingo Molnar <mingo@kernel.org>	2018-07-17 09:30:33 +02:00
Ingo Molnar	52b544bd38	Linux 4.18-rc5 -----BEGIN PGP SIGNATURE----- iQFSBAABCAA8FiEEq68RxlopcLEwq+PEeb4+QwBBGIYFAltLpVUeHHRvcnZhbGRz QGxpbnV4LWZvdW5kYXRpb24ub3JnAAoJEHm+PkMAQRiGWisH/ikONMwV7OrSk36Y 5rxzTFUoBk0Qffct88gtSNuRVCxaVb1ofCndvFJE6A6HfJkWpbBzH6eq90aakmJi f7uFcu4YmsQpeQaf9lpftWmY2vDf2fIadVTV0RnSMXks57wMax1cpBe7LJGpz13e f+g5XRVs1MdlZVtr6tG2SU3Y5AqVVVsYe/0DBPonEqeh9/JJbPFCuNkFOxxzAqPu VTnjyoOqG8qtZzjklNtR5rZn0Gv592tWX36eiWTQdThNmVFkGEAJwsHCQlY4OQYK 61QN4UhOHiu8e1ZuGDNEDhNVRnKtaaYUPFeWL1wLRW73ul4P3ZkpvpS8QTMwcFJI JjzNOkI= =ckcO -----END PGP SIGNATURE----- Merge tag 'v4.18-rc5' into locking/core, to pick up fixes Signed-off-by: Ingo Molnar <mingo@kernel.org>	2018-07-17 09:27:43 +02:00
Ingo Molnar	ea73a5c692	Merge branch 'for-mingo' of git://git.kernel.org/pub/scm/linux/kernel/git/paulmck/linux-rcu into core/rcu Pull RCU updates from Paul E. McKenney: - An optimization and a fix for RCU expedited grace periods, with the fix being from Boqun Feng. - Miscellaneous fixes, including a lockdep-annotation fix from Boqun Feng. - SRCU updates. - Updates to rcutorture and associated scripting. - Introduce grace-period sequence numbers to the RCU-bh, RCU-preempt, and RCU-sched flavors, replacing the old ->gpnum and ->completed pair of fields. This change allows lockless code to obtain the complete grace-period state with a single READ_ONCE(), which is needed to maintain tolerable lock contention during the upcoming consolidation of the three RCU flavors. Note that grace-period sequence numbers are already used by rcu_barrier(), expedited RCU grace periods, and SRCU, and are thus already heavily used and well-tested. Joel Fernandes contributed a number of excellent fixes and improvements. - Clean up some grace-period-reporting loose ends, including improving the handling of quiescent states from offline CPUs and fixing some false-positive WARN_ON_ONCE() invocations. (Strictly speaking, the WARN_ON_ONCE() invocations were quite correct, but their invariants were (harmlessly) violated by the earlier sloppy handling of quiescent states from offline CPUs.) In addition, improve grace-period forward-progress guarantees so as to allow removal of fail-safe checks that required otherwise needless lock acquisitions. Finally, add more diagnostics to help debug the upcoming consolidation of the RCU-bh, RCU-preempt, and RCU-sched flavors. - Additional miscellaneous fixes, including those contributed by Byungchul Park, Mauro Carvalho Chehab, Joe Perches, Joel Fernandes, Steven Rostedt, Andrea Parri, and Neil Brown. - Additional torture-test changes, including several contributed by Arnd Bergmann and Joel Fernandes. Signed-off-by: Ingo Molnar <mingo@kernel.org>	2018-07-17 09:16:02 +02:00
Mimi Zohar	c77b8cdf74	module: replace the existing LSM hook in init_module Both the init_module and finit_module syscalls call either directly or indirectly the security_kernel_read_file LSM hook. This patch replaces the direct call in init_module with a call to the new security_kernel_load_data hook and makes the corresponding changes in SELinux, LoadPin, and IMA. Signed-off-by: Mimi Zohar <zohar@linux.vnet.ibm.com> Cc: Jeff Vander Stoep <jeffv@google.com> Cc: Casey Schaufler <casey@schaufler-ca.com> Cc: Kees Cook <keescook@chromium.org> Acked-by: Jessica Yu <jeyu@kernel.org> Acked-by: Paul Moore <paul@paul-moore.com> Acked-by: Kees Cook <keescook@chromium.org> Signed-off-by: James Morris <james.morris@microsoft.com>	2018-07-16 12:31:57 -07:00
Mimi Zohar	a210fd32a4	kexec: add call to LSM hook in original kexec_load syscall In order for LSMs and IMA-appraisal to differentiate between kexec_load and kexec_file_load syscalls, both the original and new syscalls must call an LSM hook. This patch adds a call to security_kernel_load_data() in the original kexec_load syscall. Signed-off-by: Mimi Zohar <zohar@linux.vnet.ibm.com> Cc: Eric Biederman <ebiederm@xmission.com> Cc: Kees Cook <keescook@chromium.org> Acked-by: Serge Hallyn <serge@hallyn.com> Acked-by: Kees Cook <keescook@chromium.org> Signed-off-by: James Morris <james.morris@microsoft.com>	2018-07-16 12:31:57 -07:00
Kamalesh Babulal	1d98a69e5c	livepatch: Remove reliable stacktrace check in klp_try_switch_task() Support for immediate flag was removed by commit `d0807da78e` ("livepatch: Remove immediate feature"). We bail out during patch registration for architectures, those don't support reliable stack trace. Remove the check in klp_try_switch_task(), as its not required. Signed-off-by: Kamalesh Babulal <kamalesh@linux.vnet.ibm.com> Reviewed-by: Petr Mladek <pmladek@suse.com> Acked-by: Miroslav Benes <mbenes@suse.cz> Acked-by: Josh Poimboeuf <jpoimboe@redhat.com> Signed-off-by: Jiri Kosina <jkosina@suse.cz>	2018-07-16 17:50:33 +02:00
Tobias Tefke	788faab70d	perf, tools: Use correct articles in comments Some of the comments in the perf events code use articles incorrectly, using 'a' for words beginning with a vowel sound, where 'an' should be used. Signed-off-by: Tobias Tefke <tobias.tefke@tutanota.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: acme@kernel.org Cc: alexander.shishkin@linux.intel.com Cc: jolsa@redhat.com Cc: namhyung@kernel.org Link: http://lkml.kernel.org/r/20180709105715.22938-1-tobias.tefke@tutanota.com [ Fix a few more perf related 'a event' typo fixes from all around the kernel and tooling tree. ] Signed-off-by: Ingo Molnar <mingo@kernel.org>	2018-07-16 00:21:03 +02:00
Sebastian Andrzej Siewior	af0fffd930	sched/core: Remove get_cpu() from sched_fork() get_cpu() disables preemption for the entire sched_fork() function. This get_cpu() was introduced in commit: `dd41f596cd` ("sched: cfs core code") ... which also invoked sched_balance_self() and this function required preemption do be off. Today, sched_balance_self() seems to be moved to ->task_fork callback which is invoked while the ->pi_lock is held. set_load_weight() could invoke reweight_task() which then via $callchain might end up in smp_processor_id() but since `update_load' is false this won't happen. I didn't find any this_cpu*() or similar usage during the initialisation of the task_struct. The `cpu' value (from get_cpu()) is only used later in __set_task_cpu() while the ->pi_lock lock is held. Based on this it is possible to remove get_cpu() and use smp_processor_id() for the `cpu' variable without breaking anything. Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Link: http://lkml.kernel.org/r/20180706130615.g2ex2kmfu5kcvlq6@linutronix.de Signed-off-by: Ingo Molnar <mingo@kernel.org>	2018-07-16 00:16:29 +02:00
Peter Zijlstra	45f5519ec5	sched/cpufreq: Clarify sugov_get_util() Add a few comments to (hopefully) clarifying some of the magic in sugov_get_util(). Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Acked-by: Viresh Kumar <viresh.kumar@linaro.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Morten.Rasmussen@arm.com Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Vincent Guittot <vincent.guittot@linaro.org> Cc: claudio@evidence.eu.com Cc: daniel.lezcano@linaro.org Cc: dietmar.eggemann@arm.com Cc: joel@joelfernandes.org Cc: juri.lelli@redhat.com Cc: luca.abeni@santannapisa.it Cc: patrick.bellasi@arm.com Cc: quentin.perret@arm.com Cc: rjw@rjwysocki.net Cc: valentin.schneider@arm.com Link: http://lkml.kernel.org/r/20180705123617.GM2458@hirez.programming.kicks-ass.net Signed-off-by: Ingo Molnar <mingo@kernel.org>	2018-07-16 00:16:29 +02:00
Vincent Guittot	5fd778915a	sched/sysctl: Remove unused sched_time_avg_ms sysctl /proc/sys/kernel/sched_time_avg_ms entry is not used anywhere, remove it. Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Luis R. Rodriguez <mcgrof@kernel.org> Cc: Kees Cook <keescook@chromium.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Morten.Rasmussen@arm.com Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: claudio@evidence.eu.com Cc: daniel.lezcano@linaro.org Cc: dietmar.eggemann@arm.com Cc: joel@joelfernandes.org Cc: juri.lelli@redhat.com Cc: luca.abeni@santannapisa.it Cc: patrick.bellasi@arm.com Cc: quentin.perret@arm.com Cc: rjw@rjwysocki.net Cc: valentin.schneider@arm.com Cc: viresh.kumar@linaro.org Link: http://lkml.kernel.org/r/1530200714-4504-12-git-send-email-vincent.guittot@linaro.org Signed-off-by: Ingo Molnar <mingo@kernel.org>	2018-07-16 00:16:29 +02:00
Vincent Guittot	bbb62c0b02	sched/core: Remove the rt_avg code rt_avg is not used anywhere anymore, so we can remove all related code. Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Morten.Rasmussen@arm.com Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: claudio@evidence.eu.com Cc: daniel.lezcano@linaro.org Cc: dietmar.eggemann@arm.com Cc: joel@joelfernandes.org Cc: juri.lelli@redhat.com Cc: luca.abeni@santannapisa.it Cc: patrick.bellasi@arm.com Cc: quentin.perret@arm.com Cc: rjw@rjwysocki.net Cc: valentin.schneider@arm.com Cc: viresh.kumar@linaro.org Link: http://lkml.kernel.org/r/1530200714-4504-11-git-send-email-vincent.guittot@linaro.org Signed-off-by: Ingo Molnar <mingo@kernel.org>	2018-07-16 00:16:29 +02:00
Vincent Guittot	523e979d31	sched/core: Use PELT for scale_rt_capacity() The utilization of the CPU by RT, DL and IRQs are now tracked with PELT so we can use these metrics instead of rt_avg to evaluate the remaining capacity available for CFS class. scale_rt_capacity() behavior has been changed and now returns the remaining capacity available for CFS instead of a scaling factor because RT, DL and IRQ provide now absolute utilization value. The same formula as schedutil is used: IRQ util_avg + (1 - IRQ util_avg / max capacity ) * /Sum rq util_avg but the implementation is different because it doesn't return the same value and doesn't benefit of the same optimization. Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Morten.Rasmussen@arm.com Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: claudio@evidence.eu.com Cc: daniel.lezcano@linaro.org Cc: dietmar.eggemann@arm.com Cc: joel@joelfernandes.org Cc: juri.lelli@redhat.com Cc: luca.abeni@santannapisa.it Cc: patrick.bellasi@arm.com Cc: quentin.perret@arm.com Cc: rjw@rjwysocki.net Cc: valentin.schneider@arm.com Cc: viresh.kumar@linaro.org Link: http://lkml.kernel.org/r/1530200714-4504-10-git-send-email-vincent.guittot@linaro.org Signed-off-by: Ingo Molnar <mingo@kernel.org>	2018-07-16 00:16:25 +02:00
Vincent Guittot	dfa444dc2f	sched/cpufreq: Remove sugov_aggregate_util() There is no reason why sugov_get_util() and sugov_aggregate_util() were in fact separate functions. Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org> [ Rebased after adding irq tracking and fixed some compilation errors. ] Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Acked-by: Viresh Kumar <viresh.kumar@linaro.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Morten.Rasmussen@arm.com Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: claudio@evidence.eu.com Cc: daniel.lezcano@linaro.org Cc: dietmar.eggemann@arm.com Cc: joel@joelfernandes.org Cc: juri.lelli@redhat.com Cc: luca.abeni@santannapisa.it Cc: patrick.bellasi@arm.com Cc: quentin.perret@arm.com Cc: rjw@rjwysocki.net Cc: valentin.schneider@arm.com Link: http://lkml.kernel.org/r/1530200714-4504-9-git-send-email-vincent.guittot@linaro.org Signed-off-by: Ingo Molnar <mingo@kernel.org>	2018-07-15 23:51:21 +02:00
Vincent Guittot	9033ea1188	cpufreq/schedutil: Take time spent in interrupts into account The time spent executing IRQ handlers can be significant but it is not reflected in the utilization of CPU when deciding to choose an OPP. Now that we have access to this metric, schedutil can take it into account when selecting the OPP for a CPU. RQS utilization don't see the time spend under interrupt context and report their value in the normal context time window. We need to compensate this when adding interrupt utilization The CPU utilization is: IRQ util_avg + (1 - IRQ util_avg / max capacity ) * /Sum rq util_avg A test with iperf on hikey (octo arm64) gives the following speedup: iperf -c server_address -r -t 5 w/o patch w/ patch Tx 276 Mbits/sec 304 Mbits/sec +10% Rx 299 Mbits/sec 328 Mbits/sec +9% 8 iterations stdev is lower than 1% Only WFI idle state is enabled (shallowest idle state). Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Acked-by: Viresh Kumar <viresh.kumar@linaro.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Morten.Rasmussen@arm.com Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: claudio@evidence.eu.com Cc: daniel.lezcano@linaro.org Cc: dietmar.eggemann@arm.com Cc: joel@joelfernandes.org Cc: juri.lelli@redhat.com Cc: luca.abeni@santannapisa.it Cc: patrick.bellasi@arm.com Cc: quentin.perret@arm.com Cc: rjw@rjwysocki.net Cc: valentin.schneider@arm.com Link: http://lkml.kernel.org/r/1530200714-4504-8-git-send-email-vincent.guittot@linaro.org Signed-off-by: Ingo Molnar <mingo@kernel.org>	2018-07-15 23:51:21 +02:00
Vincent Guittot	91c27493e7	sched/irq: Add IRQ utilization tracking interrupt and steal time are the only remaining activities tracked by rt_avg. Like for sched classes, we can use PELT to track their average utilization of the CPU. But unlike sched class, we don't track when entering/leaving interrupt; Instead, we take into account the time spent under interrupt context when we update rqs' clock (rq_clock_task). This also means that we have to decay the normal context time and account for interrupt time during the update. That's also important to note that because: rq_clock == rq_clock_task + interrupt time and rq_clock_task is used by a sched class to compute its utilization, the util_avg of a sched class only reflects the utilization of the time spent in normal context and not of the whole time of the CPU. The utilization of interrupt gives an more accurate level of utilization of CPU. The CPU utilization is: avg_irq + (1 - avg_irq / max capacity) * /Sum avg_rq Most of the time, avg_irq is small and neglictible so the use of the approximation CPU utilization = /Sum avg_rq was enough. Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Morten.Rasmussen@arm.com Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: claudio@evidence.eu.com Cc: daniel.lezcano@linaro.org Cc: dietmar.eggemann@arm.com Cc: joel@joelfernandes.org Cc: juri.lelli@redhat.com Cc: luca.abeni@santannapisa.it Cc: patrick.bellasi@arm.com Cc: quentin.perret@arm.com Cc: rjw@rjwysocki.net Cc: valentin.schneider@arm.com Cc: viresh.kumar@linaro.org Link: http://lkml.kernel.org/r/1530200714-4504-7-git-send-email-vincent.guittot@linaro.org Signed-off-by: Ingo Molnar <mingo@kernel.org>	2018-07-15 23:51:21 +02:00
Vincent Guittot	8cc90515a4	cpufreq/schedutil: Use DL utilization tracking Now that we have both the DL class bandwidth requirement and the DL class utilization, we can detect when CPU is fully used so we should run at max. Otherwise, we keep using the DL bandwidth requirement to define the utilization of the CPU. Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Acked-by: Viresh Kumar <viresh.kumar@linaro.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Morten.Rasmussen@arm.com Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: claudio@evidence.eu.com Cc: daniel.lezcano@linaro.org Cc: dietmar.eggemann@arm.com Cc: joel@joelfernandes.org Cc: juri.lelli@redhat.com Cc: luca.abeni@santannapisa.it Cc: patrick.bellasi@arm.com Cc: quentin.perret@arm.com Cc: rjw@rjwysocki.net Cc: valentin.schneider@arm.com Link: http://lkml.kernel.org/r/1530200714-4504-6-git-send-email-vincent.guittot@linaro.org Signed-off-by: Ingo Molnar <mingo@kernel.org>	2018-07-15 23:51:21 +02:00
Vincent Guittot	3727e0e163	sched/dl: Add dl_rq utilization tracking Similarly to what happens with RT tasks, CFS tasks can be preempted by DL tasks and the CFS's utilization might no longer describes the real utilization level. Current DL bandwidth reflects the requirements to meet deadline when tasks are enqueued but not the current utilization of the DL sched class. We track DL class utilization to estimate the system utilization. Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Morten.Rasmussen@arm.com Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: claudio@evidence.eu.com Cc: daniel.lezcano@linaro.org Cc: dietmar.eggemann@arm.com Cc: joel@joelfernandes.org Cc: juri.lelli@redhat.com Cc: luca.abeni@santannapisa.it Cc: patrick.bellasi@arm.com Cc: quentin.perret@arm.com Cc: rjw@rjwysocki.net Cc: valentin.schneider@arm.com Cc: viresh.kumar@linaro.org Link: http://lkml.kernel.org/r/1530200714-4504-5-git-send-email-vincent.guittot@linaro.org Signed-off-by: Ingo Molnar <mingo@kernel.org>	2018-07-15 23:51:20 +02:00
Vincent Guittot	3ae117c6cd	cpufreq/schedutil: Use RT utilization tracking Add both CFS and RT utilization when selecting an OPP for CFS tasks as RT can preempt and steal CFS's running time. RT util_avg is used to take into account the utilization of RT tasks on the CPU when selecting OPP. If a RT task migrate, the RT utilization will not migrate but will decay over time. On an overloaded CPU, CFS utilization reflects the remaining utilization avialable on CPU. When RT task migrates, the CFS utilization will increase when tasks will start to use the newly available capacity. At the same pace, RT utilization will decay and both variations will compensate each other to keep unchanged overall utilization and will prevent any OPP drop. Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Acked-by: Viresh Kumar <viresh.kumar@linaro.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Morten.Rasmussen@arm.com Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: claudio@evidence.eu.com Cc: daniel.lezcano@linaro.org Cc: dietmar.eggemann@arm.com Cc: joel@joelfernandes.org Cc: juri.lelli@redhat.com Cc: luca.abeni@santannapisa.it Cc: patrick.bellasi@arm.com Cc: quentin.perret@arm.com Cc: rjw@rjwysocki.net Cc: valentin.schneider@arm.com Link: http://lkml.kernel.org/r/1530200714-4504-4-git-send-email-vincent.guittot@linaro.org Signed-off-by: Ingo Molnar <mingo@kernel.org>	2018-07-15 23:51:20 +02:00
Vincent Guittot	371bf42732	sched/rt: Add rt_rq utilization tracking schedutil governor relies on cfs_rq's util_avg to choose the OPP when CFS tasks are running. When the CPU is overloaded by CFS and RT tasks, CFS tasks are preempted by RT tasks and in this case util_avg reflects the remaining capacity but not what CFS want to use. In such case, schedutil can select a lower OPP whereas the CPU is overloaded. In order to have a more accurate view of the utilization of the CPU, we track the utilization of RT tasks. Only util_avg is correctly tracked but not load_avg and runnable_load_avg which are useless for rt_rq. rt_rq uses rq_clock_task and cfs_rq uses cfs_rq_clock_task but they are the same at the root group level, so the PELT windows of the util_sum are aligned. Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Morten.Rasmussen@arm.com Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: claudio@evidence.eu.com Cc: daniel.lezcano@linaro.org Cc: dietmar.eggemann@arm.com Cc: joel@joelfernandes.org Cc: juri.lelli@redhat.com Cc: luca.abeni@santannapisa.it Cc: patrick.bellasi@arm.com Cc: quentin.perret@arm.com Cc: rjw@rjwysocki.net Cc: valentin.schneider@arm.com Cc: viresh.kumar@linaro.org Link: http://lkml.kernel.org/r/1530200714-4504-3-git-send-email-vincent.guittot@linaro.org Signed-off-by: Ingo Molnar <mingo@kernel.org>	2018-07-15 23:51:20 +02:00
Vincent Guittot	c079629862	sched/pelt: Move PELT related code in a dedicated file We want to track rt_rq's utilization as a part of the estimation of the whole rq's utilization. This is necessary because rt tasks can steal utilization to cfs tasks and make them lighter than they are. As we want to use the same load tracking mecanism for both and prevent useless dependency between cfs and rt code, PELT code is moved in a dedicated file. Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Morten.Rasmussen@arm.com Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: claudio@evidence.eu.com Cc: daniel.lezcano@linaro.org Cc: dietmar.eggemann@arm.com Cc: joel@joelfernandes.org Cc: juri.lelli@redhat.com Cc: luca.abeni@santannapisa.it Cc: patrick.bellasi@arm.com Cc: quentin.perret@arm.com Cc: rjw@rjwysocki.net Cc: valentin.schneider@arm.com Cc: viresh.kumar@linaro.org Link: http://lkml.kernel.org/r/1530200714-4504-2-git-send-email-vincent.guittot@linaro.org Signed-off-by: Ingo Molnar <mingo@kernel.org>	2018-07-15 23:51:20 +02:00
Quentin Perret	8fe5c5a937	sched/fair: Fix util_avg of new tasks for asymmetric systems When a new task wakes-up for the first time, its initial utilization is set to half of the spare capacity of its CPU. The current implementation of post_init_entity_util_avg() uses SCHED_CAPACITY_SCALE directly as a capacity reference. As a result, on a big.LITTLE system, a new task waking up on an idle little CPU will be given ~512 of util_avg, even if the CPU's capacity is significantly less than that. Fix this by computing the spare capacity with arch_scale_cpu_capacity(). Signed-off-by: Quentin Perret <quentin.perret@arm.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Acked-by: Vincent Guittot <vincent.guittot@linaro.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: dietmar.eggemann@arm.com Cc: morten.rasmussen@arm.com Cc: patrick.bellasi@arm.com Link: http://lkml.kernel.org/r/20180612112215.25448-1-quentin.perret@arm.com Signed-off-by: Ingo Molnar <mingo@kernel.org>	2018-07-15 23:51:20 +02:00
Peter Zijlstra	be45bf5395	watchdog/softlockup: Fix cpu_stop_queue_work() double-queue bug When scheduling is delayed for longer than the softlockup interrupt period it is possible to double-queue the cpu_stop_work, causing list corruption. Cure this by adding a completion to track the cpu_stop_work's progress. Reported-by: kernel test robot <lkp@intel.com> Tested-by: Rong Chen <rong.a.chen@intel.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Fixes: `9cf57731b6` ("watchdog/softlockup: Replace "watchdog/%u" threads with cpu_stop_work") Link: http://lkml.kernel.org/r/20180713104208.GW2494@hirez.programming.kicks-ass.net Signed-off-by: Ingo Molnar <mingo@kernel.org>	2018-07-15 23:51:19 +02:00
Juri Lelli	e117cb52bd	sched/deadline: Fix switched_from_dl() warning Mark noticed that syzkaller is able to reliably trigger the following warning: dl_rq->running_bw > dl_rq->this_bw WARNING: CPU: 1 PID: 153 at kernel/sched/deadline.c:124 switched_from_dl+0x454/0x608 Kernel panic - not syncing: panic_on_warn set ... CPU: 1 PID: 153 Comm: syz-executor253 Not tainted 4.18.0-rc3+ #29 Hardware name: linux,dummy-virt (DT) Call trace: dump_backtrace+0x0/0x458 show_stack+0x20/0x30 dump_stack+0x180/0x250 panic+0x2dc/0x4ec __warn_printk+0x0/0x150 report_bug+0x228/0x2d8 bug_handler+0xa0/0x1a0 brk_handler+0x2f0/0x568 do_debug_exception+0x1bc/0x5d0 el1_dbg+0x18/0x78 switched_from_dl+0x454/0x608 __sched_setscheduler+0x8cc/0x2018 sys_sched_setattr+0x340/0x758 el0_svc_naked+0x30/0x34 syzkaller reproducer runs a bunch of threads that constantly switch between DEADLINE and NORMAL classes while interacting through futexes. The splat above is caused by the fact that if a DEADLINE task is setattr back to NORMAL while in non_contending state (blocked on a futex - inactive timer armed), its contribution to running_bw is not removed before sub_rq_bw() gets called (!task_on_rq_queued() branch) and the latter sees running_bw > this_bw. Fix it by removing a task contribution from running_bw if the task is not queued and in non_contending state while switched to a different class. Reported-by: Mark Rutland <mark.rutland@arm.com> Signed-off-by: Juri Lelli <juri.lelli@redhat.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Daniel Bristot de Oliveira <bristot@redhat.com> Reviewed-by: Luca Abeni <luca.abeni@santannapisa.it> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: claudio@evidence.eu.com Cc: rostedt@goodmis.org Link: http://lkml.kernel.org/r/20180711072948.27061-1-juri.lelli@redhat.com Signed-off-by: Ingo Molnar <mingo@kernel.org>	2018-07-15 23:47:33 +02:00
Ingo Molnar	fdf2ceb7f5	Linux 4.18-rc5 -----BEGIN PGP SIGNATURE----- iQFSBAABCAA8FiEEq68RxlopcLEwq+PEeb4+QwBBGIYFAltLpVUeHHRvcnZhbGRz QGxpbnV4LWZvdW5kYXRpb24ub3JnAAoJEHm+PkMAQRiGWisH/ikONMwV7OrSk36Y 5rxzTFUoBk0Qffct88gtSNuRVCxaVb1ofCndvFJE6A6HfJkWpbBzH6eq90aakmJi f7uFcu4YmsQpeQaf9lpftWmY2vDf2fIadVTV0RnSMXks57wMax1cpBe7LJGpz13e f+g5XRVs1MdlZVtr6tG2SU3Y5AqVVVsYe/0DBPonEqeh9/JJbPFCuNkFOxxzAqPu VTnjyoOqG8qtZzjklNtR5rZn0Gv592tWX36eiWTQdThNmVFkGEAJwsHCQlY4OQYK 61QN4UhOHiu8e1ZuGDNEDhNVRnKtaaYUPFeWL1wLRW73ul4P3ZkpvpS8QTMwcFJI JjzNOkI= =ckcO -----END PGP SIGNATURE----- Merge tag 'v4.18-rc5' into sched/core, to pick up fixes Signed-off-by: Ingo Molnar <mingo@kernel.org>	2018-07-15 23:33:26 +02:00
Isaac J. Manjarres	9fb8d5dc4b	stop_machine: Disable preemption when waking two stopper threads When cpu_stop_queue_two_works() begins to wake the stopper threads, it does so without preemption disabled, which leads to the following race condition: The source CPU calls cpu_stop_queue_two_works(), with cpu1 as the source CPU, and cpu2 as the destination CPU. When adding the stopper threads to the wake queue used in this function, the source CPU stopper thread is added first, and the destination CPU stopper thread is added last. When wake_up_q() is invoked to wake the stopper threads, the threads are woken up in the order that they are queued in, so the source CPU's stopper thread is woken up first, and it preempts the thread running on the source CPU. The stopper thread will then execute on the source CPU, disable preemption, and begin executing multi_cpu_stop(), and wait for an ack from the destination CPU's stopper thread, with preemption still disabled. Since the worker thread that woke up the stopper thread on the source CPU is affine to the source CPU, and preemption is disabled on the source CPU, that thread will never run to dequeue the destination CPU's stopper thread from the wake queue, and thus, the destination CPU's stopper thread will never run, causing the source CPU's stopper thread to wait forever, and stall. Disable preemption when waking the stopper threads in cpu_stop_queue_two_works(). Fixes: `0b26351b91` ("stop_machine, sched: Fix migrate_swap() vs. active_balance() deadlock") Co-Developed-by: Prasad Sodagudi <psodagud@codeaurora.org> Signed-off-by: Prasad Sodagudi <psodagud@codeaurora.org> Co-Developed-by: Pavankumar Kondeti <pkondeti@codeaurora.org> Signed-off-by: Pavankumar Kondeti <pkondeti@codeaurora.org> Signed-off-by: Isaac J. Manjarres <isaacm@codeaurora.org> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Cc: peterz@infradead.org Cc: matt@codeblueprint.co.uk Cc: bigeasy@linutronix.de Cc: gregkh@linuxfoundation.org Cc: stable@vger.kernel.org Link: https://lkml.kernel.org/r/1530655334-4601-1-git-send-email-isaacm@codeaurora.org	2018-07-15 12:12:45 +02:00
Linus Torvalds	3951dbf232	Merge branch 'timers-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull timer fixes from Ingo Molnar: "A clocksource driver fix and a revert" * 'timers-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: clocksource: arm_arch_timer: Set arch_mem_timer cpumask to cpu_possible_mask Revert "tick: Prefer a lower rating device only if it's CPU local device"	2018-07-13 13:36:36 -07:00
Linus Torvalds	ae4ea3975d	Merge branch 'core-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull rseq fixes from Ingo Molnar: "Various rseq ABI fixes and cleanups: use get_user()/put_user(), validate parameters and use proper uapi types, etc" * 'core-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: rseq/selftests: cleanup: Update comment above rseq_prepare_unload rseq: Remove unused types_32_64.h uapi header rseq: uapi: Declare rseq_cs field as union, update includes rseq: uapi: Update uapi comments rseq: Use get_user/put_user rather than __get_user/__put_user rseq: Use __u64 for rseq_cs fields, validate user inputs	2018-07-13 12:50:42 -07:00
Linus Torvalds	35a84f34cf	Joel Fernandes asked to add a feature in tracing that Android had its own patch internally for. I took it back in 4.13. Now he realizes that he had a mistake, and swapped the values from what Android had. This means that the old Android tools will break when using a new kernel that has the new feature on it. The options are: 1. To swap it back to what Android wants. 2. Add a command line option or something to do the swap 3. Just let Android carry a patch that swaps it back Since it requires setting a tracing option to enable this anyway, I doubt there are other users of this than Android. Thus, I've decided to take option 1. If someone else is actually depending on the order that is in the kernel, then we will have to revert this change and go to option 2 or 3. -----BEGIN PGP SIGNATURE----- iIoEABYIADIWIQRRSw7ePDh/lE+zeZMp5XQQmuv6qgUCW0ib3BQccm9zdGVkdEBn b29kbWlzLm9yZwAKCRAp5XQQmuv6qpH2APwJb1C72w6/QF9QK8I7HWzK3BN+9KuK xfJA+58HXzu7SgD+IXzhXW9tODU+sWbYr9cVOyj2ad6p8CYNDkPlVAJulwM= =XJE5 -----END PGP SIGNATURE----- Merge tag 'trace-v4.18-rc3-3' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace Pull tracing fixlet from Steven Rostedt: "Joel Fernandes asked to add a feature in tracing that Android had its own patch internally for. I took it back in 4.13. Now he realizes that he had a mistake, and swapped the values from what Android had. This means that the old Android tools will break when using a new kernel that has the new feature on it. The options are: 1. To swap it back to what Android wants. 2. Add a command line option or something to do the swap 3. Just let Android carry a patch that swaps it back Since it requires setting a tracing option to enable this anyway, I doubt there are other users of this than Android. Thus, I've decided to take option 1. If someone else is actually depending on the order that is in the kernel, then we will have to revert this change and go to option 2 or 3" * tag 'trace-v4.18-rc3-3' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace: tracing: Reorder display of TGID to be after PID	2018-07-13 11:40:11 -07:00
Thomas Gleixner	fee0aede6f	cpu/hotplug: Set CPU_SMT_NOT_SUPPORTED early The CPU_SMT_NOT_SUPPORTED state is set (if the processor does not support SMT) when the sysfs SMT control file is initialized. That was fine so far as this was only required to make the output of the control file correct and to prevent writes in that case. With the upcoming l1tf command line parameter, this needs to be set up before the L1TF mitigation selection and command line parsing happens. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Tested-by: Jiri Kosina <jkosina@suse.cz> Reviewed-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Reviewed-by: Josh Poimboeuf <jpoimboe@redhat.com> Link: https://lkml.kernel.org/r/20180713142323.121795971@linutronix.de	2018-07-13 16:29:56 +02:00
Jiri Kosina	8e1b706b6e	cpu/hotplug: Expose SMT control init function The L1TF mitigation will gain a commend line parameter which allows to set a combination of hypervisor mitigation and SMT control. Expose cpu_smt_disable() so the command line parser can tweak SMT settings. [ tglx: Split out of larger patch and made it preserve an already existing force off state ] Signed-off-by: Jiri Kosina <jkosina@suse.cz> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Tested-by: Jiri Kosina <jkosina@suse.cz> Reviewed-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Reviewed-by: Josh Poimboeuf <jpoimboe@redhat.com> Link: https://lkml.kernel.org/r/20180713142323.039715135@linutronix.de	2018-07-13 16:29:55 +02:00
Joel Fernandes (Google)	f8494fa3dd	tracing: Reorder display of TGID to be after PID Currently ftrace displays data in trace output like so: _-----=> irqs-off / _----=> need-resched \| / _---=> hardirq/softirq \|\| / _--=> preempt-depth \|\|\| / delay TASK-PID CPU TGID \|\|\|\| TIMESTAMP FUNCTION \| \| \| \| \|\|\|\| \| \| bash-1091 [000] ( 1091) d..2 28.313544: sched_switch: However Android's trace visualization tools expect a slightly different format due to an out-of-tree patch patch that was been carried for a decade, notice that the TGID and CPU fields are reversed: _-----=> irqs-off / _----=> need-resched \| / _---=> hardirq/softirq \|\| / _--=> preempt-depth \|\|\| / delay TASK-PID TGID CPU \|\|\|\| TIMESTAMP FUNCTION \| \| \| \| \|\|\|\| \| \| bash-1091 ( 1091) [002] d..2 64.965177: sched_switch: From kernel v4.13 onwards, during which TGID was introduced, tracing with systrace on all Android kernels will break (most Android kernels have been on 4.9 with Android patches, so this issues hasn't been seen yet). From v4.13 onwards things will break. The chrome browser's tracing tools also embed the systrace viewer which uses the legacy TGID format and updates to that are known to be difficult to make. Considering this, I suggest we make this change to the upstream kernel and backport it to all Android kernels. I believe this feature is merged recently enough into the upstream kernel that it shouldn't be a problem. Also logically, IMO it makes more sense to group the TGID with the TASK-PID and the CPU after these. Link: http://lkml.kernel.org/r/20180626000822.113931-1-joel@joelfernandes.org Cc: jreck@google.com Cc: tkjos@google.com Cc: stable@vger.kernel.org Fixes: `441dae8f2f` ("tracing: Add support for display of tgid in trace output") Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org> Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>	2018-07-12 19:56:25 -04:00
Paul E. McKenney	18952651da	Merge branches 'fixes1.2018.07.12b' and 'torture1.2018.07.12b' into HEAD fixes1.2018.07.12b: Post-gp_seq miscellaneous fixes torture1.2018.07.12b: Post-gp_seq torture-test updates	2018-07-12 15:42:41 -07:00
Joel Fernandes (Google)	bf5b64355a	rcutorture: Fix rcu_barrier successes counter The rcutorture test module currently increments both successes and error for the barrier test upon error, which results in misleading statistics being printed. This commit therefore changes the code to increment the success counter only when the test actually passes. This change was tested by by returning from the barrier callback without incrementing the callback counter, thus introducing what appeared to rcutorture to be rcu_barrier() failures. Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org> Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>	2018-07-12 15:42:08 -07:00
Joel Fernandes (Google)	4babd855fd	rcutorture: Add support to detect if boost kthread prio is too low When rcutorture is built in to the kernel, an earlier patch detects that and raises the priority of RCU's kthreads to allow rcutorture's RCU priority boosting tests to succeed. However, if rcutorture is built as a module, those priorities must be raised manually via the rcutree.kthread_prio kernel boot parameter. If this manual step is not taken, rcutorture's RCU priority boosting tests will fail due to kthread starvation. One approach would be to raise the default priority, but that risks breaking existing users. Another approach would be to allow runtime adjustment of RCU's kthread priorities, but that introduces numerous "interesting" race conditions. This patch therefore instead detects too-low priorities, and prints a message and disables the RCU priority boosting tests in that case. Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org> Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>	2018-07-12 15:42:08 -07:00
Arnd Bergmann	622be33fcb	rcutorture: Use monotonic timestamp for stall detection The get_seconds() call is deprecated because it overflows on 32-bit architectures. The algorithm in rcu_torture_stall() can deal with the overflow, but another problem here is that using a CLOCK_REALTIME stamp can lead to a false-positive stall warning when a settimeofday() happens concurrently. Using ktime_get_seconds() instead avoids those issues and will never overflow. The added cast to 'unsigned long' however is necessary to make ULONG_CMP_LT() work correctly. Signed-off-by: Arnd Bergmann <arnd@arndb.de> Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>	2018-07-12 15:42:07 -07:00
Joel Fernandes (Google)	3b745c8969	rcutorture: Make boost test more robust Currently, with RCU_BOOST disabled, I get no failures when forcing rcutorture to test RCU boost priority inversion. The reason seems to be that we don't check for failures if the callback never ran at all for the duration of the boost-test loop. Further, the 'rtb' and 'rtbf' counters seem to be used inconsistently. 'rtb' is incremented at the start of each test and 'rtbf' is incremented per-cpu on each failure of call_rcu. So its possible 'rtbf' > 'rtb'. To test the boost with rcutorture, I did following on a 4-CPU x86 machine: modprobe rcutorture test_boost=2 sleep 20 rmmod rcutorture With patch: rtbf: 8 rtb: 12 Without patch: rtbf: 0 rtb: 2 In summary this patch: - Increments failed and total test counters once per boost-test. - Checks for failure cases correctly. Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org> Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>	2018-07-12 15:42:06 -07:00
Joel Fernandes (Google)	450efca718	rcutorture: Disable RT throttling for boost tests Currently rcutorture is not able to torture RCU boosting properly. This is because the rcutorture's boost threads which are doing the torturing may be throttled due to RT throttling. This patch makes rcutorture use the right torture technique (unthrottled rcutorture boost tasks) for torturing RCU so that the test fails correctly when no boost is available. Currently this requires accessing sysctl_sched_rt_runtime directly, but that should be Ok since rcutorture is test code. Such direct access is also only possible if rcutorture is used as a built-in so make it conditional on that. Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org> Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>	2018-07-12 15:42:06 -07:00
Paul E. McKenney	bf1bef50be	rcutorture: Emphasize testing of single reader protection type For RCU implementations supporting multiple types of reader protection, rcutorture currently randomly selects the combinations of types of protection for each phase of each reader. The problem with this, for example, given the four kinds of protection for RCU-sched (local_irq_disable(), local_bh_disable(), preempt_disable(), and rcu_read_lock_sched()), the reader will be protected by a single mechanism only 25% of the time. We really heavier testing of single read-side mechanisms. This commit therefore uses only a single mechanism about 60% of the time, half of the time explicitly and one-eighth of the time by chance. Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>	2018-07-12 15:42:05 -07:00
Paul E. McKenney	2397d072f7	rcutorture: Handle extended read-side critical sections This commit enables rcutorture to test whether RCU properly aggregates different types of read-side critical sections into a larger section covering the set. It does this by extending an initial read-side critical section randomly for a random number of extensions. There is a new rcu_torture_ops field ->extendable that specifies what extensions are permitted for a given flavor of RCU (for example, SRCU does not permit any extensions, while RCU-sched permits all types). Note that if a given operation (for example, local_bh_disable()) extends an RCU read-side critical section, then rcutorture feels free to also start and end the critical section with that operation's type of disabling. Disabling operations include local_bh_disable(), local_irq_disable(), and preempt_disable(). This commit also adds a new "busted_srcud" torture type, which verifies rcutorture's ability to detect extensions of RCU read-side critical sections that are not handled. Gotta test the test, after all! Note that it is not legal to invoke local_bh_disable() with interrupts disabled, and this transition is avoided by overriding the random-number generator when it wants to call local_bh_disable() while interrupts are disabled. The code instead leaves both interrupts and bh/softirq disabled in this case. Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>	2018-07-12 15:42:05 -07:00
Paul E. McKenney	241b42522a	rcutorture: Make rcu_torture_timer() use rcu_torture_one_read() This commit saves a few lines of code by making rcu_torture_timer() invoke rcu_torture_one_read(), thus completing the consolidation of code between rcu_torture_timer() and rcu_torture_reader(). Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>	2018-07-12 15:42:04 -07:00
Paul E. McKenney	3025520ec4	rcutorture: Use per-CPU random state for rcu_torture_timer() Currently, the rcu_torture_timer() function uses a single global torture_random_state structure protected by a single global lock. This conflicts to some extent with performance and scalability, but even more with the goal of consolidating read-side testing with rcu_torture_reader(). This commit therefore creates a per-CPU torture_random_state structure for use by rcu_torture_timer() and eliminates the lock. Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com> [ paulmck: Make rcu_torture_timer_rand static, per 0day Test Robot report. ]	2018-07-12 15:42:04 -07:00
Paul E. McKenney	8da9a59523	rcutorture: Use atomic increment for n_rcu_torture_timers Currently, rcu_torture_timer() relies on a lock to guard updates to n_rcu_torture_timers. Unfortunately, consolidating code with rcu_torture_reader() will dispense with this lock. This commit therefore makes n_rcu_torture_timers be an atomic_long_t and uses atomic_long_inc() to carry out the update. Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>	2018-07-12 15:42:03 -07:00
Paul E. McKenney	6b06aa723e	rcutorture: Extract common code from rcu_torture_reader() This commit extracts the code executed on each pass through the loop in rcu_torture_reader() into a new rcu_torture_one_read() function. This new function will also be used by rcu_torture_timer(). Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>	2018-07-12 15:42:02 -07:00
Paul E. McKenney	2d3625841d	rcuperf: Remove unused torturing_tasks() function The torturing_tasks() function in rcuperf.c is not used, so this commit removes it. Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>	2018-07-12 15:42:02 -07:00
Paul E. McKenney	6bea2cc5a9	rcu: Remove rcutorture test version and sequence number Back when RCU had a debugfs interface, there was a test version and sequence number that allowed associating debugfs data with a particular test run, where the test run started with modprobe and ended with rmmod, which was how tests were run back on the old ABAT system within IBM. But rcutorture testing no longer runs on ABAT, and there is no longer an RCU debugfs interface, so there is no longer any need for test versions and sequence numbers. This commit therefore removes the rcutorture_record_test_transition() and rcutorture_record_progress() functions, and along with them the rcutorture_testseq and rcutorture_vernum variables that they update. Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>	2018-07-12 15:42:01 -07:00
Paul E. McKenney	028be12b29	rcutorture: Change units of onoff_interval to jiffies Some RCU bugs have been sensitive to the frequency of CPU-hotplug operations, which have been gradually increased over time. But this frequency is now at the one-second lower limit that can be specified using the rcutorture.onoff_interval kernel parameter. This commit therefore changes the units of rcutorture.onoff_interval from seconds to jiffies, and also sets the value specified for this kernel parameter in the TREE03 rcutorture scenario to 200, which is 200 milliseconds for HZ=1000. Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>	2018-07-12 15:42:01 -07:00
Joel Fernandes (Google)	c7cd161ecb	rcu: Assign higher prio to RCU threads if rcutorture is built-in The rcutorture RCU priority boosting tests fail even with CONFIG_RCU_BOOST set because rcutorture's threads run at the same priority as the default RCU kthreads (RT class with priority of 1). This patch checks if RCU torture is built into the kernel and if so, assigns RT priority 1 to the RCU threads, allowing the rcutorture boost tests to pass. Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org> Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>	2018-07-12 15:39:26 -07:00
Paul E. McKenney	52e17ba1d0	srcu: Add grace-period number to rcutorture statistics printout This commit adds the SRCU grace-period number to the rcutorture statistics printout, which allows it to be compared to the rcutorture "Writer stall state" message. Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>	2018-07-12 15:39:25 -07:00
Paul E. McKenney	89b4cd4b9e	rcu: Print stall-warning NMI dyntick state in hexadecimal The ->dynticks_nmi_nesting field records the nesting depth of both interrupt and NMI handlers. Because the kernel can enter interrupts and never leave them (and vice versa) and because NMIs can interrupt manipulation of the ->dynticks_nmi_nesting field, the values in this field must be both chosen and maniupated very carefully. As a result, although the value is zero when the corresponding CPU is executing neither an interrupt nor an NMI handler, it is 4,611,686,018,427,387,906 on 64-bit systems when there is a single level of interrupt/NMI handling in progress. This number is difficult to remember and interpret, so this commit switches the output to hexadecimal, resulting in the much nicer 0x4000000000000002. Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>	2018-07-12 15:39:24 -07:00
Paul E. McKenney	2ee5aca546	rcu: Make rcu_seq_diff() more exact The current implementatation of rcu_seq_diff() follows tradition in providing a rough-and-ready approximation of the number of elapsed grace periods between the two rcu_seq values. However, this difference is used to flag RCU-failure "near misses", which can be a valuable debugging aid, so more exactitude would be an improvement. This commit therefore improves the accuracy of rcu_seq_diff(). Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>	2018-07-12 15:39:23 -07:00
Byungchul Park	67abb96cbf	rcu: Check the range of jiffies_till_{first,next}_fqs when setting them Currently, the range of jiffies_till_{first,next}_fqs are checked and adjusted on and on in the loop of rcu_gp_kthread on runtime. However, it's enough to check them only when setting them, not every time in the loop. So make them handled on a setting time via sysfs. Signed-off-by: Byungchul Park <byungchul.park@lge.com> Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>	2018-07-12 15:39:22 -07:00
Paul E. McKenney	47199a0812	rcu: Add diagnostics for rcutorture writer stall warning This commit adds any in-the-future ->gp_seq_needed fields to the diagnostics for an rcutorture writer stall warning message. Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>	2018-07-12 15:39:22 -07:00
Steven Rostedt (VMware)	cd23ac8ddb	rcu: Add comment to the last sleep in the rcu tasks loop At the end of rcu_tasks_kthread() there's a lonely schedule_timeout_uninterruptible() call with no apparent rationale for its existence. But there is. It is to keep the thread from going into a tight loop if there's some anomaly. That really needs a comment. Link: http://lkml.kernel.org/r/20180524223839.GU3803@linux.vnet.ibm.com Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org> Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>	2018-07-12 15:39:21 -07:00
Steven Rostedt (VMware)	c03be752d3	rcu: Speed up calling of RCU tasks callbacks Joel Fernandes found that the synchronize_rcu_tasks() was taking a significant amount of time. He demonstrated it with the following test: # cd /sys/kernel/tracing # while [ 1 ]; do x=1; done & # echo '__schedule_bug:traceon' > set_ftrace_filter # time echo '!__schedule_bug:traceon' > set_ftrace_filter; real 0m1.064s user 0m0.000s sys 0m0.004s Where it takes a little over a second to perform the synchronize, because there's a loop that waits 1 second at a time for tasks to get through their quiescent points when there's a task that must be waited for. After discussion we came up with a simple way to wait for holdouts but increase the time for each iteration of the loop but no more than a full second. With the new patch we have: # time echo '!__schedule_bug:traceon' > set_ftrace_filter; real 0m0.131s user 0m0.000s sys 0m0.004s Which drops it down to 13% of what the original wait time was. Link: http://lkml.kernel.org/r/20180523063815.198302-2-joel@joelfernandes.org Reported-by: Joel Fernandes (Google) <joel@joelfernandes.org> Suggested-by: Joel Fernandes (Google) <joel@joelfernandes.org> Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org> Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>	2018-07-12 15:39:21 -07:00
Joel Fernandes (Google)	0d805a70a6	rcu: Add comment documenting how rcu_seq_snap works rcu_seq_snap may be tricky to decipher. Lets document how it works with an example to make it easier. Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org> Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com> [ paulmck: Shrink comment as suggested by Peter Zijlstra. ]	2018-07-12 15:39:20 -07:00
Paul E. McKenney	b06ae25a1e	rcu: Use RCU CPU stall timeout for rcu_check_gp_start_stall() Currently, rcu_check_gp_start_stall() waits for one second after the first request before complaining that a grace period has not yet started. This was desirable while testing the conversion from ->future_gp_needed[] to ->gp_seq_needed, but it is a bit on the hair-trigger side for production use under heavy load. This commit therefore makes this wait time be exactly that of the RCU CPU stall warning, allowing easy adjustment of both timeouts to suit the distribution or installation at hand. Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>	2018-07-12 15:39:20 -07:00
Paul E. McKenney	51fbb910f5	rcu: Remove __maybe_unused from rcu_cpu_has_callbacks() The rcu_cpu_has_callbacks() function is now used in all configurations, so this commit removes the __maybe_unused. Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>	2018-07-12 15:39:19 -07:00
Paul E. McKenney	9622179519	rcu: Remove "inline" from rcu_perf_print_module_parms() This function is in rcuperf.c, which is not an include file, so there is no problem dropping the "inline", especially given that this function is invoked only twice per rcuperf run. This commit therefore delegates the inlining decision to the compiler by dropping the "inline". Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>	2018-07-12 15:39:19 -07:00
Paul E. McKenney	eac45e586c	rcu: Remove "inline" from rcu_torture_print_module_parms() This function is in rcutorture.c, which is not an include file, so there is no problem dropping the "inline", especially given that this function is invoked only twice per rcutorture run. This commit therefore delegates the inlining decision to the compiler by dropping the "inline". Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>	2018-07-12 15:39:18 -07:00
Paul E. McKenney	95394e69c4	rcu: Remove "inline" from panic_on_rcu_stall() and rcu_blocking_is_gp() These functions are in kernel/rcu/tree.c, which is not an include file, so there is no problem dropping the "inline", especially given that these functions are nowhere near a fastpath. This commit therefore delegates the inlining decision to the compiler by dropping the "inline". Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>	2018-07-12 15:39:18 -07:00
Paul E. McKenney	ab6b82147f	rcu: Remove unused local variable "cpu" One danger of using __maybe_unused is that the compiler doesn't yell at you when you remove the last reference, witness rcu_bind_gp_kthread() and its local variable "cpu". This commit removes this local variable. Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>	2018-07-12 15:39:17 -07:00
Paul E. McKenney	164ba3fc48	rcu: Remove unused rcu_kick_nohz_cpu() function The rcu_kick_nohz_cpu() function is no longer used, and the functionality it used to provide is now provided by a call to resched_cpu() in the force-quiescent-state function rcu_implicit_dynticks_qs(). This commit therefore removes rcu_kick_nohz_cpu(). Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>	2018-07-12 15:39:17 -07:00
Paul E. McKenney	c7037ff524	rcu: Clarify and correct the rcu_preempt_qs() header comment The rcu_preempt_qs() function only applies to the CPU, not the task. A task really is allowed to invoke this function while in an RCU-preempt read-side critical section, but only if it has first added itself to some leaf rcu_node structure's ->blkd_tasks list. Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>	2018-07-12 15:39:16 -07:00
Paul E. McKenney	3b57a3994f	rcu: Inline rcu_dynticks_momentary_idle() into its sole caller The rcu_dynticks_momentary_idle() function is invoked only from rcu_momentary_dyntick_idle(), and neither function is particularly large. This commit therefore saves a few lines by inlining rcu_dynticks_momentary_idle() into rcu_momentary_dyntick_idle(). Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>	2018-07-12 15:39:16 -07:00
Paul E. McKenney	15651201fa	rcu: Mark task as .need_qs less aggressively If any scheduling-clock interrupt interrupts an RCU-preempt read-side critical section, the interrupted task's ->rcu_read_unlock_special.b.need_qs field is set. This causes the outermost rcu_read_unlock() to incur the extra overhead of calling into rcu_read_unlock_special(). This commit reduces that overhead by setting ->rcu_read_unlock_special.b.need_qs only if the grace period has been in effect for more than one second. Why one second? Because this is comfortably smaller than the minimum RCU CPU stall-warning timeout of three seconds, but long enough that the .need_qs marking should happen quite rarely. And if your RCU read-side critical section has run on-CPU for a full second, it is not unreasonable to invest some CPU time in ending the grace period quickly. Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>	2018-07-12 15:39:15 -07:00
Paul E. McKenney	6f56f714db	rcu: Improve RCU-tasks naming and comments The naming and comments associated with some RCU-tasks code make the faulty assumption that context switches due to cond_resched() are voluntary. As several people pointed out, this is not the case. This commit therefore updates function names and comments to better reflect current reality. Reported-by: Byungchul Park <byungchul.park@lge.com> Reported-by: Joel Fernandes <joel@joelfernandes.org> Reported-by: Steven Rostedt <rostedt@goodmis.org> Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>	2018-07-12 15:39:15 -07:00
Joe Perches	a7538352da	rcu: Use pr_fmt to prefix "rcu: " to logging output This commit also adjusts some whitespace while in the area. Signed-off-by: Joe Perches <joe@perches.com> Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com> [ paulmck: Revert string-breaking %s as requested by Andy Shevchenko. ]	2018-07-12 15:39:13 -07:00
Byungchul Park	07f27570dc	rcu: Improve rcu_note_voluntary_context_switch() reporting We expect a quiescent state of TASKS_RCU when cond_resched_tasks_rcu_qs() is called, no matter whether it actually be scheduled or not. However, it currently doesn't report the quiescent state when the task enters into __schedule() as it's called with preempt = true. So make it report the quiescent state unconditionally when cond_resched_tasks_rcu_qs() is called. And in TINY_RCU, even though the quiescent state of rcu_bh also should be reported when the tick interrupt comes from user, it doesn't. So make it reported. Lastly in TREE_RCU, rcu_note_voluntary_context_switch() should be reported when the tick interrupt comes from not only user but also idle, as an extended quiescent state. Signed-off-by: Byungchul Park <byungchul.park@lge.com> Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com> [ paulmck: Simplify rcutiny portion given no RCU-tasks for !PREEMPT. ]	2018-07-12 15:39:12 -07:00
Paul E. McKenney	3949fa9bac	rcu: Make rcu_read_unlock_special() static Because rcu_read_unlock_special() is no longer used outside of kernel/rcu/tree_plugin.h, this commit makes it static. Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>	2018-07-12 15:39:11 -07:00
Paul E. McKenney	f2e2df5978	rcu: Add diagnostics for offline CPUs failing to report QS CPUs are expected to report quiescent states when coming online and when going offline, and grace-period initialization is supposed to handle any race conditions where a CPU's ->qsmask bit is set just after it goes offline. This commit adds diagnostics for the case where an offline CPU nevertheless has a grace period waiting on it. Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>	2018-07-12 15:39:10 -07:00
Paul E. McKenney	fea3f222d3	rcu: Record ->gp_state for both phases of grace-period initialization Grace-period initialization first processes any recent CPU-hotplug operations, and then initializes state for the new grace period. These two phases of initialization are currently not distinguished in debug prints, but the distinction is valuable in a number of debug situations. This commit therefore introduces two new values for ->gp_state, RCU_GP_ONOFF and RCU_GP_INIT, in order to make this distinction. Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>	2018-07-12 15:39:09 -07:00
Paul E. McKenney	5773894231	rcu: Add CPU online/offline state to dump_blkd_tasks() Interactions between CPU-hotplug operations and grace-period initialization can result in dump_blkd_tasks(). One of the first debugging actions in this case is to search back in dmesg to work out which of the affected rcu_node structure's CPUs are online and to determine the last CPU-hotplug operation affecting any of those CPUs. This can be laborious and error-prone, especially when console output is lost. This commit therefore causes dump_blkd_tasks() to dump the state of the affected rcu_node structure's CPUs and the last grace period during which the last offline and online operation affected each of these CPUs. Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>	2018-07-12 15:39:09 -07:00
Paul E. McKenney	ff3cee3908	rcu: Add up-tree information to dump_blkd_tasks() diagnostics This commit updates dump_blkd_tasks() to print out quiescent-state bitmasks for the rcu_node structures further up the tree. This information helps debugging of interactions between CPU-hotplug operations and RCU grace-period initialization. Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>	2018-07-12 15:39:08 -07:00
Paul E. McKenney	e05121ba5b	rcu: Remove CPU-hotplug failsafe from force-quiescent-state code path Now that quiescent states for newly offlined CPUs are reported either when that CPU goes offline or at the end of grace-period initialization, the CPU-hotplug failsafe in the force-quiescent-state code path is no longer needed. This commit therefore removes this failsafe. Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>	2018-07-12 15:39:07 -07:00
Paul E. McKenney	17a8212b8d	rcu: Remove failsafe check for lost quiescent state Now that quiescent-state reporting is fully event-driven, this commit removes the check for a lost quiescent state from force_qs_rnp(). Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>	2018-07-12 15:39:06 -07:00
Paul E. McKenney	f34f2f5852	rcu: Move grace-period pre-init delay after pre-init The main race with the early part of grace-period initialization appears to be with CPU hotplug. To more fully open this race window, this commit moves the rcu_gp_slow() from the beginning of the early initialization loop to follow that loop, thus widening the race window, especially for the rcu_node structures that are initialized last. This commit also expands rcutree.gp_preinit_delay from 3 to 12, giving the same overall delay in the grace period, but concentrated in the spot where it will do the most good. Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>	2018-07-12 15:39:06 -07:00
Paul E. McKenney	1f3e5f51b9	rcu: Add RCU-preempt check for waiting on newly onlined CPU RCU should only be waiting on CPUs that were online at the time that the current grace period started. Failure to abide by this rule can result in confusing splats during grace-period cleanup and initialization. This commit therefore adds a check to RCU-preempt's preempted-task queuing that checks for waiting on newly onlined CPUs. Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>	2018-07-12 15:39:05 -07:00
Paul E. McKenney	1e64b15a4b	rcu: Fix grace-period hangs due to race with CPU offline Without special fail-safe quiescent-state-propagation checks, grace-period hangs can result from the following scenario: 1. CPU 1 goes offline. 2. Because CPU 1 is the only CPU in the system blocking the current grace period, the grace period ends as soon as rcu_cleanup_dying_idle_cpu()'s call to rcu_report_qs_rnp() returns. 3. At this point, the leaf rcu_node structure's ->lock is no longer held: rcu_report_qs_rnp() has released it, as it must in order to awaken the RCU grace-period kthread. 4. At this point, that same leaf rcu_node structure's ->qsmaskinitnext field still records CPU 1 as being online. This is absolutely necessary because the scheduler uses RCU (in this case on the wake-up path while awakening RCU's grace-period kthread), and ->qsmaskinitnext contains RCU's idea as to which CPUs are online. Therefore, invoking rcu_report_qs_rnp() after clearing CPU 1's bit from ->qsmaskinitnext would result in a lockdep-RCU splat due to RCU being used from an offline CPU. 5. RCU's grace-period kthread awakens, sees that the old grace period has completed and that a new one is needed. It therefore starts a new grace period, but because CPU 1's leaf rcu_node structure's ->qsmaskinitnext field still shows CPU 1 as being online, this new grace period is initialized to wait for a quiescent state from the now-offline CPU 1. 6. Without the fail-safe force-quiescent-state checks, there would be no quiescent state from the now-offline CPU 1, which would eventually result in RCU CPU stall warnings and memory exhaustion. It would be good to get rid of the special fail-safe quiescent-state propagation checks, and thus it would be good to fix things so that the above scenario cannot happen. This commit therefore adds a new ->ofl_lock to the rcu_state structure. This lock is held by rcu_gp_init() across the applying of buffered online and offline operations to the rcu_node tree, and it is also held by rcu_cleanup_dying_idle_cpu() when buffering a new offline operation. This prevents rcu_gp_init() from acquiring the leaf rcu_node structure's lock during the interval between when rcu_cleanup_dying_idle_cpu() invokes rcu_report_qs_rnp(), which releases ->lock and the re-acquisition of that same lock. This in turn prevents the failure scenario outlined above, and will hopefully eventually allow removal of the offline-CPU checks from the force-quiescent-state code path. Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>	2018-07-12 15:39:04 -07:00
Paul E. McKenney	ec2c29765a	rcu: Fix grace-period hangs from mid-init task resume Without special fail-safe quiescent-state-propagation checks, grace-period hangs can result from the following scenario: 1. A task running on a given CPU is preempted in its RCU read-side critical section. 2. That CPU goes offline, and there are now no online CPUs corresponding to that CPU's leaf rcu_node structure. 3. The rcu_gp_init() function does the first phase of grace-period initialization, and sets the aforementioned leaf rcu_node structure's ->qsmaskinit field to all zeroes. Because there is a blocked task, it does not propagate the zeroing of either ->qsmaskinit or ->qsmaskinitnext up the rcu_node tree. 4. The task resumes on some other CPU and exits its critical section. There is no grace period in progress, so the resulting quiescent state is not reported up the tree. 5. The rcu_gp_init() function does the second phase of grace-period initialization, which results in the leaf rcu_node structure being initialized to expect no further quiescent states, but with that structure's parent expecting a quiescent-state report. The parent will never receive a quiescent state from this leaf rcu_node structure, so the grace period will hang, resulting in RCU CPU stall warnings. It would be good to get rid of the special fail-safe quiescent-state propagation checks. This commit therefore checks the leaf rcu_node structure's ->wait_blkd_tasks field during grace-period initialization. If this flag is set, the rcu_report_qs_rnp() is invoked to immediately report the possible quiescent state. While in the neighborhood, this commit also report quiescent states for any CPUs that went offline between the two phases of grace-period initialization, thus reducing grace-period delays and hopefully eventually allowing removal of offline-CPU checks from the force-quiescent-state code path. Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>	2018-07-12 15:39:04 -07:00
Paul E. McKenney	0b107d24d9	rcu: Suppress false-positive splats from mid-init task resume Consider the following sequence of events in a PREEMPT=y kernel: 1. All CPUs corresponding to a given leaf rcu_node structure are offline. 2. The first phase of the rcu_gp_init() function's grace-period initialization runs, and sets that rcu_node structure's ->qsmaskinit to zero, as it should. 3. One of the CPUs corresponding to that rcu_node structure comes back online. Note that because this CPU came online after the grace period started, this grace period can safely ignore this newly onlined CPU. 4. A task running on the newly onlined CPU enters an RCU-preempt read-side critical section, and is then preempted. Because the corresponding rcu_node structure's ->qsmask is zero, rcu_preempt_ctxt_queue() leaves the rcu_node structure's ->gp_tasks field NULL, as it should. 5. The rcu_gp_init() function continues running the second phase of grace-period initialization. The ->qsmask field of the parent of the aforementioned leaf rcu_node structure is set to not expect a quiescent state from the leaf, as is only right and proper. However, when rcu_gp_init() reaches the leaf, it invokes rcu_preempt_check_blocked_tasks(), which sees that the leaf's ->blkd_tasks list is non-empty, and therefore sets the leaf's ->gp_tasks field to reference the first task on that list. 6. The grace period ends before the preempted task resumes, which is perfectly fine, given that this grace period was under no obligation to wait for that task to exit its late-starting RCU-preempt read-side critical section. Unfortunately, the leaf's ->gp_tasks field is non-NULL, so rcu_gp_cleanup() splats. After all, it appears to rcu_gp_cleanup() that the grace period failed to wait for a task that was supposed to be blocking that grace period. This commit avoids this false-positive splat by adding a check of both ->qsmaskinit and ->wait_blkd_tasks to rcu_preempt_check_blocked_tasks(). If both ->qsmaskinit and ->wait_blkd_tasks are zero, then the task must have entered its RCU-preempt read-side critical section late (after all, the CPU that it is running on was not online at that time), which means that the upper-level rcu_node structure won't be waiting for anything on the leaf anyway. If ->wait_blkd_tasks is non-zero, then there is at least one task on ths rcu_node structure's ->blkd_tasks list whose RCU read-side critical section predates the current grace period. If ->qsmaskinit is non-zero, there is at least one CPU that was online at the start of the current grace period. Thus, if both are zero, there is nothing to wait for. Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>	2018-07-12 15:39:03 -07:00
Paul E. McKenney	99990da1b3	rcu: Suppress more involved false-positive preempted-task splats Consider the following sequence of events in a PREEMPT=y kernel: 1. All but one of the CPUs corresponding to a given leaf rcu_node structure go offline. Each of these CPUs clears its bit in that structure's ->qsmaskinitnext field. 2. A new grace period starts, and rcu_gp_init() scans the leaf rcu_node structures, applying CPU-hotplug changes since the start of the previous grace period, including those changes in #1 above. This copies each leaf structure's ->qsmaskinitnext to its ->qsmask field, which represents the CPUs that this new grace period will wait on. Each copy operation is done holding the corresponding leaf rcu_node structure's ->lock, and at the end of this scan, rcu_gp_init() holds no locks. 3. The last CPU corresponding to #1's leaf rcu_node structure goes offline, clearing its bit in that structure's ->qsmaskinitnext field, but not touching the ->qsmaskinit field. Note that rcu_gp_init() is not currently holding any locks! This CPU does -not- report a quiescent state because the grace period has not yet initialized itself sufficiently to have set any bits in any of the leaf rcu_node structures' ->qsmask fields. 4. The rcu_gp_init() function continues initializing the new grace period, copying each leaf rcu_node structure's ->qsmaskinit field to its ->qsmask field while holding the corresponding ->lock. This sets the ->qsmask bit corresponding to #3's CPU. 5. Before the grace period ends, #3's CPU comes back online. Because te grace period has not yet done any force-quiescent-state scans (which would report a quiescent state on behalf of any offline CPUs), this CPU's ->qsmask bit is still set. 6. A task running on the newly onlined CPU is preempted while in an RCU read-side critical section. Because this CPU's ->qsmask bit is net, not only does this task queue itself on the leaf rcu_node structure's ->blkd_tasks list, it also sets that structure's ->gp_tasks pointer to reference it. 7. The grace period started in #1 above comes to an end. This results in rcu_gp_cleanup() being invoked, which, among other things, checks to make sure that there are no tasks blocking the just-ended grace period, that is, that all ->gp_tasks pointers are NULL. The ->gp_tasks pointer corresponding to the task preempted in #3 above is non-NULL, which results in a splat. This splat is a false positive. The task's RCU read-side critical section cannot have begun before the just-ended grace period because this would mean either: (1) The CPU came online before the grace period started, which cannot have happened because the grace period started before that CPU went offline, or (2) The task started its RCU read-side critical section on some other CPU, but then it would have had to have been preempted before migrating to this CPU, which would mean that it would have instead queued itself on that other CPU's rcu_node structure. RCU's grace periods thus are working correctly. Or, more accurately, that remaining bugs in RCU's grace periods are elsewhere. This commit eliminates this false positive by adding code to the end of rcu_cpu_starting() that reports a quiescent state to RCU, which has the side-effect of clearing that CPU's ->qsmask bit, preventing the above scenario. This approach has the added benefit of more promptly reporting quiescent states corresponding to offline CPUs. Nevertheless, this commit does -not- remove the need for the force-quiescent-state scans to check for offline CPUs, given that a CPU might remain offline indefinitely. And without the checks in the force-quiescent-state scans, the grace period would also persist indefinitely, which could result in hangs or memory exhaustion. Note well that the call to rcu_report_qs_rnp() reporting the quiescent state must come -after- the setting of this CPU's bit in the leaf rcu_node structure's ->qsmaskinitnext field. Otherwise, lockdep-RCU will complain bitterly about quiescent states coming from an offline CPU. Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>	2018-07-12 15:39:03 -07:00
Paul E. McKenney	fece27760f	rcu: Suppress false-positive preempted-task splats Consider the following sequence of events in a PREEMPT=y kernel: 1. All CPUs corresponding to a given rcu_node structure go offline. A new grace period starts just after the CPU-hotplug code path does its synchronize_rcu() for the last CPU, so at least this CPU is present in that structure's ->qsmask. 2. Before the grace period ends, a CPU comes back online, and not just any CPU, but the one corresponding to a non-zero bit in the leaf rcu_node structure's ->qsmask. 3. A task running on the newly onlined CPU is preempted while in an RCU read-side critical section. Because this CPU's ->qsmask bit is net, not only does this task queue itself on the leaf rcu_node structure's ->blkd_tasks list, it also sets that structure's ->gp_tasks pointer to reference it. 4. The grace period started in #1 above comes to an end. This results in rcu_gp_cleanup() being invoked, which, among other things, checks to make sure that there are no tasks blocking the just-ended grace period, that is, that all ->gp_tasks pointers are NULL. The ->gp_tasks pointer corresponding to the task preempted in #3 above is non-NULL, which results in a splat. This splat is a false positive. The task's RCU read-side critical section cannot have begun before the just-ended grace period because this would mean either: (1) The CPU came online before the grace period started, which cannot have happened because the grace period started before that CPU was all the way offline, or (2) The task started its RCU read-side critical section on some other CPU, but then it would have had to have been preempted before migrating to this CPU, which would mean that it would have instead queued itself on that other CPU's rcu_node structure. This commit eliminates this false positive by adding code to the end of rcu_cleanup_dying_idle_cpu() that reports a quiescent state to RCU, which has the side-effect of clearing that CPU's ->qsmask bit, preventing the above scenario. This approach has the added benefit of more promptly reporting quiescent states corresponding to offline CPUs. Note well that the call to rcu_report_qs_rnp() reporting the quiescent state must come -before- the clearing of this CPU's bit in the leaf rcu_node structure's ->qsmaskinitnext field. Otherwise, lockdep-RCU will complain bitterly about quiescent states coming from an offline CPU. Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>	2018-07-12 15:39:02 -07:00
Paul E. McKenney	5554788e1d	rcu: Suppress false-positive offline-CPU lockdep-RCU splat The rcu_lockdep_current_cpu_online() function currently checks only the RCU-sched data structures to determine whether or not RCU believes that a given CPU is offline. Unfortunately, there are multiple flavors of RCU, which means that there is a short window of time during which the various flavors disagree as to whether or not a given CPU is offline. This can result in false-positive lockdep-RCU splats in which some other flavor of RCU tries to do something based on its view that the CPU is online, only to get hit with a lockdep-RCU splat because RCU-sched instead believes that the CPU is offline. This commit therefore changes rcu_lockdep_current_cpu_online() to scan all RCU flavors and to consider a given CPU to be online if any of the RCU flavors believe it to be online, thus preventing these false-positive splats. Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>	2018-07-12 15:39:02 -07:00
Paul E. McKenney	928164351e	rcu: Prevent useless FQS scan after all CPUs have checked in The force_qs_rnp() function checks for ->qsmask being all zero, that is, all CPUs for the current rcu_node structure having already passed through quiescent states. But with RCU-preempt, this is not sufficient to report quiescent states further up the tree, so there are further checks that can initiate RCU priority boosting and also for races with CPU-hotplug operations. However, if neither of these further checks apply, the code proceeds to carry out a useless scan of an all-zero ->qsmask. This commit therefore adds code to release the current rcu_node structure's lock and continue on to the next rcu_node structure, thereby avoiding this useless scan. Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>	2018-07-12 15:39:01 -07:00
Paul E. McKenney	91f63ced7d	rcu: Replace smp_wmb() with smp_store_release() for stall check This commit gets rid of the smp_wmb() in record_gp_stall_check_time() in favor of an smp_store_release(). Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>	2018-07-12 15:39:01 -07:00
Paul E. McKenney	77cfc7bf24	rcu: Fix typo and add additional debug This commit fixes a typo and adds some additional debugging to the message emitted when a task blocking the current grace period is listed as blocking it when either that grace period ends or the next grace period begins. This commit also reformats the console message for readability. Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>	2018-07-12 15:39:00 -07:00
Paul E. McKenney	c74859d1eb	rcu: Make rcu_report_unblock_qs_rnp() warn on violated preconditions If rcu_report_unblock_qs_rnp() is invoked on something other than preemptible RCU or if there are still preempted tasks blocking the current grace period, something went badly wrong in the caller. This commit therefore adds WARN_ON_ONCE() to these conditions, but leaving the legitimate reason for early exit (rnp->qsmask != 0) unwarned. Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>	2018-07-12 15:38:59 -07:00
Paul E. McKenney	8d672fa6bf	rcu: Make rcu_init_new_rnp() stop upon already-set bit Currently, rcu_init_new_rnp() walks up the rcu_node combining tree, setting bits in the ->qsmaskinit fields on the way up. It walks up unconditionally, regardless of the initial state of these bits. This is OK because only the corresponding RCU grace-period kthread ever tests or sets these bits during runtime. However, it is also pointless, and it increases both memory and lock contention (albeit only slightly), so this commit stops the walk as soon as an already-set bit is encountered. Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>	2018-07-12 15:38:59 -07:00
Paul E. McKenney	c50cbe535c	rcu: Fix an obsolete ->qsmaskinit comment Back in the old days, when grace-period initialization blocked CPU hotplug, the ->qsmaskinit mask was indeed updated at the time that a given CPU went offline. However, with the deferral of these updates until the beginning of the next grace period in commit `0aa04b055e` ("rcu: Process offlining and onlining only at grace-period start"), it is instead ->qsmaskinitnext that gets updated at that time. This commit therefore updates the obsolete comment. It also fixes punctuation while on the topic of comments mentioning ->qsmaskinit. Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>	2018-07-12 15:38:58 -07:00
Paul E. McKenney	962aff03c3	rcu: Clean up handling of tasks blocked across full-rcu_node offline Commit `0aa04b055e` ("rcu: Process offlining and onlining only at grace-period start") deferred handling of CPU-hotplug events until the start of the next grace period, but consider the following sequence of events: 1. A task is preempted within an RCU-preempt read-side critical section. 2. The CPU that this task was running on goes offline, along with all other CPUs sharing the corresponding leaf rcu_node structure. 3. The task resumes execution. 4. One of those CPUs comes back online before a new grace period starts. In step 2, the code in the next rcu_gp_init() invocation will (correctly) defer removing the leaf rcu_node structure from the upper-level bitmasks, and will (correctly) set that structure's ->wait_blkd_tasks field. During the ensuing interval, RCU will (correctly) track the tasks preempted on that structure because they must block any subsequent grace period. In step 3, the code in rcu_read_unlock_special() will (correctly) remove the task from the leaf rcu_node structure. From this point forward, RCU need not pay attention to this structure, at least not until one of the corresponding CPUs comes back online. In step 4, the code in the next rcu_gp_init() invocation will (incorrectly) invoke rcu_init_new_rnp(). This is incorrect because the corresponding rcu_cleanup_dead_rnp() was never invoked. This is nevertheless harmless because the upper-level bits are still set. So, no harm, no foul, right? At least, all is well until a little further into rcu_gp_init() invocation, which will notice that there are no longer any tasks blocked on the leaf rcu_node structure, conclude that there is no longer anything left over from step 2's offline operation, and will therefore invoke rcu_cleanup_dead_rnp(). But this invocation of rcu_cleanup_dead_rnp() is for the beginning of the earlier offline interval, and the previous invocation of rcu_init_new_rnp() is for the end of that same interval. That is right, they are invoked out of order. That cannot be good, can it? It turns out that this is not a (correctness!) problem because rcu_cleanup_dead_rnp() checks to see if any of the corresponding CPUs are online, and refuses to do anything if so. In other words, in the case where rcu_init_new_rnp() and rcu_cleanup_dead_rnp() execute out of order, they both have no effect. But this is at best an accident waiting to happen. This commit therefore adds logic to rcu_gp_init() so that rcu_init_new_rnp() and rcu_cleanup_dead_rnp() are always invoked in order, and so that neither are invoked at all in cases where RCU had to pay attention to the leaf rcu_node structure during the entire time that all corresponding CPUs were offline. And, while in the area, this commit reduces confusion by using formal parameters rather than local variables that just happen to have the same value at that particular point in the code. Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>	2018-07-12 15:38:58 -07:00
Joel Fernandes (Google)	226ca5e766	rcu: Identify grace period is in progress as we advance up the tree There's no need to keep checking the same starting node for whether a grace period is in progress as we advance up the funnel lock loop. Its sufficient if we just checked it in the start, and then subsequently checked the internal nodes as we advanced up the combining tree. This also makes sense because the grace-period updates propogate from the root to the leaf, so there's a chance we may find a grace period has started as we advance up, lets check for the same. Reported-by: Paul McKenney <paulmck@linux.vnet.ibm.com> Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org> Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>	2018-07-12 15:38:57 -07:00
Joel Fernandes (Google)	df2bf8f7f7	rcu: Use better variable names in funnel locking loop The funnel locking loop in rcu_start_this_gp uses rcu_root as a temporary variable while walking the combining tree. This causes a tiresome exercise of a code reader reminding themselves that rcu_root may not be root. Lets just call it rnp, and rename other variables as well to be more appropriate. Original patch: https://patchwork.kernel.org/patch/10396577/ Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org> Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com> [ paulmck: Fix name in comment as well. ]	2018-07-12 15:38:57 -07:00
Joel Fernandes	b73de91d6a	rcu: Rename the grace-period-request variables and parameters The name 'c' is used for variables and parameters holding the requested grace-period sequence number. However it is no longer very meaningful given the conversions from ->gpnum and (especially) ->completed to ->gp_seq. This commit therefore renames 'c' to 'gp_seq_req'. Previous patch discussion is at: https://patchwork.kernel.org/patch/10396579/ Signed-off-by: Joel Fernandes <joel@joelfernandes.org> Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>	2018-07-12 15:38:56 -07:00
Paul E. McKenney	3d18469a2b	rcu: Regularize resetting of rcu_data wrap indicator The rcu_data structure's ->gpwrap indicator is currently reset only when the CPU in question detects a new grace period. This is in theory sufficient because any CPU that has been out of action for long enough that its ->gpwrap indicator is set is guaranteed to see both the end of an old grace period and the start of a new one. However, the current code leaves a short window during which the ->gpwrap indicator has been reset but the corresponding ->gp_seq counter has not yet been brought up to date. This is harmless because interrupts are disabled, but it is likely to (at the very least) cause confusion. This commit therefore moves the resetting of ->gpwrap to follow the updating of ->gp_seq. While in the area, it also resets ->gp_seq_needed. Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>	2018-07-12 15:38:56 -07:00
Paul E. McKenney	d72193123c	rcutorture: Correctly handle grace-period sequence wrap The new ->gq_seq grace-period sequence numbers must be shifted down, which give artifacts when these numbers wrap. This commit therefore enables rcutorture and rcuperf to handle grace-period sequence numbers even if they do wrap. It does this by allowing a special subtraction function to be specified, and this function subtracts before shifting. Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>	2018-07-12 15:38:55 -07:00
Paul E. McKenney	2e3e5e5501	rcu: Make rcu_start_this_gp() check for grace period already started In the old days of ->gpnum and ->completed, the code requesting a new grace period checked to see if that grace period had already started, bailing early if so. The new-age ->gp_seq approach instead checks whether the grace period has already finished. A compensating change pushed the requested grace period down to the bottom of the tree, thus reducing lock contention and even eliminating it in some cases. But why not further reduce contention, especially on large systems, by doing both, especially given that the cost of doing both is extremely small? This commit therefore adds a new rcu_seq_started() function that checks whether a specified grace period has already started. It then uses this new function in place of rcu_seq_done() in the rcu_start_this_gp() function's funnel locking code. Reported-by: Joel Fernandes <joel@joelfernandes.org> Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>	2018-07-12 15:38:54 -07:00
Joel Fernandes (Google)	5ca0905f67	rcu: Fix cpustart tracepoint gp_seq number The "cpustart" trace event shows a stale gp_seq. This is because it uses rdp->gp_seq, which is updated only at the end of the __note_gp_changes() function. This commit therefore instead uses rnp->gp_seq. An alternative fix would be to update rdp->gp_seq earlier, but this would break RCU's detection of the beginning of a new-to-this-CPU grace period. Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org> Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>	2018-07-12 15:38:53 -07:00
Joel Fernandes (Google)	5b55072f22	rcu: Produce last "CleanupMore" trace only if late-breaking request Currently Tree RCU's clean-up code emits a "CleanupMore" trace event in response to late-arriving grace-period requests even if the grace period was already requested. This makes "CleanupMore" show up an extra time (in addition to once for each rcu_node structure that was previously marked with the request), and for no good reason. This commit therefore avoids emitting this trace message unless the the only request for this next grace period arrived during or after the cleanup scan of the rcu_node structures. Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org> Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>	2018-07-12 15:38:53 -07:00
Paul E. McKenney	a2165e4168	rcu: Don't funnel-lock above leaf node if GP in progress The old grace-period start code would acquire only the leaf's rcu_node structure's ->lock if that structure believed that a grace period was in progress. The new code advances to the leaf's parent in this case, needlessly acquiring then leaf's parent's ->lock. This commit therefore checks the grace-period state after marking the leaf with the need for the specified grace period, and if the leaf believes that a grace period is in progress, takes an early exit. Reported-by: Joel Fernandes <joel@joelfernandes.org> Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com> [ paulmck: Add "Startedleaf" tracing as suggested by Joel Fernandes. ]	2018-07-12 15:38:52 -07:00
Paul E. McKenney	e44e73ca47	rcu: Make simple callback acceleration refer to rdp->gp_seq_needed Now that the rcu_data structure contains ->gp_seq_needed, create an rcu_accelerate_cbs_unlocked() helper function that locklessly checks to see if new callbacks' required grace period has already been requested. If so, update the callback list locally and again locklessly. (Though interrupts must be and are disabled to avoid racing with conflicting updates in interrupt handlers.) Otherwise, call rcu_accelerate_cbs() as before. Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>	2018-07-12 15:38:49 -07:00
Paul E. McKenney	ff3bb6f4d0	rcu: Remove ->gpnum and ->completed Now that everything has been converted to use ->gp_seq instead of ->gpnum and ->completed, this commit removes ->gpnum and ->completed. Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>	2018-07-12 15:38:48 -07:00
Paul E. McKenney	fee5997c17	rcu: Convert rcu_fqs tracepoint to ->gp_seq This commit makes the rcu_fqs tracepoint use ->gp_seq instead of ->gpnum. Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>	2018-07-12 15:38:47 -07:00
Paul E. McKenney	db023296f0	rcu: Convert rcu_quiescent_state_report tracepoint to ->gp_seq This commit makes the rcu_quiescent_state_report tracepoint use ->gp_seq instead of ->gpnum. Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>	2018-07-12 15:38:47 -07:00
Paul E. McKenney	865aa1e08d	rcu: Convert rcu_unlock_preempted_task tracepoint to ->gp_seq This commit makes the rcu_unlock_preempted_task tracepoint use ->gp_seq instead of ->gpnum. Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>	2018-07-12 15:38:46 -07:00
Paul E. McKenney	598ce09480	rcu: Convert rcu_preempt_task tracepoint to ->gp_seq This commit makes the rcu_preempt_task tracepoint use ->gp_seq instead of ->gpnum. Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>	2018-07-12 15:38:45 -07:00
Paul E. McKenney	abd13fdd95	rcu: Convert rcu_future_grace_period tracepoint to gp_seq This commit makes the rcu_future_grace_period tracepoint use gp_seq instead of ->gpnum and ->completed. Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>	2018-07-12 15:38:44 -07:00
Paul E. McKenney	477351f782	rcu: Convert rcu_grace_period tracepoint to gp_seq This commit makes the rcu_grace_period tracepoint use gp_seq instead of ->gpnum or ->completed. It also introduces a "cpuofl-bgp" string to less obscurely indicate when a CPU has gone offline while a grace period is waiting on it. Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>	2018-07-12 15:38:43 -07:00
Paul E. McKenney	ab5e869c1f	rcu: Make rcu_nocb_wait_gp() check if GP already requested This commit makes rcu_nocb_wait_gp() check rdp->gp_seq_needed to see if the current CPU already knows about the needed grace period having already been requested. If so, it avoids acquiring the corresponding leaf rcu_node structure's ->lock, thus decreasing contention. This optimization is intended for cases where either multiple leader rcuo kthreads are running on the same CPU or these kthreads are running on a non-offloaded (e.g., housekeeping) CPU. Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com> [ paulmck: Move lock release past "if" as suggested by Joel Fernandes. ] [ paulmck: Fix caching of furthest-future requested grace period. ]	2018-07-12 15:38:42 -07:00
Paul E. McKenney	7a1d0f23ad	rcu: Move from ->need_future_gp[] to ->gp_seq_needed One problem with the ->need_future_gp[] array is that the grace-period assignment of each element changes as the grace periods complete. This means that it is necessary to hold a lock when checking this array to learn if a given grace period has already been requested. This increase lock contention, which is the opposite of helpful. This commit therefore replaces the ->need_future_gp[] with a single ->gp_seq_needed value and keeps it updated in the rcu_data structure. This will enable reliable lockless checking of whether or not a given grace period has already been requested. Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>	2018-07-12 15:37:48 -07:00
Paul E. McKenney	aebc82644b	rcutorture: Convert rcutorture_get_gp_data() to ->gp_seq SRCU has long used ->srcu_gp_seq, and now RCU uses ->gp_seq. This commit therefore moves the rcutorture_get_gp_data() function from a ->gpnum / ->completed pair to ->gp_seq. Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>	2018-07-12 14:27:57 -07:00
Paul E. McKenney	471f87c3d9	rcu: Make RCU CPU stall warnings use ->gp_seq This commit makes the RCU CPU stall-warning code in print_other_cpu_stall(), print_cpu_stall(), and check_cpu_stall() use ->gp_seq instead of ->gpnum and ->completed. Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>	2018-07-12 14:27:56 -07:00
Paul E. McKenney	29365e563b	rcu: Convert grace-period requests to ->gp_seq This commit converts the grace-period request code paths from ->completed and ->gpnum to ->gp_seq. The need_future_gp_element() macro encapsulates the shift operation required to use ->gp_seq as an index to the ->need_future_gp[] array. The rcu_cbs_completed() function is removed in favor of the rcu_seq_snap() function. The rcu_start_this_gp() gets some temporary consistency checks and uses rcu_seq_done(), rcu_seq_current(), rcu_seq_state(), and rcu_gp_in_progress() in place of the earlier open-coded comparisons of ->gpnum and ->completed. The rcu_future_gp_cleanup() function replaces use of ->completed with ->gp_seq. The rcu_accelerate_cbs() function replaces a call to rcu_cbs_completed() with one to rcu_seq_snap(). The rcu_advance_cbs() function replaces an access to >completed with one to ->gp_seq and adds some temporary warnings. The rcu_nocb_wait_gp() function replaces a call to rcu_cbs_completed() with one to rcu_seq_snap() and an open-coded comparison with rcu_seq_done(). The temporary warnings will be removed when the various ->gpnum and ->completed fields are removed. Their purpose is to locate code who might still be using ->gpnum and ->completed. (Much easier that way than trying to trace down the causes of too-short grace periods and grace-period hangs!) Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>	2018-07-12 14:27:55 -07:00
Paul E. McKenney	d43a5d32e1	rcu: Convert ->completedqs to ->gp_seq This commit switches the quiescent-state no-backtracking checks from ->gpnum and ->completed to ->gp_seq. Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>	2018-07-12 14:27:54 -07:00
Paul E. McKenney	8aa670cdac	rcu: Convert ->rcu_iw_gpnum to ->gp_seq This commit switches the interrupt-disabled detection mechanism to ->gp_seq. This mechanism is used as part of RCU CPU stall warnings, and detects cases where the stall is due to a CPU having interrupts disabled. Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>	2018-07-12 14:27:53 -07:00
Paul E. McKenney	ba04107fc9	rcu: Move rcu_gp_in_progress() to ->gp_seq This commit makes rcu_gp_in_progress() use ->gp_seq instead of ->completed and ->gpnum. The READ_ONCE() invocations are buried in rcu_seq_current(). Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>	2018-07-12 14:27:52 -07:00
Paul E. McKenney	e0da2374c3	rcu: Move rcu_nocb_gp_get() to ->gp_seq This commit makes rcu_try_advance_all_cbs() use ->gp_seq. It uses rcu_seq_ctr() in order to shift away the state bits, so that the low-order bits of the result may safely be used to index ->nocb_gp_wq[]. Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>	2018-07-12 14:27:52 -07:00
Paul E. McKenney	03c8cb765a	rcu: Move rcu_try_advance_all_cbs() to ->gp_seq This commit makes rcu_try_advance_all_cbs() use ->gp_seq, with the exception of tracing, which will be converted later. Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>	2018-07-12 14:27:51 -07:00
Paul E. McKenney	e05720b097	rcu: Move rcu_implicit_dynticks_qs() to ->gp_seq This commit makes rcu_implicit_dynticks_qs() use ->gp_seq, with the exception of tracing, which will be converted later. Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>	2018-07-12 14:27:51 -07:00
Paul E. McKenney	a66ae8ae35	rcu: Convert rcu_gpnum_ovf() to ->gp_seq This commit converts rcu_gpnum_ovf() to use ->gp_seq instead of ->gpnum. Same size unsigned long, so same approach. Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>	2018-07-12 14:27:50 -07:00
Paul E. McKenney	67e14c1e39	rcu: Move RCU's grace-period-change code to ->gp_seq This commit moves __note_gp_changes(), note_gp_changes(), and __rcu_pending() to ->gp_seq, creating new rcu_seq_completed_gp() and rcu_seq_new_gp() functions for this purpose. Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com> [ paulmck: Reinstate "cpuend: trace as suggested by Joel Fernandes. ]	2018-07-12 14:27:50 -07:00
Paul E. McKenney	e4be81a2ed	rcu: Convert conditional grace-period primitives to ->gp_seq This commit converts get_state_synchronize_rcu(), cond_synchronize_rcu(), get_state_synchronize_sched(), and cond_synchronize_sched() from ->gpnum and ->completed to ->gp_seq. Note that this also introduces a full memory barrier in the already-done paths off cond_synchronize_rcu() and cond_synchronize_sched(), as work with LKMM indicates that the earlier smp_load_acquire() were insufficiently strong in some situations where these two functions were called just as the grace period ended. In such cases, these two functions would not gain the benefit of memory ordering at the end of the grace period. Please note that the performance impact is negligible, as you shouldn't be using either function anywhere near a fastpath in any case. Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>	2018-07-12 14:27:49 -07:00
Paul E. McKenney	c9a24e2d0c	rcu: Make quiescent-state reporting use ->gp_seq This commit switches the functions reporting quiescent states from use of ->gpnum to ->gp_seq. In either case, the point is to handle races where a given grace period ends before a quiescent state can be reported. Failing to catch these races would result in too-short grace periods, hence the checking. Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>	2018-07-12 14:27:48 -07:00
Paul E. McKenney	78c5a67f17	rcu: Convert rcu_check_gp_kthread_starvation() to GP sequence number This commit switches rcu_check_gp_kthread_starvation() from printing ->gpnum and ->completed to printing ->gp_seq upon detecting a starving RCU grace-period kthread during an RCU CPU stall warning. Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>	2018-07-12 14:27:48 -07:00
Paul E. McKenney	17ef2fe97c	rcu: Make rcutorture's batches-completed API use ->gp_seq The rcutorture test invokes rcu_batches_started(), rcu_batches_completed(), rcu_batches_started_bh(), rcu_batches_completed_bh(), rcu_batches_started_sched(), and rcu_batches_completed_sched() to do grace-period consistency checks, and rcuperf uses the _completed variants for statistics. These functions use ->gpnum and ->completed. This commit therefore replaces them with rcu_get_gp_seq(), rcu_bh_get_gp_seq(), and rcu_sched_get_gp_seq(), adjusting rcutorture and rcuperf to make use of them. Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>	2018-07-12 14:27:47 -07:00
Paul E. McKenney	dee4f42298	rcu: Move rcu_gp_slow() to ->gp_seq This commit moves rcu_gp_slow() to ->gp_seq. This function only uses the grace-period number to modulate delay, so rcu_seq_ctr(rsp->gp_seq) gets the same effect, at least in cases where the delay is to happen more than four times per wrap of an unsigned long. Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>	2018-07-12 14:27:46 -07:00
Paul E. McKenney	de30ad512a	rcu: Introduce grace-period sequence numbers This commit adds grace-period sequence numbers (->gp_seq) to the rcu_state, rcu_node, and rcu_data structures, and updates them. It also checks for consistency between rsp->gpnum and rsp->gp_seq. These ->gp_seq counters will eventually replace the existing ->gpnum and ->completed counters, allowing a single memory access to determine whether or not a grace period is in progress and if so, which one. This in turn will enable changes that will reduce ->lock contention on the leaf rcu_node structures. Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>	2018-07-12 14:27:46 -07:00
Paul E. McKenney	609af1cdf0	Merge branches 'expedited.2018.07.12a', 'fixes.2018.07.12a', 'srcu.2018.06.25b' and 'torture.2018.06.25b' into HEAD expedited.2018.07.12a: Expedited grace-period updates. fixes.2018.07.12a: Pre-gp_seq miscellaneous fixes. srcu.2018.06.25b: SRCU updates. torture.2018.06.25b: Pre-gp_seq torture-test updates.	2018-07-12 14:26:14 -07:00
Paul E. McKenney	18390aeae7	rcu: Make rcu_gp_cleanup() write only once to ->gp_flags At the end of rcu_gp_cleanup(), if another grace period is needed, but not via rcu_accelerate_cbs(), the ->gp_flags field is written twice, once when making the new grace-period request, and once when clearing all other types of requests. This commit therefore adds an else-clause to avoid this double write. Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>	2018-07-12 14:25:17 -07:00
Paul E. McKenney	26d950a945	rcu: Diagnostics for grace-period startup hangs This commit causes a splat if RCU is idle and a request for a new grace period is ignored for more than one second. This splat normally indicates that some code path asked for a new grace period, but failed to wake up the RCU grace-period kthread. Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com> [ paulmck: Fix bug located by Dan Carpenter and his static checker. ] [ paulmck: Fix self-deadlock bug located 0day test robot. ] [ paulmck: Disable unless CONFIG_PROVE_RCU=y. ]	2018-07-12 14:24:42 -07:00
Daniel Borkmann	c7a8978432	bpf: don't leave partial mangled prog in jit_subprogs error path syzkaller managed to trigger the following bug through fault injection: [...] [ 141.043668] verifier bug. No program starts at insn 3 [ 141.044648] WARNING: CPU: 3 PID: 4072 at kernel/bpf/verifier.c:1613 get_callee_stack_depth kernel/bpf/verifier.c:1612 [inline] [ 141.044648] WARNING: CPU: 3 PID: 4072 at kernel/bpf/verifier.c:1613 fixup_call_args kernel/bpf/verifier.c:5587 [inline] [ 141.044648] WARNING: CPU: 3 PID: 4072 at kernel/bpf/verifier.c:1613 bpf_check+0x525e/0x5e60 kernel/bpf/verifier.c:5952 [ 141.047355] CPU: 3 PID: 4072 Comm: a.out Not tainted 4.18.0-rc4+ #51 [ 141.048446] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996),BIOS 1.10.2-1 04/01/2014 [ 141.049877] Call Trace: [ 141.050324] __dump_stack lib/dump_stack.c:77 [inline] [ 141.050324] dump_stack+0x1c9/0x2b4 lib/dump_stack.c:113 [ 141.050950] ? dump_stack_print_info.cold.2+0x52/0x52 lib/dump_stack.c:60 [ 141.051837] panic+0x238/0x4e7 kernel/panic.c:184 [ 141.052386] ? add_taint.cold.5+0x16/0x16 kernel/panic.c:385 [ 141.053101] ? __warn.cold.8+0x148/0x1ba kernel/panic.c:537 [ 141.053814] ? __warn.cold.8+0x117/0x1ba kernel/panic.c:530 [ 141.054506] ? get_callee_stack_depth kernel/bpf/verifier.c:1612 [inline] [ 141.054506] ? fixup_call_args kernel/bpf/verifier.c:5587 [inline] [ 141.054506] ? bpf_check+0x525e/0x5e60 kernel/bpf/verifier.c:5952 [ 141.055163] __warn.cold.8+0x163/0x1ba kernel/panic.c:538 [ 141.055820] ? get_callee_stack_depth kernel/bpf/verifier.c:1612 [inline] [ 141.055820] ? fixup_call_args kernel/bpf/verifier.c:5587 [inline] [ 141.055820] ? bpf_check+0x525e/0x5e60 kernel/bpf/verifier.c:5952 [...] What happens in jit_subprogs() is that kcalloc() for the subprog func buffer is failing with NULL where we then bail out. Latter is a plain return -ENOMEM, and this is definitely not okay since earlier in the loop we are walking all subprogs and temporarily rewrite insn->off to remember the subprog id as well as insn->imm to temporarily point the call to __bpf_call_base + 1 for the initial JIT pass. Thus, bailing out in such state and handing this over to the interpreter is troublesome since later/subsequent e.g. find_subprog() lookups are based on wrong insn->imm. Therefore, once we hit this point, we need to jump to out_free path where we undo all changes from earlier loop, so that interpreter can work on unmodified insn->{off,imm}. Another point is that should find_subprog() fail in jit_subprogs() due to a verifier bug, then we also should not simply defer the program to the interpreter since also here we did partial modifications. Instead we should just bail out entirely and return an error to the user who is trying to load the program. Fixes: `1c2a088a66` ("bpf: x64: add JIT support for multi-function programs") Reported-by: syzbot+7d427828b2ea6e592804@syzkaller.appspotmail.com Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2018-07-12 14:00:54 -07:00
Thomas Gleixner	c6bb11147e	Merge branch 'fortglx/4.19/time' of https://git.linaro.org/people/john.stultz/linux into timers/core Pull timekeeping updates from John Stultz: - Make the timekeeping update more precise when NTP frequency is set directly by updating the multiplier. - Adjust selftests	2018-07-12 22:19:58 +02:00
Boqun Feng	fcc6354365	rcu: Make expedited GPs handle CPU 0 being offline Currently, the parallelized initialization of expedited grace periods uses the workqueue associated with each rcu_node structure's ->grplo field. This works fine unless that CPU is offline. This commit therefore uses the CPU corresponding to the lowest-numbered online CPU, or just queues the work on WORK_CPU_UNBOUND if there are no online CPUs corresponding to this rcu_node structure. Note that this patch uses cpu_is_offline() instead of the usual approach of checking bits in the rcu_node structure's ->qsmaskinitnext field. This is safe because preemption is disabled across both the cpu_is_offline() check and the call to queue_work_on(). Signed-off-by: Boqun Feng <boqun.feng@gmail.com> [ paulmck: Disable preemption to close offline race window. ] Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com> [ paulmck: Apply Peter Zijlstra feedback on CPU selection. ] Tested-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>	2018-07-12 12:36:06 -07:00
Geert Uytterhoeven	7a6e55375d	hrtimer: Improve kernel message printing - Join split message for easier grepping, - Use pr_() instead of printk(), - Use %u to format unsigned cpu numbers. Signed-off-by: Geert Uytterhoeven <geert+renesas@glider.be> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Link: https://lkml.kernel.org/r/20180712144118.8819-1-geert+renesas@glider.be	2018-07-12 21:29:30 +02:00
Okash Khawaja	b65f370d06	bpf: btf: Fix bitfield extraction for big endian When extracting bitfield from a number, btf_int_bits_seq_show() builds a mask and accesses least significant byte of the number in a way specific to little-endian. This patch fixes that by checking endianness of the machine and then shifting left and right the unneeded bits. Thanks to Martin Lau for the help in navigating potential pitfalls when dealing with endianess and for the final solution. Fixes: `b00b8daec8` ("bpf: btf: Add pretty print capability for data with BTF type info") Signed-off-by: Okash Khawaja <osk@fb.com> Acked-by: Martin KaFai Lau <kafai@fb.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>	2018-07-11 22:36:08 +02:00
Linus Torvalds	c25c74b747	This fixes a memory leak in the kprobe code. -----BEGIN PGP SIGNATURE----- iIoEABYIADIWIQRRSw7ePDh/lE+zeZMp5XQQmuv6qgUCW0ZhphQccm9zdGVkdEBn b29kbWlzLm9yZwAKCRAp5XQQmuv6qlxwAP4jKOp86ljtdqHSlh9mOHChnvwdMprQ kaQ82WiamGikIwD/Vo/6ynsVl5eYMDO2VQSv6W1DHmhV9I/KgdqcQBl8jAk= =RF9o -----END PGP SIGNATURE----- Merge tag 'trace-v4.18-rc3-2' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace Pull kprobe fix from Steven Rostedt: "This fixes a memory leak in the kprobe code" * tag 'trace-v4.18-rc3-2' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace: tracing/kprobe: Release kprobe print_fmt properly	2018-07-11 13:03:51 -07:00
Jiri Olsa	0fc8c3581d	tracing/kprobe: Release kprobe print_fmt properly We don't release tk->tp.call.print_fmt when destroying local uprobe. Also there's missing print_fmt kfree in create_local_trace_kprobe error path. Link: http://lkml.kernel.org/r/20180709141906.2390-1-jolsa@kernel.org Cc: stable@vger.kernel.org Fixes: `e12f03d703` ("perf/core: Implement the 'perf_kprobe' PMU") Acked-by: Song Liu <songliubraving@fb.com> Acked-by: Masami Hiramatsu <mhiramat@kernel.org> Signed-off-by: Jiri Olsa <jolsa@kernel.org> Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>	2018-07-11 15:50:52 -04:00
Steven Rostedt (VMware)	e4f8d81c73	cgroup/tracing: Move taking of spin lock out of trace event handlers It is unwise to take spin locks from the handlers of trace events. Mainly, because they can introduce lockups, because it introduces locks in places that are normally not tested. Worse yet, because trace events are tucked away in the include/trace/events/ directory, locks that are taken there are forgotten about. As a general rule, I tell people never to take any locks in a trace event handler. Several cgroup trace event handlers call cgroup_path() which eventually takes the kernfs_rename_lock spinlock. This injects the spinlock in the code without people realizing it. It also can cause issues for the PREEMPT_RT patch, as the spinlock becomes a mutex, and the trace event handlers are called with preemption disabled. By moving the calculation of the cgroup_path() out of the trace event handlers and into a macro (surrounded by a trace_cgroup_##type##_enabled()), then we could place the cgroup_path into a string, and pass that to the trace event. Not only does this remove the taking of the spinlock out of the trace event handler, but it also means that the cgroup_path() only needs to be called once (it is currently called twice, once to get the length to reserver the buffer for, and once again to get the path itself. Now it only needs to be done once. Reported-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org> Signed-off-by: Tejun Heo <tj@kernel.org>	2018-07-11 10:48:47 -07:00
Petr Mladek	ffaa619af1	printk: Fix warning about unused suppress_message_printing suppress_message_printing() is not longer called in console_unlock(). Therefore it is not longer needed with disabled CONFIG_PRINTK. This fixes the warning: kernel/printk/printk.c:2033:13: warning: ‘suppress_message_printing’ defined but not used [-Wunused-function] static bool suppress_message_printing(int level) { return false; } Reported-by: Stephen Rothwell <sfr@canb.auug.org.au> Reported-by: Arnd Bergmann <arnd@arndb.de> Suggested-by: Maninder Singh <maninder1.s@samsung.com> Acked-by: Sergey Senozhatsky <sergey.senozhatsky@gmail.com> Signed-off-by: Petr Mladek <pmladek@suse.com>	2018-07-11 10:38:23 +02:00
Mathieu Desnoyers	ec9c82e03a	rseq: uapi: Declare rseq_cs field as union, update includes Declaring the rseq_cs field as a union between __u64 and two __u32 allows both 32-bit and 64-bit kernels to read the full __u64, and therefore validate that a 32-bit user-space cleared the upper 32 bits, thus ensuring a consistent behavior between native 32-bit kernels and 32-bit compat tasks on 64-bit kernels. Check that the rseq_cs value read is < TASK_SIZE. The asm/byteorder.h header needs to be included by rseq.h, now that it is not using linux/types_32_64.h anymore. Considering that only __32 and __u64 types are declared in linux/rseq.h, the linux/types.h header should always be included for both kernel and user-space code: including stdint.h is just for u64 and u32, which are not used in this header at all. Use copy_from_user()/clear_user() to interact with a 64-bit field, because arm32 does not implement 64-bit __get_user, and ppc32 does not 64-bit get_user. Considering that the rseq_cs pointer does not need to be loaded/stored with single-copy atomicity from the kernel anymore, we can simply use copy_from_user()/clear_user(). Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Cc: linux-api@vger.kernel.org Cc: Peter Zijlstra <peterz@infradead.org> Cc: "Paul E . McKenney" <paulmck@linux.vnet.ibm.com> Cc: Boqun Feng <boqun.feng@gmail.com> Cc: Andy Lutomirski <luto@amacapital.net> Cc: Dave Watson <davejwatson@fb.com> Cc: Paul Turner <pjt@google.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Russell King <linux@arm.linux.org.uk> Cc: "H . Peter Anvin" <hpa@zytor.com> Cc: Andi Kleen <andi@firstfloor.org> Cc: Chris Lameter <cl@linux.com> Cc: Ben Maurer <bmaurer@fb.com> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Josh Triplett <josh@joshtriplett.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Will Deacon <will.deacon@arm.com> Cc: Michael Kerrisk <mtk.manpages@gmail.com> Cc: Joel Fernandes <joelaf@google.com> Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com> Cc: "H. Peter Anvin" <hpa@zytor.com> Link: https://lkml.kernel.org/r/20180709195155.7654-5-mathieu.desnoyers@efficios.com	2018-07-10 22:18:52 +02:00
Mathieu Desnoyers	0fb9a1abc8	rseq: uapi: Update uapi comments Update rseq uapi header comments to reflect that user-space need to do thread-local loads/stores from/to the struct rseq fields. As a consequence of this added requirement, the kernel does not need to perform loads/stores with single-copy atomicity. Update the comment associated to the "flags" fields to describe more accurately that it's only useful to facilitate single-stepping through rseq critical sections with debuggers. Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Cc: linux-api@vger.kernel.org Cc: Peter Zijlstra <peterz@infradead.org> Cc: "Paul E . McKenney" <paulmck@linux.vnet.ibm.com> Cc: Boqun Feng <boqun.feng@gmail.com> Cc: Andy Lutomirski <luto@amacapital.net> Cc: Dave Watson <davejwatson@fb.com> Cc: Paul Turner <pjt@google.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Russell King <linux@arm.linux.org.uk> Cc: "H . Peter Anvin" <hpa@zytor.com> Cc: Andi Kleen <andi@firstfloor.org> Cc: Chris Lameter <cl@linux.com> Cc: Ben Maurer <bmaurer@fb.com> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Josh Triplett <josh@joshtriplett.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Will Deacon <will.deacon@arm.com> Cc: Michael Kerrisk <mtk.manpages@gmail.com> Cc: Joel Fernandes <joelaf@google.com> Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com> Cc: "H. Peter Anvin" <hpa@zytor.com> Link: https://lkml.kernel.org/r/20180709195155.7654-4-mathieu.desnoyers@efficios.com	2018-07-10 22:18:52 +02:00
Mathieu Desnoyers	8f28177014	rseq: Use get_user/put_user rather than __get_user/__put_user __get_user()/__put_user() is used to read values for address ranges that were already checked with access_ok() on rseq registration. It has been recognized that __get_user/__put_user are optimizing the wrong thing. Replace them by get_user/put_user across rseq instead. If those end up showing up in benchmarks, the proper approach would be to use user_access_begin() / unsafe_{get,put}_user() / user_access_end() anyway. Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Cc: linux-api@vger.kernel.org Cc: Peter Zijlstra <peterz@infradead.org> Cc: "Paul E . McKenney" <paulmck@linux.vnet.ibm.com> Cc: Boqun Feng <boqun.feng@gmail.com> Cc: Andy Lutomirski <luto@amacapital.net> Cc: Dave Watson <davejwatson@fb.com> Cc: Paul Turner <pjt@google.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Russell King <linux@arm.linux.org.uk> Cc: "H . Peter Anvin" <hpa@zytor.com> Cc: Andi Kleen <andi@firstfloor.org> Cc: Chris Lameter <cl@linux.com> Cc: Ben Maurer <bmaurer@fb.com> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Josh Triplett <josh@joshtriplett.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Will Deacon <will.deacon@arm.com> Cc: Michael Kerrisk <mtk.manpages@gmail.com> Cc: Joel Fernandes <joelaf@google.com> Cc: linux-arm-kernel@lists.infradead.org Link: https://lkml.kernel.org/r/20180709195155.7654-3-mathieu.desnoyers@efficios.com	2018-07-10 22:18:52 +02:00
Mathieu Desnoyers	e96d71359e	rseq: Use __u64 for rseq_cs fields, validate user inputs Change the rseq ABI so rseq_cs start_ip, post_commit_offset and abort_ip fields are seen as 64-bit fields by both 32-bit and 64-bit kernels rather that ignoring the 32 upper bits on 32-bit kernels. This ensures we have a consistent behavior for a 32-bit binary executed on 32-bit kernels and in compat mode on 64-bit kernels. Validating the value of abort_ip field to be below TASK_SIZE ensures the kernel don't return to an invalid address when returning to userspace after an abort. I don't fully trust each architecture code to consistently deal with invalid return addresses. Validating the value of the start_ip and post_commit_offset fields prevents overflow on arithmetic performed on those values, used to check whether abort_ip is within the rseq critical section. If validation fails, the process is killed with a segmentation fault. When the signature encountered before abort_ip does not match the expected signature, return -EINVAL rather than -EPERM to be consistent with other input validation return codes from rseq_get_rseq_cs(). Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Cc: linux-api@vger.kernel.org Cc: Peter Zijlstra <peterz@infradead.org> Cc: "Paul E . McKenney" <paulmck@linux.vnet.ibm.com> Cc: Boqun Feng <boqun.feng@gmail.com> Cc: Andy Lutomirski <luto@amacapital.net> Cc: Dave Watson <davejwatson@fb.com> Cc: Paul Turner <pjt@google.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Russell King <linux@arm.linux.org.uk> Cc: "H . Peter Anvin" <hpa@zytor.com> Cc: Andi Kleen <andi@firstfloor.org> Cc: Chris Lameter <cl@linux.com> Cc: Ben Maurer <bmaurer@fb.com> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Josh Triplett <josh@joshtriplett.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Will Deacon <will.deacon@arm.com> Cc: Michael Kerrisk <mtk.manpages@gmail.com> Cc: Joel Fernandes <joelaf@google.com> Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com> Cc: "H. Peter Anvin" <hpa@zytor.com> Link: https://lkml.kernel.org/r/20180709195155.7654-2-mathieu.desnoyers@efficios.com	2018-07-10 22:18:51 +02:00
Sudeep Holla	5b5ccbc2b0	Revert "tick: Prefer a lower rating device only if it's CPU local device" This reverts commit `1332a90558`. The original issue was not because of incorrect checking of cpumask for both new and old tick device. It was incorrectly analysed was due to the misunderstanding of the comment and misinterpretation of the return value from tick_check_preferred. The main issue is with the clockevent driver that sets the cpumask to cpu_all_mask instead of cpu_possible_mask. Signed-off-by: Sudeep Holla <sudeep.holla@arm.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Tested-by: Kevin Hilman <khilman@baylibre.com> Tested-by: Martin Blumenstingl <martin.blumenstingl@googlemail.com> Cc: linux-arm-kernel@lists.infradead.org Cc: Marc Zyngier <marc.zyngier@arm.com> Link: https://lkml.kernel.org/r/1531151136-18297-1-git-send-email-sudeep.holla@arm.com	2018-07-10 22:12:47 +02:00
Miroslav Lichvar	b061c7a513	timekeeping: Update multiplier when NTP frequency is set directly When the NTP frequency is set directly from userspace using the ADJ_FREQUENCY or ADJ_TICK timex mode, immediately update the timekeeper's multiplier instead of waiting for the next tick. This removes a hidden non-deterministic delay in setting of the frequency and allows an extremely tight control of the system clock with update rates close to or even exceeding the kernel HZ. The update is limited to archs using modern timekeeping (!ARCH_USES_GETTIMEOFFSET). Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Ingo Molnar <mingo@kernel.org> Cc: Miroslav Lichvar <mlichvar@redhat.com> Cc: Richard Cochran <richardcochran@gmail.com> Cc: Prarit Bhargava <prarit@redhat.com> Cc: Stephen Boyd <sboyd@kernel.org> Signed-off-by: Miroslav Lichvar <mlichvar@redhat.com> Signed-off-by: John Stultz <john.stultz@linaro.org>	2018-07-10 12:44:25 -07:00
Liu Bo	e1a413245a	Blktrace: bail out early if block debugfs is not configured Since @blk_debugfs_root couldn't be configured dynamically, we can save a few memory allocation if it's not there. Signed-off-by: Liu Bo <bo.liu@linux.alibaba.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2018-07-09 09:07:52 -06:00
Petr Mladek	03fc7f9c99	printk/nmi: Prevent deadlock when accessing the main log buffer in NMI The commit `719f6a7040` ("printk: Use the main logbuf in NMI when logbuf_lock is available") brought back the possible deadlocks in printk() and NMI. The check of logbuf_lock is done only in printk_nmi_enter() to prevent mixed output. But another CPU might take the lock later, enter NMI, and: + Both NMIs might be serialized by yet another lock, for example, the one in nmi_cpu_backtrace(). + The other CPU might get stopped in NMI, see smp_send_stop() in panic(). The only safe solution is to use trylock when storing the message into the main log-buffer. It might cause reordering when some lines go to the main lock buffer directly and others are delayed via the per-CPU buffer. It means that it is not useful in general. This patch replaces the problematic NMI deferred context with NMI direct context. It can be used to mark a code that might produce many messages in NMI and the risk of losing them is more critical than problems with eventual reordering. The context is then used when dumping trace buffers on oops. It was the primary motivation for the original fix. Also the reordering is even smaller issue there because some traces have their own time stamps. Finally, nmi_cpu_backtrace() need not longer be serialized because it will always us the per-CPU buffers again. Fixes: `719f6a7040` ("printk: Use the main logbuf in NMI when logbuf_lock is available") Cc: stable@vger.kernel.org Link: http://lkml.kernel.org/r/20180627142028.11259-1-pmladek@suse.com To: Steven Rostedt <rostedt@goodmis.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Cc: Sergey Senozhatsky <sergey.senozhatsky.work@gmail.com> Cc: linux-kernel@vger.kernel.org Cc: stable@vger.kernel.org Acked-by: Sergey Senozhatsky <sergey.senozhatsky@gmail.com> Signed-off-by: Petr Mladek <pmladek@suse.com>	2018-07-09 14:10:40 +02:00
Petr Mladek	a338f84dc1	printk: Create helper function to queue deferred console handling It is just a preparation step. The patch does not change the existing behavior. Link: http://lkml.kernel.org/r/20180627140817.27764-3-pmladek@suse.com To: Steven Rostedt <rostedt@goodmis.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Cc: Sergey Senozhatsky <sergey.senozhatsky.work@gmail.com> Cc: linux-kernel@vger.kernel.org Cc: stable@vger.kernel.org Acked-by: Sergey Senozhatsky <sergey.senozhatsky@gmail.com> Signed-off-by: Petr Mladek <pmladek@suse.com>	2018-07-09 14:08:13 +02:00
Petr Mladek	ba55239995	printk: Split the code for storing a message into the log buffer It is just a preparation step. The patch does not change the existing behavior. Link: http://lkml.kernel.org/r/20180627140817.27764-2-pmladek@suse.com To: Steven Rostedt <rostedt@goodmis.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Cc: Sergey Senozhatsky <sergey.senozhatsky.work@gmail.com> Cc: linux-kernel@vger.kernel.org Cc: stable@vger.kernel.org Acked-by: Sergey Senozhatsky <sergey.senozhatsky@gmail.com> Signed-off-by: Petr Mladek <pmladek@suse.com>	2018-07-09 13:53:31 +02:00
Petr Mladek	8599dc7dec	printk: Clean up syslog_print_all() syslog_print_all() is called twice. Once with a valid buffer and once just to set the indexes. Both variants are already handled separately. This patch just makes it more obvious. It does not change the existing behavior. Link: http://lkml.kernel.org/r/20180627150641.p56xyy6mdzvnfpig@pathway.suse.cz Cc: Sergey Senozhatsky <sergey.senozhatsky@gmail.com> Cc: Namit Gupta <gupta.namit@samsung.com> Cc: linux-kernel@vger.kernel.org Cc: pankaj.m@samsung.com Cc: a.sahrawat@samsung.com Cc: himanshu.m@samsung.com Signed-off-by: Petr Mladek <pmladek@suse.com>	2018-07-09 13:38:03 +02:00
Thomas Gleixner	215af5499d	cpu/hotplug: Online siblings when SMT control is turned on Writing 'off' to /sys/devices/system/cpu/smt/control offlines all SMT siblings. Writing 'on' merily enables the abilify to online them, but does not online them automatically. Make 'on' more useful by onlining all offline siblings. Signed-off-by: Thomas Gleixner <tglx@linutronix.de>	2018-07-09 09:19:18 +02:00
Linus Torvalds	6fb2489d7f	Merge branch 'sched-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull scheduler fixes from Thomas Gleixner: - The hopefully final fix for the reported race problems in kthread_parkme(). The previous attempt still left a hole and was partially wrong. - Plug a race in the remote tick mechanism which triggers a warning about updates not being done correctly. That's a false positive if the race condition is hit as the remote CPU is idle. Plug it by checking the condition again when holding run queue lock. - Fix a bug in the utilization estimation of a run queue which causes the estimation to be 0 when a run queue is throttled. - Advance the global expiration of the period timer when the timer is restarted after a idle period. Otherwise the expiry time is stale and the timer fires prematurely. - Cure the drift between the bandwidth timer and the runqueue accounting, which leads to bogus throttling of runqueues - Place the call to cpufreq_update_util() correctly so the function will observe the correct number of running RT tasks and not a stale one. * 'sched-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: kthread, sched/core: Fix kthread_parkme() (again...) sched/util_est: Fix util_est_dequeue() for throttled cfs_rq sched/fair: Advance global expiration when period timer is restarted sched/fair: Fix bandwidth timer clock drift condition sched/rt: Fix call to cpufreq_update_util() sched/nohz: Skip remote tick on idle task entirely	2018-07-08 12:41:23 -07:00
Toshiaki Makita	d8d7218ad8	xdp: XDP_REDIRECT should check IFF_UP and MTU Otherwise we end up with attempting to send packets from down devices or to send oversized packets, which may cause unexpected driver/device behaviour. Generic XDP has already done this check, so reuse the logic in native XDP. Fixes: `814abfabef` ("xdp: add bpf_redirect helper function") Signed-off-by: Toshiaki Makita <makita.toshiaki@lab.ntt.co.jp> Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2018-07-07 15:25:35 -07:00
John Fastabend	0ea488ff8d	bpf: sockmap, convert bpf_compute_data_pointers to bpf__sk_skb In commit 'bpf: bpf_compute_data uses incorrect cb structure' (`8108a77515`) we added the routine bpf_compute_data_end_sk_skb() to compute the correct data_end values, but this has since been lost. In kernel v4.14 this was correct and the above patch was applied in it entirety. Then when v4.14 was merged into v4.15-rc1 net-next tree we lost the piece that renamed bpf_compute_data_pointers to the new function bpf_compute_data_end_sk_skb. This was done here, `e1ea2f9856` ("Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net") When it conflicted with the following rename patch, `6aaae2b6c4` ("bpf: rename bpf_compute_data_end into bpf_compute_data_pointers") Finally, after a refactor I thought even the function bpf_compute_data_end_sk_skb() was no longer needed and it was erroneously removed. However, we never reverted the sk_skb_convert_ctx_access() usage of tcp_skb_cb which had been committed and survived the merge conflict. Here we fix this by adding back the helper and _data_end_sk_skb() usage. Using the bpf_skc_data_end mapping is not correct because it expects a qdisc_skb_cb object but at the sock layer this is not the case. Even though it happens to work here because we don't overwrite any data in-use at the socket layer and the cb structure is cleared later this has potential to create some subtle issues. But, even more concretely the filter.c access check uses tcp_skb_cb. And by some act of chance though, struct bpf_skb_data_end { struct qdisc_skb_cb qdisc_cb; /* 0 28 / / XXX 4 bytes hole, try to pack / void data_meta; /* 32 8 / void data_end; /* 40 8 / / size: 48, cachelines: 1, members: 3 / / sum members: 44, holes: 1, sum holes: 4 / / last cacheline: 48 bytes / }; and then tcp_skb_cb, struct tcp_skb_cb { [...] struct { __u32 flags; / 24 4 / struct sock sk_redir; /* 32 8 / void data_end; /* 40 8 / } bpf; / 24 */ }; So when we use offset_of() to track down the byte offset we get 40 in either case and everything continues to work. Fix this mess and use correct structures its unclear how long this might actually work for until someone moves the structs around. Reported-by: Martin KaFai Lau <kafai@fb.com> Fixes: `e1ea2f9856` ("Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net") Fixes: `6aaae2b6c4` ("bpf: rename bpf_compute_data_end into bpf_compute_data_pointers") Signed-off-by: John Fastabend <john.fastabend@gmail.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2018-07-07 15:19:30 -07:00
John Fastabend	7ebc14d507	bpf: sockmap, consume_skb in close path Currently, when a sock is closed and the bpf_tcp_close() callback is used we remove memory but do not free the skb. Call consume_skb() if the skb is attached to the buffer. Reported-by: syzbot+d464d2c20c717ef5a6a8@syzkaller.appspotmail.com Fixes: `1aa12bdf1b` ("bpf: sockmap, add sock close() hook to remove socks") Signed-off-by: John Fastabend <john.fastabend@gmail.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2018-07-07 15:19:30 -07:00
John Fastabend	99ba2b5aba	bpf: sockhash, disallow bpf_tcp_close and update in parallel After latest lock updates there is no longer anything preventing a close and recvmsg call running in parallel. Additionally, we can race update with close if we close a socket and simultaneously update if via the BPF userspace API (note the cgroup ops are already run with sock_lock held). To resolve this take sock_lock in close and update paths. Reported-by: syzbot+b680e42077a0d7c9a0c4@syzkaller.appspotmail.com Fixes: `e9db4ef6bf` ("bpf: sockhash fix omitted bucket lock in sock_close") Signed-off-by: John Fastabend <john.fastabend@gmail.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2018-07-07 15:19:30 -07:00
John Fastabend	1d1ef005db	bpf: sockmap, hash table is RCU so readers do not need locks This removes locking from readers of RCU hash table. Its not necessary. Fixes: `8111038444` ("bpf: sockmap, add hash map support") Signed-off-by: John Fastabend <john.fastabend@gmail.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2018-07-07 15:16:58 -07:00
John Fastabend	547b3aa451	bpf: sockmap, error path can not release psock in multi-map case The current code, in the error path of sock_hash_ctx_update_elem, checks if the sock has a psock in the user data and if so decrements the reference count of the psock. However, if the error happens early in the error path we may have never incremented the psock reference count and if the psock exists because the sock is in another map then we may inadvertently decrement the reference count. Fix this by making the error path only call smap_release_sock if the error happens after the increment. Reported-by: syzbot+d464d2c20c717ef5a6a8@syzkaller.appspotmail.com Fixes: `8111038444` ("bpf: sockmap, add hash map support") Signed-off-by: John Fastabend <john.fastabend@gmail.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2018-07-07 15:16:58 -07:00
Linus Torvalds	97f4e14229	While cleaning out my INBOX, I found a few patches that were lost in the noise. These are minor bug fixes and clean ups. Those include: - Avoiding a string overflow - Code that didn't match the comment (but should) - A small code optimization (use of a conditional) - Quieting printf warnings - Nuking unused code - Fixing function graph interrupt annotation -----BEGIN PGP SIGNATURE----- iIoEABYIADIWIQRRSw7ePDh/lE+zeZMp5XQQmuv6qgUCWz7ARhQccm9zdGVkdEBn b29kbWlzLm9yZwAKCRAp5XQQmuv6qmMqAQDTS7uvFLRR603WXyOazX6Y7FeiYFWp MUUZjnbG9u0bawEAulW53AM0OL3EAAaZKtPi8VtsT+uktR1GIynXrp+yoww= =yQDv -----END PGP SIGNATURE----- Merge tag 'trace-v4.18-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace Pull tracing fixes and cleanups from Steven Rostedt: "While cleaning out my INBOX, I found a few patches that were lost in the noise. These are minor bug fixes and clean ups. Those include: - avoid a string overflow - code that didn't match the comment (but should) - a small code optimization (use of a conditional) - quiet printf warnings - nuke unused code - fix function graph interrupt annotation" * tag 'trace-v4.18-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace: tracing: Fix missing return symbol in function_graph output ftrace: Nuke clear_ftrace_function tracing: Use __printf markup to silence compiler tracing: Optimize trace_buffer_iter() logic tracing: Make create_filter() code match the comments tracing: Avoid string overflow	2018-07-05 19:29:07 -07:00
Dave Airlie	a1c3b49523	drm-misc-next for 4.19: UAPI Changes: v3d: add fourcc modicfier for fourcc for the Broadcom UIF format (Eric Anholt) Cross-subsystem Changes: console/fbcon: Add support for deferred console takeover (Hans de Goede) Core Changes: dma-fence clean up, improvements and docs (Daniel Vetter) add mask function for crtc, plane, encoder and connector DRM objects(Ville Syrjälä) Driver Changes: pl111: add Nomadik LCDC variant (Linus Walleij) -----BEGIN PGP SIGNATURE----- iQIcBAABAgAGBQJbPVtiAAoJEEN0HIUfOBk0V6gP+gM0uASgBw7rW3L+zKZEISZ7 O2xFyQ/+qaA41Kmv9fWh1abz8HK509PYQEPhbmbZpfjceMhx6b9aYgH1iom1eGVl 7FgTAzazympaagNde5Eik1jrQG0UHvS9At9oyfnlQTEnvJnRJneMI+1bMt/Q2bYt NLGIp21xcEOhPMAhV1dW5aC/+vLdIHsbc68MeeqqQH6e+f+3+DKjaW7bPom4lgP+ EQVh+76zGVkCmUvYMqkz1yWHGCjZmfQp1/UuTXWNwz/W5Wp2+HEVPHuIcSnSyJ4B rCOnPSw0+K6Y1DjbjO4bmQ59UHzBtWScpLgnX0oGeYaU3d9evhHhOOGUm+l7bV3o wTf7hnpFAigz/9tEDFJ5pJRaVC0ak2fVP8d7i3khJAb1o9WAVAzGIS5B5yXp8eep EuzR6WWwwhq+buYu/BeTvR/kjnooBmuNP9MBbctkmA55CydUfMp4hfhnY7GF66/C zf4HPYVgX13F8gAcBnYgvy45m1haE4VsqNySO0foC5+GWx8j9bofVzuH0QN+GE9K kcV2bSHDDNB7lfp53nNou0sj9A+UCkZMR22p8s0QCWuhawxeASTv3P6xWf+M120Y /7NMLJmZGQj9H+5blUD2bS168actr5z21EdtjPo331Kv43KFyY7mozyJEHBowPyP x4PeXDwWDra6qbrXmQVp =A/cb -----END PGP SIGNATURE----- Merge tag 'drm-misc-next-2018-07-04' of git://anongit.freedesktop.org/drm/drm-misc into drm-next drm-misc-next for 4.19: UAPI Changes: v3d: add fourcc modicfier for fourcc for the Broadcom UIF format (Eric Anholt) Cross-subsystem Changes: console/fbcon: Add support for deferred console takeover (Hans de Goede) Core Changes: dma-fence clean up, improvements and docs (Daniel Vetter) add mask function for crtc, plane, encoder and connector DRM objects(Ville Syrjälä) Driver Changes: pl111: add Nomadik LCDC variant (Linus Walleij) Signed-off-by: Dave Airlie <airlied@redhat.com> Link: https://patchwork.freedesktop.org/patch/msgid/20180704234641.GA3981@juma	2018-07-06 10:01:56 +10:00
Dave Airlie	c5be9b5403	Merge branch 'vmwgfx-next' of git://people.freedesktop.org/~thomash/linux into drm-next A patchset worked out together with Peter Zijlstra. Ingo is OK with taking it through the DRM tree: This is a small fallout from a work to allow batching WW mutex locks and unlocks. Our Wound-Wait mutexes actually don't use the Wound-Wait algorithm but the Wait-Die algorithm. One could perhaps rename those mutexes tree-wide to "Wait-Die mutexes" or "Deadlock Avoidance mutexes". Another approach suggested here is to implement also the "Wound-Wait" algorithm as a per-WW-class choice, as it has advantages in some cases. See for example http://www.mathcs.emory.edu/~cheung/Courses/554/Syllabus/8-recv+serial/deadlock-compare.html Now Wound-Wait is a preemptive algorithm, and the preemption is implemented using a lazy scheme: If a wounded transaction is about to go to sleep on a contended WW mutex, we return -EDEADLK. That is sufficient for deadlock prevention. Since with WW mutexes we also require the aborted transaction to sleep waiting to lock the WW mutex it was aborted on, this choice also provides a suitable WW mutex to sleep on. If we were to return -EDEADLK on the first WW mutex lock after the transaction was wounded whether the WW mutex was contended or not, the transaction might frequently be restarted without a wait, which is far from optimal. Note also that with the lazy preemption scheme, contrary to Wait-Die there will be no rollbacks on lock contention of locks held by a transaction that has completed its locking sequence. The modeset locks are then changed from Wait-Die to Wound-Wait since the typical locking pattern of those locks very well matches the criterion for a substantial reduction in the number of rollbacks. For reservation objects, the benefit is more unclear at this point and they remain using Wait-Die. Signed-off-by: Dave Airlie <airlied@redhat.com> Link: https://patchwork.freedesktop.org/patch/msgid/20180703105339.4461-1-thellstrom@vmware.com	2018-07-06 08:47:14 +10:00
Konrad Rzeszutek Wilk	26acfb666a	x86/KVM: Warn user if KVM is loaded SMT and L1TF CPU bug being present If the L1TF CPU bug is present we allow the KVM module to be loaded as the major of users that use Linux and KVM have trusted guests and do not want a broken setup. Cloud vendors are the ones that are uncomfortable with CVE 2018-3620 and as such they are the ones that should set nosmt to one. Setting 'nosmt' means that the system administrator also needs to disable SMT (Hyper-threading) in the BIOS, or via the 'nosmt' command line parameter, or via the /sys/devices/system/cpu/smt/control. See commit `05736e4ac1` ("cpu/hotplug: Provide knobs to control SMT"). Other mitigations are to use task affinity, cpu sets, interrupt binding, etc - anything to make sure that _only_ the same guests vCPUs are running on sibling threads. Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>	2018-07-04 20:49:38 +02:00
Changbin Du	1fe4293f4b	tracing: Fix missing return symbol in function_graph output The function_graph tracer does not show the interrupt return marker for the leaf entry. On leaf entries, we see an unbalanced interrupt marker (the interrupt was entered, but nevern left). Before: 1) \| SyS_write() { 1) \| __fdget_pos() { 1) 0.061 us \| __fget_light(); 1) 0.289 us \| } 1) \| vfs_write() { 1) 0.049 us \| rw_verify_area(); 1) + 15.424 us \| __vfs_write(); 1) ==========> \| 1) 6.003 us \| smp_apic_timer_interrupt(); 1) 0.055 us \| __fsnotify_parent(); 1) 0.073 us \| fsnotify(); 1) + 23.665 us \| } 1) + 24.501 us \| } After: 0) \| SyS_write() { 0) \| __fdget_pos() { 0) 0.052 us \| __fget_light(); 0) 0.328 us \| } 0) \| vfs_write() { 0) 0.057 us \| rw_verify_area(); 0) \| __vfs_write() { 0) ==========> \| 0) 8.548 us \| smp_apic_timer_interrupt(); 0) <========== \| 0) + 36.507 us \| } /* __vfs_write */ 0) 0.049 us \| __fsnotify_parent(); 0) 0.066 us \| fsnotify(); 0) + 50.064 us \| } 0) + 50.952 us \| } Link: http://lkml.kernel.org/r/1517413729-20411-1-git-send-email-changbin.du@intel.com Cc: stable@vger.kernel.org Fixes: `f8b755ac8e` ("tracing/function-graph-tracer: Output arrows signal on hardirq call/return") Signed-off-by: Changbin Du <changbin.du@intel.com> Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>	2018-07-03 18:47:11 -04:00
Yisheng Xie	5ccba64a56	ftrace: Nuke clear_ftrace_function clear_ftrace_function is not used outside of ftrace.c and is not help to use a function, so nuke it per Steve's suggestion. Link: http://lkml.kernel.org/r/1517537689-34947-1-git-send-email-xieyisheng1@huawei.com Suggested-by: Steven Rostedt <rostedt@goodmis.org> Signed-off-by: Yisheng Xie <xieyisheng1@huawei.com> Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>	2018-07-03 18:33:19 -04:00
Mathieu Malaterre	26b68dd2f4	tracing: Use __printf markup to silence compiler Silence warnings (triggered at W=1) by adding relevant __printf attributes. CC kernel/trace/trace.o kernel/trace/trace.c: In function ‘__trace_array_vprintk’: kernel/trace/trace.c:2979:2: warning: function might be possible candidate for ‘gnu_printf’ format attribute [-Wsuggest-attribute=format] len = vscnprintf(tbuffer, TRACE_BUF_SIZE, fmt, args); ^~~ AR kernel/trace/built-in.o Link: http://lkml.kernel.org/r/20180308205843.27447-1-malat@debian.org Signed-off-by: Mathieu Malaterre <malat@debian.org> Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>	2018-07-03 18:32:04 -04:00
yuan linyu	f26808ba72	tracing: Optimize trace_buffer_iter() logic Simplify and optimize the logic in trace_buffer_iter() to use a conditional operation instead of an if conditional. Link: http://lkml.kernel.org/r/20180408113631.3947-1-cugyly@163.com Signed-off-by: yuan linyu <Linyu.Yuan@alcatel-sbell.com.cn> Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>	2018-07-03 18:23:33 -04:00
Steven Rostedt (VMware)	f90658725b	tracing: Make create_filter() code match the comments The comment in create_filter() states that the passed in filter pointer (filterp) will either be NULL or contain an error message stating why the filter failed. But it also expects the filter pointer to point to NULL when passed in. If it is not, the function create_filter_start() will warn and return an error message without updating the filter pointer. This is not what the comment states. As we always expect the pointer to point to NULL, if it is not, trigger a WARN_ON(), set it to NULL, and then continue the path as the rest will work as the comment states. Also update the comment to state it must point to NULL. Reported-by: Dan Carpenter <dan.carpenter@oracle.com> Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>	2018-07-03 18:14:40 -04:00
Arnd Bergmann	cf4d418e65	tracing: Avoid string overflow 'err' is used as a NUL-terminated string, but using strncpy() with the length equal to the buffer size may result in lack of the termination: kernel/trace/trace_events_hist.c: In function 'hist_err_event': kernel/trace/trace_events_hist.c:396:3: error: 'strncpy' specified bound 256 equals destination size [-Werror=stringop-truncation] strncpy(err, var, MAX_FILTER_STR_VAL); This changes it to use the safer strscpy() instead. Link: http://lkml.kernel.org/r/20180328140920.2842153-1-arnd@arndb.de Cc: stable@vger.kernel.org Fixes: `f404da6e1d` ("tracing: Add 'last error' error facility for hist triggers") Acked-by: Tom Zanussi <tom.zanussi@linux.intel.com> Signed-off-by: Arnd Bergmann <arnd@arndb.de> Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>	2018-07-03 18:14:39 -04:00
Mauricio Vasquez B	ed2b82c03d	bpf: hash map: decrement counter on error Decrement the number of elements in the map in case the allocation of a new node fails. Fixes: `6c90598174` ("bpf: pre-allocate hash map elements") Signed-off-by: Mauricio Vasquez B <mauricio.vasquez@polito.it> Acked-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>	2018-07-03 23:26:28 +02:00
Arnd Bergmann	c72051d577	audit: use ktime_get_coarse_ts64() for time access The API got renamed for consistency with the other time accessors, this changes the audit caller as well. Signed-off-by: Arnd Bergmann <arnd@arndb.de> Reviewed-by: Richard Guy Briggs <rgb@redhat.com> Signed-off-by: Paul Moore <paul@paul-moore.com>	2018-07-03 10:12:54 -04:00
Thomas Hellstrom	08295b3b5b	locking: Implement an algorithm choice for Wound-Wait mutexes The current Wound-Wait mutex algorithm is actually not Wound-Wait but Wait-Die. Implement also Wound-Wait as a per-ww-class choice. Wound-Wait is, contrary to Wait-Die a preemptive algorithm and is known to generate fewer backoffs. Testing reveals that this is true if the number of simultaneous contending transactions is small. As the number of simultaneous contending threads increases, Wait-Wound becomes inferior to Wait-Die in terms of elapsed time. Possibly due to the larger number of held locks of sleeping transactions. Update documentation and callers. Timings using git://people.freedesktop.org/~thomash/ww_mutex_test tag patch-18-06-15 Each thread runs 100000 batches of lock / unlock 800 ww mutexes randomly chosen out of 100000. Four core Intel x86_64: Algorithm #threads Rollbacks time Wound-Wait 4 ~100 ~17s. Wait-Die 4 ~150000 ~19s. Wound-Wait 16 ~360000 ~109s. Wait-Die 16 ~450000 ~82s. Cc: Ingo Molnar <mingo@redhat.com> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Gustavo Padovan <gustavo@padovan.org> Cc: Maarten Lankhorst <maarten.lankhorst@linux.intel.com> Cc: Sean Paul <seanpaul@chromium.org> Cc: David Airlie <airlied@linux.ie> Cc: Davidlohr Bueso <dave@stgolabs.net> Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com> Cc: Josh Triplett <josh@joshtriplett.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Kate Stewart <kstewart@linuxfoundation.org> Cc: Philippe Ombredanne <pombredanne@nexb.com> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: linux-doc@vger.kernel.org Cc: linux-media@vger.kernel.org Cc: linaro-mm-sig@lists.linaro.org Co-authored-by: Peter Zijlstra <peterz@infradead.org> Signed-off-by: Thomas Hellstrom <thellstrom@vmware.com> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Acked-by: Ingo Molnar <mingo@kernel.org>	2018-07-03 09:44:36 +02:00
Peter Ziljstra	55f036ca7e	locking: WW mutex cleanup Make the WW mutex code more readable by adding comments, splitting up functions and pointing out that we're actually using the Wait-Die algorithm. Cc: Ingo Molnar <mingo@redhat.com> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Gustavo Padovan <gustavo@padovan.org> Cc: Maarten Lankhorst <maarten.lankhorst@linux.intel.com> Cc: Sean Paul <seanpaul@chromium.org> Cc: David Airlie <airlied@linux.ie> Cc: Davidlohr Bueso <dave@stgolabs.net> Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com> Cc: Josh Triplett <josh@joshtriplett.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Kate Stewart <kstewart@linuxfoundation.org> Cc: Philippe Ombredanne <pombredanne@nexb.com> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: linux-doc@vger.kernel.org Cc: linux-media@vger.kernel.org Cc: linaro-mm-sig@lists.linaro.org Co-authored-by: Thomas Hellstrom <thellstrom@vmware.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Thomas Hellstrom <thellstrom@vmware.com> Acked-by: Ingo Molnar <mingo@kernel.org>	2018-07-03 09:42:40 +02:00
Peter Zijlstra	f83ee19be4	kthread: Simplify kthread_park() completion Oleg explains the reason we could hit park+park is that smpboot_update_cpumask_percpu_thread()'s for_each_cpu_and(cpu, &tmp, cpu_online_mask) smpboot_park_kthread(); turns into: for ((cpu) = 0; (cpu) < 1; (cpu)++, (void)mask, (void)and) smpboot_park_kthread(); on UP, ignoring the mask. But since we just completely removed that function, this is no longer relevant. So revert commit: `b1f5b378e1` ("kthread: Allow kthread_park() on a parked kthread") Suggested-by: Oleg Nesterov <oleg@redhat.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: linux-kernel@vger.kernel.org Signed-off-by: Ingo Molnar <mingo@kernel.org>	2018-07-03 09:20:44 +02:00
Peter Zijlstra	167a88677b	smpboot: Remove cpumask from the API Now that the sole use of the whole smpboot_*cpumask() API is gone, remove it. Suggested-by: Oleg Nesterov <oleg@redhat.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: linux-kernel@vger.kernel.org Signed-off-by: Ingo Molnar <mingo@kernel.org>	2018-07-03 09:20:44 +02:00
Peter Zijlstra	9cf57731b6	watchdog/softlockup: Replace "watchdog/%u" threads with cpu_stop_work Oleg suggested to replace the "watchdog/%u" threads with cpu_stop_work. That removes one thread per CPU while at the same time fixes softlockup vs SCHED_DEADLINE. But more importantly, it does away with the single smpboot_update_cpumask_percpu_thread() user, which allows cleanups/shrinkage of the smpboot interface. Suggested-by: Oleg Nesterov <oleg@redhat.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: linux-kernel@vger.kernel.org Signed-off-by: Ingo Molnar <mingo@kernel.org>	2018-07-03 09:20:43 +02:00
Ingo Molnar	4520843dfa	Merge branch 'sched/urgent' into sched/core, to pick up fixes Signed-off-by: Ingo Molnar <mingo@kernel.org>	2018-07-03 09:20:22 +02:00
Peter Zijlstra	1cef1150ef	kthread, sched/core: Fix kthread_parkme() (again...) Gaurav reports that commit: `85f1abe001` ("kthread, sched/wait: Fix kthread_parkme() completion issue") isn't working for him. Because of the following race: > controller Thread CPUHP Thread > takedown_cpu > kthread_park > kthread_parkme > Set KTHREAD_SHOULD_PARK > smpboot_thread_fn > set Task interruptible > > > wake_up_process > if (!(p->state & state)) > goto out; > > Kthread_parkme > SET TASK_PARKED > schedule > raw_spin_lock(&rq->lock) > ttwu_remote > waiting for __task_rq_lock > context_switch > > finish_lock_switch > > > > Case TASK_PARKED > kthread_park_complete > > > SET Running Furthermore, Oleg noticed that the whole scheduler TASK_PARKED handling is buggered because the TASK_DEAD thing is done with preemption disabled, the current code can still complete early on preemption :/ So basically revert that earlier fix and go with a variant of the alternative mentioned in the commit. Promote TASK_PARKED to special state to avoid the store-store issue on task->state leading to the WARN in kthread_unpark() -> __kthread_bind(). But in addition, add wait_task_inactive() to kthread_park() to ensure the task really is PARKED when we return from kthread_park(). This avoids the whole kthread still gets migrated nonsense -- although it would be really good to get this done differently. Reported-by: Gaurav Kohli <gkohli@codeaurora.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Fixes: `85f1abe001` ("kthread, sched/wait: Fix kthread_parkme() completion issue") Signed-off-by: Ingo Molnar <mingo@kernel.org>	2018-07-03 09:17:30 +02:00
Vincent Guittot	3482d98bbc	sched/util_est: Fix util_est_dequeue() for throttled cfs_rq When a cfs_rq is throttled, parent cfs_rq->nr_running is decreased and everything happens at cfs_rq level. Currently util_est stays unchanged in such case and it keeps accounting the utilization of throttled tasks. This can somewhat make sense as we don't dequeue tasks but only throttled cfs_rq. If a task of another group is enqueued/dequeued and root cfs_rq becomes idle during the dequeue, util_est will be cleared whereas it was accounting util_est of throttled tasks before. So the behavior of util_est is not always the same regarding throttled tasks and depends of side activity. Furthermore, util_est will not be updated when the cfs_rq is unthrottled as everything happens at cfs_rq level. Main results is that util_est will stay null whereas we now have running tasks. We have to wait for the next dequeue/enqueue of the previously throttled tasks to get an up to date util_est. Remove the assumption that cfs_rq's estimated utilization of a CPU is 0 if there is no running task so the util_est of a task remains until the latter is dequeued even if its cfs_rq has been throttled. Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Patrick Bellasi <patrick.bellasi@arm.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Fixes: `7f65ea42eb` ("sched/fair: Add util_est on top of PELT") Link: http://lkml.kernel.org/r/1528972380-16268-1-git-send-email-vincent.guittot@linaro.org Signed-off-by: Ingo Molnar <mingo@kernel.org>	2018-07-03 09:17:30 +02:00
Xunlei Pang	f1d1be8aee	sched/fair: Advance global expiration when period timer is restarted When period gets restarted after some idle time, start_cfs_bandwidth() doesn't update the expiration information, expire_cfs_rq_runtime() will see cfs_rq->runtime_expires smaller than rq clock and go to the clock drift logic, wasting needless CPU cycles on the scheduler hot path. Update the global expiration in start_cfs_bandwidth() to avoid frequent expire_cfs_rq_runtime() calls once a new period begins. Signed-off-by: Xunlei Pang <xlpang@linux.alibaba.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Ben Segall <bsegall@google.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Link: http://lkml.kernel.org/r/20180620101834.24455-2-xlpang@linux.alibaba.com Signed-off-by: Ingo Molnar <mingo@kernel.org>	2018-07-03 09:17:29 +02:00
Xunlei Pang	512ac999d2	sched/fair: Fix bandwidth timer clock drift condition I noticed that cgroup task groups constantly get throttled even if they have low CPU usage, this causes some jitters on the response time to some of our business containers when enabling CPU quotas. It's very simple to reproduce: mkdir /sys/fs/cgroup/cpu/test cd /sys/fs/cgroup/cpu/test echo 100000 > cpu.cfs_quota_us echo $$ > tasks then repeat: cat cpu.stat \| grep nr_throttled # nr_throttled will increase steadily After some analysis, we found that cfs_rq::runtime_remaining will be cleared by expire_cfs_rq_runtime() due to two equal but stale "cfs_{b\|q}->runtime_expires" after period timer is re-armed. The current condition to judge clock drift in expire_cfs_rq_runtime() is wrong, the two runtime_expires are actually the same when clock drift happens, so this condtion can never hit. The orginal design was correctly done by this commit: `a9cf55b286` ("sched: Expire invalid runtime") ... but was changed to be the current implementation due to its locking bug. This patch introduces another way, it adds a new field in both structures cfs_rq and cfs_bandwidth to record the expiration update sequence, and uses them to figure out if clock drift happens (true if they are equal). Signed-off-by: Xunlei Pang <xlpang@linux.alibaba.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Ben Segall <bsegall@google.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Fixes: `51f2176d74` ("sched/fair: Fix unlocked reads of some cfs_b->quota/period") Link: http://lkml.kernel.org/r/20180620101834.24455-1-xlpang@linux.alibaba.com Signed-off-by: Ingo Molnar <mingo@kernel.org>	2018-07-03 09:17:29 +02:00
Vincent Guittot	296b2ffe7f	sched/rt: Fix call to cpufreq_update_util() With commit: `8f111bc357` ("cpufreq/schedutil: Rewrite CPUFREQ_RT support") the schedutil governor uses rq->rt.rt_nr_running to detect whether an RT task is currently running on the CPU and to set frequency to max if necessary. cpufreq_update_util() is called in enqueue/dequeue_top_rt_rq() but rq->rt.rt_nr_running has not been updated yet when dequeue_top_rt_rq() is called so schedutil still considers that an RT task is running when the last task is dequeued. The update of rq->rt.rt_nr_running happens later in dequeue_rt_stack(). In fact, we can take advantage of the sequence that the dequeue then re-enqueue rt entities when a rt task is enqueued or dequeued; As a result enqueue_top_rt_rq() is always called when a task is enqueued or dequeued and also when groups are throttled or unthrottled. The only place that not use enqueue_top_rt_rq() is when root rt_rq is throttled. Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: efault@gmx.de Cc: juri.lelli@redhat.com Cc: patrick.bellasi@arm.com Cc: viresh.kumar@linaro.org Fixes: `8f111bc357` ('cpufreq/schedutil: Rewrite CPUFREQ_RT support') Link: http://lkml.kernel.org/r/1530021202-21695-1-git-send-email-vincent.guittot@linaro.org Signed-off-by: Ingo Molnar <mingo@kernel.org>	2018-07-03 09:17:28 +02:00
Frederic Weisbecker	d9c0ffcabd	sched/nohz: Skip remote tick on idle task entirely Some people have reported that the warning in sched_tick_remote() occasionally triggers, especially in favour of some RCU-Torture pressure: WARNING: CPU: 11 PID: 906 at kernel/sched/core.c:3138 sched_tick_remote+0xb6/0xc0 Modules linked in: CPU: 11 PID: 906 Comm: kworker/u32:3 Not tainted 4.18.0-rc2+ #1 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.10.2-1 04/01/2014 Workqueue: events_unbound sched_tick_remote RIP: 0010:sched_tick_remote+0xb6/0xc0 Code: e8 0f 06 b8 00 c6 03 00 fb eb 9d 8b 43 04 85 c0 75 8d 48 8b 83 e0 0a 00 00 48 85 c0 75 81 eb 88 48 89 df e8 bc fe ff ff eb aa <0f> 0b eb +c5 66 0f 1f 44 00 00 bf 17 00 00 00 e8 b6 2e fe ff 0f b6 Call Trace: process_one_work+0x1df/0x3b0 worker_thread+0x44/0x3d0 kthread+0xf3/0x130 ? set_worker_desc+0xb0/0xb0 ? kthread_create_worker_on_cpu+0x70/0x70 ret_from_fork+0x35/0x40 This happens when the remote tick applies on an idle task. Usually the idle_cpu() check avoids that, but it is performed before we lock the runqueue and it is therefore racy. It was intended to be that way in order to prevent from useless runqueue locks since idle task tick callback is a no-op. Now if the racy check slips out of our hands and we end up remotely ticking an idle task, the empty task_tick_idle() is harmless. Still it won't pass the WARN_ON_ONCE() test that ensures rq_clock_task() is not too far from curr->se.exec_start because update_curr_idle() doesn't update the exec_start value like other scheduler policies. Hence the reported false positive. So let's have another check, while the rq is locked, to make sure we don't remote tick on an idle task. The lockless idle_cpu() still applies to avoid unecessary rq lock contention. Reported-by: Jacek Tomaka <jacekt@dug.com> Reported-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Reported-by: Anna-Maria Gleixner <anna-maria@linutronix.de> Signed-off-by: Frederic Weisbecker <frederic@kernel.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Link: http://lkml.kernel.org/r/1530203381-31234-1-git-send-email-frederic@kernel.org Signed-off-by: Ingo Molnar <mingo@kernel.org>	2018-07-03 09:17:28 +02:00
Linus Torvalds	4e33d7d479	Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net Pull networking fixes from David Miller: 1) Verify netlink attributes properly in nf_queue, from Eric Dumazet. 2) Need to bump memory lock rlimit for test_sockmap bpf test, from Yonghong Song. 3) Fix VLAN handling in lan78xx driver, from Dave Stevenson. 4) Fix uninitialized read in nf_log, from Jann Horn. 5) Fix raw command length parsing in mlx5, from Alex Vesker. 6) Cleanup loopback RDS connections upon netns deletion, from Sowmini Varadhan. 7) Fix regressions in FIB rule matching during create, from Jason A. Donenfeld and Roopa Prabhu. 8) Fix mpls ether type detection in nfp, from Pieter Jansen van Vuuren. 9) More bpfilter build fixes/adjustments from Masahiro Yamada. 10) Fix XDP_{TX,REDIRECT} flushing in various drivers, from Jesper Dangaard Brouer. 11) fib_tests.sh file permissions were broken, from Shuah Khan. 12) Make sure BH/preemption is disabled in data path of mac80211, from Denis Kenzior. 13) Don't ignore nla_parse_nested() return values in nl80211, from Johannes berg. 14) Properly account sock objects ot kmemcg, from Shakeel Butt. 15) Adjustments to setting bpf program permissions to read-only, from Daniel Borkmann. 16) TCP Fast Open key endianness was broken, it always took on the host endiannness. Whoops. Explicitly make it little endian. From Yuching Cheng. 17) Fix prefix route setting for link local addresses in ipv6, from David Ahern. 18) Potential Spectre v1 in zatm driver, from Gustavo A. R. Silva. 19) Various bpf sockmap fixes, from John Fastabend. 20) Use after free for GRO with ESP, from Sabrina Dubroca. 21) Passing bogus flags to crypto_alloc_shash() in ipv6 SR code, from Eric Biggers. * git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (87 commits) qede: Adverstise software timestamp caps when PHC is not available. qed: Fix use of incorrect size in memcpy call. qed: Fix setting of incorrect eswitch mode. qed: Limit msix vectors in kdump kernel to the minimum required count. ipvlan: call dev_change_flags when ipvlan mode is reset ipv6: sr: fix passing wrong flags to crypto_alloc_shash() net: fix use-after-free in GRO with ESP tcp: prevent bogus FRTO undos with non-SACK flows bpf: sockhash, add release routine bpf: sockhash fix omitted bucket lock in sock_close bpf: sockmap, fix smap_list_map_remove when psock is in many maps bpf: sockmap, fix crash when ipv6 sock is added net: fib_rules: bring back rule_exists to match rule during add hv_netvsc: split sub-channel setup into async and sync net: use dev_change_tx_queue_len() for SIOCSIFTXQLEN atm: zatm: Fix potential Spectre v1 s390/qeth: consistently re-enable device features s390/qeth: don't clobber buffer on async TX completion s390/qeth: avoid using is_multicast_ether_addr_64bits on (u8 *)[6] s390/qeth: fix race when setting MAC address ...	2018-07-02 11:18:28 -07:00
Mauro Carvalho Chehab	ea272257cc	docs: histogram.txt: convert it to ReST file format Despite being mentioned at Documentation/trace/ftrace.rst as a rst file, this file was still a text one, with several issues. Convert it to ReST and add it to the trace index: - Mark the document title as such; - Identify and indent the literal blocks; - Use the proper markups for table. Signed-off-by: Mauro Carvalho Chehab <mchehab+samsung@kernel.org> Acked-by: Steven Rostedt (VMware) <rostedt@goodmis.org> Signed-off-by: Jonathan Corbet <corbet@lwn.net>	2018-07-02 11:26:02 -06:00
Chengguang Xu	d5641c64c4	PM / hibernate: cast PAGE_SIZE to int when comparing with error code If PAGE_SIZE is unsigned type then negative error code will be larger than PAGE_SIZE. Signed-off-by: Chengguang Xu <cgxu519@gmx.com> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>	2018-07-02 11:48:30 +02:00
Jessica Yu	f314dfea16	modsign: log module name in the event of an error Now that we have the load_info struct all initialized (including info->name, which contains the name of the module) before module_sig_check(), make the load_info struct and hence module name available to mod_verify_sig() so that we can log the module name in the event of an error. Signed-off-by: Jessica Yu <jeyu@kernel.org>	2018-07-02 11:36:17 +02:00
Thomas Gleixner	5f936e19cc	alarmtimer: Prevent overflow for relative nanosleep Air Icy reported: UBSAN: Undefined behaviour in kernel/time/alarmtimer.c:811:7 signed integer overflow: 1529859276030040771 + 9223372036854775807 cannot be represented in type 'long long int' Call Trace: alarm_timer_nsleep+0x44c/0x510 kernel/time/alarmtimer.c:811 __do_sys_clock_nanosleep kernel/time/posix-timers.c:1235 [inline] __se_sys_clock_nanosleep kernel/time/posix-timers.c:1213 [inline] __x64_sys_clock_nanosleep+0x326/0x4e0 kernel/time/posix-timers.c:1213 do_syscall_64+0xb8/0x3a0 arch/x86/entry/common.c:290 alarm_timer_nsleep() uses ktime_add() to add the current time and the relative expiry value. ktime_add() has no sanity checks so the addition can overflow when the relative timeout is large enough. Use ktime_add_safe() which has the necessary sanity checks in place and limits the result to the valid range. Fixes: `9a7adcf5c6` ("timers: Posix interface for alarm-timers") Reported-by: Team OWL337 <icytxw@gmail.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Cc: John Stultz <john.stultz@linaro.org> Link: https://lkml.kernel.org/r/alpine.DEB.2.21.1807020926360.1595@nanos.tec.linutronix.de	2018-07-02 11:33:26 +02:00
Thomas Gleixner	78c9c4dfbf	posix-timers: Sanitize overrun handling The posix timer overrun handling is broken because the forwarding functions can return a huge number of overruns which does not fit in an int. As a consequence timer_getoverrun(2) and siginfo::si_overrun can turn into random number generators. The k_clock::timer_forward() callbacks return a 64 bit value now. Make k_itimer::ti_overrun[_last] 64bit as well, so the kernel internal accounting is correct. 3Remove the temporary (int) casts. Add a helper function which clamps the overrun value returned to user space via timer_getoverrun(2) or siginfo::si_overrun limited to a positive value between 0 and INT_MAX. INT_MAX is an indicator for user space that the overrun value has been clamped. Reported-by: Team OWL337 <icytxw@gmail.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Acked-by: John Stultz <john.stultz@linaro.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Michael Kerrisk <mtk.manpages@gmail.com> Link: https://lkml.kernel.org/r/20180626132705.018623573@linutronix.de	2018-07-02 11:33:25 +02:00
Thomas Gleixner	6fec64e1c9	posix-timers: Make forward callback return s64 The posix timer ti_overrun handling is broken because the forwarding functions can return a huge number of overruns which does not fit in an int. As a consequence timer_getoverrun(2) and siginfo::si_overrun can turn into random number generators. As a first step to address that let the timer_forward() callbacks return the full 64 bit value. Cast it to (int) temporarily until k_itimer::ti_overrun is converted to 64bit and the conversion to user space visible values is sanitized. Reported-by: Team OWL337 <icytxw@gmail.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Acked-by: John Stultz <john.stultz@linaro.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Michael Kerrisk <mtk.manpages@gmail.com> Link: https://lkml.kernel.org/r/20180626132704.922098090@linutronix.de	2018-07-02 11:33:25 +02:00
Thomas Gleixner	0cc3cd2165	cpu/hotplug: Boot HT siblings at least once Due to the way Machine Check Exceptions work on X86 hyperthreads it's required to boot up _all_ logical cores at least once in order to set the CR4.MCE bit. So instead of ignoring the sibling threads right away, let them boot up once so they can configure themselves. After they came out of the initial boot stage check whether its a "secondary" sibling and cancel the operation which puts the CPU back into offline state. Reported-by: Dave Hansen <dave.hansen@intel.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Tested-by: Tony Luck <tony.luck@intel.com>	2018-07-02 11:25:29 +02:00
Linus Torvalds	c350d6d1d7	Add a missing export required by riscv and unicore -----BEGIN PGP SIGNATURE----- iQI/BAABCgApFiEEgdbnc3r/njty3Iq9D55TZVIEUYMFAls4b+ILHGhjaEBsc3Qu ZGUACgkQD55TZVIEUYNrVw//SiXXoTDekxaQmyExQcEhac4ZzTjERoDFaE9zo14D x7iwdOXFQN9pRJwtUX70QDxQxc8iJT1j6CXMRUJZL9ZKdEOVbqfX2k7P8W4TJkhU id2PeZTxmreWOKL8EJB1mEjckZxP/qQLKBp9qwNET6rPf18WmSDb6wQuUcaLgDhl n4UJrzXuiCaEQNor30BvGDM4jQ4IOFh2JVjWqwPubcgPgOoopZ0TWSPeAxRo38MH 2hDTSMFEV/QqM/slDhHxJacwFyMc2mWxyHg66/YGzTmX/ZciABq2eMiTwhSMumSp wDP/4/9TvaM6QGPkWblTFr18sRECBaNo59e5Lz0g/KtJCXwIeAiBqCKaIct6CYAf hHdSaxvYzPgIOs91uSgGbhtlweg3TdjSRJIbB8k1xkub7uAnT0hpeHnqPy4DUfjy DB25YTt0Qabnmfyz37UpbCQN0PPYY4TLvZc9N81WXiKW2BOIogvNeiCNbS06rvXa wxn/haocbMSbh3k41jxbkrFBKiALiz1QaZRgH0omdOvXSN//WgI8qAt6F8begDc8 dBu5aTu50oDv6kLoe4aONTH72Jrt/q6xBI9dN5cD73luBm1p+QyoYhekHUPNMef7 7ek/lBJLo1Xq+bEuqgnaPZRC7mWYGTvnEe+r0uYW3cOw+PGx/VNfPc6B43J/YaCS Hv4= =IUWC -----END PGP SIGNATURE----- Merge tag 'dma-mapping-4.18-2' of git://git.infradead.org/users/hch/dma-mapping Pull dma mapping fixlet from Christoph Hellwig: "Add a missing export required by riscv and unicore" * tag 'dma-mapping-4.18-2' of git://git.infradead.org/users/hch/dma-mapping: swiotlb: export swiotlb_dma_ops	2018-07-01 10:45:13 -07:00
David S. Miller	271b955e52	Merge git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf Daniel Borkmann says: ==================== pull-request: bpf 2018-07-01 The following pull-request contains BPF updates for your net tree. The main changes are: 1) A bpf_fib_lookup() helper fix to change the API before freeze to return an encoding of the FIB lookup result and return the nexthop device index in the params struct (instead of device index as return code that we had before), from David. 2) Various BPF JIT fixes to address syzkaller fallout, that is, do not reject progs when set_memory_*() fails since it could still be RO. Also arm32 JIT was not using bpf_jit_binary_lock_ro() API which was an issue, and a memory leak in s390 JIT found during review, from Daniel. 3) Multiple fixes for sockmap/hash to address most of the syzkaller triggered bugs. Usage with IPv6 was crashing, a GPF in bpf_tcp_close(), a missing sock_map_release() routine to hook up to callbacks, and a fix for an omitted bucket lock in sock_close(), from John. 4) Two bpftool fixes to remove duplicated error message on program load, and another one to close the libbpf object after program load. One additional fix for nfp driver's BPF offload to avoid stopping offload completely if replace of program failed, from Jakub. 5) Couple of BPF selftest fixes that bail out in some of the test scripts if the user does not have the right privileges, from Jeffrin. 6) Fixes in test_bpf for s390 when CONFIG_BPF_JIT_ALWAYS_ON is set where we need to set the flag that some of the test cases are expected to fail, from Kleber. 7) Fix to detangle BPF_LIRC_MODE2 dependency from CONFIG_CGROUP_BPF since it has no relation to it and lirc2 users often have configs without cgroups enabled and thus would not be able to use it, from Sean. 8) Fix a selftest failure in sockmap by removing a useless setrlimit() call that would set a too low limit where at the same time we are already including bpf_rlimit.h that does the job, from Yonghong. 9) Fix BPF selftest config with missing missing NET_SCHED, from Anders. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>	2018-07-01 09:27:44 +09:00
John Fastabend	caac76a517	bpf: sockhash, add release routine Add map_release_uref pointer to hashmap ops. This was dropped when original sockhash code was ported into bpf-next before initial commit. Fixes: `8111038444` ("bpf: sockmap, add hash map support") Acked-by: Martin KaFai Lau <kafai@fb.com> Signed-off-by: John Fastabend <john.fastabend@gmail.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>	2018-07-01 01:21:32 +02:00
John Fastabend	e9db4ef6bf	bpf: sockhash fix omitted bucket lock in sock_close First the sk_callback_lock() was being used to protect both the sock callback hooks and the psock->maps list. This got overly convoluted after the addition of sockhash (in sockmap it made some sense because masp and callbacks were tightly coupled) so lets split out a specific lock for maps and only use the callback lock for its intended purpose. This fixes a couple cases where we missed using maps lock when it was in fact needed. Also this makes it easier to follow the code because now we can put the locking closer to the actual code its serializing. Next, in sock_hash_delete_elem() the pattern was as follows, sock_hash_delete_elem() [...] spin_lock(bucket_lock) l = lookup_elem_raw() if (l) hlist_del_rcu() write_lock(sk_callback_lock) .... destroy psock ... write_unlock(sk_callback_lock) spin_unlock(bucket_lock) The ordering is necessary because we only know the {p}sock after dereferencing the hash table which we can't do unless we have the bucket lock held. Once we have the bucket lock and the psock element it is deleted from the hashmap to ensure any other path doing a lookup will fail. Finally, the refcnt is decremented and if zero the psock is destroyed. In parallel with the above (or free'ing the map) a tcp close event may trigger tcp_close(). Which at the moment omits the bucket lock altogether (oops!) where the flow looks like this, bpf_tcp_close() [...] write_lock(sk_callback_lock) for each psock->maps // list of maps this sock is part of hlist_del_rcu(ref_hash_node); .... destroy psock ... write_unlock(sk_callback_lock) Obviously, and demonstrated by syzbot, this is broken because we can have multiple threads deleting entries via hlist_del_rcu(). To fix this we might be tempted to wrap the hlist operation in a bucket lock but that would create a lock inversion problem. In summary to follow locking rules the psocks maps list needs the sk_callback_lock (after this patch maps_lock) but we need the bucket lock to do the hlist_del_rcu. To resolve the lock inversion problem pop the head of the maps list repeatedly and remove the reference until no more are left. If a delete happens in parallel from the BPF API that is OK as well because it will do a similar action, lookup the lock in the map/hash, delete it from the map/hash, and dec the refcnt. We check for this case before doing a destroy on the psock to ensure we don't have two threads tearing down a psock. The new logic is as follows, bpf_tcp_close() e = psock_map_pop(psock->maps) // done with map lock bucket_lock() // lock hash list bucket l = lookup_elem_raw(head, hash, key, key_size); if (l) { //only get here if elmnt was not already removed hlist_del_rcu() ... destroy psock... } bucket_unlock() And finally for all the above to work add missing locking around map operations per above. Then add RCU annotations and use rcu_dereference/rcu_assign_pointer to manage values relying on RCU so that the object is not free'd from sock_hash_free() while it is being referenced in bpf_tcp_close(). Reported-by: syzbot+0ce137753c78f7b6acc1@syzkaller.appspotmail.com Fixes: `8111038444` ("bpf: sockmap, add hash map support") Signed-off-by: John Fastabend <john.fastabend@gmail.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>	2018-07-01 01:21:32 +02:00
John Fastabend	54fedb42c6	bpf: sockmap, fix smap_list_map_remove when psock is in many maps If a hashmap is free'd with open socks it removes the reference to the hash entry from the psock. If that is the last reference to the psock then it will also be free'd by the reference counting logic. However the current logic that removes the hash reference from the list of references is broken. In smap_list_remove() we first check if the sockmap entry matches and then check if the hashmap entry matches. But, the sockmap entry sill always match because its NULL in this case which causes the first entry to be removed from the list. If this is always the "right" entry (because the user adds/removes entries in order) then everything is OK but otherwise a subsequent bpf_tcp_close() may reference a free'd object. To fix this create two list handlers one for sockmap and one for sockhash. Reported-by: syzbot+0ce137753c78f7b6acc1@syzkaller.appspotmail.com Fixes: `8111038444` ("bpf: sockmap, add hash map support") Acked-by: Martin KaFai Lau <kafai@fb.com> Signed-off-by: John Fastabend <john.fastabend@gmail.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>	2018-07-01 01:21:31 +02:00
John Fastabend	9901c5d77e	bpf: sockmap, fix crash when ipv6 sock is added This fixes a crash where we assign tcp_prot to IPv6 sockets instead of tcpv6_prot. Previously we overwrote the sk->prot field with tcp_prot even in the AF_INET6 case. This patch ensures the correct tcp_prot and tcpv6_prot are used. Tested with 'netserver -6' and 'netperf -H [IPv6]' as well as 'netperf -H [IPv4]'. The ESTABLISHED check resolves the previously crashing case here. Fixes: `174a79ff95` ("bpf: sockmap with sk redirect support") Reported-by: syzbot+5c063698bdbfac19f363@syzkaller.appspotmail.com Acked-by: Martin KaFai Lau <kafai@fb.com> Signed-off-by: John Fastabend <john.fastabend@gmail.com> Signed-off-by: Wei Wang <weiwan@google.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>	2018-07-01 01:21:31 +02:00
Daniel Borkmann	85782e037f	bpf: undo prog rejection on read-only lock failure Partially undo commit `9facc33687` ("bpf: reject any prog that failed read-only lock") since it caused a regression, that is, syzkaller was able to manage to cause a panic via fault injection deep in set_memory_ro() path by letting an allocation fail: In x86's __change_page_attr_set_clr() it was able to change the attributes of the primary mapping but not in the alias mapping via cpa_process_alias(), so the second, inner call to the __change_page_attr() via __change_page_attr_set_clr() had to split a larger page and failed in the alloc_pages() with the artifically triggered allocation error which is then propagated down to the call site. Thus, for set_memory_ro() this means that it returned with an error, but from debugging a probe_kernel_write() revealed EFAULT on that memory since the primary mapping succeeded to get changed. Therefore the subsequent hdr->locked = 0 reset triggered the panic as it was performed on read-only memory, so call-site assumptions were infact wrong to assume that it would either succeed /or/ not succeed at all since there's no such rollback in set_memory_() calls from partial change of mappings, in other words, we're left in a state that is "half done". A later undo via set_memory_rw() is succeeding though due to matching permissions on that part (aka due to the try_preserve_large_page() succeeding). While reproducing locally with explicitly triggering this error, the initial splitting only happens on rare occasions and in real world it would additionally need oom conditions, but that said, it could partially fail. Therefore, it is definitely wrong to bail out on set_memory_ro() error and reject the program with the set_memory_() semantics we have today. Shouldn't have gone the extra mile since no other user in tree today infact checks for any set_memory_() errors, e.g. neither module_enable_ro() / module_disable_ro() for module RO/NX handling which is mostly default these days nor kprobes core with alloc_insn_page() / free_insn_page() as examples that could be invoked long after bootup and original `314beb9bca` ("x86: bpf_jit_comp: secure bpf jit against spraying attacks") did neither when it got first introduced to BPF so "improving" with bailing out was clearly not right when set_memory_() cannot handle it today. Kees suggested that if set_memory_() can fail, we should annotate it with __must_check, and all callers need to deal with it gracefully given those set_memory_() markings aren't "advisory", but they're expected to actually do what they say. This might be an option worth to move forward in future but would at the same time require that set_memory_*() calls from supporting archs are guaranteed to be "atomic" in that they provide rollback if part of the range fails, once that happened, the transition from RW -> RO could be made more robust that way, while subsequent RO -> RW transition /must/ continue guaranteeing to always succeed the undo part. Reported-by: syzbot+a4eb8c7766952a1ca872@syzkaller.appspotmail.com Reported-by: syzbot+d866d1925855328eac3b@syzkaller.appspotmail.com Fixes: `9facc33687` ("bpf: reject any prog that failed read-only lock") Cc: Laura Abbott <labbott@redhat.com> Cc: Kees Cook <keescook@chromium.org> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2018-06-29 10:47:35 -07:00
Richard Guy Briggs	4fa7f08699	audit: simplify audit_enabled check in audit_watch_log_rule_change() Check the audit_enabled flag and bail immediately. This does not change the functionality, but brings the code format in line with similar checks in audit_tree_log_remove_rule(), audit_mark_log_rule_change(), and elsewhere in the audit code. See: https://github.com/linux-audit/audit-kernel/issues/50 Signed-off-by: Richard Guy Briggs <rgb@redhat.com> [PM: tweaked subject line] Signed-off-by: Paul Moore <paul@paul-moore.com>	2018-06-28 11:44:31 -04:00
Richard Guy Briggs	65a8766f5f	audit: check audit_enabled in audit_tree_log_remove_rule() Respect the audit_enabled flag when printing tree rule config change records. See: https://github.com/linux-audit/audit-kernel/issues/50 Signed-off-by: Richard Guy Briggs <rgb@redhat.com> [PM: tweak the subject line] Signed-off-by: Paul Moore <paul@paul-moore.com>	2018-06-28 11:41:02 -04:00
Hans de Goede	d48de54a9d	printk: Export is_console_locked This is a preparation patch for adding a number of WARN_CONSOLE_UNLOCKED() calls to the fbcon code, which may be built as a module (event though usually it is not). Acked-by: Steven Rostedt (VMware) <rostedt@goodmis.org> Acked-by: Sergey Senozhatsky <sergey.senozhatsky@gmail.com> Acked-by: Petr Mladek <pmladek@suse.com> Reviewed-by: Daniel Vetter <daniel.vetter@ffwll.ch> Signed-off-by: Hans de Goede <hdegoede@redhat.com> Signed-off-by: Bartlomiej Zolnierkiewicz <b.zolnierkie@samsung.com>	2018-06-28 15:20:27 +02:00
Christoph Hellwig	210d0797c9	swiotlb: export swiotlb_dma_ops For architectures that do not use per-device dma ops we need to export the dma_map_ops structure returned from get_arch_dma_ops(). Fixes: `10314e09` ("riscv: add swiotlb support") Signed-off-by: Christoph Hellwig <hch@lst.de> Reported-by: Andreas Schwab <schwab@suse.de>	2018-06-28 14:00:40 +02:00
Namit Gupta	63842c2134	printk: Remove unnecessary kmalloc() from syslog during clear When the request is only for clearing logs, there is no need for allocation/deallocation. Only the indexes need to be reset and returned. Rest of the patch is mostly made up of changes because of indention. Link: http://lkml.kernel.org/r/20180620135951epcas5p3bd2a8f25ec689ca333bce861b527dba2~54wyKcT0_3155531555epcas5p3y@epcas5p3.samsung.com Cc: linux-kernel@vger.kernel.org Cc: pankaj.m@samsung.com Cc: a.sahrawat@samsung.com Signed-off-by: Namit Gupta <gupta.namit@samsung.com> Signed-off-by: Himanshu Maithani <himanshu.m@samsung.com> Reviewed-by: Steven Rostedt (VMware) <rostedt@goodmis.org> Reviewed-by: Sergey Senozhatsky <sergey.senozhatsky@gmail.com> Signed-off-by: Petr Mladek <pmladek@suse.com>	2018-06-27 16:16:28 +02:00
Maninder Singh	375899cddc	printk: make sure to print log on console. This patch make sure printing of log on console if loglevel at time of storing log is less than current console loglevel. @why In SMP printk can work asynchronously, logs can be missed on console because it checks current log level at time of console_unlock, not at time of storing logs. func() { .... .... console_verbose(); // user wants to have all the logs on console. pr_alert(); dump_backtrace(); //prints with default loglevel. ... console_silent(); // stop all logs from printing on console. } Now if console_lock was owned by another process, the messages might be handled after the consoles were silenced. Reused flag LOG_NOCONS as its usage is gone long back by the commit `5c2992ee7f` ("printk: remove console flushing special cases for partial buffered lines"). Note that there are still some corner cases where this patch is not enough. For example, when the messages are flushed later from printk_safe buffers or when there are races between console_verbose() and console_silent() callers. Link: http://lkml.kernel.org/r/20180601090029epcas5p3cc93d4bfbebb3199f0a2684058da7e26~z-a_jkmrI2993329933epcas5p3q@epcas5p3.samsung.com Cc: linux-kernel@vger.kernel.org Cc: a.sahrawat@samsung.com Cc: pankaj.m@samsung.com Cc: v.narang@samsung.com Cc: <maninder1.s@samsung.com> Signed-off-by: Vaneet Narang <v.narang@samsung.com> Signed-off-by: Maninder Singh <maninder1.s@samsung.com> Reviewed-by: Sergey Senozhatsky <sergey.senozhatsky@gmail.com> Signed-off-by: Petr Mladek <pmladek@suse.com>	2018-06-27 16:14:28 +02:00
Amir Goldstein	36f10f55ff	fsnotify: let connector point to an abstract object Make the code to attach/detach a connector to object more generic by letting the fsnotify connector point to an abstract fsnotify_connp_t. Code that needs to dereference an inode or mount object now uses the helpers fsnotify_conn_{inode,mount}. Signed-off-by: Amir Goldstein <amir73il@gmail.com> Signed-off-by: Jan Kara <jack@suse.cz>	2018-06-27 13:45:05 +02:00
Mathieu Malaterre	9331510135	perf/core: Move inline keyword at the beginning of declaration Fix non-fatal warning triggered during compilation with W=1: kernel/events/core.c:6106:1: warning: ‘inline’ is not at beginning of declaration [-Wold-style-declaration] static void __always_inline ^~~~~~ Signed-off-by: Mathieu Malaterre <malat@debian.org> Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com> Cc: Arnaldo Carvalho de Melo <acme@kernel.org> Cc: Jiri Olsa <jolsa@redhat.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Namhyung Kim <namhyung@kernel.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Link: http://lkml.kernel.org/r/20180626202301.20270-1-malat@debian.org Signed-off-by: Ingo Molnar <mingo@kernel.org>	2018-06-27 09:55:58 +02:00
Paul E. McKenney	8c42b1f39f	rcu: Exclude near-simultaneous RCU CPU stall warnings There is a two-jiffy delay between the time that a CPU will self-report an RCU CPU stall warning and the time that some other CPU will report a warning on behalf of the first CPU. This has worked well in the past, but on busy systems, it is possible for the two warnings to overlap, which makes interpreting them extremely difficult. This commit therefore uses a cmpxchg-based timing decision that allows only one report in a given one-minute period (assuming default stall-warning Kconfig parameters). This approach will of course fail if you are seeing minute-long vCPU preemption, but in that case the overlapping RCU CPU stall warnings are the least of your worries. Reported-by: Dmitry Vyukov <dvyukov@google.com> Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>	2018-06-26 12:25:56 -07:00
Boqun Feng	ce11fae8d4	rcu: Use the proper lockdep annotation in dump_blkd_tasks() Sparse reported this: \| kernel/rcu/tree_plugin.h:814:9: warning: incorrect type in argument 1 (different modifiers) \| kernel/rcu/tree_plugin.h:814:9: expected struct lockdep_map const lock \| kernel/rcu/tree_plugin.h:814:9: got struct lockdep_map [noderef] <noident> This is caused by using vanilla lockdep annotations on rcu_node::lock, and that requires accessing ->lock of rcu_node directly. However we need to keep rcu_node::lock __private to avoid breaking its extra ordering guarantee. And we have a dedicated lockdep annotation for rcu_node::lock, so use it. Signed-off-by: Boqun Feng <boqun.feng@gmail.com> Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>	2018-06-26 12:25:55 -07:00
Paul E. McKenney	4bc8d55574	rcu: Add debugging info to assertion The WARN_ON_ONCE(rcu_preempt_blocked_readers_cgp()) in rcu_gp_cleanup() triggers (inexplicably, of course) every so often. This commit therefore extracts more information. Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>	2018-06-26 12:25:55 -07:00
Sean Young	fdb5c4531c	bpf: fix attach type BPF_LIRC_MODE2 dependency wrt CONFIG_CGROUP_BPF If the kernel is compiled with CONFIG_CGROUP_BPF not enabled, it is not possible to attach, detach or query IR BPF programs to /dev/lircN devices, making them impossible to use. For embedded devices, it should be possible to use IR decoding without cgroups or CONFIG_CGROUP_BPF enabled. This change requires some refactoring, since bpf_prog_{attach,detach,query} functions are now always compiled, but their code paths for cgroups need moving out. Rather than a #ifdef CONFIG_CGROUP_BPF in kernel/bpf/syscall.c, moving them to kernel/bpf/cgroup.c and kernel/bpf/sockmap.c does not require #ifdefs since that is already conditionally compiled. Fixes: `f4364dcfc8` ("media: rc: introduce BPF_PROG_LIRC_MODE2") Signed-off-by: Sean Young <sean@mess.org> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>	2018-06-26 11:28:38 +02:00
Frederic Weisbecker	26c6ccdf5c	perf/hw_breakpoint: Clean up and consolidate modify_user_hw_breakpoint_check() Remove the dance around old and new attributes. Just don't modify the previous breakpoint at all until we have verified everything. Original-patch-by: Andy Lutomirski <luto@kernel.org> Reported-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Frederic Weisbecker <frederic@kernel.org> Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com> Cc: Andy Lutomirski <luto@kernel.org> Cc: Arnaldo Carvalho de Melo <acme@kernel.org> Cc: Arnaldo Carvalho de Melo <acme@redhat.com> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Chris Zankel <chris@zankel.net> Cc: Jiri Olsa <jolsa@redhat.com> Cc: Joel Fernandes <joel.opensrc@gmail.com> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Max Filippov <jcmvbkbc@gmail.com> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Namhyung Kim <namhyung@kernel.org> Cc: Paul Mackerras <paulus@samba.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Rich Felker <dalias@libc.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Will Deacon <will.deacon@arm.com> Cc: Yoshinori Sato <ysato@users.sourceforge.jp> Link: http://lkml.kernel.org/r/1529981939-8231-13-git-send-email-frederic@kernel.org Signed-off-by: Ingo Molnar <mingo@kernel.org>	2018-06-26 09:07:59 +02:00
Frederic Weisbecker	cb8b78815b	perf/hw_breakpoint: Pass new breakpoint type to modify_breakpoint_slot() We soon won't be able to rely on bp->attr anymore to get the new type of the modifying breakpoint because the new attributes are going to be copied only once we successfully modified the breakpoint slot. This will fix the current misdesigned layout where the new attr are copied to the modifying breakpoint before we actually know if the modification will be validated. In order to prepare for that, allow modify_breakpoint_slot() to take the new breakpoint type. Signed-off-by: Frederic Weisbecker <frederic@kernel.org> Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com> Cc: Andy Lutomirski <luto@kernel.org> Cc: Arnaldo Carvalho de Melo <acme@kernel.org> Cc: Arnaldo Carvalho de Melo <acme@redhat.com> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Chris Zankel <chris@zankel.net> Cc: Jiri Olsa <jolsa@redhat.com> Cc: Joel Fernandes <joel.opensrc@gmail.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Max Filippov <jcmvbkbc@gmail.com> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Namhyung Kim <namhyung@kernel.org> Cc: Paul Mackerras <paulus@samba.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Rich Felker <dalias@libc.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Will Deacon <will.deacon@arm.com> Cc: Yoshinori Sato <ysato@users.sourceforge.jp> Link: http://lkml.kernel.org/r/1529981939-8231-12-git-send-email-frederic@kernel.org Signed-off-by: Ingo Molnar <mingo@kernel.org>	2018-06-26 09:07:59 +02:00
Frederic Weisbecker	cffbb3bd44	perf/hw_breakpoint: Remove default hw_breakpoint_arch_parse() All architectures have implemented it, we can now remove the poor weak version. Signed-off-by: Frederic Weisbecker <frederic@kernel.org> Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com> Cc: Andy Lutomirski <luto@kernel.org> Cc: Arnaldo Carvalho de Melo <acme@kernel.org> Cc: Arnaldo Carvalho de Melo <acme@redhat.com> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Chris Zankel <chris@zankel.net> Cc: Jiri Olsa <jolsa@redhat.com> Cc: Joel Fernandes <joel.opensrc@gmail.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Max Filippov <jcmvbkbc@gmail.com> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Namhyung Kim <namhyung@kernel.org> Cc: Paul Mackerras <paulus@samba.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Rich Felker <dalias@libc.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Will Deacon <will.deacon@arm.com> Cc: Yoshinori Sato <ysato@users.sourceforge.jp> Link: http://lkml.kernel.org/r/1529981939-8231-11-git-send-email-frederic@kernel.org Signed-off-by: Ingo Molnar <mingo@kernel.org>	2018-06-26 09:07:58 +02:00
Frederic Weisbecker	8e983ff9ac	perf/hw_breakpoint: Pass arch breakpoint struct to arch_check_bp_in_kernelspace() We can't pass the breakpoint directly on arch_check_bp_in_kernelspace() anymore because its architecture internal datas (struct arch_hw_breakpoint) are not yet filled by the time we call the function, and most implementation need this backend to be up to date. So arrange the function to take the probing struct instead. Signed-off-by: Frederic Weisbecker <frederic@kernel.org> Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com> Cc: Andy Lutomirski <luto@kernel.org> Cc: Arnaldo Carvalho de Melo <acme@kernel.org> Cc: Arnaldo Carvalho de Melo <acme@redhat.com> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Chris Zankel <chris@zankel.net> Cc: Jiri Olsa <jolsa@redhat.com> Cc: Joel Fernandes <joel.opensrc@gmail.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Max Filippov <jcmvbkbc@gmail.com> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Namhyung Kim <namhyung@kernel.org> Cc: Paul Mackerras <paulus@samba.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Rich Felker <dalias@libc.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Will Deacon <will.deacon@arm.com> Cc: Yoshinori Sato <ysato@users.sourceforge.jp> Link: http://lkml.kernel.org/r/1529981939-8231-3-git-send-email-frederic@kernel.org Signed-off-by: Ingo Molnar <mingo@kernel.org>	2018-06-26 09:07:54 +02:00
Frederic Weisbecker	9a4903dde2	perf/hw_breakpoint: Split attribute parse and commit arch_validate_hwbkpt_settings() mixes up attribute check and commit into a single code entity. Therefore the validation may return an error due to incorrect atributes while still leaving halfway modified architecture breakpoint data. This is harmless when we deal with a new breakpoint but it becomes a problem when we modify an existing breakpoint. Split attribute parse and commit to fix that. The architecture is passed a "struct arch_hw_breakpoint" to fill on top of the new attr and the core takes care about copying the backend data once it's fully validated. The architectures then need to implement the new API. Original-patch-by: Andy Lutomirski <luto@kernel.org> Reported-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Frederic Weisbecker <frederic@kernel.org> Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com> Cc: Andy Lutomirski <luto@kernel.org> Cc: Arnaldo Carvalho de Melo <acme@kernel.org> Cc: Arnaldo Carvalho de Melo <acme@redhat.com> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Chris Zankel <chris@zankel.net> Cc: Jiri Olsa <jolsa@redhat.com> Cc: Joel Fernandes <joel.opensrc@gmail.com> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Max Filippov <jcmvbkbc@gmail.com> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Namhyung Kim <namhyung@kernel.org> Cc: Paul Mackerras <paulus@samba.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Rich Felker <dalias@libc.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Will Deacon <will.deacon@arm.com> Cc: Yoshinori Sato <ysato@users.sourceforge.jp> Link: http://lkml.kernel.org/r/1529981939-8231-2-git-send-email-frederic@kernel.org Signed-off-by: Ingo Molnar <mingo@kernel.org>	2018-06-26 09:07:53 +02:00
Ingo Molnar	f446474889	Merge branch 'linus' into perf/core, to pick up fixes Signed-off-by: Ingo Molnar <mingo@kernel.org>	2018-06-26 09:02:41 +02:00
Paul E. McKenney	6050003763	torture: Keep old-school dmesg format This commit adds "#define pr_fmt(fmt) fmt" to the torture-test files in order to keep the current dmesg format. Once Joe's commits have hit mainline, these definitions will be changed in order to automatically generate the dmesg line prefix that the scripts expect. This will have the beneficial side-effect of allowing printk() formats to be used more widely and of shortening some pr_*() lines. Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Cc: Joe Perches <joe@perches.com>	2018-06-25 11:30:10 -07:00
Paul E. McKenney	90127d605f	torture: Make online/offline messages appear only for verbose=2 Some bugs reproduce quickly only at high CPU-hotplug rates, so the rcutorture TREE03 scenario now has only 200 milliseconds spacing between CPU-hotplug operations. At this rate, the torture-test pair of console messages per operation becomes a bit voluminous. This commit therefore converts the torture-test set of "verbose" kernel-boot arguments from bool to int, and prints the extra console messages only when verbose=2. The default is still verbose=1. Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>	2018-06-25 11:30:10 -07:00
Paul E. McKenney	5ab07a8df4	srcu: Add address of first callback to rcutorture output This commit adds the address of the first callback to the per-CPU rcutorture output in order to allow lost wakeups to be more efficiently tracked down. Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>	2018-06-25 11:26:24 -07:00
Paul E. McKenney	17294ce6a4	srcu: Document that srcu_funnel_gp_start() implies srcu_funnel_exp_start() This commit updates the header comment of srcu_funnel_gp_start() to document the fact that srcu_funnel_gp_start() does the work of srcu_funnel_exp_start(), in some cases by invoking it directly. Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>	2018-06-25 11:26:24 -07:00
Paul E. McKenney	5ef98a6328	srcu: Fix typos in __call_srcu() header comment This commit simply changes some copy-pasta call_rcu() instances to the correct call_srcu(). Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>	2018-06-25 11:26:24 -07:00
Paul E. McKenney	5257514d88	rcu: Make expedited grace period use direct call on last leaf During expedited grace-period initialization, a work item is scheduled for each leaf rcu_node structure. However, that initialization code is itself (normally) executing from a workqueue, so one of the leaf rcu_node structures could just as well be handled by that pre-existing workqueue, and with less overhead. This commit therefore uses a shiny new rcu_is_leaf_node() macro to execute the last leaf rcu_node structure's initialization directly from the pre-existing workqueue. Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>	2018-06-25 11:25:41 -07:00
Masahiro Yamada	996302c5e8	module: replace VMLINUX_SYMBOL_STR() with __stringify() or string literal With the special case handling for Blackfin and Metag was removed by commit `94e58e0ac3` ("export.h: remove code for prefixing symbols with underscore"), VMLINUX_SYMBOL_STR() is now equivalent to __stringify(). Replace the remaining usages to prepare for the entire removal of VMLINUX_SYMBOL_STR(). Signed-off-by: Masahiro Yamada <yamada.masahiro@socionext.com> Signed-off-by: Jessica Yu <jeyu@kernel.org>	2018-06-25 11:18:29 +02:00
Jason A. Donenfeld	62267e0ecc	module: print sensible error code Printing "err 0" to the user in the warning message is not particularly useful, especially when this gets transformed into a -ENOENT for the remainder of the call chain. Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com> Signed-off-by: Jessica Yu <jeyu@kernel.org>	2018-06-25 10:37:08 +02:00
Christoph Hellwig	faef87723a	dma-noncoherent: add a arch_sync_dma_for_cpu_all hook The MIPS bmips platform needs a global flush when transferring ownership back to the CPU. Add a hook for that to the dma-noncoherent implementation. Signed-off-by: Christoph Hellwig <hch@lst.de> Patchwork: https://patchwork.linux-mips.org/patch/19549/ Signed-off-by: Paul Burton <paul.burton@mips.com> Cc: Florian Fainelli <f.fainelli@gmail.com> Cc: David Daney <david.daney@cavium.com> Cc: Kevin Cernekee <cernekee@gmail.com> Cc: Jiaxun Yang <jiaxun.yang@flygoat.com> Cc: Tom Bogendoerfer <tsbogend@alpha.franken.de> Cc: Huacai Chen <chenhc@lemote.com> Cc: iommu@lists.linux-foundation.org Cc: linux-mips@linux-mips.org	2018-06-24 09:27:27 -07:00
Deepa Dinamani	6ff8473507	time: Change types to new y2038 safe __kernel_itimerspec timer_set/gettime and timerfd_set/get apis use struct itimerspec at the user interface layer. struct itimerspec is not y2038-safe. Change these interfaces to use y2038-safe struct __kernel_itimerspec instead. This will help define new syscalls when 32bit architectures select CONFIG_64BIT_TIME. Signed-off-by: Deepa Dinamani <deepa.kernel@gmail.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Cc: arnd@arndb.de Cc: viro@zeniv.linux.org.uk Cc: linux-fsdevel@vger.kernel.org Cc: linux-api@vger.kernel.org Cc: y2038@lists.linaro.org Link: https://lkml.kernel.org/r/20180617051144.29756-4-deepa.kernel@gmail.com	2018-06-24 14:39:47 +02:00
Deepa Dinamani	afef05cf23	time: Enable get/put_compat_itimerspec64 always This will aid in enabling the compat syscalls on 32-bit architectures later on. Also move compat_itimerspec and related defines to compat_time.h. The compat_time.h file will eventually be deleted. Signed-off-by: Deepa Dinamani <deepa.kernel@gmail.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Cc: arnd@arndb.de Cc: viro@zeniv.linux.org.uk Cc: linux-fsdevel@vger.kernel.org Cc: linux-api@vger.kernel.org Cc: y2038@lists.linaro.org Link: https://lkml.kernel.org/r/20180617051144.29756-3-deepa.kernel@gmail.com	2018-06-24 14:39:47 +02:00
Deepa Dinamani	d0dd63a8ae	time: Introduce struct __kernel_itimerspec struct itimerspec is not y2038-safe. Introduce a new struct __kernel_itimerspec based on the kernel internal y2038-safe struct itimerspec64. The definition of struct __kernel_itimerspec includes two struct __kernel_timespec. Since struct __kernel_timespec has the same representation in native and compat modes, so does struct __kernel_itimerspec. This helps have a common entry point for syscalls using struct __kernel_itimerspec. New y2038-safe syscalls will use this new type. Since most of the new syscalls are just an update to the native syscalls with the type update, place the new definition under CONFIG_64BIT_TIME. This helps architectures that do not support the above config to keep using the old definition of struct itimerspec. Also change the get/put_itimerspec64 to use struct__kernel_itimerspec. This will help 32 bit architectures to use the new syscalls when architectures select CONFIG_64BIT_TIME. Signed-off-by: Deepa Dinamani <deepa.kernel@gmail.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Cc: arnd@arndb.de Cc: viro@zeniv.linux.org.uk Cc: linux-fsdevel@vger.kernel.org Cc: linux-api@vger.kernel.org Cc: y2038@lists.linaro.org Link: https://lkml.kernel.org/r/20180617051144.29756-2-deepa.kernel@gmail.com	2018-06-24 14:39:46 +02:00
Linus Torvalds	c81b995f00	Merge branch 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull perf fixes from Thomas Gleixner: "A pile of perf updates: Kernel side: - Remove an incorrect warning in uprobe_init_insn() when insn_get_length() fails. The error return code is handled at the call site. - Move the inline keyword to the right place in the perf ringbuffer code to address a W=1 build warning. Tooling: perf stat: - Fix metric column header display alignment - Improve error messages for default attributes, providing better output for error in command line. - Add --interval-clear option, to provide a 'watch' like printing perf script: - Show hw-cache events too perf c2c: - Fix data dependency problem in layout of 'struct c2c_hist_entry' Core: - Do not blindly assume that 'struct perf_evsel' can be obtained via a straight forward container_of() as there are call sites which hand in a plain 'struct hist' which is not part of a container. - Fix error index in the PMU event parser, so that error messages can point to the problematic token" * 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: perf/core: Move the inline keyword at the beginning of the function declaration uprobes/x86: Remove incorrect WARN_ON() in uprobe_init_insn() perf script: Show hw-cache events perf c2c: Keep struct hist_entry at the end of struct c2c_hist_entry perf stat: Add event parsing error handling to add_default_attributes perf stat: Allow to specify specific metric column len perf stat: Fix metric column header display alignment perf stat: Use only color_fprintf call in print_metric_only perf stat: Add --interval-clear option perf tools: Fix error index for pmu event parser perf hists: Reimplement hists__has_callchains() perf hists browser gtk: Use hist_entry__has_callchains() perf hists: Make hist_entry__has_callchains() work with 'perf c2c' perf hists: Save the callchain_size in struct hist_entry	2018-06-24 20:29:15 +08:00
Linus Torvalds	2ce413ec16	Merge branch 'sched-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull rseq fixes from Thomas Gleixer: "A pile of rseq related fixups: - Prevent infinite recursion when delivering SIGSEGV - Remove the abort of rseq critical section on fork() as syscalls inside rseq critical sections are explicitely forbidden. So no point in doing the abort on the child. - Align the rseq structure on 32 bytes in the ARM selftest code. - Fix file permissions of the test script" * 'sched-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: rseq: Avoid infinite recursion when delivering SIGSEGV rseq/cleanup: Do not abort rseq c.s. in child on fork() rseq/selftests/arm: Align 'struct rseq_cs' on 32 bytes rseq/selftests: Make run_param_test.sh executable	2018-06-24 20:18:19 +08:00
Lukas Wunner	519cc8652b	genirq: Synchronize only with single thread on free_irq() When pciehp is converted to threaded IRQ handling, removal of unplugged devices below a PCIe hotplug port happens synchronously in the IRQ thread. Removal of devices typically entails a call to free_irq() by their drivers. If those devices share their IRQ with the hotplug port, __free_irq() deadlocks because it calls synchronize_irq() to wait for all hard IRQ handlers as well as all threads sharing the IRQ to finish. Actually it's sufficient to wait only for the IRQ thread of the removed device, so call synchronize_hardirq() to wait for all hard IRQ handlers to finish, but no longer for any threads. Compensate by rearranging the control flow in irq_wait_for_interrupt() such that the device's thread is allowed to run one last time after kthread_stop() has been called. kthread_stop() blocks until the IRQ thread has completed. On completion the IRQ thread clears its oneshot thread_mask bit. This is safe because __free_irq() holds the request_mutex, thereby preventing __setup_irq() from handing out the same oneshot thread_mask bit to a newly requested action. Stack trace for posterity: INFO: task irq/17-pciehp:94 blocked for more than 120 seconds. schedule+0x28/0x80 synchronize_irq+0x6e/0xa0 __free_irq+0x15a/0x2b0 free_irq+0x33/0x70 pciehp_release_ctrl+0x98/0xb0 pcie_port_remove_service+0x2f/0x40 device_release_driver_internal+0x157/0x220 bus_remove_device+0xe2/0x150 device_del+0x124/0x340 device_unregister+0x16/0x60 remove_iter+0x1a/0x20 device_for_each_child+0x4b/0x90 pcie_port_device_remove+0x1e/0x30 pci_device_remove+0x36/0xb0 device_release_driver_internal+0x157/0x220 pci_stop_bus_device+0x7d/0xa0 pci_stop_bus_device+0x3d/0xa0 pci_stop_and_remove_bus_device+0xe/0x20 pciehp_unconfigure_device+0xb8/0x160 pciehp_disable_slot+0x84/0x130 pciehp_ist+0x158/0x190 irq_thread_fn+0x1b/0x50 irq_thread+0x143/0x1a0 kthread+0x111/0x130 Signed-off-by: Lukas Wunner <lukas@wunner.de> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Cc: Bjorn Helgaas <bhelgaas@google.com> Cc: Mika Westerberg <mika.westerberg@linux.intel.com> Cc: linux-pci@vger.kernel.org Link: https://lkml.kernel.org/r/d72b41309f077c8d3bee6cc08ad3662d50b5d22a.1529828292.git.lukas@wunner.de	2018-06-24 14:17:27 +02:00
Lukas Wunner	836557bd58	genirq: Update code comments wrt recycled thread_mask Previously a race existed between __free_irq() and __setup_irq() wherein the thread_mask of a just removed action could be handed out to a newly added action and the freed irq thread would then tread on the oneshot mask bit of the newly added irq thread in irq_finalize_oneshot(): time \| __free_irq() \| raw_spin_lock_irqsave(&desc->lock, flags); \| <remove action from linked list> \| raw_spin_unlock_irqrestore(&desc->lock, flags); \| \| __setup_irq() \| raw_spin_lock_irqsave(&desc->lock, flags); \| <traverse linked list to determine oneshot mask bit> \| raw_spin_unlock_irqrestore(&desc->lock, flags); \| \| irq_thread() of freed irq (__free_irq() waits in synchronize_irq()) \| irq_thread_fn() \| irq_finalize_oneshot() \| raw_spin_lock_irq(&desc->lock); \| desc->threads_oneshot &= ~action->thread_mask; \| raw_spin_unlock_irq(&desc->lock); v The race was known at least since 2012 when it was documented in a code comment by commit `e04268b0ef` ("genirq: Remove paranoid warnons and bogus fixups"). The race itself is harmless as nothing touches any of the potentially freed data after synchronize_irq(). In 2017 the race was close by commit `9114014cf4` ("genirq: Add mutex to irq desc to serialize request/free_irq()"), apparently inadvertantly so because the race is neither mentioned in the commit message nor was the code comment updated. Make up for that. Signed-off-by: Lukas Wunner <lukas@wunner.de> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Cc: Bjorn Helgaas <bhelgaas@google.com> Cc: Mika Westerberg <mika.westerberg@linux.intel.com> Cc: linux-pci@vger.kernel.org Link: https://lkml.kernel.org/r/32fc25aa35ecef4b2692f57687bb7fc2a57230e2.1529828292.git.lukas@wunner.de	2018-06-24 14:17:26 +02:00
Linus Torvalds	2da2ca24a3	Merge branch 'locking-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull locking fixes from Thomas Gleixner: "A set of fixes and updates for the locking code: - Prevent lockdep from updating irq state within its own code and thereby confusing itself. - Buid fix for older GCCs which mistreat anonymous unions - Add a missing lockdep annotation in down_read_non_onwer() which causes up_read_non_owner() to emit a lockdep splat - Remove the custom alpha dec_and_lock() implementation which is incorrect in terms of ordering and use the generic one. The remaining two commits are not strictly fixes. They provide irqsave variants of atomic_dec_and_lock() and refcount_dec_and_lock(). These are required to merge the relevant updates and cleanups into different maintainer trees for 4.19, so routing them into mainline without actual users is the sanest approach. They should have been in -rc1, but last weekend I took the liberty to just avoid computers in order to regain some mental sanity" * 'locking-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: locking/qspinlock: Fix build for anonymous union in older GCC compilers locking/lockdep: Do not record IRQ state within lockdep code locking/rwsem: Fix up_read_non_owner() warning with DEBUG_RWSEMS locking/refcounts: Implement refcount_dec_and_lock_irqsave() atomic: Add irqsave variant of atomic_dec_and_lock() alpha: Remove custom dec_and_lock() implementation	2018-06-24 19:36:16 +08:00
Linus Torvalds	6242258b6b	Merge branch 'timers-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull timer fixes from Thomas Gleixner: "A small set of fixes for time(r) related issues: - Fix a long standing conversion issue in jiffies_to_msecs() for odd HZ values like 1024 or 1200 which resulted in returning 0 for small jiffies values due to rounding down. - Use the proper CONFIG symbol in the new Y2038 safe compat code for posix-timers. Not yet a visible breakage, but this will immediately trigger when the architecture support for the new interfaces is merged. - Return an error code in the STM32 clocksource driver on failure instead of success. - Remove the redundant and stale irq disabled check in the posix cpu timer code. The check is at the wrong place anyway and lockdep already covers it via the sighand lock locking coverage" * 'timers-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: time: Make sure jiffies_to_msecs() preserves non-zero time periods posix-timers: Fix nanosleep_copyout() for CONFIG_COMPAT_32BIT_TIME clocksource/drivers/stm32: Fix error return code posix-cpu-timers: Remove lockdep_assert_irqs_disabled()	2018-06-24 19:16:42 +08:00
Linus Torvalds	78fea6334f	Merge branch 'irq-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull irq fixes from Thomas Gleixner: "A set of fixes mostly for the ARM/GIC world: - Fix the MSI affinity handling in the ls-scfg irq chip driver so it updates and uses the effective affinity mask correctly - Prevent binding LPIs to offline CPUs and respect the Cavium erratum which requires that LPIs which belong to an offline NUMA node are not bound to a CPU on a different NUMA node. - Free only the amount of allocated interrupts in the GIC-V2M driver instead of trying to free log2(nrirqs). - Prevent emitting SYNC and VSYNC targetting non existing interrupt collections in the GIC-V3 ITS driver - Ensure that the GIV-V3 interrupt redistributor is correctly reprogrammed on CPU hotplug - Remove a stale unused helper function" * 'irq-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: irqdesc: Delete irq_desc_get_msi_desc() irqchip/gic-v3-its: Fix reprogramming of redistributors on CPU hotplug irqchip/gic-v3-its: Only emit VSYNC if targetting a valid collection irqchip/gic-v3-its: Only emit SYNC if targetting a valid collection irqchip/gic-v3-its: Don't bind LPI to unavailable NUMA node irqchip/gic-v2m: Fix SPI release on error path irqchip/ls-scfg-msi: Fix MSI affinity handling genirq/debugfs: Add missing IRQCHIP_SUPPORTS_LEVEL_MSI debug	2018-06-24 19:01:18 +08:00
Linus Torvalds	81f9c4e417	This contains a few fixes and a clean up. - A bad merge caused an "endif" to go in the wrong place in scripts/Makefile.build - Softirq tracing fix for tracing that corrupts lockdep and causes a false splat - Histogram documentation typo fixes - Fix a bad memory reference when passing in no filter to the filter code - Simplify code by using the swap macro instead of open coding the swap -----BEGIN PGP SIGNATURE----- iIkEABYIADIWIQRRSw7ePDh/lE+zeZMp5XQQmuv6qgUCWy2gMBQccm9zdGVkdEBn b29kbWlzLm9yZwAKCRAp5XQQmuv6qsqWAPdmoudDVf9Wi+THEJzlZL7ZhfYSDQyi A5Y3mc3iKoZeAQCq7PD7uH4Q1IMaVbAKG8OxvGVb69ijkMsSL4XxD33sAQ== =U1V7 -----END PGP SIGNATURE----- Merge tag 'trace-v4.18-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace Pull tracing fixes from Steven Rostedt: "This contains a few fixes and a clean up. - a bad merge caused an "endif" to go in the wrong place in scripts/Makefile.build - softirq tracing fix for tracing that corrupts lockdep and causes a false splat - histogram documentation typo fixes - fix a bad memory reference when passing in no filter to the filter code - simplify code by using the swap macro instead of open coding the swap" * tag 'trace-v4.18-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace: tracing: Fix SKIP_STACK_VALIDATION=1 build due to bad merge with -mrecord-mcount tracing: Fix some errors in histogram documentation tracing: Use swap macro in update_max_tr softirq: Reorder trace_softirqs_on to prevent lockdep splat tracing: Check for no filter when processing event filters	2018-06-24 06:23:28 +08:00
Will Deacon	784e0300fe	rseq: Avoid infinite recursion when delivering SIGSEGV When delivering a signal to a task that is using rseq, we call into __rseq_handle_notify_resume() so that the registers pushed in the sigframe are updated to reflect the state of the restartable sequence (for example, ensuring that the signal returns to the abort handler if necessary). However, if the rseq management fails due to an unrecoverable fault when accessing userspace or certain combinations of RSEQ_CS_* flags, then we will attempt to deliver a SIGSEGV. This has the potential for infinite recursion if the rseq code continuously fails on signal delivery. Avoid this problem by using force_sigsegv() instead of force_sig(), which is explicitly designed to reset the SEGV handler to SIG_DFL in the case of a recursive fault. In doing so, remove rseq_signal_deliver() from the internal rseq API and have an optional struct ksignal * parameter to rseq_handle_notify_resume() instead. Signed-off-by: Will Deacon <will.deacon@arm.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Acked-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: peterz@infradead.org Cc: paulmck@linux.vnet.ibm.com Cc: boqun.feng@gmail.com Link: https://lkml.kernel.org/r/1529664307-983-1-git-send-email-will.deacon@arm.com	2018-06-22 19:04:22 +02:00
Geert Uytterhoeven	abcbcb80cd	time: Make sure jiffies_to_msecs() preserves non-zero time periods For the common cases where 1000 is a multiple of HZ, or HZ is a multiple of 1000, jiffies_to_msecs() never returns zero when passed a non-zero time period. However, if HZ > 1000 and not an integer multiple of 1000 (e.g. 1024 or 1200, as used on alpha and DECstation), jiffies_to_msecs() may return zero for small non-zero time periods. This may break code that relies on receiving back a non-zero value. jiffies_to_usecs() does not need such a fix: one jiffy can only be less than one µs if HZ > 1000000, and such large values of HZ are already rejected at build time, twice: - include/linux/jiffies.h does #error if HZ >= 12288, - kernel/time/time.c has BUILD_BUG_ON(HZ > USEC_PER_SEC). Broken since forever. Signed-off-by: Geert Uytterhoeven <geert@linux-m68k.org> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Arnd Bergmann <arnd@arndb.de> Cc: John Stultz <john.stultz@linaro.org> Cc: Stephen Boyd <sboyd@kernel.org> Cc: linux-alpha@vger.kernel.org Cc: linux-mips@linux-mips.org Cc: stable@vger.kernel.org Link: https://lkml.kernel.org/r/20180622143357.7495-1-geert@linux-m68k.org	2018-06-22 17:48:36 +02:00
Eric Dumazet	74bdf7815d	genirq: Speedup show_interrupts() Since commit `425a5072dc` ("genirq: Free irq_desc with rcu"), show_interrupts() can be switched to rcu locking, which removes possible contention on sparse_irq_lock. The per_cpu count scan and print can be done without holding desc spinlock. And there is no need to call kstat_irqs_cpu() and abuse irq_to_desc() while holding rcu read lock, since desc and desc->kstat_irqs wont disappear or change. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Cc: Eric Dumazet <eric.dumazet@gmail.com> Link: https://lkml.kernel.org/r/20180620150332.163320-1-edumazet@google.com	2018-06-22 14:22:58 +02:00
Marc Zyngier	72a8edc2d9	genirq/debugfs: Add missing IRQCHIP_SUPPORTS_LEVEL_MSI debug Debug is missing the IRQCHIP_SUPPORTS_LEVEL_MSI debug entry, making debugfs slightly less useful. Take this opportunity to also add a missing comment in the definition of IRQCHIP_SUPPORTS_LEVEL_MSI. Fixes: `6988e0e0d2` ("genirq/msi: Limit level-triggered MSI to platform devices") Signed-off-by: Marc Zyngier <marc.zyngier@arm.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Cc: Jason Cooper <jason@lakedaemon.net> Cc: Alexandre Belloni <alexandre.belloni@bootlin.com> Cc: Yang Yingliang <yangyingliang@huawei.com> Cc: Sumit Garg <sumit.garg@linaro.org> Link: https://lkml.kernel.org/r/20180622095254.5906-2-marc.zyngier@arm.com	2018-06-22 14:22:00 +02:00
Jessica Yu	5fdc7db644	module: setup load info before module_sig_check() We want to be able to log the module name in early error messages, such as when module signature verification fails. Previously, the module name is set in layout_and_allocate(), meaning that any error messages that happen before (such as those in module_sig_check()) won't be logged with a module name, which isn't terribly helpful. In order to do this, reshuffle the order in load_module() and set up load info earlier so that we can log the module name along with these error messages. This requires splitting rewrite_section_headers() out of setup_load_info(). While we're at it, clean up and split up the operations done in layout_and_allocate(), setup_load_info(), and rewrite_section_headers() more cleanly so these functions only perform what their names suggest. Signed-off-by: Jessica Yu <jeyu@kernel.org>	2018-06-22 14:00:01 +02:00
Jessica Yu	81a0abd9f2	module: make it clear when we're handling the module copy in info->hdr In load_module(), it's not always clear whether we're handling the temporary module copy in info->hdr (which is freed at the end of load_module()) or if we're handling the module already allocated and copied to it's final place. Adding an info->mod field and using it whenever we're handling the temporary copy makes that explicitly clear. Signed-off-by: Jessica Yu <jeyu@kernel.org>	2018-06-22 13:59:29 +02:00
Mathieu Malaterre	57d6a7938a	perf/core: Move the inline keyword at the beginning of the function declaration When building perf with W=1 the following warning triggers: CC kernel/events/ring_buffer.o kernel/events/ring_buffer.c:105:1: warning: ‘inline’ is not at beginning of declaration [-Wold-style-declaration] static bool __always_inline ^~~~~~ ... Move the inline keyword to the beginning of the function declaration. Signed-off-by: Mathieu Malaterre <malat@debian.org> Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com> Cc: Arnaldo Carvalho de Melo <acme@kernel.org> Cc: Jiri Olsa <jolsa@redhat.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Namhyung Kim <namhyung@kernel.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: trival@kernel.org Link: http://lkml.kernel.org/r/20180308202856.9378-1-malat@debian.org Signed-off-by: Ingo Molnar <mingo@kernel.org>	2018-06-22 11:07:47 +02:00
Gustavo A. R. Silva	08ae88f810	tracing: Use swap macro in update_max_tr Make use of the swap macro and remove unnecessary variable _buf_. This makes the code easier to read and maintain. Also, reduces the stack usage. This code was detected with the help of Coccinelle. Link: http://lkml.kernel.org/r/20180209175316.GA18720@embeddedgus Signed-off-by: Gustavo A. R. Silva <gustavo@embeddedor.com> Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>	2018-06-21 15:12:43 -04:00
Joel Fernandes (Google)	1a63dcd876	softirq: Reorder trace_softirqs_on to prevent lockdep splat I'm able to reproduce a lockdep splat with config options: CONFIG_PROVE_LOCKING=y, CONFIG_DEBUG_LOCK_ALLOC=y and CONFIG_PREEMPTIRQ_EVENTS=y $ echo 1 > /d/tracing/events/preemptirq/preempt_enable/enable [ 26.112609] DEBUG_LOCKS_WARN_ON(current->softirqs_enabled) [ 26.112636] WARNING: CPU: 0 PID: 118 at kernel/locking/lockdep.c:3854 [...] [ 26.144229] Call Trace: [ 26.144926] <IRQ> [ 26.145506] lock_acquire+0x55/0x1b0 [ 26.146499] ? __do_softirq+0x46f/0x4d9 [ 26.147571] ? __do_softirq+0x46f/0x4d9 [ 26.148646] trace_preempt_on+0x8f/0x240 [ 26.149744] ? trace_preempt_on+0x4d/0x240 [ 26.150862] ? __do_softirq+0x46f/0x4d9 [ 26.151930] preempt_count_sub+0x18a/0x1a0 [ 26.152985] __do_softirq+0x46f/0x4d9 [ 26.153937] irq_exit+0x68/0xe0 [ 26.154755] smp_apic_timer_interrupt+0x271/0x280 [ 26.156056] apic_timer_interrupt+0xf/0x20 [ 26.157105] </IRQ> The issue was this: preempt_count = 1 << SOFTIRQ_SHIFT __local_bh_enable(cnt = 1 << SOFTIRQ_SHIFT) { if (softirq_count() == (cnt && SOFTIRQ_MASK)) { trace_softirqs_on() { current->softirqs_enabled = 1; } } preempt_count_sub(cnt) { trace_preempt_on() { tracepoint() { rcu_read_lock_sched() { // jumps into lockdep Where preempt_count still has softirqs disabled, but current->softirqs_enabled is true, and we get a splat. Link: http://lkml.kernel.org/r/20180607201143.247775-1-joel@joelfernandes.org Cc: Peter Zijlstra <peterz@infradead.org> Cc: Ingo Molnar <mingo@redhat.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: Tom Zanussi <tom.zanussi@linux.intel.com> Cc: Namhyung Kim <namhyung@kernel.org> Cc: Thomas Glexiner <tglx@linutronix.de> Cc: Boqun Feng <boqun.feng@gmail.com> Cc: Paul McKenney <paulmck@linux.vnet.ibm.com> Cc: Masami Hiramatsu <mhiramat@kernel.org> Cc: Todd Kjos <tkjos@google.com> Cc: Erick Reyes <erickreyes@google.com> Cc: Julia Cartwright <julia@ni.com> Cc: Byungchul Park <byungchul.park@lge.com> Cc: stable@vger.kernel.org Reviewed-by: Steven Rostedt (VMware) <rostedt@goodmis.org> Fixes: `d59158162e` ("tracing: Add support for preempt and irq enable/disable events") Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org> Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>	2018-06-21 15:12:43 -04:00
Steven Rostedt (VMware)	70303420b5	tracing: Check for no filter when processing event filters The syzkaller detected a out-of-bounds issue with the events filter code, specifically here: prog[N].pred = NULL; /* #13 / prog[N].target = 1; / TRUE / prog[N+1].pred = NULL; prog[N+1].target = 0; / FALSE */ -> prog[N-1].target = N; prog[N-1].when_to_branch = false; As that's the first reference to a "N-1" index, it appears that the code got here with N = 0, which means the filter parser found no filter to parse (which shouldn't ever happen, but apparently it did). Add a new error to the parsing code that will check to make sure that N is not zero before going into this part of the code. If N = 0, then -EINVAL is returned, and a error message is added to the filter. Cc: stable@vger.kernel.org Fixes: `80765597bc` ("tracing: Rewrite filter logic to be simpler and faster") Reported-by: air icy <icytxw@gmail.com> bugzilla url: https://bugzilla.kernel.org/show_bug.cgi?id=200019 Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>	2018-06-21 15:12:42 -04:00
Steven Rostedt (VMware)	fcc784be83	locking/lockdep: Do not record IRQ state within lockdep code While debugging where things were going wrong with mapping enabling/disabling interrupts with the lockdep state and actual real enabling and disabling interrupts, I had to silent the IRQ disabling/enabling in debug_check_no_locks_freed() because it was always showing up as it was called before the splat was. Use raw_local_irq_save/restore() for not only debug_check_no_locks_freed() but for all internal lockdep functions, as they hide useful information about where interrupts were used incorrectly last. Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Will Deacon <will.deacon@arm.com> Link: https://lkml.kernel.org/lkml/20180404140630.3f4f4c7a@gandalf.local.home Signed-off-by: Ingo Molnar <mingo@kernel.org>	2018-06-21 18:19:01 +02:00
Li RongQing	03585a95cd	sched/fair: Remove stale tg_unthrottle_up() comments After commit: `82958366cf` ("sched: Replace update_shares weight distribution with per-entity computation") tg_unthrottle_up() did not update the weight. Signed-off-by: Li RongQing <lirongqing@baidu.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Link: https://lkml.kernel.org/lkml/1523423816-18322-1-git-send-email-lirongqing@baidu.com Signed-off-by: Ingo Molnar <mingo@kernel.org>	2018-06-21 17:58:22 +02:00
Masami Hiramatsu	4458515b2c	kprobes: Replace %p with other pointer types Replace %p with %pS or just remove it if unneeded. And use WARN_ONCE() if it is a single bug. Signed-off-by: Masami Hiramatsu <mhiramat@kernel.org> Cc: Ananth N Mavinakayanahalli <ananth@in.ibm.com> Cc: Anil S Keshavamurthy <anil.s.keshavamurthy@intel.com> Cc: Arnd Bergmann <arnd@arndb.de> Cc: David Howells <dhowells@redhat.com> Cc: David S . Miller <davem@davemloft.net> Cc: Heiko Carstens <heiko.carstens@de.ibm.com> Cc: Jon Medhurst <tixy@linaro.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Thomas Richter <tmricht@linux.ibm.com> Cc: Tobin C . Harding <me@tobin.cc> Cc: Will Deacon <will.deacon@arm.com> Cc: acme@kernel.org Cc: akpm@linux-foundation.org Cc: brueckner@linux.vnet.ibm.com Cc: linux-arch@vger.kernel.org Cc: rostedt@goodmis.org Cc: schwidefsky@de.ibm.com Cc: stable@vger.kernel.org Link: https://lkml.kernel.org/lkml/152491899284.9916.5350534544808158621.stgit@devbox Signed-off-by: Ingo Molnar <mingo@kernel.org>	2018-06-21 17:33:42 +02:00
Masami Hiramatsu	81365a947d	kprobes: Show address of kprobes if kallsyms does Show probed address in debugfs kprobe list file as same as kallsyms does. This information is used for checking kprobes are placed in the expected address. So it should be able to compared with address in kallsyms. Signed-off-by: Masami Hiramatsu <mhiramat@kernel.org> Cc: Ananth N Mavinakayanahalli <ananth@in.ibm.com> Cc: Anil S Keshavamurthy <anil.s.keshavamurthy@intel.com> Cc: Arnd Bergmann <arnd@arndb.de> Cc: David Howells <dhowells@redhat.com> Cc: David S . Miller <davem@davemloft.net> Cc: Heiko Carstens <heiko.carstens@de.ibm.com> Cc: Jon Medhurst <tixy@linaro.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Thomas Richter <tmricht@linux.ibm.com> Cc: Tobin C . Harding <me@tobin.cc> Cc: Will Deacon <will.deacon@arm.com> Cc: acme@kernel.org Cc: akpm@linux-foundation.org Cc: brueckner@linux.vnet.ibm.com Cc: linux-arch@vger.kernel.org Cc: rostedt@goodmis.org Cc: schwidefsky@de.ibm.com Cc: stable@vger.kernel.org Link: https://lkml.kernel.org/lkml/152491896256.9916.1583733714492565296.stgit@devbox Signed-off-by: Ingo Molnar <mingo@kernel.org>	2018-06-21 17:33:42 +02:00
Masami Hiramatsu	ffb9bd68eb	kprobes: Show blacklist addresses as same as kallsyms does Show kprobes blacklist addresses under same condition of showing kallsyms addresses. Since there are several name conflict for local symbols, kprobe blacklist needs to show each addresses so that user can identify where is on blacklist by comparing with kallsyms. Signed-off-by: Masami Hiramatsu <mhiramat@kernel.org> Cc: Ananth N Mavinakayanahalli <ananth@in.ibm.com> Cc: Anil S Keshavamurthy <anil.s.keshavamurthy@intel.com> Cc: Arnd Bergmann <arnd@arndb.de> Cc: David Howells <dhowells@redhat.com> Cc: David S . Miller <davem@davemloft.net> Cc: Heiko Carstens <heiko.carstens@de.ibm.com> Cc: Jon Medhurst <tixy@linaro.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Thomas Richter <tmricht@linux.ibm.com> Cc: Tobin C . Harding <me@tobin.cc> Cc: Will Deacon <will.deacon@arm.com> Cc: acme@kernel.org Cc: akpm@linux-foundation.org Cc: brueckner@linux.vnet.ibm.com Cc: linux-arch@vger.kernel.org Cc: rostedt@goodmis.org Cc: schwidefsky@de.ibm.com Cc: stable@vger.kernel.org Link: https://lkml.kernel.org/lkml/152491893217.9916.14760965896164273464.stgit@devbox Signed-off-by: Ingo Molnar <mingo@kernel.org>	2018-06-21 17:33:41 +02:00
Masami Hiramatsu	f2a3ab3607	kprobes: Make list and blacklist root user read only Since the blacklist and list files on debugfs indicates a sensitive address information to reader, it should be restricted to the root user. Suggested-by: Thomas Richter <tmricht@linux.ibm.com> Suggested-by: Ingo Molnar <mingo@kernel.org> Signed-off-by: Masami Hiramatsu <mhiramat@kernel.org> Cc: Ananth N Mavinakayanahalli <ananth@in.ibm.com> Cc: Anil S Keshavamurthy <anil.s.keshavamurthy@intel.com> Cc: Arnd Bergmann <arnd@arndb.de> Cc: David Howells <dhowells@redhat.com> Cc: David S . Miller <davem@davemloft.net> Cc: Heiko Carstens <heiko.carstens@de.ibm.com> Cc: Jon Medhurst <tixy@linaro.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Tobin C . Harding <me@tobin.cc> Cc: Will Deacon <will.deacon@arm.com> Cc: acme@kernel.org Cc: akpm@linux-foundation.org Cc: brueckner@linux.vnet.ibm.com Cc: linux-arch@vger.kernel.org Cc: rostedt@goodmis.org Cc: schwidefsky@de.ibm.com Cc: stable@vger.kernel.org Link: https://lkml.kernel.org/lkml/152491890171.9916.5183693615601334087.stgit@devbox Signed-off-by: Ingo Molnar <mingo@kernel.org>	2018-06-21 17:33:41 +02:00
Souptick Joarder	9e3ed2d759	perf/core: Change perf_mmap_fault() return type to 'vm_fault_t' Use new return type 'vm_fault_t' for fault handlers. For now, this is just documenting that the function returns a VM_FAULT value rather than an errno. Once all instances are converted, vm_fault_t will become a distinct type. See the following commit: `1c8f422059` ("mm: change return type to vm_fault_t") Signed-off-by: Souptick Joarder <jrdr.linux@gmail.com> Reviewed-by: Matthew Wilcox <mawilcox@microsoft.com> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: acme@kernel.org Cc: akpm@linux-foundation.org Cc: alexander.shishkin@linux.intel.com Cc: jolsa@redhat.com Cc: namhyung@kernel.org Link: https://lkml.kernel.org/lkml/20180521182520.GA19677@jordon-HP-15-Notebook-PC Signed-off-by: Ingo Molnar <mingo@kernel.org>	2018-06-21 16:26:25 +02:00
Yisheng Xie	8f894bf47d	sched/debug: Use match_string() helper instead of open-coded logic match_string() returns the index of an array for a matching string, which can be used instead of the open coded variant. Signed-off-by: Yisheng Xie <xieyisheng1@huawei.com> Reviewed-by: Andy Shevchenko <andy.shevchenko@gmail.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Mike Galbraith <efault@gmx.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Link: https://lore.kernel.org/lkml/1527765086-19873-15-git-send-email-xieyisheng1@huawei.com Signed-off-by: Ingo Molnar <mingo@kernel.org>	2018-06-21 15:45:31 +02:00
Mark Rutland	bfc18e389c	atomics/treewide: Rename __atomic_add_unless() => atomic_fetch_add_unless() While __atomic_add_unless() was originally intended as a building-block for atomic_add_unless(), it's now used in a number of places around the kernel. It's the only common atomic operation named __atomic(), rather than atomic_(), and for consistency it would be better named atomic_fetch_add_unless(). This lack of consistency is slightly confusing, and gets in the way of scripting atomics. Given that, let's clean things up and promote it to an official part of the atomics API, in the form of atomic_fetch_add_unless(). This patch converts definitions and invocations over to the new name, including the instrumented version, using the following script: ---- git grep -w __atomic_add_unless \| while read line; do sed -i '{s/\<__atomic_add_unless\>/atomic_fetch_add_unless/}' "${line%%:}"; done git grep -w __arch_atomic_add_unless \| while read line; do sed -i '{s/\<__arch_atomic_add_unless\>/arch_atomic_fetch_add_unless/}' "${line%%:}"; done ---- Note that we do not have atomic{64,_long}_fetch_add_unless(), which will be introduced by later patches. There should be no functional change as a result of this patch. Signed-off-by: Mark Rutland <mark.rutland@arm.com> Reviewed-by: Will Deacon <will.deacon@arm.com> Acked-by: Geert Uytterhoeven <geert@linux-m68k.org> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Acked-by: Palmer Dabbelt <palmer@sifive.com> Cc: Boqun Feng <boqun.feng@gmail.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Thomas Gleixner <tglx@linutronix.de> Link: https://lore.kernel.org/lkml/20180621121321.4761-2-mark.rutland@arm.com Signed-off-by: Ingo Molnar <mingo@kernel.org>	2018-06-21 14:22:32 +02:00
Thomas Gleixner	05736e4ac1	cpu/hotplug: Provide knobs to control SMT Provide a command line and a sysfs knob to control SMT. The command line options are: 'nosmt': Enumerate secondary threads, but do not online them 'nosmt=force': Ignore secondary threads completely during enumeration via MP table and ACPI/MADT. The sysfs control file has the following states (read/write): 'on': SMT is enabled. Secondary threads can be freely onlined 'off': SMT is disabled. Secondary threads, even if enumerated cannot be onlined 'forceoff': SMT is permanentely disabled. Writes to the control file are rejected. 'notsupported': SMT is not supported by the CPU The command line option 'nosmt' sets the sysfs control to 'off'. This can be changed to 'on' to reenable SMT during runtime. The command line option 'nosmt=force' sets the sysfs control to 'forceoff'. This cannot be changed during runtime. When SMT is 'on' and the control file is changed to 'off' then all online secondary threads are offlined and attempts to online a secondary thread later on are rejected. When SMT is 'off' and the control file is changed to 'on' then secondary threads can be onlined again. The 'off' -> 'on' transition does not automatically online the secondary threads. When the control file is set to 'forceoff', the behaviour is the same as setting it to 'off', but the operation is irreversible and later writes to the control file are rejected. When the control status is 'notsupported' then writes to the control file are rejected. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> Acked-by: Ingo Molnar <mingo@kernel.org>	2018-06-21 14:20:58 +02:00
Thomas Gleixner	cc1fe215e1	cpu/hotplug: Split do_cpu_down() Split out the inner workings of do_cpu_down() to allow reuse of that function for the upcoming SMT disabling mechanism. No functional change. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> Acked-by: Ingo Molnar <mingo@kernel.org>	2018-06-21 14:20:57 +02:00
Thomas Gleixner	c4de65696d	cpu/hotplug: Make bringup/teardown of smp threads symmetric The asymmetry caused a warning to trigger if the bootup was stopped in state CPUHP_AP_ONLINE_IDLE. The warning no longer triggers as kthread_park() can now be invoked on already or still parked threads. But there is still no reason to have this be asymmetric. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> Acked-by: Ingo Molnar <mingo@kernel.org>	2018-06-21 14:20:57 +02:00
Peter Zijlstra	ba2591a599	sched/smt: Update sched_smt_present at runtime The static key sched_smt_present is only updated at boot time when SMT siblings have been detected. Booting with maxcpus=1 and bringing the siblings online after boot rebuilds the scheduling domains correctly but does not update the static key, so the SMT code is not enabled. Let the key be updated in the scheduler CPU hotplug code to fix this. Signed-off-by: Peter Zijlstra <peterz@infradead.org> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> Acked-by: Ingo Molnar <mingo@kernel.org>	2018-06-21 14:20:56 +02:00
Masami Hiramatsu	cce188bd58	bpf/error-inject/kprobes: Clear current_kprobe and enable preempt in kprobe Clear current_kprobe and enable preemption in kprobe even if pre_handler returns !0. This simplifies function override using kprobes. Jprobe used to require to keep the preemption disabled and keep current_kprobe until it returned to original function entry. For this reason kprobe_int3_handler() and similar arch dependent kprobe handers checks pre_handler result and exit without enabling preemption if the result is !0. After removing the jprobe, Kprobes does not need to keep preempt disabled even if user handler returns !0 anymore. But since the function override handler in error-inject and bpf is also returns !0 if it overrides a function, to balancing the preempt count, it enables preemption and reset current kprobe by itself. That is a bad design that is very buggy. This fixes such unbalanced preempt-count and current_kprobes setting in kprobes, bpf and error-inject. Note: for powerpc and x86, this removes all preempt_disable from kprobe_ftrace_handler because ftrace callbacks are called under preempt disabled. Signed-off-by: Masami Hiramatsu <mhiramat@kernel.org> Acked-by: Thomas Gleixner <tglx@linutronix.de> Acked-by: Naveen N. Rao <naveen.n.rao@linux.vnet.ibm.com> Cc: Alexei Starovoitov <ast@kernel.org> Cc: Ananth N Mavinakayanahalli <ananth@linux.vnet.ibm.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: David S. Miller <davem@davemloft.net> Cc: Fenghua Yu <fenghua.yu@intel.com> Cc: Heiko Carstens <heiko.carstens@de.ibm.com> Cc: James Hogan <jhogan@kernel.org> Cc: Josef Bacik <jbacik@fb.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Martin Schwidefsky <schwidefsky@de.ibm.com> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Paul Mackerras <paulus@samba.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Ralf Baechle <ralf@linux-mips.org> Cc: Rich Felker <dalias@libc.org> Cc: Russell King <linux@armlinux.org.uk> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Tony Luck <tony.luck@intel.com> Cc: Vineet Gupta <vgupta@synopsys.com> Cc: Will Deacon <will.deacon@arm.com> Cc: Yoshinori Sato <ysato@users.sourceforge.jp> Cc: linux-arch@vger.kernel.org Cc: linux-arm-kernel@lists.infradead.org Cc: linux-ia64@vger.kernel.org Cc: linux-mips@linux-mips.org Cc: linux-s390@vger.kernel.org Cc: linux-sh@vger.kernel.org Cc: linux-snps-arc@lists.infradead.org Cc: linuxppc-dev@lists.ozlabs.org Cc: sparclinux@vger.kernel.org Link: https://lore.kernel.org/lkml/152942494574.15209.12323837825873032258.stgit@devbox Signed-off-by: Ingo Molnar <mingo@kernel.org>	2018-06-21 12:33:19 +02:00
Masami Hiramatsu	059053a275	kprobes: Don't check the ->break_handler() in generic kprobes code Don't check the ->break_handler() from the core kprobes code, because it was only used by jprobes which got removed. ( In followup patches we'll remove the remaining calls in low level arch handlers as well and remove the callback altogether. ) Signed-off-by: Masami Hiramatsu <mhiramat@kernel.org> Acked-by: Thomas Gleixner <tglx@linutronix.de> Cc: Ananth N Mavinakayanahalli <ananth@linux.vnet.ibm.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: linux-arch@vger.kernel.org Link: https://lore.kernel.org/lkml/152942462686.15209.6324404940493598980.stgit@devbox Signed-off-by: Ingo Molnar <mingo@kernel.org>	2018-06-21 12:33:12 +02:00
Masami Hiramatsu	5a6cf77f5e	kprobes: Remove jprobe API implementation Remove functionally empty jprobe API implementations and test cases. Signed-off-by: Masami Hiramatsu <mhiramat@kernel.org> Acked-by: Thomas Gleixner <tglx@linutronix.de> Cc: Ananth N Mavinakayanahalli <ananth@linux.vnet.ibm.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: linux-arch@vger.kernel.org Link: https://lore.kernel.org/lkml/152942430705.15209.2307050500995264322.stgit@devbox Signed-off-by: Ingo Molnar <mingo@kernel.org>	2018-06-21 12:33:05 +02:00
Linus Torvalds	d8894a08d9	Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net Pull networking fixes from David Miller: 1) Fix crash on bpf_prog_load() errors, from Daniel Borkmann. 2) Fix ATM VCC memory accounting, from David Woodhouse. 3) fib6_info objects need RCU freeing, from Eric Dumazet. 4) Fix SO_BINDTODEVICE handling for TCP sockets, from David Ahern. 5) Fix clobbered error code in enic_open() failure path, from Govindarajulu Varadarajan. 6) Propagate dev_get_valid_name() error returns properly, from Li RongQing. 7) Fix suspend/resume in davinci_emac driver, from Bartosz Golaszewski. 8) Various act_ife fixes (recursive locking, IDR leaks, etc.) from Davide Caratti. 9) Fix buggy checksum handling in sungem driver, from Eric Dumazet. * git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (40 commits) ip: limit use of gso_size to udp stmmac: fix DMA channel hang in half-duplex mode net: stmmac: socfpga: add additional ocp reset line for Stratix10 net: sungem: fix rx checksum support bpfilter: ignore binary files bpfilter: fix build error net/usb/drivers: Remove useless hrtimer_active check net/sched: act_ife: preserve the action control in case of error net/sched: act_ife: fix recursive lock and idr leak net: ethernet: fix suspend/resume in davinci_emac net: propagate dev_get_valid_name return code enic: do not overwrite error code net/tcp: Fix socket lookups with SO_BINDTODEVICE ptp: replace getnstimeofday64() with ktime_get_real_ts64() net/ipv6: respect rcu grace period before freeing fib6_info net: net_failover: fix typo in net_failover_slave_register() ipvlan: use ETH_MAX_MTU as max mtu net: hamradio: use eth_broadcast_addr enic: initialize enic->rfs_h.lock in enic_probe MAINTAINERS: Add Sam as the maintainer for NCSI ...	2018-06-21 07:13:42 +09:00
Peter Zijlstra	b3dae109fa	sched/swait: Rename to exclusive Since swait basically implemented exclusive waits only, make sure the API reflects that. $ git grep -l -e "\<swake_up\>" -e "\<swait_event[^ (]" -e "\<prepare_to_swait\>" \| while read file; do sed -i -e 's/\<swake_up\>/&_one/g' -e 's/\<swait_event[^ (]/&_exclusive/g' -e 's/\<prepare_to_swait\>/&_exclusive/g' $file; done With a few manual touch-ups. Suggested-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Acked-by: Linus Torvalds <torvalds@linux-foundation.org> Cc: bigeasy@linutronix.de Cc: oleg@redhat.com Cc: paulmck@linux.vnet.ibm.com Cc: pbonzini@redhat.com Link: https://lkml.kernel.org/r/20180612083909.261946548@infradead.org	2018-06-20 11:35:56 +02:00
Peter Zijlstra	0abf17bc77	sched/swait: Switch to full exclusive mode Linus noted that swait basically implements exclusive mode -- because swake_up() only wakes a single waiter. And because of that it should take care to properly deal with the interruptible case. In short, the problem is that swake_up() can race with a signal. In this this case it is possible the swake_up() 'wakes' the waiter that is already on the way out because it just got a signal and the wakeup gets lost. The normal wait code is very careful and avoids this situation, make sure we do too. Copy the exact exclusive semantics from wait. Suggested-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Acked-by: Linus Torvalds <torvalds@linux-foundation.org> Cc: bigeasy@linutronix.de Cc: oleg@redhat.com Cc: paulmck@linux.vnet.ibm.com Cc: pbonzini@redhat.com Link: https://lkml.kernel.org/r/20180612083909.209762413@infradead.org	2018-06-20 11:35:56 +02:00
Peter Zijlstra	6519750210	sched/swait: Remove __prepare_to_swait There is no public user of this API, remove it. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Acked-by: Linus Torvalds <torvalds@linux-foundation.org> Cc: bigeasy@linutronix.de Cc: oleg@redhat.com Cc: paulmck@linux.vnet.ibm.com Cc: pbonzini@redhat.com Link: https://lkml.kernel.org/r/20180612083909.157076812@infradead.org	2018-06-20 11:35:56 +02:00
Waiman Long	03eeafdd9a	locking/rwsem: Fix up_read_non_owner() warning with DEBUG_RWSEMS It was found that the use of up_read_non_owner() in NFS was causing the following warning when DEBUG_RWSEMS was configured. DEBUG_LOCKS_WARN_ON(sem->owner != ((struct task_struct *)(1UL << 0))) Looking into the rwsem.c file, it was discovered that the corresponding down_read_non_owner() function was not setting the owner field properly. This is fixed now, and the warning should be gone. Fixes: `5149cbac42` ("locking/rwsem: Add DEBUG_RWSEMS to look for lock/unlock mismatches") Signed-off-by: Waiman Long <longman@redhat.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Tested-by: Gavin Schenk <g.schenk@eckelmann.de> Cc: Davidlohr Bueso <dave@stgolabs.net> Cc: Dan Williams <dan.j.williams@intel.com> Cc: Arnd Bergmann <arnd@arndb.de> Cc: linux-nfs@vger.kernel.org Cc: stable@vger.kernel.org Link: https://lkml.kernel.org/r/1527168398-4291-1-git-send-email-longman@redhat.com	2018-06-20 11:29:23 +02:00
Linus Torvalds	6d90eb7ba3	Move all the dma-mapping code to kernel/dma -----BEGIN PGP SIGNATURE----- iQI/BAABCAApFiEEgdbnc3r/njty3Iq9D55TZVIEUYMFAlsp150LHGhjaEBsc3Qu ZGUACgkQD55TZVIEUYNVkg//XcJtjBdhCLl7Q9H+JfCvknYH1nnS8nXs0e3WE+3n dfphChd1eKbk/IWFaMf3rOHgLw49w7YbatRxP60fFcNyAgCzFJaW/dq5D9BdQS+X j2h9O28rs7eskQJ0VKdjOg3By2KjqiPdVvgduRSou8RLeVBgJgBSXZTm5xMaY3oX iNRQhf+opL75Vpl7tZT0QH58X07fhai8dyGOJ31Yg45IKMgrY/bipMEAQvlqJteG FcQa5PBUj0MW9dImlQVp1DofUFCzUqZgNTl7Dis6d0wJBJBpkOX7n1ysN2HmPcUE T0rMU7GY9ZexF1+VTN/jIYos3smud93OG4LQHkos429E2E7ljmSPSAJx5xczxNr7 FRWbT3CKWJjuo2ZWKezoz80+faaicrRnav3RzrL3TgpMPes65d46daqITXZGQU/P zAwFTxj/s0gYziFm6Bdmmf24p1dhXVnu8SwIqI002+/7D7/9LsnUzeq0E/dPV+uX l75zYIlRxKrRuJsst53Sbcc7V3y2lA9psMexP1NqcCSSSyYG74xBAhqHe0i73w86 STKyI/doLw5It1oCgkODWIhivZvdGSqE+/xYocyFFZhNqgrAkt8GwNThXnUAsAXR SSLvIMZ9PvSNXy9oMglk7tS2eFLMTEz1pPZkMREqBdGBgEhyCABo6lpELZk3nq/a IkE= =PrOL -----END PGP SIGNATURE----- Merge tag 'dma-rename-4.18' of git://git.infradead.org/users/hch/dma-mapping Pull dma-mapping rename from Christoph Hellwig: "Move all the dma-mapping code to kernel/dma and lose their dma-* prefixes" * tag 'dma-rename-4.18' of git://git.infradead.org/users/hch/dma-mapping: dma-mapping: move all DMA mapping code to kernel/dma dma-mapping: use obj-y instead of lib-y for generic dma ops	2018-06-20 16:30:01 +09:00
Richard Guy Briggs	f7859590d9	audit: eliminate audit_enabled magic number comparison Remove comparison of audit_enabled to magic numbers outside of audit. Related: https://github.com/linux-audit/audit-kernel/issues/86 Signed-off-by: Richard Guy Briggs <rgb@redhat.com> Signed-off-by: Paul Moore <paul@paul-moore.com>	2018-06-19 10:43:55 -04:00
Richard Guy Briggs	d904ac0320	audit: rename FILTER_TYPE to FILTER_EXCLUDE The AUDIT_FILTER_TYPE name is vague and misleading due to not describing where or when the filter is applied and obsolete due to its available filter fields having been expanded. Userspace has already renamed it from AUDIT_FILTER_TYPE to AUDIT_FILTER_EXCLUDE without checking if it already exists. The userspace maintainer assures that as long as it is set to the same value it will not be a problem since the userspace code does not treat compiler warnings as errors. If this policy changes then checks if it already exists can be added at the same time. See: https://github.com/linux-audit/audit-kernel/issues/89 Signed-off-by: Richard Guy Briggs <rgb@redhat.com> Signed-off-by: Paul Moore <paul@paul-moore.com>	2018-06-19 10:39:54 -04:00
Ondrej Mosnáček	af85d1772e	audit: Fix extended comparison of GID/EGID The audit_filter_rules() function in auditsc.c used the in_[e]group_p() functions to check GID/EGID match, but these functions use the current task's credentials, while the comparison should use the credentials of the task given to audit_filter_rules() as a parameter (tsk). Note that we can use group_search(cred->group_info, ...) as a replacement for both in_group_p and in_egroup_p as these functions only compare the parameter to cred->fsgid/egid and then call group_search. In fact, the usage of in_group_p was even more incorrect: it compares to cred->fsgid (which is usually equal to cred->egid) and not cred->gid. GitHub issue: https://github.com/linux-audit/audit-kernel/issues/82 Fixes: `37eebe39c9` ("audit: improve GID/EGID comparation logic") Signed-off-by: Ondrej Mosnacek <omosnace@redhat.com> Signed-off-by: Paul Moore <paul@paul-moore.com>	2018-06-19 10:33:04 -04:00
Richard Guy Briggs	d87de4a878	audit: tie ANOM_ABEND records to syscall Since core dump events are triggered by user activity, tie the ANOM_ABEND record to the syscall record to collect all records from the same event. See: https://github.com/linux-audit/audit-kernel/issues/88 Signed-off-by: Richard Guy Briggs <rgb@redhat.com> Signed-off-by: Paul Moore <paul@paul-moore.com>	2018-06-19 10:30:05 -04:00
Richard Guy Briggs	9b8753fffe	audit: tie SECCOMP records to syscall Since seccomp events are triggered by user activity, tie the SECCOMP record to the syscall record to collect all records from the same event. See: https://github.com/linux-audit/audit-kernel/issues/87 Signed-off-by: Richard Guy Briggs <rgb@redhat.com> Signed-off-by: Paul Moore <paul@paul-moore.com>	2018-06-19 10:26:59 -04:00
Ondrej Mosnáček	29c1372d6a	audit: allow other filter list types for AUDIT_EXE This patch removes the restriction of the AUDIT_EXE field to only SYSCALL filter and teaches audit_filter to recognize this field. This makes it possible to write rule lists such as: auditctl -a exit,always [some general rule] # Filter out events with executable name /bin/exe1 or /bin/exe2: auditctl -a exclude,always -F exe=/bin/exe1 auditctl -a exclude,always -F exe=/bin/exe2 See: https://github.com/linux-audit/audit-kernel/issues/54 Signed-off-by: Ondrej Mosnacek <omosnace@redhat.com> Reviewed-by: Richard Guy Briggs <rgb@redhat.com> Signed-off-by: Paul Moore <paul@paul-moore.com>	2018-06-19 09:33:42 -04:00
Arnd Bergmann	dc1b7b6ca9	sysinfo: Remove get_monotonic_boottime() get_monotonic_boottime() is deprecated because it uses the old 'timespec' structure. This replaces one of the last callers with a call to ktime_get_boottime. Signed-off-by: Arnd Bergmann <arnd@arndb.de> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Cyrill Gorcunov <gorcunov@gmail.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: y2038@lists.linaro.org Cc: Dominik Brodowski <linux@dominikbrodowski.net> Cc: Cyrill Gorcunov <gorcunov@openvz.org> Link: https://lkml.kernel.org/r/20180618150114.849216-1-arnd@arndb.de	2018-06-19 09:56:27 +02:00
Arnd Bergmann	58a10456d7	posix-timers: Use new ktime_get_*_ts64() helpers Some of the oddly named time accessor functions now have a more consistent naming, which should be used from now on so the aliases can be removed. Signed-off-by: Arnd Bergmann <arnd@arndb.de> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Cc: y2038@lists.linaro.org Cc: Deepa Dinamani <deepa.kernel@gmail.com> Cc: "Eric W. Biederman" <ebiederm@xmission.com> Link: https://lkml.kernel.org/r/20180618143246.3865099-1-arnd@arndb.de	2018-06-19 09:56:27 +02:00
Arnd Bergmann	d30faff900	timekeeping: Use ktime_get_real_ts64() instead of getnstimeofday64() The two do the same, this moves all users to the newer name for consistency. Signed-off-by: Arnd Bergmann <arnd@arndb.de> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Cc: John Stultz <john.stultz@linaro.org> Cc: y2038@lists.linaro.org Cc: Stephen Boyd <sboyd@kernel.org> Cc: Miroslav Lichvar <mlichvar@redhat.com> Link: https://lkml.kernel.org/r/20180618140811.2998503-3-arnd@arndb.de	2018-06-19 09:56:26 +02:00
Arnd Bergmann	f5a89295e2	time: Use ktime_get_real_seconds() in time syscall Both get_seconds() and do_gettimeofday() are deprecated. Change the time() implementation to use the replacement function instead. Obviously the system call will still overflow in 2038, but this gets us closer to removing the old helper functions. Signed-off-by: Arnd Bergmann <arnd@arndb.de> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Cc: John Stultz <john.stultz@linaro.org> Cc: y2038@lists.linaro.org Cc: Stephen Boyd <sboyd@kernel.org> Cc: Deepa Dinamani <deepa.kernel@gmail.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Link: https://lkml.kernel.org/r/20180618140811.2998503-2-arnd@arndb.de	2018-06-19 09:56:26 +02:00
Arnd Bergmann	0fe2795516	posix-timers: Fix nanosleep_copyout() for CONFIG_COMPAT_32BIT_TIME Commit `b5793b0d92` added support for building the nanosleep compat system call on 32-bit architectures, but missed one change in nanosleep_copyout(), which would trigger a BUG() as soon as any architecture is switched over to use it. Use the proper config symbol to enable the code path. Fixes: Commit `b5793b0d92` ("posix-timers: Make compat syscalls depend on CONFIG_COMPAT_32BIT_TIME") Signed-off-by: Arnd Bergmann <arnd@arndb.de> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Cc: y2038@lists.linaro.org Cc: Anna-Maria Gleixner <anna-maria@linutronix.de> Cc: Deepa Dinamani <deepa.kernel@gmail.com> Cc: "Rafael J. Wysocki" <rafael.j.wysocki@intel.com> Link: https://lkml.kernel.org/r/20180618140811.2998503-1-arnd@arndb.de	2018-06-19 09:23:19 +02:00
Jonathan Neuschäfer	0a13ec0bbc	genirq: Fix editing error in a comment When the comment was reflowed to a wider format, the "*" snuck in. Fixes: `ae88a23b32` ("irq: refactor and clean up the free_irq() code flow") Signed-off-by: Jonathan Neuschäfer <j.neuschaefer@gmx.net> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Link: https://lkml.kernel.org/r/20180617124018.25539-1-j.neuschaefer@gmx.net	2018-06-19 09:19:41 +02:00
Eric Dumazet	4a5f4d2f89	genirq: Use rcu in kstat_irqs_usr() Jeremy Dorfman identified mutex contention when multiple threads parse /proc/stat concurrently. Since commit `425a5072dc` ("genirq: Free irq_desc with rcu"), kstat_irqs_usr() can be switched to rcu locking, which removes this mutex contention. show_interrupts() case will be handled in a separate patch. Reported-by: Jeremy Dorfman <jdorfman@google.com> Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Cc: Eric Dumazet <eric.dumazet@gmail.com> Cc: Willem de Bruijn <willemb@google.com> Link: https://lkml.kernel.org/r/20180618125612.155057-1-edumazet@google.com	2018-06-19 09:19:40 +02:00
Jessica Yu	9f2d1e68cf	module: exclude SHN_UNDEF symbols from kallsyms api Livepatch modules are special in that we preserve their entire symbol tables in order to be able to apply relocations after module load. The unwanted side effect of this is that undefined (SHN_UNDEF) symbols of livepatch modules are accessible via the kallsyms api and this can confuse symbol resolution in livepatch (klp_find_object_symbol()) and cause subtle bugs in livepatch. Have the module kallsyms api skip over SHN_UNDEF symbols. These symbols are usually not available for normal modules anyway as we cut down their symbol tables to just the core (non-undefined) symbols, so this should really just affect livepatch modules. Note that this patch doesn't affect the display of undefined symbols in /proc/kallsyms. Reported-by: Josh Poimboeuf <jpoimboe@redhat.com> Tested-by: Josh Poimboeuf <jpoimboe@redhat.com> Reviewed-by: Josh Poimboeuf <jpoimboe@redhat.com> Signed-off-by: Jessica Yu <jeyu@kernel.org>	2018-06-18 10:04:03 +02:00
David S. Miller	0841d98641	Merge git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf Daniel Borkmann says: ==================== pull-request: bpf 2018-06-16 The following pull-request contains BPF updates for your net tree. The main changes are: 1) Fix a panic in devmap handling in generic XDP where return type of __devmap_lookup_elem() got changed recently but generic XDP code missed the related update, from Toshiaki. 2) Fix a freeze when BPF progs are loaded that include BPF to BPF calls when JIT is enabled where we would later bail out via error path w/o dropping kallsyms, and another one to silence syzkaller splats from locking prog read-only, from Daniel. 3) Fix a bug in test_offloads.py BPF selftest which must not assume that the underlying system have no BPF progs loaded prior to test, and one in bpftool to fix accuracy of program load time, from Jakub. 4) Fix a bug in bpftool's probe for availability of the bpf(2) BPF_TASK_FD_QUERY subcommand, from Yonghong. 5) Fix a regression in AF_XDP's XDP_SKB receive path where queue id check got erroneously removed, from Björn. 6) Fix missing state cleanup in BPF's xfrm tunnel test, from William. 7) Check tunnel type more accurately in BPF's tunnel collect metadata kselftest, from Jian. 8) Fix missing Kconfig fragments for BPF kselftests, from Anders. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>	2018-06-17 07:54:24 +09:00
Linus Torvalds	5e7b9212a4	Solve a series of broken links for files under Documentation: - can.rst: fix a footnote reference; - crypto_engine.rst: Fix two parsing warnings; - Fix a lot of broken references to Documentation/; - Improves the scripts/documentation-file-ref-check script, in order to help detecting/fixing broken references, preventing false-positives. After this patch series, only 33 broken references to doc files are detected by scripts/documentation-file-ref-check. -----BEGIN PGP SIGNATURE----- iQIcBAABAgAGBQJbJC2aAAoJEAhfPr2O5OEVPmMP/2rN5m9LZ048oRWlg4hCwo73 4FpWqDg18hbWCMHXYHIN1UACIMUkIUfgLhF7WE3D/XqRMuxHoiE5u7DUdak7+VNt wunpksKFJbgyfFMHRvykHcZV+jQFVbM7eFvXVPIvoSaAeGH6zx4imHTyeDn3x/nL gdtBqM4bvEhmBjotBTRR4PB8+oPrT/HIT5npHepx3UnFFFAzDQGEZ/I67/el2G5C pVmYdBXvr7iqrvUs6FilHLTEfe1quCI4UaKNfLHKrxXrTkiJQFOwugYuobZfNmxT GwjWzfpNy9HMlKJFYipcByALxel1Mnpqz5mIxFQaCTygBuEsORCWzW5MoKIsIUJ0 KOoG76v0rUyMvLBRvaoao3CHYHdzxhQbtVV9DjyDuDksa2G5IoCAF1t6DyIOitRw 9plMnGckk+FJ/MXJKYWXHszFS8NhI0SF2zHe3s1DmRTD8P6oxkxvxBFz6iqqADmL W6XHd8CcqJItaS9ctPen91TFuysN1HFpdzLLY+xwWmmKOcWC/jFjhTm8pj7xLQHM 5yuuEcefsajf+Xk4w2fSQmRfXnuq+oOlPuWpwSvEy+59cHGI0ms18P1nHy/yt3II CJywwdx6fjwDon57RFKH7kkGd7px317zMqWdIv9gUj/qZAy9gcdLdvEQLhx9u0aV 4F+hLKFDFEpf58xqRT1R =/ozx -----END PGP SIGNATURE----- Merge tag 'docs-broken-links' of git://linuxtv.org/mchehab/experimental Pull documentation fixes from Mauro Carvalho Chehab: "This solves a series of broken links for files under Documentation, and improves a script meant to detect such broken links (see scripts/documentation-file-ref-check). The changes on this series are: - can.rst: fix a footnote reference; - crypto_engine.rst: Fix two parsing warnings; - Fix a lot of broken references to Documentation/; - improve the scripts/documentation-file-ref-check script, in order to help detecting/fixing broken references, preventing false-positives. After this patch series, only 33 broken references to doc files are detected by scripts/documentation-file-ref-check" * tag 'docs-broken-links' of git://linuxtv.org/mchehab/experimental: (26 commits) fix a series of Documentation/ broken file name references Documentation: rstFlatTable.py: fix a broken reference ABI: sysfs-devices-system-cpu: remove a broken reference devicetree: fix a series of wrong file references devicetree: fix name of pinctrl-bindings.txt devicetree: fix some bindings file names MAINTAINERS: fix location of DT npcm files MAINTAINERS: fix location of some display DT bindings kernel-parameters.txt: fix pointers to sound parameters bindings: nvmem/zii: Fix location of nvmem.txt docs: Fix more broken references scripts/documentation-file-ref-check: check tools/*/Documentation scripts/documentation-file-ref-check: get rid of false-positives scripts/documentation-file-ref-check: hint: dash or underline scripts/documentation-file-ref-check: add a fix logic for DT scripts/documentation-file-ref-check: accept more wildcards at filenames scripts/documentation-file-ref-check: fix help message media: max2175: fix location of driver's companion documentation media: v4l: fix broken video4linux docs locations media: dvb: point to the location of the old README.dvb-usb file ...	2018-06-17 05:25:18 +09:00
Linus Torvalds	dbb2816fc7	\n -----BEGIN PGP SIGNATURE----- iQEzBAABCAAdFiEEq1nRK9aeMoq1VSgcnJ2qBz9kQNkFAlsielwACgkQnJ2qBz9k QNlN0Af/Q82iP3EqrT3w+CT7w0gER2su+Df2riDpo0/XYRQLxuyW+kYtLsQwovvB Q7Tt+WTSO5OIqoJxwGMmd6VO5ICblhP+uHVC6+JlWy17DgccjwFBE/sUopxPqJaK 9utwXZhqqOEoikNpDABcptNnWVILRl0yppkQrVV/pKkyZFp2F8vO4roUHFFYkJJt /uXJfLDQx6pBLTwqfQBFyiz0dCSsvCHUVnlw7Hu5JfE6xPtkMlk6F/M0Y0rvyEOg 8KmH5jUX/BXKIijg+ycOzS3CCdvm0UhrtiH5YWy4qGaI8eczT31Epfl08Sk8pvkv n2rnxNnJP5sjPPNQhXvHJqy9qRCB6g== =bLjN -----END PGP SIGNATURE----- Merge tag 'fsnotify_for_v4.18-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs Pull fsnotify updates from Jan Kara: "fsnotify cleanups unifying handling of different watch types. This is the shortened fsnotify series from Amir with the last five patches pulled out. Amir has modified those patches to not change struct inode but obviously it's too late for those to go into this merge window" * tag 'fsnotify_for_v4.18-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs: fsnotify: add fsnotify_add_inode_mark() wrappers fanotify: generalize fanotify_should_send_event() fsnotify: generalize send_to_group() fsnotify: generalize iteration of marks by object type fsnotify: introduce marks iteration helpers fsnotify: remove redundant arguments to handle_event() fsnotify: use type id to identify connector object type	2018-06-17 05:06:18 +09:00
Linus Torvalds	9215310cf1	Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net Pull networking fixes from David Miller: 1) Various netfilter fixlets from Pablo and the netfilter team. 2) Fix regression in IPVS caused by lack of PMTU exceptions on local routes in ipv6, from Julian Anastasov. 3) Check pskb_trim_rcsum for failure in DSA, from Zhouyang Jia. 4) Don't crash on poll in TLS, from Daniel Borkmann. 5) Revert SO_REUSE{ADDR,PORT} change, it regresses various things including Avahi mDNS. From Bart Van Assche. 6) Missing of_node_put in qcom/emac driver, from Yue Haibing. 7) We lack checking of the TCP checking in one special case during SYN receive, from Frank van der Linden. 8) Fix module init error paths of mac80211 hwsim, from Johannes Berg. 9) Handle 802.1ad properly in stmmac driver, from Elad Nachman. 10) Must grab HW caps before doing quirk checks in stmmac driver, from Jose Abreu. * git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (81 commits) net: stmmac: Run HWIF Quirks after getting HW caps neighbour: skip NTF_EXT_LEARNED entries during forced gc net: cxgb3: add error handling for sysfs_create_group tls: fix waitall behavior in tls_sw_recvmsg tls: fix use-after-free in tls_push_record l2tp: filter out non-PPP sessions in pppol2tp_tunnel_ioctl() l2tp: reject creation of non-PPP sessions on L2TPv2 tunnels mlxsw: spectrum_switchdev: Fix port_vlan refcounting mlxsw: spectrum_router: Align with new route replace logic mlxsw: spectrum_router: Allow appending to dev-only routes ipv6: Only emit append events for appended routes stmmac: added support for 802.1ad vlan stripping cfg80211: fix rcu in cfg80211_unregister_wdev mac80211: Move up init of TXQs mac80211_hwsim: fix module init error paths cfg80211: initialize sinfo in cfg80211_get_station nl80211: fix some kernel doc tag mistakes hv_netvsc: Fix the variable sizes in ipsecv2 and rsc offload rds: avoid unenecessary cong_update in loop transport l2tp: clean up stale tunnel or session in pppol2tp_connect's error path ...	2018-06-16 07:39:34 +09:00
Linus Torvalds	de7f01c22a	Modules updates for v4.18 Summary of modules changes for the 4.18 merge window: - Minor code cleanup and also allow sig_enforce param to be shown in sysfs with CONFIG_MODULE_SIG_FORCE Signed-off-by: Jessica Yu <jeyu@kernel.org> -----BEGIN PGP SIGNATURE----- iQIcBAABCgAGBQJbI086AAoJEMBFfjjOO8Fy9XIP/iQrmvH0kWGa0WvCBopoJy1e 9V5nGNDaWePHynORMMqfgYwmfrG9Gv8TPivDn5raOxkJA8j6j9JwIaOTIfLX96/x pSSx8ofljvUQk7cSO8owx8+9BKfXTuyxczsGoKZrAI0M/ZBigDUmiG0JeUqkdKFo Ucf9R+2og1AQzfiqBEg5jxR7cIj0htezvdWEHiR3e68m4S0jFLivGMdGVAb//eXI 6vAfUki8GWDw4GIfZ7c1J70FYsJWqCZQDmeY31tjE487l8Woc48ZY4YKaFMjBR+z q0No1xYVs24jEjw+2dxeXeTQf1WyqHjUmikdEHYPv2BThc7vpd+zDJPoSU2GB8K9 XMELxnuqoT5PzHz7za0mTOBVnX0NJT4Y601GxiUpfT9dbmx+sj7YSJqO2RZ4bzVR xTRsNyejOAdEeoUbu+D1O3mnQzagFa4BOKFjKQe18vvJpEk+OIfSc7afyNnW7Ly6 TPbwxm2Xrn9td+6QMZiuu/p3M8WEp0I7XfkYqqUBsSFe0SbzBjYZ00G9/4nAB/DX 2KxYBrREPqrjzJdoNqLqgmDp/5Jix68Dy5m/tJ5GoR2S6bWilY7KqanlU15EG8jc MD9gOSfA5u0cYWirVQwgrQXXDlYDiDAqc0pKy1W8JQ07bQr5NiVVR82TdVOW599/ Jo8OHeIE4H9NQai8u2Mf =5gpe -----END PGP SIGNATURE----- Merge tag 'modules-for-v4.18' of git://git.kernel.org/pub/scm/linux/kernel/git/jeyu/linux Pull module updates from Jessica Yu: "Minor code cleanup and also allow sig_enforce param to be shown in sysfs with CONFIG_MODULE_SIG_FORCE" * tag 'modules-for-v4.18' of git://git.kernel.org/pub/scm/linux/kernel/git/jeyu/linux: module: Allow to always show the status of modsign module: Do not access sig_enforce directly	2018-06-16 07:36:39 +09:00
Toshiaki Makita	6d5fc19579	xdp: Fix handling of devmap in generic XDP Commit `67f29e07e1` ("bpf: devmap introduce dev_map_enqueue") changed the return value type of __devmap_lookup_elem() from struct net_device * to struct bpf_dtab_netdev * but forgot to modify generic XDP code accordingly. Thus generic XDP incorrectly used struct bpf_dtab_netdev where struct net_device is expected, then skb->dev was set to invalid value. v2: - Fix compiler warning without CONFIG_BPF_SYSCALL. Fixes: `67f29e07e1` ("bpf: devmap introduce dev_map_enqueue") Signed-off-by: Toshiaki Makita <makita.toshiaki@lab.ntt.co.jp> Acked-by: Yonghong Song <yhs@fb.com> Acked-by: Jesper Dangaard Brouer <brouer@redhat.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>	2018-06-15 23:47:15 +02:00
Mauro Carvalho Chehab	44348e8ac1	fix a series of Documentation/ broken file name references As files move around, their previous links break. Fix the references for them. Acked-by: Andy Shevchenko <andy.shevchenko@gmail.com> Signed-off-by: Mauro Carvalho Chehab <mchehab+samsung@kernel.org> Acked-by: Jonathan Corbet <corbet@lwn.net>	2018-06-15 18:10:01 -03:00
Mauro Carvalho Chehab	5fb94e9ca3	docs: Fix some broken references As we move stuff around, some doc references are broken. Fix some of them via this script: ./scripts/documentation-file-ref-check --fix Manually checked if the produced result is valid, removing a few false-positives. Acked-by: Takashi Iwai <tiwai@suse.de> Acked-by: Masami Hiramatsu <mhiramat@kernel.org> Acked-by: Stephen Boyd <sboyd@kernel.org> Acked-by: Charles Keepax <ckeepax@opensource.wolfsonmicro.com> Acked-by: Mathieu Poirier <mathieu.poirier@linaro.org> Reviewed-by: Coly Li <colyli@suse.de> Signed-off-by: Mauro Carvalho Chehab <mchehab+samsung@kernel.org> Acked-by: Jonathan Corbet <corbet@lwn.net>	2018-06-15 18:10:01 -03:00
Daniel Borkmann	9facc33687	bpf: reject any prog that failed read-only lock We currently lock any JITed image as read-only via bpf_jit_binary_lock_ro() as well as the BPF image as read-only through bpf_prog_lock_ro(). In the case any of these would fail we throw a WARN_ON_ONCE() in order to yell loudly to the log. Perhaps, to some extend, this may be comparable to an allocation where __GFP_NOWARN is explicitly not set. Added via `65869a47f3` ("bpf: improve read-only handling"), this behavior is slightly different compared to any of the other in-kernel set_memory_ro() users who do not check the return code of set_memory_ro() and friends /at all/ (e.g. in the case of module_enable_ro() / module_disable_ro()). Given in BPF this is mandatory hardening step, we want to know whether there are any issues that would leave both BPF data writable. So it happens that syzkaller enabled fault injection and it triggered memory allocation failure deep inside x86's change_page_attr_set_clr() which was triggered from set_memory_ro(). Now, there are two options: i) leaving everything as is, and ii) reworking the image locking code in order to have a final checkpoint out of the central bpf_prog_select_runtime() which probes whether any of the calls during prog setup weren't successful, and then bailing out with an error. Option ii) is a better approach since this additional paranoia avoids altogether leaving any potential W+X pages from BPF side in the system. Therefore, lets be strict about it, and reject programs in such unlikely occasion. While testing I noticed also that one bpf_prog_lock_ro() call was missing on the outer dummy prog in case of calls, e.g. in the destructor we call bpf_prog_free_deferred() on the main prog where we try to bpf_prog_unlock_free() the program, and since we go via bpf_prog_select_runtime() do that as well. Reported-by: syzbot+3b889862e65a98317058@syzkaller.appspotmail.com Reported-by: syzbot+9e762b52dd17e616a7a5@syzkaller.appspotmail.com Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Martin KaFai Lau <kafai@fb.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2018-06-15 11:14:25 -07:00
Daniel Borkmann	7d1982b4e3	bpf: fix panic in prog load calls cleanup While testing I found that when hitting error path in bpf_prog_load() where we jump to free_used_maps and prog contained BPF to BPF calls that were JITed earlier, then we never clean up the bpf_prog_kallsyms_add() done under jit_subprogs(). Add proper API to make BPF kallsyms deletion more clear and fix that. Fixes: `1c2a088a66` ("bpf: x64: add JIT support for multi-function programs") Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Martin KaFai Lau <kafai@fb.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2018-06-15 11:14:25 -07:00
Linus Torvalds	b5d903c2d6	Merge branch 'akpm' (patches from Andrew) Merge more updates from Andrew Morton: - MM remainders - various misc things - kcov updates * emailed patches from Andrew Morton <akpm@linux-foundation.org>: (27 commits) lib/test_printf.c: call wait_for_random_bytes() before plain %p tests hexagon: drop the unused variable zero_page_mask hexagon: fix printk format warning in setup.c mm: fix oom_kill event handling treewide: use PHYS_ADDR_MAX to avoid type casting ULLONG_MAX mm: use octal not symbolic permissions ipc: use new return type vm_fault_t sysvipc/sem: mitigate semnum index against spectre v1 fault-injection: reorder config entries arm: port KCOV to arm sched/core / kcov: avoid kcov_area during task switch kcov: prefault the kcov_area kcov: ensure irq code sees a valid area kernel/relay.c: change return type to vm_fault_t exofs: avoid VLA in structures coredump: fix spam with zero VMA process fat: use fat_fs_error() instead of BUG_ON() in __fat_get_block() proc: skip branch in /proc// lookup mremap: remove LATENCY_LIMIT from mremap to reduce the number of TLB shootdowns mm/memblock: add missing include <linux/bootmem.h> ...	2018-06-15 08:51:42 +09:00
Mark Rutland	0ed557aa81	sched/core / kcov: avoid kcov_area during task switch During a context switch, we first switch_mm() to the next task's mm, then switch_to() that new task. This means that vmalloc'd regions which had previously been faulted in can transiently disappear in the context of the prev task. Functions instrumented by KCOV may try to access a vmalloc'd kcov_area during this window, and as the fault handling code is instrumented, this results in a recursive fault. We must avoid accessing any kcov_area during this window. We can do so with a new flag in kcov_mode, set prior to switching the mm, and cleared once the new task is live. Since task_struct::kcov_mode isn't always a specific enum kcov_mode value, this is made an unsigned int. The manipulation is hidden behind kcov_{prepare,finish}_switch() helpers, which are empty for !CONFIG_KCOV kernels. The code uses macros because I can't use static inline functions without a circular include dependency between <linux/sched.h> and <linux/kcov.h>, since the definition of task_struct uses things defined in <linux/kcov.h> Link: http://lkml.kernel.org/r/20180504135535.53744-4-mark.rutland@arm.com Signed-off-by: Mark Rutland <mark.rutland@arm.com> Acked-by: Andrey Ryabinin <aryabinin@virtuozzo.com> Cc: Dmitry Vyukov <dvyukov@google.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Peter Zijlstra <peterz@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2018-06-15 07:55:24 +09:00
Mark Rutland	dc55daff90	kcov: prefault the kcov_area On many architectures the vmalloc area is lazily faulted in upon first access. This is problematic for KCOV, as __sanitizer_cov_trace_pc accesses the (vmalloc'd) kcov_area, and fault handling code may be instrumented. If an access to kcov_area faults, this will result in mutual recursion through the fault handling code and __sanitizer_cov_trace_pc(), eventually leading to stack corruption and/or overflow. We can avoid this by faulting in the kcov_area before __sanitizer_cov_trace_pc() is permitted to access it. Once it has been faulted in, it will remain present in the process page tables, and will not fault again. [akpm@linux-foundation.org: code cleanup] [akpm@linux-foundation.org: add comment explaining kcov_fault_in_area()] [akpm@linux-foundation.org: fancier code comment from Mark] Link: http://lkml.kernel.org/r/20180504135535.53744-3-mark.rutland@arm.com Signed-off-by: Mark Rutland <mark.rutland@arm.com> Acked-by: Andrey Ryabinin <aryabinin@virtuozzo.com> Cc: Dmitry Vyukov <dvyukov@google.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Peter Zijlstra <peterz@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2018-06-15 07:55:24 +09:00
Mark Rutland	c9484b986e	kcov: ensure irq code sees a valid area Patch series "kcov: fix unexpected faults". These patches fix a few issues where KCOV code could trigger recursive faults, discovered while debugging a patch enabling KCOV for arch/arm: * On CONFIG_PREEMPT kernels, there's a small race window where __sanitizer_cov_trace_pc() can see a bogus kcov_area. * Lazy faulting of the vmalloc area can cause mutual recursion between fault handling code and __sanitizer_cov_trace_pc(). * During the context switch, switching the mm can cause the kcov_area to be transiently unmapped. These are prerequisites for enabling KCOV on arm, but the issues themsevles are generic -- we just happen to avoid them by chance rather than design on x86-64 and arm64. This patch (of 3): For kernels built with CONFIG_PREEMPT, some C code may execute before or after the interrupt handler, while the hardirq count is zero. In these cases, in_task() can return true. A task can be interrupted in the middle of a KCOV_DISABLE ioctl while it resets the task's kcov data via kcov_task_init(). Instrumented code executed during this period will call __sanitizer_cov_trace_pc(), and as in_task() returns true, will inspect t->kcov_mode before trying to write to t->kcov_area. In kcov_init_task() we update t->kcov_{mode,area,size} with plain stores, which may be re-ordered, torn, etc. Thus __sanitizer_cov_trace_pc() may see bogus values for any of these fields, and may attempt to write to memory which is not mapped. Let's avoid this by using WRITE_ONCE() to set t->kcov_mode, with a barrier() to ensure this is ordered before we clear t->kov_{area,size}. This ensures that any code execute while kcov_init_task() is preempted will either see valid values for t->kcov_{area,size}, or will see that t->kcov_mode is KCOV_MODE_DISABLED, and bail out without touching t->kcov_area. Link: http://lkml.kernel.org/r/20180504135535.53744-2-mark.rutland@arm.com Signed-off-by: Mark Rutland <mark.rutland@arm.com> Acked-by: Andrey Ryabinin <aryabinin@virtuozzo.com> Cc: Dmitry Vyukov <dvyukov@google.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Peter Zijlstra <peterz@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2018-06-15 07:55:24 +09:00
Souptick Joarder	3fb3894b84	kernel/relay.c: change return type to vm_fault_t Use new return type vm_fault_t for fault handler. For now, this is just documenting that the function returns a VM_FAULT value rather than an errno. Once all instances are converted, vm_fault_t will become a distinct type. commit `1c8f422059` ("mm: change return type to vm_fault_t") Link: http://lkml.kernel.org/r/20180510140335.GA25363@jordon-HP-15-Notebook-PC Signed-off-by: Souptick Joarder <jrdr.linux@gmail.com> Reviewed-by: Matthew Wilcox <mawilcox@microsoft.com> Reviewed-by: Andrew Morton <akpm@linux-foundation.org> Cc: Eric Biggers <ebiggers@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2018-06-15 07:55:24 +09:00
Tetsuo Handa	655c79bb40	mm: check for SIGKILL inside dup_mmap() loop As a theoretical problem, dup_mmap() of an mm_struct with 60000+ vmas can loop while potentially allocating memory, with mm->mmap_sem held for write by current thread. This is bad if current thread was selected as an OOM victim, for current thread will continue allocations using memory reserves while OOM reaper is unable to reclaim memory. As an actually observable problem, it is not difficult to make OOM reaper unable to reclaim memory if the OOM victim is blocked at i_mmap_lock_write() in this loop. Unfortunately, since nobody can explain whether it is safe to use killable wait there, let's check for SIGKILL before trying to allocate memory. Even without an OOM event, there is no point with continuing the loop from the beginning if current thread is killed. I tested with debug printk(). This patch should be safe because we already fail if security_vm_enough_memory_mm() or kmem_cache_alloc(GFP_KERNEL) fails and exit_mmap() handles it. *** Aborting dup_mmap() due to SIGKILL * * Aborting dup_mmap() due to SIGKILL * * Aborting dup_mmap() due to SIGKILL * * Aborting dup_mmap() due to SIGKILL * * Aborting exit_mmap() due to NULL mmap *** [akpm@linux-foundation.org: add comment] Link: http://lkml.kernel.org/r/201804071938.CDE04681.SOFVQJFtMHOOLF@I-love.SAKURA.ne.jp Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Cc: Alexander Viro <viro@zeniv.linux.org.uk> Cc: Rik van Riel <riel@redhat.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: David Rientjes <rientjes@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2018-06-15 07:55:24 +09:00
Jarrett Farnitano	a8311f647e	kexec: yield to scheduler when loading kimage segments Without yielding while loading kimage segments, a large initrd will block all other work on the CPU performing the load until it is completed. For example loading an initrd of 200MB on a low power single core system will lock up the system for a few seconds. To increase system responsiveness to other tasks at that time, call cond_resched() in both the crash kernel and normal kernel segment loading loops. I did run into a practical problem. Hardware watchdogs on embedded systems can have short timers on the order of seconds. If the system is locked up for a few seconds with only a single core available, the watchdog may not be pet in a timely fashion. If this happens, the hardware watchdog will fire and reset the system. This really only becomes a problem when you are working with a single core, a decently sized initrd, and have a constrained hardware watchdog. Link: http://lkml.kernel.org/r/1528738546-3328-1-git-send-email-jmf@amazon.com Signed-off-by: Jarrett Farnitano <jmf@amazon.com> Reviewed-by: "Eric W. Biederman" <ebiederm@xmission.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2018-06-15 07:55:24 +09:00
Masahiro Yamada	a0f8c29706	kconfig: tinyconfig: remove stale stack protector fixups Prior to commit `2a61f4747e` ("stack-protector: test compiler capability in Kconfig and drop AUTO mode"), the stack protector was configured by the choice of NONE, REGULAR, STRONG, AUTO. tiny.config needed to explicitly set NONE because the default value of choice, AUTO, did not produce the tiniest kernel. Now that there are only two boolean symbols, STACKPROTECTOR and STACKPROTECTOR_STRONG, they are naturally disabled by "make allnoconfig", which "make tinyconfig" is based on. Remove unnecessary lines from the tiny.config fragment file. Signed-off-by: Masahiro Yamada <yamada.masahiro@socionext.com> Acked-by: Kees Cook <keescook@chromium.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2018-06-15 07:15:28 +09:00
Christoph Hellwig	cf65a0f6f6	dma-mapping: move all DMA mapping code to kernel/dma Currently the code is split over various files with dma- prefixes in the lib/ and drives/base directories, and the number of files keeps growing. Move them into a single directory to keep the code together and remove the file name prefixes. To match the irq infrastructure this directory is placed under the kernel/ directory. Signed-off-by: Christoph Hellwig <hch@lst.de>	2018-06-14 08:50:37 +02:00
Linus Torvalds	050e9baa9d	Kbuild: rename CC_STACKPROTECTOR[_STRONG] config variables The changes to automatically test for working stack protector compiler support in the Kconfig files removed the special STACKPROTECTOR_AUTO option that picked the strongest stack protector that the compiler supported. That was all a nice cleanup - it makes no sense to have the AUTO case now that the Kconfig phase can just determine the compiler support directly. HOWEVER. It also meant that doing "make oldconfig" would now _disable_ the strong stackprotector if you had AUTO enabled, because in a legacy config file, the sane stack protector configuration would look like CONFIG_HAVE_CC_STACKPROTECTOR=y # CONFIG_CC_STACKPROTECTOR_NONE is not set # CONFIG_CC_STACKPROTECTOR_REGULAR is not set # CONFIG_CC_STACKPROTECTOR_STRONG is not set CONFIG_CC_STACKPROTECTOR_AUTO=y and when you ran this through "make oldconfig" with the Kbuild changes, it would ask you about the regular CONFIG_CC_STACKPROTECTOR (that had been renamed from CONFIG_CC_STACKPROTECTOR_REGULAR to just CONFIG_CC_STACKPROTECTOR), but it would think that the STRONG version used to be disabled (because it was really enabled by AUTO), and would disable it in the new config, resulting in: CONFIG_HAVE_CC_STACKPROTECTOR=y CONFIG_CC_HAS_STACKPROTECTOR_NONE=y CONFIG_CC_STACKPROTECTOR=y # CONFIG_CC_STACKPROTECTOR_STRONG is not set CONFIG_CC_HAS_SANE_STACKPROTECTOR=y That's dangerously subtle - people could suddenly find themselves with the weaker stack protector setup without even realizing. The solution here is to just rename not just the old RECULAR stack protector option, but also the strong one. This does that by just removing the CC_ prefix entirely for the user choices, because it really is not about the compiler support (the compiler support now instead automatially impacts _visibility_ of the options to users). This results in "make oldconfig" actually asking the user for their choice, so that we don't have any silent subtle security model changes. The end result would generally look like this: CONFIG_HAVE_CC_STACKPROTECTOR=y CONFIG_CC_HAS_STACKPROTECTOR_NONE=y CONFIG_STACKPROTECTOR=y CONFIG_STACKPROTECTOR_STRONG=y CONFIG_CC_HAS_SANE_STACKPROTECTOR=y where the "CC_" versions really are about internal compiler infrastructure, not the user selections. Acked-by: Masahiro Yamada <yamada.masahiro@socionext.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2018-06-14 12:21:18 +09:00
Linus Torvalds	be779f03d5	Kbuild updates for v4.18 (2nd) - fix some bugs introduced by the recent Kconfig syntax extension - add some symbols about compiler information in Kconfig, such as CC_IS_GCC, CC_IS_CLANG, GCC_VERSION, etc. - test compiler capability for the stack protector in Kconfig, and clean-up Makefile - test compiler capability for GCC-plugins in Kconfig, and clean-up Makefile - allow to enable GCC-plugins for COMPILE_TEST - test compiler capability for KCOV in Kconfig and correct dependency - remove auto-detect mode of the GCOV format, which is now more nicely handled in Kconfig - test compiler capability for mprofile-kernel on PowerPC, and clean-up Makefile - misc cleanups -----BEGIN PGP SIGNATURE----- Version: GnuPG v1 iQIcBAABAgAGBQJbISvEAAoJED2LAQed4NsGEsoQAKBHMqUM9yQo0LdVMnDMCLQI Xsjyqzr0ySp6YiuF+cobwDs49sggt7/8EX+OnrP/sLlAhY0QrNGI1ulhwpFx1Ewa xFxz5kF/1jDwC+AjngXcK5Dr9nGSSMfT3wQhLGKjMkKSypbz2QyTrfMOfHGYSzU1 gD8RMWYXxKoJFmIaqmpLz7PDfWKPzhSOZo7BflPjAGXdlpfSV9cQvu+TkJ12qvSp KZ2uHUgLz95NnltSuGtN71X8so7w4eTYAvkJ5bOeOpYsZSVYRq4Exvwe0Y0dbwie WDpcRC5KrQOlIFxRUUSGn5cDsaW9yYJJAwMG6Dr8qJ66QlgY5GqOKXxXX+ARa7WU 7GkeAZ11n5dArjjdSjfClh8CwDiZNpJmAUbahm+feQfUfq9nbs+0JX6bOG5ZE+nt 3iE0ZoSGDjxD5Pjy4u+NtQM0JCpieuz3JNxqVbAVm0Ua5q8niwSEneixyrNmjkBF 1tV+qsMYus7AFwdGuDRXzBhVY7hd931H34czA3FUZZqwcClFVoJiygI++s62mVXx w9kYi8Ades/W6dt7c7XGjmqYTDgnTolLaYY5vggpEeLOzc1QPW6iKt9tpREi6Zzm n+y586YsIo0vjTMfRcfmGZUPG3CJeqL2UDslYmG8PgMQ6/eaAHBDXECLrAkGGPlG aIPZcMam5BQxhmSJc19c =VABv -----END PGP SIGNATURE----- Merge tag 'kbuild-v4.18-2' of git://git.kernel.org/pub/scm/linux/kernel/git/masahiroy/linux-kbuild Pull more Kbuild updates from Masahiro Yamada: - fix some bugs introduced by the recent Kconfig syntax extension - add some symbols about compiler information in Kconfig, such as CC_IS_GCC, CC_IS_CLANG, GCC_VERSION, etc. - test compiler capability for the stack protector in Kconfig, and clean-up Makefile - test compiler capability for GCC-plugins in Kconfig, and clean-up Makefile - allow to enable GCC-plugins for COMPILE_TEST - test compiler capability for KCOV in Kconfig and correct dependency - remove auto-detect mode of the GCOV format, which is now more nicely handled in Kconfig - test compiler capability for mprofile-kernel on PowerPC, and clean-up Makefile - misc cleanups * tag 'kbuild-v4.18-2' of git://git.kernel.org/pub/scm/linux/kernel/git/masahiroy/linux-kbuild: linux/linkage.h: replace VMLINUX_SYMBOL_STR() with __stringify() kconfig: fix localmodconfig sh: remove no-op macro VMLINUX_SYMBOL() powerpc/kbuild: move -mprofile-kernel check to Kconfig Documentation: kconfig: add recommended way to describe compiler support gcc-plugins: disable GCC_PLUGIN_STRUCTLEAK_BYREF_ALL for COMPILE_TEST gcc-plugins: allow to enable GCC_PLUGINS for COMPILE_TEST gcc-plugins: test plugin support in Kconfig and clean up Makefile gcc-plugins: move GCC version check for PowerPC to Kconfig kcov: test compiler capability in Kconfig and correct dependency gcov: remove CONFIG_GCOV_FORMAT_AUTODETECT arm64: move GCC version check for ARCH_SUPPORTS_INT128 to Kconfig kconfig: add CC_IS_CLANG and CLANG_VERSION kconfig: add CC_IS_GCC and GCC_VERSION stack-protector: test compiler capability in Kconfig and drop AUTO mode kbuild: fix endless syncconfig in case arch Makefile sets CROSS_COMPILE	2018-06-13 08:40:34 -07:00
Kees Cook	fad953ce0b	treewide: Use array_size() in vzalloc() The vzalloc() function has no 2-factor argument form, so multiplication factors need to be wrapped in array_size(). This patch replaces cases of: vzalloc(a * b) with: vzalloc(array_size(a, b)) as well as handling cases of: vzalloc(a * b * c) with: vzalloc(array3_size(a, b, c)) This does, however, attempt to ignore constant size factors like: vzalloc(4 * 1024) though any constants defined via macros get caught up in the conversion. Any factors with a sizeof() of "unsigned char", "char", and "u8" were dropped, since they're redundant. The Coccinelle script used for this was: // Fix redundant parens around sizeof(). @@ type TYPE; expression THING, E; @@ ( vzalloc( - (sizeof(TYPE)) * E + sizeof(TYPE) * E , ...) \| vzalloc( - (sizeof(THING)) * E + sizeof(THING) * E , ...) ) // Drop single-byte sizes and redundant parens. @@ expression COUNT; typedef u8; typedef __u8; @@ ( vzalloc( - sizeof(u8) * (COUNT) + COUNT , ...) \| vzalloc( - sizeof(__u8) * (COUNT) + COUNT , ...) \| vzalloc( - sizeof(char) * (COUNT) + COUNT , ...) \| vzalloc( - sizeof(unsigned char) * (COUNT) + COUNT , ...) \| vzalloc( - sizeof(u8) * COUNT + COUNT , ...) \| vzalloc( - sizeof(__u8) * COUNT + COUNT , ...) \| vzalloc( - sizeof(char) * COUNT + COUNT , ...) \| vzalloc( - sizeof(unsigned char) * COUNT + COUNT , ...) ) // 2-factor product with sizeof(type/expression) and identifier or constant. @@ type TYPE; expression THING; identifier COUNT_ID; constant COUNT_CONST; @@ ( vzalloc( - sizeof(TYPE) * (COUNT_ID) + array_size(COUNT_ID, sizeof(TYPE)) , ...) \| vzalloc( - sizeof(TYPE) * COUNT_ID + array_size(COUNT_ID, sizeof(TYPE)) , ...) \| vzalloc( - sizeof(TYPE) * (COUNT_CONST) + array_size(COUNT_CONST, sizeof(TYPE)) , ...) \| vzalloc( - sizeof(TYPE) * COUNT_CONST + array_size(COUNT_CONST, sizeof(TYPE)) , ...) \| vzalloc( - sizeof(THING) * (COUNT_ID) + array_size(COUNT_ID, sizeof(THING)) , ...) \| vzalloc( - sizeof(THING) * COUNT_ID + array_size(COUNT_ID, sizeof(THING)) , ...) \| vzalloc( - sizeof(THING) * (COUNT_CONST) + array_size(COUNT_CONST, sizeof(THING)) , ...) \| vzalloc( - sizeof(THING) * COUNT_CONST + array_size(COUNT_CONST, sizeof(THING)) , ...) ) // 2-factor product, only identifiers. @@ identifier SIZE, COUNT; @@ vzalloc( - SIZE * COUNT + array_size(COUNT, SIZE) , ...) // 3-factor product with 1 sizeof(type) or sizeof(expression), with // redundant parens removed. @@ expression THING; identifier STRIDE, COUNT; type TYPE; @@ ( vzalloc( - sizeof(TYPE) * (COUNT) * (STRIDE) + array3_size(COUNT, STRIDE, sizeof(TYPE)) , ...) \| vzalloc( - sizeof(TYPE) * (COUNT) * STRIDE + array3_size(COUNT, STRIDE, sizeof(TYPE)) , ...) \| vzalloc( - sizeof(TYPE) * COUNT * (STRIDE) + array3_size(COUNT, STRIDE, sizeof(TYPE)) , ...) \| vzalloc( - sizeof(TYPE) * COUNT * STRIDE + array3_size(COUNT, STRIDE, sizeof(TYPE)) , ...) \| vzalloc( - sizeof(THING) * (COUNT) * (STRIDE) + array3_size(COUNT, STRIDE, sizeof(THING)) , ...) \| vzalloc( - sizeof(THING) * (COUNT) * STRIDE + array3_size(COUNT, STRIDE, sizeof(THING)) , ...) \| vzalloc( - sizeof(THING) * COUNT * (STRIDE) + array3_size(COUNT, STRIDE, sizeof(THING)) , ...) \| vzalloc( - sizeof(THING) * COUNT * STRIDE + array3_size(COUNT, STRIDE, sizeof(THING)) , ...) ) // 3-factor product with 2 sizeof(variable), with redundant parens removed. @@ expression THING1, THING2; identifier COUNT; type TYPE1, TYPE2; @@ ( vzalloc( - sizeof(TYPE1) * sizeof(TYPE2) * COUNT + array3_size(COUNT, sizeof(TYPE1), sizeof(TYPE2)) , ...) \| vzalloc( - sizeof(TYPE1) * sizeof(THING2) * (COUNT) + array3_size(COUNT, sizeof(TYPE1), sizeof(TYPE2)) , ...) \| vzalloc( - sizeof(THING1) * sizeof(THING2) * COUNT + array3_size(COUNT, sizeof(THING1), sizeof(THING2)) , ...) \| vzalloc( - sizeof(THING1) * sizeof(THING2) * (COUNT) + array3_size(COUNT, sizeof(THING1), sizeof(THING2)) , ...) \| vzalloc( - sizeof(TYPE1) * sizeof(THING2) * COUNT + array3_size(COUNT, sizeof(TYPE1), sizeof(THING2)) , ...) \| vzalloc( - sizeof(TYPE1) * sizeof(THING2) * (COUNT) + array3_size(COUNT, sizeof(TYPE1), sizeof(THING2)) , ...) ) // 3-factor product, only identifiers, with redundant parens removed. @@ identifier STRIDE, SIZE, COUNT; @@ ( vzalloc( - (COUNT) * STRIDE * SIZE + array3_size(COUNT, STRIDE, SIZE) , ...) \| vzalloc( - COUNT * (STRIDE) * SIZE + array3_size(COUNT, STRIDE, SIZE) , ...) \| vzalloc( - COUNT * STRIDE * (SIZE) + array3_size(COUNT, STRIDE, SIZE) , ...) \| vzalloc( - (COUNT) * (STRIDE) * SIZE + array3_size(COUNT, STRIDE, SIZE) , ...) \| vzalloc( - COUNT * (STRIDE) * (SIZE) + array3_size(COUNT, STRIDE, SIZE) , ...) \| vzalloc( - (COUNT) * STRIDE * (SIZE) + array3_size(COUNT, STRIDE, SIZE) , ...) \| vzalloc( - (COUNT) * (STRIDE) * (SIZE) + array3_size(COUNT, STRIDE, SIZE) , ...) \| vzalloc( - COUNT * STRIDE * SIZE + array3_size(COUNT, STRIDE, SIZE) , ...) ) // Any remaining multi-factor products, first at least 3-factor products // when they're not all constants... @@ expression E1, E2, E3; constant C1, C2, C3; @@ ( vzalloc(C1 * C2 * C3, ...) \| vzalloc( - E1 * E2 * E3 + array3_size(E1, E2, E3) , ...) ) // And then all remaining 2 factors products when they're not all constants. @@ expression E1, E2; constant C1, C2; @@ ( vzalloc(C1 * C2, ...) \| vzalloc( - E1 * E2 + array_size(E1, E2) , ...) ) Signed-off-by: Kees Cook <keescook@chromium.org>	2018-06-12 16:19:22 -07:00
Kees Cook	42bc47b353	treewide: Use array_size() in vmalloc() The vmalloc() function has no 2-factor argument form, so multiplication factors need to be wrapped in array_size(). This patch replaces cases of: vmalloc(a * b) with: vmalloc(array_size(a, b)) as well as handling cases of: vmalloc(a * b * c) with: vmalloc(array3_size(a, b, c)) This does, however, attempt to ignore constant size factors like: vmalloc(4 * 1024) though any constants defined via macros get caught up in the conversion. Any factors with a sizeof() of "unsigned char", "char", and "u8" were dropped, since they're redundant. The Coccinelle script used for this was: // Fix redundant parens around sizeof(). @@ type TYPE; expression THING, E; @@ ( vmalloc( - (sizeof(TYPE)) * E + sizeof(TYPE) * E , ...) \| vmalloc( - (sizeof(THING)) * E + sizeof(THING) * E , ...) ) // Drop single-byte sizes and redundant parens. @@ expression COUNT; typedef u8; typedef __u8; @@ ( vmalloc( - sizeof(u8) * (COUNT) + COUNT , ...) \| vmalloc( - sizeof(__u8) * (COUNT) + COUNT , ...) \| vmalloc( - sizeof(char) * (COUNT) + COUNT , ...) \| vmalloc( - sizeof(unsigned char) * (COUNT) + COUNT , ...) \| vmalloc( - sizeof(u8) * COUNT + COUNT , ...) \| vmalloc( - sizeof(__u8) * COUNT + COUNT , ...) \| vmalloc( - sizeof(char) * COUNT + COUNT , ...) \| vmalloc( - sizeof(unsigned char) * COUNT + COUNT , ...) ) // 2-factor product with sizeof(type/expression) and identifier or constant. @@ type TYPE; expression THING; identifier COUNT_ID; constant COUNT_CONST; @@ ( vmalloc( - sizeof(TYPE) * (COUNT_ID) + array_size(COUNT_ID, sizeof(TYPE)) , ...) \| vmalloc( - sizeof(TYPE) * COUNT_ID + array_size(COUNT_ID, sizeof(TYPE)) , ...) \| vmalloc( - sizeof(TYPE) * (COUNT_CONST) + array_size(COUNT_CONST, sizeof(TYPE)) , ...) \| vmalloc( - sizeof(TYPE) * COUNT_CONST + array_size(COUNT_CONST, sizeof(TYPE)) , ...) \| vmalloc( - sizeof(THING) * (COUNT_ID) + array_size(COUNT_ID, sizeof(THING)) , ...) \| vmalloc( - sizeof(THING) * COUNT_ID + array_size(COUNT_ID, sizeof(THING)) , ...) \| vmalloc( - sizeof(THING) * (COUNT_CONST) + array_size(COUNT_CONST, sizeof(THING)) , ...) \| vmalloc( - sizeof(THING) * COUNT_CONST + array_size(COUNT_CONST, sizeof(THING)) , ...) ) // 2-factor product, only identifiers. @@ identifier SIZE, COUNT; @@ vmalloc( - SIZE * COUNT + array_size(COUNT, SIZE) , ...) // 3-factor product with 1 sizeof(type) or sizeof(expression), with // redundant parens removed. @@ expression THING; identifier STRIDE, COUNT; type TYPE; @@ ( vmalloc( - sizeof(TYPE) * (COUNT) * (STRIDE) + array3_size(COUNT, STRIDE, sizeof(TYPE)) , ...) \| vmalloc( - sizeof(TYPE) * (COUNT) * STRIDE + array3_size(COUNT, STRIDE, sizeof(TYPE)) , ...) \| vmalloc( - sizeof(TYPE) * COUNT * (STRIDE) + array3_size(COUNT, STRIDE, sizeof(TYPE)) , ...) \| vmalloc( - sizeof(TYPE) * COUNT * STRIDE + array3_size(COUNT, STRIDE, sizeof(TYPE)) , ...) \| vmalloc( - sizeof(THING) * (COUNT) * (STRIDE) + array3_size(COUNT, STRIDE, sizeof(THING)) , ...) \| vmalloc( - sizeof(THING) * (COUNT) * STRIDE + array3_size(COUNT, STRIDE, sizeof(THING)) , ...) \| vmalloc( - sizeof(THING) * COUNT * (STRIDE) + array3_size(COUNT, STRIDE, sizeof(THING)) , ...) \| vmalloc( - sizeof(THING) * COUNT * STRIDE + array3_size(COUNT, STRIDE, sizeof(THING)) , ...) ) // 3-factor product with 2 sizeof(variable), with redundant parens removed. @@ expression THING1, THING2; identifier COUNT; type TYPE1, TYPE2; @@ ( vmalloc( - sizeof(TYPE1) * sizeof(TYPE2) * COUNT + array3_size(COUNT, sizeof(TYPE1), sizeof(TYPE2)) , ...) \| vmalloc( - sizeof(TYPE1) * sizeof(THING2) * (COUNT) + array3_size(COUNT, sizeof(TYPE1), sizeof(TYPE2)) , ...) \| vmalloc( - sizeof(THING1) * sizeof(THING2) * COUNT + array3_size(COUNT, sizeof(THING1), sizeof(THING2)) , ...) \| vmalloc( - sizeof(THING1) * sizeof(THING2) * (COUNT) + array3_size(COUNT, sizeof(THING1), sizeof(THING2)) , ...) \| vmalloc( - sizeof(TYPE1) * sizeof(THING2) * COUNT + array3_size(COUNT, sizeof(TYPE1), sizeof(THING2)) , ...) \| vmalloc( - sizeof(TYPE1) * sizeof(THING2) * (COUNT) + array3_size(COUNT, sizeof(TYPE1), sizeof(THING2)) , ...) ) // 3-factor product, only identifiers, with redundant parens removed. @@ identifier STRIDE, SIZE, COUNT; @@ ( vmalloc( - (COUNT) * STRIDE * SIZE + array3_size(COUNT, STRIDE, SIZE) , ...) \| vmalloc( - COUNT * (STRIDE) * SIZE + array3_size(COUNT, STRIDE, SIZE) , ...) \| vmalloc( - COUNT * STRIDE * (SIZE) + array3_size(COUNT, STRIDE, SIZE) , ...) \| vmalloc( - (COUNT) * (STRIDE) * SIZE + array3_size(COUNT, STRIDE, SIZE) , ...) \| vmalloc( - COUNT * (STRIDE) * (SIZE) + array3_size(COUNT, STRIDE, SIZE) , ...) \| vmalloc( - (COUNT) * STRIDE * (SIZE) + array3_size(COUNT, STRIDE, SIZE) , ...) \| vmalloc( - (COUNT) * (STRIDE) * (SIZE) + array3_size(COUNT, STRIDE, SIZE) , ...) \| vmalloc( - COUNT * STRIDE * SIZE + array3_size(COUNT, STRIDE, SIZE) , ...) ) // Any remaining multi-factor products, first at least 3-factor products // when they're not all constants... @@ expression E1, E2, E3; constant C1, C2, C3; @@ ( vmalloc(C1 * C2 * C3, ...) \| vmalloc( - E1 * E2 * E3 + array3_size(E1, E2, E3) , ...) ) // And then all remaining 2 factors products when they're not all constants. @@ expression E1, E2; constant C1, C2; @@ ( vmalloc(C1 * C2, ...) \| vmalloc( - E1 * E2 + array_size(E1, E2) , ...) ) Signed-off-by: Kees Cook <keescook@chromium.org>	2018-06-12 16:19:22 -07:00
Kees Cook	778e1cdd81	treewide: kvzalloc() -> kvcalloc() The kvzalloc() function has a 2-factor argument form, kvcalloc(). This patch replaces cases of: kvzalloc(a * b, gfp) with: kvcalloc(a * b, gfp) as well as handling cases of: kvzalloc(a * b * c, gfp) with: kvzalloc(array3_size(a, b, c), gfp) as it's slightly less ugly than: kvcalloc(array_size(a, b), c, gfp) This does, however, attempt to ignore constant size factors like: kvzalloc(4 * 1024, gfp) though any constants defined via macros get caught up in the conversion. Any factors with a sizeof() of "unsigned char", "char", and "u8" were dropped, since they're redundant. The Coccinelle script used for this was: // Fix redundant parens around sizeof(). @@ type TYPE; expression THING, E; @@ ( kvzalloc( - (sizeof(TYPE)) * E + sizeof(TYPE) * E , ...) \| kvzalloc( - (sizeof(THING)) * E + sizeof(THING) * E , ...) ) // Drop single-byte sizes and redundant parens. @@ expression COUNT; typedef u8; typedef __u8; @@ ( kvzalloc( - sizeof(u8) * (COUNT) + COUNT , ...) \| kvzalloc( - sizeof(__u8) * (COUNT) + COUNT , ...) \| kvzalloc( - sizeof(char) * (COUNT) + COUNT , ...) \| kvzalloc( - sizeof(unsigned char) * (COUNT) + COUNT , ...) \| kvzalloc( - sizeof(u8) * COUNT + COUNT , ...) \| kvzalloc( - sizeof(__u8) * COUNT + COUNT , ...) \| kvzalloc( - sizeof(char) * COUNT + COUNT , ...) \| kvzalloc( - sizeof(unsigned char) * COUNT + COUNT , ...) ) // 2-factor product with sizeof(type/expression) and identifier or constant. @@ type TYPE; expression THING; identifier COUNT_ID; constant COUNT_CONST; @@ ( - kvzalloc + kvcalloc ( - sizeof(TYPE) * (COUNT_ID) + COUNT_ID, sizeof(TYPE) , ...) \| - kvzalloc + kvcalloc ( - sizeof(TYPE) * COUNT_ID + COUNT_ID, sizeof(TYPE) , ...) \| - kvzalloc + kvcalloc ( - sizeof(TYPE) * (COUNT_CONST) + COUNT_CONST, sizeof(TYPE) , ...) \| - kvzalloc + kvcalloc ( - sizeof(TYPE) * COUNT_CONST + COUNT_CONST, sizeof(TYPE) , ...) \| - kvzalloc + kvcalloc ( - sizeof(THING) * (COUNT_ID) + COUNT_ID, sizeof(THING) , ...) \| - kvzalloc + kvcalloc ( - sizeof(THING) * COUNT_ID + COUNT_ID, sizeof(THING) , ...) \| - kvzalloc + kvcalloc ( - sizeof(THING) * (COUNT_CONST) + COUNT_CONST, sizeof(THING) , ...) \| - kvzalloc + kvcalloc ( - sizeof(THING) * COUNT_CONST + COUNT_CONST, sizeof(THING) , ...) ) // 2-factor product, only identifiers. @@ identifier SIZE, COUNT; @@ - kvzalloc + kvcalloc ( - SIZE * COUNT + COUNT, SIZE , ...) // 3-factor product with 1 sizeof(type) or sizeof(expression), with // redundant parens removed. @@ expression THING; identifier STRIDE, COUNT; type TYPE; @@ ( kvzalloc( - sizeof(TYPE) * (COUNT) * (STRIDE) + array3_size(COUNT, STRIDE, sizeof(TYPE)) , ...) \| kvzalloc( - sizeof(TYPE) * (COUNT) * STRIDE + array3_size(COUNT, STRIDE, sizeof(TYPE)) , ...) \| kvzalloc( - sizeof(TYPE) * COUNT * (STRIDE) + array3_size(COUNT, STRIDE, sizeof(TYPE)) , ...) \| kvzalloc( - sizeof(TYPE) * COUNT * STRIDE + array3_size(COUNT, STRIDE, sizeof(TYPE)) , ...) \| kvzalloc( - sizeof(THING) * (COUNT) * (STRIDE) + array3_size(COUNT, STRIDE, sizeof(THING)) , ...) \| kvzalloc( - sizeof(THING) * (COUNT) * STRIDE + array3_size(COUNT, STRIDE, sizeof(THING)) , ...) \| kvzalloc( - sizeof(THING) * COUNT * (STRIDE) + array3_size(COUNT, STRIDE, sizeof(THING)) , ...) \| kvzalloc( - sizeof(THING) * COUNT * STRIDE + array3_size(COUNT, STRIDE, sizeof(THING)) , ...) ) // 3-factor product with 2 sizeof(variable), with redundant parens removed. @@ expression THING1, THING2; identifier COUNT; type TYPE1, TYPE2; @@ ( kvzalloc( - sizeof(TYPE1) * sizeof(TYPE2) * COUNT + array3_size(COUNT, sizeof(TYPE1), sizeof(TYPE2)) , ...) \| kvzalloc( - sizeof(TYPE1) * sizeof(THING2) * (COUNT) + array3_size(COUNT, sizeof(TYPE1), sizeof(TYPE2)) , ...) \| kvzalloc( - sizeof(THING1) * sizeof(THING2) * COUNT + array3_size(COUNT, sizeof(THING1), sizeof(THING2)) , ...) \| kvzalloc( - sizeof(THING1) * sizeof(THING2) * (COUNT) + array3_size(COUNT, sizeof(THING1), sizeof(THING2)) , ...) \| kvzalloc( - sizeof(TYPE1) * sizeof(THING2) * COUNT + array3_size(COUNT, sizeof(TYPE1), sizeof(THING2)) , ...) \| kvzalloc( - sizeof(TYPE1) * sizeof(THING2) * (COUNT) + array3_size(COUNT, sizeof(TYPE1), sizeof(THING2)) , ...) ) // 3-factor product, only identifiers, with redundant parens removed. @@ identifier STRIDE, SIZE, COUNT; @@ ( kvzalloc( - (COUNT) * STRIDE * SIZE + array3_size(COUNT, STRIDE, SIZE) , ...) \| kvzalloc( - COUNT * (STRIDE) * SIZE + array3_size(COUNT, STRIDE, SIZE) , ...) \| kvzalloc( - COUNT * STRIDE * (SIZE) + array3_size(COUNT, STRIDE, SIZE) , ...) \| kvzalloc( - (COUNT) * (STRIDE) * SIZE + array3_size(COUNT, STRIDE, SIZE) , ...) \| kvzalloc( - COUNT * (STRIDE) * (SIZE) + array3_size(COUNT, STRIDE, SIZE) , ...) \| kvzalloc( - (COUNT) * STRIDE * (SIZE) + array3_size(COUNT, STRIDE, SIZE) , ...) \| kvzalloc( - (COUNT) * (STRIDE) * (SIZE) + array3_size(COUNT, STRIDE, SIZE) , ...) \| kvzalloc( - COUNT * STRIDE * SIZE + array3_size(COUNT, STRIDE, SIZE) , ...) ) // Any remaining multi-factor products, first at least 3-factor products, // when they're not all constants... @@ expression E1, E2, E3; constant C1, C2, C3; @@ ( kvzalloc(C1 * C2 * C3, ...) \| kvzalloc( - (E1) * E2 * E3 + array3_size(E1, E2, E3) , ...) \| kvzalloc( - (E1) * (E2) * E3 + array3_size(E1, E2, E3) , ...) \| kvzalloc( - (E1) * (E2) * (E3) + array3_size(E1, E2, E3) , ...) \| kvzalloc( - E1 * E2 * E3 + array3_size(E1, E2, E3) , ...) ) // And then all remaining 2 factors products when they're not all constants, // keeping sizeof() as the second factor argument. @@ expression THING, E1, E2; type TYPE; constant C1, C2, C3; @@ ( kvzalloc(sizeof(THING) * C2, ...) \| kvzalloc(sizeof(TYPE) * C2, ...) \| kvzalloc(C1 * C2 * C3, ...) \| kvzalloc(C1 * C2, ...) \| - kvzalloc + kvcalloc ( - sizeof(TYPE) * (E2) + E2, sizeof(TYPE) , ...) \| - kvzalloc + kvcalloc ( - sizeof(TYPE) * E2 + E2, sizeof(TYPE) , ...) \| - kvzalloc + kvcalloc ( - sizeof(THING) * (E2) + E2, sizeof(THING) , ...) \| - kvzalloc + kvcalloc ( - sizeof(THING) * E2 + E2, sizeof(THING) , ...) \| - kvzalloc + kvcalloc ( - (E1) * E2 + E1, E2 , ...) \| - kvzalloc + kvcalloc ( - (E1) * (E2) + E1, E2 , ...) \| - kvzalloc + kvcalloc ( - E1 * E2 + E1, E2 , ...) ) Signed-off-by: Kees Cook <keescook@chromium.org>	2018-06-12 16:19:22 -07:00
Kees Cook	590b5b7d86	treewide: kzalloc_node() -> kcalloc_node() The kzalloc_node() function has a 2-factor argument form, kcalloc_node(). This patch replaces cases of: kzalloc_node(a * b, gfp, node) with: kcalloc_node(a * b, gfp, node) as well as handling cases of: kzalloc_node(a * b * c, gfp, node) with: kzalloc_node(array3_size(a, b, c), gfp, node) as it's slightly less ugly than: kcalloc_node(array_size(a, b), c, gfp, node) This does, however, attempt to ignore constant size factors like: kzalloc_node(4 * 1024, gfp, node) though any constants defined via macros get caught up in the conversion. Any factors with a sizeof() of "unsigned char", "char", and "u8" were dropped, since they're redundant. The Coccinelle script used for this was: // Fix redundant parens around sizeof(). @@ type TYPE; expression THING, E; @@ ( kzalloc_node( - (sizeof(TYPE)) * E + sizeof(TYPE) * E , ...) \| kzalloc_node( - (sizeof(THING)) * E + sizeof(THING) * E , ...) ) // Drop single-byte sizes and redundant parens. @@ expression COUNT; typedef u8; typedef __u8; @@ ( kzalloc_node( - sizeof(u8) * (COUNT) + COUNT , ...) \| kzalloc_node( - sizeof(__u8) * (COUNT) + COUNT , ...) \| kzalloc_node( - sizeof(char) * (COUNT) + COUNT , ...) \| kzalloc_node( - sizeof(unsigned char) * (COUNT) + COUNT , ...) \| kzalloc_node( - sizeof(u8) * COUNT + COUNT , ...) \| kzalloc_node( - sizeof(__u8) * COUNT + COUNT , ...) \| kzalloc_node( - sizeof(char) * COUNT + COUNT , ...) \| kzalloc_node( - sizeof(unsigned char) * COUNT + COUNT , ...) ) // 2-factor product with sizeof(type/expression) and identifier or constant. @@ type TYPE; expression THING; identifier COUNT_ID; constant COUNT_CONST; @@ ( - kzalloc_node + kcalloc_node ( - sizeof(TYPE) * (COUNT_ID) + COUNT_ID, sizeof(TYPE) , ...) \| - kzalloc_node + kcalloc_node ( - sizeof(TYPE) * COUNT_ID + COUNT_ID, sizeof(TYPE) , ...) \| - kzalloc_node + kcalloc_node ( - sizeof(TYPE) * (COUNT_CONST) + COUNT_CONST, sizeof(TYPE) , ...) \| - kzalloc_node + kcalloc_node ( - sizeof(TYPE) * COUNT_CONST + COUNT_CONST, sizeof(TYPE) , ...) \| - kzalloc_node + kcalloc_node ( - sizeof(THING) * (COUNT_ID) + COUNT_ID, sizeof(THING) , ...) \| - kzalloc_node + kcalloc_node ( - sizeof(THING) * COUNT_ID + COUNT_ID, sizeof(THING) , ...) \| - kzalloc_node + kcalloc_node ( - sizeof(THING) * (COUNT_CONST) + COUNT_CONST, sizeof(THING) , ...) \| - kzalloc_node + kcalloc_node ( - sizeof(THING) * COUNT_CONST + COUNT_CONST, sizeof(THING) , ...) ) // 2-factor product, only identifiers. @@ identifier SIZE, COUNT; @@ - kzalloc_node + kcalloc_node ( - SIZE * COUNT + COUNT, SIZE , ...) // 3-factor product with 1 sizeof(type) or sizeof(expression), with // redundant parens removed. @@ expression THING; identifier STRIDE, COUNT; type TYPE; @@ ( kzalloc_node( - sizeof(TYPE) * (COUNT) * (STRIDE) + array3_size(COUNT, STRIDE, sizeof(TYPE)) , ...) \| kzalloc_node( - sizeof(TYPE) * (COUNT) * STRIDE + array3_size(COUNT, STRIDE, sizeof(TYPE)) , ...) \| kzalloc_node( - sizeof(TYPE) * COUNT * (STRIDE) + array3_size(COUNT, STRIDE, sizeof(TYPE)) , ...) \| kzalloc_node( - sizeof(TYPE) * COUNT * STRIDE + array3_size(COUNT, STRIDE, sizeof(TYPE)) , ...) \| kzalloc_node( - sizeof(THING) * (COUNT) * (STRIDE) + array3_size(COUNT, STRIDE, sizeof(THING)) , ...) \| kzalloc_node( - sizeof(THING) * (COUNT) * STRIDE + array3_size(COUNT, STRIDE, sizeof(THING)) , ...) \| kzalloc_node( - sizeof(THING) * COUNT * (STRIDE) + array3_size(COUNT, STRIDE, sizeof(THING)) , ...) \| kzalloc_node( - sizeof(THING) * COUNT * STRIDE + array3_size(COUNT, STRIDE, sizeof(THING)) , ...) ) // 3-factor product with 2 sizeof(variable), with redundant parens removed. @@ expression THING1, THING2; identifier COUNT; type TYPE1, TYPE2; @@ ( kzalloc_node( - sizeof(TYPE1) * sizeof(TYPE2) * COUNT + array3_size(COUNT, sizeof(TYPE1), sizeof(TYPE2)) , ...) \| kzalloc_node( - sizeof(TYPE1) * sizeof(THING2) * (COUNT) + array3_size(COUNT, sizeof(TYPE1), sizeof(TYPE2)) , ...) \| kzalloc_node( - sizeof(THING1) * sizeof(THING2) * COUNT + array3_size(COUNT, sizeof(THING1), sizeof(THING2)) , ...) \| kzalloc_node( - sizeof(THING1) * sizeof(THING2) * (COUNT) + array3_size(COUNT, sizeof(THING1), sizeof(THING2)) , ...) \| kzalloc_node( - sizeof(TYPE1) * sizeof(THING2) * COUNT + array3_size(COUNT, sizeof(TYPE1), sizeof(THING2)) , ...) \| kzalloc_node( - sizeof(TYPE1) * sizeof(THING2) * (COUNT) + array3_size(COUNT, sizeof(TYPE1), sizeof(THING2)) , ...) ) // 3-factor product, only identifiers, with redundant parens removed. @@ identifier STRIDE, SIZE, COUNT; @@ ( kzalloc_node( - (COUNT) * STRIDE * SIZE + array3_size(COUNT, STRIDE, SIZE) , ...) \| kzalloc_node( - COUNT * (STRIDE) * SIZE + array3_size(COUNT, STRIDE, SIZE) , ...) \| kzalloc_node( - COUNT * STRIDE * (SIZE) + array3_size(COUNT, STRIDE, SIZE) , ...) \| kzalloc_node( - (COUNT) * (STRIDE) * SIZE + array3_size(COUNT, STRIDE, SIZE) , ...) \| kzalloc_node( - COUNT * (STRIDE) * (SIZE) + array3_size(COUNT, STRIDE, SIZE) , ...) \| kzalloc_node( - (COUNT) * STRIDE * (SIZE) + array3_size(COUNT, STRIDE, SIZE) , ...) \| kzalloc_node( - (COUNT) * (STRIDE) * (SIZE) + array3_size(COUNT, STRIDE, SIZE) , ...) \| kzalloc_node( - COUNT * STRIDE * SIZE + array3_size(COUNT, STRIDE, SIZE) , ...) ) // Any remaining multi-factor products, first at least 3-factor products, // when they're not all constants... @@ expression E1, E2, E3; constant C1, C2, C3; @@ ( kzalloc_node(C1 * C2 * C3, ...) \| kzalloc_node( - (E1) * E2 * E3 + array3_size(E1, E2, E3) , ...) \| kzalloc_node( - (E1) * (E2) * E3 + array3_size(E1, E2, E3) , ...) \| kzalloc_node( - (E1) * (E2) * (E3) + array3_size(E1, E2, E3) , ...) \| kzalloc_node( - E1 * E2 * E3 + array3_size(E1, E2, E3) , ...) ) // And then all remaining 2 factors products when they're not all constants, // keeping sizeof() as the second factor argument. @@ expression THING, E1, E2; type TYPE; constant C1, C2, C3; @@ ( kzalloc_node(sizeof(THING) * C2, ...) \| kzalloc_node(sizeof(TYPE) * C2, ...) \| kzalloc_node(C1 * C2 * C3, ...) \| kzalloc_node(C1 * C2, ...) \| - kzalloc_node + kcalloc_node ( - sizeof(TYPE) * (E2) + E2, sizeof(TYPE) , ...) \| - kzalloc_node + kcalloc_node ( - sizeof(TYPE) * E2 + E2, sizeof(TYPE) , ...) \| - kzalloc_node + kcalloc_node ( - sizeof(THING) * (E2) + E2, sizeof(THING) , ...) \| - kzalloc_node + kcalloc_node ( - sizeof(THING) * E2 + E2, sizeof(THING) , ...) \| - kzalloc_node + kcalloc_node ( - (E1) * E2 + E1, E2 , ...) \| - kzalloc_node + kcalloc_node ( - (E1) * (E2) + E1, E2 , ...) \| - kzalloc_node + kcalloc_node ( - E1 * E2 + E1, E2 , ...) ) Signed-off-by: Kees Cook <keescook@chromium.org>	2018-06-12 16:19:22 -07:00
Kees Cook	6396bb2215	treewide: kzalloc() -> kcalloc() The kzalloc() function has a 2-factor argument form, kcalloc(). This patch replaces cases of: kzalloc(a * b, gfp) with: kcalloc(a * b, gfp) as well as handling cases of: kzalloc(a * b * c, gfp) with: kzalloc(array3_size(a, b, c), gfp) as it's slightly less ugly than: kzalloc_array(array_size(a, b), c, gfp) This does, however, attempt to ignore constant size factors like: kzalloc(4 * 1024, gfp) though any constants defined via macros get caught up in the conversion. Any factors with a sizeof() of "unsigned char", "char", and "u8" were dropped, since they're redundant. The Coccinelle script used for this was: // Fix redundant parens around sizeof(). @@ type TYPE; expression THING, E; @@ ( kzalloc( - (sizeof(TYPE)) * E + sizeof(TYPE) * E , ...) \| kzalloc( - (sizeof(THING)) * E + sizeof(THING) * E , ...) ) // Drop single-byte sizes and redundant parens. @@ expression COUNT; typedef u8; typedef __u8; @@ ( kzalloc( - sizeof(u8) * (COUNT) + COUNT , ...) \| kzalloc( - sizeof(__u8) * (COUNT) + COUNT , ...) \| kzalloc( - sizeof(char) * (COUNT) + COUNT , ...) \| kzalloc( - sizeof(unsigned char) * (COUNT) + COUNT , ...) \| kzalloc( - sizeof(u8) * COUNT + COUNT , ...) \| kzalloc( - sizeof(__u8) * COUNT + COUNT , ...) \| kzalloc( - sizeof(char) * COUNT + COUNT , ...) \| kzalloc( - sizeof(unsigned char) * COUNT + COUNT , ...) ) // 2-factor product with sizeof(type/expression) and identifier or constant. @@ type TYPE; expression THING; identifier COUNT_ID; constant COUNT_CONST; @@ ( - kzalloc + kcalloc ( - sizeof(TYPE) * (COUNT_ID) + COUNT_ID, sizeof(TYPE) , ...) \| - kzalloc + kcalloc ( - sizeof(TYPE) * COUNT_ID + COUNT_ID, sizeof(TYPE) , ...) \| - kzalloc + kcalloc ( - sizeof(TYPE) * (COUNT_CONST) + COUNT_CONST, sizeof(TYPE) , ...) \| - kzalloc + kcalloc ( - sizeof(TYPE) * COUNT_CONST + COUNT_CONST, sizeof(TYPE) , ...) \| - kzalloc + kcalloc ( - sizeof(THING) * (COUNT_ID) + COUNT_ID, sizeof(THING) , ...) \| - kzalloc + kcalloc ( - sizeof(THING) * COUNT_ID + COUNT_ID, sizeof(THING) , ...) \| - kzalloc + kcalloc ( - sizeof(THING) * (COUNT_CONST) + COUNT_CONST, sizeof(THING) , ...) \| - kzalloc + kcalloc ( - sizeof(THING) * COUNT_CONST + COUNT_CONST, sizeof(THING) , ...) ) // 2-factor product, only identifiers. @@ identifier SIZE, COUNT; @@ - kzalloc + kcalloc ( - SIZE * COUNT + COUNT, SIZE , ...) // 3-factor product with 1 sizeof(type) or sizeof(expression), with // redundant parens removed. @@ expression THING; identifier STRIDE, COUNT; type TYPE; @@ ( kzalloc( - sizeof(TYPE) * (COUNT) * (STRIDE) + array3_size(COUNT, STRIDE, sizeof(TYPE)) , ...) \| kzalloc( - sizeof(TYPE) * (COUNT) * STRIDE + array3_size(COUNT, STRIDE, sizeof(TYPE)) , ...) \| kzalloc( - sizeof(TYPE) * COUNT * (STRIDE) + array3_size(COUNT, STRIDE, sizeof(TYPE)) , ...) \| kzalloc( - sizeof(TYPE) * COUNT * STRIDE + array3_size(COUNT, STRIDE, sizeof(TYPE)) , ...) \| kzalloc( - sizeof(THING) * (COUNT) * (STRIDE) + array3_size(COUNT, STRIDE, sizeof(THING)) , ...) \| kzalloc( - sizeof(THING) * (COUNT) * STRIDE + array3_size(COUNT, STRIDE, sizeof(THING)) , ...) \| kzalloc( - sizeof(THING) * COUNT * (STRIDE) + array3_size(COUNT, STRIDE, sizeof(THING)) , ...) \| kzalloc( - sizeof(THING) * COUNT * STRIDE + array3_size(COUNT, STRIDE, sizeof(THING)) , ...) ) // 3-factor product with 2 sizeof(variable), with redundant parens removed. @@ expression THING1, THING2; identifier COUNT; type TYPE1, TYPE2; @@ ( kzalloc( - sizeof(TYPE1) * sizeof(TYPE2) * COUNT + array3_size(COUNT, sizeof(TYPE1), sizeof(TYPE2)) , ...) \| kzalloc( - sizeof(TYPE1) * sizeof(THING2) * (COUNT) + array3_size(COUNT, sizeof(TYPE1), sizeof(TYPE2)) , ...) \| kzalloc( - sizeof(THING1) * sizeof(THING2) * COUNT + array3_size(COUNT, sizeof(THING1), sizeof(THING2)) , ...) \| kzalloc( - sizeof(THING1) * sizeof(THING2) * (COUNT) + array3_size(COUNT, sizeof(THING1), sizeof(THING2)) , ...) \| kzalloc( - sizeof(TYPE1) * sizeof(THING2) * COUNT + array3_size(COUNT, sizeof(TYPE1), sizeof(THING2)) , ...) \| kzalloc( - sizeof(TYPE1) * sizeof(THING2) * (COUNT) + array3_size(COUNT, sizeof(TYPE1), sizeof(THING2)) , ...) ) // 3-factor product, only identifiers, with redundant parens removed. @@ identifier STRIDE, SIZE, COUNT; @@ ( kzalloc( - (COUNT) * STRIDE * SIZE + array3_size(COUNT, STRIDE, SIZE) , ...) \| kzalloc( - COUNT * (STRIDE) * SIZE + array3_size(COUNT, STRIDE, SIZE) , ...) \| kzalloc( - COUNT * STRIDE * (SIZE) + array3_size(COUNT, STRIDE, SIZE) , ...) \| kzalloc( - (COUNT) * (STRIDE) * SIZE + array3_size(COUNT, STRIDE, SIZE) , ...) \| kzalloc( - COUNT * (STRIDE) * (SIZE) + array3_size(COUNT, STRIDE, SIZE) , ...) \| kzalloc( - (COUNT) * STRIDE * (SIZE) + array3_size(COUNT, STRIDE, SIZE) , ...) \| kzalloc( - (COUNT) * (STRIDE) * (SIZE) + array3_size(COUNT, STRIDE, SIZE) , ...) \| kzalloc( - COUNT * STRIDE * SIZE + array3_size(COUNT, STRIDE, SIZE) , ...) ) // Any remaining multi-factor products, first at least 3-factor products, // when they're not all constants... @@ expression E1, E2, E3; constant C1, C2, C3; @@ ( kzalloc(C1 * C2 * C3, ...) \| kzalloc( - (E1) * E2 * E3 + array3_size(E1, E2, E3) , ...) \| kzalloc( - (E1) * (E2) * E3 + array3_size(E1, E2, E3) , ...) \| kzalloc( - (E1) * (E2) * (E3) + array3_size(E1, E2, E3) , ...) \| kzalloc( - E1 * E2 * E3 + array3_size(E1, E2, E3) , ...) ) // And then all remaining 2 factors products when they're not all constants, // keeping sizeof() as the second factor argument. @@ expression THING, E1, E2; type TYPE; constant C1, C2, C3; @@ ( kzalloc(sizeof(THING) * C2, ...) \| kzalloc(sizeof(TYPE) * C2, ...) \| kzalloc(C1 * C2 * C3, ...) \| kzalloc(C1 * C2, ...) \| - kzalloc + kcalloc ( - sizeof(TYPE) * (E2) + E2, sizeof(TYPE) , ...) \| - kzalloc + kcalloc ( - sizeof(TYPE) * E2 + E2, sizeof(TYPE) , ...) \| - kzalloc + kcalloc ( - sizeof(THING) * (E2) + E2, sizeof(THING) , ...) \| - kzalloc + kcalloc ( - sizeof(THING) * E2 + E2, sizeof(THING) , ...) \| - kzalloc + kcalloc ( - (E1) * E2 + E1, E2 , ...) \| - kzalloc + kcalloc ( - (E1) * (E2) + E1, E2 , ...) \| - kzalloc + kcalloc ( - E1 * E2 + E1, E2 , ...) ) Signed-off-by: Kees Cook <keescook@chromium.org>	2018-06-12 16:19:22 -07:00
Kees Cook	6da2ec5605	treewide: kmalloc() -> kmalloc_array() The kmalloc() function has a 2-factor argument form, kmalloc_array(). This patch replaces cases of: kmalloc(a * b, gfp) with: kmalloc_array(a * b, gfp) as well as handling cases of: kmalloc(a * b * c, gfp) with: kmalloc(array3_size(a, b, c), gfp) as it's slightly less ugly than: kmalloc_array(array_size(a, b), c, gfp) This does, however, attempt to ignore constant size factors like: kmalloc(4 * 1024, gfp) though any constants defined via macros get caught up in the conversion. Any factors with a sizeof() of "unsigned char", "char", and "u8" were dropped, since they're redundant. The tools/ directory was manually excluded, since it has its own implementation of kmalloc(). The Coccinelle script used for this was: // Fix redundant parens around sizeof(). @@ type TYPE; expression THING, E; @@ ( kmalloc( - (sizeof(TYPE)) * E + sizeof(TYPE) * E , ...) \| kmalloc( - (sizeof(THING)) * E + sizeof(THING) * E , ...) ) // Drop single-byte sizes and redundant parens. @@ expression COUNT; typedef u8; typedef __u8; @@ ( kmalloc( - sizeof(u8) * (COUNT) + COUNT , ...) \| kmalloc( - sizeof(__u8) * (COUNT) + COUNT , ...) \| kmalloc( - sizeof(char) * (COUNT) + COUNT , ...) \| kmalloc( - sizeof(unsigned char) * (COUNT) + COUNT , ...) \| kmalloc( - sizeof(u8) * COUNT + COUNT , ...) \| kmalloc( - sizeof(__u8) * COUNT + COUNT , ...) \| kmalloc( - sizeof(char) * COUNT + COUNT , ...) \| kmalloc( - sizeof(unsigned char) * COUNT + COUNT , ...) ) // 2-factor product with sizeof(type/expression) and identifier or constant. @@ type TYPE; expression THING; identifier COUNT_ID; constant COUNT_CONST; @@ ( - kmalloc + kmalloc_array ( - sizeof(TYPE) * (COUNT_ID) + COUNT_ID, sizeof(TYPE) , ...) \| - kmalloc + kmalloc_array ( - sizeof(TYPE) * COUNT_ID + COUNT_ID, sizeof(TYPE) , ...) \| - kmalloc + kmalloc_array ( - sizeof(TYPE) * (COUNT_CONST) + COUNT_CONST, sizeof(TYPE) , ...) \| - kmalloc + kmalloc_array ( - sizeof(TYPE) * COUNT_CONST + COUNT_CONST, sizeof(TYPE) , ...) \| - kmalloc + kmalloc_array ( - sizeof(THING) * (COUNT_ID) + COUNT_ID, sizeof(THING) , ...) \| - kmalloc + kmalloc_array ( - sizeof(THING) * COUNT_ID + COUNT_ID, sizeof(THING) , ...) \| - kmalloc + kmalloc_array ( - sizeof(THING) * (COUNT_CONST) + COUNT_CONST, sizeof(THING) , ...) \| - kmalloc + kmalloc_array ( - sizeof(THING) * COUNT_CONST + COUNT_CONST, sizeof(THING) , ...) ) // 2-factor product, only identifiers. @@ identifier SIZE, COUNT; @@ - kmalloc + kmalloc_array ( - SIZE * COUNT + COUNT, SIZE , ...) // 3-factor product with 1 sizeof(type) or sizeof(expression), with // redundant parens removed. @@ expression THING; identifier STRIDE, COUNT; type TYPE; @@ ( kmalloc( - sizeof(TYPE) * (COUNT) * (STRIDE) + array3_size(COUNT, STRIDE, sizeof(TYPE)) , ...) \| kmalloc( - sizeof(TYPE) * (COUNT) * STRIDE + array3_size(COUNT, STRIDE, sizeof(TYPE)) , ...) \| kmalloc( - sizeof(TYPE) * COUNT * (STRIDE) + array3_size(COUNT, STRIDE, sizeof(TYPE)) , ...) \| kmalloc( - sizeof(TYPE) * COUNT * STRIDE + array3_size(COUNT, STRIDE, sizeof(TYPE)) , ...) \| kmalloc( - sizeof(THING) * (COUNT) * (STRIDE) + array3_size(COUNT, STRIDE, sizeof(THING)) , ...) \| kmalloc( - sizeof(THING) * (COUNT) * STRIDE + array3_size(COUNT, STRIDE, sizeof(THING)) , ...) \| kmalloc( - sizeof(THING) * COUNT * (STRIDE) + array3_size(COUNT, STRIDE, sizeof(THING)) , ...) \| kmalloc( - sizeof(THING) * COUNT * STRIDE + array3_size(COUNT, STRIDE, sizeof(THING)) , ...) ) // 3-factor product with 2 sizeof(variable), with redundant parens removed. @@ expression THING1, THING2; identifier COUNT; type TYPE1, TYPE2; @@ ( kmalloc( - sizeof(TYPE1) * sizeof(TYPE2) * COUNT + array3_size(COUNT, sizeof(TYPE1), sizeof(TYPE2)) , ...) \| kmalloc( - sizeof(TYPE1) * sizeof(THING2) * (COUNT) + array3_size(COUNT, sizeof(TYPE1), sizeof(TYPE2)) , ...) \| kmalloc( - sizeof(THING1) * sizeof(THING2) * COUNT + array3_size(COUNT, sizeof(THING1), sizeof(THING2)) , ...) \| kmalloc( - sizeof(THING1) * sizeof(THING2) * (COUNT) + array3_size(COUNT, sizeof(THING1), sizeof(THING2)) , ...) \| kmalloc( - sizeof(TYPE1) * sizeof(THING2) * COUNT + array3_size(COUNT, sizeof(TYPE1), sizeof(THING2)) , ...) \| kmalloc( - sizeof(TYPE1) * sizeof(THING2) * (COUNT) + array3_size(COUNT, sizeof(TYPE1), sizeof(THING2)) , ...) ) // 3-factor product, only identifiers, with redundant parens removed. @@ identifier STRIDE, SIZE, COUNT; @@ ( kmalloc( - (COUNT) * STRIDE * SIZE + array3_size(COUNT, STRIDE, SIZE) , ...) \| kmalloc( - COUNT * (STRIDE) * SIZE + array3_size(COUNT, STRIDE, SIZE) , ...) \| kmalloc( - COUNT * STRIDE * (SIZE) + array3_size(COUNT, STRIDE, SIZE) , ...) \| kmalloc( - (COUNT) * (STRIDE) * SIZE + array3_size(COUNT, STRIDE, SIZE) , ...) \| kmalloc( - COUNT * (STRIDE) * (SIZE) + array3_size(COUNT, STRIDE, SIZE) , ...) \| kmalloc( - (COUNT) * STRIDE * (SIZE) + array3_size(COUNT, STRIDE, SIZE) , ...) \| kmalloc( - (COUNT) * (STRIDE) * (SIZE) + array3_size(COUNT, STRIDE, SIZE) , ...) \| kmalloc( - COUNT * STRIDE * SIZE + array3_size(COUNT, STRIDE, SIZE) , ...) ) // Any remaining multi-factor products, first at least 3-factor products, // when they're not all constants... @@ expression E1, E2, E3; constant C1, C2, C3; @@ ( kmalloc(C1 * C2 * C3, ...) \| kmalloc( - (E1) * E2 * E3 + array3_size(E1, E2, E3) , ...) \| kmalloc( - (E1) * (E2) * E3 + array3_size(E1, E2, E3) , ...) \| kmalloc( - (E1) * (E2) * (E3) + array3_size(E1, E2, E3) , ...) \| kmalloc( - E1 * E2 * E3 + array3_size(E1, E2, E3) , ...) ) // And then all remaining 2 factors products when they're not all constants, // keeping sizeof() as the second factor argument. @@ expression THING, E1, E2; type TYPE; constant C1, C2, C3; @@ ( kmalloc(sizeof(THING) * C2, ...) \| kmalloc(sizeof(TYPE) * C2, ...) \| kmalloc(C1 * C2 * C3, ...) \| kmalloc(C1 * C2, ...) \| - kmalloc + kmalloc_array ( - sizeof(TYPE) * (E2) + E2, sizeof(TYPE) , ...) \| - kmalloc + kmalloc_array ( - sizeof(TYPE) * E2 + E2, sizeof(TYPE) , ...) \| - kmalloc + kmalloc_array ( - sizeof(THING) * (E2) + E2, sizeof(THING) , ...) \| - kmalloc + kmalloc_array ( - sizeof(THING) * E2 + E2, sizeof(THING) , ...) \| - kmalloc + kmalloc_array ( - (E1) * E2 + E1, E2 , ...) \| - kmalloc + kmalloc_array ( - (E1) * (E2) + E1, E2 , ...) \| - kmalloc + kmalloc_array ( - E1 * E2 + E1, E2 , ...) ) Signed-off-by: Kees Cook <keescook@chromium.org>	2018-06-12 16:19:22 -07:00
Sebastian Andrzej Siewior	c60c32a577	posix-cpu-timers: Remove lockdep_assert_irqs_disabled() The lockdep_assert_irqs_disabled() was a BUG_ON() statement in the beginning and it was added just before the "spin_lock(siglock)" statement to ensure this lock was taken with disabled interrupts. This is no longer the case: the siglock is acquired via lock_task_sighand() and this function already disables the interrupts. The lock is also acquired before this "lockdep_assert_irqs_disabled" so it is best to remove it. Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Cc: Frederic Weisbecker <frederic@kernel.org> Link: https://lkml.kernel.org/r20180504152548.7166-1-bigeasy@linutronix.de	2018-06-12 17:18:37 +02:00
David S. Miller	0ca69d1399	Merge git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf Daniel Borkmann says: ==================== pull-request: bpf 2018-06-12 The following pull-request contains BPF updates for your net tree. The main changes are: 1) Avoid an allocation warning in AF_XDP by adding __GFP_NOWARN for the umem setup, from Björn. 2) Silence a warning in bpf fs when an application tries to open(2) a pinned bpf obj due to missing fops. Add a dummy open fop that continues to just bail out in such case, from Daniel. 3) Fix a BPF selftest urandom_read build issue where gcc complains that it gets built twice, from Anders. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>	2018-06-11 17:37:03 -07:00
Linus Torvalds	f0dc7f9c6d	Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net Pull networking fixes from David Miller: 1) Fix several bpfilter/UMH bugs, in particular make the UMH build not depend upon X86 specific Kconfig symbols. From Alexei Starovoitov. 2) Fix handling of modified context pointer in bpf verifier, from Daniel Borkmann. 3) Kill regression in ifdown/ifup sequences for hv_netvsc driver, from Dexuan Cui. 4) When the bonding primary member name changes, we have to re-evaluate the bond->force_primary setting, from Xiangning Yu. 5) Eliminate possible padding beyone end of SKB in cdc_ncm driver, from Bjørn Mork. 6) RX queue length reported for UDP sockets in procfs and socket diag are inaccurate, from Paolo Abeni. 7) Fix br_fdb_find_port() locking, from Petr Machata. 8) Limit sk_rcvlowat values properly in TCP, from Soheil Hassas Yeganeh. * git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (23 commits) tcp: limit sk_rcvlowat by the maximum receive buffer net: phy: dp83822: use BMCR_ANENABLE instead of BMSR_ANEGCAPABLE for DP83620 socket: close race condition between sock_close() and sockfs_setattr() net: bridge: Fix locking in br_fdb_find_port() udp: fix rx queue len reported by diag and proc interface cdc_ncm: avoid padding beyond end of skb net/sched: act_simple: fix parsing of TCA_DEF_DATA net: fddi: fix a possible null-ptr-deref net: aquantia: fix unsigned numvecs comparison with less than zero net: stmmac: fix build failure due to missing COMMON_CLK dependency bpfilter: fix race in pipe access bpf, xdp: fix crash in xdp_umem_unaccount_pages xsk: Fix umem fill/completion queue mmap on 32-bit tools/bpf: fix selftest get_cgroup_id_user bpfilter: fix OUTPUT_FORMAT umh: fix race condition net: mscc: ocelot: Fix uninitialized error in ocelot_netdevice_event() bonding: re-evaluate force_primary when the primary slave name changes ip_tunnel: Fix name string concatenate in __ip_tunnel_create() hv_netvsc: Fix a network regression after ifdown/ifup ...	2018-06-10 19:25:23 -07:00
Linus Torvalds	5f85942c2e	SCSI misc on 20180610 This is mostly updates to the usual drivers: ufs, qedf, mpt3sas, lpfc, xfcp, hisi_sas, cxlflash, qla2xxx. In the absence of Nic, we're also taking target updates which are mostly minor except for the tcmu refactor. The only real core change to worry about is the removal of high page bouncing (in sas, storvsc and iscsi). This has been well tested and no problems have shown up so far. Signed-off-by: James E.J. Bottomley <jejb@linux.vnet.ibm.com> -----BEGIN PGP SIGNATURE----- iJwEABMIAEQWIQTnYEDbdso9F2cI+arnQslM7pishQUCWx1pbCYcamFtZXMuYm90 dG9tbGV5QGhhbnNlbnBhcnRuZXJzaGlwLmNvbQAKCRDnQslM7pishUucAP42pccS ziKyiOizuxv9fZ4Q+nXd1A9zhI5tqqpkHjcQegEA40qiZSi3EKGKR8W0UpX7Ntmo tqrZJGojx9lnrAM2RbQ= =NMXg -----END PGP SIGNATURE----- Merge tag 'scsi-misc' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi Pull SCSI updates from James Bottomley: "This is mostly updates to the usual drivers: ufs, qedf, mpt3sas, lpfc, xfcp, hisi_sas, cxlflash, qla2xxx. In the absence of Nic, we're also taking target updates which are mostly minor except for the tcmu refactor. The only real core change to worry about is the removal of high page bouncing (in sas, storvsc and iscsi). This has been well tested and no problems have shown up so far" * tag 'scsi-misc' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi: (268 commits) scsi: lpfc: update driver version to 12.0.0.4 scsi: lpfc: Fix port initialization failure. scsi: lpfc: Fix 16gb hbas failing cq create. scsi: lpfc: Fix crash in blk_mq layer when executing modprobe -r lpfc scsi: lpfc: correct oversubscription of nvme io requests for an adapter scsi: lpfc: Fix MDS diagnostics failure (Rx < Tx) scsi: hisi_sas: Mark PHY as in reset for nexus reset scsi: hisi_sas: Fix return value when get_free_slot() failed scsi: hisi_sas: Terminate STP reject quickly for v2 hw scsi: hisi_sas: Add v2 hw force PHY function for internal ATA command scsi: hisi_sas: Include TMF elements in struct hisi_sas_slot scsi: hisi_sas: Try wait commands before before controller reset scsi: hisi_sas: Init disks after controller reset scsi: hisi_sas: Create a scsi_host_template per HW module scsi: hisi_sas: Reset disks when discovered scsi: hisi_sas: Add LED feature for v3 hw scsi: hisi_sas: Change common allocation mode of device id scsi: hisi_sas: change slot index allocation mode scsi: hisi_sas: Introduce hisi_sas_phy_set_linkrate() scsi: hisi_sas: fix a typo in hisi_sas_task_prep() ...	2018-06-10 13:01:12 -07:00
Linus Torvalds	d82991a868	Merge branch 'core-rseq-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull restartable sequence support from Thomas Gleixner: "The restartable sequences syscall (finally): After a lot of back and forth discussion and massive delays caused by the speculative distraction of maintainers, the core set of restartable sequences has finally reached a consensus. It comes with the basic non disputed core implementation along with support for arm, powerpc and x86 and a full set of selftests It was exposed to linux-next earlier this week, so it does not fully comply with the merge window requirements, but there is really no point to drag it out for yet another cycle" * 'core-rseq-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: rseq/selftests: Provide Makefile, scripts, gitignore rseq/selftests: Provide parametrized tests rseq/selftests: Provide basic percpu ops test rseq/selftests: Provide basic test rseq/selftests: Provide rseq library selftests/lib.mk: Introduce OVERRIDE_TARGETS powerpc: Wire up restartable sequences system call powerpc: Add syscall detection for restartable sequences powerpc: Add support for restartable sequences x86: Wire up restartable sequence system call x86: Add support for restartable sequences arm: Wire up restartable sequences system call arm: Add syscall detection for restartable sequences arm: Add restartable sequences support rseq: Introduce restartable sequences system call uapi/headers: Provide types_32_64.h	2018-06-10 10:17:09 -07:00
Linus Torvalds	f4e5b30d80	Merge branch 'x86-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull x86 updates and fixes from Thomas Gleixner: - Fix the (late) fallout from the vector management rework causing hlist corruption and irq descriptor reference leaks caused by a missing sanity check. The straight forward fix triggered another long standing issue to surface. The pre rework code hid the issue due to being way slower, but now the chance that user space sees an EBUSY error return when updating irq affinities is way higher, though quite a bunch of userspace tools do not handle it properly despite the fact that EBUSY could be returned for at least 10 years. It turned out that the EBUSY return can be avoided completely by utilizing the existing delayed affinity update mechanism for irq remapped scenarios as well. That's a bit more error handling in the kernel, but avoids fruitless fingerpointing discussions with tool developers. - Decouple PHYSICAL_MASK from AMD SME as its going to be required for the upcoming Intel memory encryption support as well. - Handle legacy device ACPI detection properly for newer platforms - Fix the wrong argument ordering in the vector allocation tracepoint - Simplify the IDT setup code for the APIC=n case - Use the proper string helpers in the MTRR code - Remove a stale unused VDSO source file - Convert the microcode update lock to a raw spinlock as its used in atomic context. * 'x86-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: x86/intel_rdt: Enable CMT and MBM on new Skylake stepping x86/apic/vector: Print APIC control bits in debugfs genirq/affinity: Defer affinity setting if irq chip is busy x86/platform/uv: Use apic_ack_irq() x86/ioapic: Use apic_ack_irq() irq_remapping: Use apic_ack_irq() x86/apic: Provide apic_ack_irq() genirq/migration: Avoid out of line call if pending is not set genirq/generic_pending: Do not lose pending affinity update x86/apic/vector: Prevent hlist corruption and leaks x86/vector: Fix the args of vector_alloc tracepoint x86/idt: Simplify the idt_setup_apic_and_irq_gates() x86/platform/uv: Remove extra parentheses x86/mm: Decouple dynamic __PHYSICAL_MASK from AMD SME x86: Mark native_set_p4d() as __always_inline x86/microcode: Make the late update update_lock a raw lock for RT x86/mtrr: Convert to use strncpy_from_user() helper x86/mtrr: Convert to use match_string() helper x86/vdso: Remove unused file x86/i8237: Register device based on FADT legacy boot flag	2018-06-10 09:44:53 -07:00
Linus Torvalds	a8a4021b77	Merge branch 'core-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull core fixes from Thomas Gleixner: "A small set of core updates: - Make objtool cope with GCC8 oddities some more - Remove a stale local_irq_save/restore sequence in the signal code along with the stale comment in the RCU code. The underlying issue which led to this has been solved long time ago, but nobody cared to cleanup the hackarounds" * 'core-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: signal: Remove no longer required irqsave/restore rcu: Update documentation of rcu_read_unlock() objtool: Fix GCC 8 cold subfunction detection for aliased functions	2018-06-10 08:30:35 -07:00
Anna-Maria Gleixner	59dc6f3c6d	signal: Remove no longer required irqsave/restore Commit `a841796f11` ("signal: align __lock_task_sighand() irq disabling and RCU") introduced a rcu read side critical section with interrupts disabled. The changelog suggested that a better long-term fix would be "to make rt_mutex_unlock() disable irqs when acquiring the rt_mutex structure's ->wait_lock". This long-term fix has been made in commit `b4abf91047` ("rtmutex: Make wait_lock irq safe") for a different reason. Therefore revert commit `a841796f11` ("signal: align > __lock_task_sighand() irq disabling and RCU") as the interrupt disable dance is not longer required. The change was tested on the base of `b4abf91047` ("rtmutex: Make wait_lock irq safe") with a four hour run of rcutorture scenario TREE03 with lockdep enabled as suggested by Paul McKenney. Signed-off-by: Anna-Maria Gleixner <anna-maria@linutronix.de> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Acked-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Acked-by: "Eric W. Biederman" <ebiederm@xmission.com> Cc: bigeasy@linutronix.de Link: https://lkml.kernel.org/r/20180525090507.22248-3-anna-maria@linutronix.de	2018-06-10 06:14:01 +02:00
Linus Torvalds	d6c7528447	Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/ide Pull IDE updates from David Miller: "Primarily IRQ disabling avoidance changes from Sebastian Andrzej Siewior" * git://git.kernel.org/pub/scm/linux/kernel/git/davem/ide: ide: don't enable/disable interrupts in force threaded-IRQ mode ide: don't disable interrupts during kmap_atomic() ide: Handle irq disabling consistently alim15x3: move irq-restore before pci_dev_put()	2018-06-09 11:10:16 -07:00
Linus Torvalds	7d3bf613e9	libnvdimm for 4.18 * DAX broke a fundamental assumption of truncate of file mapped pages. The truncate path assumed that it is safe to disconnect a pinned page from a file and let the filesystem reclaim the physical block. With DAX the page is equivalent to the filesystem block. Introduce dax_layout_busy_page() to enable filesystems to wait for pinned DAX pages to be released. Without this wait a filesystem could allocate blocks under active device-DMA to a new file. * DAX arranges for the block layer to be bypassed and uses dax_direct_access() + copy_to_iter() to satisfy read(2) calls. However, the memcpy_mcsafe() facility is available through the pmem block driver. In order to safely handle media errors, via the DAX block-layer bypass, introduce copy_to_iter_mcsafe(). * Fix cache management policy relative to the ACPI NFIT Platform Capabilities Structure to properly elide cache flushes when they are not necessary. The table indicates whether CPU caches are power-fail protected. Clarify that a deep flush is always performed on REQ_{FUA,PREFLUSH} requests. -----BEGIN PGP SIGNATURE----- iQIcBAABAgAGBQJbGxI7AAoJEB7SkWpmfYgCDjsP/2Lcibu9Kf4tKIzuInsle6iE 6qP29qlkpHVTpDKbhvIxTYTYL9sMU0DNUrpPCJR/EYdeyztLWDFC5EAT1wF240vf maV37s/uP331jSC/2VJnKWzBs2ztQxmKLEIQCxh6aT0qs9cbaOvJgB/WlVu+qtsl aGJFLmb6vdQacp31noU5plKrMgMA1pADyF5qx9I9K2HwowHE7T368ZEFS/3S//c3 LXmpx/Nfq52sGu/qbRbu6B1CTJhIGhmarObyQnvBYoKntK1Ov4e8DS95wD3EhNDe FuRkOCUKhjl6cFy7QVWh1ct1bFm84ny+b4/AtbpOmv9l/+0mveJ7e+5mu8HQTifT wYiEe2xzXJ+OG/xntv8SvlZKMpjP3BqI0jYsTutsjT4oHrciiXdXM186cyS+BiGp KtFmWyncQJgfiTq6+Hj5XpP9BapNS+OYdYgUagw9ZwzdzptuGFYUMSVOBrYrn6c/ fwqtxjubykJoW0P3pkIoT91arFSea7nxOKnGwft06imQ7TwR4ARsI308feQ9itJq 2P2e7/20nYMsw2aRaUDDA70Yu+Lagn1m8WL87IybUGeUDLb1BAkjphAlWa6COJ+u PhvAD2tvyM9m0c7O5Mytvz7iWKG6SVgatoAyOPkaeplQK8khZ+wEpuK58sO6C1w8 4GBvt9ri9i/Ww/A+ppWs =4bfw -----END PGP SIGNATURE----- Merge tag 'libnvdimm-for-4.18' of git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm Pull libnvdimm updates from Dan Williams: "This adds a user for the new 'bytes-remaining' updates to memcpy_mcsafe() that you already received through Ingo via the x86-dax- for-linus pull. Not included here, but still targeting this cycle, is support for handling memory media errors (poison) consumed via userspace dax mappings. Summary: - DAX broke a fundamental assumption of truncate of file mapped pages. The truncate path assumed that it is safe to disconnect a pinned page from a file and let the filesystem reclaim the physical block. With DAX the page is equivalent to the filesystem block. Introduce dax_layout_busy_page() to enable filesystems to wait for pinned DAX pages to be released. Without this wait a filesystem could allocate blocks under active device-DMA to a new file. - DAX arranges for the block layer to be bypassed and uses dax_direct_access() + copy_to_iter() to satisfy read(2) calls. However, the memcpy_mcsafe() facility is available through the pmem block driver. In order to safely handle media errors, via the DAX block-layer bypass, introduce copy_to_iter_mcsafe(). - Fix cache management policy relative to the ACPI NFIT Platform Capabilities Structure to properly elide cache flushes when they are not necessary. The table indicates whether CPU caches are power-fail protected. Clarify that a deep flush is always performed on REQ_{FUA,PREFLUSH} requests" * tag 'libnvdimm-for-4.18' of git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm: (21 commits) dax: Use dax_write_cache* helpers libnvdimm, pmem: Do not flush power-fail protected CPU caches libnvdimm, pmem: Unconditionally deep flush on *sync libnvdimm, pmem: Complete REQ_FLUSH => REQ_PREFLUSH acpi, nfit: Remove ecc_unit_size dax: dax_insert_mapping_entry always succeeds libnvdimm, e820: Register all pmem resources libnvdimm: Debug probe times linvdimm, pmem: Preserve read-only setting for pmem devices x86, nfit_test: Add unit test for memcpy_mcsafe() pmem: Switch to copy_to_iter_mcsafe() dax: Report bytes remaining in dax_iomap_actor() dax: Introduce a ->copy_to_iter dax operation uio, lib: Fix CONFIG_ARCH_HAS_UACCESS_MCSAFE compilation xfs, dax: introduce xfs_break_dax_layouts() xfs: prepare xfs_break_layouts() for another layout type xfs: prepare xfs_break_layouts() to be called with XFS_MMAPLOCK_EXCL mm, fs, dax: handle layout changes to pinned dax mappings mm: fix __gup_device_huge vs unmap mm: introduce MEMORY_DEVICE_FS_DAX and CONFIG_DEV_PAGEMAP_OPS ...	2018-06-08 17:21:52 -07:00
Dan Williams	b56845794e	Merge branch 'for-4.18/dax' into libnvdimm-for-next	2018-06-08 15:16:40 -07:00
Daniel Borkmann	b165585795	bpf: implement dummy fops for bpf objects syzkaller was able to trigger the following warning in do_dentry_open(): WARNING: CPU: 1 PID: 4508 at fs/open.c:778 do_dentry_open+0x4ad/0xe40 fs/open.c:778 Kernel panic - not syncing: panic_on_warn set ... CPU: 1 PID: 4508 Comm: syz-executor867 Not tainted 4.17.0+ #90 Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011 Call Trace: [...] vfs_open+0x139/0x230 fs/open.c:908 do_last fs/namei.c:3370 [inline] path_openat+0x1717/0x4dc0 fs/namei.c:3511 do_filp_open+0x249/0x350 fs/namei.c:3545 do_sys_open+0x56f/0x740 fs/open.c:1101 __do_sys_openat fs/open.c:1128 [inline] __se_sys_openat fs/open.c:1122 [inline] __x64_sys_openat+0x9d/0x100 fs/open.c:1122 do_syscall_64+0x1b1/0x800 arch/x86/entry/common.c:287 entry_SYSCALL_64_after_hwframe+0x49/0xbe Problem was that prog and map inodes in bpf fs did not implement a dummy file open operation that would return an error. The patch in do_dentry_open() checks whether f_ops are present and if not bails out with an error. While this may be fine, we really shouldn't be throwing a warning though. Thus follow the model similar to bad_file_ops and reject the request unconditionally with -EIO. Fixes: `b2197755b2` ("bpf: add support for persistent maps/progs") Reported-by: syzbot+2e7fcab0f56fdbb330b8@syzkaller.appspotmail.com Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2018-06-08 10:58:48 -07:00
Masahiro Yamada	6a61b70b43	gcov: remove CONFIG_GCOV_FORMAT_AUTODETECT CONFIG_GCOV_FORMAT_AUTODETECT compiles either gcc_3_4.c or gcc_4_7.c according to your GCC version. We can achieve the equivalent behavior by setting reasonable dependency with the knowledge of the compiler version. If GCC older than 4.7 is used, GCOV_FORMAT_3_4 is the default, but users are still allowed to select GCOV_FORMAT_4_7 in case the newer format is back-ported. On the other hand, If GCC 4.7 or newer is used, there is no reason to use GCOV_FORMAT_3_4, so it should be hidden. If you downgrade the compiler to GCC 4.7 or older, oldconfig/syncconfig will display a prompt for the choice because GCOV_FORMAT_3_4 becomes visible as a new symbol. Signed-off-by: Masahiro Yamada <yamada.masahiro@socionext.com> Acked-by: Peter Oberparleiter <oberpar@linux.vnet.ibm.com> Reviewed-by: Kees Cook <keescook@chromium.org>	2018-06-08 18:56:02 +09:00
Tetsuo Handa	401c636a0e	kernel/hung_task.c: show all hung tasks before panic When we get a hung task it can often be valuable to see _all_ the hung tasks on the system before calling panic(). Quoting from https://syzkaller.appspot.com/text?tag=CrashReport&id=5316056503549952 ---------------------------------------- INFO: task syz-executor0:6540 blocked for more than 120 seconds. Not tainted 4.16.0+ #13 "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. syz-executor0 D23560 6540 4521 0x80000004 Call Trace: context_switch kernel/sched/core.c:2848 [inline] __schedule+0x8fb/0x1ef0 kernel/sched/core.c:3490 schedule+0xf5/0x430 kernel/sched/core.c:3549 schedule_preempt_disabled+0x10/0x20 kernel/sched/core.c:3607 __mutex_lock_common kernel/locking/mutex.c:833 [inline] __mutex_lock+0xb7f/0x1810 kernel/locking/mutex.c:893 mutex_lock_nested+0x16/0x20 kernel/locking/mutex.c:908 lo_ioctl+0x8b/0x1b70 drivers/block/loop.c:1355 __blkdev_driver_ioctl block/ioctl.c:303 [inline] blkdev_ioctl+0x1759/0x1e00 block/ioctl.c:601 ioctl_by_bdev+0xa5/0x110 fs/block_dev.c:2060 isofs_get_last_session fs/isofs/inode.c:567 [inline] isofs_fill_super+0x2ba9/0x3bc0 fs/isofs/inode.c:660 mount_bdev+0x2b7/0x370 fs/super.c:1119 isofs_mount+0x34/0x40 fs/isofs/inode.c:1560 mount_fs+0x66/0x2d0 fs/super.c:1222 vfs_kern_mount.part.26+0xc6/0x4a0 fs/namespace.c:1037 vfs_kern_mount fs/namespace.c:2514 [inline] do_new_mount fs/namespace.c:2517 [inline] do_mount+0xea4/0x2b90 fs/namespace.c:2847 ksys_mount+0xab/0x120 fs/namespace.c:3063 SYSC_mount fs/namespace.c:3077 [inline] SyS_mount+0x39/0x50 fs/namespace.c:3074 do_syscall_64+0x281/0x940 arch/x86/entry/common.c:287 entry_SYSCALL_64_after_hwframe+0x42/0xb7 (...snipped...) Showing all locks held in the system: (...snipped...) 2 locks held by syz-executor0/6540: #0: 00000000566d4c39 (&type->s_umount_key#49/1){+.+.}, at: alloc_super fs/super.c:211 [inline] #0: 00000000566d4c39 (&type->s_umount_key#49/1){+.+.}, at: sget_userns+0x3b2/0xe60 fs/super.c:502 /* down_write_nested(&s->s_umount, SINGLE_DEPTH_NESTING); / #1: 0000000043ca8836 (&lo->lo_ctl_mutex/1){+.+.}, at: lo_ioctl+0x8b/0x1b70 drivers/block/loop.c:1355 / mutex_lock_nested(&lo->lo_ctl_mutex, 1); / (...snipped...) 3 locks held by syz-executor7/6541: #0: 0000000043ca8836 (&lo->lo_ctl_mutex/1){+.+.}, at: lo_ioctl+0x8b/0x1b70 drivers/block/loop.c:1355 / mutex_lock_nested(&lo->lo_ctl_mutex, 1); / #1: 000000007bf3d3f9 (&bdev->bd_mutex){+.+.}, at: blkdev_reread_part+0x1e/0x40 block/ioctl.c:192 #2: 00000000566d4c39 (&type->s_umount_key#50){.+.+}, at: __get_super.part.10+0x1d3/0x280 fs/super.c:663 / down_read(&sb->s_umount); */ ---------------------------------------- When reporting an AB-BA deadlock like shown above, it would be nice if trace of PID=6541 is printed as well as trace of PID=6540 before calling panic(). Showing hung tasks up to /proc/sys/kernel/hung_task_warnings could delay calling panic() but normally there should not be so many hung tasks. Link: http://lkml.kernel.org/r/201804050705.BHE57833.HVFOFtSOMQJFOL@I-love.SAKURA.ne.jp Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Acked-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Acked-by: Dmitry Vyukov <dvyukov@google.com> Cc: Vegard Nossum <vegard.nossum@oracle.com> Cc: Mandeep Singh Baines <msb@chromium.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Ingo Molnar <mingo@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2018-06-07 17:34:39 -07:00
Matthew Wilcox	6e292b9be7	mm: split page_type out from _mapcount We're already using a union of many fields here, so stop abusing the _mapcount and make page_type its own field. That implies renaming some of the machinery that creates PageBuddy, PageBalloon and PageKmemcg; bring back the PG_buddy, PG_balloon and PG_kmemcg names. As suggested by Kirill, make page_type a bitmask. Because it starts out life as -1 (thanks to sharing the storage with _mapcount), setting a page flag means clearing the appropriate bit. This gives us space for probably twenty or so extra bits (depending how paranoid we want to be about _mapcount underflow). Link: http://lkml.kernel.org/r/20180518194519.3820-3-willy@infradead.org Signed-off-by: Matthew Wilcox <mawilcox@microsoft.com> Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Acked-by: Vlastimil Babka <vbabka@suse.cz> Cc: Christoph Lameter <cl@linux.com> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: Jérôme Glisse <jglisse@redhat.com> Cc: Lai Jiangshan <jiangshanlai@gmail.com> Cc: Martin Schwidefsky <schwidefsky@de.ibm.com> Cc: Pekka Enberg <penberg@kernel.org> Cc: Randy Dunlap <rdunlap@infradead.org> Cc: Andrey Ryabinin <aryabinin@virtuozzo.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2018-06-07 17:34:37 -07:00
Yang Shi	88aa7cc688	mm: introduce arg_lock to protect arg_start\|end and env_start\|end in mm_struct mmap_sem is on the hot path of kernel, and it very contended, but it is abused too. It is used to protect arg_start\|end and evn_start\|end when reading /proc/$PID/cmdline and /proc/$PID/environ, but it doesn't make sense since those proc files just expect to read 4 values atomically and not related to VM, they could be set to arbitrary values by C/R. And, the mmap_sem contention may cause unexpected issue like below: INFO: task ps:14018 blocked for more than 120 seconds. Tainted: G E 4.9.79-009.ali3000.alios7.x86_64 #1 "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. ps D 0 14018 1 0x00000004 Call Trace: schedule+0x36/0x80 rwsem_down_read_failed+0xf0/0x150 call_rwsem_down_read_failed+0x18/0x30 down_read+0x20/0x40 proc_pid_cmdline_read+0xd9/0x4e0 __vfs_read+0x37/0x150 vfs_read+0x96/0x130 SyS_read+0x55/0xc0 entry_SYSCALL_64_fastpath+0x1a/0xc5 Both Alexey Dobriyan and Michal Hocko suggested to use dedicated lock for them to mitigate the abuse of mmap_sem. So, introduce a new spinlock in mm_struct to protect the concurrent access to arg_start\|end, env_start\|end and others, as well as replace write map_sem to read to protect the race condition between prctl and sys_brk which might break check_data_rlimit(), and makes prctl more friendly to other VM operations. This patch just eliminates the abuse of mmap_sem, but it can't resolve the above hung task warning completely since the later access_remote_vm() call needs acquire mmap_sem. The mmap_sem scalability issue will be solved in the future. [yang.shi@linux.alibaba.com: add comment about mmap_sem and arg_lock] Link: http://lkml.kernel.org/r/1524077799-80690-1-git-send-email-yang.shi@linux.alibaba.com Link: http://lkml.kernel.org/r/1523730291-109696-1-git-send-email-yang.shi@linux.alibaba.com Signed-off-by: Yang Shi <yang.shi@linux.alibaba.com> Reviewed-by: Cyrill Gorcunov <gorcunov@openvz.org> Acked-by: Michal Hocko <mhocko@suse.com> Cc: Alexey Dobriyan <adobriyan@gmail.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Mateusz Guzik <mguzik@redhat.com> Cc: Kirill Tkhai <ktkhai@virtuozzo.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2018-06-07 17:34:34 -07:00

... 5 6 7 8 9 ...

28363 Commits