mirror of
https://github.com/AuxXxilium/linux_dsm_epyc7002.git
synced 2024-12-28 11:18:45 +07:00
d7822b1e24
Expose a new system call allowing each thread to register one userspace memory area to be used as an ABI between kernel and user-space for two purposes: user-space restartable sequences and quick access to read the current CPU number value from user-space. * Restartable sequences (per-cpu atomics) Restartables sequences allow user-space to perform update operations on per-cpu data without requiring heavy-weight atomic operations. The restartable critical sections (percpu atomics) work has been started by Paul Turner and Andrew Hunter. It lets the kernel handle restart of critical sections. [1] [2] The re-implementation proposed here brings a few simplifications to the ABI which facilitates porting to other architectures and speeds up the user-space fast path. Here are benchmarks of various rseq use-cases. Test hardware: arm32: ARMv7 Processor rev 4 (v7l) "Cubietruck", 2-core x86-64: Intel E5-2630 v3@2.40GHz, 16-core, hyperthreading The following benchmarks were all performed on a single thread. * Per-CPU statistic counter increment getcpu+atomic (ns/op) rseq (ns/op) speedup arm32: 344.0 31.4 11.0 x86-64: 15.3 2.0 7.7 * LTTng-UST: write event 32-bit header, 32-bit payload into tracer per-cpu buffer getcpu+atomic (ns/op) rseq (ns/op) speedup arm32: 2502.0 2250.0 1.1 x86-64: 117.4 98.0 1.2 * liburcu percpu: lock-unlock pair, dereference, read/compare word getcpu+atomic (ns/op) rseq (ns/op) speedup arm32: 751.0 128.5 5.8 x86-64: 53.4 28.6 1.9 * jemalloc memory allocator adapted to use rseq Using rseq with per-cpu memory pools in jemalloc at Facebook (based on rseq 2016 implementation): The production workload response-time has 1-2% gain avg. latency, and the P99 overall latency drops by 2-3%. * Reading the current CPU number Speeding up reading the current CPU number on which the caller thread is running is done by keeping the current CPU number up do date within the cpu_id field of the memory area registered by the thread. This is done by making scheduler preemption set the TIF_NOTIFY_RESUME flag on the current thread. Upon return to user-space, a notify-resume handler updates the current CPU value within the registered user-space memory area. User-space can then read the current CPU number directly from memory. Keeping the current cpu id in a memory area shared between kernel and user-space is an improvement over current mechanisms available to read the current CPU number, which has the following benefits over alternative approaches: - 35x speedup on ARM vs system call through glibc - 20x speedup on x86 compared to calling glibc, which calls vdso executing a "lsl" instruction, - 14x speedup on x86 compared to inlined "lsl" instruction, - Unlike vdso approaches, this cpu_id value can be read from an inline assembly, which makes it a useful building block for restartable sequences. - The approach of reading the cpu id through memory mapping shared between kernel and user-space is portable (e.g. ARM), which is not the case for the lsl-based x86 vdso. On x86, yet another possible approach would be to use the gs segment selector to point to user-space per-cpu data. This approach performs similarly to the cpu id cache, but it has two disadvantages: it is not portable, and it is incompatible with existing applications already using the gs segment selector for other purposes. Benchmarking various approaches for reading the current CPU number: ARMv7 Processor rev 4 (v7l) Machine model: Cubietruck - Baseline (empty loop): 8.4 ns - Read CPU from rseq cpu_id: 16.7 ns - Read CPU from rseq cpu_id (lazy register): 19.8 ns - glibc 2.19-0ubuntu6.6 getcpu: 301.8 ns - getcpu system call: 234.9 ns x86-64 Intel(R) Xeon(R) CPU E5-2630 v3 @ 2.40GHz: - Baseline (empty loop): 0.8 ns - Read CPU from rseq cpu_id: 0.8 ns - Read CPU from rseq cpu_id (lazy register): 0.8 ns - Read using gs segment selector: 0.8 ns - "lsl" inline assembly: 13.0 ns - glibc 2.19-0ubuntu6 getcpu: 16.6 ns - getcpu system call: 53.9 ns - Speed (benchmark taken on v8 of patchset) Running 10 runs of hackbench -l 100000 seems to indicate, contrary to expectations, that enabling CONFIG_RSEQ slightly accelerates the scheduler: Configuration: 2 sockets * 8-core Intel(R) Xeon(R) CPU E5-2630 v3 @ 2.40GHz (directly on hardware, hyperthreading disabled in BIOS, energy saving disabled in BIOS, turboboost disabled in BIOS, cpuidle.off=1 kernel parameter), with a Linux v4.6 defconfig+localyesconfig, restartable sequences series applied. * CONFIG_RSEQ=n avg.: 41.37 s std.dev.: 0.36 s * CONFIG_RSEQ=y avg.: 40.46 s std.dev.: 0.33 s - Size On x86-64, between CONFIG_RSEQ=n/y, the text size increase of vmlinux is 567 bytes, and the data size increase of vmlinux is 5696 bytes. [1] https://lwn.net/Articles/650333/ [2] http://www.linuxplumbersconf.org/2013/ocw/system/presentations/1695/original/LPC%20-%20PerCpu%20Atomics.pdf Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Joel Fernandes <joelaf@google.com> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Dave Watson <davejwatson@fb.com> Cc: Will Deacon <will.deacon@arm.com> Cc: Andi Kleen <andi@firstfloor.org> Cc: "H . Peter Anvin" <hpa@zytor.com> Cc: Chris Lameter <cl@linux.com> Cc: Russell King <linux@arm.linux.org.uk> Cc: Andrew Hunter <ahh@google.com> Cc: Michael Kerrisk <mtk.manpages@gmail.com> Cc: "Paul E . McKenney" <paulmck@linux.vnet.ibm.com> Cc: Paul Turner <pjt@google.com> Cc: Boqun Feng <boqun.feng@gmail.com> Cc: Josh Triplett <josh@joshtriplett.org> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Ben Maurer <bmaurer@fb.com> Cc: Alexander Viro <viro@zeniv.linux.org.uk> Cc: linux-api@vger.kernel.org Cc: Andy Lutomirski <luto@amacapital.net> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Link: http://lkml.kernel.org/r/20151027235635.16059.11630.stgit@pjt-glaptop.roam.corp.google.com Link: http://lkml.kernel.org/r/20150624222609.6116.86035.stgit@kitami.mtv.corp.google.com Link: https://lkml.kernel.org/r/20180602124408.8430-3-mathieu.desnoyers@efficios.com
134 lines
4.4 KiB
C
134 lines
4.4 KiB
C
/* SPDX-License-Identifier: GPL-2.0+ WITH Linux-syscall-note */
|
|
#ifndef _UAPI_LINUX_RSEQ_H
|
|
#define _UAPI_LINUX_RSEQ_H
|
|
|
|
/*
|
|
* linux/rseq.h
|
|
*
|
|
* Restartable sequences system call API
|
|
*
|
|
* Copyright (c) 2015-2018 Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
|
|
*/
|
|
|
|
#ifdef __KERNEL__
|
|
# include <linux/types.h>
|
|
#else
|
|
# include <stdint.h>
|
|
#endif
|
|
|
|
#include <linux/types_32_64.h>
|
|
|
|
enum rseq_cpu_id_state {
|
|
RSEQ_CPU_ID_UNINITIALIZED = -1,
|
|
RSEQ_CPU_ID_REGISTRATION_FAILED = -2,
|
|
};
|
|
|
|
enum rseq_flags {
|
|
RSEQ_FLAG_UNREGISTER = (1 << 0),
|
|
};
|
|
|
|
enum rseq_cs_flags_bit {
|
|
RSEQ_CS_FLAG_NO_RESTART_ON_PREEMPT_BIT = 0,
|
|
RSEQ_CS_FLAG_NO_RESTART_ON_SIGNAL_BIT = 1,
|
|
RSEQ_CS_FLAG_NO_RESTART_ON_MIGRATE_BIT = 2,
|
|
};
|
|
|
|
enum rseq_cs_flags {
|
|
RSEQ_CS_FLAG_NO_RESTART_ON_PREEMPT =
|
|
(1U << RSEQ_CS_FLAG_NO_RESTART_ON_PREEMPT_BIT),
|
|
RSEQ_CS_FLAG_NO_RESTART_ON_SIGNAL =
|
|
(1U << RSEQ_CS_FLAG_NO_RESTART_ON_SIGNAL_BIT),
|
|
RSEQ_CS_FLAG_NO_RESTART_ON_MIGRATE =
|
|
(1U << RSEQ_CS_FLAG_NO_RESTART_ON_MIGRATE_BIT),
|
|
};
|
|
|
|
/*
|
|
* struct rseq_cs is aligned on 4 * 8 bytes to ensure it is always
|
|
* contained within a single cache-line. It is usually declared as
|
|
* link-time constant data.
|
|
*/
|
|
struct rseq_cs {
|
|
/* Version of this structure. */
|
|
__u32 version;
|
|
/* enum rseq_cs_flags */
|
|
__u32 flags;
|
|
LINUX_FIELD_u32_u64(start_ip);
|
|
/* Offset from start_ip. */
|
|
LINUX_FIELD_u32_u64(post_commit_offset);
|
|
LINUX_FIELD_u32_u64(abort_ip);
|
|
} __attribute__((aligned(4 * sizeof(__u64))));
|
|
|
|
/*
|
|
* struct rseq is aligned on 4 * 8 bytes to ensure it is always
|
|
* contained within a single cache-line.
|
|
*
|
|
* A single struct rseq per thread is allowed.
|
|
*/
|
|
struct rseq {
|
|
/*
|
|
* Restartable sequences cpu_id_start field. Updated by the
|
|
* kernel, and read by user-space with single-copy atomicity
|
|
* semantics. Aligned on 32-bit. Always contains a value in the
|
|
* range of possible CPUs, although the value may not be the
|
|
* actual current CPU (e.g. if rseq is not initialized). This
|
|
* CPU number value should always be compared against the value
|
|
* of the cpu_id field before performing a rseq commit or
|
|
* returning a value read from a data structure indexed using
|
|
* the cpu_id_start value.
|
|
*/
|
|
__u32 cpu_id_start;
|
|
/*
|
|
* Restartable sequences cpu_id field. Updated by the kernel,
|
|
* and read by user-space with single-copy atomicity semantics.
|
|
* Aligned on 32-bit. Values RSEQ_CPU_ID_UNINITIALIZED and
|
|
* RSEQ_CPU_ID_REGISTRATION_FAILED have a special semantic: the
|
|
* former means "rseq uninitialized", and latter means "rseq
|
|
* initialization failed". This value is meant to be read within
|
|
* rseq critical sections and compared with the cpu_id_start
|
|
* value previously read, before performing the commit instruction,
|
|
* or read and compared with the cpu_id_start value before returning
|
|
* a value loaded from a data structure indexed using the
|
|
* cpu_id_start value.
|
|
*/
|
|
__u32 cpu_id;
|
|
/*
|
|
* Restartable sequences rseq_cs field.
|
|
*
|
|
* Contains NULL when no critical section is active for the current
|
|
* thread, or holds a pointer to the currently active struct rseq_cs.
|
|
*
|
|
* Updated by user-space, which sets the address of the currently
|
|
* active rseq_cs at the beginning of assembly instruction sequence
|
|
* block, and set to NULL by the kernel when it restarts an assembly
|
|
* instruction sequence block, as well as when the kernel detects that
|
|
* it is preempting or delivering a signal outside of the range
|
|
* targeted by the rseq_cs. Also needs to be set to NULL by user-space
|
|
* before reclaiming memory that contains the targeted struct rseq_cs.
|
|
*
|
|
* Read and set by the kernel with single-copy atomicity semantics.
|
|
* Set by user-space with single-copy atomicity semantics. Aligned
|
|
* on 64-bit.
|
|
*/
|
|
LINUX_FIELD_u32_u64(rseq_cs);
|
|
/*
|
|
* - RSEQ_DISABLE flag:
|
|
*
|
|
* Fallback fast-track flag for single-stepping.
|
|
* Set by user-space if lack of progress is detected.
|
|
* Cleared by user-space after rseq finish.
|
|
* Read by the kernel.
|
|
* - RSEQ_CS_FLAG_NO_RESTART_ON_PREEMPT
|
|
* Inhibit instruction sequence block restart and event
|
|
* counter increment on preemption for this thread.
|
|
* - RSEQ_CS_FLAG_NO_RESTART_ON_SIGNAL
|
|
* Inhibit instruction sequence block restart and event
|
|
* counter increment on signal delivery for this thread.
|
|
* - RSEQ_CS_FLAG_NO_RESTART_ON_MIGRATE
|
|
* Inhibit instruction sequence block restart and event
|
|
* counter increment on migration for this thread.
|
|
*/
|
|
__u32 flags;
|
|
} __attribute__((aligned(4 * sizeof(__u64))));
|
|
|
|
#endif /* _UAPI_LINUX_RSEQ_H */
|