linux_dsm_epyc7002/Documentation/atomic_t.txt
Linus Torvalds e192832869 Merge branch 'locking-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull locking updates from Ingo Molnar:
 "The main changes in this cycle are:

   - rwsem scalability improvements, phase , by Waiman Long, which are
     rather impressive:

       "On a 2-socket 40-core 80-thread Skylake system with 40 reader
        and writer locking threads, the min/mean/max locking operations
        done in a 5-second testing window before the patchset were:

         40 readers, Iterations Min/Mean/Max = 1,807/1,808/1,810
         40 writers, Iterations Min/Mean/Max = 1,807/50,344/151,255

        After the patchset, they became:

         40 readers, Iterations Min/Mean/Max = 30,057/31,359/32,741
         40 writers, Iterations Min/Mean/Max = 94,466/95,845/97,098"

     There's a lot of changes to the locking implementation that makes
     it similar to qrwlock, including owner handoff for more fair
     locking.

     Another microbenchmark shows how across the spectrum the
     improvements are:

       "With a locking microbenchmark running on 5.1 based kernel, the
        total locking rates (in kops/s) on a 2-socket Skylake system
        with equal numbers of readers and writers (mixed) before and
        after this patchset were:

        # of Threads   Before Patch      After Patch
        ------------   ------------      -----------
             2            2,618             4,193
             4            1,202             3,726
             8              802             3,622
            16              729             3,359
            32              319             2,826
            64              102             2,744"

     The changes are extensive and the patch-set has been through
     several iterations addressing various locking workloads. There
     might be more regressions, but unless they are pathological I
     believe we want to use this new implementation as the baseline
     going forward.

   - jump-label optimizations by Daniel Bristot de Oliveira: the primary
     motivation was to remove IPI disturbance of isolated RT-workload
     CPUs, which resulted in the implementation of batched jump-label
     updates. Beyond the improvement of the real-time characteristics
     kernel, in one test this patchset improved static key update
     overhead from 57 msecs to just 1.4 msecs - which is a nice speedup
     as well.

   - atomic64_t cross-arch type cleanups by Mark Rutland: over the last
     ~10 years of atomic64_t existence the various types used by the
     APIs only had to be self-consistent within each architecture -
     which means they became wildly inconsistent across architectures.
     Mark puts and end to this by reworking all the atomic64
     implementations to use 's64' as the base type for atomic64_t, and
     to ensure that this type is consistently used for parameters and
     return values in the API, avoiding further problems in this area.

   - A large set of small improvements to lockdep by Yuyang Du: type
     cleanups, output cleanups, function return type and othr cleanups
     all around the place.

   - A set of percpu ops cleanups and fixes by Peter Zijlstra.

   - Misc other changes - please see the Git log for more details"

* 'locking-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (82 commits)
  locking/lockdep: increase size of counters for lockdep statistics
  locking/atomics: Use sed(1) instead of non-standard head(1) option
  locking/lockdep: Move mark_lock() inside CONFIG_TRACE_IRQFLAGS && CONFIG_PROVE_LOCKING
  x86/jump_label: Make tp_vec_nr static
  x86/percpu: Optimize raw_cpu_xchg()
  x86/percpu, sched/fair: Avoid local_clock()
  x86/percpu, x86/irq: Relax {set,get}_irq_regs()
  x86/percpu: Relax smp_processor_id()
  x86/percpu: Differentiate this_cpu_{}() and __this_cpu_{}()
  locking/rwsem: Guard against making count negative
  locking/rwsem: Adaptive disabling of reader optimistic spinning
  locking/rwsem: Enable time-based spinning on reader-owned rwsem
  locking/rwsem: Make rwsem->owner an atomic_long_t
  locking/rwsem: Enable readers spinning on writer
  locking/rwsem: Clarify usage of owner's nonspinaable bit
  locking/rwsem: Wake up almost all readers in wait queue
  locking/rwsem: More optimal RT task handling of null owner
  locking/rwsem: Always release wait_lock before waking up tasks
  locking/rwsem: Implement lock handoff to prevent lock starvation
  locking/rwsem: Make rwsem_spin_on_owner() return owner state
  ...
2019-07-08 16:12:03 -07:00

274 lines
6.9 KiB
Plaintext

On atomic types (atomic_t atomic64_t and atomic_long_t).
The atomic type provides an interface to the architecture's means of atomic
RMW operations between CPUs (atomic operations on MMIO are not supported and
can lead to fatal traps on some platforms).
API
---
The 'full' API consists of (atomic64_ and atomic_long_ prefixes omitted for
brevity):
Non-RMW ops:
atomic_read(), atomic_set()
atomic_read_acquire(), atomic_set_release()
RMW atomic operations:
Arithmetic:
atomic_{add,sub,inc,dec}()
atomic_{add,sub,inc,dec}_return{,_relaxed,_acquire,_release}()
atomic_fetch_{add,sub,inc,dec}{,_relaxed,_acquire,_release}()
Bitwise:
atomic_{and,or,xor,andnot}()
atomic_fetch_{and,or,xor,andnot}{,_relaxed,_acquire,_release}()
Swap:
atomic_xchg{,_relaxed,_acquire,_release}()
atomic_cmpxchg{,_relaxed,_acquire,_release}()
atomic_try_cmpxchg{,_relaxed,_acquire,_release}()
Reference count (but please see refcount_t):
atomic_add_unless(), atomic_inc_not_zero()
atomic_sub_and_test(), atomic_dec_and_test()
Misc:
atomic_inc_and_test(), atomic_add_negative()
atomic_dec_unless_positive(), atomic_inc_unless_negative()
Barriers:
smp_mb__{before,after}_atomic()
TYPES (signed vs unsigned)
-----
While atomic_t, atomic_long_t and atomic64_t use int, long and s64
respectively (for hysterical raisins), the kernel uses -fno-strict-overflow
(which implies -fwrapv) and defines signed overflow to behave like
2s-complement.
Therefore, an explicitly unsigned variant of the atomic ops is strictly
unnecessary and we can simply cast, there is no UB.
There was a bug in UBSAN prior to GCC-8 that would generate UB warnings for
signed types.
With this we also conform to the C/C++ _Atomic behaviour and things like
P1236R1.
SEMANTICS
---------
Non-RMW ops:
The non-RMW ops are (typically) regular LOADs and STOREs and are canonically
implemented using READ_ONCE(), WRITE_ONCE(), smp_load_acquire() and
smp_store_release() respectively. Therefore, if you find yourself only using
the Non-RMW operations of atomic_t, you do not in fact need atomic_t at all
and are doing it wrong.
A subtle detail of atomic_set{}() is that it should be observable to the RMW
ops. That is:
C atomic-set
{
atomic_set(v, 1);
}
P1(atomic_t *v)
{
atomic_add_unless(v, 1, 0);
}
P2(atomic_t *v)
{
atomic_set(v, 0);
}
exists
(v=2)
In this case we would expect the atomic_set() from CPU1 to either happen
before the atomic_add_unless(), in which case that latter one would no-op, or
_after_ in which case we'd overwrite its result. In no case is "2" a valid
outcome.
This is typically true on 'normal' platforms, where a regular competing STORE
will invalidate a LL/SC or fail a CMPXCHG.
The obvious case where this is not so is when we need to implement atomic ops
with a lock:
CPU0 CPU1
atomic_add_unless(v, 1, 0);
lock();
ret = READ_ONCE(v->counter); // == 1
atomic_set(v, 0);
if (ret != u) WRITE_ONCE(v->counter, 0);
WRITE_ONCE(v->counter, ret + 1);
unlock();
the typical solution is to then implement atomic_set{}() with atomic_xchg().
RMW ops:
These come in various forms:
- plain operations without return value: atomic_{}()
- operations which return the modified value: atomic_{}_return()
these are limited to the arithmetic operations because those are
reversible. Bitops are irreversible and therefore the modified value
is of dubious utility.
- operations which return the original value: atomic_fetch_{}()
- swap operations: xchg(), cmpxchg() and try_cmpxchg()
- misc; the special purpose operations that are commonly used and would,
given the interface, normally be implemented using (try_)cmpxchg loops but
are time critical and can, (typically) on LL/SC architectures, be more
efficiently implemented.
All these operations are SMP atomic; that is, the operations (for a single
atomic variable) can be fully ordered and no intermediate state is lost or
visible.
ORDERING (go read memory-barriers.txt first)
--------
The rule of thumb:
- non-RMW operations are unordered;
- RMW operations that have no return value are unordered;
- RMW operations that have a return value are fully ordered;
- RMW operations that are conditional are unordered on FAILURE,
otherwise the above rules apply.
Except of course when an operation has an explicit ordering like:
{}_relaxed: unordered
{}_acquire: the R of the RMW (or atomic_read) is an ACQUIRE
{}_release: the W of the RMW (or atomic_set) is a RELEASE
Where 'unordered' is against other memory locations. Address dependencies are
not defeated.
Fully ordered primitives are ordered against everything prior and everything
subsequent. Therefore a fully ordered primitive is like having an smp_mb()
before and an smp_mb() after the primitive.
The barriers:
smp_mb__{before,after}_atomic()
only apply to the RMW atomic ops and can be used to augment/upgrade the
ordering inherent to the op. These barriers act almost like a full smp_mb():
smp_mb__before_atomic() orders all earlier accesses against the RMW op
itself and all accesses following it, and smp_mb__after_atomic() orders all
later accesses against the RMW op and all accesses preceding it. However,
accesses between the smp_mb__{before,after}_atomic() and the RMW op are not
ordered, so it is advisable to place the barrier right next to the RMW atomic
op whenever possible.
These helper barriers exist because architectures have varying implicit
ordering on their SMP atomic primitives. For example our TSO architectures
provide full ordered atomics and these barriers are no-ops.
NOTE: when the atomic RmW ops are fully ordered, they should also imply a
compiler barrier.
Thus:
atomic_fetch_add();
is equivalent to:
smp_mb__before_atomic();
atomic_fetch_add_relaxed();
smp_mb__after_atomic();
However the atomic_fetch_add() might be implemented more efficiently.
Further, while something like:
smp_mb__before_atomic();
atomic_dec(&X);
is a 'typical' RELEASE pattern, the barrier is strictly stronger than
a RELEASE because it orders preceding instructions against both the read
and write parts of the atomic_dec(), and against all following instructions
as well. Similarly, something like:
atomic_inc(&X);
smp_mb__after_atomic();
is an ACQUIRE pattern (though very much not typical), but again the barrier is
strictly stronger than ACQUIRE. As illustrated:
C strong-acquire
{
}
P1(int *x, atomic_t *y)
{
r0 = READ_ONCE(*x);
smp_rmb();
r1 = atomic_read(y);
}
P2(int *x, atomic_t *y)
{
atomic_inc(y);
smp_mb__after_atomic();
WRITE_ONCE(*x, 1);
}
exists
(r0=1 /\ r1=0)
This should not happen; but a hypothetical atomic_inc_acquire() --
(void)atomic_fetch_inc_acquire() for instance -- would allow the outcome,
because it would not order the W part of the RMW against the following
WRITE_ONCE. Thus:
P1 P2
t = LL.acq *y (0)
t++;
*x = 1;
r0 = *x (1)
RMB
r1 = *y (0)
SC *y, t;
is allowed.