mirror of
https://github.com/AuxXxilium/linux_dsm_epyc7002.git
synced 2025-01-25 18:30:20 +07:00
f2ae679411
Alpha provides a custom implementation of dec_and_lock(). The functions is split into two parts: - atomic_add_unless() + return 0 (fast path in assembly) - remaining part including locking (slow path in C) Comparing the result of the alpha implementation with the generic implementation compiled by gcc it looks like the fast path is optimized by avoiding a stack frame (and reloading the GP), register store and all this. This is only done in the slowpath. After marking the slowpath (atomic_dec_and_lock_1()) as "noinline" and doing the slowpath in C (the atomic_add_unless(atomic, -1, 1) part) I noticed differences in the resulting assembly: - the GP is still reloaded - atomic_add_unless() adds more memory barriers compared to the custom assembly - the custom assembly here does "load, sub, beq" while atomic_add_unless() does "load, cmpeq, add, bne". This is okay because it compares against zero after subtraction while the generic code compares against 1 before. I'm not sure if avoiding the stack frame (and GP reloading) brings a lot in terms of performance. Regarding the different barriers, Peter Zijlstra says: |refcount decrement needs to be a RELEASE operation, such that all the |load/stores to the object happen before we decrement the refcount. | |Otherwise things like: | | obj->foo = 5; | refcnt_dec(&obj->ref); | |can be re-ordered, which then allows fun scenarios like: | | CPU0 CPU1 | | refcnt_dec(&obj->ref); | if (dec_and_test(&obj->ref)) | free(obj); | obj->foo = 5; // oops UaF | | |This means (for alpha) that there should be a memory barrier _before_ |the decrement, however the dec_and_lock asm thing only has one _after_, |which, per the above, is too late. | |The generic version using add_unless will result in memory barrier |before and after (because that is the rule for atomic ops with a return |value) which is strictly too many barriers for the refcount story, but |who knows what other ordering requirements code has. Remove the custom alpha implementation of dec_and_lock() and if it is an issue (performance wise) then the fast path could still be inlined. Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Richard Henderson <rth@twiddle.net> Cc: Ivan Kokshaysky <ink@jurassic.park.msu.ru> Cc: Matt Turner <mattst88@gmail.com> Cc: linux-alpha@vger.kernel.org Link: https://lkml.kernel.org/r/20180606115918.GG12198@hirez.programming.kicks-ass.net Link: https://lkml.kernel.org/r20180612161621.22645-2-bigeasy@linutronix.de
61 lines
1.6 KiB
Makefile
61 lines
1.6 KiB
Makefile
# SPDX-License-Identifier: GPL-2.0
|
|
#
|
|
# Makefile for alpha-specific library files..
|
|
#
|
|
|
|
asflags-y := $(KBUILD_CFLAGS)
|
|
ccflags-y := -Werror
|
|
|
|
# Many of these routines have implementations tuned for ev6.
|
|
# Choose them iff we're targeting ev6 specifically.
|
|
ev6-$(CONFIG_ALPHA_EV6) := ev6-
|
|
|
|
# Several make use of the cttz instruction introduced in ev67.
|
|
ev67-$(CONFIG_ALPHA_EV67) := ev67-
|
|
|
|
lib-y = __divqu.o __remqu.o __divlu.o __remlu.o \
|
|
udelay.o \
|
|
$(ev6-y)memset.o \
|
|
$(ev6-y)memcpy.o \
|
|
memmove.o \
|
|
checksum.o \
|
|
csum_partial_copy.o \
|
|
$(ev67-y)strlen.o \
|
|
stycpy.o \
|
|
styncpy.o \
|
|
$(ev67-y)strchr.o \
|
|
$(ev67-y)strrchr.o \
|
|
$(ev6-y)memchr.o \
|
|
$(ev6-y)copy_user.o \
|
|
$(ev6-y)clear_user.o \
|
|
$(ev6-y)csum_ipv6_magic.o \
|
|
$(ev6-y)clear_page.o \
|
|
$(ev6-y)copy_page.o \
|
|
fpreg.o \
|
|
callback_srm.o srm_puts.o srm_printk.o \
|
|
fls.o
|
|
|
|
# The division routines are built from single source, with different defines.
|
|
AFLAGS___divqu.o = -DDIV
|
|
AFLAGS___remqu.o = -DREM
|
|
AFLAGS___divlu.o = -DDIV -DINTSIZE
|
|
AFLAGS___remlu.o = -DREM -DINTSIZE
|
|
|
|
$(addprefix $(obj)/,__divqu.o __remqu.o __divlu.o __remlu.o): \
|
|
$(src)/$(ev6-y)divide.S FORCE
|
|
$(call if_changed_rule,as_o_S)
|
|
|
|
# There are direct branches between {str*cpy,str*cat} and stx*cpy.
|
|
# Ensure the branches are within range by merging these objects.
|
|
|
|
LDFLAGS_stycpy.o := -r
|
|
LDFLAGS_styncpy.o := -r
|
|
|
|
$(obj)/stycpy.o: $(obj)/strcpy.o $(obj)/$(ev67-y)strcat.o \
|
|
$(obj)/$(ev6-y)stxcpy.o FORCE
|
|
$(call if_changed,ld)
|
|
|
|
$(obj)/styncpy.o: $(obj)/strncpy.o $(obj)/$(ev67-y)strncat.o \
|
|
$(obj)/$(ev6-y)stxncpy.o FORCE
|
|
$(call if_changed,ld)
|