Commit Graph

949505 Commits

Author SHA1 Message Date
Oliver O'Halloran
395ee2a2a1 powerpc/eeh: Move EEH initialisation to an arch initcall
The initialisation of EEH mostly happens in a core_initcall_sync initcall,
followed by registering a bus notifier later on in an arch_initcall.
Anything involving initcall dependecies is mostly incomprehensible unless
you've spent a while staring at code so here's the full sequence:

ppc_md.setup_arch       <-- pci_controllers are created here

...time passes...

core_initcall           <-- pci_dns are created from DT nodes
core_initcall_sync      <-- platforms call eeh_init()
postcore_initcall       <-- PCI bus type is registered
postcore_initcall_sync
arch_initcall           <-- EEH pci_bus notifier registered
subsys_initcall         <-- PHBs are scanned here

There's no real requirement to do the EEH setup at the core_initcall_sync
level. It just needs to be done after pci_dn's are created and before we
start scanning PHBs. Simplify the flow a bit by moving the platform EEH
inititalisation to an arch_initcall so we can fold the bus notifier
registration into eeh_init().

Signed-off-by: Oliver O'Halloran <oohall@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20200918093050.37344-5-oohall@gmail.com
2020-10-06 23:22:25 +11:00
Oliver O'Halloran
5d69e46a21 powerpc/eeh: Delete eeh_ops->init
No longer used since the platforms perform their EEH initialisation before
calling eeh_init().

Signed-off-by: Oliver O'Halloran <oohall@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20200918093050.37344-4-oohall@gmail.com
2020-10-06 23:22:25 +11:00
Oliver O'Halloran
1f8fa0cd6a powerpc/pseries: Stop using eeh_ops->init()
Fold pseries_eeh_init() into eeh_pseries_init() rather than having
eeh_init() call it via eeh_ops->init(). It's simpler and it'll let us
delete eeh_ops.init.

Signed-off-by: Oliver O'Halloran <oohall@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20200918093050.37344-3-oohall@gmail.com
2020-10-06 23:22:24 +11:00
Oliver O'Halloran
82a1ea21f1 powerpc/powernv: Stop using eeh_ops->init()
Fold pnv_eeh_init() into eeh_powernv_init() rather than having eeh_init()
call it via eeh_ops->init(). It's simpler and it'll let us delete
eeh_ops.init.

Signed-off-by: Oliver O'Halloran <oohall@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20200918093050.37344-2-oohall@gmail.com
2020-10-06 23:22:24 +11:00
Oliver O'Halloran
d125aedb40 powerpc/eeh: Rework EEH initialisation
Drop the EEH register / unregister ops thing and have the platform pass the
ops structure into eeh_init() directly. This takes one initcall out of the
EEH setup path and it means we're only doing EEH setup on the platforms
which actually support it. It's also less code and generally easier to
follow.

No functional changes.

Signed-off-by: Oliver O'Halloran <oohall@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20200918093050.37344-1-oohall@gmail.com
2020-10-06 23:22:24 +11:00
Christoph Hellwig
874dc62f54 powerpc: switch 85xx defconfigs from legacy ide to libata
Switch the 85xx defconfigs from the soon to be removed legacy ide
driver to libata.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20200924041310.520970-1-hch@lst.de
2020-10-06 23:22:24 +11:00
Daniel Axtens
5c5e46dad9 powerpc: PPC_SECURE_BOOT should not require PowerNV
In commit 61f879d97c ("powerpc/pseries: Detect secure and trusted
boot state of the system.") we taught the kernel how to understand the
secure-boot parameters used by a pseries guest.

However, CONFIG_PPC_SECURE_BOOT still requires PowerNV. I didn't
catch this because pseries_le_defconfig includes support for
PowerNV and so everything still worked. Indeed, most configs will.
Nonetheless, technically PPC_SECURE_BOOT doesn't require PowerNV
any more.

The secure variables support (PPC_SECVAR_SYSFS) doesn't do anything
on pSeries yet, but I don't think it's worth adding a new condition -
at some stage we'll want to add a backend for pSeries anyway.

Fixes: 61f879d97c ("powerpc/pseries: Detect secure and trusted boot state of the system.")
Signed-off-by: Daniel Axtens <dja@axtens.net>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20200924014922.172914-1-dja@axtens.net
2020-10-06 23:22:24 +11:00
Wang Wensheng
4366337490 powerpc/papr_scm: Fix warnings about undeclared variable
Build the kernel with 'make C=2':
arch/powerpc/platforms/pseries/papr_scm.c:825:1: warning: symbol
'dev_attr_perf_stats' was not declared. Should it be static?

Signed-off-by: Wang Wensheng <wangwensheng4@huawei.com>
Reviewed-by: Vaibhav Jain <vaibhav@linux.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20200918085951.44983-1-wangwensheng4@huawei.com
2020-10-06 23:22:24 +11:00
Nicholas Piggin
455575533c powerpc/64: make restore_interrupts 64e only
This is not used by 64s.

Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20200915114650.3980244-5-npiggin@gmail.com
2020-10-06 23:22:24 +11:00
Nicholas Piggin
903dd1ff45 powerpc/64e: remove 64s specific interrupt soft-mask code
Since the assembly soft-masking code was moved to 64e specific, there
are some 64s specific interrupt types still there. Remove them.

Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20200915114650.3980244-4-npiggin@gmail.com
2020-10-06 23:22:23 +11:00
Nicholas Piggin
012a9a97a8 powerpc/64e: remove PACA_IRQ_EE_EDGE
This is not used anywhere.

Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20200915114650.3980244-3-npiggin@gmail.com
2020-10-06 23:22:23 +11:00
Nicholas Piggin
2b48e96be2 powerpc/64: fix irq replay pt_regs->softe value
Replayed interrupts get an "artificial" struct pt_regs constructed to
pass to interrupt handler functions. This did not get the softe field
set correctly, it's as though the interrupt has hit while irqs are
disabled. It should be IRQS_ENABLED.

This is possibly harmless, asynchronous handlers should not be testing
if irqs were disabled, but it might be possible for example some code
is shared with synchronous or NMI handlers, and it makes more sense if
debug output looks at this.

Fixes: 3282a3da25 ("powerpc/64: Implement soft interrupt replay in C")
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20200915114650.3980244-2-npiggin@gmail.com
2020-10-06 23:22:23 +11:00
Nicholas Piggin
903fd31d32 powerpc/64: fix irq replay missing preempt
Prior to commit 3282a3da25 ("powerpc/64: Implement soft interrupt
replay in C"), replayed interrupts returned by the regular interrupt
exit code, which performs preemption in case an interrupt had set
need_resched.

This logic was missed by the conversion. Adding preempt_disable/enable
around the interrupt replay and final irq enable will reschedule if
needed.

Fixes: 3282a3da25 ("powerpc/64: Implement soft interrupt replay in C")
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20200915114650.3980244-1-npiggin@gmail.com
2020-10-06 23:22:23 +11:00
Nicholas Piggin
cdb1ea0276 powerpc/pseries: add new branch prediction security bits for link stack
The hypervisor interface has defined branch prediction security bits for
handling the link stack. Wire them up.

Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20200825075612.224656-1-npiggin@gmail.com
2020-10-06 23:22:23 +11:00
Nicholas Piggin
05504b4256 powerpc/64s: Add cp_abort after tlbiel to invalidate copy-buffer address
The copy buffer is implemented as a real address in the nest which is
translated from EA by copy, and used for memory access by paste. This
requires that it be invalidated by TLB invalidation.

TLBIE does invalidate the copy buffer, but TLBIEL does not. Add
cp_abort to the tlbiel sequence.

Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
[mpe: Fixup whitespace and comment formatting]
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20200916030234.4110379-2-npiggin@gmail.com
2020-10-06 23:22:23 +11:00
Nicholas Piggin
9983efa83b powerpc: untangle cputable mce include
Having cputable.h include mce.h means it pulls in a bunch of low level
headers (e.g., synch.h) which then can't use CPU_FTR_ definitions.

Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20200916030234.4110379-1-npiggin@gmail.com
2020-10-06 23:22:22 +11:00
Mahesh Salgaonkar
aea948bb80 powerpc/powernv/elog: Fix race while processing OPAL error log event.
Every error log reported by OPAL is exported to userspace through a
sysfs interface and notified using kobject_uevent(). The userspace
daemon (opal_errd) then reads the error log and acknowledges the error
log is saved safely to disk. Once acknowledged the kernel removes the
respective sysfs file entry causing respective resources to be
released including kobject.

However it's possible the userspace daemon may already be scanning
elog entries when a new sysfs elog entry is created by the kernel.
User daemon may read this new entry and ack it even before kernel can
notify userspace about it through kobject_uevent() call. If that
happens then we have a potential race between
elog_ack_store->kobject_put() and kobject_uevent which can lead to
use-after-free of a kernfs object resulting in a kernel crash. eg:

  BUG: Unable to handle kernel data access on read at 0x6b6b6b6b6b6b6bfb
  Faulting instruction address: 0xc0000000008ff2a0
  Oops: Kernel access of bad area, sig: 11 [#1]
  LE PAGE_SIZE=64K MMU=Hash SMP NR_CPUS=2048 NUMA PowerNV
  CPU: 27 PID: 805 Comm: irq/29-opal-elo Not tainted 5.9.0-rc2-gcc-8.2.0-00214-g6f56a67bcbb5-dirty #363
  ...
  NIP kobject_uevent_env+0xa0/0x910
  LR  elog_event+0x1f4/0x2d0
  Call Trace:
    0x5deadbeef0000122 (unreliable)
    elog_event+0x1f4/0x2d0
    irq_thread_fn+0x4c/0xc0
    irq_thread+0x1c0/0x2b0
    kthread+0x1c4/0x1d0
    ret_from_kernel_thread+0x5c/0x6c

This patch fixes this race by protecting the sysfs file
creation/notification by holding a reference count on kobject until we
safely send kobject_uevent().

The function create_elog_obj() returns the elog object which if used
by caller function will end up in use-after-free problem again.
However, the return value of create_elog_obj() function isn't being
used today and there is no need as well. Hence change it to return
void to make this fix complete.

Fixes: 774fea1a38 ("powerpc/powernv: Read OPAL error log and export it through sysfs")
Cc: stable@vger.kernel.org # v3.15+
Reported-by: Oliver O'Halloran <oohall@gmail.com>
Signed-off-by: Mahesh Salgaonkar <mahesh@linux.ibm.com>
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
Reviewed-by: Oliver O'Halloran <oohall@gmail.com>
Reviewed-by: Vasant Hegde <hegdevasant@linux.vnet.ibm.com>
[mpe: Rework the logic to use a single return, reword comments, add oops]
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20201006122051.190176-1-mpe@ellerman.id.au
2020-10-06 23:22:22 +11:00
Cédric Le Goater
ebbfeef0d8 powerpc/32: Declare stack_overflow_exception() prototype
This fixes a compile error with W=1.

CC      arch/powerpc/kernel/traps.o
../arch/powerpc/kernel/traps.c:1663:6: error: no previous prototype for ‘stack_overflow_exception’ [-Werror=missing-prototypes]
 void stack_overflow_exception(struct pt_regs *regs)
      ^~~~~~~~~~~~~~~~~~~~~~~~

Fixes: 3978eb7851 ("powerpc/32: Add early stack overflow detection with VMAP stack.")
Signed-off-by: Cédric Le Goater <clg@kaod.org>
Reviewed-by: Christophe Leroy <christophe.leroy@csgroup.eu>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20200914211007.2285999-8-clg@kaod.org
2020-09-18 20:05:25 +10:00
Cédric Le Goater
2228f19cf9 powerpc/xive: Make debug routines static
This fixes a compile error with W=1.

CC      arch/powerpc/sysdev/xive/common.o
../arch/powerpc/sysdev/xive/common.c:1568:6: error: no previous prototype for ‘xive_debug_show_cpu’ [-Werror=missing-prototypes]
 void xive_debug_show_cpu(struct seq_file *m, int cpu)
      ^~~~~~~~~~~~~~~~~~~
../arch/powerpc/sysdev/xive/common.c:1602:6: error: no previous prototype for ‘xive_debug_show_irq’ [-Werror=missing-prototypes]
 void xive_debug_show_irq(struct seq_file *m, u32 hw_irq, struct irq_data *d)
      ^~~~~~~~~~~~~~~~~~~

Fixes: 930914b7d5 ("powerpc/xive: Add a debugfs file to dump internal XIVE state")
Signed-off-by: Cédric Le Goater <clg@kaod.org>
Reviewed-by: Christophe Leroy <christophe.leroy@csgroup.eu>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20200914211007.2285999-5-clg@kaod.org
2020-09-18 20:05:25 +10:00
Cédric Le Goater
5ab187e01a powerpc/sstep: Remove empty if statement checking for invalid form
The check should be performed by the caller. This fixes a compile
error with W=1.

../arch/powerpc/lib/sstep.c: In function ‘mlsd_8lsd_ea’:
../arch/powerpc/lib/sstep.c:225:3: error: suggest braces around empty body in an ‘if’ statement [-Werror=empty-body]
   ; /* Invalid form. Should already be checked for by caller! */
   ^

Signed-off-by: Cédric Le Goater <clg@kaod.org>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20200914211007.2285999-4-clg@kaod.org
2020-09-18 20:05:24 +10:00
Cédric Le Goater
7b2aab5f22 powerpc/sysfs: Remove unused 'err' variable in sysfs_create_dscr_default()
This fixes a compile error with W=1.

arch/powerpc/kernel/sysfs.c: In function ‘sysfs_create_dscr_default’:
arch/powerpc/kernel/sysfs.c:228:7: error: variable ‘err’ set but not used [-Werror=unused-but-set-variable]
   int err = 0;
       ^~~
cc1: all warnings being treated as errors

Signed-off-by: Cédric Le Goater <clg@kaod.org>
Reviewed-by: Christophe Leroy <christophe.leroy@c-s.fr>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20200914211007.2285999-2-clg@kaod.org
2020-09-18 20:05:24 +10:00
Qinglang Miao
8ec5cb12cd powerpc/powernv: fix wrong warning message in opalcore_config_init()
The logic of the warn output is incorrect. The two args should be
exchanged.

Signed-off-by: Qinglang Miao <miaoqinglang@huawei.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20200916062129.190864-1-miaoqinglang@huawei.com
2020-09-18 19:59:45 +10:00
Qinglang Miao
1d42e07e9c serial: pmac_zilog: use for_each_child_of_node() macro
Use for_each_child_of_node() macro instead of open coding it.

Signed-off-by: Qinglang Miao <miaoqinglang@huawei.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20200916062138.191188-1-miaoqinglang@huawei.com
2020-09-18 19:59:44 +10:00
Qinglang Miao
acff5e6c37 macintosh: smu_sensors: use for_each_child_of_node() macro
Use for_each_child_of_node() macro instead of open coding it.

Signed-off-by: Qinglang Miao <miaoqinglang@huawei.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20200916062125.190729-1-miaoqinglang@huawei.com
2020-09-18 19:59:44 +10:00
Qinglang Miao
9c826d31a7 drivers/macintosh/smu.c: use for_each_child_of_node() macro
Use for_each_child_of_node() macro instead of open coding it.

Signed-off-by: Qinglang Miao <miaoqinglang@huawei.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20200916062122.190586-1-miaoqinglang@huawei.com
2020-09-18 19:59:44 +10:00
Michael Ellerman
6c71cfcc01 powerpc/prom_init: Check display props exist before enabling btext
It's possible to enable CONFIG_PPC_EARLY_DEBUG_BOOTX for a pseries
kernel (maybe it shouldn't be), which is then booted with qemu/slof.

But if you do that the kernel crashes in draw_byte(), with a DAR
pointing somewhere near INT_MAX.

Adding some debug to prom_init we see that we're not able to read the
"address" property from OF, so we're just using whatever junk value
was on the stack.

So check the properties can be read properly from OF, if not we bail
out before initialising btext, which avoids the crash.

Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Reviewed-by: Alexey Kardashevskiy <aik@ozlabs.ru>
Link: https://lore.kernel.org/r/20200821103407.3362149-1-mpe@ellerman.id.au
2020-09-18 19:59:44 +10:00
Michael Ellerman
39f8756145 powerpc/smp: Move ppc_md.cpu_die() to smp_ops.cpu_offline_self()
We have smp_ops->cpu_die() and ppc_md.cpu_die(). One of them offlines
the current CPU and one offlines another CPU, can you guess which is
which? Also one is in smp_ops and one is in ppc_md?

So rename ppc_md.cpu_die(), to cpu_offline_self(), because that's what
it does. And move it into smp_ops where it belongs.

Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20200819015634.1974478-3-mpe@ellerman.id.au
2020-09-18 19:59:43 +10:00
Michael Ellerman
bf3c1464db powerpc/smp: Fold cpu_die() into its only caller
Avoid the eternal confusion between cpu_die() and __cpu_die() by
removing the former, folding it into its only caller.

Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20200819015634.1974478-2-mpe@ellerman.id.au
2020-09-18 19:59:43 +10:00
Michael Ellerman
1ea21ba231 powerpc: Move arch_cpu_idle_dead() into smp.c
arch_cpu_idle_dead() is in idle.c, which makes sense, but it's inside
a CONFIG_HOTPLUG_CPU block.

It would be more at home in smp.c, inside the existing
CONFIG_HOTPLUG_CPU block. Note that CONFIG_HOTPLUG_CPU depends on
CONFIG_SMP so even though smp.c is not built for SMP=n builds, that's
fine.

Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20200819015634.1974478-1-mpe@ellerman.id.au
2020-09-18 19:59:43 +10:00
Michael Ellerman
d10ebe79df powerpc/perf: Add declarations to fix sparse warnings
Sparse warns about all the init functions:
  symbol init_ppc970_pmu was not declared. Should it be static?
  symbol init_power5p_pmu was not declared. Should it be static?
  symbol init_power5_pmu was not declared. Should it be static?
  symbol init_power6_pmu was not declared. Should it be static?
  symbol init_power7_pmu was not declared. Should it be static?
  symbol init_power9_pmu was not declared. Should it be static?
  symbol init_power8_pmu was not declared. Should it be static?
  symbol init_generic_compat_pmu was not declared. Should it be static?

They're already declared in internal.h, so just make sure all the C
files include that directly or indirectly.

Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Reviewed-by: Madhavan Srinivasan <maddy@linux.ibm.com>
Link: https://lore.kernel.org/r/20200916115637.3100484-2-mpe@ellerman.id.au
2020-09-18 19:59:43 +10:00
Michael Ellerman
ef1edbba52 powerpc/mm/64s: Fix slb_setup_new_exec() sparse warning
Sparse says:
  symbol slb_setup_new_exec was not declared. Should it be static?

No, it should have a declaration in a header, add one.

Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20200916115637.3100484-1-mpe@ellerman.id.au
2020-09-18 19:59:43 +10:00
Liu Shixin
96543e7352 powerpc/pseries: convert to use DEFINE_SEQ_ATTRIBUTE macro
Use DEFINE_SEQ_ATTRIBUTE macro to simplify the code.

Signed-off-by: Liu Shixin <liushixin2@huawei.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20200916025026.3992835-1-liushixin2@huawei.com
2020-09-18 19:59:43 +10:00
Yang Yingliang
bda7673d64 powerpc/book3s64: fix link error with CONFIG_PPC_RADIX_MMU=n
Fix link error when CONFIG_PPC_RADIX_MMU is disabled:
powerpc64-linux-gnu-ld: arch/powerpc/platforms/pseries/lpar.o:(.toc+0x0): undefined reference to `mmu_pid_bits'

Reported-by: Hulk Robot <hulkci@huawei.com>
Signed-off-by: Yang Yingliang <yangyingliang@huawei.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20200917020643.90375-1-yangyingliang@huawei.com
2020-09-18 19:59:43 +10:00
Michael Ellerman
0b30191b27 Merge branch 'topic/irqs-off-activate-mm' into next
Merge Nick's series to add ARCH_WANT_IRQS_OFF_ACTIVATE_MM.
2020-09-18 18:14:06 +10:00
Michael Ellerman
d208e13c6a powerpc/process: Fix uninitialised variable error
Clang, and GCC with -Wmaybe-uninitialized, can't see that val is
unused in get_fpexec_mode():

  arch/powerpc/kernel/process.c:1940:7: error: variable 'val' is used
  uninitialized whenever 'if' condition is true
		  if (cpu_has_feature(CPU_FTR_SPE)) {
		      ^~~~~~~~~~~~~~~~~~~~~~~~~~~~

We know that CPU_FTR_SPE will only be true iff CONFIG_SPE is also
true, but the compiler doesn't.

Avoid it by initialising val to zero.

Reported-by: kernel test robot <lkp@intel.com>
Fixes: 532ed1900d ("powerpc/process: Remove useless #ifdef CONFIG_SPE")
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Reviewed-by: Nick Desaulniers <ndesaulniers@google.com>
Tested-by: Nick Desaulniers <ndesaulniers@google.com>
Link: https://lore.kernel.org/r/20200917024509.3253837-1-mpe@ellerman.id.au
2020-09-18 18:12:46 +10:00
Michael Ellerman
b5c8a2934e Merge coregroup support into next
From Srikar's cover letter, with some reformatting:

Cleanup of existing powerpc topologies and add coregroup support on
powerpc. Coregroup is a group of (subset of) cores of a DIE that share
a resource.

Summary of some of the testing done with coregroup patchset.

It includes ebizzy, schbench, perf bench sched pipe and topology
verification. On the left side are results from powerpc/next tree and
on the right are the results with the patchset applied. Topological
verification clearly shows that there is no change in topology with
and without the patches on all the 3 class of systems that were
tested.

Power 9 PowerNV (2 Node/ 160 Cpu System)
----------------------------------------

Baseline                                                                Baseline + Coregroup Support

  N      Min       Max    Median       Avg        Stddev                  N      Min       Max    Median       Avg      Stddev
100   993884   1276090   1173476   1165914     54867.201                100   910470   1279820   1171095   1162091    67363.28

^ ebizzy (Throughput of 100 iterations of 30 seconds higher throughput is better)

schbench (latency hence lower is better)
Latency percentiles (usec)                                              Latency percentiles (usec)
        50.0th: 455                                                             50.0th: 454
        75.0th: 533                                                             75.0th: 543
        90.0th: 683                                                             90.0th: 701
        95.0th: 743                                                             95.0th: 737
        *99.0th: 815                                                            *99.0th: 805
        99.5th: 839                                                             99.5th: 835
        99.9th: 913                                                             99.9th: 893
        min=0, max=1011                                                         min=0, max=2833

perf bench sched pipe (lesser time and higher ops/sec is better)
Running 'sched/pipe' benchmark:                                         Running 'sched/pipe' benchmark:
Executed 1000000 pipe operations between two processes                  Executed 1000000 pipe operations between two processes

     Total time: 6.083 [sec]                                                 Total time: 6.303 [sec]

       6.083576 usecs/op                                                       6.303318 usecs/op
         164377 ops/sec                                                          158646 ops/sec

Power 9 LPAR (2 Node/ 128 Cpu System)
-------------------------------------

Baseline                                                                Baseline + Coregroup Support

  N       Min       Max    Median         Avg      Stddev                 N       Min       Max    Median         Avg      Stddev
100   1058029   1295393   1200414   1188306.7   56786.538               100    943264   1287619   1180522   1168473.2   64469.955

^ ebizzy (Throughput of 100 iterations of 30 seconds higher throughput is better)

schbench (latency hence lower is better)
Latency percentiles (usec)                                              Latency percentiles (usec)
        50.0000th: 34                                                           50.0000th: 39
        75.0000th: 46                                                           75.0000th: 52
        90.0000th: 53                                                           90.0000th: 68
        95.0000th: 56                                                           95.0000th: 77
        *99.0000th: 61                                                          *99.0000th: 89
        99.5000th: 63                                                           99.5000th: 94
        99.9000th: 81                                                           99.9000th: 169
        min=0, max=8405                                                         min=0, max=23674

perf bench sched pipe (lesser time and higher ops/sec is better)
Running 'sched/pipe' benchmark:                                         Running 'sched/pipe' benchmark:
Executed 1000000 pipe operations between two processes                  Executed 1000000 pipe operations between two processes

     Total time: 8.768 [sec]                                                 Total time: 5.217 [sec]

       8.768400 usecs/op                                                       5.217625 usecs/op
         114045 ops/sec                                                          191658 ops/sec

Power 8 LPAR (8 Node/ 256 Cpu System)
-------------------------------------

Baseline                                                                Baseline + Coregroup Support

  N       Min       Max    Median         Avg      Stddev                 N      Min      Max   Median        Avg     Stddev
100   1267615   1965234   1707423   1689137.6   144363.29               100  1175357  1924262  1691104  1664792.1   145876.4

^ ebizzy (Throughput of 100 iterations of 30 seconds higher throughput is better)

schbench (latency hence lower is better)
Latency percentiles (usec)                                              Latency percentiles (usec)
        50.0th: 37                                                              50.0th: 36
        75.0th: 51                                                              75.0th: 48
        90.0th: 59                                                              90.0th: 55
        95.0th: 63                                                              95.0th: 59
        *99.0th: 71                                                             *99.0th: 67
        99.5th: 75                                                              99.5th: 72
        99.9th: 105                                                             99.9th: 170
        min=0, max=18560                                                        min=0, max=27031

perf bench sched pipe (lesser time and higher ops/sec is better)
Running 'sched/pipe' benchmark:                                         Running 'sched/pipe' benchmark:
Executed 1000000 pipe operations between two processes                  Executed 1000000 pipe operations between two processes

     Total time: 6.013 [sec]                                                 Total time: 5.930 [sec]

       6.013963 usecs/op                                                       5.930724 usecs/op
         166279 ops/sec                                                          168613 ops/sec

Topology verification on Power9
Power9 / powernv / SMT4

  $ tail /proc/cpuinfo
  cpu             : POWER9, altivec supported
  clock           : 3600.000000MHz
  revision        : 2.2 (pvr 004e 1202)

  timebase        : 512000000
  platform        : PowerNV
  model           : 9006-22P
  machine         : PowerNV 9006-22P
  firmware        : OPAL
  MMU             : Radix

Baseline                                                                Baseline + Coregroup Support

  lscpu                                                                 lscpu
  ------                                                                ------
  Architecture:        ppc64le                                          Architecture:        ppc64le
  Byte Order:          Little Endian                                    Byte Order:          Little Endian
  CPU(s):              160                                              CPU(s):              160
  On-line CPU(s) list: 0-159                                            On-line CPU(s) list: 0-159
  Thread(s) per core:  4                                                Thread(s) per core:  4
  Core(s) per socket:  20                                               Core(s) per socket:  20
  Socket(s):           2                                                Socket(s):           2
  NUMA node(s):        2                                                NUMA node(s):        2
  Model:               2.2 (pvr 004e 1202)                              Model:               2.2 (pvr 004e 1202)
  Model name:          POWER9, altivec supported                        Model name:          POWER9, altivec supported
  CPU max MHz:         3800.0000                                        CPU max MHz:         3800.0000
  CPU min MHz:         2166.0000                                        CPU min MHz:         2166.0000
  L1d cache:           32K                                              L1d cache:           32K
  L1i cache:           32K                                              L1i cache:           32K
  L2 cache:            512K                                             L2 cache:            512K
  L3 cache:            10240K                                           L3 cache:            10240K
  NUMA node0 CPU(s):   0-79                                             NUMA node0 CPU(s):   0-79
  NUMA node8 CPU(s):   80-159                                           NUMA node8 CPU(s):   80-159

  grep . /proc/sys/kernel/sched_domain/cpu0/domain*/name                grep . /proc/sys/kernel/sched_domain/cpu0/domain*/name
  -----------------------------------------------------                 -----------------------------------------------------
  /proc/sys/kernel/sched_domain/cpu0/domain0/name:SMT                   /proc/sys/kernel/sched_domain/cpu0/domain0/name:SMT
  /proc/sys/kernel/sched_domain/cpu0/domain1/name:CACHE                 /proc/sys/kernel/sched_domain/cpu0/domain1/name:CACHE
  /proc/sys/kernel/sched_domain/cpu0/domain2/name:DIE                   /proc/sys/kernel/sched_domain/cpu0/domain2/name:DIE
  /proc/sys/kernel/sched_domain/cpu0/domain3/name:NUMA                  /proc/sys/kernel/sched_domain/cpu0/domain3/name:NUMA

  grep . /proc/sys/kernel/sched_domain/cpu0/domain*/flags               grep . /proc/sys/kernel/sched_domain/cpu0/domain*/flags
  ------------------------------------------------------                ------------------------------------------------------
  /proc/sys/kernel/sched_domain/cpu0/domain0/flags:2391                 /proc/sys/kernel/sched_domain/cpu0/domain0/flags:2391
  /proc/sys/kernel/sched_domain/cpu0/domain1/flags:2327                 /proc/sys/kernel/sched_domain/cpu0/domain1/flags:2327
  /proc/sys/kernel/sched_domain/cpu0/domain2/flags:2071                 /proc/sys/kernel/sched_domain/cpu0/domain2/flags:2071
  /proc/sys/kernel/sched_domain/cpu0/domain3/flags:12801                /proc/sys/kernel/sched_domain/cpu0/domain3/flags:12801

Baseline

  head /proc/schedstat
  --------------------
  version 15
  timestamp 4295043536
  cpu0 0 0 0 0 0 0 9597119314 2408913694 11897
  domain0 00000000,00000000,00000000,00000000,0000000f 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
  domain1 00000000,00000000,00000000,00000000,000000ff 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
  domain2 00000000,00000000,0000ffff,ffffffff,ffffffff 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
  domain3 ffffffff,ffffffff,ffffffff,ffffffff,ffffffff 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
  cpu1 0 0 0 0 0 0 4941435230 11106132 1583
  domain0 00000000,00000000,00000000,00000000,0000000f 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
  domain1 00000000,00000000,00000000,00000000,000000ff 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

Baseline + Coregroup Support

  head /proc/schedstat
  --------------------
  version 15
  timestamp 4296311826
  cpu0 0 0 0 0 0 0 3353674045024 3781680865826 297483
  domain0 00000000,00000000,00000000,00000000,0000000f 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
  domain1 00000000,00000000,00000000,00000000,000000ff 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
  domain2 00000000,00000000,0000ffff,ffffffff,ffffffff 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
  domain3 ffffffff,ffffffff,ffffffff,ffffffff,ffffffff 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
  cpu1 0 0 0 0 0 0 3337873293332 4231590033856 229090
  domain0 00000000,00000000,00000000,00000000,0000000f 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
  domain1 00000000,00000000,00000000,00000000,000000ff 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

  Post sudo ppc64_cpu --smt=1                                           Post sudo ppc64_cpu --smt=1
  ---------------------                                                 ---------------------
  grep . /proc/sys/kernel/sched_domain/cpu0/domain*/name                grep . /proc/sys/kernel/sched_domain/cpu0/domain*/name
  -----------------------------------------------------                 -----------------------------------------------------
  /proc/sys/kernel/sched_domain/cpu0/domain0/name:CACHE                 /proc/sys/kernel/sched_domain/cpu0/domain0/name:CACHE
  /proc/sys/kernel/sched_domain/cpu0/domain1/name:DIE                   /proc/sys/kernel/sched_domain/cpu0/domain1/name:DIE
  /proc/sys/kernel/sched_domain/cpu0/domain2/name:NUMA                  /proc/sys/kernel/sched_domain/cpu0/domain2/name:NUMA

  grep . /proc/sys/kernel/sched_domain/cpu0/domain*/flags               grep . /proc/sys/kernel/sched_domain/cpu0/domain*/flags
  ------------------------------------------------------                ------------------------------------------------------
  /proc/sys/kernel/sched_domain/cpu0/domain0/flags:2327                 /proc/sys/kernel/sched_domain/cpu0/domain0/flags:2327
  /proc/sys/kernel/sched_domain/cpu0/domain1/flags:2071                 /proc/sys/kernel/sched_domain/cpu0/domain1/flags:2071
  /proc/sys/kernel/sched_domain/cpu0/domain2/flags:12801                /proc/sys/kernel/sched_domain/cpu0/domain2/flags:12801

Baseline:

  head /proc/schedstat
  --------------------
  version 15
  timestamp 4295046242
  cpu0 0 0 0 0 0 0 10978610020 2658997390 13068
  domain0 00000000,00000000,00000000,00000000,00000011 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
  domain1 00000000,00000000,00001111,11111111,11111111 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
  domain2 91111111,11111111,11111111,11111111,11111111 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
  cpu4 0 0 0 0 0 0 5408663896 95701034 7697
  domain0 00000000,00000000,00000000,00000000,00000011 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
  domain1 00000000,00000000,00001111,11111111,11111111 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
  domain2 91111111,11111111,11111111,11111111,11111111 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

Baseline + Coregroup Support

  head /proc/schedstat
  --------------------
  version 15
  timestamp 4296314905
  cpu0 0 0 0 0 0 0 3355392013536 3781975150576 298723
  domain0 00000000,00000000,00000000,00000000,00000011 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
  domain1 00000000,00000000,00001111,11111111,11111111 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
  domain2 91111111,11111111,11111111,11111111,11111111 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
  cpu4 0 0 0 0 0 0 3351637920996 4427329763050 256776
  domain0 00000000,00000000,00000000,00000000,00000011 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
  domain1 00000000,00000000,00001111,11111111,11111111 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
  domain2 91111111,11111111,11111111,11111111,11111111 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

Similar verification was done on Power 8 (8 Node 256 CPU LPAR) and
Power 9 (2 node 128 Cpu LPAR) and they showed the topology before and
after the patch to be identical. If Interested, I could provide the
same.

On Power 9 (with device-tree enablement to show coregroups):

  $ tail /proc/cpuinfo
  processor     : 127
  cpu           : POWER9 (architected), altivec supported
  clock         : 3000.000000MHz
  revision      : 2.2 (pvr 004e 0202)

  timebase      : 512000000
  platform      : pSeries
  model         : IBM,9008-22L
  machine       : CHRP IBM,9008-22L
  MMU           : Hash

Before patchset:

  $ cat /proc/sys/kernel/sched_domain/cpu0/domain*/name
  SMT
  CACHE
  DIE
  NUMA

  $ head /proc/schedstat
  version 15
  timestamp 4318242208
  cpu0 0 0 0 0 0 0 28077107004 4773387362 78205
  domain0 00000000,00000000,00000000,00000055 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
  domain1 00000000,00000000,00000000,000000ff 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
  domain2 00000000,00000000,ffffffff,ffffffff 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
  domain3 ffffffff,ffffffff,ffffffff,ffffffff 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
  cpu1 0 0 0 0 0 0 24177439200 413887604 75393
  domain0 00000000,00000000,00000000,000000aa 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
  domain1 00000000,00000000,00000000,000000ff 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

After patchset:

  $ cat /proc/sys/kernel/sched_domain/cpu0/domain*/name
  SMT
  CACHE
  MC
  DIE
  NUMA

  $ head /proc/schedstat
  version 15
  timestamp 4318242208
  cpu0 0 0 0 0 0 0 28077107004 4773387362 78205
  domain0 00000000,00000000,00000000,00000055 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
  domain1 00000000,00000000,00000000,000000ff 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
  domain2 00000000,00000000,00000000,ffffffff 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
  domain3 00000000,00000000,ffffffff,ffffffff 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
  domain4 ffffffff,ffffffff,ffffffff,ffffffff 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
  cpu1 0 0 0 0 0 0 24177439200 413887604 75393
  domain0 00000000,00000000,00000000,000000aa 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
2020-09-16 22:21:05 +10:00
Srikar Dronamraju
fa35e868f9 powerpc/smp: Implement cpu_to_coregroup_id
Lookup the coregroup id from the associativity array.

If unable to detect the coregroup id, fallback on the core id.
This way, ensure sched_domain degenerates and an extra sched domain is
not created.

Ideally this function should have been implemented in
arch/powerpc/kernel/smp.c. However if its implemented in mm/numa.c, we
don't need to find the primary domain again.

If the device-tree mentions more than one coregroup, then kernel
implements only the last or the smallest coregroup, which currently
corresponds to the penultimate domain in the device-tree.

Signed-off-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Reviewed-by: Gautham R. Shenoy <ego@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20200810071834.92514-11-srikar@linux.vnet.ibm.com
2020-09-16 22:13:32 +10:00
Srikar Dronamraju
72730bfc2a powerpc/smp: Create coregroup domain
Add percpu coregroup maps and masks to create coregroup domain.
If a coregroup doesn't exist, the coregroup domain will be degenerated
in favour of SMT/CACHE domain. Do note this patch is only creating stubs
for cpu_to_coregroup_id. The actual cpu_to_coregroup_id implementation
would be in a subsequent patch.

Signed-off-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Reviewed-by: Gautham R. Shenoy <ego@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20200810071834.92514-10-srikar@linux.vnet.ibm.com
2020-09-16 22:13:32 +10:00
Srikar Dronamraju
6e08630281 powerpc/smp: Allocate cpumask only after searching thread group
If allocated earlier and the search fails, then cpu_l1_cache_map cpumask
is unnecessarily cleared. However cpu_l1_cache_map can be allocated /
cleared after we search thread group.

Please note CONFIG_CPUMASK_OFFSTACK is not set on Powerpc. Hence cpumask
allocated by zalloc_cpumask_var_node is never freed.

Signed-off-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Reviewed-by: Gautham R. Shenoy <ego@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20200810071834.92514-9-srikar@linux.vnet.ibm.com
2020-09-16 22:13:32 +10:00
Srikar Dronamraju
f9f130ff2e powerpc/numa: Detect support for coregroup
Add support for grouping cores based on the device-tree classification.
- The last domain in the associativity domains always refers to the
core.
- If primary reference domain happens to be the penultimate domain in
the associativity domains device-tree property, then there are no
coregroups. However if its not a penultimate domain, then there are
coregroups. There can be more than one coregroup. For now we would be
interested in the last or the smallest coregroups, i.e one sub-group
per DIE.

Currently there are no firmwares that are exposing this grouping. Hence
allow the basis for grouping to be abstract.  Once the firmware starts
using this grouping, code would be added to detect the type of grouping
and adjust the sd domain flags accordingly.

Signed-off-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Reviewed-by: Gautham R. Shenoy <ego@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20200810071834.92514-8-srikar@linux.vnet.ibm.com
2020-09-16 22:13:31 +10:00
Srikar Dronamraju
caa8e29da5 powerpc/smp: Optimize start_secondary
In start_secondary, even if shared_cache was already set, system does a
redundant match for cpumask. This redundant check can be removed by
checking if shared_cache is already set.

While here, localize the sibling_mask variable to within the if
condition.

Signed-off-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Reviewed-by: Gautham R. Shenoy <ego@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20200810071834.92514-7-srikar@linux.vnet.ibm.com
2020-09-16 22:13:31 +10:00
Srikar Dronamraju
f6606cfdfb powerpc/smp: Dont assume l2-cache to be superset of sibling
Current code assumes that cpumask of cpus sharing a l2-cache mask will
always be a superset of cpu_sibling_mask.

Lets stop that assumption. cpu_l2_cache_mask is a superset of
cpu_sibling_mask if and only if shared_caches is set.

Reviewed-by: Gautham R. Shenoy <ego@linux.vnet.ibm.com>
Signed-off-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20200913171038.GB11808@linux.vnet.ibm.com
2020-09-16 22:13:31 +10:00
Srikar Dronamraju
3c6032a8fe powerpc/smp: Move topology fixups into a new function
Move topology fixup based on the platform attributes into its own
function which is called just before set_sched_topology.

Signed-off-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Reviewed-by: Gautham R. Shenoy <ego@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20200810071834.92514-5-srikar@linux.vnet.ibm.com
2020-09-16 22:13:31 +10:00
Srikar Dronamraju
5e93f16ae4 powerpc/smp: Move powerpc_topology above
Just moving the powerpc_topology description above.
This will help in using functions in this file and avoid declarations.

No other functional changes

Signed-off-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Reviewed-by: Gautham R. Shenoy <ego@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20200810071834.92514-4-srikar@linux.vnet.ibm.com
2020-09-16 22:13:31 +10:00
Srikar Dronamraju
2ef0ca54d9 powerpc/smp: Merge Power9 topology with Power topology
A new sched_domain_topology_level was added just for Power9. However the
same can be achieved by merging powerpc_topology with power9_topology
and makes the code more simpler especially when adding a new sched
domain.

Signed-off-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Reviewed-by: Gautham R. Shenoy <ego@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20200810071834.92514-3-srikar@linux.vnet.ibm.com
2020-09-16 22:13:31 +10:00
Srikar Dronamraju
d0fd24bbd2 powerpc/smp: Fix a warning under !NEED_MULTIPLE_NODES
Fix a build warning in a non CONFIG_NEED_MULTIPLE_NODES
"error: _numa_cpu_lookup_table_ undeclared"

Signed-off-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Reviewed-by: Gautham R. Shenoy <ego@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20200810071834.92514-2-srikar@linux.vnet.ibm.com
2020-09-16 22:13:30 +10:00
Srikar Dronamraju
e75130f20b powerpc/numa: Offline memoryless cpuless node 0
Currently Linux kernel with CONFIG_NUMA on a system with multiple
possible nodes, marks node 0 as online at boot.  However in practice,
there are systems which have node 0 as memoryless and cpuless.

This can cause numa_balancing to be enabled on systems with only one node
with memory and CPUs. The existence of this dummy node which is cpuless and
memoryless node can confuse users/scripts looking at output of lscpu /
numactl.

By marking, node 0 as offline, lets stop assuming that node 0 is
always online. If node 0 has CPU or memory that are online, node 0 will
again be set as online.

v5.8
 available: 2 nodes (0,2)
 node 0 cpus:
 node 0 size: 0 MB
 node 0 free: 0 MB
 node 2 cpus: 0 1 2 3 4 5 6 7
 node 2 size: 32625 MB
 node 2 free: 31490 MB
 node distances:
 node   0   2
   0:  10  20
   2:  20  10

proc and sys files
------------------
 /sys/devices/system/node/online:            0,2
 /proc/sys/kernel/numa_balancing:            1
 /sys/devices/system/node/has_cpu:           2
 /sys/devices/system/node/has_memory:        2
 /sys/devices/system/node/has_normal_memory: 2
 /sys/devices/system/node/possible:          0-31

v5.8 + patch
------------------
 available: 1 nodes (2)
 node 2 cpus: 0 1 2 3 4 5 6 7
 node 2 size: 32625 MB
 node 2 free: 31487 MB
 node distances:
 node   2
   2:  10

proc and sys files
------------------
/sys/devices/system/node/online:            2
/proc/sys/kernel/numa_balancing:            0
/sys/devices/system/node/has_cpu:           2
/sys/devices/system/node/has_memory:        2
/sys/devices/system/node/has_normal_memory: 2
/sys/devices/system/node/possible:          0-31

Example of a node with online CPUs/memory on node 0.
(Same o/p with and without patch)
numactl -H
available: 4 nodes (0-3)
node 0 cpus: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47
node 0 size: 32482 MB
node 0 free: 22994 MB
node 1 cpus: 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95
node 1 size: 0 MB
node 1 free: 0 MB
node 2 cpus: 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143
node 2 size: 0 MB
node 2 free: 0 MB
node 3 cpus: 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 node 3 size: 0 MB
node 3 free: 0 MB
node distances:
node   0   1   2   3
  0:  10  20  40  40
  1:  20  10  40  40
  2:  40  40  10  20
  3:  40  40  20  10

Note: On Powerpc, cpu_to_node of possible but not present cpus would
previously return 0. Hence this commit depends on commit ("powerpc/numa: Set
numa_node for all possible cpus") and commit ("powerpc/numa: Prefer node id
queried from vphn"). Without the 2 commits, Powerpc system might crash.

1. User space applications like Numactl, lscpu, that parse the sysfs tend to
believe there is an extra online node. This tends to confuse users and
applications. Other user space applications start believing that system was
not able to use all the resources (i.e missing resources) or the system was
not setup correctly.

2. Also existence of dummy node also leads to inconsistent information. The
number of online nodes is inconsistent with the information in the
device-tree and resource-dump

3. When the dummy node is present, single node non-Numa systems end up showing
up as NUMA systems and numa_balancing gets enabled. This will mean we take
the hit from the unnecessary numa hinting faults.

Signed-off-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20200818081104.57888-4-srikar@linux.vnet.ibm.com
2020-09-16 22:05:20 +10:00
Srikar Dronamraju
6398eaa268 powerpc/numa: Prefer node id queried from vphn
Node id queried from the static device tree may not
be correct. For example: it may always show 0 on a shared processor.
Hence prefer the node id queried from vphn and fallback on the device tree
based node id if vphn query fails.

Signed-off-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20200818081104.57888-3-srikar@linux.vnet.ibm.com
2020-09-16 22:05:19 +10:00
Srikar Dronamraju
a874f1005e powerpc/numa: Set numa_node for all possible cpus
A Powerpc system with multiple possible nodes and with CONFIG_NUMA
enabled always used to have a node 0, even if node 0 does not any cpus
or memory attached to it. As per PAPR, node affinity of a cpu is only
available once its present / online. For all cpus that are possible but
not present, cpu_to_node() would point to node 0.

To ensure a cpuless, memoryless dummy node is not online, powerpc need
to make sure all possible but not present cpu_to_node are set to a
proper node.

Signed-off-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20200818081104.57888-2-srikar@linux.vnet.ibm.com
2020-09-16 22:05:19 +10:00
Srikar Dronamraju
67df77845c powerpc/numa: Restrict possible nodes based on platform
As per draft LoPAPR (Revision 2.9_pre7), section B.5.3 "Run Time
Abstraction Services (RTAS) Node" available at:
  https://openpowerfoundation.org/wp-content/uploads/2020/07/LoPAR-20200611.pdf

... there are 2 device tree properties:

  "ibm,max-associativity-domains"
   which defines the maximum number of domains that the firmware i.e
   PowerVM can support.

and:

  "ibm,current-associativity-domains"
   which defines the maximum number of domains that the current
   platform can support.

The value of "ibm,max-associativity-domains" is always greater than or
equal to "ibm,current-associativity-domains" property. If the latter
property is not available, use "ibm,max-associativity-domain" as a
fallback. In this yet to be released LoPAPR, "ibm,current-associativity-domains"
is mentioned in page 833 / B.5.3 which is covered under under
"Appendix B. System Binding" section

Currently powerpc uses the "ibm,max-associativity-domains" property
while setting the possible number of nodes. This is currently set at
32. However the possible number of nodes for a platform may be
significantly less. Hence set the possible number of nodes based on
"ibm,current-associativity-domains" property.

Nathan Lynch had raised a valid concern that post LPM (Live Partition
Migration), a user could DLPAR add processors and memory after LPM
with "new" associativity properties:
  https://lore.kernel.org/linuxppc-dev/871rljfet9.fsf@linux.ibm.com/t/#u

He also pointed out that "ibm,max-associativity-domains" has the same
contents on all currently available PowerVM systems, unlike
"ibm,current-associativity-domains" and hence may be better able to
handle the new NUMA associativity properties.

However with the recent commit dbce456280 ("powerpc/numa: Limit
possible nodes to within num_possible_nodes"), all new NUMA
associativity properties are capped to initially set nr_node_ids.
Hence this commit should be safe with any new DLPAR add post LPM.

  $ lsprop /proc/device-tree/rtas/ibm,*associ*-domains
  /proc/device-tree/rtas/ibm,current-associativity-domains
  		 00000005 00000001 00000002 00000002 00000002 00000010
  /proc/device-tree/rtas/ibm,max-associativity-domains
  		 00000005 00000001 00000008 00000020 00000020 00000100

  $ cat /sys/devices/system/node/possible ##Before patch
  0-31

  $ cat /sys/devices/system/node/possible ##After patch
  0-1

Note the maximum nodes this platform can support is only 2 but the
possible nodes is set to 32.

This is important because lot of kernel and user space code allocate
structures for all possible nodes leading to a lot of memory that is
allocated but not used.

I ran a simple experiment to create and destroy 100 memory cgroups on
boot on a 8 node machine (Power8 Alpine).

Before patch:
  free -k at boot
                total        used        free      shared  buff/cache   available
  Mem:      523498176     4106816   518820608       22272      570752   516606720
  Swap:       4194240           0     4194240

  free -k after creating 100 memory cgroups
                total        used        free      shared  buff/cache   available
  Mem:      523498176     4628416   518246464       22336      623296   516058688
  Swap:       4194240           0     4194240

  free -k after destroying 100 memory cgroups
                total        used        free      shared  buff/cache   available
  Mem:      523498176     4697408   518173760       22400      627008   515987904
  Swap:       4194240           0     4194240

After patch:
  free -k at boot
                total        used        free      shared  buff/cache   available
  Mem:      523498176     3969472   518933888       22272      594816   516731776
  Swap:       4194240           0     4194240

  free -k after creating 100 memory cgroups
                total        used        free      shared  buff/cache   available
  Mem:      523498176     4181888   518676096       22208      640192   516496448
  Swap:       4194240           0     4194240

  free -k after destroying 100 memory cgroups
                total        used        free      shared  buff/cache   available
  Mem:      523498176     4232320   518619904       22272      645952   516443264
  Swap:       4194240           0     4194240

Observations:
  Fixed kernel takes 137344 kb (4106816-3969472) less to boot.
  Fixed kernel takes 309184 kb (4628416-4181888-137344) less to create 100 memcgs.

Signed-off-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
[mpe: Reformat change log a bit for readability]
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20200817055257.110873-1-srikar@linux.vnet.ibm.com
2020-09-16 22:05:19 +10:00