linux_dsm_epyc7002

mirror of https://github.com/AuxXxilium/linux_dsm_epyc7002.git synced 2024-12-28 11:18:45 +07:00

Author	SHA1	Message	Date
Oliver O'Halloran	395ee2a2a1	powerpc/eeh: Move EEH initialisation to an arch initcall The initialisation of EEH mostly happens in a core_initcall_sync initcall, followed by registering a bus notifier later on in an arch_initcall. Anything involving initcall dependecies is mostly incomprehensible unless you've spent a while staring at code so here's the full sequence: ppc_md.setup_arch <-- pci_controllers are created here ...time passes... core_initcall <-- pci_dns are created from DT nodes core_initcall_sync <-- platforms call eeh_init() postcore_initcall <-- PCI bus type is registered postcore_initcall_sync arch_initcall <-- EEH pci_bus notifier registered subsys_initcall <-- PHBs are scanned here There's no real requirement to do the EEH setup at the core_initcall_sync level. It just needs to be done after pci_dn's are created and before we start scanning PHBs. Simplify the flow a bit by moving the platform EEH inititalisation to an arch_initcall so we can fold the bus notifier registration into eeh_init(). Signed-off-by: Oliver O'Halloran <oohall@gmail.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/20200918093050.37344-5-oohall@gmail.com	2020-10-06 23:22:25 +11:00
Oliver O'Halloran	5d69e46a21	powerpc/eeh: Delete eeh_ops->init No longer used since the platforms perform their EEH initialisation before calling eeh_init(). Signed-off-by: Oliver O'Halloran <oohall@gmail.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/20200918093050.37344-4-oohall@gmail.com	2020-10-06 23:22:25 +11:00
Oliver O'Halloran	1f8fa0cd6a	powerpc/pseries: Stop using eeh_ops->init() Fold pseries_eeh_init() into eeh_pseries_init() rather than having eeh_init() call it via eeh_ops->init(). It's simpler and it'll let us delete eeh_ops.init. Signed-off-by: Oliver O'Halloran <oohall@gmail.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/20200918093050.37344-3-oohall@gmail.com	2020-10-06 23:22:24 +11:00
Oliver O'Halloran	82a1ea21f1	powerpc/powernv: Stop using eeh_ops->init() Fold pnv_eeh_init() into eeh_powernv_init() rather than having eeh_init() call it via eeh_ops->init(). It's simpler and it'll let us delete eeh_ops.init. Signed-off-by: Oliver O'Halloran <oohall@gmail.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/20200918093050.37344-2-oohall@gmail.com	2020-10-06 23:22:24 +11:00
Oliver O'Halloran	d125aedb40	powerpc/eeh: Rework EEH initialisation Drop the EEH register / unregister ops thing and have the platform pass the ops structure into eeh_init() directly. This takes one initcall out of the EEH setup path and it means we're only doing EEH setup on the platforms which actually support it. It's also less code and generally easier to follow. No functional changes. Signed-off-by: Oliver O'Halloran <oohall@gmail.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/20200918093050.37344-1-oohall@gmail.com	2020-10-06 23:22:24 +11:00
Christoph Hellwig	874dc62f54	powerpc: switch 85xx defconfigs from legacy ide to libata Switch the 85xx defconfigs from the soon to be removed legacy ide driver to libata. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/20200924041310.520970-1-hch@lst.de	2020-10-06 23:22:24 +11:00
Daniel Axtens	5c5e46dad9	powerpc: PPC_SECURE_BOOT should not require PowerNV In commit `61f879d97c` ("powerpc/pseries: Detect secure and trusted boot state of the system.") we taught the kernel how to understand the secure-boot parameters used by a pseries guest. However, CONFIG_PPC_SECURE_BOOT still requires PowerNV. I didn't catch this because pseries_le_defconfig includes support for PowerNV and so everything still worked. Indeed, most configs will. Nonetheless, technically PPC_SECURE_BOOT doesn't require PowerNV any more. The secure variables support (PPC_SECVAR_SYSFS) doesn't do anything on pSeries yet, but I don't think it's worth adding a new condition - at some stage we'll want to add a backend for pSeries anyway. Fixes: `61f879d97c` ("powerpc/pseries: Detect secure and trusted boot state of the system.") Signed-off-by: Daniel Axtens <dja@axtens.net> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/20200924014922.172914-1-dja@axtens.net	2020-10-06 23:22:24 +11:00
Wang Wensheng	4366337490	powerpc/papr_scm: Fix warnings about undeclared variable Build the kernel with 'make C=2': arch/powerpc/platforms/pseries/papr_scm.c:825:1: warning: symbol 'dev_attr_perf_stats' was not declared. Should it be static? Signed-off-by: Wang Wensheng <wangwensheng4@huawei.com> Reviewed-by: Vaibhav Jain <vaibhav@linux.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/20200918085951.44983-1-wangwensheng4@huawei.com	2020-10-06 23:22:24 +11:00
Nicholas Piggin	455575533c	powerpc/64: make restore_interrupts 64e only This is not used by 64s. Signed-off-by: Nicholas Piggin <npiggin@gmail.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/20200915114650.3980244-5-npiggin@gmail.com	2020-10-06 23:22:24 +11:00
Nicholas Piggin	903dd1ff45	powerpc/64e: remove 64s specific interrupt soft-mask code Since the assembly soft-masking code was moved to 64e specific, there are some 64s specific interrupt types still there. Remove them. Signed-off-by: Nicholas Piggin <npiggin@gmail.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/20200915114650.3980244-4-npiggin@gmail.com	2020-10-06 23:22:23 +11:00
Nicholas Piggin	012a9a97a8	powerpc/64e: remove PACA_IRQ_EE_EDGE This is not used anywhere. Signed-off-by: Nicholas Piggin <npiggin@gmail.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/20200915114650.3980244-3-npiggin@gmail.com	2020-10-06 23:22:23 +11:00
Nicholas Piggin	2b48e96be2	powerpc/64: fix irq replay pt_regs->softe value Replayed interrupts get an "artificial" struct pt_regs constructed to pass to interrupt handler functions. This did not get the softe field set correctly, it's as though the interrupt has hit while irqs are disabled. It should be IRQS_ENABLED. This is possibly harmless, asynchronous handlers should not be testing if irqs were disabled, but it might be possible for example some code is shared with synchronous or NMI handlers, and it makes more sense if debug output looks at this. Fixes: `3282a3da25` ("powerpc/64: Implement soft interrupt replay in C") Signed-off-by: Nicholas Piggin <npiggin@gmail.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/20200915114650.3980244-2-npiggin@gmail.com	2020-10-06 23:22:23 +11:00
Nicholas Piggin	903fd31d32	powerpc/64: fix irq replay missing preempt Prior to commit `3282a3da25` ("powerpc/64: Implement soft interrupt replay in C"), replayed interrupts returned by the regular interrupt exit code, which performs preemption in case an interrupt had set need_resched. This logic was missed by the conversion. Adding preempt_disable/enable around the interrupt replay and final irq enable will reschedule if needed. Fixes: `3282a3da25` ("powerpc/64: Implement soft interrupt replay in C") Signed-off-by: Nicholas Piggin <npiggin@gmail.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/20200915114650.3980244-1-npiggin@gmail.com	2020-10-06 23:22:23 +11:00
Nicholas Piggin	cdb1ea0276	powerpc/pseries: add new branch prediction security bits for link stack The hypervisor interface has defined branch prediction security bits for handling the link stack. Wire them up. Signed-off-by: Nicholas Piggin <npiggin@gmail.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/20200825075612.224656-1-npiggin@gmail.com	2020-10-06 23:22:23 +11:00
Nicholas Piggin	05504b4256	powerpc/64s: Add cp_abort after tlbiel to invalidate copy-buffer address The copy buffer is implemented as a real address in the nest which is translated from EA by copy, and used for memory access by paste. This requires that it be invalidated by TLB invalidation. TLBIE does invalidate the copy buffer, but TLBIEL does not. Add cp_abort to the tlbiel sequence. Signed-off-by: Nicholas Piggin <npiggin@gmail.com> [mpe: Fixup whitespace and comment formatting] Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/20200916030234.4110379-2-npiggin@gmail.com	2020-10-06 23:22:23 +11:00
Nicholas Piggin	9983efa83b	powerpc: untangle cputable mce include Having cputable.h include mce.h means it pulls in a bunch of low level headers (e.g., synch.h) which then can't use CPU_FTR_ definitions. Signed-off-by: Nicholas Piggin <npiggin@gmail.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/20200916030234.4110379-1-npiggin@gmail.com	2020-10-06 23:22:22 +11:00
Mahesh Salgaonkar	aea948bb80	powerpc/powernv/elog: Fix race while processing OPAL error log event. Every error log reported by OPAL is exported to userspace through a sysfs interface and notified using kobject_uevent(). The userspace daemon (opal_errd) then reads the error log and acknowledges the error log is saved safely to disk. Once acknowledged the kernel removes the respective sysfs file entry causing respective resources to be released including kobject. However it's possible the userspace daemon may already be scanning elog entries when a new sysfs elog entry is created by the kernel. User daemon may read this new entry and ack it even before kernel can notify userspace about it through kobject_uevent() call. If that happens then we have a potential race between elog_ack_store->kobject_put() and kobject_uevent which can lead to use-after-free of a kernfs object resulting in a kernel crash. eg: BUG: Unable to handle kernel data access on read at 0x6b6b6b6b6b6b6bfb Faulting instruction address: 0xc0000000008ff2a0 Oops: Kernel access of bad area, sig: 11 [#1] LE PAGE_SIZE=64K MMU=Hash SMP NR_CPUS=2048 NUMA PowerNV CPU: 27 PID: 805 Comm: irq/29-opal-elo Not tainted 5.9.0-rc2-gcc-8.2.0-00214-g6f56a67bcbb5-dirty #363 ... NIP kobject_uevent_env+0xa0/0x910 LR elog_event+0x1f4/0x2d0 Call Trace: 0x5deadbeef0000122 (unreliable) elog_event+0x1f4/0x2d0 irq_thread_fn+0x4c/0xc0 irq_thread+0x1c0/0x2b0 kthread+0x1c4/0x1d0 ret_from_kernel_thread+0x5c/0x6c This patch fixes this race by protecting the sysfs file creation/notification by holding a reference count on kobject until we safely send kobject_uevent(). The function create_elog_obj() returns the elog object which if used by caller function will end up in use-after-free problem again. However, the return value of create_elog_obj() function isn't being used today and there is no need as well. Hence change it to return void to make this fix complete. Fixes: `774fea1a38` ("powerpc/powernv: Read OPAL error log and export it through sysfs") Cc: stable@vger.kernel.org # v3.15+ Reported-by: Oliver O'Halloran <oohall@gmail.com> Signed-off-by: Mahesh Salgaonkar <mahesh@linux.ibm.com> Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com> Reviewed-by: Oliver O'Halloran <oohall@gmail.com> Reviewed-by: Vasant Hegde <hegdevasant@linux.vnet.ibm.com> [mpe: Rework the logic to use a single return, reword comments, add oops] Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/20201006122051.190176-1-mpe@ellerman.id.au	2020-10-06 23:22:22 +11:00
Cédric Le Goater	ebbfeef0d8	powerpc/32: Declare stack_overflow_exception() prototype This fixes a compile error with W=1. CC arch/powerpc/kernel/traps.o ../arch/powerpc/kernel/traps.c:1663:6: error: no previous prototype for ‘stack_overflow_exception’ [-Werror=missing-prototypes] void stack_overflow_exception(struct pt_regs *regs) ^~~~~~~~~~~~~~~~~~~~~~~~ Fixes: `3978eb7851` ("powerpc/32: Add early stack overflow detection with VMAP stack.") Signed-off-by: Cédric Le Goater <clg@kaod.org> Reviewed-by: Christophe Leroy <christophe.leroy@csgroup.eu> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/20200914211007.2285999-8-clg@kaod.org	2020-09-18 20:05:25 +10:00
Cédric Le Goater	2228f19cf9	powerpc/xive: Make debug routines static This fixes a compile error with W=1. CC arch/powerpc/sysdev/xive/common.o ../arch/powerpc/sysdev/xive/common.c:1568:6: error: no previous prototype for ‘xive_debug_show_cpu’ [-Werror=missing-prototypes] void xive_debug_show_cpu(struct seq_file m, int cpu) ^~~~~~~~~~~~~~~~~~~ ../arch/powerpc/sysdev/xive/common.c:1602:6: error: no previous prototype for ‘xive_debug_show_irq’ [-Werror=missing-prototypes] void xive_debug_show_irq(struct seq_file m, u32 hw_irq, struct irq_data *d) ^~~~~~~~~~~~~~~~~~~ Fixes: `930914b7d5` ("powerpc/xive: Add a debugfs file to dump internal XIVE state") Signed-off-by: Cédric Le Goater <clg@kaod.org> Reviewed-by: Christophe Leroy <christophe.leroy@csgroup.eu> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/20200914211007.2285999-5-clg@kaod.org	2020-09-18 20:05:25 +10:00
Cédric Le Goater	5ab187e01a	powerpc/sstep: Remove empty if statement checking for invalid form The check should be performed by the caller. This fixes a compile error with W=1. ../arch/powerpc/lib/sstep.c: In function ‘mlsd_8lsd_ea’: ../arch/powerpc/lib/sstep.c:225:3: error: suggest braces around empty body in an ‘if’ statement [-Werror=empty-body] ; /* Invalid form. Should already be checked for by caller! */ ^ Signed-off-by: Cédric Le Goater <clg@kaod.org> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/20200914211007.2285999-4-clg@kaod.org	2020-09-18 20:05:24 +10:00
Cédric Le Goater	7b2aab5f22	powerpc/sysfs: Remove unused 'err' variable in sysfs_create_dscr_default() This fixes a compile error with W=1. arch/powerpc/kernel/sysfs.c: In function ‘sysfs_create_dscr_default’: arch/powerpc/kernel/sysfs.c:228:7: error: variable ‘err’ set but not used [-Werror=unused-but-set-variable] int err = 0; ^~~ cc1: all warnings being treated as errors Signed-off-by: Cédric Le Goater <clg@kaod.org> Reviewed-by: Christophe Leroy <christophe.leroy@c-s.fr> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/20200914211007.2285999-2-clg@kaod.org	2020-09-18 20:05:24 +10:00
Qinglang Miao	8ec5cb12cd	powerpc/powernv: fix wrong warning message in opalcore_config_init() The logic of the warn output is incorrect. The two args should be exchanged. Signed-off-by: Qinglang Miao <miaoqinglang@huawei.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/20200916062129.190864-1-miaoqinglang@huawei.com	2020-09-18 19:59:45 +10:00
Qinglang Miao	1d42e07e9c	serial: pmac_zilog: use for_each_child_of_node() macro Use for_each_child_of_node() macro instead of open coding it. Signed-off-by: Qinglang Miao <miaoqinglang@huawei.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/20200916062138.191188-1-miaoqinglang@huawei.com	2020-09-18 19:59:44 +10:00
Qinglang Miao	acff5e6c37	macintosh: smu_sensors: use for_each_child_of_node() macro Use for_each_child_of_node() macro instead of open coding it. Signed-off-by: Qinglang Miao <miaoqinglang@huawei.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/20200916062125.190729-1-miaoqinglang@huawei.com	2020-09-18 19:59:44 +10:00
Qinglang Miao	9c826d31a7	drivers/macintosh/smu.c: use for_each_child_of_node() macro Use for_each_child_of_node() macro instead of open coding it. Signed-off-by: Qinglang Miao <miaoqinglang@huawei.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/20200916062122.190586-1-miaoqinglang@huawei.com	2020-09-18 19:59:44 +10:00
Michael Ellerman	6c71cfcc01	powerpc/prom_init: Check display props exist before enabling btext It's possible to enable CONFIG_PPC_EARLY_DEBUG_BOOTX for a pseries kernel (maybe it shouldn't be), which is then booted with qemu/slof. But if you do that the kernel crashes in draw_byte(), with a DAR pointing somewhere near INT_MAX. Adding some debug to prom_init we see that we're not able to read the "address" property from OF, so we're just using whatever junk value was on the stack. So check the properties can be read properly from OF, if not we bail out before initialising btext, which avoids the crash. Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Reviewed-by: Alexey Kardashevskiy <aik@ozlabs.ru> Link: https://lore.kernel.org/r/20200821103407.3362149-1-mpe@ellerman.id.au	2020-09-18 19:59:44 +10:00
Michael Ellerman	39f8756145	powerpc/smp: Move ppc_md.cpu_die() to smp_ops.cpu_offline_self() We have smp_ops->cpu_die() and ppc_md.cpu_die(). One of them offlines the current CPU and one offlines another CPU, can you guess which is which? Also one is in smp_ops and one is in ppc_md? So rename ppc_md.cpu_die(), to cpu_offline_self(), because that's what it does. And move it into smp_ops where it belongs. Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/20200819015634.1974478-3-mpe@ellerman.id.au	2020-09-18 19:59:43 +10:00
Michael Ellerman	bf3c1464db	powerpc/smp: Fold cpu_die() into its only caller Avoid the eternal confusion between cpu_die() and __cpu_die() by removing the former, folding it into its only caller. Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/20200819015634.1974478-2-mpe@ellerman.id.au	2020-09-18 19:59:43 +10:00
Michael Ellerman	1ea21ba231	powerpc: Move arch_cpu_idle_dead() into smp.c arch_cpu_idle_dead() is in idle.c, which makes sense, but it's inside a CONFIG_HOTPLUG_CPU block. It would be more at home in smp.c, inside the existing CONFIG_HOTPLUG_CPU block. Note that CONFIG_HOTPLUG_CPU depends on CONFIG_SMP so even though smp.c is not built for SMP=n builds, that's fine. Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/20200819015634.1974478-1-mpe@ellerman.id.au	2020-09-18 19:59:43 +10:00
Michael Ellerman	d10ebe79df	powerpc/perf: Add declarations to fix sparse warnings Sparse warns about all the init functions: symbol init_ppc970_pmu was not declared. Should it be static? symbol init_power5p_pmu was not declared. Should it be static? symbol init_power5_pmu was not declared. Should it be static? symbol init_power6_pmu was not declared. Should it be static? symbol init_power7_pmu was not declared. Should it be static? symbol init_power9_pmu was not declared. Should it be static? symbol init_power8_pmu was not declared. Should it be static? symbol init_generic_compat_pmu was not declared. Should it be static? They're already declared in internal.h, so just make sure all the C files include that directly or indirectly. Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Reviewed-by: Madhavan Srinivasan <maddy@linux.ibm.com> Link: https://lore.kernel.org/r/20200916115637.3100484-2-mpe@ellerman.id.au	2020-09-18 19:59:43 +10:00
Michael Ellerman	ef1edbba52	powerpc/mm/64s: Fix slb_setup_new_exec() sparse warning Sparse says: symbol slb_setup_new_exec was not declared. Should it be static? No, it should have a declaration in a header, add one. Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/20200916115637.3100484-1-mpe@ellerman.id.au	2020-09-18 19:59:43 +10:00
Liu Shixin	96543e7352	powerpc/pseries: convert to use DEFINE_SEQ_ATTRIBUTE macro Use DEFINE_SEQ_ATTRIBUTE macro to simplify the code. Signed-off-by: Liu Shixin <liushixin2@huawei.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/20200916025026.3992835-1-liushixin2@huawei.com	2020-09-18 19:59:43 +10:00
Yang Yingliang	bda7673d64	powerpc/book3s64: fix link error with CONFIG_PPC_RADIX_MMU=n Fix link error when CONFIG_PPC_RADIX_MMU is disabled: powerpc64-linux-gnu-ld: arch/powerpc/platforms/pseries/lpar.o:(.toc+0x0): undefined reference to `mmu_pid_bits' Reported-by: Hulk Robot <hulkci@huawei.com> Signed-off-by: Yang Yingliang <yangyingliang@huawei.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/20200917020643.90375-1-yangyingliang@huawei.com	2020-09-18 19:59:43 +10:00
Michael Ellerman	0b30191b27	Merge branch 'topic/irqs-off-activate-mm' into next Merge Nick's series to add ARCH_WANT_IRQS_OFF_ACTIVATE_MM.	2020-09-18 18:14:06 +10:00
Michael Ellerman	d208e13c6a	powerpc/process: Fix uninitialised variable error Clang, and GCC with -Wmaybe-uninitialized, can't see that val is unused in get_fpexec_mode(): arch/powerpc/kernel/process.c:1940:7: error: variable 'val' is used uninitialized whenever 'if' condition is true if (cpu_has_feature(CPU_FTR_SPE)) { ^~~~~~~~~~~~~~~~~~~~~~~~~~~~ We know that CPU_FTR_SPE will only be true iff CONFIG_SPE is also true, but the compiler doesn't. Avoid it by initialising val to zero. Reported-by: kernel test robot <lkp@intel.com> Fixes: `532ed1900d` ("powerpc/process: Remove useless #ifdef CONFIG_SPE") Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Reviewed-by: Nick Desaulniers <ndesaulniers@google.com> Tested-by: Nick Desaulniers <ndesaulniers@google.com> Link: https://lore.kernel.org/r/20200917024509.3253837-1-mpe@ellerman.id.au	2020-09-18 18:12:46 +10:00
Michael Ellerman	b5c8a2934e	Merge coregroup support into next From Srikar's cover letter, with some reformatting: Cleanup of existing powerpc topologies and add coregroup support on powerpc. Coregroup is a group of (subset of) cores of a DIE that share a resource. Summary of some of the testing done with coregroup patchset. It includes ebizzy, schbench, perf bench sched pipe and topology verification. On the left side are results from powerpc/next tree and on the right are the results with the patchset applied. Topological verification clearly shows that there is no change in topology with and without the patches on all the 3 class of systems that were tested. Power 9 PowerNV (2 Node/ 160 Cpu System) ---------------------------------------- Baseline Baseline + Coregroup Support N Min Max Median Avg Stddev N Min Max Median Avg Stddev 100 993884 1276090 1173476 1165914 54867.201 100 910470 1279820 1171095 1162091 67363.28 ^ ebizzy (Throughput of 100 iterations of 30 seconds higher throughput is better) schbench (latency hence lower is better) Latency percentiles (usec) Latency percentiles (usec) 50.0th: 455 50.0th: 454 75.0th: 533 75.0th: 543 90.0th: 683 90.0th: 701 95.0th: 743 95.0th: 737 99.0th: 815 99.0th: 805 99.5th: 839 99.5th: 835 99.9th: 913 99.9th: 893 min=0, max=1011 min=0, max=2833 perf bench sched pipe (lesser time and higher ops/sec is better) Running 'sched/pipe' benchmark: Running 'sched/pipe' benchmark: Executed 1000000 pipe operations between two processes Executed 1000000 pipe operations between two processes Total time: 6.083 [sec] Total time: 6.303 [sec] 6.083576 usecs/op 6.303318 usecs/op 164377 ops/sec 158646 ops/sec Power 9 LPAR (2 Node/ 128 Cpu System) ------------------------------------- Baseline Baseline + Coregroup Support N Min Max Median Avg Stddev N Min Max Median Avg Stddev 100 1058029 1295393 1200414 1188306.7 56786.538 100 943264 1287619 1180522 1168473.2 64469.955 ^ ebizzy (Throughput of 100 iterations of 30 seconds higher throughput is better) schbench (latency hence lower is better) Latency percentiles (usec) Latency percentiles (usec) 50.0000th: 34 50.0000th: 39 75.0000th: 46 75.0000th: 52 90.0000th: 53 90.0000th: 68 95.0000th: 56 95.0000th: 77 99.0000th: 61 99.0000th: 89 99.5000th: 63 99.5000th: 94 99.9000th: 81 99.9000th: 169 min=0, max=8405 min=0, max=23674 perf bench sched pipe (lesser time and higher ops/sec is better) Running 'sched/pipe' benchmark: Running 'sched/pipe' benchmark: Executed 1000000 pipe operations between two processes Executed 1000000 pipe operations between two processes Total time: 8.768 [sec] Total time: 5.217 [sec] 8.768400 usecs/op 5.217625 usecs/op 114045 ops/sec 191658 ops/sec Power 8 LPAR (8 Node/ 256 Cpu System) ------------------------------------- Baseline Baseline + Coregroup Support N Min Max Median Avg Stddev N Min Max Median Avg Stddev 100 1267615 1965234 1707423 1689137.6 144363.29 100 1175357 1924262 1691104 1664792.1 145876.4 ^ ebizzy (Throughput of 100 iterations of 30 seconds higher throughput is better) schbench (latency hence lower is better) Latency percentiles (usec) Latency percentiles (usec) 50.0th: 37 50.0th: 36 75.0th: 51 75.0th: 48 90.0th: 59 90.0th: 55 95.0th: 63 95.0th: 59 99.0th: 71 99.0th: 67 99.5th: 75 99.5th: 72 99.9th: 105 99.9th: 170 min=0, max=18560 min=0, max=27031 perf bench sched pipe (lesser time and higher ops/sec is better) Running 'sched/pipe' benchmark: Running 'sched/pipe' benchmark: Executed 1000000 pipe operations between two processes Executed 1000000 pipe operations between two processes Total time: 6.013 [sec] Total time: 5.930 [sec] 6.013963 usecs/op 5.930724 usecs/op 166279 ops/sec 168613 ops/sec Topology verification on Power9 Power9 / powernv / SMT4 $ tail /proc/cpuinfo cpu : POWER9, altivec supported clock : 3600.000000MHz revision : 2.2 (pvr 004e 1202) timebase : 512000000 platform : PowerNV model : 9006-22P machine : PowerNV 9006-22P firmware : OPAL MMU : Radix Baseline Baseline + Coregroup Support lscpu lscpu ------ ------ Architecture: ppc64le Architecture: ppc64le Byte Order: Little Endian Byte Order: Little Endian CPU(s): 160 CPU(s): 160 On-line CPU(s) list: 0-159 On-line CPU(s) list: 0-159 Thread(s) per core: 4 Thread(s) per core: 4 Core(s) per socket: 20 Core(s) per socket: 20 Socket(s): 2 Socket(s): 2 NUMA node(s): 2 NUMA node(s): 2 Model: 2.2 (pvr 004e 1202) Model: 2.2 (pvr 004e 1202) Model name: POWER9, altivec supported Model name: POWER9, altivec supported CPU max MHz: 3800.0000 CPU max MHz: 3800.0000 CPU min MHz: 2166.0000 CPU min MHz: 2166.0000 L1d cache: 32K L1d cache: 32K L1i cache: 32K L1i cache: 32K L2 cache: 512K L2 cache: 512K L3 cache: 10240K L3 cache: 10240K NUMA node0 CPU(s): 0-79 NUMA node0 CPU(s): 0-79 NUMA node8 CPU(s): 80-159 NUMA node8 CPU(s): 80-159 grep . /proc/sys/kernel/sched_domain/cpu0/domain/name grep . /proc/sys/kernel/sched_domain/cpu0/domain/name ----------------------------------------------------- ----------------------------------------------------- /proc/sys/kernel/sched_domain/cpu0/domain0/name:SMT /proc/sys/kernel/sched_domain/cpu0/domain0/name:SMT /proc/sys/kernel/sched_domain/cpu0/domain1/name:CACHE /proc/sys/kernel/sched_domain/cpu0/domain1/name:CACHE /proc/sys/kernel/sched_domain/cpu0/domain2/name:DIE /proc/sys/kernel/sched_domain/cpu0/domain2/name:DIE /proc/sys/kernel/sched_domain/cpu0/domain3/name:NUMA /proc/sys/kernel/sched_domain/cpu0/domain3/name:NUMA grep . /proc/sys/kernel/sched_domain/cpu0/domain/flags grep . /proc/sys/kernel/sched_domain/cpu0/domain/flags ------------------------------------------------------ ------------------------------------------------------ /proc/sys/kernel/sched_domain/cpu0/domain0/flags:2391 /proc/sys/kernel/sched_domain/cpu0/domain0/flags:2391 /proc/sys/kernel/sched_domain/cpu0/domain1/flags:2327 /proc/sys/kernel/sched_domain/cpu0/domain1/flags:2327 /proc/sys/kernel/sched_domain/cpu0/domain2/flags:2071 /proc/sys/kernel/sched_domain/cpu0/domain2/flags:2071 /proc/sys/kernel/sched_domain/cpu0/domain3/flags:12801 /proc/sys/kernel/sched_domain/cpu0/domain3/flags:12801 Baseline head /proc/schedstat -------------------- version 15 timestamp 4295043536 cpu0 0 0 0 0 0 0 9597119314 2408913694 11897 domain0 00000000,00000000,00000000,00000000,0000000f 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 domain1 00000000,00000000,00000000,00000000,000000ff 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 domain2 00000000,00000000,0000ffff,ffffffff,ffffffff 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 domain3 ffffffff,ffffffff,ffffffff,ffffffff,ffffffff 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 cpu1 0 0 0 0 0 0 4941435230 11106132 1583 domain0 00000000,00000000,00000000,00000000,0000000f 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 domain1 00000000,00000000,00000000,00000000,000000ff 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 Baseline + Coregroup Support head /proc/schedstat -------------------- version 15 timestamp 4296311826 cpu0 0 0 0 0 0 0 3353674045024 3781680865826 297483 domain0 00000000,00000000,00000000,00000000,0000000f 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 domain1 00000000,00000000,00000000,00000000,000000ff 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 domain2 00000000,00000000,0000ffff,ffffffff,ffffffff 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 domain3 ffffffff,ffffffff,ffffffff,ffffffff,ffffffff 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 cpu1 0 0 0 0 0 0 3337873293332 4231590033856 229090 domain0 00000000,00000000,00000000,00000000,0000000f 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 domain1 00000000,00000000,00000000,00000000,000000ff 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 Post sudo ppc64_cpu --smt=1 Post sudo ppc64_cpu --smt=1 --------------------- --------------------- grep . /proc/sys/kernel/sched_domain/cpu0/domain/name grep . /proc/sys/kernel/sched_domain/cpu0/domain/name ----------------------------------------------------- ----------------------------------------------------- /proc/sys/kernel/sched_domain/cpu0/domain0/name:CACHE /proc/sys/kernel/sched_domain/cpu0/domain0/name:CACHE /proc/sys/kernel/sched_domain/cpu0/domain1/name:DIE /proc/sys/kernel/sched_domain/cpu0/domain1/name:DIE /proc/sys/kernel/sched_domain/cpu0/domain2/name:NUMA /proc/sys/kernel/sched_domain/cpu0/domain2/name:NUMA grep . /proc/sys/kernel/sched_domain/cpu0/domain/flags grep . /proc/sys/kernel/sched_domain/cpu0/domain/flags ------------------------------------------------------ ------------------------------------------------------ /proc/sys/kernel/sched_domain/cpu0/domain0/flags:2327 /proc/sys/kernel/sched_domain/cpu0/domain0/flags:2327 /proc/sys/kernel/sched_domain/cpu0/domain1/flags:2071 /proc/sys/kernel/sched_domain/cpu0/domain1/flags:2071 /proc/sys/kernel/sched_domain/cpu0/domain2/flags:12801 /proc/sys/kernel/sched_domain/cpu0/domain2/flags:12801 Baseline: head /proc/schedstat -------------------- version 15 timestamp 4295046242 cpu0 0 0 0 0 0 0 10978610020 2658997390 13068 domain0 00000000,00000000,00000000,00000000,00000011 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 domain1 00000000,00000000,00001111,11111111,11111111 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 domain2 91111111,11111111,11111111,11111111,11111111 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 cpu4 0 0 0 0 0 0 5408663896 95701034 7697 domain0 00000000,00000000,00000000,00000000,00000011 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 domain1 00000000,00000000,00001111,11111111,11111111 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 domain2 91111111,11111111,11111111,11111111,11111111 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 Baseline + Coregroup Support head /proc/schedstat -------------------- version 15 timestamp 4296314905 cpu0 0 0 0 0 0 0 3355392013536 3781975150576 298723 domain0 00000000,00000000,00000000,00000000,00000011 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 domain1 00000000,00000000,00001111,11111111,11111111 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 domain2 91111111,11111111,11111111,11111111,11111111 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 cpu4 0 0 0 0 0 0 3351637920996 4427329763050 256776 domain0 00000000,00000000,00000000,00000000,00000011 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 domain1 00000000,00000000,00001111,11111111,11111111 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 domain2 91111111,11111111,11111111,11111111,11111111 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 Similar verification was done on Power 8 (8 Node 256 CPU LPAR) and Power 9 (2 node 128 Cpu LPAR) and they showed the topology before and after the patch to be identical. If Interested, I could provide the same. On Power 9 (with device-tree enablement to show coregroups): $ tail /proc/cpuinfo processor : 127 cpu : POWER9 (architected), altivec supported clock : 3000.000000MHz revision : 2.2 (pvr 004e 0202) timebase : 512000000 platform : pSeries model : IBM,9008-22L machine : CHRP IBM,9008-22L MMU : Hash Before patchset: $ cat /proc/sys/kernel/sched_domain/cpu0/domain/name SMT CACHE DIE NUMA $ head /proc/schedstat version 15 timestamp 4318242208 cpu0 0 0 0 0 0 0 28077107004 4773387362 78205 domain0 00000000,00000000,00000000,00000055 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 domain1 00000000,00000000,00000000,000000ff 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 domain2 00000000,00000000,ffffffff,ffffffff 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 domain3 ffffffff,ffffffff,ffffffff,ffffffff 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 cpu1 0 0 0 0 0 0 24177439200 413887604 75393 domain0 00000000,00000000,00000000,000000aa 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 domain1 00000000,00000000,00000000,000000ff 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 After patchset: $ cat /proc/sys/kernel/sched_domain/cpu0/domain/name SMT CACHE MC DIE NUMA $ head /proc/schedstat version 15 timestamp 4318242208 cpu0 0 0 0 0 0 0 28077107004 4773387362 78205 domain0 00000000,00000000,00000000,00000055 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 domain1 00000000,00000000,00000000,000000ff 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 domain2 00000000,00000000,00000000,ffffffff 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 domain3 00000000,00000000,ffffffff,ffffffff 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 domain4 ffffffff,ffffffff,ffffffff,ffffffff 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 cpu1 0 0 0 0 0 0 24177439200 413887604 75393 domain0 00000000,00000000,00000000,000000aa 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0	2020-09-16 22:21:05 +10:00
Srikar Dronamraju	fa35e868f9	powerpc/smp: Implement cpu_to_coregroup_id Lookup the coregroup id from the associativity array. If unable to detect the coregroup id, fallback on the core id. This way, ensure sched_domain degenerates and an extra sched domain is not created. Ideally this function should have been implemented in arch/powerpc/kernel/smp.c. However if its implemented in mm/numa.c, we don't need to find the primary domain again. If the device-tree mentions more than one coregroup, then kernel implements only the last or the smallest coregroup, which currently corresponds to the penultimate domain in the device-tree. Signed-off-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com> Reviewed-by: Gautham R. Shenoy <ego@linux.vnet.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/20200810071834.92514-11-srikar@linux.vnet.ibm.com	2020-09-16 22:13:32 +10:00
Srikar Dronamraju	72730bfc2a	powerpc/smp: Create coregroup domain Add percpu coregroup maps and masks to create coregroup domain. If a coregroup doesn't exist, the coregroup domain will be degenerated in favour of SMT/CACHE domain. Do note this patch is only creating stubs for cpu_to_coregroup_id. The actual cpu_to_coregroup_id implementation would be in a subsequent patch. Signed-off-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com> Reviewed-by: Gautham R. Shenoy <ego@linux.vnet.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/20200810071834.92514-10-srikar@linux.vnet.ibm.com	2020-09-16 22:13:32 +10:00
Srikar Dronamraju	6e08630281	powerpc/smp: Allocate cpumask only after searching thread group If allocated earlier and the search fails, then cpu_l1_cache_map cpumask is unnecessarily cleared. However cpu_l1_cache_map can be allocated / cleared after we search thread group. Please note CONFIG_CPUMASK_OFFSTACK is not set on Powerpc. Hence cpumask allocated by zalloc_cpumask_var_node is never freed. Signed-off-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com> Reviewed-by: Gautham R. Shenoy <ego@linux.vnet.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/20200810071834.92514-9-srikar@linux.vnet.ibm.com	2020-09-16 22:13:32 +10:00
Srikar Dronamraju	f9f130ff2e	powerpc/numa: Detect support for coregroup Add support for grouping cores based on the device-tree classification. - The last domain in the associativity domains always refers to the core. - If primary reference domain happens to be the penultimate domain in the associativity domains device-tree property, then there are no coregroups. However if its not a penultimate domain, then there are coregroups. There can be more than one coregroup. For now we would be interested in the last or the smallest coregroups, i.e one sub-group per DIE. Currently there are no firmwares that are exposing this grouping. Hence allow the basis for grouping to be abstract. Once the firmware starts using this grouping, code would be added to detect the type of grouping and adjust the sd domain flags accordingly. Signed-off-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com> Reviewed-by: Gautham R. Shenoy <ego@linux.vnet.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/20200810071834.92514-8-srikar@linux.vnet.ibm.com	2020-09-16 22:13:31 +10:00
Srikar Dronamraju	caa8e29da5	powerpc/smp: Optimize start_secondary In start_secondary, even if shared_cache was already set, system does a redundant match for cpumask. This redundant check can be removed by checking if shared_cache is already set. While here, localize the sibling_mask variable to within the if condition. Signed-off-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com> Reviewed-by: Gautham R. Shenoy <ego@linux.vnet.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/20200810071834.92514-7-srikar@linux.vnet.ibm.com	2020-09-16 22:13:31 +10:00
Srikar Dronamraju	f6606cfdfb	powerpc/smp: Dont assume l2-cache to be superset of sibling Current code assumes that cpumask of cpus sharing a l2-cache mask will always be a superset of cpu_sibling_mask. Lets stop that assumption. cpu_l2_cache_mask is a superset of cpu_sibling_mask if and only if shared_caches is set. Reviewed-by: Gautham R. Shenoy <ego@linux.vnet.ibm.com> Signed-off-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/20200913171038.GB11808@linux.vnet.ibm.com	2020-09-16 22:13:31 +10:00
Srikar Dronamraju	3c6032a8fe	powerpc/smp: Move topology fixups into a new function Move topology fixup based on the platform attributes into its own function which is called just before set_sched_topology. Signed-off-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com> Reviewed-by: Gautham R. Shenoy <ego@linux.vnet.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/20200810071834.92514-5-srikar@linux.vnet.ibm.com	2020-09-16 22:13:31 +10:00
Srikar Dronamraju	5e93f16ae4	powerpc/smp: Move powerpc_topology above Just moving the powerpc_topology description above. This will help in using functions in this file and avoid declarations. No other functional changes Signed-off-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com> Reviewed-by: Gautham R. Shenoy <ego@linux.vnet.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/20200810071834.92514-4-srikar@linux.vnet.ibm.com	2020-09-16 22:13:31 +10:00
Srikar Dronamraju	2ef0ca54d9	powerpc/smp: Merge Power9 topology with Power topology A new sched_domain_topology_level was added just for Power9. However the same can be achieved by merging powerpc_topology with power9_topology and makes the code more simpler especially when adding a new sched domain. Signed-off-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com> Reviewed-by: Gautham R. Shenoy <ego@linux.vnet.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/20200810071834.92514-3-srikar@linux.vnet.ibm.com	2020-09-16 22:13:31 +10:00
Srikar Dronamraju	d0fd24bbd2	powerpc/smp: Fix a warning under !NEED_MULTIPLE_NODES Fix a build warning in a non CONFIG_NEED_MULTIPLE_NODES "error: _numa_cpu_lookup_table_ undeclared" Signed-off-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com> Reviewed-by: Gautham R. Shenoy <ego@linux.vnet.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/20200810071834.92514-2-srikar@linux.vnet.ibm.com	2020-09-16 22:13:30 +10:00
Srikar Dronamraju	e75130f20b	powerpc/numa: Offline memoryless cpuless node 0 Currently Linux kernel with CONFIG_NUMA on a system with multiple possible nodes, marks node 0 as online at boot. However in practice, there are systems which have node 0 as memoryless and cpuless. This can cause numa_balancing to be enabled on systems with only one node with memory and CPUs. The existence of this dummy node which is cpuless and memoryless node can confuse users/scripts looking at output of lscpu / numactl. By marking, node 0 as offline, lets stop assuming that node 0 is always online. If node 0 has CPU or memory that are online, node 0 will again be set as online. v5.8 available: 2 nodes (0,2) node 0 cpus: node 0 size: 0 MB node 0 free: 0 MB node 2 cpus: 0 1 2 3 4 5 6 7 node 2 size: 32625 MB node 2 free: 31490 MB node distances: node 0 2 0: 10 20 2: 20 10 proc and sys files ------------------ /sys/devices/system/node/online: 0,2 /proc/sys/kernel/numa_balancing: 1 /sys/devices/system/node/has_cpu: 2 /sys/devices/system/node/has_memory: 2 /sys/devices/system/node/has_normal_memory: 2 /sys/devices/system/node/possible: 0-31 v5.8 + patch ------------------ available: 1 nodes (2) node 2 cpus: 0 1 2 3 4 5 6 7 node 2 size: 32625 MB node 2 free: 31487 MB node distances: node 2 2: 10 proc and sys files ------------------ /sys/devices/system/node/online: 2 /proc/sys/kernel/numa_balancing: 0 /sys/devices/system/node/has_cpu: 2 /sys/devices/system/node/has_memory: 2 /sys/devices/system/node/has_normal_memory: 2 /sys/devices/system/node/possible: 0-31 Example of a node with online CPUs/memory on node 0. (Same o/p with and without patch) numactl -H available: 4 nodes (0-3) node 0 cpus: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 node 0 size: 32482 MB node 0 free: 22994 MB node 1 cpus: 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 node 1 size: 0 MB node 1 free: 0 MB node 2 cpus: 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 node 2 size: 0 MB node 2 free: 0 MB node 3 cpus: 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 node 3 size: 0 MB node 3 free: 0 MB node distances: node 0 1 2 3 0: 10 20 40 40 1: 20 10 40 40 2: 40 40 10 20 3: 40 40 20 10 Note: On Powerpc, cpu_to_node of possible but not present cpus would previously return 0. Hence this commit depends on commit ("powerpc/numa: Set numa_node for all possible cpus") and commit ("powerpc/numa: Prefer node id queried from vphn"). Without the 2 commits, Powerpc system might crash. 1. User space applications like Numactl, lscpu, that parse the sysfs tend to believe there is an extra online node. This tends to confuse users and applications. Other user space applications start believing that system was not able to use all the resources (i.e missing resources) or the system was not setup correctly. 2. Also existence of dummy node also leads to inconsistent information. The number of online nodes is inconsistent with the information in the device-tree and resource-dump 3. When the dummy node is present, single node non-Numa systems end up showing up as NUMA systems and numa_balancing gets enabled. This will mean we take the hit from the unnecessary numa hinting faults. Signed-off-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/20200818081104.57888-4-srikar@linux.vnet.ibm.com	2020-09-16 22:05:20 +10:00
Srikar Dronamraju	6398eaa268	powerpc/numa: Prefer node id queried from vphn Node id queried from the static device tree may not be correct. For example: it may always show 0 on a shared processor. Hence prefer the node id queried from vphn and fallback on the device tree based node id if vphn query fails. Signed-off-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/20200818081104.57888-3-srikar@linux.vnet.ibm.com	2020-09-16 22:05:19 +10:00
Srikar Dronamraju	a874f1005e	powerpc/numa: Set numa_node for all possible cpus A Powerpc system with multiple possible nodes and with CONFIG_NUMA enabled always used to have a node 0, even if node 0 does not any cpus or memory attached to it. As per PAPR, node affinity of a cpu is only available once its present / online. For all cpus that are possible but not present, cpu_to_node() would point to node 0. To ensure a cpuless, memoryless dummy node is not online, powerpc need to make sure all possible but not present cpu_to_node are set to a proper node. Signed-off-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/20200818081104.57888-2-srikar@linux.vnet.ibm.com	2020-09-16 22:05:19 +10:00
Srikar Dronamraju	67df77845c	powerpc/numa: Restrict possible nodes based on platform As per draft LoPAPR (Revision 2.9_pre7), section B.5.3 "Run Time Abstraction Services (RTAS) Node" available at: https://openpowerfoundation.org/wp-content/uploads/2020/07/LoPAR-20200611.pdf ... there are 2 device tree properties: "ibm,max-associativity-domains" which defines the maximum number of domains that the firmware i.e PowerVM can support. and: "ibm,current-associativity-domains" which defines the maximum number of domains that the current platform can support. The value of "ibm,max-associativity-domains" is always greater than or equal to "ibm,current-associativity-domains" property. If the latter property is not available, use "ibm,max-associativity-domain" as a fallback. In this yet to be released LoPAPR, "ibm,current-associativity-domains" is mentioned in page 833 / B.5.3 which is covered under under "Appendix B. System Binding" section Currently powerpc uses the "ibm,max-associativity-domains" property while setting the possible number of nodes. This is currently set at 32. However the possible number of nodes for a platform may be significantly less. Hence set the possible number of nodes based on "ibm,current-associativity-domains" property. Nathan Lynch had raised a valid concern that post LPM (Live Partition Migration), a user could DLPAR add processors and memory after LPM with "new" associativity properties: https://lore.kernel.org/linuxppc-dev/871rljfet9.fsf@linux.ibm.com/t/#u He also pointed out that "ibm,max-associativity-domains" has the same contents on all currently available PowerVM systems, unlike "ibm,current-associativity-domains" and hence may be better able to handle the new NUMA associativity properties. However with the recent commit `dbce456280` ("powerpc/numa: Limit possible nodes to within num_possible_nodes"), all new NUMA associativity properties are capped to initially set nr_node_ids. Hence this commit should be safe with any new DLPAR add post LPM. $ lsprop /proc/device-tree/rtas/ibm,associ-domains /proc/device-tree/rtas/ibm,current-associativity-domains 00000005 00000001 00000002 00000002 00000002 00000010 /proc/device-tree/rtas/ibm,max-associativity-domains 00000005 00000001 00000008 00000020 00000020 00000100 $ cat /sys/devices/system/node/possible ##Before patch 0-31 $ cat /sys/devices/system/node/possible ##After patch 0-1 Note the maximum nodes this platform can support is only 2 but the possible nodes is set to 32. This is important because lot of kernel and user space code allocate structures for all possible nodes leading to a lot of memory that is allocated but not used. I ran a simple experiment to create and destroy 100 memory cgroups on boot on a 8 node machine (Power8 Alpine). Before patch: free -k at boot total used free shared buff/cache available Mem: 523498176 4106816 518820608 22272 570752 516606720 Swap: 4194240 0 4194240 free -k after creating 100 memory cgroups total used free shared buff/cache available Mem: 523498176 4628416 518246464 22336 623296 516058688 Swap: 4194240 0 4194240 free -k after destroying 100 memory cgroups total used free shared buff/cache available Mem: 523498176 4697408 518173760 22400 627008 515987904 Swap: 4194240 0 4194240 After patch: free -k at boot total used free shared buff/cache available Mem: 523498176 3969472 518933888 22272 594816 516731776 Swap: 4194240 0 4194240 free -k after creating 100 memory cgroups total used free shared buff/cache available Mem: 523498176 4181888 518676096 22208 640192 516496448 Swap: 4194240 0 4194240 free -k after destroying 100 memory cgroups total used free shared buff/cache available Mem: 523498176 4232320 518619904 22272 645952 516443264 Swap: 4194240 0 4194240 Observations: Fixed kernel takes 137344 kb (4106816-3969472) less to boot. Fixed kernel takes 309184 kb (4628416-4181888-137344) less to create 100 memcgs. Signed-off-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com> [mpe: Reformat change log a bit for readability] Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/20200817055257.110873-1-srikar@linux.vnet.ibm.com	2020-09-16 22:05:19 +10:00

1 2 3 4 5 ...

949505 Commits