linux_dsm_epyc7002/arch/x86/include/asm/topology.h

222 lines
6.1 KiB
C
Raw Normal View History

/*
* Written by: Matthew Dobson, IBM Corporation
*
* Copyright (C) 2002, IBM Corp.
*
* All rights reserved.
*
* This program is free software; you can redistribute it and/or modify
* it under the terms of the GNU General Public License as published by
* the Free Software Foundation; either version 2 of the License, or
* (at your option) any later version.
*
* This program is distributed in the hope that it will be useful, but
* WITHOUT ANY WARRANTY; without even the implied warranty of
* MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE, GOOD TITLE or
* NON INFRINGEMENT. See the GNU General Public License for more
* details.
*
* You should have received a copy of the GNU General Public License
* along with this program; if not, write to the Free Software
* Foundation, Inc., 675 Mass Ave, Cambridge, MA 02139, USA.
*
* Send feedback to <colpatch@us.ibm.com>
*/
#ifndef _ASM_X86_TOPOLOGY_H
#define _ASM_X86_TOPOLOGY_H
/*
* to preserve the visibility of NUMA_NO_NODE definition,
* moved to there from here. May be used independent of
* CONFIG_NUMA.
*/
#include <linux/numa.h>
x86: cleanup early per cpu variables/accesses v4 * Introduce a new PER_CPU macro called "EARLY_PER_CPU". This is used by some per_cpu variables that are initialized and accessed before there are per_cpu areas allocated. ["Early" in respect to per_cpu variables is "earlier than the per_cpu areas have been setup".] This patchset adds these new macros: DEFINE_EARLY_PER_CPU(_type, _name, _initvalue) EXPORT_EARLY_PER_CPU_SYMBOL(_name) DECLARE_EARLY_PER_CPU(_type, _name) early_per_cpu_ptr(_name) early_per_cpu_map(_name, _idx) early_per_cpu(_name, _cpu) The DEFINE macro defines the per_cpu variable as well as the early map and pointer. It also initializes the per_cpu variable and map elements to "_initvalue". The early_* macros provide access to the initial map (usually setup during system init) and the early pointer. This pointer is initialized to point to the early map but is then NULL'ed when the actual per_cpu areas are setup. After that the per_cpu variable is the correct access to the variable. The early_per_cpu() macro is not very efficient but does show how to access the variable if you have a function that can be called both "early" and "late". It tests the early ptr to be NULL, and if not then it's still valid. Otherwise, the per_cpu variable is used instead: #define early_per_cpu(_name, _cpu) \ (early_per_cpu_ptr(_name) ? \ early_per_cpu_ptr(_name)[_cpu] : \ per_cpu(_name, _cpu)) A better method is to actually check the pointer manually. In the case below, numa_set_node can be called both "early" and "late": void __cpuinit numa_set_node(int cpu, int node) { int *cpu_to_node_map = early_per_cpu_ptr(x86_cpu_to_node_map); if (cpu_to_node_map) cpu_to_node_map[cpu] = node; else per_cpu(x86_cpu_to_node_map, cpu) = node; } * Add a flag "arch_provides_topology_pointers" that indicates pointers to topology cpumask_t maps are available. Otherwise, use the function returning the cpumask_t value. This is useful if cpumask_t set size is very large to avoid copying data on to/off of the stack. * The coverage of CONFIG_DEBUG_PER_CPU_MAPS has been increased while the non-debug case has been optimized a bit. * Remove an unreferenced compiler warning in drivers/base/topology.c * Clean up #ifdef in setup.c For inclusion into sched-devel/latest tree. Based on: git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux-2.6.git + sched-devel/latest .../mingo/linux-2.6-sched-devel.git Signed-off-by: Mike Travis <travis@sgi.com> Signed-off-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2008-05-13 02:21:12 +07:00
#ifdef CONFIG_NUMA
#include <linux/cpumask.h>
#include <asm/mpspec.h>
#include <asm/percpu.h>
x86: cleanup early per cpu variables/accesses v4 * Introduce a new PER_CPU macro called "EARLY_PER_CPU". This is used by some per_cpu variables that are initialized and accessed before there are per_cpu areas allocated. ["Early" in respect to per_cpu variables is "earlier than the per_cpu areas have been setup".] This patchset adds these new macros: DEFINE_EARLY_PER_CPU(_type, _name, _initvalue) EXPORT_EARLY_PER_CPU_SYMBOL(_name) DECLARE_EARLY_PER_CPU(_type, _name) early_per_cpu_ptr(_name) early_per_cpu_map(_name, _idx) early_per_cpu(_name, _cpu) The DEFINE macro defines the per_cpu variable as well as the early map and pointer. It also initializes the per_cpu variable and map elements to "_initvalue". The early_* macros provide access to the initial map (usually setup during system init) and the early pointer. This pointer is initialized to point to the early map but is then NULL'ed when the actual per_cpu areas are setup. After that the per_cpu variable is the correct access to the variable. The early_per_cpu() macro is not very efficient but does show how to access the variable if you have a function that can be called both "early" and "late". It tests the early ptr to be NULL, and if not then it's still valid. Otherwise, the per_cpu variable is used instead: #define early_per_cpu(_name, _cpu) \ (early_per_cpu_ptr(_name) ? \ early_per_cpu_ptr(_name)[_cpu] : \ per_cpu(_name, _cpu)) A better method is to actually check the pointer manually. In the case below, numa_set_node can be called both "early" and "late": void __cpuinit numa_set_node(int cpu, int node) { int *cpu_to_node_map = early_per_cpu_ptr(x86_cpu_to_node_map); if (cpu_to_node_map) cpu_to_node_map[cpu] = node; else per_cpu(x86_cpu_to_node_map, cpu) = node; } * Add a flag "arch_provides_topology_pointers" that indicates pointers to topology cpumask_t maps are available. Otherwise, use the function returning the cpumask_t value. This is useful if cpumask_t set size is very large to avoid copying data on to/off of the stack. * The coverage of CONFIG_DEBUG_PER_CPU_MAPS has been increased while the non-debug case has been optimized a bit. * Remove an unreferenced compiler warning in drivers/base/topology.c * Clean up #ifdef in setup.c For inclusion into sched-devel/latest tree. Based on: git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux-2.6.git + sched-devel/latest .../mingo/linux-2.6-sched-devel.git Signed-off-by: Mike Travis <travis@sgi.com> Signed-off-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2008-05-13 02:21:12 +07:00
/* Mappings between logical cpu number and node number */
DECLARE_EARLY_PER_CPU(int, x86_cpu_to_node_map);
#ifdef CONFIG_DEBUG_PER_CPU_MAPS
/*
* override generic percpu implementation of cpu_to_node
*/
extern int __cpu_to_node(int cpu);
#define cpu_to_node __cpu_to_node
x86: cleanup early per cpu variables/accesses v4 * Introduce a new PER_CPU macro called "EARLY_PER_CPU". This is used by some per_cpu variables that are initialized and accessed before there are per_cpu areas allocated. ["Early" in respect to per_cpu variables is "earlier than the per_cpu areas have been setup".] This patchset adds these new macros: DEFINE_EARLY_PER_CPU(_type, _name, _initvalue) EXPORT_EARLY_PER_CPU_SYMBOL(_name) DECLARE_EARLY_PER_CPU(_type, _name) early_per_cpu_ptr(_name) early_per_cpu_map(_name, _idx) early_per_cpu(_name, _cpu) The DEFINE macro defines the per_cpu variable as well as the early map and pointer. It also initializes the per_cpu variable and map elements to "_initvalue". The early_* macros provide access to the initial map (usually setup during system init) and the early pointer. This pointer is initialized to point to the early map but is then NULL'ed when the actual per_cpu areas are setup. After that the per_cpu variable is the correct access to the variable. The early_per_cpu() macro is not very efficient but does show how to access the variable if you have a function that can be called both "early" and "late". It tests the early ptr to be NULL, and if not then it's still valid. Otherwise, the per_cpu variable is used instead: #define early_per_cpu(_name, _cpu) \ (early_per_cpu_ptr(_name) ? \ early_per_cpu_ptr(_name)[_cpu] : \ per_cpu(_name, _cpu)) A better method is to actually check the pointer manually. In the case below, numa_set_node can be called both "early" and "late": void __cpuinit numa_set_node(int cpu, int node) { int *cpu_to_node_map = early_per_cpu_ptr(x86_cpu_to_node_map); if (cpu_to_node_map) cpu_to_node_map[cpu] = node; else per_cpu(x86_cpu_to_node_map, cpu) = node; } * Add a flag "arch_provides_topology_pointers" that indicates pointers to topology cpumask_t maps are available. Otherwise, use the function returning the cpumask_t value. This is useful if cpumask_t set size is very large to avoid copying data on to/off of the stack. * The coverage of CONFIG_DEBUG_PER_CPU_MAPS has been increased while the non-debug case has been optimized a bit. * Remove an unreferenced compiler warning in drivers/base/topology.c * Clean up #ifdef in setup.c For inclusion into sched-devel/latest tree. Based on: git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux-2.6.git + sched-devel/latest .../mingo/linux-2.6-sched-devel.git Signed-off-by: Mike Travis <travis@sgi.com> Signed-off-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2008-05-13 02:21:12 +07:00
extern int early_cpu_to_node(int cpu);
x86: cleanup early per cpu variables/accesses v4 * Introduce a new PER_CPU macro called "EARLY_PER_CPU". This is used by some per_cpu variables that are initialized and accessed before there are per_cpu areas allocated. ["Early" in respect to per_cpu variables is "earlier than the per_cpu areas have been setup".] This patchset adds these new macros: DEFINE_EARLY_PER_CPU(_type, _name, _initvalue) EXPORT_EARLY_PER_CPU_SYMBOL(_name) DECLARE_EARLY_PER_CPU(_type, _name) early_per_cpu_ptr(_name) early_per_cpu_map(_name, _idx) early_per_cpu(_name, _cpu) The DEFINE macro defines the per_cpu variable as well as the early map and pointer. It also initializes the per_cpu variable and map elements to "_initvalue". The early_* macros provide access to the initial map (usually setup during system init) and the early pointer. This pointer is initialized to point to the early map but is then NULL'ed when the actual per_cpu areas are setup. After that the per_cpu variable is the correct access to the variable. The early_per_cpu() macro is not very efficient but does show how to access the variable if you have a function that can be called both "early" and "late". It tests the early ptr to be NULL, and if not then it's still valid. Otherwise, the per_cpu variable is used instead: #define early_per_cpu(_name, _cpu) \ (early_per_cpu_ptr(_name) ? \ early_per_cpu_ptr(_name)[_cpu] : \ per_cpu(_name, _cpu)) A better method is to actually check the pointer manually. In the case below, numa_set_node can be called both "early" and "late": void __cpuinit numa_set_node(int cpu, int node) { int *cpu_to_node_map = early_per_cpu_ptr(x86_cpu_to_node_map); if (cpu_to_node_map) cpu_to_node_map[cpu] = node; else per_cpu(x86_cpu_to_node_map, cpu) = node; } * Add a flag "arch_provides_topology_pointers" that indicates pointers to topology cpumask_t maps are available. Otherwise, use the function returning the cpumask_t value. This is useful if cpumask_t set size is very large to avoid copying data on to/off of the stack. * The coverage of CONFIG_DEBUG_PER_CPU_MAPS has been increased while the non-debug case has been optimized a bit. * Remove an unreferenced compiler warning in drivers/base/topology.c * Clean up #ifdef in setup.c For inclusion into sched-devel/latest tree. Based on: git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux-2.6.git + sched-devel/latest .../mingo/linux-2.6-sched-devel.git Signed-off-by: Mike Travis <travis@sgi.com> Signed-off-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2008-05-13 02:21:12 +07:00
#else /* !CONFIG_DEBUG_PER_CPU_MAPS */
/* Same function but used if called before per_cpu areas are setup */
static inline int early_cpu_to_node(int cpu)
{
return early_per_cpu(x86_cpu_to_node_map, cpu);
x86: cleanup early per cpu variables/accesses v4 * Introduce a new PER_CPU macro called "EARLY_PER_CPU". This is used by some per_cpu variables that are initialized and accessed before there are per_cpu areas allocated. ["Early" in respect to per_cpu variables is "earlier than the per_cpu areas have been setup".] This patchset adds these new macros: DEFINE_EARLY_PER_CPU(_type, _name, _initvalue) EXPORT_EARLY_PER_CPU_SYMBOL(_name) DECLARE_EARLY_PER_CPU(_type, _name) early_per_cpu_ptr(_name) early_per_cpu_map(_name, _idx) early_per_cpu(_name, _cpu) The DEFINE macro defines the per_cpu variable as well as the early map and pointer. It also initializes the per_cpu variable and map elements to "_initvalue". The early_* macros provide access to the initial map (usually setup during system init) and the early pointer. This pointer is initialized to point to the early map but is then NULL'ed when the actual per_cpu areas are setup. After that the per_cpu variable is the correct access to the variable. The early_per_cpu() macro is not very efficient but does show how to access the variable if you have a function that can be called both "early" and "late". It tests the early ptr to be NULL, and if not then it's still valid. Otherwise, the per_cpu variable is used instead: #define early_per_cpu(_name, _cpu) \ (early_per_cpu_ptr(_name) ? \ early_per_cpu_ptr(_name)[_cpu] : \ per_cpu(_name, _cpu)) A better method is to actually check the pointer manually. In the case below, numa_set_node can be called both "early" and "late": void __cpuinit numa_set_node(int cpu, int node) { int *cpu_to_node_map = early_per_cpu_ptr(x86_cpu_to_node_map); if (cpu_to_node_map) cpu_to_node_map[cpu] = node; else per_cpu(x86_cpu_to_node_map, cpu) = node; } * Add a flag "arch_provides_topology_pointers" that indicates pointers to topology cpumask_t maps are available. Otherwise, use the function returning the cpumask_t value. This is useful if cpumask_t set size is very large to avoid copying data on to/off of the stack. * The coverage of CONFIG_DEBUG_PER_CPU_MAPS has been increased while the non-debug case has been optimized a bit. * Remove an unreferenced compiler warning in drivers/base/topology.c * Clean up #ifdef in setup.c For inclusion into sched-devel/latest tree. Based on: git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux-2.6.git + sched-devel/latest .../mingo/linux-2.6-sched-devel.git Signed-off-by: Mike Travis <travis@sgi.com> Signed-off-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2008-05-13 02:21:12 +07:00
}
#endif /* !CONFIG_DEBUG_PER_CPU_MAPS */
/* Mappings between node number and cpus on that node. */
extern cpumask_var_t node_to_cpumask_map[MAX_NUMNODES];
#ifdef CONFIG_DEBUG_PER_CPU_MAPS
extern const struct cpumask *cpumask_of_node(int node);
#else
x86: cleanup early per cpu variables/accesses v4 * Introduce a new PER_CPU macro called "EARLY_PER_CPU". This is used by some per_cpu variables that are initialized and accessed before there are per_cpu areas allocated. ["Early" in respect to per_cpu variables is "earlier than the per_cpu areas have been setup".] This patchset adds these new macros: DEFINE_EARLY_PER_CPU(_type, _name, _initvalue) EXPORT_EARLY_PER_CPU_SYMBOL(_name) DECLARE_EARLY_PER_CPU(_type, _name) early_per_cpu_ptr(_name) early_per_cpu_map(_name, _idx) early_per_cpu(_name, _cpu) The DEFINE macro defines the per_cpu variable as well as the early map and pointer. It also initializes the per_cpu variable and map elements to "_initvalue". The early_* macros provide access to the initial map (usually setup during system init) and the early pointer. This pointer is initialized to point to the early map but is then NULL'ed when the actual per_cpu areas are setup. After that the per_cpu variable is the correct access to the variable. The early_per_cpu() macro is not very efficient but does show how to access the variable if you have a function that can be called both "early" and "late". It tests the early ptr to be NULL, and if not then it's still valid. Otherwise, the per_cpu variable is used instead: #define early_per_cpu(_name, _cpu) \ (early_per_cpu_ptr(_name) ? \ early_per_cpu_ptr(_name)[_cpu] : \ per_cpu(_name, _cpu)) A better method is to actually check the pointer manually. In the case below, numa_set_node can be called both "early" and "late": void __cpuinit numa_set_node(int cpu, int node) { int *cpu_to_node_map = early_per_cpu_ptr(x86_cpu_to_node_map); if (cpu_to_node_map) cpu_to_node_map[cpu] = node; else per_cpu(x86_cpu_to_node_map, cpu) = node; } * Add a flag "arch_provides_topology_pointers" that indicates pointers to topology cpumask_t maps are available. Otherwise, use the function returning the cpumask_t value. This is useful if cpumask_t set size is very large to avoid copying data on to/off of the stack. * The coverage of CONFIG_DEBUG_PER_CPU_MAPS has been increased while the non-debug case has been optimized a bit. * Remove an unreferenced compiler warning in drivers/base/topology.c * Clean up #ifdef in setup.c For inclusion into sched-devel/latest tree. Based on: git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux-2.6.git + sched-devel/latest .../mingo/linux-2.6-sched-devel.git Signed-off-by: Mike Travis <travis@sgi.com> Signed-off-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2008-05-13 02:21:12 +07:00
/* Returns a pointer to the cpumask of CPUs on Node 'node'. */
static inline const struct cpumask *cpumask_of_node(int node)
{
return node_to_cpumask_map[node];
}
#endif
x86: cleanup early per cpu variables/accesses v4 * Introduce a new PER_CPU macro called "EARLY_PER_CPU". This is used by some per_cpu variables that are initialized and accessed before there are per_cpu areas allocated. ["Early" in respect to per_cpu variables is "earlier than the per_cpu areas have been setup".] This patchset adds these new macros: DEFINE_EARLY_PER_CPU(_type, _name, _initvalue) EXPORT_EARLY_PER_CPU_SYMBOL(_name) DECLARE_EARLY_PER_CPU(_type, _name) early_per_cpu_ptr(_name) early_per_cpu_map(_name, _idx) early_per_cpu(_name, _cpu) The DEFINE macro defines the per_cpu variable as well as the early map and pointer. It also initializes the per_cpu variable and map elements to "_initvalue". The early_* macros provide access to the initial map (usually setup during system init) and the early pointer. This pointer is initialized to point to the early map but is then NULL'ed when the actual per_cpu areas are setup. After that the per_cpu variable is the correct access to the variable. The early_per_cpu() macro is not very efficient but does show how to access the variable if you have a function that can be called both "early" and "late". It tests the early ptr to be NULL, and if not then it's still valid. Otherwise, the per_cpu variable is used instead: #define early_per_cpu(_name, _cpu) \ (early_per_cpu_ptr(_name) ? \ early_per_cpu_ptr(_name)[_cpu] : \ per_cpu(_name, _cpu)) A better method is to actually check the pointer manually. In the case below, numa_set_node can be called both "early" and "late": void __cpuinit numa_set_node(int cpu, int node) { int *cpu_to_node_map = early_per_cpu_ptr(x86_cpu_to_node_map); if (cpu_to_node_map) cpu_to_node_map[cpu] = node; else per_cpu(x86_cpu_to_node_map, cpu) = node; } * Add a flag "arch_provides_topology_pointers" that indicates pointers to topology cpumask_t maps are available. Otherwise, use the function returning the cpumask_t value. This is useful if cpumask_t set size is very large to avoid copying data on to/off of the stack. * The coverage of CONFIG_DEBUG_PER_CPU_MAPS has been increased while the non-debug case has been optimized a bit. * Remove an unreferenced compiler warning in drivers/base/topology.c * Clean up #ifdef in setup.c For inclusion into sched-devel/latest tree. Based on: git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux-2.6.git + sched-devel/latest .../mingo/linux-2.6-sched-devel.git Signed-off-by: Mike Travis <travis@sgi.com> Signed-off-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2008-05-13 02:21:12 +07:00
extern void setup_node_to_cpumask_map(void);
#define pcibus_to_node(bus) __pcibus_to_node(bus)
extern int __node_distance(int, int);
#define node_distance(a, b) __node_distance(a, b)
x86: cleanup early per cpu variables/accesses v4 * Introduce a new PER_CPU macro called "EARLY_PER_CPU". This is used by some per_cpu variables that are initialized and accessed before there are per_cpu areas allocated. ["Early" in respect to per_cpu variables is "earlier than the per_cpu areas have been setup".] This patchset adds these new macros: DEFINE_EARLY_PER_CPU(_type, _name, _initvalue) EXPORT_EARLY_PER_CPU_SYMBOL(_name) DECLARE_EARLY_PER_CPU(_type, _name) early_per_cpu_ptr(_name) early_per_cpu_map(_name, _idx) early_per_cpu(_name, _cpu) The DEFINE macro defines the per_cpu variable as well as the early map and pointer. It also initializes the per_cpu variable and map elements to "_initvalue". The early_* macros provide access to the initial map (usually setup during system init) and the early pointer. This pointer is initialized to point to the early map but is then NULL'ed when the actual per_cpu areas are setup. After that the per_cpu variable is the correct access to the variable. The early_per_cpu() macro is not very efficient but does show how to access the variable if you have a function that can be called both "early" and "late". It tests the early ptr to be NULL, and if not then it's still valid. Otherwise, the per_cpu variable is used instead: #define early_per_cpu(_name, _cpu) \ (early_per_cpu_ptr(_name) ? \ early_per_cpu_ptr(_name)[_cpu] : \ per_cpu(_name, _cpu)) A better method is to actually check the pointer manually. In the case below, numa_set_node can be called both "early" and "late": void __cpuinit numa_set_node(int cpu, int node) { int *cpu_to_node_map = early_per_cpu_ptr(x86_cpu_to_node_map); if (cpu_to_node_map) cpu_to_node_map[cpu] = node; else per_cpu(x86_cpu_to_node_map, cpu) = node; } * Add a flag "arch_provides_topology_pointers" that indicates pointers to topology cpumask_t maps are available. Otherwise, use the function returning the cpumask_t value. This is useful if cpumask_t set size is very large to avoid copying data on to/off of the stack. * The coverage of CONFIG_DEBUG_PER_CPU_MAPS has been increased while the non-debug case has been optimized a bit. * Remove an unreferenced compiler warning in drivers/base/topology.c * Clean up #ifdef in setup.c For inclusion into sched-devel/latest tree. Based on: git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux-2.6.git + sched-devel/latest .../mingo/linux-2.6-sched-devel.git Signed-off-by: Mike Travis <travis@sgi.com> Signed-off-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2008-05-13 02:21:12 +07:00
#else /* !CONFIG_NUMA */
static inline int numa_node_id(void)
{
return 0;
}
numa: add generic percpu var numa_node_id() implementation Rework the generic version of the numa_node_id() function to use the new generic percpu variable infrastructure. Guard the new implementation with a new config option: CONFIG_USE_PERCPU_NUMA_NODE_ID. Archs which support this new implemention will default this option to 'y' when NUMA is configured. This config option could be removed if/when all archs switch over to the generic percpu implementation of numa_node_id(). Arch support involves: 1) converting any existing per cpu variable implementations to use this implementation. x86_64 is an instance of such an arch. 2) archs that don't use a per cpu variable for numa_node_id() will need to initialize the new per cpu variable "numa_node" as cpus are brought on-line. ia64 is an example. 3) Defining USE_PERCPU_NUMA_NODE_ID in arch dependent Kconfig--e.g., when NUMA is configured. This is required because I have retained the old implementation by default to allow archs to be modified incrementally, as desired. Subsequent patches will convert x86_64 and ia64 to use this implemenation. Signed-off-by: Lee Schermerhorn <lee.schermerhorn@hp.com> Cc: Tejun Heo <tj@kernel.org> Cc: Mel Gorman <mel@csn.ul.ie> Reviewed-by: Christoph Lameter <cl@linux-foundation.org> Cc: Nick Piggin <npiggin@suse.de> Cc: David Rientjes <rientjes@google.com> Cc: Eric Whitney <eric.whitney@hp.com> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Ingo Molnar <mingo@elte.hu> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: "Luck, Tony" <tony.luck@intel.com> Cc: Pekka Enberg <penberg@cs.helsinki.fi> Cc: <linux-arch@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2010-05-27 04:44:56 +07:00
/*
* indicate override:
*/
#define numa_node_id numa_node_id
static inline int early_cpu_to_node(int cpu)
{
return 0;
}
x86: cleanup early per cpu variables/accesses v4 * Introduce a new PER_CPU macro called "EARLY_PER_CPU". This is used by some per_cpu variables that are initialized and accessed before there are per_cpu areas allocated. ["Early" in respect to per_cpu variables is "earlier than the per_cpu areas have been setup".] This patchset adds these new macros: DEFINE_EARLY_PER_CPU(_type, _name, _initvalue) EXPORT_EARLY_PER_CPU_SYMBOL(_name) DECLARE_EARLY_PER_CPU(_type, _name) early_per_cpu_ptr(_name) early_per_cpu_map(_name, _idx) early_per_cpu(_name, _cpu) The DEFINE macro defines the per_cpu variable as well as the early map and pointer. It also initializes the per_cpu variable and map elements to "_initvalue". The early_* macros provide access to the initial map (usually setup during system init) and the early pointer. This pointer is initialized to point to the early map but is then NULL'ed when the actual per_cpu areas are setup. After that the per_cpu variable is the correct access to the variable. The early_per_cpu() macro is not very efficient but does show how to access the variable if you have a function that can be called both "early" and "late". It tests the early ptr to be NULL, and if not then it's still valid. Otherwise, the per_cpu variable is used instead: #define early_per_cpu(_name, _cpu) \ (early_per_cpu_ptr(_name) ? \ early_per_cpu_ptr(_name)[_cpu] : \ per_cpu(_name, _cpu)) A better method is to actually check the pointer manually. In the case below, numa_set_node can be called both "early" and "late": void __cpuinit numa_set_node(int cpu, int node) { int *cpu_to_node_map = early_per_cpu_ptr(x86_cpu_to_node_map); if (cpu_to_node_map) cpu_to_node_map[cpu] = node; else per_cpu(x86_cpu_to_node_map, cpu) = node; } * Add a flag "arch_provides_topology_pointers" that indicates pointers to topology cpumask_t maps are available. Otherwise, use the function returning the cpumask_t value. This is useful if cpumask_t set size is very large to avoid copying data on to/off of the stack. * The coverage of CONFIG_DEBUG_PER_CPU_MAPS has been increased while the non-debug case has been optimized a bit. * Remove an unreferenced compiler warning in drivers/base/topology.c * Clean up #ifdef in setup.c For inclusion into sched-devel/latest tree. Based on: git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux-2.6.git + sched-devel/latest .../mingo/linux-2.6-sched-devel.git Signed-off-by: Mike Travis <travis@sgi.com> Signed-off-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2008-05-13 02:21:12 +07:00
static inline void setup_node_to_cpumask_map(void) { }
#endif
#include <asm-generic/topology.h>
extern const struct cpumask *cpu_coregroup_mask(int cpu);
x86/topology: Create logical package id For per package oriented services we must be able to rely on the number of CPU packages to be within bounds. Create a tracking facility, which - calculates the number of possible packages depending on nr_cpu_ids after boot - makes sure that the package id is within the number of possible packages. If the apic id is outside we map it to a logical package id if there is enough space available. Provide interfaces for drivers to query the mapping and do translations from physcial to logical ids. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Andi Kleen <andi.kleen@intel.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Andy Lutomirski <luto@amacapital.net> Cc: Arnaldo Carvalho de Melo <acme@redhat.com> Cc: Borislav Petkov <bp@alien8.de> Cc: Brian Gerst <brgerst@gmail.com> Cc: Denys Vlasenko <dvlasenk@redhat.com> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Harish Chegondi <harish.chegondi@intel.com> Cc: Jacob Pan <jacob.jun.pan@linux.intel.com> Cc: Jiri Olsa <jolsa@redhat.com> Cc: Kan Liang <kan.liang@intel.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Luis R. Rodriguez <mcgrof@suse.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Stephane Eranian <eranian@google.com> Cc: Toshi Kani <toshi.kani@hp.com> Cc: Vince Weaver <vincent.weaver@maine.edu> Cc: linux-kernel@vger.kernel.org Link: http://lkml.kernel.org/r/20160222221011.541071755@linutronix.de Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-02-23 05:19:15 +07:00
#define topology_logical_package_id(cpu) (cpu_data(cpu).logical_proc_id)
#define topology_physical_package_id(cpu) (cpu_data(cpu).phys_proc_id)
#define topology_logical_die_id(cpu) (cpu_data(cpu).logical_die_id)
#define topology_die_id(cpu) (cpu_data(cpu).cpu_die_id)
#define topology_core_id(cpu) (cpu_data(cpu).cpu_core_id)
#ifdef CONFIG_SMP
#define topology_die_cpumask(cpu) (per_cpu(cpu_die_map, cpu))
#define topology_core_cpumask(cpu) (per_cpu(cpu_core_map, cpu))
#define topology_sibling_cpumask(cpu) (per_cpu(cpu_sibling_map, cpu))
x86/topology: Create logical package id For per package oriented services we must be able to rely on the number of CPU packages to be within bounds. Create a tracking facility, which - calculates the number of possible packages depending on nr_cpu_ids after boot - makes sure that the package id is within the number of possible packages. If the apic id is outside we map it to a logical package id if there is enough space available. Provide interfaces for drivers to query the mapping and do translations from physcial to logical ids. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Andi Kleen <andi.kleen@intel.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Andy Lutomirski <luto@amacapital.net> Cc: Arnaldo Carvalho de Melo <acme@redhat.com> Cc: Borislav Petkov <bp@alien8.de> Cc: Brian Gerst <brgerst@gmail.com> Cc: Denys Vlasenko <dvlasenk@redhat.com> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Harish Chegondi <harish.chegondi@intel.com> Cc: Jacob Pan <jacob.jun.pan@linux.intel.com> Cc: Jiri Olsa <jolsa@redhat.com> Cc: Kan Liang <kan.liang@intel.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Luis R. Rodriguez <mcgrof@suse.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Stephane Eranian <eranian@google.com> Cc: Toshi Kani <toshi.kani@hp.com> Cc: Vince Weaver <vincent.weaver@maine.edu> Cc: linux-kernel@vger.kernel.org Link: http://lkml.kernel.org/r/20160222221011.541071755@linutronix.de Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-02-23 05:19:15 +07:00
extern unsigned int __max_logical_packages;
#define topology_max_packages() (__max_logical_packages)
extern unsigned int __max_die_per_package;
static inline int topology_max_die_per_package(void)
{
return __max_die_per_package;
}
extern int __max_smt_threads;
static inline int topology_max_smt_threads(void)
{
return __max_smt_threads;
}
x86/topology: Create logical package id For per package oriented services we must be able to rely on the number of CPU packages to be within bounds. Create a tracking facility, which - calculates the number of possible packages depending on nr_cpu_ids after boot - makes sure that the package id is within the number of possible packages. If the apic id is outside we map it to a logical package id if there is enough space available. Provide interfaces for drivers to query the mapping and do translations from physcial to logical ids. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Andi Kleen <andi.kleen@intel.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Andy Lutomirski <luto@amacapital.net> Cc: Arnaldo Carvalho de Melo <acme@redhat.com> Cc: Borislav Petkov <bp@alien8.de> Cc: Brian Gerst <brgerst@gmail.com> Cc: Denys Vlasenko <dvlasenk@redhat.com> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Harish Chegondi <harish.chegondi@intel.com> Cc: Jacob Pan <jacob.jun.pan@linux.intel.com> Cc: Jiri Olsa <jolsa@redhat.com> Cc: Kan Liang <kan.liang@intel.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Luis R. Rodriguez <mcgrof@suse.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Stephane Eranian <eranian@google.com> Cc: Toshi Kani <toshi.kani@hp.com> Cc: Vince Weaver <vincent.weaver@maine.edu> Cc: linux-kernel@vger.kernel.org Link: http://lkml.kernel.org/r/20160222221011.541071755@linutronix.de Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-02-23 05:19:15 +07:00
int topology_update_package_map(unsigned int apicid, unsigned int cpu);
int topology_update_die_map(unsigned int dieid, unsigned int cpu);
int topology_phys_to_logical_pkg(unsigned int pkg);
int topology_phys_to_logical_die(unsigned int die, unsigned int cpu);
bool topology_is_primary_thread(unsigned int cpu);
bool topology_smt_supported(void);
x86/topology: Create logical package id For per package oriented services we must be able to rely on the number of CPU packages to be within bounds. Create a tracking facility, which - calculates the number of possible packages depending on nr_cpu_ids after boot - makes sure that the package id is within the number of possible packages. If the apic id is outside we map it to a logical package id if there is enough space available. Provide interfaces for drivers to query the mapping and do translations from physcial to logical ids. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Andi Kleen <andi.kleen@intel.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Andy Lutomirski <luto@amacapital.net> Cc: Arnaldo Carvalho de Melo <acme@redhat.com> Cc: Borislav Petkov <bp@alien8.de> Cc: Brian Gerst <brgerst@gmail.com> Cc: Denys Vlasenko <dvlasenk@redhat.com> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Harish Chegondi <harish.chegondi@intel.com> Cc: Jacob Pan <jacob.jun.pan@linux.intel.com> Cc: Jiri Olsa <jolsa@redhat.com> Cc: Kan Liang <kan.liang@intel.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Luis R. Rodriguez <mcgrof@suse.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Stephane Eranian <eranian@google.com> Cc: Toshi Kani <toshi.kani@hp.com> Cc: Vince Weaver <vincent.weaver@maine.edu> Cc: linux-kernel@vger.kernel.org Link: http://lkml.kernel.org/r/20160222221011.541071755@linutronix.de Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-02-23 05:19:15 +07:00
#else
#define topology_max_packages() (1)
static inline int
topology_update_package_map(unsigned int apicid, unsigned int cpu) { return 0; }
static inline int
topology_update_die_map(unsigned int dieid, unsigned int cpu) { return 0; }
x86/topology: Create logical package id For per package oriented services we must be able to rely on the number of CPU packages to be within bounds. Create a tracking facility, which - calculates the number of possible packages depending on nr_cpu_ids after boot - makes sure that the package id is within the number of possible packages. If the apic id is outside we map it to a logical package id if there is enough space available. Provide interfaces for drivers to query the mapping and do translations from physcial to logical ids. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Andi Kleen <andi.kleen@intel.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Andy Lutomirski <luto@amacapital.net> Cc: Arnaldo Carvalho de Melo <acme@redhat.com> Cc: Borislav Petkov <bp@alien8.de> Cc: Brian Gerst <brgerst@gmail.com> Cc: Denys Vlasenko <dvlasenk@redhat.com> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Harish Chegondi <harish.chegondi@intel.com> Cc: Jacob Pan <jacob.jun.pan@linux.intel.com> Cc: Jiri Olsa <jolsa@redhat.com> Cc: Kan Liang <kan.liang@intel.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Luis R. Rodriguez <mcgrof@suse.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Stephane Eranian <eranian@google.com> Cc: Toshi Kani <toshi.kani@hp.com> Cc: Vince Weaver <vincent.weaver@maine.edu> Cc: linux-kernel@vger.kernel.org Link: http://lkml.kernel.org/r/20160222221011.541071755@linutronix.de Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-02-23 05:19:15 +07:00
static inline int topology_phys_to_logical_pkg(unsigned int pkg) { return 0; }
static inline int topology_phys_to_logical_die(unsigned int die,
unsigned int cpu) { return 0; }
static inline int topology_max_die_per_package(void) { return 1; }
static inline int topology_max_smt_threads(void) { return 1; }
static inline bool topology_is_primary_thread(unsigned int cpu) { return true; }
static inline bool topology_smt_supported(void) { return false; }
#endif
static inline void arch_fix_phys_package_id(int num, u32 slot)
{
}
struct pci_bus;
int x86_pci_root_bus_node(int bus);
void x86_pci_root_bus_resources(int bus, struct list_head *resources);
extern bool x86_topology_update;
x86: Enable Intel Turbo Boost Max Technology 3.0 On platforms supporting Intel Turbo Boost Max Technology 3.0, the maximum turbo frequencies of some cores in a CPU package may be higher than for the other cores in the same package. In that case, better performance (and possibly lower energy consumption as well) can be achieved by making the scheduler prefer to run tasks on the CPUs with higher max turbo frequencies. To that end, set up a core priority metric to abstract the core preferences based on the maximum turbo frequency. In that metric, the cores with higher maximum turbo frequencies are higher-priority than the other cores in the same package and that causes the scheduler to favor them when making load-balancing decisions using the asymmertic packing approach. At the same time, the priority of SMT threads with a higher CPU number is reduced so as to avoid scheduling tasks on all of the threads that belong to a favored core before all of the other cores have been given a task to run. The priority metric will be initialized by the P-state driver with the help of the sched_set_itmt_core_prio() function. The P-state driver will also determine whether or not ITMT is supported by the platform and will call sched_set_itmt_support() to indicate that. Co-developed-by: Peter Zijlstra (Intel) <peterz@infradead.org> Co-developed-by: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com> Signed-off-by: Tim Chen <tim.c.chen@linux.intel.com> Cc: linux-pm@vger.kernel.org Cc: peterz@infradead.org Cc: jolsa@redhat.com Cc: rjw@rjwysocki.net Cc: linux-acpi@vger.kernel.org Cc: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com> Cc: bp@suse.de Link: http://lkml.kernel.org/r/cd401ccdff88f88c8349314febdc25d51f7c48f7.1479844244.git.tim.c.chen@linux.intel.com Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2016-11-23 03:23:55 +07:00
#ifdef CONFIG_SCHED_MC_PRIO
x86: Enable Intel Turbo Boost Max Technology 3.0 On platforms supporting Intel Turbo Boost Max Technology 3.0, the maximum turbo frequencies of some cores in a CPU package may be higher than for the other cores in the same package. In that case, better performance (and possibly lower energy consumption as well) can be achieved by making the scheduler prefer to run tasks on the CPUs with higher max turbo frequencies. To that end, set up a core priority metric to abstract the core preferences based on the maximum turbo frequency. In that metric, the cores with higher maximum turbo frequencies are higher-priority than the other cores in the same package and that causes the scheduler to favor them when making load-balancing decisions using the asymmertic packing approach. At the same time, the priority of SMT threads with a higher CPU number is reduced so as to avoid scheduling tasks on all of the threads that belong to a favored core before all of the other cores have been given a task to run. The priority metric will be initialized by the P-state driver with the help of the sched_set_itmt_core_prio() function. The P-state driver will also determine whether or not ITMT is supported by the platform and will call sched_set_itmt_support() to indicate that. Co-developed-by: Peter Zijlstra (Intel) <peterz@infradead.org> Co-developed-by: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com> Signed-off-by: Tim Chen <tim.c.chen@linux.intel.com> Cc: linux-pm@vger.kernel.org Cc: peterz@infradead.org Cc: jolsa@redhat.com Cc: rjw@rjwysocki.net Cc: linux-acpi@vger.kernel.org Cc: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com> Cc: bp@suse.de Link: http://lkml.kernel.org/r/cd401ccdff88f88c8349314febdc25d51f7c48f7.1479844244.git.tim.c.chen@linux.intel.com Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2016-11-23 03:23:55 +07:00
#include <asm/percpu.h>
DECLARE_PER_CPU_READ_MOSTLY(int, sched_core_priority);
extern unsigned int __read_mostly sysctl_sched_itmt_enabled;
x86: Enable Intel Turbo Boost Max Technology 3.0 On platforms supporting Intel Turbo Boost Max Technology 3.0, the maximum turbo frequencies of some cores in a CPU package may be higher than for the other cores in the same package. In that case, better performance (and possibly lower energy consumption as well) can be achieved by making the scheduler prefer to run tasks on the CPUs with higher max turbo frequencies. To that end, set up a core priority metric to abstract the core preferences based on the maximum turbo frequency. In that metric, the cores with higher maximum turbo frequencies are higher-priority than the other cores in the same package and that causes the scheduler to favor them when making load-balancing decisions using the asymmertic packing approach. At the same time, the priority of SMT threads with a higher CPU number is reduced so as to avoid scheduling tasks on all of the threads that belong to a favored core before all of the other cores have been given a task to run. The priority metric will be initialized by the P-state driver with the help of the sched_set_itmt_core_prio() function. The P-state driver will also determine whether or not ITMT is supported by the platform and will call sched_set_itmt_support() to indicate that. Co-developed-by: Peter Zijlstra (Intel) <peterz@infradead.org> Co-developed-by: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com> Signed-off-by: Tim Chen <tim.c.chen@linux.intel.com> Cc: linux-pm@vger.kernel.org Cc: peterz@infradead.org Cc: jolsa@redhat.com Cc: rjw@rjwysocki.net Cc: linux-acpi@vger.kernel.org Cc: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com> Cc: bp@suse.de Link: http://lkml.kernel.org/r/cd401ccdff88f88c8349314febdc25d51f7c48f7.1479844244.git.tim.c.chen@linux.intel.com Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2016-11-23 03:23:55 +07:00
/* Interface to set priority of a cpu */
void sched_set_itmt_core_prio(int prio, int core_cpu);
/* Interface to notify scheduler that system supports ITMT */
int sched_set_itmt_support(void);
x86: Enable Intel Turbo Boost Max Technology 3.0 On platforms supporting Intel Turbo Boost Max Technology 3.0, the maximum turbo frequencies of some cores in a CPU package may be higher than for the other cores in the same package. In that case, better performance (and possibly lower energy consumption as well) can be achieved by making the scheduler prefer to run tasks on the CPUs with higher max turbo frequencies. To that end, set up a core priority metric to abstract the core preferences based on the maximum turbo frequency. In that metric, the cores with higher maximum turbo frequencies are higher-priority than the other cores in the same package and that causes the scheduler to favor them when making load-balancing decisions using the asymmertic packing approach. At the same time, the priority of SMT threads with a higher CPU number is reduced so as to avoid scheduling tasks on all of the threads that belong to a favored core before all of the other cores have been given a task to run. The priority metric will be initialized by the P-state driver with the help of the sched_set_itmt_core_prio() function. The P-state driver will also determine whether or not ITMT is supported by the platform and will call sched_set_itmt_support() to indicate that. Co-developed-by: Peter Zijlstra (Intel) <peterz@infradead.org> Co-developed-by: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com> Signed-off-by: Tim Chen <tim.c.chen@linux.intel.com> Cc: linux-pm@vger.kernel.org Cc: peterz@infradead.org Cc: jolsa@redhat.com Cc: rjw@rjwysocki.net Cc: linux-acpi@vger.kernel.org Cc: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com> Cc: bp@suse.de Link: http://lkml.kernel.org/r/cd401ccdff88f88c8349314febdc25d51f7c48f7.1479844244.git.tim.c.chen@linux.intel.com Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2016-11-23 03:23:55 +07:00
/* Interface to notify scheduler that system revokes ITMT support */
void sched_clear_itmt_support(void);
#else /* CONFIG_SCHED_MC_PRIO */
x86: Enable Intel Turbo Boost Max Technology 3.0 On platforms supporting Intel Turbo Boost Max Technology 3.0, the maximum turbo frequencies of some cores in a CPU package may be higher than for the other cores in the same package. In that case, better performance (and possibly lower energy consumption as well) can be achieved by making the scheduler prefer to run tasks on the CPUs with higher max turbo frequencies. To that end, set up a core priority metric to abstract the core preferences based on the maximum turbo frequency. In that metric, the cores with higher maximum turbo frequencies are higher-priority than the other cores in the same package and that causes the scheduler to favor them when making load-balancing decisions using the asymmertic packing approach. At the same time, the priority of SMT threads with a higher CPU number is reduced so as to avoid scheduling tasks on all of the threads that belong to a favored core before all of the other cores have been given a task to run. The priority metric will be initialized by the P-state driver with the help of the sched_set_itmt_core_prio() function. The P-state driver will also determine whether or not ITMT is supported by the platform and will call sched_set_itmt_support() to indicate that. Co-developed-by: Peter Zijlstra (Intel) <peterz@infradead.org> Co-developed-by: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com> Signed-off-by: Tim Chen <tim.c.chen@linux.intel.com> Cc: linux-pm@vger.kernel.org Cc: peterz@infradead.org Cc: jolsa@redhat.com Cc: rjw@rjwysocki.net Cc: linux-acpi@vger.kernel.org Cc: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com> Cc: bp@suse.de Link: http://lkml.kernel.org/r/cd401ccdff88f88c8349314febdc25d51f7c48f7.1479844244.git.tim.c.chen@linux.intel.com Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2016-11-23 03:23:55 +07:00
#define sysctl_sched_itmt_enabled 0
x86: Enable Intel Turbo Boost Max Technology 3.0 On platforms supporting Intel Turbo Boost Max Technology 3.0, the maximum turbo frequencies of some cores in a CPU package may be higher than for the other cores in the same package. In that case, better performance (and possibly lower energy consumption as well) can be achieved by making the scheduler prefer to run tasks on the CPUs with higher max turbo frequencies. To that end, set up a core priority metric to abstract the core preferences based on the maximum turbo frequency. In that metric, the cores with higher maximum turbo frequencies are higher-priority than the other cores in the same package and that causes the scheduler to favor them when making load-balancing decisions using the asymmertic packing approach. At the same time, the priority of SMT threads with a higher CPU number is reduced so as to avoid scheduling tasks on all of the threads that belong to a favored core before all of the other cores have been given a task to run. The priority metric will be initialized by the P-state driver with the help of the sched_set_itmt_core_prio() function. The P-state driver will also determine whether or not ITMT is supported by the platform and will call sched_set_itmt_support() to indicate that. Co-developed-by: Peter Zijlstra (Intel) <peterz@infradead.org> Co-developed-by: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com> Signed-off-by: Tim Chen <tim.c.chen@linux.intel.com> Cc: linux-pm@vger.kernel.org Cc: peterz@infradead.org Cc: jolsa@redhat.com Cc: rjw@rjwysocki.net Cc: linux-acpi@vger.kernel.org Cc: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com> Cc: bp@suse.de Link: http://lkml.kernel.org/r/cd401ccdff88f88c8349314febdc25d51f7c48f7.1479844244.git.tim.c.chen@linux.intel.com Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2016-11-23 03:23:55 +07:00
static inline void sched_set_itmt_core_prio(int prio, int core_cpu)
{
}
static inline int sched_set_itmt_support(void)
x86: Enable Intel Turbo Boost Max Technology 3.0 On platforms supporting Intel Turbo Boost Max Technology 3.0, the maximum turbo frequencies of some cores in a CPU package may be higher than for the other cores in the same package. In that case, better performance (and possibly lower energy consumption as well) can be achieved by making the scheduler prefer to run tasks on the CPUs with higher max turbo frequencies. To that end, set up a core priority metric to abstract the core preferences based on the maximum turbo frequency. In that metric, the cores with higher maximum turbo frequencies are higher-priority than the other cores in the same package and that causes the scheduler to favor them when making load-balancing decisions using the asymmertic packing approach. At the same time, the priority of SMT threads with a higher CPU number is reduced so as to avoid scheduling tasks on all of the threads that belong to a favored core before all of the other cores have been given a task to run. The priority metric will be initialized by the P-state driver with the help of the sched_set_itmt_core_prio() function. The P-state driver will also determine whether or not ITMT is supported by the platform and will call sched_set_itmt_support() to indicate that. Co-developed-by: Peter Zijlstra (Intel) <peterz@infradead.org> Co-developed-by: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com> Signed-off-by: Tim Chen <tim.c.chen@linux.intel.com> Cc: linux-pm@vger.kernel.org Cc: peterz@infradead.org Cc: jolsa@redhat.com Cc: rjw@rjwysocki.net Cc: linux-acpi@vger.kernel.org Cc: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com> Cc: bp@suse.de Link: http://lkml.kernel.org/r/cd401ccdff88f88c8349314febdc25d51f7c48f7.1479844244.git.tim.c.chen@linux.intel.com Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2016-11-23 03:23:55 +07:00
{
return 0;
x86: Enable Intel Turbo Boost Max Technology 3.0 On platforms supporting Intel Turbo Boost Max Technology 3.0, the maximum turbo frequencies of some cores in a CPU package may be higher than for the other cores in the same package. In that case, better performance (and possibly lower energy consumption as well) can be achieved by making the scheduler prefer to run tasks on the CPUs with higher max turbo frequencies. To that end, set up a core priority metric to abstract the core preferences based on the maximum turbo frequency. In that metric, the cores with higher maximum turbo frequencies are higher-priority than the other cores in the same package and that causes the scheduler to favor them when making load-balancing decisions using the asymmertic packing approach. At the same time, the priority of SMT threads with a higher CPU number is reduced so as to avoid scheduling tasks on all of the threads that belong to a favored core before all of the other cores have been given a task to run. The priority metric will be initialized by the P-state driver with the help of the sched_set_itmt_core_prio() function. The P-state driver will also determine whether or not ITMT is supported by the platform and will call sched_set_itmt_support() to indicate that. Co-developed-by: Peter Zijlstra (Intel) <peterz@infradead.org> Co-developed-by: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com> Signed-off-by: Tim Chen <tim.c.chen@linux.intel.com> Cc: linux-pm@vger.kernel.org Cc: peterz@infradead.org Cc: jolsa@redhat.com Cc: rjw@rjwysocki.net Cc: linux-acpi@vger.kernel.org Cc: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com> Cc: bp@suse.de Link: http://lkml.kernel.org/r/cd401ccdff88f88c8349314febdc25d51f7c48f7.1479844244.git.tim.c.chen@linux.intel.com Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2016-11-23 03:23:55 +07:00
}
static inline void sched_clear_itmt_support(void)
{
}
#endif /* CONFIG_SCHED_MC_PRIO */
x86: Enable Intel Turbo Boost Max Technology 3.0 On platforms supporting Intel Turbo Boost Max Technology 3.0, the maximum turbo frequencies of some cores in a CPU package may be higher than for the other cores in the same package. In that case, better performance (and possibly lower energy consumption as well) can be achieved by making the scheduler prefer to run tasks on the CPUs with higher max turbo frequencies. To that end, set up a core priority metric to abstract the core preferences based on the maximum turbo frequency. In that metric, the cores with higher maximum turbo frequencies are higher-priority than the other cores in the same package and that causes the scheduler to favor them when making load-balancing decisions using the asymmertic packing approach. At the same time, the priority of SMT threads with a higher CPU number is reduced so as to avoid scheduling tasks on all of the threads that belong to a favored core before all of the other cores have been given a task to run. The priority metric will be initialized by the P-state driver with the help of the sched_set_itmt_core_prio() function. The P-state driver will also determine whether or not ITMT is supported by the platform and will call sched_set_itmt_support() to indicate that. Co-developed-by: Peter Zijlstra (Intel) <peterz@infradead.org> Co-developed-by: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com> Signed-off-by: Tim Chen <tim.c.chen@linux.intel.com> Cc: linux-pm@vger.kernel.org Cc: peterz@infradead.org Cc: jolsa@redhat.com Cc: rjw@rjwysocki.net Cc: linux-acpi@vger.kernel.org Cc: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com> Cc: bp@suse.de Link: http://lkml.kernel.org/r/cd401ccdff88f88c8349314febdc25d51f7c48f7.1479844244.git.tim.c.chen@linux.intel.com Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2016-11-23 03:23:55 +07:00
x86, sched: Add support for frequency invariance Implement arch_scale_freq_capacity() for 'modern' x86. This function is used by the scheduler to correctly account usage in the face of DVFS. The present patch addresses Intel processors specifically and has positive performance and performance-per-watt implications for the schedutil cpufreq governor, bringing it closer to, if not on-par with, the powersave governor from the intel_pstate driver/framework. Large performance gains are obtained when the machine is lightly loaded and no regression are observed at saturation. The benchmarks with the largest gains are kernel compilation, tbench (the networking version of dbench) and shell-intensive workloads. 1. FREQUENCY INVARIANCE: MOTIVATION * Without it, a task looks larger if the CPU runs slower 2. PECULIARITIES OF X86 * freq invariance accounting requires knowing the ratio freq_curr/freq_max 2.1 CURRENT FREQUENCY * Use delta_APERF / delta_MPERF * freq_base (a.k.a "BusyMHz") 2.2 MAX FREQUENCY * It varies with time (turbo). As an approximation, we set it to a constant, i.e. 4-cores turbo frequency. 3. EFFECTS ON THE SCHEDUTIL FREQUENCY GOVERNOR * The invariant schedutil's formula has no feedback loop and reacts faster to utilization changes 4. KNOWN LIMITATIONS * In some cases tasks can't reach max util despite how hard they try 5. PERFORMANCE TESTING 5.1 MACHINES * Skylake, Broadwell, Haswell 5.2 SETUP * baseline Linux v5.2 w/ non-invariant schedutil. Tested freq_max = 1-2-3-4-8-12 active cores turbo w/ invariant schedutil, and intel_pstate/powersave 5.3 BENCHMARK RESULTS 5.3.1 NEUTRAL BENCHMARKS * NAS Parallel Benchmark (HPC), hackbench 5.3.2 NON-NEUTRAL BENCHMARKS * tbench (10-30% better), kernbench (10-15% better), shell-intensive-scripts (30-50% better) * no regressions 5.3.3 SELECTION OF DETAILED RESULTS 5.3.4 POWER CONSUMPTION, PERFORMANCE-PER-WATT * dbench (5% worse on one machine), kernbench (3% worse), tbench (5-10% better), shell-intensive-scripts (10-40% better) 6. MICROARCH'ES ADDRESSED HERE * Xeon Core before Scalable Performance processors line (Xeon Gold/Platinum etc have different MSRs semantic for querying turbo levels) 7. REFERENCES * MMTests performance testing framework, github.com/gormanm/mmtests +-------------------------------------------------------------------------+ | 1. FREQUENCY INVARIANCE: MOTIVATION +-------------------------------------------------------------------------+ For example; suppose a CPU has two frequencies: 500 and 1000 Mhz. When running a task that would consume 1/3rd of a CPU at 1000 MHz, it would appear to consume 2/3rd (or 66.6%) when running at 500 MHz, giving the false impression this CPU is almost at capacity, even though it can go faster [*]. In a nutshell, without frequency scale-invariance tasks look larger just because the CPU is running slower. [*] (footnote: this assumes a linear frequency/performance relation; which everybody knows to be false, but given realities its the best approximation we can make.) +-------------------------------------------------------------------------+ | 2. PECULIARITIES OF X86 +-------------------------------------------------------------------------+ Accounting for frequency changes in PELT signals requires the computation of the ratio freq_curr / freq_max. On x86 neither of those terms is readily available. 2.1 CURRENT FREQUENCY ==================== Since modern x86 has hardware control over the actual frequency we run at (because amongst other things, Turbo-Mode), we cannot simply use the frequency as requested through cpufreq. Instead we use the APERF/MPERF MSRs to compute the effective frequency over the recent past. Also, because reading MSRs is expensive, don't do so every time we need the value, but amortize the cost by doing it every tick. 2.2 MAX FREQUENCY ================= Obtaining freq_max is also non-trivial because at any time the hardware can provide a frequency boost to a selected subset of cores if the package has enough power to spare (eg: Turbo Boost). This means that the maximum frequency available to a given core changes with time. The approach taken in this change is to arbitrarily set freq_max to a constant value at boot. The value chosen is the "4-cores (4C) turbo frequency" on most microarchitectures, after evaluating the following candidates: * 1-core (1C) turbo frequency (the fastest turbo state available) * around base frequency (a.k.a. max P-state) * something in between, such as 4C turbo To interpret these options, consider that this is the denominator in freq_curr/freq_max, and that ratio will be used to scale PELT signals such as util_avg and load_avg. A large denominator will undershoot (util_avg looks a bit smaller than it really is), viceversa with a smaller denominator PELT signals will tend to overshoot. Given that PELT drives frequency selection in the schedutil governor, we will have: freq_max set to | effect on DVFS --------------------+------------------ 1C turbo | power efficiency (lower freq choices) base freq | performance (higher util_avg, higher freq requests) 4C turbo | a bit of both 4C turbo proves to be a good compromise in a number of benchmarks (see below). +-------------------------------------------------------------------------+ | 3. EFFECTS ON THE SCHEDUTIL FREQUENCY GOVERNOR +-------------------------------------------------------------------------+ Once an architecture implements a frequency scale-invariant utilization (the PELT signal util_avg), schedutil switches its frequency selection formula from freq_next = 1.25 * freq_curr * util [non-invariant util signal] to freq_next = 1.25 * freq_max * util [invariant util signal] where, in the second formula, freq_max is set to the 1C turbo frequency (max turbo). The advantage of the second formula, whose usage we unlock with this patch, is that freq_next doesn't depend on the current frequency in an iterative fashion, but can jump to any frequency in a single update. This absence of feedback in the formula makes it quicker to react to utilization changes and more robust against pathological instabilities. Compare it to the update formula of intel_pstate/powersave: freq_next = 1.25 * freq_max * Busy% where again freq_max is 1C turbo and Busy% is the percentage of time not spent idling (calculated with delta_MPERF / delta_TSC); essentially the same as invariant schedutil, and largely responsible for intel_pstate/powersave good reputation. The non-invariant schedutil formula is derived from the invariant one by approximating util_inv with util_raw * freq_curr / freq_max, but this has limitations. Testing shows improved performances due to better frequency selections when the machine is lightly loaded, and essentially no change in behaviour at saturation / overutilization. +-------------------------------------------------------------------------+ | 4. KNOWN LIMITATIONS +-------------------------------------------------------------------------+ It's been shown that it is possible to create pathological scenarios where a CPU-bound task cannot reach max utilization, if the normalizing factor freq_max is fixed to a constant value (see [Lelli-2018]). If freq_max is set to 4C turbo as we do here, one needs to peg at least 5 cores in a package doing some busywork, and observe that none of those task will ever reach max util (1024) because they're all running at less than the 4C turbo frequency. While this concern still applies, we believe the performance benefit of frequency scale-invariant PELT signals outweights the cost of this limitation. [Lelli-2018] https://lore.kernel.org/lkml/20180517150418.GF22493@localhost.localdomain/ +-------------------------------------------------------------------------+ | 5. PERFORMANCE TESTING +-------------------------------------------------------------------------+ 5.1 MACHINES ============ We tested the patch on three machines, with Skylake, Broadwell and Haswell CPUs. The details are below, together with the available turbo ratios as reported by the appropriate MSRs. * 8x-SKYLAKE-UMA: Single socket E3-1240 v5, Skylake 4 cores/8 threads Max EFFiciency, BASE frequency and available turbo levels (MHz): EFFIC 800 |******** BASE 3500 |*********************************** 4C 3700 |************************************* 3C 3800 |************************************** 2C 3900 |*************************************** 1C 3900 |*************************************** * 80x-BROADWELL-NUMA: Two sockets E5-2698 v4, 2x Broadwell 20 cores/40 threads Max EFFiciency, BASE frequency and available turbo levels (MHz): EFFIC 1200 |************ BASE 2200 |********************** 8C 2900 |***************************** 7C 3000 |****************************** 6C 3100 |******************************* 5C 3200 |******************************** 4C 3300 |********************************* 3C 3400 |********************************** 2C 3600 |************************************ 1C 3600 |************************************ * 48x-HASWELL-NUMA Two sockets E5-2670 v3, 2x Haswell 12 cores/24 threads Max EFFiciency, BASE frequency and available turbo levels (MHz): EFFIC 1200 |************ BASE 2300 |*********************** 12C 2600 |************************** 11C 2600 |************************** 10C 2600 |************************** 9C 2600 |************************** 8C 2600 |************************** 7C 2600 |************************** 6C 2600 |************************** 5C 2700 |*************************** 4C 2800 |**************************** 3C 2900 |***************************** 2C 3100 |******************************* 1C 3100 |******************************* 5.2 SETUP ========= * The baseline is Linux v5.2 with schedutil (non-invariant) and the intel_pstate driver in passive mode. * The rationale for choosing the various freq_max values to test have been to try all the 1-2-3-4C turbo levels (note that 1C and 2C turbo are identical on all machines), plus one more value closer to base_freq but still in the turbo range (8C turbo for both 80x-BROADWELL-NUMA and 48x-HASWELL-NUMA). * In addition we've run all tests with intel_pstate/powersave for comparison. * The filesystem is always XFS, the userspace is openSUSE Leap 15.1. * 8x-SKYLAKE-UMA is capable of HWP (Hardware-Managed P-States), so the runs with active intel_pstate on this machine use that. This gives, in terms of combinations tested on each machine: * 8x-SKYLAKE-UMA * Baseline: Linux v5.2, non-invariant schedutil, intel_pstate passive * intel_pstate active + powersave + HWP * invariant schedutil, freq_max = 1C turbo * invariant schedutil, freq_max = 3C turbo * invariant schedutil, freq_max = 4C turbo * both 80x-BROADWELL-NUMA and 48x-HASWELL-NUMA * [same as 8x-SKYLAKE-UMA, but no HWP capable] * invariant schedutil, freq_max = 8C turbo (which on 48x-HASWELL-NUMA is the same as 12C turbo, or "all cores turbo") 5.3 BENCHMARK RESULTS ===================== 5.3.1 NEUTRAL BENCHMARKS ------------------------ Tests that didn't show any measurable difference in performance on any of the test machines between non-invariant schedutil and our patch are: * NAS Parallel Benchmarks (NPB) using either MPI or openMP for IPC, any computational kernel * flexible I/O (FIO) * hackbench (using threads or processes, and using pipes or sockets) 5.3.2 NON-NEUTRAL BENCHMARKS ---------------------------- What follow are summary tables where each benchmark result is given a score. * A tilde (~) means a neutral result, i.e. no difference from baseline. * Scores are computed with the ratio result_new / result_baseline, so a tilde means a score of 1.00. * The results in the score ratio are the geometric means of results running the benchmark with different parameters (eg: for kernbench: using 1, 2, 4, ... number of processes; for pgbench: varying the number of clients, and so on). * The first three tables show higher-is-better kind of tests (i.e. measured in operations/second), the subsequent three show lower-is-better kind of tests (i.e. the workload is fixed and we measure elapsed time, think kernbench). * "gitsource" is a name we made up for the test consisting in running the entire unit tests suite of the Git SCM and measuring how long it takes. We take it as a typical example of shell-intensive serialized workload. * In the "I_PSTATE" column we have the results for intel_pstate/powersave. Other columns show invariant schedutil for different values of freq_max. 4C turbo is circled as it's the value we've chosen for the final implementation. 80x-BROADWELL-NUMA (comparison ratio; higher is better) +------+ I_PSTATE 1C 3C | 4C | 8C pgbench-ro 1.14 ~ ~ | 1.11 | 1.14 pgbench-rw ~ ~ ~ | ~ | ~ netperf-udp 1.06 ~ 1.06 | 1.05 | 1.07 netperf-tcp ~ 1.03 ~ | 1.01 | 1.02 tbench4 1.57 1.18 1.22 | 1.30 | 1.56 +------+ 8x-SKYLAKE-UMA (comparison ratio; higher is better) +------+ I_PSTATE/HWP 1C 3C | 4C | pgbench-ro ~ ~ ~ | ~ | pgbench-rw ~ ~ ~ | ~ | netperf-udp ~ ~ ~ | ~ | netperf-tcp ~ ~ ~ | ~ | tbench4 1.30 1.14 1.14 | 1.16 | +------+ 48x-HASWELL-NUMA (comparison ratio; higher is better) +------+ I_PSTATE 1C 3C | 4C | 12C pgbench-ro 1.15 ~ ~ | 1.06 | 1.16 pgbench-rw ~ ~ ~ | ~ | ~ netperf-udp 1.05 0.97 1.04 | 1.04 | 1.02 netperf-tcp 0.96 1.01 1.01 | 1.01 | 1.01 tbench4 1.50 1.05 1.13 | 1.13 | 1.25 +------+ In the table above we see that active intel_pstate is slightly better than our 4C-turbo patch (both in reference to the baseline non-invariant schedutil) on read-only pgbench and much better on tbench. Both cases are notable in which it shows that lowering our freq_max (to 8C-turbo and 12C-turbo on 80x-BROADWELL-NUMA and 48x-HASWELL-NUMA respectively) helps invariant schedutil to get closer. If we ignore active intel_pstate and focus on the comparison with baseline alone, there are several instances of double-digit performance improvement. 80x-BROADWELL-NUMA (comparison ratio; lower is better) +------+ I_PSTATE 1C 3C | 4C | 8C dbench4 1.23 0.95 0.95 | 0.95 | 0.95 kernbench 0.93 0.83 0.83 | 0.83 | 0.82 gitsource 0.98 0.49 0.49 | 0.49 | 0.48 +------+ 8x-SKYLAKE-UMA (comparison ratio; lower is better) +------+ I_PSTATE/HWP 1C 3C | 4C | dbench4 ~ ~ ~ | ~ | kernbench ~ ~ ~ | ~ | gitsource 0.92 0.55 0.55 | 0.55 | +------+ 48x-HASWELL-NUMA (comparison ratio; lower is better) +------+ I_PSTATE 1C 3C | 4C | 8C dbench4 ~ ~ ~ | ~ | ~ kernbench 0.94 0.90 0.89 | 0.90 | 0.90 gitsource 0.97 0.69 0.69 | 0.69 | 0.69 +------+ dbench is not very remarkable here, unless we notice how poorly active intel_pstate is performing on 80x-BROADWELL-NUMA: 23% regression versus non-invariant schedutil. We repeated that run getting consistent results. Out of scope for the patch at hand, but deserving future investigation. Other than that, we previously ran this campaign with Linux v5.0 and saw the patch doing better on dbench a the time. We haven't checked closely and can only speculate at this point. On the NUMA boxes kernbench gets 10-15% improvements on average; we'll see in the detailed tables that the gains concentrate on low process counts (lightly loaded machines). The test we call "gitsource" (running the git unit test suite, a long-running single-threaded shell script) appears rather spectacular in this table (gains of 30-50% depending on the machine). It is to be noted, however, that gitsource has no adjustable parameters (such as the number of jobs in kernbench, which we average over in order to get a single-number summary score) and is exactly the kind of low-parallelism workload that benefits the most from this patch. When looking at the detailed tables of kernbench or tbench4, at low process or client counts one can see similar numbers. 5.3.3 SELECTION OF DETAILED RESULTS ----------------------------------- Machine : 48x-HASWELL-NUMA Benchmark : tbench4 (i.e. dbench4 over the network, actually loopback) Varying parameter : number of clients Unit : MB/sec (higher is better) 5.2.0 vanilla (BASELINE) 5.2.0 intel_pstate 5.2.0 1C-turbo - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - Hmean 1 126.73 +- 0.31% ( ) 315.91 +- 0.66% ( 149.28%) 125.03 +- 0.76% ( -1.34%) Hmean 2 258.04 +- 0.62% ( ) 614.16 +- 0.51% ( 138.01%) 269.58 +- 1.45% ( 4.47%) Hmean 4 514.30 +- 0.67% ( ) 1146.58 +- 0.54% ( 122.94%) 533.84 +- 1.99% ( 3.80%) Hmean 8 1111.38 +- 2.52% ( ) 2159.78 +- 0.38% ( 94.33%) 1359.92 +- 1.56% ( 22.36%) Hmean 16 2286.47 +- 1.36% ( ) 3338.29 +- 0.21% ( 46.00%) 2720.20 +- 0.52% ( 18.97%) Hmean 32 4704.84 +- 0.35% ( ) 4759.03 +- 0.43% ( 1.15%) 4774.48 +- 0.30% ( 1.48%) Hmean 64 7578.04 +- 0.27% ( ) 7533.70 +- 0.43% ( -0.59%) 7462.17 +- 0.65% ( -1.53%) Hmean 128 6998.52 +- 0.16% ( ) 6987.59 +- 0.12% ( -0.16%) 6909.17 +- 0.14% ( -1.28%) Hmean 192 6901.35 +- 0.25% ( ) 6913.16 +- 0.10% ( 0.17%) 6855.47 +- 0.21% ( -0.66%) 5.2.0 3C-turbo 5.2.0 4C-turbo 5.2.0 12C-turbo - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - Hmean 1 128.43 +- 0.28% ( 1.34%) 130.64 +- 3.81% ( 3.09%) 153.71 +- 5.89% ( 21.30%) Hmean 2 311.70 +- 6.15% ( 20.79%) 281.66 +- 3.40% ( 9.15%) 305.08 +- 5.70% ( 18.23%) Hmean 4 641.98 +- 2.32% ( 24.83%) 623.88 +- 5.28% ( 21.31%) 906.84 +- 4.65% ( 76.32%) Hmean 8 1633.31 +- 1.56% ( 46.96%) 1714.16 +- 0.93% ( 54.24%) 2095.74 +- 0.47% ( 88.57%) Hmean 16 3047.24 +- 0.42% ( 33.27%) 3155.02 +- 0.30% ( 37.99%) 3634.58 +- 0.15% ( 58.96%) Hmean 32 4734.31 +- 0.60% ( 0.63%) 4804.38 +- 0.23% ( 2.12%) 4674.62 +- 0.27% ( -0.64%) Hmean 64 7699.74 +- 0.35% ( 1.61%) 7499.72 +- 0.34% ( -1.03%) 7659.03 +- 0.25% ( 1.07%) Hmean 128 6935.18 +- 0.15% ( -0.91%) 6942.54 +- 0.10% ( -0.80%) 7004.85 +- 0.12% ( 0.09%) Hmean 192 6901.62 +- 0.12% ( 0.00%) 6856.93 +- 0.10% ( -0.64%) 6978.74 +- 0.10% ( 1.12%) This is one of the cases where the patch still can't surpass active intel_pstate, not even when freq_max is as low as 12C-turbo. Otherwise, gains are visible up to 16 clients and the saturated scenario is the same as baseline. The scores in the summary table from the previous sections are ratios of geometric means of the results over different clients, as seen in this table. Machine : 80x-BROADWELL-NUMA Benchmark : kernbench (kernel compilation) Varying parameter : number of jobs Unit : seconds (lower is better) 5.2.0 vanilla (BASELINE) 5.2.0 intel_pstate 5.2.0 1C-turbo - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - Amean 2 379.68 +- 0.06% ( ) 330.20 +- 0.43% ( 13.03%) 285.93 +- 0.07% ( 24.69%) Amean 4 200.15 +- 0.24% ( ) 175.89 +- 0.22% ( 12.12%) 153.78 +- 0.25% ( 23.17%) Amean 8 106.20 +- 0.31% ( ) 95.54 +- 0.23% ( 10.03%) 86.74 +- 0.10% ( 18.32%) Amean 16 56.96 +- 1.31% ( ) 53.25 +- 1.22% ( 6.50%) 48.34 +- 1.73% ( 15.13%) Amean 32 34.80 +- 2.46% ( ) 33.81 +- 0.77% ( 2.83%) 30.28 +- 1.59% ( 12.99%) Amean 64 26.11 +- 1.63% ( ) 25.04 +- 1.07% ( 4.10%) 22.41 +- 2.37% ( 14.16%) Amean 128 24.80 +- 1.36% ( ) 23.57 +- 1.23% ( 4.93%) 21.44 +- 1.37% ( 13.55%) Amean 160 24.85 +- 0.56% ( ) 23.85 +- 1.17% ( 4.06%) 21.25 +- 1.12% ( 14.49%) 5.2.0 3C-turbo 5.2.0 4C-turbo 5.2.0 8C-turbo - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - Amean 2 284.08 +- 0.13% ( 25.18%) 283.96 +- 0.51% ( 25.21%) 285.05 +- 0.21% ( 24.92%) Amean 4 153.18 +- 0.22% ( 23.47%) 154.70 +- 1.64% ( 22.71%) 153.64 +- 0.30% ( 23.24%) Amean 8 87.06 +- 0.28% ( 18.02%) 86.77 +- 0.46% ( 18.29%) 86.78 +- 0.22% ( 18.28%) Amean 16 48.03 +- 0.93% ( 15.68%) 47.75 +- 1.99% ( 16.17%) 47.52 +- 1.61% ( 16.57%) Amean 32 30.23 +- 1.20% ( 13.14%) 30.08 +- 1.67% ( 13.57%) 30.07 +- 1.67% ( 13.60%) Amean 64 22.59 +- 2.02% ( 13.50%) 22.63 +- 0.81% ( 13.32%) 22.42 +- 0.76% ( 14.12%) Amean 128 21.37 +- 0.67% ( 13.82%) 21.31 +- 1.15% ( 14.07%) 21.17 +- 1.93% ( 14.63%) Amean 160 21.68 +- 0.57% ( 12.76%) 21.18 +- 1.74% ( 14.77%) 21.22 +- 1.00% ( 14.61%) The patch outperform active intel_pstate (and baseline) by a considerable margin; the summary table from the previous section says 4C turbo and active intel_pstate are 0.83 and 0.93 against baseline respectively, so 4C turbo is 0.83/0.93=0.89 against intel_pstate (~10% better on average). There is no noticeable difference with regard to the value of freq_max. Machine : 8x-SKYLAKE-UMA Benchmark : gitsource (time to run the git unit test suite) Varying parameter : none Unit : seconds (lower is better) 5.2.0 vanilla 5.2.0 intel_pstate/hwp 5.2.0 1C-turbo - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - Amean 858.85 +- 1.16% ( ) 791.94 +- 0.21% ( 7.79%) 474.95 ( 44.70%) 5.2.0 3C-turbo 5.2.0 4C-turbo - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - Amean 475.26 +- 0.20% ( 44.66%) 474.34 +- 0.13% ( 44.77%) In this test, which is of interest as representing shell-intensive (i.e. fork-intensive) serialized workloads, invariant schedutil outperforms intel_pstate/powersave by a whopping 40% margin. 5.3.4 POWER CONSUMPTION, PERFORMANCE-PER-WATT --------------------------------------------- The following table shows average power consumption in watt for each benchmark. Data comes from turbostat (package average), which in turn is read from the RAPL interface on CPUs. We know the patch affects CPU frequencies so it's reasonable to ignore other power consumers (such as memory or I/O). Also, we don't have a power meter available in the lab so RAPL is the best we have. turbostat sampled average power every 10 seconds for the entire duration of each benchmark. We took all those values and averaged them (i.e. with don't have detail on a per-parameter granularity, only on whole benchmarks). 80x-BROADWELL-NUMA (power consumption, watts) +--------+ BASELINE I_PSTATE 1C 3C | 4C | 8C pgbench-ro 130.01 142.77 131.11 132.45 | 134.65 | 136.84 pgbench-rw 68.30 60.83 71.45 71.70 | 71.65 | 72.54 dbench4 90.25 59.06 101.43 99.89 | 101.10 | 102.94 netperf-udp 65.70 69.81 66.02 68.03 | 68.27 | 68.95 netperf-tcp 88.08 87.96 88.97 88.89 | 88.85 | 88.20 tbench4 142.32 176.73 153.02 163.91 | 165.58 | 176.07 kernbench 92.94 101.95 114.91 115.47 | 115.52 | 115.10 gitsource 40.92 41.87 75.14 75.20 | 75.40 | 75.70 +--------+ 8x-SKYLAKE-UMA (power consumption, watts) +--------+ BASELINE I_PSTATE/HWP 1C 3C | 4C | pgbench-ro 46.49 46.68 46.56 46.59 | 46.52 | pgbench-rw 29.34 31.38 30.98 31.00 | 31.00 | dbench4 27.28 27.37 27.49 27.41 | 27.38 | netperf-udp 22.33 22.41 22.36 22.35 | 22.36 | netperf-tcp 27.29 27.29 27.30 27.31 | 27.33 | tbench4 41.13 45.61 43.10 43.33 | 43.56 | kernbench 42.56 42.63 43.01 43.01 | 43.01 | gitsource 13.32 13.69 17.33 17.30 | 17.35 | +--------+ 48x-HASWELL-NUMA (power consumption, watts) +--------+ BASELINE I_PSTATE 1C 3C | 4C | 12C pgbench-ro 128.84 136.04 129.87 132.43 | 132.30 | 134.86 pgbench-rw 37.68 37.92 37.17 37.74 | 37.73 | 37.31 dbench4 28.56 28.73 28.60 28.73 | 28.70 | 28.79 netperf-udp 56.70 60.44 56.79 57.42 | 57.54 | 57.52 netperf-tcp 75.49 75.27 75.87 76.02 | 76.01 | 75.95 tbench4 115.44 139.51 119.53 123.07 | 123.97 | 130.22 kernbench 83.23 91.55 95.58 95.69 | 95.72 | 96.04 gitsource 36.79 36.99 39.99 40.34 | 40.35 | 40.23 +--------+ A lower power consumption isn't necessarily better, it depends on what is done with that energy. Here are tables with the ratio of performance-per-watt on each machine and benchmark. Higher is always better; a tilde (~) means a neutral ratio (i.e. 1.00). 80x-BROADWELL-NUMA (performance-per-watt ratios; higher is better) +------+ I_PSTATE 1C 3C | 4C | 8C pgbench-ro 1.04 1.06 0.94 | 1.07 | 1.08 pgbench-rw 1.10 0.97 0.96 | 0.96 | 0.97 dbench4 1.24 0.94 0.95 | 0.94 | 0.92 netperf-udp ~ 1.02 1.02 | ~ | 1.02 netperf-tcp ~ 1.02 ~ | ~ | 1.02 tbench4 1.26 1.10 1.06 | 1.12 | 1.26 kernbench 0.98 0.97 0.97 | 0.97 | 0.98 gitsource ~ 1.11 1.11 | 1.11 | 1.13 +------+ 8x-SKYLAKE-UMA (performance-per-watt ratios; higher is better) +------+ I_PSTATE/HWP 1C 3C | 4C | pgbench-ro ~ ~ ~ | ~ | pgbench-rw 0.95 0.97 0.96 | 0.96 | dbench4 ~ ~ ~ | ~ | netperf-udp ~ ~ ~ | ~ | netperf-tcp ~ ~ ~ | ~ | tbench4 1.17 1.09 1.08 | 1.10 | kernbench ~ ~ ~ | ~ | gitsource 1.06 1.40 1.40 | 1.40 | +------+ 48x-HASWELL-NUMA (performance-per-watt ratios; higher is better) +------+ I_PSTATE 1C 3C | 4C | 12C pgbench-ro 1.09 ~ 1.09 | 1.03 | 1.11 pgbench-rw ~ 0.86 ~ | ~ | 0.86 dbench4 ~ 1.02 1.02 | 1.02 | ~ netperf-udp ~ 0.97 1.03 | 1.02 | ~ netperf-tcp 0.96 ~ ~ | ~ | ~ tbench4 1.24 ~ 1.06 | 1.05 | 1.11 kernbench 0.97 0.97 0.98 | 0.97 | 0.96 gitsource 1.03 1.33 1.32 | 1.32 | 1.33 +------+ These results are overall pleasing: in plenty of cases we observe performance-per-watt improvements. The few regressions (read/write pgbench and dbench on the Broadwell machine) are of small magnitude. kernbench loses a few percentage points (it has a 10-15% performance improvement, but apparently the increase in power consumption is larger than that). tbench4 and gitsource, which benefit the most from the patch, keep a positive score in this table which is a welcome surprise; that suggests that in those particular workloads the non-invariant schedutil (and active intel_pstate, too) makes some rather suboptimal frequency selections. +-------------------------------------------------------------------------+ | 6. MICROARCH'ES ADDRESSED HERE +-------------------------------------------------------------------------+ The patch addresses Xeon Core processors that use MSR_PLATFORM_INFO and MSR_TURBO_RATIO_LIMIT to advertise their base frequency and turbo frequencies respectively. This excludes the recent Xeon Scalable Performance processors line (Xeon Gold, Platinum etc) whose MSRs have to be parsed differently. Subsequent patches will address: * Xeon Scalable Performance processors and Atom Goldmont/Goldmont Plus * Xeon Phi (Knights Landing, Knights Mill) * Atom Silvermont +-------------------------------------------------------------------------+ | 7. REFERENCES +-------------------------------------------------------------------------+ Tests have been run with the help of the MMTests performance testing framework, see github.com/gormanm/mmtests. The configuration file names for the benchmark used are: db-pgbench-timed-ro-small-xfs db-pgbench-timed-rw-small-xfs io-dbench4-async-xfs network-netperf-unbound network-tbench scheduler-unbound workload-kerndevel-xfs workload-shellscripts-xfs hpc-nas-c-class-mpi-full-xfs hpc-nas-c-class-omp-full All those benchmarks are generally available on the web: pgbench: https://www.postgresql.org/docs/10/pgbench.html netperf: https://hewlettpackard.github.io/netperf/ dbench/tbench: https://dbench.samba.org/ gitsource: git unit test suite, github.com/git/git NAS Parallel Benchmarks: https://www.nas.nasa.gov/publications/npb.html hackbench: https://people.redhat.com/mingo/cfs-scheduler/tools/hackbench.c Suggested-by: Peter Zijlstra <peterz@infradead.org> Signed-off-by: Giovanni Gherdovich <ggherdovich@suse.cz> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Ingo Molnar <mingo@kernel.org> Acked-by: Doug Smythies <dsmythies@telus.net> Acked-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Link: https://lkml.kernel.org/r/20200122151617.531-2-ggherdovich@suse.cz
2020-01-22 22:16:12 +07:00
#ifdef CONFIG_SMP
#include <asm/cpufeature.h>
DECLARE_STATIC_KEY_FALSE(arch_scale_freq_key);
#define arch_scale_freq_invariant() static_branch_likely(&arch_scale_freq_key)
DECLARE_PER_CPU(unsigned long, arch_freq_scale);
static inline long arch_scale_freq_capacity(int cpu)
{
return per_cpu(arch_freq_scale, cpu);
}
#define arch_scale_freq_capacity arch_scale_freq_capacity
extern void arch_scale_freq_tick(void);
#define arch_scale_freq_tick arch_scale_freq_tick
extern void arch_set_max_freq_ratio(bool turbo_disabled);
#else
static inline void arch_set_max_freq_ratio(bool turbo_disabled)
{
}
x86, sched: Add support for frequency invariance Implement arch_scale_freq_capacity() for 'modern' x86. This function is used by the scheduler to correctly account usage in the face of DVFS. The present patch addresses Intel processors specifically and has positive performance and performance-per-watt implications for the schedutil cpufreq governor, bringing it closer to, if not on-par with, the powersave governor from the intel_pstate driver/framework. Large performance gains are obtained when the machine is lightly loaded and no regression are observed at saturation. The benchmarks with the largest gains are kernel compilation, tbench (the networking version of dbench) and shell-intensive workloads. 1. FREQUENCY INVARIANCE: MOTIVATION * Without it, a task looks larger if the CPU runs slower 2. PECULIARITIES OF X86 * freq invariance accounting requires knowing the ratio freq_curr/freq_max 2.1 CURRENT FREQUENCY * Use delta_APERF / delta_MPERF * freq_base (a.k.a "BusyMHz") 2.2 MAX FREQUENCY * It varies with time (turbo). As an approximation, we set it to a constant, i.e. 4-cores turbo frequency. 3. EFFECTS ON THE SCHEDUTIL FREQUENCY GOVERNOR * The invariant schedutil's formula has no feedback loop and reacts faster to utilization changes 4. KNOWN LIMITATIONS * In some cases tasks can't reach max util despite how hard they try 5. PERFORMANCE TESTING 5.1 MACHINES * Skylake, Broadwell, Haswell 5.2 SETUP * baseline Linux v5.2 w/ non-invariant schedutil. Tested freq_max = 1-2-3-4-8-12 active cores turbo w/ invariant schedutil, and intel_pstate/powersave 5.3 BENCHMARK RESULTS 5.3.1 NEUTRAL BENCHMARKS * NAS Parallel Benchmark (HPC), hackbench 5.3.2 NON-NEUTRAL BENCHMARKS * tbench (10-30% better), kernbench (10-15% better), shell-intensive-scripts (30-50% better) * no regressions 5.3.3 SELECTION OF DETAILED RESULTS 5.3.4 POWER CONSUMPTION, PERFORMANCE-PER-WATT * dbench (5% worse on one machine), kernbench (3% worse), tbench (5-10% better), shell-intensive-scripts (10-40% better) 6. MICROARCH'ES ADDRESSED HERE * Xeon Core before Scalable Performance processors line (Xeon Gold/Platinum etc have different MSRs semantic for querying turbo levels) 7. REFERENCES * MMTests performance testing framework, github.com/gormanm/mmtests +-------------------------------------------------------------------------+ | 1. FREQUENCY INVARIANCE: MOTIVATION +-------------------------------------------------------------------------+ For example; suppose a CPU has two frequencies: 500 and 1000 Mhz. When running a task that would consume 1/3rd of a CPU at 1000 MHz, it would appear to consume 2/3rd (or 66.6%) when running at 500 MHz, giving the false impression this CPU is almost at capacity, even though it can go faster [*]. In a nutshell, without frequency scale-invariance tasks look larger just because the CPU is running slower. [*] (footnote: this assumes a linear frequency/performance relation; which everybody knows to be false, but given realities its the best approximation we can make.) +-------------------------------------------------------------------------+ | 2. PECULIARITIES OF X86 +-------------------------------------------------------------------------+ Accounting for frequency changes in PELT signals requires the computation of the ratio freq_curr / freq_max. On x86 neither of those terms is readily available. 2.1 CURRENT FREQUENCY ==================== Since modern x86 has hardware control over the actual frequency we run at (because amongst other things, Turbo-Mode), we cannot simply use the frequency as requested through cpufreq. Instead we use the APERF/MPERF MSRs to compute the effective frequency over the recent past. Also, because reading MSRs is expensive, don't do so every time we need the value, but amortize the cost by doing it every tick. 2.2 MAX FREQUENCY ================= Obtaining freq_max is also non-trivial because at any time the hardware can provide a frequency boost to a selected subset of cores if the package has enough power to spare (eg: Turbo Boost). This means that the maximum frequency available to a given core changes with time. The approach taken in this change is to arbitrarily set freq_max to a constant value at boot. The value chosen is the "4-cores (4C) turbo frequency" on most microarchitectures, after evaluating the following candidates: * 1-core (1C) turbo frequency (the fastest turbo state available) * around base frequency (a.k.a. max P-state) * something in between, such as 4C turbo To interpret these options, consider that this is the denominator in freq_curr/freq_max, and that ratio will be used to scale PELT signals such as util_avg and load_avg. A large denominator will undershoot (util_avg looks a bit smaller than it really is), viceversa with a smaller denominator PELT signals will tend to overshoot. Given that PELT drives frequency selection in the schedutil governor, we will have: freq_max set to | effect on DVFS --------------------+------------------ 1C turbo | power efficiency (lower freq choices) base freq | performance (higher util_avg, higher freq requests) 4C turbo | a bit of both 4C turbo proves to be a good compromise in a number of benchmarks (see below). +-------------------------------------------------------------------------+ | 3. EFFECTS ON THE SCHEDUTIL FREQUENCY GOVERNOR +-------------------------------------------------------------------------+ Once an architecture implements a frequency scale-invariant utilization (the PELT signal util_avg), schedutil switches its frequency selection formula from freq_next = 1.25 * freq_curr * util [non-invariant util signal] to freq_next = 1.25 * freq_max * util [invariant util signal] where, in the second formula, freq_max is set to the 1C turbo frequency (max turbo). The advantage of the second formula, whose usage we unlock with this patch, is that freq_next doesn't depend on the current frequency in an iterative fashion, but can jump to any frequency in a single update. This absence of feedback in the formula makes it quicker to react to utilization changes and more robust against pathological instabilities. Compare it to the update formula of intel_pstate/powersave: freq_next = 1.25 * freq_max * Busy% where again freq_max is 1C turbo and Busy% is the percentage of time not spent idling (calculated with delta_MPERF / delta_TSC); essentially the same as invariant schedutil, and largely responsible for intel_pstate/powersave good reputation. The non-invariant schedutil formula is derived from the invariant one by approximating util_inv with util_raw * freq_curr / freq_max, but this has limitations. Testing shows improved performances due to better frequency selections when the machine is lightly loaded, and essentially no change in behaviour at saturation / overutilization. +-------------------------------------------------------------------------+ | 4. KNOWN LIMITATIONS +-------------------------------------------------------------------------+ It's been shown that it is possible to create pathological scenarios where a CPU-bound task cannot reach max utilization, if the normalizing factor freq_max is fixed to a constant value (see [Lelli-2018]). If freq_max is set to 4C turbo as we do here, one needs to peg at least 5 cores in a package doing some busywork, and observe that none of those task will ever reach max util (1024) because they're all running at less than the 4C turbo frequency. While this concern still applies, we believe the performance benefit of frequency scale-invariant PELT signals outweights the cost of this limitation. [Lelli-2018] https://lore.kernel.org/lkml/20180517150418.GF22493@localhost.localdomain/ +-------------------------------------------------------------------------+ | 5. PERFORMANCE TESTING +-------------------------------------------------------------------------+ 5.1 MACHINES ============ We tested the patch on three machines, with Skylake, Broadwell and Haswell CPUs. The details are below, together with the available turbo ratios as reported by the appropriate MSRs. * 8x-SKYLAKE-UMA: Single socket E3-1240 v5, Skylake 4 cores/8 threads Max EFFiciency, BASE frequency and available turbo levels (MHz): EFFIC 800 |******** BASE 3500 |*********************************** 4C 3700 |************************************* 3C 3800 |************************************** 2C 3900 |*************************************** 1C 3900 |*************************************** * 80x-BROADWELL-NUMA: Two sockets E5-2698 v4, 2x Broadwell 20 cores/40 threads Max EFFiciency, BASE frequency and available turbo levels (MHz): EFFIC 1200 |************ BASE 2200 |********************** 8C 2900 |***************************** 7C 3000 |****************************** 6C 3100 |******************************* 5C 3200 |******************************** 4C 3300 |********************************* 3C 3400 |********************************** 2C 3600 |************************************ 1C 3600 |************************************ * 48x-HASWELL-NUMA Two sockets E5-2670 v3, 2x Haswell 12 cores/24 threads Max EFFiciency, BASE frequency and available turbo levels (MHz): EFFIC 1200 |************ BASE 2300 |*********************** 12C 2600 |************************** 11C 2600 |************************** 10C 2600 |************************** 9C 2600 |************************** 8C 2600 |************************** 7C 2600 |************************** 6C 2600 |************************** 5C 2700 |*************************** 4C 2800 |**************************** 3C 2900 |***************************** 2C 3100 |******************************* 1C 3100 |******************************* 5.2 SETUP ========= * The baseline is Linux v5.2 with schedutil (non-invariant) and the intel_pstate driver in passive mode. * The rationale for choosing the various freq_max values to test have been to try all the 1-2-3-4C turbo levels (note that 1C and 2C turbo are identical on all machines), plus one more value closer to base_freq but still in the turbo range (8C turbo for both 80x-BROADWELL-NUMA and 48x-HASWELL-NUMA). * In addition we've run all tests with intel_pstate/powersave for comparison. * The filesystem is always XFS, the userspace is openSUSE Leap 15.1. * 8x-SKYLAKE-UMA is capable of HWP (Hardware-Managed P-States), so the runs with active intel_pstate on this machine use that. This gives, in terms of combinations tested on each machine: * 8x-SKYLAKE-UMA * Baseline: Linux v5.2, non-invariant schedutil, intel_pstate passive * intel_pstate active + powersave + HWP * invariant schedutil, freq_max = 1C turbo * invariant schedutil, freq_max = 3C turbo * invariant schedutil, freq_max = 4C turbo * both 80x-BROADWELL-NUMA and 48x-HASWELL-NUMA * [same as 8x-SKYLAKE-UMA, but no HWP capable] * invariant schedutil, freq_max = 8C turbo (which on 48x-HASWELL-NUMA is the same as 12C turbo, or "all cores turbo") 5.3 BENCHMARK RESULTS ===================== 5.3.1 NEUTRAL BENCHMARKS ------------------------ Tests that didn't show any measurable difference in performance on any of the test machines between non-invariant schedutil and our patch are: * NAS Parallel Benchmarks (NPB) using either MPI or openMP for IPC, any computational kernel * flexible I/O (FIO) * hackbench (using threads or processes, and using pipes or sockets) 5.3.2 NON-NEUTRAL BENCHMARKS ---------------------------- What follow are summary tables where each benchmark result is given a score. * A tilde (~) means a neutral result, i.e. no difference from baseline. * Scores are computed with the ratio result_new / result_baseline, so a tilde means a score of 1.00. * The results in the score ratio are the geometric means of results running the benchmark with different parameters (eg: for kernbench: using 1, 2, 4, ... number of processes; for pgbench: varying the number of clients, and so on). * The first three tables show higher-is-better kind of tests (i.e. measured in operations/second), the subsequent three show lower-is-better kind of tests (i.e. the workload is fixed and we measure elapsed time, think kernbench). * "gitsource" is a name we made up for the test consisting in running the entire unit tests suite of the Git SCM and measuring how long it takes. We take it as a typical example of shell-intensive serialized workload. * In the "I_PSTATE" column we have the results for intel_pstate/powersave. Other columns show invariant schedutil for different values of freq_max. 4C turbo is circled as it's the value we've chosen for the final implementation. 80x-BROADWELL-NUMA (comparison ratio; higher is better) +------+ I_PSTATE 1C 3C | 4C | 8C pgbench-ro 1.14 ~ ~ | 1.11 | 1.14 pgbench-rw ~ ~ ~ | ~ | ~ netperf-udp 1.06 ~ 1.06 | 1.05 | 1.07 netperf-tcp ~ 1.03 ~ | 1.01 | 1.02 tbench4 1.57 1.18 1.22 | 1.30 | 1.56 +------+ 8x-SKYLAKE-UMA (comparison ratio; higher is better) +------+ I_PSTATE/HWP 1C 3C | 4C | pgbench-ro ~ ~ ~ | ~ | pgbench-rw ~ ~ ~ | ~ | netperf-udp ~ ~ ~ | ~ | netperf-tcp ~ ~ ~ | ~ | tbench4 1.30 1.14 1.14 | 1.16 | +------+ 48x-HASWELL-NUMA (comparison ratio; higher is better) +------+ I_PSTATE 1C 3C | 4C | 12C pgbench-ro 1.15 ~ ~ | 1.06 | 1.16 pgbench-rw ~ ~ ~ | ~ | ~ netperf-udp 1.05 0.97 1.04 | 1.04 | 1.02 netperf-tcp 0.96 1.01 1.01 | 1.01 | 1.01 tbench4 1.50 1.05 1.13 | 1.13 | 1.25 +------+ In the table above we see that active intel_pstate is slightly better than our 4C-turbo patch (both in reference to the baseline non-invariant schedutil) on read-only pgbench and much better on tbench. Both cases are notable in which it shows that lowering our freq_max (to 8C-turbo and 12C-turbo on 80x-BROADWELL-NUMA and 48x-HASWELL-NUMA respectively) helps invariant schedutil to get closer. If we ignore active intel_pstate and focus on the comparison with baseline alone, there are several instances of double-digit performance improvement. 80x-BROADWELL-NUMA (comparison ratio; lower is better) +------+ I_PSTATE 1C 3C | 4C | 8C dbench4 1.23 0.95 0.95 | 0.95 | 0.95 kernbench 0.93 0.83 0.83 | 0.83 | 0.82 gitsource 0.98 0.49 0.49 | 0.49 | 0.48 +------+ 8x-SKYLAKE-UMA (comparison ratio; lower is better) +------+ I_PSTATE/HWP 1C 3C | 4C | dbench4 ~ ~ ~ | ~ | kernbench ~ ~ ~ | ~ | gitsource 0.92 0.55 0.55 | 0.55 | +------+ 48x-HASWELL-NUMA (comparison ratio; lower is better) +------+ I_PSTATE 1C 3C | 4C | 8C dbench4 ~ ~ ~ | ~ | ~ kernbench 0.94 0.90 0.89 | 0.90 | 0.90 gitsource 0.97 0.69 0.69 | 0.69 | 0.69 +------+ dbench is not very remarkable here, unless we notice how poorly active intel_pstate is performing on 80x-BROADWELL-NUMA: 23% regression versus non-invariant schedutil. We repeated that run getting consistent results. Out of scope for the patch at hand, but deserving future investigation. Other than that, we previously ran this campaign with Linux v5.0 and saw the patch doing better on dbench a the time. We haven't checked closely and can only speculate at this point. On the NUMA boxes kernbench gets 10-15% improvements on average; we'll see in the detailed tables that the gains concentrate on low process counts (lightly loaded machines). The test we call "gitsource" (running the git unit test suite, a long-running single-threaded shell script) appears rather spectacular in this table (gains of 30-50% depending on the machine). It is to be noted, however, that gitsource has no adjustable parameters (such as the number of jobs in kernbench, which we average over in order to get a single-number summary score) and is exactly the kind of low-parallelism workload that benefits the most from this patch. When looking at the detailed tables of kernbench or tbench4, at low process or client counts one can see similar numbers. 5.3.3 SELECTION OF DETAILED RESULTS ----------------------------------- Machine : 48x-HASWELL-NUMA Benchmark : tbench4 (i.e. dbench4 over the network, actually loopback) Varying parameter : number of clients Unit : MB/sec (higher is better) 5.2.0 vanilla (BASELINE) 5.2.0 intel_pstate 5.2.0 1C-turbo - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - Hmean 1 126.73 +- 0.31% ( ) 315.91 +- 0.66% ( 149.28%) 125.03 +- 0.76% ( -1.34%) Hmean 2 258.04 +- 0.62% ( ) 614.16 +- 0.51% ( 138.01%) 269.58 +- 1.45% ( 4.47%) Hmean 4 514.30 +- 0.67% ( ) 1146.58 +- 0.54% ( 122.94%) 533.84 +- 1.99% ( 3.80%) Hmean 8 1111.38 +- 2.52% ( ) 2159.78 +- 0.38% ( 94.33%) 1359.92 +- 1.56% ( 22.36%) Hmean 16 2286.47 +- 1.36% ( ) 3338.29 +- 0.21% ( 46.00%) 2720.20 +- 0.52% ( 18.97%) Hmean 32 4704.84 +- 0.35% ( ) 4759.03 +- 0.43% ( 1.15%) 4774.48 +- 0.30% ( 1.48%) Hmean 64 7578.04 +- 0.27% ( ) 7533.70 +- 0.43% ( -0.59%) 7462.17 +- 0.65% ( -1.53%) Hmean 128 6998.52 +- 0.16% ( ) 6987.59 +- 0.12% ( -0.16%) 6909.17 +- 0.14% ( -1.28%) Hmean 192 6901.35 +- 0.25% ( ) 6913.16 +- 0.10% ( 0.17%) 6855.47 +- 0.21% ( -0.66%) 5.2.0 3C-turbo 5.2.0 4C-turbo 5.2.0 12C-turbo - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - Hmean 1 128.43 +- 0.28% ( 1.34%) 130.64 +- 3.81% ( 3.09%) 153.71 +- 5.89% ( 21.30%) Hmean 2 311.70 +- 6.15% ( 20.79%) 281.66 +- 3.40% ( 9.15%) 305.08 +- 5.70% ( 18.23%) Hmean 4 641.98 +- 2.32% ( 24.83%) 623.88 +- 5.28% ( 21.31%) 906.84 +- 4.65% ( 76.32%) Hmean 8 1633.31 +- 1.56% ( 46.96%) 1714.16 +- 0.93% ( 54.24%) 2095.74 +- 0.47% ( 88.57%) Hmean 16 3047.24 +- 0.42% ( 33.27%) 3155.02 +- 0.30% ( 37.99%) 3634.58 +- 0.15% ( 58.96%) Hmean 32 4734.31 +- 0.60% ( 0.63%) 4804.38 +- 0.23% ( 2.12%) 4674.62 +- 0.27% ( -0.64%) Hmean 64 7699.74 +- 0.35% ( 1.61%) 7499.72 +- 0.34% ( -1.03%) 7659.03 +- 0.25% ( 1.07%) Hmean 128 6935.18 +- 0.15% ( -0.91%) 6942.54 +- 0.10% ( -0.80%) 7004.85 +- 0.12% ( 0.09%) Hmean 192 6901.62 +- 0.12% ( 0.00%) 6856.93 +- 0.10% ( -0.64%) 6978.74 +- 0.10% ( 1.12%) This is one of the cases where the patch still can't surpass active intel_pstate, not even when freq_max is as low as 12C-turbo. Otherwise, gains are visible up to 16 clients and the saturated scenario is the same as baseline. The scores in the summary table from the previous sections are ratios of geometric means of the results over different clients, as seen in this table. Machine : 80x-BROADWELL-NUMA Benchmark : kernbench (kernel compilation) Varying parameter : number of jobs Unit : seconds (lower is better) 5.2.0 vanilla (BASELINE) 5.2.0 intel_pstate 5.2.0 1C-turbo - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - Amean 2 379.68 +- 0.06% ( ) 330.20 +- 0.43% ( 13.03%) 285.93 +- 0.07% ( 24.69%) Amean 4 200.15 +- 0.24% ( ) 175.89 +- 0.22% ( 12.12%) 153.78 +- 0.25% ( 23.17%) Amean 8 106.20 +- 0.31% ( ) 95.54 +- 0.23% ( 10.03%) 86.74 +- 0.10% ( 18.32%) Amean 16 56.96 +- 1.31% ( ) 53.25 +- 1.22% ( 6.50%) 48.34 +- 1.73% ( 15.13%) Amean 32 34.80 +- 2.46% ( ) 33.81 +- 0.77% ( 2.83%) 30.28 +- 1.59% ( 12.99%) Amean 64 26.11 +- 1.63% ( ) 25.04 +- 1.07% ( 4.10%) 22.41 +- 2.37% ( 14.16%) Amean 128 24.80 +- 1.36% ( ) 23.57 +- 1.23% ( 4.93%) 21.44 +- 1.37% ( 13.55%) Amean 160 24.85 +- 0.56% ( ) 23.85 +- 1.17% ( 4.06%) 21.25 +- 1.12% ( 14.49%) 5.2.0 3C-turbo 5.2.0 4C-turbo 5.2.0 8C-turbo - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - Amean 2 284.08 +- 0.13% ( 25.18%) 283.96 +- 0.51% ( 25.21%) 285.05 +- 0.21% ( 24.92%) Amean 4 153.18 +- 0.22% ( 23.47%) 154.70 +- 1.64% ( 22.71%) 153.64 +- 0.30% ( 23.24%) Amean 8 87.06 +- 0.28% ( 18.02%) 86.77 +- 0.46% ( 18.29%) 86.78 +- 0.22% ( 18.28%) Amean 16 48.03 +- 0.93% ( 15.68%) 47.75 +- 1.99% ( 16.17%) 47.52 +- 1.61% ( 16.57%) Amean 32 30.23 +- 1.20% ( 13.14%) 30.08 +- 1.67% ( 13.57%) 30.07 +- 1.67% ( 13.60%) Amean 64 22.59 +- 2.02% ( 13.50%) 22.63 +- 0.81% ( 13.32%) 22.42 +- 0.76% ( 14.12%) Amean 128 21.37 +- 0.67% ( 13.82%) 21.31 +- 1.15% ( 14.07%) 21.17 +- 1.93% ( 14.63%) Amean 160 21.68 +- 0.57% ( 12.76%) 21.18 +- 1.74% ( 14.77%) 21.22 +- 1.00% ( 14.61%) The patch outperform active intel_pstate (and baseline) by a considerable margin; the summary table from the previous section says 4C turbo and active intel_pstate are 0.83 and 0.93 against baseline respectively, so 4C turbo is 0.83/0.93=0.89 against intel_pstate (~10% better on average). There is no noticeable difference with regard to the value of freq_max. Machine : 8x-SKYLAKE-UMA Benchmark : gitsource (time to run the git unit test suite) Varying parameter : none Unit : seconds (lower is better) 5.2.0 vanilla 5.2.0 intel_pstate/hwp 5.2.0 1C-turbo - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - Amean 858.85 +- 1.16% ( ) 791.94 +- 0.21% ( 7.79%) 474.95 ( 44.70%) 5.2.0 3C-turbo 5.2.0 4C-turbo - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - Amean 475.26 +- 0.20% ( 44.66%) 474.34 +- 0.13% ( 44.77%) In this test, which is of interest as representing shell-intensive (i.e. fork-intensive) serialized workloads, invariant schedutil outperforms intel_pstate/powersave by a whopping 40% margin. 5.3.4 POWER CONSUMPTION, PERFORMANCE-PER-WATT --------------------------------------------- The following table shows average power consumption in watt for each benchmark. Data comes from turbostat (package average), which in turn is read from the RAPL interface on CPUs. We know the patch affects CPU frequencies so it's reasonable to ignore other power consumers (such as memory or I/O). Also, we don't have a power meter available in the lab so RAPL is the best we have. turbostat sampled average power every 10 seconds for the entire duration of each benchmark. We took all those values and averaged them (i.e. with don't have detail on a per-parameter granularity, only on whole benchmarks). 80x-BROADWELL-NUMA (power consumption, watts) +--------+ BASELINE I_PSTATE 1C 3C | 4C | 8C pgbench-ro 130.01 142.77 131.11 132.45 | 134.65 | 136.84 pgbench-rw 68.30 60.83 71.45 71.70 | 71.65 | 72.54 dbench4 90.25 59.06 101.43 99.89 | 101.10 | 102.94 netperf-udp 65.70 69.81 66.02 68.03 | 68.27 | 68.95 netperf-tcp 88.08 87.96 88.97 88.89 | 88.85 | 88.20 tbench4 142.32 176.73 153.02 163.91 | 165.58 | 176.07 kernbench 92.94 101.95 114.91 115.47 | 115.52 | 115.10 gitsource 40.92 41.87 75.14 75.20 | 75.40 | 75.70 +--------+ 8x-SKYLAKE-UMA (power consumption, watts) +--------+ BASELINE I_PSTATE/HWP 1C 3C | 4C | pgbench-ro 46.49 46.68 46.56 46.59 | 46.52 | pgbench-rw 29.34 31.38 30.98 31.00 | 31.00 | dbench4 27.28 27.37 27.49 27.41 | 27.38 | netperf-udp 22.33 22.41 22.36 22.35 | 22.36 | netperf-tcp 27.29 27.29 27.30 27.31 | 27.33 | tbench4 41.13 45.61 43.10 43.33 | 43.56 | kernbench 42.56 42.63 43.01 43.01 | 43.01 | gitsource 13.32 13.69 17.33 17.30 | 17.35 | +--------+ 48x-HASWELL-NUMA (power consumption, watts) +--------+ BASELINE I_PSTATE 1C 3C | 4C | 12C pgbench-ro 128.84 136.04 129.87 132.43 | 132.30 | 134.86 pgbench-rw 37.68 37.92 37.17 37.74 | 37.73 | 37.31 dbench4 28.56 28.73 28.60 28.73 | 28.70 | 28.79 netperf-udp 56.70 60.44 56.79 57.42 | 57.54 | 57.52 netperf-tcp 75.49 75.27 75.87 76.02 | 76.01 | 75.95 tbench4 115.44 139.51 119.53 123.07 | 123.97 | 130.22 kernbench 83.23 91.55 95.58 95.69 | 95.72 | 96.04 gitsource 36.79 36.99 39.99 40.34 | 40.35 | 40.23 +--------+ A lower power consumption isn't necessarily better, it depends on what is done with that energy. Here are tables with the ratio of performance-per-watt on each machine and benchmark. Higher is always better; a tilde (~) means a neutral ratio (i.e. 1.00). 80x-BROADWELL-NUMA (performance-per-watt ratios; higher is better) +------+ I_PSTATE 1C 3C | 4C | 8C pgbench-ro 1.04 1.06 0.94 | 1.07 | 1.08 pgbench-rw 1.10 0.97 0.96 | 0.96 | 0.97 dbench4 1.24 0.94 0.95 | 0.94 | 0.92 netperf-udp ~ 1.02 1.02 | ~ | 1.02 netperf-tcp ~ 1.02 ~ | ~ | 1.02 tbench4 1.26 1.10 1.06 | 1.12 | 1.26 kernbench 0.98 0.97 0.97 | 0.97 | 0.98 gitsource ~ 1.11 1.11 | 1.11 | 1.13 +------+ 8x-SKYLAKE-UMA (performance-per-watt ratios; higher is better) +------+ I_PSTATE/HWP 1C 3C | 4C | pgbench-ro ~ ~ ~ | ~ | pgbench-rw 0.95 0.97 0.96 | 0.96 | dbench4 ~ ~ ~ | ~ | netperf-udp ~ ~ ~ | ~ | netperf-tcp ~ ~ ~ | ~ | tbench4 1.17 1.09 1.08 | 1.10 | kernbench ~ ~ ~ | ~ | gitsource 1.06 1.40 1.40 | 1.40 | +------+ 48x-HASWELL-NUMA (performance-per-watt ratios; higher is better) +------+ I_PSTATE 1C 3C | 4C | 12C pgbench-ro 1.09 ~ 1.09 | 1.03 | 1.11 pgbench-rw ~ 0.86 ~ | ~ | 0.86 dbench4 ~ 1.02 1.02 | 1.02 | ~ netperf-udp ~ 0.97 1.03 | 1.02 | ~ netperf-tcp 0.96 ~ ~ | ~ | ~ tbench4 1.24 ~ 1.06 | 1.05 | 1.11 kernbench 0.97 0.97 0.98 | 0.97 | 0.96 gitsource 1.03 1.33 1.32 | 1.32 | 1.33 +------+ These results are overall pleasing: in plenty of cases we observe performance-per-watt improvements. The few regressions (read/write pgbench and dbench on the Broadwell machine) are of small magnitude. kernbench loses a few percentage points (it has a 10-15% performance improvement, but apparently the increase in power consumption is larger than that). tbench4 and gitsource, which benefit the most from the patch, keep a positive score in this table which is a welcome surprise; that suggests that in those particular workloads the non-invariant schedutil (and active intel_pstate, too) makes some rather suboptimal frequency selections. +-------------------------------------------------------------------------+ | 6. MICROARCH'ES ADDRESSED HERE +-------------------------------------------------------------------------+ The patch addresses Xeon Core processors that use MSR_PLATFORM_INFO and MSR_TURBO_RATIO_LIMIT to advertise their base frequency and turbo frequencies respectively. This excludes the recent Xeon Scalable Performance processors line (Xeon Gold, Platinum etc) whose MSRs have to be parsed differently. Subsequent patches will address: * Xeon Scalable Performance processors and Atom Goldmont/Goldmont Plus * Xeon Phi (Knights Landing, Knights Mill) * Atom Silvermont +-------------------------------------------------------------------------+ | 7. REFERENCES +-------------------------------------------------------------------------+ Tests have been run with the help of the MMTests performance testing framework, see github.com/gormanm/mmtests. The configuration file names for the benchmark used are: db-pgbench-timed-ro-small-xfs db-pgbench-timed-rw-small-xfs io-dbench4-async-xfs network-netperf-unbound network-tbench scheduler-unbound workload-kerndevel-xfs workload-shellscripts-xfs hpc-nas-c-class-mpi-full-xfs hpc-nas-c-class-omp-full All those benchmarks are generally available on the web: pgbench: https://www.postgresql.org/docs/10/pgbench.html netperf: https://hewlettpackard.github.io/netperf/ dbench/tbench: https://dbench.samba.org/ gitsource: git unit test suite, github.com/git/git NAS Parallel Benchmarks: https://www.nas.nasa.gov/publications/npb.html hackbench: https://people.redhat.com/mingo/cfs-scheduler/tools/hackbench.c Suggested-by: Peter Zijlstra <peterz@infradead.org> Signed-off-by: Giovanni Gherdovich <ggherdovich@suse.cz> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Ingo Molnar <mingo@kernel.org> Acked-by: Doug Smythies <dsmythies@telus.net> Acked-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Link: https://lkml.kernel.org/r/20200122151617.531-2-ggherdovich@suse.cz
2020-01-22 22:16:12 +07:00
#endif
#endif /* _ASM_X86_TOPOLOGY_H */