linux_dsm_epyc7002/kernel/sched/cpupri.h

#ifndef _LINUX_CPUPRI_H
#define _LINUX_CPUPRI_H

#include <linux/sched.h>

#define CPUPRI_NR_PRIORITIES	(MAX_RT_PRIO + 2)

#define CPUPRI_INVALID -1
#define CPUPRI_IDLE     0
#define CPUPRI_NORMAL   1
/* values 2-101 are RT priorities 0-99 */

struct cpupri_vec {
	atomic_t	count;
	cpumask_var_t	mask;
};

struct cpupri {
	struct cpupri_vec pri_to_cpu[CPUPRI_NR_PRIORITIES];
	int *cpu_to_pri;
};

#ifdef CONFIG_SMP
int  cpupri_find(struct cpupri *cp,
		 struct task_struct *p, struct cpumask *lowest_mask);
void cpupri_set(struct cpupri *cp, int cpu, int pri);
int cpupri_init(struct cpupri *cp);
void cpupri_cleanup(struct cpupri *cp);
#endif

#endif /* _LINUX_CPUPRI_H */
sched: use a 2-d bitmap for searching lowest-pri CPU The current code use a linear algorithm which causes scaling issues on larger SMP machines. This patch replaces that algorithm with a 2-dimensional bitmap to reduce latencies in the wake-up path. Signed-off-by: Gregory Haskins <ghaskins@novell.com> Acked-by: Steven Rostedt <srostedt@redhat.com> Signed-off-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> 2008-05-13 02:21:01 +07:00			`#ifndef _LINUX_CPUPRI_H`
			`#define _LINUX_CPUPRI_H`

			`#include <linux/sched.h>`

sched: fix the cpuprio count really Peter pointed out that the last version of the "fix" was still one off under certain circumstances. Use BITS_TO_LONG instead to get an accurate result. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> 2008-06-05 15:28:00 +07:00			`#define CPUPRI_NR_PRIORITIES (MAX_RT_PRIO + 2)`
sched: use a 2-d bitmap for searching lowest-pri CPU The current code use a linear algorithm which causes scaling issues on larger SMP machines. This patch replaces that algorithm with a 2-dimensional bitmap to reduce latencies in the wake-up path. Signed-off-by: Gregory Haskins <ghaskins@novell.com> Acked-by: Steven Rostedt <srostedt@redhat.com> Signed-off-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> 2008-05-13 02:21:01 +07:00
			`#define CPUPRI_INVALID -1`
			`#define CPUPRI_IDLE 0`
			`#define CPUPRI_NORMAL 1`
			`/* values 2-101 are RT priorities 0-99 */`

			`struct cpupri_vec {`
sched/cpupri: Remove the vec->lock sched/cpupri: Remove the vec->lock The cpupri vec->lock has been showing up as a top contention lately. This is because of the RT push/pull logic takes an agressive approach for migrating RT tasks. The cpupri logic is in place to improve the performance of the push/pull when dealing with large number CPU machines. The problem though is a vec->lock is required, where a vec is a global per RT priority structure. That is, if there are lots of RT tasks at the same priority, every time they are added or removed from the RT queue, this global vec->lock is taken. Now that more kernel threads are becoming RT (RCU boost and threaded interrupts) this is becoming much more of an issue. There are two variables that are being synced by the vec->lock. The cpupri bitmask, and the vec->counter. The cpupri bitmask is one bit per priority. If a RT priority vec has a process queued, then the vec->count is > 0 and the cpupri bitmask is set for that RT priority. If the cpupri bitmask gets out of sync with the vec->counter, we could end up pushing a low proirity RT task to a high priority queue. That RT task that could have run immediately could be queued on a run queue with a higher priority task indefinitely. The solution is not to use the cpupri bitmask and just look at the vec->count directly when doing a pull. The cpupri bitmask is just a fast way to scan the RT priorities when a pull is made. Instead of using the bitmask, and just examine all RT priorities, and look at the vec->counts, we could eliminate the vec->lock. The scan of RT tasks is to find a run queue that we can push an RT task to, and we do not push to a high priority queue, thus the scan only needs to go from 1 to RT task->prio, and not all 100 RT priorities. The push algorithm, which does the scan of RT priorities (and scan of the bitmask) only happens when we have an overloaded RT run queue (more than one RT task queued). The grabbing of the vec->lock happens every time any RT task is queued or dequeued on the run queue for that priority. The slowing down of the scan by not using a bitmask is negligible by the speed up of removing the vec->lock contention, and replacing it with an atomic counter and memory barrier. To prove this, I wrote a patch that times both the loop and the code that grabs the vec->locks. I passed the patches to various people (and companies) to test and show the results. I let everyone choose their own load to test, giving different loads on the system, for various different setups. Here's some of the results: (snipping to a few CPUs to not make this change log huge, but the results were consistent across the entire system). System 1 (24 CPUs) Before patch: CPU: Name Count Max Min Average Total ---- ---- ----- --- --- ------- ----- [...] cpu 20: loop 3057 1.766 0.061 0.642 1963.170 vec 6782949 90.469 0.089 0.414 2811760.503 cpu 21: loop 2617 1.723 0.062 0.641 1679.074 vec 6782810 90.499 0.089 0.291 1978499.900 cpu 22: loop 2212 1.863 0.063 0.699 1547.160 vec 6767244 85.685 0.089 0.435 2949676.898 cpu 23: loop 2320 2.013 0.062 0.594 1380.265 vec 6781694 87.923 0.088 0.431 2928538.224 After patch: cpu 20: loop 2078 1.579 0.061 0.533 1108.006 vec 6164555 5.704 0.060 0.143 885185.809 cpu 21: loop 2268 1.712 0.065 0.575 1305.248 vec 6153376 5.558 0.060 0.187 1154960.469 cpu 22: loop 1542 1.639 0.095 0.533 823.249 vec 6156510 5.720 0.060 0.190 1172727.232 cpu 23: loop 1650 1.733 0.068 0.545 900.781 vec 6170784 5.533 0.060 0.167 1034287.953 All times are in microseconds. The 'loop' is the amount of time spent doing the loop across the priorities (before patch uses bitmask). the 'vec' is the amount of time in the code that requires grabbing the vec->lock. The second patch just does not have the vec lock, but encompasses the same code. Amazingly the loop code even went down on average. The vec code went from .5 down to .18, that's more than half the time spent! Note, more than one test was run, but they all had the same results. System 2 (64 CPUs) Before patch: CPU: Name Count Max Min Average Total ---- ---- ----- --- --- ------- ----- cpu 60: loop 0 0 0 0 0 vec 5410840 277.954 0.084 0.782 4232895.727 cpu 61: loop 0 0 0 0 0 vec 4915648 188.399 0.084 0.570 2803220.301 cpu 62: loop 0 0 0 0 0 vec 5356076 276.417 0.085 0.786 4214544.548 cpu 63: loop 0 0 0 0 0 vec 4891837 170.531 0.085 0.799 3910948.833 After patch: cpu 60: loop 0 0 0 0 0 vec 5365118 5.080 0.021 0.063 340490.267 cpu 61: loop 0 0 0 0 0 vec 4898590 1.757 0.019 0.071 347903.615 cpu 62: loop 0 0 0 0 0 vec 5737130 3.067 0.021 0.119 687108.734 cpu 63: loop 0 0 0 0 0 vec 4903228 1.822 0.021 0.071 348506.477 The test run during the measurement did not have any (very few, from other CPUs) RT tasks pushing. But this shows that it helped out tremendously with the contention, as the contention happens because the vec->lock is taken only on queuing at an RT priority, and different CPUs that queue tasks at the same priority will have contention. I tested on my own 4 CPU machine with the following results: Before patch: CPU: Name Count Max Min Average Total ---- ---- ----- --- --- ------- ----- cpu 0: loop 2377 1.489 0.158 0.588 1398.395 vec 4484 770.146 2.301 4.396 19711.755 cpu 1: loop 2169 1.962 0.160 0.576 1250.110 vec 4425 152.769 2.297 4.030 17834.228 cpu 2: loop 2324 1.749 0.155 0.559 1299.799 vec 4368 779.632 2.325 4.665 20379.268 cpu 3: loop 2325 1.629 0.157 0.561 1306.113 vec 4650 408.782 2.394 4.348 20222.577 After patch: CPU: Name Count Max Min Average Total ---- ---- ----- --- --- ------- ----- cpu 0: loop 2121 1.616 0.113 0.636 1349.189 vec 4303 1.151 0.225 0.421 1811.966 cpu 1: loop 2130 1.638 0.178 0.644 1372.927 vec 4627 1.379 0.235 0.428 1983.648 cpu 2: loop 2056 1.464 0.165 0.637 1310.141 vec 4471 1.311 0.217 0.433 1937.927 cpu 3: loop 2154 1.481 0.162 0.601 1295.083 vec 4236 1.253 0.230 0.425 1803.008 This was running my migrate.c code that can be found at: http://lwn.net/Articles/425763/ The migrate code does stress the RT tasks a bit. This shows that the loop did increase a little after the patch, but not by much. The vec code dropped dramatically. From 4.3us down to .42us. That's a 10x improvement! Tested-by: Mike Galbraith <mgalbraith@suse.de> Tested-by: Luis Claudio R. Gonçalves <lgoncalv@redhat.com> Tested-by: Matthew Hank Sabins<msabins@linux.vnet.ibm.com> Signed-off-by: Steven Rostedt <rostedt@goodmis.org> Reviewed-by: Gregory Haskins <gregory.haskins@gmail.com> Acked-by: Hillf Danton <dhillf@gmail.com> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Chris Mason <chris.mason@oracle.com> Link: http://lkml.kernel.org/r/1312317372.18583.101.camel@gandalf.stny.rr.com Signed-off-by: Ingo Molnar <mingo@elte.hu> 2011-08-03 03:36:12 +07:00			`atomic_t count;`
			`cpumask_var_t mask;`
sched: use a 2-d bitmap for searching lowest-pri CPU The current code use a linear algorithm which causes scaling issues on larger SMP machines. This patch replaces that algorithm with a 2-dimensional bitmap to reduce latencies in the wake-up path. Signed-off-by: Gregory Haskins <ghaskins@novell.com> Acked-by: Steven Rostedt <srostedt@redhat.com> Signed-off-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> 2008-05-13 02:21:01 +07:00			`};`

			`struct cpupri {`
			`struct cpupri_vec pri_to_cpu[CPUPRI_NR_PRIORITIES];`
sched/cpupri: Replace NR_CPUS arrays Tejun reported that his resume was failing due to order-3 allocations from sched_domain building. Replace the NR_CPUS arrays in there with a dynamically allocated array. Reported-by: Tejun Heo <tj@kernel.org> Signed-off-by: Peter Zijlstra <peterz@infradead.org> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Link: http://lkml.kernel.org/n/tip-7cysnkw1gik45r864t1nkudh@git.kernel.org Signed-off-by: Ingo Molnar <mingo@kernel.org> 2014-05-14 21:04:26 +07:00			`int *cpu_to_pri;`
sched: use a 2-d bitmap for searching lowest-pri CPU The current code use a linear algorithm which causes scaling issues on larger SMP machines. This patch replaces that algorithm with a 2-dimensional bitmap to reduce latencies in the wake-up path. Signed-off-by: Gregory Haskins <ghaskins@novell.com> Acked-by: Steven Rostedt <srostedt@redhat.com> Signed-off-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> 2008-05-13 02:21:01 +07:00			`};`

			`#ifdef CONFIG_SMP`
			`int cpupri_find(struct cpupri *cp,`
cpumask: remove cpumask_t from core Impact: cleanup struct cpumask is nicer, and we use it to make where we've made code safe for CONFIG_CPUMASK_OFFSTACK=y. Signed-off-by: Rusty Russell <rusty@rustcorp.com.au> 2009-03-31 11:05:16 +07:00			`struct task_struct p, struct cpumask lowest_mask);`
sched: use a 2-d bitmap for searching lowest-pri CPU The current code use a linear algorithm which causes scaling issues on larger SMP machines. This patch replaces that algorithm with a 2-dimensional bitmap to reduce latencies in the wake-up path. Signed-off-by: Gregory Haskins <ghaskins@novell.com> Acked-by: Steven Rostedt <srostedt@redhat.com> Signed-off-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> 2008-05-13 02:21:01 +07:00			`void cpupri_set(struct cpupri *cp, int cpu, int pri);`
sched: No need for bootmem special cases As of commit dcce284 ("mm: Extend gfp masking to the page allocator") and commit 7e85ee0 ("slab,slub: don't enable interrupts during early boot"), the slab allocator makes sure we don't attempt to sleep during boot. Therefore, remove bootmem special cases from the scheduler and use plain GFP_KERNEL instead. Signed-off-by: Pekka Enberg <penberg@cs.helsinki.fi> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> LKML-Reference: <1279225102-2572-1-git-send-email-penberg@cs.helsinki.fi> Signed-off-by: Ingo Molnar <mingo@elte.hu> 2010-07-16 03:18:22 +07:00			`int cpupri_init(struct cpupri *cp);`
sched: convert struct cpupri_vec cpumask_var_t. Impact: stack usage reduction, (future) size reduction for large NR_CPUS. Dynamically allocating cpumasks (when CONFIG_CPUMASK_OFFSTACK) saves space for small nr_cpu_ids but big CONFIG_NR_CPUS. The fact cpupro_init is called both before and after the slab is available makes for an ugly parameter unfortunately. We also use cpumask_any_and to get rid of a temporary in cpupri_find. Signed-off-by: Rusty Russell <rusty@rustcorp.com.au> Signed-off-by: Ingo Molnar <mingo@elte.hu> 2008-11-24 23:05:13 +07:00			`void cpupri_cleanup(struct cpupri *cp);`
sched: use a 2-d bitmap for searching lowest-pri CPU The current code use a linear algorithm which causes scaling issues on larger SMP machines. This patch replaces that algorithm with a 2-dimensional bitmap to reduce latencies in the wake-up path. Signed-off-by: Gregory Haskins <ghaskins@novell.com> Acked-by: Steven Rostedt <srostedt@redhat.com> Signed-off-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> 2008-05-13 02:21:01 +07:00			`#endif`

			`#endif /* _LINUX_CPUPRI_H */`