2005-04-17 05:20:36 +07:00
|
|
|
/*
|
2011-11-15 23:14:39 +07:00
|
|
|
* kernel/sched/core.c
|
2005-04-17 05:20:36 +07:00
|
|
|
*
|
|
|
|
* Kernel scheduler and related syscalls
|
|
|
|
*
|
|
|
|
* Copyright (C) 1991-2002 Linus Torvalds
|
|
|
|
*
|
|
|
|
* 1996-12-23 Modified by Dave Grothe to fix bugs in semaphores and
|
|
|
|
* make semaphores SMP safe
|
|
|
|
* 1998-11-19 Implemented schedule_timeout() and related stuff
|
|
|
|
* by Andrea Arcangeli
|
|
|
|
* 2002-01-04 New ultra-scalable O(1) scheduler by Ingo Molnar:
|
|
|
|
* hybrid priority-list and round-robin design with
|
|
|
|
* an array-switch method of distributing timeslices
|
|
|
|
* and per-CPU runqueues. Cleanups and useful suggestions
|
|
|
|
* by Davide Libenzi, preemptible kernel bits by Robert Love.
|
|
|
|
* 2003-09-03 Interactivity tuning by Con Kolivas.
|
|
|
|
* 2004-04-02 Scheduler domains code by Nick Piggin
|
2007-07-09 23:52:01 +07:00
|
|
|
* 2007-04-15 Work begun on replacing all interactivity tuning with a
|
|
|
|
* fair scheduling design by Con Kolivas.
|
|
|
|
* 2007-05-05 Load balancing (smp-nice) and other improvements
|
|
|
|
* by Peter Williams
|
|
|
|
* 2007-05-06 Interactivity improvements to CFS by Mike Galbraith
|
|
|
|
* 2007-07-01 Group scheduling enhancements by Srivatsa Vaddagiri
|
2008-01-26 03:08:19 +07:00
|
|
|
* 2007-11-29 RT balancing improvements by Steven Rostedt, Gregory Haskins,
|
|
|
|
* Thomas Gleixner, Mike Kravetz
|
2005-04-17 05:20:36 +07:00
|
|
|
*/
|
|
|
|
|
2016-03-10 05:08:18 +07:00
|
|
|
#include <linux/kasan.h>
|
2005-04-17 05:20:36 +07:00
|
|
|
#include <linux/mm.h>
|
|
|
|
#include <linux/module.h>
|
|
|
|
#include <linux/nmi.h>
|
|
|
|
#include <linux/init.h>
|
2007-07-09 23:52:00 +07:00
|
|
|
#include <linux/uaccess.h>
|
2005-04-17 05:20:36 +07:00
|
|
|
#include <linux/highmem.h>
|
2016-04-26 23:39:06 +07:00
|
|
|
#include <linux/mmu_context.h>
|
2005-04-17 05:20:36 +07:00
|
|
|
#include <linux/interrupt.h>
|
2006-01-12 03:17:46 +07:00
|
|
|
#include <linux/capability.h>
|
2005-04-17 05:20:36 +07:00
|
|
|
#include <linux/completion.h>
|
|
|
|
#include <linux/kernel_stat.h>
|
2006-07-03 14:24:33 +07:00
|
|
|
#include <linux/debug_locks.h>
|
perf: Do the big rename: Performance Counters -> Performance Events
Bye-bye Performance Counters, welcome Performance Events!
In the past few months the perfcounters subsystem has grown out its
initial role of counting hardware events, and has become (and is
becoming) a much broader generic event enumeration, reporting, logging,
monitoring, analysis facility.
Naming its core object 'perf_counter' and naming the subsystem
'perfcounters' has become more and more of a misnomer. With pending
code like hw-breakpoints support the 'counter' name is less and
less appropriate.
All in one, we've decided to rename the subsystem to 'performance
events' and to propagate this rename through all fields, variables
and API names. (in an ABI compatible fashion)
The word 'event' is also a bit shorter than 'counter' - which makes
it slightly more convenient to write/handle as well.
Thanks goes to Stephane Eranian who first observed this misnomer and
suggested a rename.
User-space tooling and ABI compatibility is not affected - this patch
should be function-invariant. (Also, defconfigs were not touched to
keep the size down.)
This patch has been generated via the following script:
FILES=$(find * -type f | grep -vE 'oprofile|[^K]config')
sed -i \
-e 's/PERF_EVENT_/PERF_RECORD_/g' \
-e 's/PERF_COUNTER/PERF_EVENT/g' \
-e 's/perf_counter/perf_event/g' \
-e 's/nb_counters/nb_events/g' \
-e 's/swcounter/swevent/g' \
-e 's/tpcounter_event/tp_event/g' \
$FILES
for N in $(find . -name perf_counter.[ch]); do
M=$(echo $N | sed 's/perf_counter/perf_event/g')
mv $N $M
done
FILES=$(find . -name perf_event.*)
sed -i \
-e 's/COUNTER_MASK/REG_MASK/g' \
-e 's/COUNTER/EVENT/g' \
-e 's/\<event\>/event_id/g' \
-e 's/counter/event/g' \
-e 's/Counter/Event/g' \
$FILES
... to keep it as correct as possible. This script can also be
used by anyone who has pending perfcounters patches - it converts
a Linux kernel tree over to the new naming. We tried to time this
change to the point in time where the amount of pending patches
is the smallest: the end of the merge window.
Namespace clashes were fixed up in a preparatory patch - and some
stylistic fallout will be fixed up in a subsequent patch.
( NOTE: 'counters' are still the proper terminology when we deal
with hardware registers - and these sed scripts are a bit
over-eager in renaming them. I've undone some of that, but
in case there's something left where 'counter' would be
better than 'event' we can undo that on an individual basis
instead of touching an otherwise nicely automated patch. )
Suggested-by: Stephane Eranian <eranian@google.com>
Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Acked-by: Paul Mackerras <paulus@samba.org>
Reviewed-by: Arjan van de Ven <arjan@linux.intel.com>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: David Howells <dhowells@redhat.com>
Cc: Kyle McMartin <kyle@mcmartin.ca>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: <linux-arch@vger.kernel.org>
LKML-Reference: <new-submission>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-09-21 17:02:48 +07:00
|
|
|
#include <linux/perf_event.h>
|
2005-04-17 05:20:36 +07:00
|
|
|
#include <linux/security.h>
|
|
|
|
#include <linux/notifier.h>
|
|
|
|
#include <linux/profile.h>
|
2006-12-07 11:34:23 +07:00
|
|
|
#include <linux/freezer.h>
|
[PATCH] scheduler cache-hot-autodetect
)
From: Ingo Molnar <mingo@elte.hu>
This is the latest version of the scheduler cache-hot-auto-tune patch.
The first problem was that detection time scaled with O(N^2), which is
unacceptable on larger SMP and NUMA systems. To solve this:
- I've added a 'domain distance' function, which is used to cache
measurement results. Each distance is only measured once. This means
that e.g. on NUMA distances of 0, 1 and 2 might be measured, on HT
distances 0 and 1, and on SMP distance 0 is measured. The code walks
the domain tree to determine the distance, so it automatically follows
whatever hierarchy an architecture sets up. This cuts down on the boot
time significantly and removes the O(N^2) limit. The only assumption
is that migration costs can be expressed as a function of domain
distance - this covers the overwhelming majority of existing systems,
and is a good guess even for more assymetric systems.
[ People hacking systems that have assymetries that break this
assumption (e.g. different CPU speeds) should experiment a bit with
the cpu_distance() function. Adding a ->migration_distance factor to
the domain structure would be one possible solution - but lets first
see the problem systems, if they exist at all. Lets not overdesign. ]
Another problem was that only a single cache-size was used for measuring
the cost of migration, and most architectures didnt set that variable
up. Furthermore, a single cache-size does not fit NUMA hierarchies with
L3 caches and does not fit HT setups, where different CPUs will often
have different 'effective cache sizes'. To solve this problem:
- Instead of relying on a single cache-size provided by the platform and
sticking to it, the code now auto-detects the 'effective migration
cost' between two measured CPUs, via iterating through a wide range of
cachesizes. The code searches for the maximum migration cost, which
occurs when the working set of the test-workload falls just below the
'effective cache size'. I.e. real-life optimized search is done for
the maximum migration cost, between two real CPUs.
This, amongst other things, has the positive effect hat if e.g. two
CPUs share a L2/L3 cache, a different (and accurate) migration cost
will be found than between two CPUs on the same system that dont share
any caches.
(The reliable measurement of migration costs is tricky - see the source
for details.)
Furthermore i've added various boot-time options to override/tune
migration behavior.
Firstly, there's a blanket override for autodetection:
migration_cost=1000,2000,3000
will override the depth 0/1/2 values with 1msec/2msec/3msec values.
Secondly, there's a global factor that can be used to increase (or
decrease) the autodetected values:
migration_factor=120
will increase the autodetected values by 20%. This option is useful to
tune things in a workload-dependent way - e.g. if a workload is
cache-insensitive then CPU utilization can be maximized by specifying
migration_factor=0.
I've tested the autodetection code quite extensively on x86, on 3
P3/Xeon/2MB, and the autodetected values look pretty good:
Dual Celeron (128K L2 cache):
---------------------
migration cost matrix (max_cache_size: 131072, cpu: 467 MHz):
---------------------
[00] [01]
[00]: - 1.7(1)
[01]: 1.7(1) -
---------------------
cacheflush times [2]: 0.0 (0) 1.7 (1784008)
---------------------
Here the slow memory subsystem dominates system performance, and even
though caches are small, the migration cost is 1.7 msecs.
Dual HT P4 (512K L2 cache):
---------------------
migration cost matrix (max_cache_size: 524288, cpu: 2379 MHz):
---------------------
[00] [01] [02] [03]
[00]: - 0.4(1) 0.0(0) 0.4(1)
[01]: 0.4(1) - 0.4(1) 0.0(0)
[02]: 0.0(0) 0.4(1) - 0.4(1)
[03]: 0.4(1) 0.0(0) 0.4(1) -
---------------------
cacheflush times [2]: 0.0 (33900) 0.4 (448514)
---------------------
Here it can be seen that there is no migration cost between two HT
siblings (CPU#0/2 and CPU#1/3 are separate physical CPUs). A fast memory
system makes inter-physical-CPU migration pretty cheap: 0.4 msecs.
8-way P3/Xeon [2MB L2 cache]:
---------------------
migration cost matrix (max_cache_size: 2097152, cpu: 700 MHz):
---------------------
[00] [01] [02] [03] [04] [05] [06] [07]
[00]: - 19.2(1) 19.2(1) 19.2(1) 19.2(1) 19.2(1) 19.2(1) 19.2(1)
[01]: 19.2(1) - 19.2(1) 19.2(1) 19.2(1) 19.2(1) 19.2(1) 19.2(1)
[02]: 19.2(1) 19.2(1) - 19.2(1) 19.2(1) 19.2(1) 19.2(1) 19.2(1)
[03]: 19.2(1) 19.2(1) 19.2(1) - 19.2(1) 19.2(1) 19.2(1) 19.2(1)
[04]: 19.2(1) 19.2(1) 19.2(1) 19.2(1) - 19.2(1) 19.2(1) 19.2(1)
[05]: 19.2(1) 19.2(1) 19.2(1) 19.2(1) 19.2(1) - 19.2(1) 19.2(1)
[06]: 19.2(1) 19.2(1) 19.2(1) 19.2(1) 19.2(1) 19.2(1) - 19.2(1)
[07]: 19.2(1) 19.2(1) 19.2(1) 19.2(1) 19.2(1) 19.2(1) 19.2(1) -
---------------------
cacheflush times [2]: 0.0 (0) 19.2 (19281756)
---------------------
This one has huge caches and a relatively slow memory subsystem - so the
migration cost is 19 msecs.
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Ashok Raj <ashok.raj@intel.com>
Signed-off-by: Ken Chen <kenneth.w.chen@intel.com>
Cc: <wilder@us.ibm.com>
Signed-off-by: John Hawkes <hawkes@sgi.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-01-12 16:05:30 +07:00
|
|
|
#include <linux/vmalloc.h>
|
2005-04-17 05:20:36 +07:00
|
|
|
#include <linux/blkdev.h>
|
|
|
|
#include <linux/delay.h>
|
2007-10-19 13:40:14 +07:00
|
|
|
#include <linux/pid_namespace.h>
|
2005-04-17 05:20:36 +07:00
|
|
|
#include <linux/smp.h>
|
|
|
|
#include <linux/threads.h>
|
|
|
|
#include <linux/timer.h>
|
|
|
|
#include <linux/rcupdate.h>
|
|
|
|
#include <linux/cpu.h>
|
|
|
|
#include <linux/cpuset.h>
|
|
|
|
#include <linux/percpu.h>
|
2008-10-06 16:23:43 +07:00
|
|
|
#include <linux/proc_fs.h>
|
2005-04-17 05:20:36 +07:00
|
|
|
#include <linux/seq_file.h>
|
2007-07-26 18:40:43 +07:00
|
|
|
#include <linux/sysctl.h>
|
2005-04-17 05:20:36 +07:00
|
|
|
#include <linux/syscalls.h>
|
|
|
|
#include <linux/times.h>
|
2006-10-01 13:28:59 +07:00
|
|
|
#include <linux/tsacct_kern.h>
|
2006-03-26 16:38:20 +07:00
|
|
|
#include <linux/kprobes.h>
|
2006-07-14 14:24:37 +07:00
|
|
|
#include <linux/delayacct.h>
|
2007-07-09 23:52:00 +07:00
|
|
|
#include <linux/unistd.h>
|
2007-09-21 14:19:54 +07:00
|
|
|
#include <linux/pagemap.h>
|
2008-01-26 03:08:29 +07:00
|
|
|
#include <linux/hrtimer.h>
|
2008-03-18 06:19:05 +07:00
|
|
|
#include <linux/tick.h>
|
2008-04-20 00:45:00 +07:00
|
|
|
#include <linux/ctype.h>
|
2008-05-13 02:20:42 +07:00
|
|
|
#include <linux/ftrace.h>
|
include cleanup: Update gfp.h and slab.h includes to prepare for breaking implicit slab.h inclusion from percpu.h
percpu.h is included by sched.h and module.h and thus ends up being
included when building most .c files. percpu.h includes slab.h which
in turn includes gfp.h making everything defined by the two files
universally available and complicating inclusion dependencies.
percpu.h -> slab.h dependency is about to be removed. Prepare for
this change by updating users of gfp and slab facilities include those
headers directly instead of assuming availability. As this conversion
needs to touch large number of source files, the following script is
used as the basis of conversion.
http://userweb.kernel.org/~tj/misc/slabh-sweep.py
The script does the followings.
* Scan files for gfp and slab usages and update includes such that
only the necessary includes are there. ie. if only gfp is used,
gfp.h, if slab is used, slab.h.
* When the script inserts a new include, it looks at the include
blocks and try to put the new include such that its order conforms
to its surrounding. It's put in the include block which contains
core kernel includes, in the same order that the rest are ordered -
alphabetical, Christmas tree, rev-Xmas-tree or at the end if there
doesn't seem to be any matching order.
* If the script can't find a place to put a new include (mostly
because the file doesn't have fitting include block), it prints out
an error message indicating which .h file needs to be added to the
file.
The conversion was done in the following steps.
1. The initial automatic conversion of all .c files updated slightly
over 4000 files, deleting around 700 includes and adding ~480 gfp.h
and ~3000 slab.h inclusions. The script emitted errors for ~400
files.
2. Each error was manually checked. Some didn't need the inclusion,
some needed manual addition while adding it to implementation .h or
embedding .c file was more appropriate for others. This step added
inclusions to around 150 files.
3. The script was run again and the output was compared to the edits
from #2 to make sure no file was left behind.
4. Several build tests were done and a couple of problems were fixed.
e.g. lib/decompress_*.c used malloc/free() wrappers around slab
APIs requiring slab.h to be added manually.
5. The script was run on all .h files but without automatically
editing them as sprinkling gfp.h and slab.h inclusions around .h
files could easily lead to inclusion dependency hell. Most gfp.h
inclusion directives were ignored as stuff from gfp.h was usually
wildly available and often used in preprocessor macros. Each
slab.h inclusion directive was examined and added manually as
necessary.
6. percpu.h was updated not to include slab.h.
7. Build test were done on the following configurations and failures
were fixed. CONFIG_GCOV_KERNEL was turned off for all tests (as my
distributed build env didn't work with gcov compiles) and a few
more options had to be turned off depending on archs to make things
build (like ipr on powerpc/64 which failed due to missing writeq).
* x86 and x86_64 UP and SMP allmodconfig and a custom test config.
* powerpc and powerpc64 SMP allmodconfig
* sparc and sparc64 SMP allmodconfig
* ia64 SMP allmodconfig
* s390 SMP allmodconfig
* alpha SMP allmodconfig
* um on x86_64 SMP allmodconfig
8. percpu.h modifications were reverted so that it could be applied as
a separate patch and serve as bisection point.
Given the fact that I had only a couple of failures from tests on step
6, I'm fairly confident about the coverage of this conversion patch.
If there is a breakage, it's likely to be something in one of the arch
headers which should be easily discoverable easily on most builds of
the specific arch.
Signed-off-by: Tejun Heo <tj@kernel.org>
Guess-its-ok-by: Christoph Lameter <cl@linux-foundation.org>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Lee Schermerhorn <Lee.Schermerhorn@hp.com>
2010-03-24 15:04:11 +07:00
|
|
|
#include <linux/slab.h>
|
2011-10-27 04:14:16 +07:00
|
|
|
#include <linux/init_task.h>
|
2012-11-28 01:33:25 +07:00
|
|
|
#include <linux/context_tracking.h>
|
2014-04-08 05:39:20 +07:00
|
|
|
#include <linux/compiler.h>
|
2016-02-29 11:22:38 +07:00
|
|
|
#include <linux/frame.h>
|
sched/cputime: Mitigate performance regression in times()/clock_gettime()
Commit:
6e998916dfe3 ("sched/cputime: Fix clock_nanosleep()/clock_gettime() inconsistency")
fixed a problem whereby clock_nanosleep() followed by clock_gettime() could
allow a task to wake early. It addressed the problem by calling the scheduling
classes update_curr() when the cputimer starts.
Said change induced a considerable performance regression on the syscalls
times() and clock_gettimes(CLOCK_PROCESS_CPUTIME_ID). There are some
debuggers and applications that monitor their own performance that
accidentally depend on the performance of these specific calls.
This patch mitigates the performace loss by prefetching data in the CPU
cache, as stalls due to cache misses appear to be where most time is spent
in our benchmarks.
Here are the performance gain of this patch over v4.7-rc7 on a Sandy Bridge
box with 32 logical cores and 2 NUMA nodes. The test is repeated with a
variable number of threads, from 2 to 4*num_cpus; the results are in
seconds and correspond to the average of 10 runs; the percentage gain is
computed with (before-after)/before so a positive value is an improvement
(it's faster). The improvement varies between a few percents for 5-20
threads and more than 10% for 2 or >20 threads.
pound_clock_gettime:
threads 4.7-rc7 patched 4.7-rc7
[num] [secs] [secs (percent)]
2 3.48 3.06 ( 11.83%)
5 3.33 3.25 ( 2.40%)
8 3.37 3.26 ( 3.30%)
12 3.32 3.37 ( -1.60%)
21 4.01 3.90 ( 2.74%)
30 3.63 3.36 ( 7.41%)
48 3.71 3.11 ( 16.27%)
79 3.75 3.16 ( 15.74%)
110 3.81 3.25 ( 14.80%)
128 3.88 3.31 ( 14.76%)
pound_times:
threads 4.7-rc7 patched 4.7-rc7
[num] [secs] [secs (percent)]
2 3.65 3.25 ( 11.03%)
5 3.45 3.17 ( 7.92%)
8 3.52 3.22 ( 8.69%)
12 3.29 3.36 ( -2.04%)
21 4.07 3.92 ( 3.78%)
30 3.87 3.40 ( 12.17%)
48 3.79 3.16 ( 16.61%)
79 3.88 3.28 ( 15.42%)
110 3.90 3.38 ( 13.35%)
128 4.00 3.38 ( 15.45%)
pound_clock_gettime and pound_clock_gettime are two benchmarks included in
the MMTests framework. They launch a given number of threads which
repeatedly call times() or clock_gettimes(). The results above can be
reproduced with cloning MMTests from github.com and running the "poundtime"
workload:
$ git clone https://github.com/gormanm/mmtests.git
$ cd mmtests
$ cp configs/config-global-dhp__workload_poundtime config
$ ./run-mmtests.sh --run-monitor $(uname -r)
The above will run "poundtime" measuring the kernel currently running on
the machine; Once a new kernel is installed and the machine rebooted,
running again
$ cd mmtests
$ ./run-mmtests.sh --run-monitor $(uname -r)
will produce results to compare with. A comparison table will be output
with:
$ cd mmtests/work/log
$ ../../compare-kernels.sh
the table will contain a lot of entries; grepping for "Amean" (as in
"arithmetic mean") will give the tables presented above. The source code
for the two benchmarks is reported at the end of this changelog for
clairity.
The cache misses addressed by this patch were found using a combination of
`perf top`, `perf record` and `perf annotate`. The incriminated lines were
found to be
struct sched_entity *curr = cfs_rq->curr;
and
delta_exec = now - curr->exec_start;
in the function update_curr() from kernel/sched/fair.c. This patch
prefetches the data from memory just before update_curr is called in the
interested execution path.
A comparison of the total number of cycles before and after the patch
follows; the data is obtained using `perf stat -r 10 -ddd <program>`
running over the same sequence of number of threads used above (a positive
gain is an improvement):
threads cycles before cycles after gain
2 19,699,563,964 +-1.19% 17,358,917,517 +-1.85% 11.88%
5 47,401,089,566 +-2.96% 45,103,730,829 +-0.97% 4.85%
8 80,923,501,004 +-3.01% 71,419,385,977 +-0.77% 11.74%
12 112,326,485,473 +-0.47% 110,371,524,403 +-0.47% 1.74%
21 193,455,574,299 +-0.72% 180,120,667,904 +-0.36% 6.89%
30 315,073,519,013 +-1.64% 271,222,225,950 +-1.29% 13.92%
48 321,969,515,332 +-1.48% 273,353,977,321 +-1.16% 15.10%
79 337,866,003,422 +-0.97% 289,462,481,538 +-1.05% 14.33%
110 338,712,691,920 +-0.78% 290,574,233,170 +-0.77% 14.21%
128 348,384,794,006 +-0.50% 292,691,648,206 +-0.66% 15.99%
A comparison of cache miss vs total cache loads ratios, before and after
the patch (again from the `perf stat -r 10 -ddd <program>` tables):
threads L1 misses/total*100 L1 misses/total*100 gain
before after
2 7.43 +-4.90% 7.36 +-4.70% 0.94%
5 13.09 +-4.74% 13.52 +-3.73% -3.28%
8 13.79 +-5.61% 12.90 +-3.27% 6.45%
12 11.57 +-2.44% 8.71 +-1.40% 24.72%
21 12.39 +-3.92% 9.97 +-1.84% 19.53%
30 13.91 +-2.53% 11.73 +-2.28% 15.67%
48 13.71 +-1.59% 12.32 +-1.97% 10.14%
79 14.44 +-0.66% 13.40 +-1.06% 7.20%
110 15.86 +-0.50% 14.46 +-0.59% 8.83%
128 16.51 +-0.32% 15.06 +-0.78% 8.78%
As a final note, the following shows the evolution of performance figures
in the "poundtime" benchmark and pinpoints commit 6e998916dfe3
("sched/cputime: Fix clock_nanosleep()/clock_gettime() inconsistency") as a
major source of degradation, mostly unaddressed to this day (figures
expressed in seconds).
pound_clock_gettime:
threads parent of 6e998916dfe3 4.7-rc7
6e998916dfe3 itself
2 2.23 3.68 ( -64.56%) 3.48 (-55.48%)
5 2.83 3.78 ( -33.42%) 3.33 (-17.43%)
8 2.84 4.31 ( -52.12%) 3.37 (-18.76%)
12 3.09 3.61 ( -16.74%) 3.32 ( -7.17%)
21 3.14 4.63 ( -47.36%) 4.01 (-27.71%)
30 3.28 5.75 ( -75.37%) 3.63 (-10.80%)
48 3.02 6.05 (-100.56%) 3.71 (-22.99%)
79 2.88 6.30 (-118.90%) 3.75 (-30.26%)
110 2.95 6.46 (-119.00%) 3.81 (-29.24%)
128 3.05 6.42 (-110.08%) 3.88 (-27.04%)
pound_times:
threads parent of 6e998916dfe3 4.7-rc7
6e998916dfe3 itself
2 2.27 3.73 ( -64.71%) 3.65 (-61.14%)
5 2.78 3.77 ( -35.56%) 3.45 (-23.98%)
8 2.79 4.41 ( -57.71%) 3.52 (-26.05%)
12 3.02 3.56 ( -17.94%) 3.29 ( -9.08%)
21 3.10 4.61 ( -48.74%) 4.07 (-31.34%)
30 3.33 5.75 ( -72.53%) 3.87 (-16.01%)
48 2.96 6.06 (-105.04%) 3.79 (-28.10%)
79 2.88 6.24 (-116.83%) 3.88 (-34.81%)
110 2.98 6.37 (-114.08%) 3.90 (-31.12%)
128 3.10 6.35 (-104.61%) 4.00 (-28.87%)
The source code of the two benchmarks follows. To compile the two:
NR_THREADS=42
for FILE in pound_times pound_clock_gettime; do
gcc -lrt -O2 -lpthread -DNUM_THREADS=$NR_THREADS $FILE.c -o $FILE
done
==== BEGIN pound_times.c ====
struct tms start;
void *pound (void *threadid)
{
struct tms end;
int oldutime = 0;
int utime;
int i;
for (i = 0; i < 5000000 / NUM_THREADS; i++) {
times(&end);
utime = ((int)end.tms_utime - (int)start.tms_utime);
if (oldutime > utime) {
printf("utime decreased, was %d, now %d!\n", oldutime, utime);
}
oldutime = utime;
}
pthread_exit(NULL);
}
int main()
{
pthread_t th[NUM_THREADS];
long i;
times(&start);
for (i = 0; i < NUM_THREADS; i++) {
pthread_create (&th[i], NULL, pound, (void *)i);
}
pthread_exit(NULL);
return 0;
}
==== END pound_times.c ====
==== BEGIN pound_clock_gettime.c ====
void *pound (void *threadid)
{
struct timespec ts;
int rc, i;
unsigned long prev = 0, this = 0;
for (i = 0; i < 5000000 / NUM_THREADS; i++) {
rc = clock_gettime(CLOCK_PROCESS_CPUTIME_ID, &ts);
if (rc < 0)
perror("clock_gettime");
this = (ts.tv_sec * 1000000000) + ts.tv_nsec;
if (0 && this < prev)
printf("%lu ns timewarp at iteration %d\n", prev - this, i);
prev = this;
}
pthread_exit(NULL);
}
int main()
{
pthread_t th[NUM_THREADS];
long rc, i;
pid_t pgid;
for (i = 0; i < NUM_THREADS; i++) {
rc = pthread_create(&th[i], NULL, pound, (void *)i);
if (rc < 0)
perror("pthread_create");
}
pthread_exit(NULL);
return 0;
}
==== END pound_clock_gettime.c ====
Suggested-by: Mike Galbraith <mgalbraith@suse.de>
Signed-off-by: Giovanni Gherdovich <ggherdovich@suse.cz>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Stanislaw Gruszka <sgruszka@redhat.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/1470385316-15027-2-git-send-email-ggherdovich@suse.cz
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-08-05 15:21:56 +07:00
|
|
|
#include <linux/prefetch.h>
|
2005-04-17 05:20:36 +07:00
|
|
|
|
2012-03-29 00:30:03 +07:00
|
|
|
#include <asm/switch_to.h>
|
2007-05-08 14:32:57 +07:00
|
|
|
#include <asm/tlb.h>
|
2007-10-24 23:23:50 +07:00
|
|
|
#include <asm/irq_regs.h>
|
2012-01-11 14:58:16 +07:00
|
|
|
#include <asm/mutex.h>
|
2011-07-12 02:28:17 +07:00
|
|
|
#ifdef CONFIG_PARAVIRT
|
|
|
|
#include <asm/paravirt.h>
|
|
|
|
#endif
|
2005-04-17 05:20:36 +07:00
|
|
|
|
2011-10-25 15:00:11 +07:00
|
|
|
#include "sched.h"
|
2013-01-19 05:05:55 +07:00
|
|
|
#include "../workqueue_internal.h"
|
2012-04-20 20:05:45 +07:00
|
|
|
#include "../smpboot.h"
|
2008-05-13 02:21:01 +07:00
|
|
|
|
2009-04-10 20:36:00 +07:00
|
|
|
#define CREATE_TRACE_POINTS
|
2009-04-15 06:39:12 +07:00
|
|
|
#include <trace/events/sched.h>
|
2009-04-10 20:36:00 +07:00
|
|
|
|
2011-10-25 15:00:11 +07:00
|
|
|
DEFINE_MUTEX(sched_domains_mutex);
|
|
|
|
DEFINE_PER_CPU_SHARED_ALIGNED(struct rq, runqueues);
|
2010-06-08 16:40:42 +07:00
|
|
|
|
2010-12-09 20:15:34 +07:00
|
|
|
static void update_rq_clock_task(struct rq *rq, s64 delta);
|
2010-10-05 07:03:21 +07:00
|
|
|
|
2011-10-25 15:00:11 +07:00
|
|
|
void update_rq_clock(struct rq *rq)
|
2008-05-03 23:29:28 +07:00
|
|
|
{
|
2010-12-09 20:15:34 +07:00
|
|
|
s64 delta;
|
2010-10-05 07:03:21 +07:00
|
|
|
|
2015-01-05 17:18:11 +07:00
|
|
|
lockdep_assert_held(&rq->lock);
|
|
|
|
|
|
|
|
if (rq->clock_skip_update & RQCF_ACT_SKIP)
|
2010-12-08 17:05:42 +07:00
|
|
|
return;
|
2010-10-05 07:03:22 +07:00
|
|
|
|
2010-12-09 20:15:34 +07:00
|
|
|
delta = sched_clock_cpu(cpu_of(rq)) - rq->clock;
|
2014-06-24 12:49:40 +07:00
|
|
|
if (delta < 0)
|
|
|
|
return;
|
2010-12-09 20:15:34 +07:00
|
|
|
rq->clock += delta;
|
|
|
|
update_rq_clock_task(rq, delta);
|
2008-05-03 23:29:28 +07:00
|
|
|
}
|
|
|
|
|
2007-10-15 22:00:04 +07:00
|
|
|
/*
|
|
|
|
* Debugging: various feature bits
|
|
|
|
*/
|
2008-04-20 00:45:00 +07:00
|
|
|
|
|
|
|
#define SCHED_FEAT(name, enabled) \
|
|
|
|
(1UL << __SCHED_FEAT_##name) * enabled |
|
|
|
|
|
2007-10-15 22:00:04 +07:00
|
|
|
const_debug unsigned int sysctl_sched_features =
|
2011-11-15 23:14:39 +07:00
|
|
|
#include "features.h"
|
2008-04-20 00:45:00 +07:00
|
|
|
0;
|
|
|
|
|
|
|
|
#undef SCHED_FEAT
|
|
|
|
|
2007-11-10 04:39:39 +07:00
|
|
|
/*
|
|
|
|
* Number of tasks to iterate in a single balance run.
|
|
|
|
* Limited because this is done with IRQs disabled.
|
|
|
|
*/
|
|
|
|
const_debug unsigned int sysctl_sched_nr_migrate = 32;
|
|
|
|
|
2009-09-01 15:34:37 +07:00
|
|
|
/*
|
|
|
|
* period over which we average the RT time consumption, measured
|
|
|
|
* in ms.
|
|
|
|
*
|
|
|
|
* default: 1s
|
|
|
|
*/
|
|
|
|
const_debug unsigned int sysctl_sched_time_avg = MSEC_PER_SEC;
|
|
|
|
|
2008-01-26 03:08:29 +07:00
|
|
|
/*
|
2008-02-13 21:45:39 +07:00
|
|
|
* period over which we measure -rt task cpu usage in us.
|
2008-01-26 03:08:29 +07:00
|
|
|
* default: 1s
|
|
|
|
*/
|
2008-02-13 21:45:39 +07:00
|
|
|
unsigned int sysctl_sched_rt_period = 1000000;
|
2008-01-26 03:08:29 +07:00
|
|
|
|
2011-10-25 15:00:11 +07:00
|
|
|
__read_mostly int scheduler_running;
|
2008-02-13 20:02:36 +07:00
|
|
|
|
2008-02-13 21:45:39 +07:00
|
|
|
/*
|
|
|
|
* part of the period that we allow rt tasks to run in us.
|
|
|
|
* default: 0.95s
|
|
|
|
*/
|
|
|
|
int sysctl_sched_rt_runtime = 950000;
|
2008-01-26 03:08:29 +07:00
|
|
|
|
2015-03-09 23:12:07 +07:00
|
|
|
/* cpus with isolated domains */
|
|
|
|
cpumask_var_t cpu_isolated_map;
|
|
|
|
|
2005-04-17 05:20:36 +07:00
|
|
|
/*
|
2006-12-10 17:20:00 +07:00
|
|
|
* this_rq_lock - lock this runqueue and disable interrupts.
|
2005-04-17 05:20:36 +07:00
|
|
|
*/
|
2007-10-15 22:00:13 +07:00
|
|
|
static struct rq *this_rq_lock(void)
|
2005-04-17 05:20:36 +07:00
|
|
|
__acquires(rq->lock)
|
|
|
|
{
|
2006-07-03 14:25:42 +07:00
|
|
|
struct rq *rq;
|
2005-04-17 05:20:36 +07:00
|
|
|
|
|
|
|
local_irq_disable();
|
|
|
|
rq = this_rq();
|
2009-11-17 20:28:38 +07:00
|
|
|
raw_spin_lock(&rq->lock);
|
2005-04-17 05:20:36 +07:00
|
|
|
|
|
|
|
return rq;
|
|
|
|
}
|
|
|
|
|
2016-04-28 21:16:33 +07:00
|
|
|
/*
|
|
|
|
* __task_rq_lock - lock the rq @p resides on.
|
|
|
|
*/
|
2015-08-01 02:28:18 +07:00
|
|
|
struct rq *__task_rq_lock(struct task_struct *p, struct rq_flags *rf)
|
2016-04-28 21:16:33 +07:00
|
|
|
__acquires(rq->lock)
|
|
|
|
{
|
|
|
|
struct rq *rq;
|
|
|
|
|
|
|
|
lockdep_assert_held(&p->pi_lock);
|
|
|
|
|
|
|
|
for (;;) {
|
|
|
|
rq = task_rq(p);
|
|
|
|
raw_spin_lock(&rq->lock);
|
|
|
|
if (likely(rq == task_rq(p) && !task_on_rq_migrating(p))) {
|
2015-08-02 00:25:08 +07:00
|
|
|
rf->cookie = lockdep_pin_lock(&rq->lock);
|
2016-04-28 21:16:33 +07:00
|
|
|
return rq;
|
|
|
|
}
|
|
|
|
raw_spin_unlock(&rq->lock);
|
|
|
|
|
|
|
|
while (unlikely(task_on_rq_migrating(p)))
|
|
|
|
cpu_relax();
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* task_rq_lock - lock p->pi_lock and lock the rq @p resides on.
|
|
|
|
*/
|
2015-08-01 02:28:18 +07:00
|
|
|
struct rq *task_rq_lock(struct task_struct *p, struct rq_flags *rf)
|
2016-04-28 21:16:33 +07:00
|
|
|
__acquires(p->pi_lock)
|
|
|
|
__acquires(rq->lock)
|
|
|
|
{
|
|
|
|
struct rq *rq;
|
|
|
|
|
|
|
|
for (;;) {
|
2015-08-01 02:28:18 +07:00
|
|
|
raw_spin_lock_irqsave(&p->pi_lock, rf->flags);
|
2016-04-28 21:16:33 +07:00
|
|
|
rq = task_rq(p);
|
|
|
|
raw_spin_lock(&rq->lock);
|
|
|
|
/*
|
|
|
|
* move_queued_task() task_rq_lock()
|
|
|
|
*
|
|
|
|
* ACQUIRE (rq->lock)
|
|
|
|
* [S] ->on_rq = MIGRATING [L] rq = task_rq()
|
|
|
|
* WMB (__set_task_cpu()) ACQUIRE (rq->lock);
|
|
|
|
* [S] ->cpu = new_cpu [L] task_rq()
|
|
|
|
* [L] ->on_rq
|
|
|
|
* RELEASE (rq->lock)
|
|
|
|
*
|
|
|
|
* If we observe the old cpu in task_rq_lock, the acquire of
|
|
|
|
* the old rq->lock will fully serialize against the stores.
|
|
|
|
*
|
|
|
|
* If we observe the new cpu in task_rq_lock, the acquire will
|
|
|
|
* pair with the WMB to ensure we must then also see migrating.
|
|
|
|
*/
|
|
|
|
if (likely(rq == task_rq(p) && !task_on_rq_migrating(p))) {
|
2015-08-02 00:25:08 +07:00
|
|
|
rf->cookie = lockdep_pin_lock(&rq->lock);
|
2016-04-28 21:16:33 +07:00
|
|
|
return rq;
|
|
|
|
}
|
|
|
|
raw_spin_unlock(&rq->lock);
|
2015-08-01 02:28:18 +07:00
|
|
|
raw_spin_unlock_irqrestore(&p->pi_lock, rf->flags);
|
2016-04-28 21:16:33 +07:00
|
|
|
|
|
|
|
while (unlikely(task_on_rq_migrating(p)))
|
|
|
|
cpu_relax();
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
2008-01-26 03:08:29 +07:00
|
|
|
#ifdef CONFIG_SCHED_HRTICK
|
|
|
|
/*
|
|
|
|
* Use HR-timers to deliver accurate preemption points.
|
|
|
|
*/
|
|
|
|
|
|
|
|
static void hrtick_clear(struct rq *rq)
|
|
|
|
{
|
|
|
|
if (hrtimer_active(&rq->hrtick_timer))
|
|
|
|
hrtimer_cancel(&rq->hrtick_timer);
|
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* High-resolution timer tick.
|
|
|
|
* Runs from hardirq context with interrupts disabled.
|
|
|
|
*/
|
|
|
|
static enum hrtimer_restart hrtick(struct hrtimer *timer)
|
|
|
|
{
|
|
|
|
struct rq *rq = container_of(timer, struct rq, hrtick_timer);
|
|
|
|
|
|
|
|
WARN_ON_ONCE(cpu_of(rq) != smp_processor_id());
|
|
|
|
|
2009-11-17 20:28:38 +07:00
|
|
|
raw_spin_lock(&rq->lock);
|
2008-05-03 23:29:28 +07:00
|
|
|
update_rq_clock(rq);
|
2008-01-26 03:08:29 +07:00
|
|
|
rq->curr->sched_class->task_tick(rq, rq->curr, 1);
|
2009-11-17 20:28:38 +07:00
|
|
|
raw_spin_unlock(&rq->lock);
|
2008-01-26 03:08:29 +07:00
|
|
|
|
|
|
|
return HRTIMER_NORESTART;
|
|
|
|
}
|
|
|
|
|
2008-05-11 07:25:33 +07:00
|
|
|
#ifdef CONFIG_SMP
|
2013-06-28 16:18:53 +07:00
|
|
|
|
2015-04-15 04:09:05 +07:00
|
|
|
static void __hrtick_restart(struct rq *rq)
|
2013-06-28 16:18:53 +07:00
|
|
|
{
|
|
|
|
struct hrtimer *timer = &rq->hrtick_timer;
|
|
|
|
|
2015-04-15 04:09:05 +07:00
|
|
|
hrtimer_start_expires(timer, HRTIMER_MODE_ABS_PINNED);
|
2013-06-28 16:18:53 +07:00
|
|
|
}
|
|
|
|
|
2008-07-18 23:01:23 +07:00
|
|
|
/*
|
|
|
|
* called from hardirq (IPI) context
|
|
|
|
*/
|
|
|
|
static void __hrtick_start(void *arg)
|
2008-04-29 15:02:46 +07:00
|
|
|
{
|
2008-07-18 23:01:23 +07:00
|
|
|
struct rq *rq = arg;
|
2008-04-29 15:02:46 +07:00
|
|
|
|
2009-11-17 20:28:38 +07:00
|
|
|
raw_spin_lock(&rq->lock);
|
2013-06-28 16:18:53 +07:00
|
|
|
__hrtick_restart(rq);
|
2008-07-18 23:01:23 +07:00
|
|
|
rq->hrtick_csd_pending = 0;
|
2009-11-17 20:28:38 +07:00
|
|
|
raw_spin_unlock(&rq->lock);
|
2008-04-29 15:02:46 +07:00
|
|
|
}
|
|
|
|
|
2008-07-18 23:01:23 +07:00
|
|
|
/*
|
|
|
|
* Called to set the hrtick timer state.
|
|
|
|
*
|
|
|
|
* called with rq->lock held and irqs disabled
|
|
|
|
*/
|
2011-10-25 15:00:11 +07:00
|
|
|
void hrtick_start(struct rq *rq, u64 delay)
|
2008-04-29 15:02:46 +07:00
|
|
|
{
|
2008-07-18 23:01:23 +07:00
|
|
|
struct hrtimer *timer = &rq->hrtick_timer;
|
2014-08-26 10:15:41 +07:00
|
|
|
ktime_t time;
|
|
|
|
s64 delta;
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Don't schedule slices shorter than 10000ns, that just
|
|
|
|
* doesn't make sense and can cause timer DoS.
|
|
|
|
*/
|
|
|
|
delta = max_t(s64, delay, 10000LL);
|
|
|
|
time = ktime_add_ns(timer->base->get_time(), delta);
|
2008-04-29 15:02:46 +07:00
|
|
|
|
2008-09-02 05:02:30 +07:00
|
|
|
hrtimer_set_expires(timer, time);
|
2008-07-18 23:01:23 +07:00
|
|
|
|
|
|
|
if (rq == this_rq()) {
|
2013-06-28 16:18:53 +07:00
|
|
|
__hrtick_restart(rq);
|
2008-07-18 23:01:23 +07:00
|
|
|
} else if (!rq->hrtick_csd_pending) {
|
2014-02-24 22:40:02 +07:00
|
|
|
smp_call_function_single_async(cpu_of(rq), &rq->hrtick_csd);
|
2008-07-18 23:01:23 +07:00
|
|
|
rq->hrtick_csd_pending = 1;
|
|
|
|
}
|
2008-04-29 15:02:46 +07:00
|
|
|
}
|
|
|
|
|
2008-07-18 23:01:23 +07:00
|
|
|
#else
|
|
|
|
/*
|
|
|
|
* Called to set the hrtick timer state.
|
|
|
|
*
|
|
|
|
* called with rq->lock held and irqs disabled
|
|
|
|
*/
|
2011-10-25 15:00:11 +07:00
|
|
|
void hrtick_start(struct rq *rq, u64 delay)
|
2008-07-18 23:01:23 +07:00
|
|
|
{
|
2014-11-26 07:44:06 +07:00
|
|
|
/*
|
|
|
|
* Don't schedule slices shorter than 10000ns, that just
|
|
|
|
* doesn't make sense. Rely on vruntime for fairness.
|
|
|
|
*/
|
|
|
|
delay = max_t(u64, delay, 10000LL);
|
2015-04-15 04:09:05 +07:00
|
|
|
hrtimer_start(&rq->hrtick_timer, ns_to_ktime(delay),
|
|
|
|
HRTIMER_MODE_REL_PINNED);
|
2008-07-18 23:01:23 +07:00
|
|
|
}
|
|
|
|
#endif /* CONFIG_SMP */
|
2008-01-26 03:08:29 +07:00
|
|
|
|
2008-07-18 23:01:23 +07:00
|
|
|
static void init_rq_hrtick(struct rq *rq)
|
2008-01-26 03:08:29 +07:00
|
|
|
{
|
2008-07-18 23:01:23 +07:00
|
|
|
#ifdef CONFIG_SMP
|
|
|
|
rq->hrtick_csd_pending = 0;
|
2008-01-26 03:08:29 +07:00
|
|
|
|
2008-07-18 23:01:23 +07:00
|
|
|
rq->hrtick_csd.flags = 0;
|
|
|
|
rq->hrtick_csd.func = __hrtick_start;
|
|
|
|
rq->hrtick_csd.info = rq;
|
|
|
|
#endif
|
2008-01-26 03:08:29 +07:00
|
|
|
|
2008-07-18 23:01:23 +07:00
|
|
|
hrtimer_init(&rq->hrtick_timer, CLOCK_MONOTONIC, HRTIMER_MODE_REL);
|
|
|
|
rq->hrtick_timer.function = hrtick;
|
2008-01-26 03:08:29 +07:00
|
|
|
}
|
2008-09-23 04:55:46 +07:00
|
|
|
#else /* CONFIG_SCHED_HRTICK */
|
2008-01-26 03:08:29 +07:00
|
|
|
static inline void hrtick_clear(struct rq *rq)
|
|
|
|
{
|
|
|
|
}
|
|
|
|
|
|
|
|
static inline void init_rq_hrtick(struct rq *rq)
|
|
|
|
{
|
|
|
|
}
|
2008-09-23 04:55:46 +07:00
|
|
|
#endif /* CONFIG_SCHED_HRTICK */
|
2008-01-26 03:08:29 +07:00
|
|
|
|
2016-03-24 21:38:01 +07:00
|
|
|
/*
|
|
|
|
* cmpxchg based fetch_or, macro so it works for different integer types
|
|
|
|
*/
|
|
|
|
#define fetch_or(ptr, mask) \
|
|
|
|
({ \
|
|
|
|
typeof(ptr) _ptr = (ptr); \
|
|
|
|
typeof(mask) _mask = (mask); \
|
|
|
|
typeof(*_ptr) _old, _val = *_ptr; \
|
|
|
|
\
|
|
|
|
for (;;) { \
|
|
|
|
_old = cmpxchg(_ptr, _val, _val | _mask); \
|
|
|
|
if (_old == _val) \
|
|
|
|
break; \
|
|
|
|
_val = _old; \
|
|
|
|
} \
|
|
|
|
_old; \
|
|
|
|
})
|
|
|
|
|
2014-06-05 00:31:18 +07:00
|
|
|
#if defined(CONFIG_SMP) && defined(TIF_POLLING_NRFLAG)
|
2014-04-09 20:35:08 +07:00
|
|
|
/*
|
|
|
|
* Atomically set TIF_NEED_RESCHED and test for TIF_POLLING_NRFLAG,
|
|
|
|
* this avoids any races wrt polling state changes and thereby avoids
|
|
|
|
* spurious IPIs.
|
|
|
|
*/
|
|
|
|
static bool set_nr_and_not_polling(struct task_struct *p)
|
|
|
|
{
|
|
|
|
struct thread_info *ti = task_thread_info(p);
|
|
|
|
return !(fetch_or(&ti->flags, _TIF_NEED_RESCHED) & _TIF_POLLING_NRFLAG);
|
|
|
|
}
|
2014-06-05 00:31:18 +07:00
|
|
|
|
|
|
|
/*
|
|
|
|
* Atomically set TIF_NEED_RESCHED if TIF_POLLING_NRFLAG is set.
|
|
|
|
*
|
|
|
|
* If this returns true, then the idle task promises to call
|
|
|
|
* sched_ttwu_pending() and reschedule soon.
|
|
|
|
*/
|
|
|
|
static bool set_nr_if_polling(struct task_struct *p)
|
|
|
|
{
|
|
|
|
struct thread_info *ti = task_thread_info(p);
|
2015-04-29 03:00:20 +07:00
|
|
|
typeof(ti->flags) old, val = READ_ONCE(ti->flags);
|
2014-06-05 00:31:18 +07:00
|
|
|
|
|
|
|
for (;;) {
|
|
|
|
if (!(val & _TIF_POLLING_NRFLAG))
|
|
|
|
return false;
|
|
|
|
if (val & _TIF_NEED_RESCHED)
|
|
|
|
return true;
|
|
|
|
old = cmpxchg(&ti->flags, val, val | _TIF_NEED_RESCHED);
|
|
|
|
if (old == val)
|
|
|
|
break;
|
|
|
|
val = old;
|
|
|
|
}
|
|
|
|
return true;
|
|
|
|
}
|
|
|
|
|
2014-04-09 20:35:08 +07:00
|
|
|
#else
|
|
|
|
static bool set_nr_and_not_polling(struct task_struct *p)
|
|
|
|
{
|
|
|
|
set_tsk_need_resched(p);
|
|
|
|
return true;
|
|
|
|
}
|
2014-06-05 00:31:18 +07:00
|
|
|
|
|
|
|
#ifdef CONFIG_SMP
|
|
|
|
static bool set_nr_if_polling(struct task_struct *p)
|
|
|
|
{
|
|
|
|
return false;
|
|
|
|
}
|
|
|
|
#endif
|
2014-04-09 20:35:08 +07:00
|
|
|
#endif
|
|
|
|
|
2015-05-01 22:27:50 +07:00
|
|
|
void wake_q_add(struct wake_q_head *head, struct task_struct *task)
|
|
|
|
{
|
|
|
|
struct wake_q_node *node = &task->wake_q;
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Atomically grab the task, if ->wake_q is !nil already it means
|
|
|
|
* its already queued (either by us or someone else) and will get the
|
|
|
|
* wakeup due to that.
|
|
|
|
*
|
|
|
|
* This cmpxchg() implies a full barrier, which pairs with the write
|
2016-05-09 10:58:10 +07:00
|
|
|
* barrier implied by the wakeup in wake_up_q().
|
2015-05-01 22:27:50 +07:00
|
|
|
*/
|
|
|
|
if (cmpxchg(&node->next, NULL, WAKE_Q_TAIL))
|
|
|
|
return;
|
|
|
|
|
|
|
|
get_task_struct(task);
|
|
|
|
|
|
|
|
/*
|
|
|
|
* The head is context local, there can be no concurrency.
|
|
|
|
*/
|
|
|
|
*head->lastp = node;
|
|
|
|
head->lastp = &node->next;
|
|
|
|
}
|
|
|
|
|
|
|
|
void wake_up_q(struct wake_q_head *head)
|
|
|
|
{
|
|
|
|
struct wake_q_node *node = head->first;
|
|
|
|
|
|
|
|
while (node != WAKE_Q_TAIL) {
|
|
|
|
struct task_struct *task;
|
|
|
|
|
|
|
|
task = container_of(node, struct task_struct, wake_q);
|
|
|
|
BUG_ON(!task);
|
|
|
|
/* task can safely be re-inserted now */
|
|
|
|
node = node->next;
|
|
|
|
task->wake_q.next = NULL;
|
|
|
|
|
|
|
|
/*
|
|
|
|
* wake_up_process() implies a wmb() to pair with the queueing
|
|
|
|
* in wake_q_add() so as not to miss wakeups.
|
|
|
|
*/
|
|
|
|
wake_up_process(task);
|
|
|
|
put_task_struct(task);
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
2007-07-09 23:51:59 +07:00
|
|
|
/*
|
2014-06-29 03:03:57 +07:00
|
|
|
* resched_curr - mark rq's current task 'to be rescheduled now'.
|
2007-07-09 23:51:59 +07:00
|
|
|
*
|
|
|
|
* On UP this means the setting of the need_resched flag, on SMP it
|
|
|
|
* might also involve a cross-CPU call to trigger the scheduler on
|
|
|
|
* the target CPU.
|
|
|
|
*/
|
2014-06-29 03:03:57 +07:00
|
|
|
void resched_curr(struct rq *rq)
|
2007-07-09 23:51:59 +07:00
|
|
|
{
|
2014-06-29 03:03:57 +07:00
|
|
|
struct task_struct *curr = rq->curr;
|
2007-07-09 23:51:59 +07:00
|
|
|
int cpu;
|
|
|
|
|
2014-06-29 03:03:57 +07:00
|
|
|
lockdep_assert_held(&rq->lock);
|
2007-07-09 23:51:59 +07:00
|
|
|
|
2014-06-29 03:03:57 +07:00
|
|
|
if (test_tsk_need_resched(curr))
|
2007-07-09 23:51:59 +07:00
|
|
|
return;
|
|
|
|
|
2014-06-29 03:03:57 +07:00
|
|
|
cpu = cpu_of(rq);
|
2014-04-09 20:35:08 +07:00
|
|
|
|
2013-08-14 19:55:31 +07:00
|
|
|
if (cpu == smp_processor_id()) {
|
2014-06-29 03:03:57 +07:00
|
|
|
set_tsk_need_resched(curr);
|
2013-08-14 19:55:31 +07:00
|
|
|
set_preempt_need_resched();
|
2007-07-09 23:51:59 +07:00
|
|
|
return;
|
2013-08-14 19:55:31 +07:00
|
|
|
}
|
2007-07-09 23:51:59 +07:00
|
|
|
|
2014-06-29 03:03:57 +07:00
|
|
|
if (set_nr_and_not_polling(curr))
|
2007-07-09 23:51:59 +07:00
|
|
|
smp_send_reschedule(cpu);
|
2014-06-05 00:31:15 +07:00
|
|
|
else
|
|
|
|
trace_sched_wake_idle_without_ipi(cpu);
|
2007-07-09 23:51:59 +07:00
|
|
|
}
|
|
|
|
|
2011-10-25 15:00:11 +07:00
|
|
|
void resched_cpu(int cpu)
|
2007-07-09 23:51:59 +07:00
|
|
|
{
|
|
|
|
struct rq *rq = cpu_rq(cpu);
|
|
|
|
unsigned long flags;
|
|
|
|
|
2009-11-17 20:28:38 +07:00
|
|
|
if (!raw_spin_trylock_irqsave(&rq->lock, flags))
|
2007-07-09 23:51:59 +07:00
|
|
|
return;
|
2014-06-29 03:03:57 +07:00
|
|
|
resched_curr(rq);
|
2009-11-17 20:28:38 +07:00
|
|
|
raw_spin_unlock_irqrestore(&rq->lock, flags);
|
2007-07-09 23:51:59 +07:00
|
|
|
}
|
2008-03-22 15:20:24 +07:00
|
|
|
|
2013-09-17 14:30:55 +07:00
|
|
|
#ifdef CONFIG_SMP
|
2011-08-11 04:21:01 +07:00
|
|
|
#ifdef CONFIG_NO_HZ_COMMON
|
2010-05-22 07:09:41 +07:00
|
|
|
/*
|
|
|
|
* In the semi idle case, use the nearest busy cpu for migrating timers
|
|
|
|
* from an idle cpu. This is good for power-savings.
|
|
|
|
*
|
|
|
|
* We don't do similar optimization for completely idle system, as
|
|
|
|
* selecting an idle cpu will add more delays to the timers than intended
|
|
|
|
* (as that cpu's timer base may not be uptodate wrt jiffies etc).
|
|
|
|
*/
|
2015-05-27 05:50:33 +07:00
|
|
|
int get_nohz_timer_target(void)
|
2010-05-22 07:09:41 +07:00
|
|
|
{
|
2015-05-27 05:50:33 +07:00
|
|
|
int i, cpu = smp_processor_id();
|
2010-05-22 07:09:41 +07:00
|
|
|
struct sched_domain *sd;
|
|
|
|
|
2015-09-01 21:50:59 +07:00
|
|
|
if (!idle_cpu(cpu) && is_housekeeping_cpu(cpu))
|
2014-03-18 17:56:07 +07:00
|
|
|
return cpu;
|
|
|
|
|
2011-04-18 16:24:34 +07:00
|
|
|
rcu_read_lock();
|
2010-05-22 07:09:41 +07:00
|
|
|
for_each_domain(cpu, sd) {
|
2011-04-18 16:24:34 +07:00
|
|
|
for_each_cpu(i, sched_domain_span(sd)) {
|
sched/nohz: Fix affine unpinned timers mess
The following commit:
9642d18eee2c ("nohz: Affine unpinned timers to housekeepers")'
intended to affine unpinned timers to housekeepers:
unpinned timers(full dynaticks, idle) => nearest busy housekeepers(otherwise, fallback to any housekeepers)
unpinned timers(full dynaticks, busy) => nearest busy housekeepers(otherwise, fallback to any housekeepers)
unpinned timers(houserkeepers, idle) => nearest busy housekeepers(otherwise, fallback to itself)
However, the !idle_cpu(i) && is_housekeeping_cpu(cpu) check modified the
intention to:
unpinned timers(full dynaticks, idle) => any housekeepers(no mattter cpu topology)
unpinned timers(full dynaticks, busy) => any housekeepers(no mattter cpu topology)
unpinned timers(housekeepers, idle) => any busy cpus(otherwise, fallback to any housekeepers)
This patch fixes it by checking if there are busy housekeepers nearby,
otherwise falls to any housekeepers/itself. After the patch:
unpinned timers(full dynaticks, idle) => nearest busy housekeepers(otherwise, fallback to any housekeepers)
unpinned timers(full dynaticks, busy) => nearest busy housekeepers(otherwise, fallback to any housekeepers)
unpinned timers(housekeepers, idle) => nearest busy housekeepers(otherwise, fallback to itself)
Signed-off-by: Wanpeng Li <wanpeng.li@hotmail.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
[ Fixed the changelog. ]
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Fixes: 'commit 9642d18eee2c ("nohz: Affine unpinned timers to housekeepers")'
Link: http://lkml.kernel.org/r/1462344334-8303-1-git-send-email-wanpeng.li@hotmail.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-05-04 13:45:34 +07:00
|
|
|
if (cpu == i)
|
|
|
|
continue;
|
|
|
|
|
|
|
|
if (!idle_cpu(i) && is_housekeeping_cpu(i)) {
|
2011-04-18 16:24:34 +07:00
|
|
|
cpu = i;
|
|
|
|
goto unlock;
|
|
|
|
}
|
|
|
|
}
|
2010-05-22 07:09:41 +07:00
|
|
|
}
|
2015-09-01 21:50:59 +07:00
|
|
|
|
|
|
|
if (!is_housekeeping_cpu(cpu))
|
|
|
|
cpu = housekeeping_any_cpu();
|
2011-04-18 16:24:34 +07:00
|
|
|
unlock:
|
|
|
|
rcu_read_unlock();
|
2010-05-22 07:09:41 +07:00
|
|
|
return cpu;
|
|
|
|
}
|
2008-03-22 15:20:24 +07:00
|
|
|
/*
|
|
|
|
* When add_timer_on() enqueues a timer into the timer wheel of an
|
|
|
|
* idle CPU then this timer might expire before the next timer event
|
|
|
|
* which is scheduled to wake up that CPU. In case of a completely
|
|
|
|
* idle system the next event might even be infinite time into the
|
|
|
|
* future. wake_up_idle_cpu() ensures that the CPU is woken up and
|
|
|
|
* leaves the inner idle loop so the newly added timer is taken into
|
|
|
|
* account when the CPU goes back to idle and evaluates the timer
|
|
|
|
* wheel for the next timer event.
|
|
|
|
*/
|
2011-08-11 04:21:01 +07:00
|
|
|
static void wake_up_idle_cpu(int cpu)
|
2008-03-22 15:20:24 +07:00
|
|
|
{
|
|
|
|
struct rq *rq = cpu_rq(cpu);
|
|
|
|
|
|
|
|
if (cpu == smp_processor_id())
|
|
|
|
return;
|
|
|
|
|
2014-06-05 00:31:17 +07:00
|
|
|
if (set_nr_and_not_polling(rq->idle))
|
2008-03-22 15:20:24 +07:00
|
|
|
smp_send_reschedule(cpu);
|
2014-06-05 00:31:15 +07:00
|
|
|
else
|
|
|
|
trace_sched_wake_idle_without_ipi(cpu);
|
2007-07-09 23:51:59 +07:00
|
|
|
}
|
|
|
|
|
2013-04-12 21:45:34 +07:00
|
|
|
static bool wake_up_full_nohz_cpu(int cpu)
|
2011-08-11 04:21:01 +07:00
|
|
|
{
|
2014-06-04 21:20:21 +07:00
|
|
|
/*
|
|
|
|
* We just need the target to call irq_exit() and re-evaluate
|
|
|
|
* the next tick. The nohz full kick at least implies that.
|
|
|
|
* If needed we can still optimize that later with an
|
|
|
|
* empty IRQ.
|
|
|
|
*/
|
2013-04-12 21:45:34 +07:00
|
|
|
if (tick_nohz_full_cpu(cpu)) {
|
2011-08-11 04:21:01 +07:00
|
|
|
if (cpu != smp_processor_id() ||
|
|
|
|
tick_nohz_tick_stopped())
|
2014-06-04 21:20:21 +07:00
|
|
|
tick_nohz_full_kick_cpu(cpu);
|
2011-08-11 04:21:01 +07:00
|
|
|
return true;
|
|
|
|
}
|
|
|
|
|
|
|
|
return false;
|
|
|
|
}
|
|
|
|
|
|
|
|
void wake_up_nohz_cpu(int cpu)
|
|
|
|
{
|
2013-04-12 21:45:34 +07:00
|
|
|
if (!wake_up_full_nohz_cpu(cpu))
|
2011-08-11 04:21:01 +07:00
|
|
|
wake_up_idle_cpu(cpu);
|
|
|
|
}
|
|
|
|
|
2011-10-04 05:09:00 +07:00
|
|
|
static inline bool got_nohz_idle_kick(void)
|
2007-07-09 23:51:59 +07:00
|
|
|
{
|
2011-12-02 08:07:32 +07:00
|
|
|
int cpu = smp_processor_id();
|
2013-06-05 15:13:11 +07:00
|
|
|
|
|
|
|
if (!test_bit(NOHZ_BALANCE_KICK, nohz_flags(cpu)))
|
|
|
|
return false;
|
|
|
|
|
|
|
|
if (idle_cpu(cpu) && !need_resched())
|
|
|
|
return true;
|
|
|
|
|
|
|
|
/*
|
|
|
|
* We can't run Idle Load Balance on this CPU for this time so we
|
|
|
|
* cancel it and clear NOHZ_BALANCE_KICK
|
|
|
|
*/
|
|
|
|
clear_bit(NOHZ_BALANCE_KICK, nohz_flags(cpu));
|
|
|
|
return false;
|
2007-07-09 23:51:59 +07:00
|
|
|
}
|
|
|
|
|
2011-08-11 04:21:01 +07:00
|
|
|
#else /* CONFIG_NO_HZ_COMMON */
|
2007-07-09 23:51:59 +07:00
|
|
|
|
2011-10-04 05:09:00 +07:00
|
|
|
static inline bool got_nohz_idle_kick(void)
|
2010-11-16 06:47:00 +07:00
|
|
|
{
|
2011-10-04 05:09:00 +07:00
|
|
|
return false;
|
2010-11-16 06:47:00 +07:00
|
|
|
}
|
|
|
|
|
2011-08-11 04:21:01 +07:00
|
|
|
#endif /* CONFIG_NO_HZ_COMMON */
|
2007-12-03 02:04:49 +07:00
|
|
|
|
2013-04-20 20:15:35 +07:00
|
|
|
#ifdef CONFIG_NO_HZ_FULL
|
2015-07-18 03:25:49 +07:00
|
|
|
bool sched_can_stop_tick(struct rq *rq)
|
2013-04-20 20:15:35 +07:00
|
|
|
{
|
2015-07-18 03:25:49 +07:00
|
|
|
int fifo_nr_running;
|
|
|
|
|
|
|
|
/* Deadline tasks, even if single, need the tick */
|
|
|
|
if (rq->dl.dl_nr_running)
|
|
|
|
return false;
|
|
|
|
|
2015-02-17 03:23:49 +07:00
|
|
|
/*
|
2016-04-21 23:03:15 +07:00
|
|
|
* If there are more than one RR tasks, we need the tick to effect the
|
|
|
|
* actual RR behaviour.
|
2015-02-17 03:23:49 +07:00
|
|
|
*/
|
2015-07-18 03:25:49 +07:00
|
|
|
if (rq->rt.rr_nr_running) {
|
|
|
|
if (rq->rt.rr_nr_running == 1)
|
|
|
|
return true;
|
|
|
|
else
|
|
|
|
return false;
|
2015-02-17 03:23:49 +07:00
|
|
|
}
|
|
|
|
|
2016-04-21 23:03:15 +07:00
|
|
|
/*
|
|
|
|
* If there's no RR tasks, but FIFO tasks, we can skip the tick, no
|
|
|
|
* forced preemption between FIFO tasks.
|
|
|
|
*/
|
|
|
|
fifo_nr_running = rq->rt.rt_nr_running - rq->rt.rr_nr_running;
|
|
|
|
if (fifo_nr_running)
|
|
|
|
return true;
|
|
|
|
|
|
|
|
/*
|
|
|
|
* If there are no DL,RR/FIFO tasks, there must only be CFS tasks left;
|
|
|
|
* if there's more than one we need the tick for involuntary
|
|
|
|
* preemption.
|
|
|
|
*/
|
|
|
|
if (rq->nr_running > 1)
|
2014-06-24 15:34:12 +07:00
|
|
|
return false;
|
2013-04-20 20:15:35 +07:00
|
|
|
|
2014-06-24 15:34:12 +07:00
|
|
|
return true;
|
2013-04-20 20:15:35 +07:00
|
|
|
}
|
|
|
|
#endif /* CONFIG_NO_HZ_FULL */
|
2007-12-03 02:04:49 +07:00
|
|
|
|
2011-10-25 15:00:11 +07:00
|
|
|
void sched_avg_update(struct rq *rq)
|
2008-04-20 00:45:00 +07:00
|
|
|
{
|
2009-09-01 15:34:37 +07:00
|
|
|
s64 period = sched_avg_period();
|
|
|
|
|
2013-04-12 06:51:02 +07:00
|
|
|
while ((s64)(rq_clock(rq) - rq->age_stamp) > period) {
|
2010-05-25 02:11:43 +07:00
|
|
|
/*
|
|
|
|
* Inline assembly required to prevent the compiler
|
|
|
|
* optimising this loop into a divmod call.
|
|
|
|
* See __iter_div_u64_rem() for another example of this.
|
|
|
|
*/
|
|
|
|
asm("" : "+rm" (rq->age_stamp));
|
2009-09-01 15:34:37 +07:00
|
|
|
rq->age_stamp += period;
|
|
|
|
rq->rt_avg /= 2;
|
|
|
|
}
|
2008-04-20 00:45:00 +07:00
|
|
|
}
|
|
|
|
|
2008-05-30 19:23:45 +07:00
|
|
|
#endif /* CONFIG_SMP */
|
2008-04-20 00:45:00 +07:00
|
|
|
|
2011-07-21 23:43:29 +07:00
|
|
|
#if defined(CONFIG_RT_GROUP_SCHED) || (defined(CONFIG_FAIR_GROUP_SCHED) && \
|
|
|
|
(defined(CONFIG_SMP) || defined(CONFIG_CFS_BANDWIDTH)))
|
2008-06-27 18:41:14 +07:00
|
|
|
/*
|
2011-07-21 23:43:35 +07:00
|
|
|
* Iterate task_group tree rooted at *from, calling @down when first entering a
|
|
|
|
* node and @up when leaving it for the final time.
|
|
|
|
*
|
|
|
|
* Caller must hold rcu_lock or sufficient equivalent.
|
2008-06-27 18:41:14 +07:00
|
|
|
*/
|
2011-10-25 15:00:11 +07:00
|
|
|
int walk_tg_tree_from(struct task_group *from,
|
2011-07-21 23:43:35 +07:00
|
|
|
tg_visitor down, tg_visitor up, void *data)
|
2008-06-27 18:41:14 +07:00
|
|
|
{
|
|
|
|
struct task_group *parent, *child;
|
2008-08-19 17:33:05 +07:00
|
|
|
int ret;
|
2008-06-27 18:41:14 +07:00
|
|
|
|
2011-07-21 23:43:35 +07:00
|
|
|
parent = from;
|
|
|
|
|
2008-06-27 18:41:14 +07:00
|
|
|
down:
|
2008-08-19 17:33:05 +07:00
|
|
|
ret = (*down)(parent, data);
|
|
|
|
if (ret)
|
2011-07-21 23:43:35 +07:00
|
|
|
goto out;
|
2008-06-27 18:41:14 +07:00
|
|
|
list_for_each_entry_rcu(child, &parent->children, siblings) {
|
|
|
|
parent = child;
|
|
|
|
goto down;
|
|
|
|
|
|
|
|
up:
|
|
|
|
continue;
|
|
|
|
}
|
2008-08-19 17:33:05 +07:00
|
|
|
ret = (*up)(parent, data);
|
2011-07-21 23:43:35 +07:00
|
|
|
if (ret || parent == from)
|
|
|
|
goto out;
|
2008-06-27 18:41:14 +07:00
|
|
|
|
|
|
|
child = parent;
|
|
|
|
parent = parent->parent;
|
|
|
|
if (parent)
|
|
|
|
goto up;
|
2011-07-21 23:43:35 +07:00
|
|
|
out:
|
2008-08-19 17:33:05 +07:00
|
|
|
return ret;
|
2008-06-27 18:41:14 +07:00
|
|
|
}
|
|
|
|
|
2011-10-25 15:00:11 +07:00
|
|
|
int tg_nop(struct task_group *tg, void *data)
|
2008-08-19 17:33:05 +07:00
|
|
|
{
|
2011-08-01 16:03:28 +07:00
|
|
|
return 0;
|
2008-08-19 17:33:05 +07:00
|
|
|
}
|
2008-04-20 00:45:00 +07:00
|
|
|
#endif
|
|
|
|
|
2007-07-09 23:51:59 +07:00
|
|
|
static void set_load_weight(struct task_struct *p)
|
|
|
|
{
|
2011-05-19 00:09:38 +07:00
|
|
|
int prio = p->static_prio - MAX_RT_PRIO;
|
|
|
|
struct load_weight *load = &p->se.load;
|
|
|
|
|
2007-07-09 23:51:59 +07:00
|
|
|
/*
|
|
|
|
* SCHED_IDLE tasks get minimal weight:
|
|
|
|
*/
|
2015-09-09 22:00:41 +07:00
|
|
|
if (idle_policy(p->policy)) {
|
sched: Increase SCHED_LOAD_SCALE resolution
Introduce SCHED_LOAD_RESOLUTION, which scales is added to
SCHED_LOAD_SHIFT and increases the resolution of
SCHED_LOAD_SCALE. This patch sets the value of
SCHED_LOAD_RESOLUTION to 10, scaling up the weights for all
sched entities by a factor of 1024. With this extra resolution,
we can handle deeper cgroup hiearchies and the scheduler can do
better shares distribution and load load balancing on larger
systems (especially for low weight task groups).
This does not change the existing user interface, the scaled
weights are only used internally. We do not modify
prio_to_weight values or inverses, but use the original weights
when calculating the inverse which is used to scale execution
time delta in calc_delta_mine(). This ensures we do not lose
accuracy when accounting time to the sched entities. Thanks to
Nikunj Dadhania for fixing an bug in c_d_m() that broken fairness.
Below is some analysis of the performance costs/improvements of
this patch.
1. Micro-arch performance costs:
Experiment was to run Ingo's pipe_test_100k 200 times with the
task pinned to one cpu. I measured instruction, cycles and
stalled-cycles for the runs. See:
http://thread.gmane.org/gmane.linux.kernel/1129232/focus=1129389
for more info.
-tip (baseline):
Performance counter stats for '/root/load-scale/pipe-test-100k' (200 runs):
964,991,769 instructions # 0.82 insns per cycle
# 0.33 stalled cycles per insn
# ( +- 0.05% )
1,171,186,635 cycles # 0.000 GHz ( +- 0.08% )
306,373,664 stalled-cycles-backend # 26.16% backend cycles idle ( +- 0.28% )
314,933,621 stalled-cycles-frontend # 26.89% frontend cycles idle ( +- 0.34% )
1.122405684 seconds time elapsed ( +- 0.05% )
-tip+patches:
Performance counter stats for './load-scale/pipe-test-100k' (200 runs):
963,624,821 instructions # 0.82 insns per cycle
# 0.33 stalled cycles per insn
# ( +- 0.04% )
1,175,215,649 cycles # 0.000 GHz ( +- 0.08% )
315,321,126 stalled-cycles-backend # 26.83% backend cycles idle ( +- 0.28% )
316,835,873 stalled-cycles-frontend # 26.96% frontend cycles idle ( +- 0.29% )
1.122238659 seconds time elapsed ( +- 0.06% )
With this patch, instructions decrease by ~0.10% and cycles
increase by 0.27%. This doesn't look statistically significant.
The number of stalled cycles in the backend increased from
26.16% to 26.83%. This can be attributed to the shifts we do in
c_d_m() and other places. The fraction of stalled cycles in the
frontend remains about the same, at 26.96% compared to 26.89% in -tip.
2. Balancing low-weight task groups
Test setup: run 50 tasks with random sleep/busy times (biased
around 100ms) in a low weight container (with cpu.shares = 2).
Measure %idle as reported by mpstat over a 10s window.
-tip (baseline):
06:47:48 PM CPU %usr %nice %sys %iowait %irq %soft %steal %guest %idle intr/s
06:47:49 PM all 94.32 0.00 0.06 0.00 0.00 0.00 0.00 0.00 5.62 15888.00
06:47:50 PM all 94.57 0.00 0.62 0.00 0.00 0.00 0.00 0.00 4.81 16180.00
06:47:51 PM all 94.69 0.00 0.06 0.00 0.00 0.00 0.00 0.00 5.25 15966.00
06:47:52 PM all 95.81 0.00 0.00 0.00 0.00 0.00 0.00 0.00 4.19 16053.00
06:47:53 PM all 94.88 0.06 0.00 0.00 0.00 0.00 0.00 0.00 5.06 15984.00
06:47:54 PM all 93.31 0.00 0.00 0.00 0.00 0.00 0.00 0.00 6.69 15806.00
06:47:55 PM all 94.19 0.00 0.06 0.00 0.00 0.00 0.00 0.00 5.75 15896.00
06:47:56 PM all 92.87 0.00 0.00 0.00 0.00 0.00 0.00 0.00 7.13 15716.00
06:47:57 PM all 94.88 0.00 0.00 0.00 0.00 0.00 0.00 0.00 5.12 15982.00
06:47:58 PM all 95.44 0.00 0.00 0.00 0.00 0.00 0.00 0.00 4.56 16075.00
Average: all 94.49 0.01 0.08 0.00 0.00 0.00 0.00 0.00 5.42 15954.60
-tip+patches:
06:47:03 PM CPU %usr %nice %sys %iowait %irq %soft %steal %guest %idle intr/s
06:47:04 PM all 100.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 16630.00
06:47:05 PM all 99.69 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.31 16580.20
06:47:06 PM all 99.69 0.00 0.06 0.00 0.00 0.00 0.00 0.00 0.25 16596.00
06:47:07 PM all 99.20 0.00 0.74 0.00 0.00 0.06 0.00 0.00 0.00 17838.61
06:47:08 PM all 100.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 16540.00
06:47:09 PM all 100.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 16575.00
06:47:10 PM all 100.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 16614.00
06:47:11 PM all 99.94 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.06 16588.00
06:47:12 PM all 99.94 0.00 0.06 0.00 0.00 0.00 0.00 0.00 0.00 16593.00
06:47:13 PM all 99.94 0.00 0.06 0.00 0.00 0.00 0.00 0.00 0.00 16551.00
Average: all 99.84 0.00 0.09 0.00 0.00 0.01 0.00 0.00 0.06 16711.58
We see an improvement in idle% on the system (drops from 5.42% on -tip to 0.06%
with the patches).
We see an improvement in idle% on the system (drops from 5.42%
on -tip to 0.06% with the patches).
Signed-off-by: Nikhil Rao <ncrao@google.com>
Acked-by: Peter Zijlstra <peterz@infradead.org>
Cc: Nikunj A. Dadhania <nikunj@linux.vnet.ibm.com>
Cc: Srivatsa Vaddagiri <vatsa@linux.vnet.ibm.com>
Cc: Stephan Barwolf <stephan.baerwolf@tu-ilmenau.de>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Link: http://lkml.kernel.org/r/1305754668-18792-1-git-send-email-ncrao@google.com
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2011-05-19 04:37:48 +07:00
|
|
|
load->weight = scale_load(WEIGHT_IDLEPRIO);
|
2011-05-19 00:09:38 +07:00
|
|
|
load->inv_weight = WMULT_IDLEPRIO;
|
2007-07-09 23:51:59 +07:00
|
|
|
return;
|
|
|
|
}
|
2007-07-09 23:51:59 +07:00
|
|
|
|
2015-11-30 11:59:43 +07:00
|
|
|
load->weight = scale_load(sched_prio_to_weight[prio]);
|
|
|
|
load->inv_weight = sched_prio_to_wmult[prio];
|
2007-07-09 23:51:59 +07:00
|
|
|
}
|
|
|
|
|
sched/core: Fix task and run queue sched_info::run_delay inconsistencies
Mike Meyer reported the following bug:
> During evaluation of some performance data, it was discovered thread
> and run queue run_delay accounting data was inconsistent with the other
> accounting data that was collected. Further investigation found under
> certain circumstances execution time was leaking into the task and
> run queue accounting of run_delay.
>
> Consider the following sequence:
>
> a. thread is running.
> b. thread moves beween cgroups, changes scheduling class or priority.
> c. thread sleeps OR
> d. thread involuntarily gives up cpu.
>
> a. implies:
>
> thread->sched_info.last_queued = 0
>
> a. and b. results in the following:
>
> 1. dequeue_task(rq, thread)
>
> sched_info_dequeued(rq, thread)
> delta = 0
>
> sched_info_reset_dequeued(thread)
> thread->sched_info.last_queued = 0
>
> thread->sched_info.run_delay += delta
>
> 2. enqueue_task(rq, thread)
>
> sched_info_queued(rq, thread)
>
> /* thread is still on cpu at this point. */
> thread->sched_info.last_queued = task_rq(thread)->clock;
>
> c. results in:
>
> dequeue_task(rq, thread)
>
> sched_info_dequeued(rq, thread)
>
> /* delta is execution time not run_delay. */
> delta = task_rq(thread)->clock - thread->sched_info.last_queued
>
> sched_info_reset_dequeued(thread)
> thread->sched_info.last_queued = 0
>
> thread->sched_info.run_delay += delta
>
> Since thread was running between enqueue_task(rq, thread) and
> dequeue_task(rq, thread), the delta above is really execution
> time and not run_delay.
>
> d. results in:
>
> __sched_info_switch(thread, next_thread)
>
> sched_info_depart(rq, thread)
>
> sched_info_queued(rq, thread)
>
> /* last_queued not updated due to being non-zero */
> return
>
> Since thread was running between enqueue_task(rq, thread) and
> __sched_info_switch(thread, next_thread), the execution time
> between enqueue_task(rq, thread) and
> __sched_info_switch(thread, next_thread) now will become
> associated with run_delay due to when last_queued was last updated.
>
This alternative patch solves the problem by not calling
sched_info_{de,}queued() in {de,en}queue_task(). Therefore the
sched_info state is preserved and things work as expected.
By inlining the {de,en}queue_task() functions the new condition
becomes (mostly) a compile-time constant and we'll not emit any new
branch instructions.
It even shrinks the code (due to inlining {en,de}queue_task()):
$ size defconfig-build/kernel/sched/core.o defconfig-build/kernel/sched/core.o.orig
text data bss dec hex filename
64019 23378 2344 89741 15e8d defconfig-build/kernel/sched/core.o
64149 23378 2344 89871 15f0f defconfig-build/kernel/sched/core.o.orig
Reported-by: Mike Meyer <Mike.Meyer@Teradata.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Link: http://lkml.kernel.org/r/20150930154413.GO3604@twins.programming.kicks-ass.net
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-09-30 22:44:13 +07:00
|
|
|
static inline void enqueue_task(struct rq *rq, struct task_struct *p, int flags)
|
2008-06-28 03:30:00 +07:00
|
|
|
{
|
2010-03-11 23:16:20 +07:00
|
|
|
update_rq_clock(rq);
|
sched/core: Fix task and run queue sched_info::run_delay inconsistencies
Mike Meyer reported the following bug:
> During evaluation of some performance data, it was discovered thread
> and run queue run_delay accounting data was inconsistent with the other
> accounting data that was collected. Further investigation found under
> certain circumstances execution time was leaking into the task and
> run queue accounting of run_delay.
>
> Consider the following sequence:
>
> a. thread is running.
> b. thread moves beween cgroups, changes scheduling class or priority.
> c. thread sleeps OR
> d. thread involuntarily gives up cpu.
>
> a. implies:
>
> thread->sched_info.last_queued = 0
>
> a. and b. results in the following:
>
> 1. dequeue_task(rq, thread)
>
> sched_info_dequeued(rq, thread)
> delta = 0
>
> sched_info_reset_dequeued(thread)
> thread->sched_info.last_queued = 0
>
> thread->sched_info.run_delay += delta
>
> 2. enqueue_task(rq, thread)
>
> sched_info_queued(rq, thread)
>
> /* thread is still on cpu at this point. */
> thread->sched_info.last_queued = task_rq(thread)->clock;
>
> c. results in:
>
> dequeue_task(rq, thread)
>
> sched_info_dequeued(rq, thread)
>
> /* delta is execution time not run_delay. */
> delta = task_rq(thread)->clock - thread->sched_info.last_queued
>
> sched_info_reset_dequeued(thread)
> thread->sched_info.last_queued = 0
>
> thread->sched_info.run_delay += delta
>
> Since thread was running between enqueue_task(rq, thread) and
> dequeue_task(rq, thread), the delta above is really execution
> time and not run_delay.
>
> d. results in:
>
> __sched_info_switch(thread, next_thread)
>
> sched_info_depart(rq, thread)
>
> sched_info_queued(rq, thread)
>
> /* last_queued not updated due to being non-zero */
> return
>
> Since thread was running between enqueue_task(rq, thread) and
> __sched_info_switch(thread, next_thread), the execution time
> between enqueue_task(rq, thread) and
> __sched_info_switch(thread, next_thread) now will become
> associated with run_delay due to when last_queued was last updated.
>
This alternative patch solves the problem by not calling
sched_info_{de,}queued() in {de,en}queue_task(). Therefore the
sched_info state is preserved and things work as expected.
By inlining the {de,en}queue_task() functions the new condition
becomes (mostly) a compile-time constant and we'll not emit any new
branch instructions.
It even shrinks the code (due to inlining {en,de}queue_task()):
$ size defconfig-build/kernel/sched/core.o defconfig-build/kernel/sched/core.o.orig
text data bss dec hex filename
64019 23378 2344 89741 15e8d defconfig-build/kernel/sched/core.o
64149 23378 2344 89871 15f0f defconfig-build/kernel/sched/core.o.orig
Reported-by: Mike Meyer <Mike.Meyer@Teradata.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Link: http://lkml.kernel.org/r/20150930154413.GO3604@twins.programming.kicks-ass.net
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-09-30 22:44:13 +07:00
|
|
|
if (!(flags & ENQUEUE_RESTORE))
|
|
|
|
sched_info_queued(rq, p);
|
2010-03-24 22:38:48 +07:00
|
|
|
p->sched_class->enqueue_task(rq, p, flags);
|
2007-07-09 23:51:59 +07:00
|
|
|
}
|
|
|
|
|
sched/core: Fix task and run queue sched_info::run_delay inconsistencies
Mike Meyer reported the following bug:
> During evaluation of some performance data, it was discovered thread
> and run queue run_delay accounting data was inconsistent with the other
> accounting data that was collected. Further investigation found under
> certain circumstances execution time was leaking into the task and
> run queue accounting of run_delay.
>
> Consider the following sequence:
>
> a. thread is running.
> b. thread moves beween cgroups, changes scheduling class or priority.
> c. thread sleeps OR
> d. thread involuntarily gives up cpu.
>
> a. implies:
>
> thread->sched_info.last_queued = 0
>
> a. and b. results in the following:
>
> 1. dequeue_task(rq, thread)
>
> sched_info_dequeued(rq, thread)
> delta = 0
>
> sched_info_reset_dequeued(thread)
> thread->sched_info.last_queued = 0
>
> thread->sched_info.run_delay += delta
>
> 2. enqueue_task(rq, thread)
>
> sched_info_queued(rq, thread)
>
> /* thread is still on cpu at this point. */
> thread->sched_info.last_queued = task_rq(thread)->clock;
>
> c. results in:
>
> dequeue_task(rq, thread)
>
> sched_info_dequeued(rq, thread)
>
> /* delta is execution time not run_delay. */
> delta = task_rq(thread)->clock - thread->sched_info.last_queued
>
> sched_info_reset_dequeued(thread)
> thread->sched_info.last_queued = 0
>
> thread->sched_info.run_delay += delta
>
> Since thread was running between enqueue_task(rq, thread) and
> dequeue_task(rq, thread), the delta above is really execution
> time and not run_delay.
>
> d. results in:
>
> __sched_info_switch(thread, next_thread)
>
> sched_info_depart(rq, thread)
>
> sched_info_queued(rq, thread)
>
> /* last_queued not updated due to being non-zero */
> return
>
> Since thread was running between enqueue_task(rq, thread) and
> __sched_info_switch(thread, next_thread), the execution time
> between enqueue_task(rq, thread) and
> __sched_info_switch(thread, next_thread) now will become
> associated with run_delay due to when last_queued was last updated.
>
This alternative patch solves the problem by not calling
sched_info_{de,}queued() in {de,en}queue_task(). Therefore the
sched_info state is preserved and things work as expected.
By inlining the {de,en}queue_task() functions the new condition
becomes (mostly) a compile-time constant and we'll not emit any new
branch instructions.
It even shrinks the code (due to inlining {en,de}queue_task()):
$ size defconfig-build/kernel/sched/core.o defconfig-build/kernel/sched/core.o.orig
text data bss dec hex filename
64019 23378 2344 89741 15e8d defconfig-build/kernel/sched/core.o
64149 23378 2344 89871 15f0f defconfig-build/kernel/sched/core.o.orig
Reported-by: Mike Meyer <Mike.Meyer@Teradata.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Link: http://lkml.kernel.org/r/20150930154413.GO3604@twins.programming.kicks-ass.net
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-09-30 22:44:13 +07:00
|
|
|
static inline void dequeue_task(struct rq *rq, struct task_struct *p, int flags)
|
2007-07-09 23:51:59 +07:00
|
|
|
{
|
2010-03-11 23:16:20 +07:00
|
|
|
update_rq_clock(rq);
|
sched/core: Fix task and run queue sched_info::run_delay inconsistencies
Mike Meyer reported the following bug:
> During evaluation of some performance data, it was discovered thread
> and run queue run_delay accounting data was inconsistent with the other
> accounting data that was collected. Further investigation found under
> certain circumstances execution time was leaking into the task and
> run queue accounting of run_delay.
>
> Consider the following sequence:
>
> a. thread is running.
> b. thread moves beween cgroups, changes scheduling class or priority.
> c. thread sleeps OR
> d. thread involuntarily gives up cpu.
>
> a. implies:
>
> thread->sched_info.last_queued = 0
>
> a. and b. results in the following:
>
> 1. dequeue_task(rq, thread)
>
> sched_info_dequeued(rq, thread)
> delta = 0
>
> sched_info_reset_dequeued(thread)
> thread->sched_info.last_queued = 0
>
> thread->sched_info.run_delay += delta
>
> 2. enqueue_task(rq, thread)
>
> sched_info_queued(rq, thread)
>
> /* thread is still on cpu at this point. */
> thread->sched_info.last_queued = task_rq(thread)->clock;
>
> c. results in:
>
> dequeue_task(rq, thread)
>
> sched_info_dequeued(rq, thread)
>
> /* delta is execution time not run_delay. */
> delta = task_rq(thread)->clock - thread->sched_info.last_queued
>
> sched_info_reset_dequeued(thread)
> thread->sched_info.last_queued = 0
>
> thread->sched_info.run_delay += delta
>
> Since thread was running between enqueue_task(rq, thread) and
> dequeue_task(rq, thread), the delta above is really execution
> time and not run_delay.
>
> d. results in:
>
> __sched_info_switch(thread, next_thread)
>
> sched_info_depart(rq, thread)
>
> sched_info_queued(rq, thread)
>
> /* last_queued not updated due to being non-zero */
> return
>
> Since thread was running between enqueue_task(rq, thread) and
> __sched_info_switch(thread, next_thread), the execution time
> between enqueue_task(rq, thread) and
> __sched_info_switch(thread, next_thread) now will become
> associated with run_delay due to when last_queued was last updated.
>
This alternative patch solves the problem by not calling
sched_info_{de,}queued() in {de,en}queue_task(). Therefore the
sched_info state is preserved and things work as expected.
By inlining the {de,en}queue_task() functions the new condition
becomes (mostly) a compile-time constant and we'll not emit any new
branch instructions.
It even shrinks the code (due to inlining {en,de}queue_task()):
$ size defconfig-build/kernel/sched/core.o defconfig-build/kernel/sched/core.o.orig
text data bss dec hex filename
64019 23378 2344 89741 15e8d defconfig-build/kernel/sched/core.o
64149 23378 2344 89871 15f0f defconfig-build/kernel/sched/core.o.orig
Reported-by: Mike Meyer <Mike.Meyer@Teradata.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Link: http://lkml.kernel.org/r/20150930154413.GO3604@twins.programming.kicks-ass.net
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-09-30 22:44:13 +07:00
|
|
|
if (!(flags & DEQUEUE_SAVE))
|
|
|
|
sched_info_dequeued(rq, p);
|
2010-03-24 22:38:48 +07:00
|
|
|
p->sched_class->dequeue_task(rq, p, flags);
|
2007-07-09 23:51:59 +07:00
|
|
|
}
|
|
|
|
|
2011-10-25 15:00:11 +07:00
|
|
|
void activate_task(struct rq *rq, struct task_struct *p, int flags)
|
2009-12-17 23:00:43 +07:00
|
|
|
{
|
|
|
|
if (task_contributes_to_load(p))
|
|
|
|
rq->nr_uninterruptible--;
|
|
|
|
|
2010-03-24 22:38:48 +07:00
|
|
|
enqueue_task(rq, p, flags);
|
2009-12-17 23:00:43 +07:00
|
|
|
}
|
|
|
|
|
2011-10-25 15:00:11 +07:00
|
|
|
void deactivate_task(struct rq *rq, struct task_struct *p, int flags)
|
2009-12-17 23:00:43 +07:00
|
|
|
{
|
|
|
|
if (task_contributes_to_load(p))
|
|
|
|
rq->nr_uninterruptible++;
|
|
|
|
|
2010-03-24 22:38:48 +07:00
|
|
|
dequeue_task(rq, p, flags);
|
2009-12-17 23:00:43 +07:00
|
|
|
}
|
|
|
|
|
2010-12-09 20:15:34 +07:00
|
|
|
static void update_rq_clock_task(struct rq *rq, s64 delta)
|
2010-10-05 07:03:22 +07:00
|
|
|
{
|
2011-07-12 02:28:18 +07:00
|
|
|
/*
|
|
|
|
* In theory, the compile should just see 0 here, and optimize out the call
|
|
|
|
* to sched_rt_avg_update. But I don't trust it...
|
|
|
|
*/
|
|
|
|
#if defined(CONFIG_IRQ_TIME_ACCOUNTING) || defined(CONFIG_PARAVIRT_TIME_ACCOUNTING)
|
|
|
|
s64 steal = 0, irq_delta = 0;
|
|
|
|
#endif
|
|
|
|
#ifdef CONFIG_IRQ_TIME_ACCOUNTING
|
2010-12-09 20:15:34 +07:00
|
|
|
irq_delta = irq_time_read(cpu_of(rq)) - rq->prev_irq_time;
|
2010-12-09 20:15:34 +07:00
|
|
|
|
|
|
|
/*
|
|
|
|
* Since irq_time is only updated on {soft,}irq_exit, we might run into
|
|
|
|
* this case when a previous update_rq_clock() happened inside a
|
|
|
|
* {soft,}irq region.
|
|
|
|
*
|
|
|
|
* When this happens, we stop ->clock_task and only update the
|
|
|
|
* prev_irq_time stamp to account for the part that fit, so that a next
|
|
|
|
* update will consume the rest. This ensures ->clock_task is
|
|
|
|
* monotonic.
|
|
|
|
*
|
|
|
|
* It does however cause some slight miss-attribution of {soft,}irq
|
|
|
|
* time, a more accurate solution would be to update the irq_time using
|
|
|
|
* the current rq->clock timestamp, except that would require using
|
|
|
|
* atomic ops.
|
|
|
|
*/
|
|
|
|
if (irq_delta > delta)
|
|
|
|
irq_delta = delta;
|
|
|
|
|
|
|
|
rq->prev_irq_time += irq_delta;
|
|
|
|
delta -= irq_delta;
|
2011-07-12 02:28:18 +07:00
|
|
|
#endif
|
|
|
|
#ifdef CONFIG_PARAVIRT_TIME_ACCOUNTING
|
2012-02-24 14:31:31 +07:00
|
|
|
if (static_key_false((¶virt_steal_rq_enabled))) {
|
2011-07-12 02:28:18 +07:00
|
|
|
steal = paravirt_steal_clock(cpu_of(rq));
|
|
|
|
steal -= rq->prev_steal_time_rq;
|
|
|
|
|
|
|
|
if (unlikely(steal > delta))
|
|
|
|
steal = delta;
|
|
|
|
|
|
|
|
rq->prev_steal_time_rq += steal;
|
|
|
|
delta -= steal;
|
|
|
|
}
|
|
|
|
#endif
|
|
|
|
|
2010-12-09 20:15:34 +07:00
|
|
|
rq->clock_task += delta;
|
|
|
|
|
2011-07-12 02:28:18 +07:00
|
|
|
#if defined(CONFIG_IRQ_TIME_ACCOUNTING) || defined(CONFIG_PARAVIRT_TIME_ACCOUNTING)
|
2014-05-28 00:50:41 +07:00
|
|
|
if ((irq_delta + steal) && sched_feat(NONTASK_CAPACITY))
|
2011-07-12 02:28:18 +07:00
|
|
|
sched_rt_avg_update(rq, irq_delta + steal);
|
|
|
|
#endif
|
2010-10-05 07:03:22 +07:00
|
|
|
}
|
|
|
|
|
2010-09-22 18:53:15 +07:00
|
|
|
void sched_set_stop_task(int cpu, struct task_struct *stop)
|
|
|
|
{
|
|
|
|
struct sched_param param = { .sched_priority = MAX_RT_PRIO - 1 };
|
|
|
|
struct task_struct *old_stop = cpu_rq(cpu)->stop;
|
|
|
|
|
|
|
|
if (stop) {
|
|
|
|
/*
|
|
|
|
* Make it appear like a SCHED_FIFO task, its something
|
|
|
|
* userspace knows about and won't get confused about.
|
|
|
|
*
|
|
|
|
* Also, it will make PI more or less work without too
|
|
|
|
* much confusion -- but then, stop work should not
|
|
|
|
* rely on PI working anyway.
|
|
|
|
*/
|
|
|
|
sched_setscheduler_nocheck(stop, SCHED_FIFO, ¶m);
|
|
|
|
|
|
|
|
stop->sched_class = &stop_sched_class;
|
|
|
|
}
|
|
|
|
|
|
|
|
cpu_rq(cpu)->stop = stop;
|
|
|
|
|
|
|
|
if (old_stop) {
|
|
|
|
/*
|
|
|
|
* Reset it back to a normal scheduling class so that
|
|
|
|
* it can die in pieces.
|
|
|
|
*/
|
|
|
|
old_stop->sched_class = &rt_sched_class;
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
2007-07-09 23:51:59 +07:00
|
|
|
/*
|
2007-07-09 23:51:59 +07:00
|
|
|
* __normal_prio - return the priority that is based on the static prio
|
2007-07-09 23:51:59 +07:00
|
|
|
*/
|
|
|
|
static inline int __normal_prio(struct task_struct *p)
|
|
|
|
{
|
2007-07-09 23:51:59 +07:00
|
|
|
return p->static_prio;
|
2007-07-09 23:51:59 +07:00
|
|
|
}
|
|
|
|
|
2006-06-27 16:54:51 +07:00
|
|
|
/*
|
|
|
|
* Calculate the expected normal priority: i.e. priority
|
|
|
|
* without taking RT-inheritance into account. Might be
|
|
|
|
* boosted by interactivity modifiers. Changes upon fork,
|
|
|
|
* setprio syscalls, and whenever the interactivity
|
|
|
|
* estimator recalculates.
|
|
|
|
*/
|
2006-07-03 14:25:41 +07:00
|
|
|
static inline int normal_prio(struct task_struct *p)
|
2006-06-27 16:54:51 +07:00
|
|
|
{
|
|
|
|
int prio;
|
|
|
|
|
sched/deadline: Add SCHED_DEADLINE structures & implementation
Introduces the data structures, constants and symbols needed for
SCHED_DEADLINE implementation.
Core data structure of SCHED_DEADLINE are defined, along with their
initializers. Hooks for checking if a task belong to the new policy
are also added where they are needed.
Adds a scheduling class, in sched/dl.c and a new policy called
SCHED_DEADLINE. It is an implementation of the Earliest Deadline
First (EDF) scheduling algorithm, augmented with a mechanism (called
Constant Bandwidth Server, CBS) that makes it possible to isolate
the behaviour of tasks between each other.
The typical -deadline task will be made up of a computation phase
(instance) which is activated on a periodic or sporadic fashion. The
expected (maximum) duration of such computation is called the task's
runtime; the time interval by which each instance need to be completed
is called the task's relative deadline. The task's absolute deadline
is dynamically calculated as the time instant a task (better, an
instance) activates plus the relative deadline.
The EDF algorithms selects the task with the smallest absolute
deadline as the one to be executed first, while the CBS ensures each
task to run for at most its runtime every (relative) deadline
length time interval, avoiding any interference between different
tasks (bandwidth isolation).
Thanks to this feature, also tasks that do not strictly comply with
the computational model sketched above can effectively use the new
policy.
To summarize, this patch:
- introduces the data structures, constants and symbols needed;
- implements the core logic of the scheduling algorithm in the new
scheduling class file;
- provides all the glue code between the new scheduling class and
the core scheduler and refines the interactions between sched/dl
and the other existing scheduling classes.
Signed-off-by: Dario Faggioli <raistlin@linux.it>
Signed-off-by: Michael Trimarchi <michael@amarulasolutions.com>
Signed-off-by: Fabio Checconi <fchecconi@gmail.com>
Signed-off-by: Juri Lelli <juri.lelli@gmail.com>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/1383831828-15501-4-git-send-email-juri.lelli@gmail.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2013-11-28 17:14:43 +07:00
|
|
|
if (task_has_dl_policy(p))
|
|
|
|
prio = MAX_DL_PRIO-1;
|
|
|
|
else if (task_has_rt_policy(p))
|
2006-06-27 16:54:51 +07:00
|
|
|
prio = MAX_RT_PRIO-1 - p->rt_priority;
|
|
|
|
else
|
|
|
|
prio = __normal_prio(p);
|
|
|
|
return prio;
|
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Calculate the current priority, i.e. the priority
|
|
|
|
* taken into account by the scheduler. This value might
|
|
|
|
* be boosted by RT tasks, or might be boosted by
|
|
|
|
* interactivity modifiers. Will be RT if the task got
|
|
|
|
* RT-boosted. If not then it returns p->normal_prio.
|
|
|
|
*/
|
2006-07-03 14:25:41 +07:00
|
|
|
static int effective_prio(struct task_struct *p)
|
2006-06-27 16:54:51 +07:00
|
|
|
{
|
|
|
|
p->normal_prio = normal_prio(p);
|
|
|
|
/*
|
|
|
|
* If we are RT tasks or we were boosted to RT priority,
|
|
|
|
* keep the priority unchanged. Otherwise, update priority
|
|
|
|
* to the normal priority:
|
|
|
|
*/
|
|
|
|
if (!rt_prio(p->prio))
|
|
|
|
return p->normal_prio;
|
|
|
|
return p->prio;
|
|
|
|
}
|
|
|
|
|
2005-04-17 05:20:36 +07:00
|
|
|
/**
|
|
|
|
* task_curr - is this task currently executing on a CPU?
|
|
|
|
* @p: the task in question.
|
2013-07-13 01:45:47 +07:00
|
|
|
*
|
|
|
|
* Return: 1 if the task is currently executing. 0 otherwise.
|
2005-04-17 05:20:36 +07:00
|
|
|
*/
|
2006-07-03 14:25:41 +07:00
|
|
|
inline int task_curr(const struct task_struct *p)
|
2005-04-17 05:20:36 +07:00
|
|
|
{
|
|
|
|
return cpu_curr(task_cpu(p)) == p;
|
|
|
|
}
|
|
|
|
|
2014-10-27 21:40:52 +07:00
|
|
|
/*
|
2015-06-11 19:46:39 +07:00
|
|
|
* switched_from, switched_to and prio_changed must _NOT_ drop rq->lock,
|
|
|
|
* use the balance_callback list if you want balancing.
|
|
|
|
*
|
|
|
|
* this means any call to check_class_changed() must be followed by a call to
|
|
|
|
* balance_callback().
|
2014-10-27 21:40:52 +07:00
|
|
|
*/
|
2008-01-26 03:08:22 +07:00
|
|
|
static inline void check_class_changed(struct rq *rq, struct task_struct *p,
|
|
|
|
const struct sched_class *prev_class,
|
2011-01-17 23:03:27 +07:00
|
|
|
int oldprio)
|
2008-01-26 03:08:22 +07:00
|
|
|
{
|
|
|
|
if (prev_class != p->sched_class) {
|
|
|
|
if (prev_class->switched_from)
|
2011-01-17 23:03:27 +07:00
|
|
|
prev_class->switched_from(rq, p);
|
2015-06-11 19:46:39 +07:00
|
|
|
|
2011-01-17 23:03:27 +07:00
|
|
|
p->sched_class->switched_to(rq, p);
|
sched/deadline: Add SCHED_DEADLINE inheritance logic
Some method to deal with rt-mutexes and make sched_dl interact with
the current PI-coded is needed, raising all but trivial issues, that
needs (according to us) to be solved with some restructuring of
the pi-code (i.e., going toward a proxy execution-ish implementation).
This is under development, in the meanwhile, as a temporary solution,
what this commits does is:
- ensure a pi-lock owner with waiters is never throttled down. Instead,
when it runs out of runtime, it immediately gets replenished and it's
deadline is postponed;
- the scheduling parameters (relative deadline and default runtime)
used for that replenishments --during the whole period it holds the
pi-lock-- are the ones of the waiting task with earliest deadline.
Acting this way, we provide some kind of boosting to the lock-owner,
still by using the existing (actually, slightly modified by the previous
commit) pi-architecture.
We would stress the fact that this is only a surely needed, all but
clean solution to the problem. In the end it's only a way to re-start
discussion within the community. So, as always, comments, ideas, rants,
etc.. are welcome! :-)
Signed-off-by: Dario Faggioli <raistlin@linux.it>
Signed-off-by: Juri Lelli <juri.lelli@gmail.com>
[ Added !RT_MUTEXES build fix. ]
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/1383831828-15501-11-git-send-email-juri.lelli@gmail.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2013-11-07 20:43:44 +07:00
|
|
|
} else if (oldprio != p->prio || dl_task(p))
|
2011-01-17 23:03:27 +07:00
|
|
|
p->sched_class->prio_changed(rq, p, oldprio);
|
2008-01-26 03:08:22 +07:00
|
|
|
}
|
|
|
|
|
2011-10-25 15:00:11 +07:00
|
|
|
void check_preempt_curr(struct rq *rq, struct task_struct *p, int flags)
|
2010-10-31 18:37:04 +07:00
|
|
|
{
|
|
|
|
const struct sched_class *class;
|
|
|
|
|
|
|
|
if (p->sched_class == rq->curr->sched_class) {
|
|
|
|
rq->curr->sched_class->check_preempt_curr(rq, p, flags);
|
|
|
|
} else {
|
|
|
|
for_each_class(class) {
|
|
|
|
if (class == rq->curr->sched_class)
|
|
|
|
break;
|
|
|
|
if (class == p->sched_class) {
|
2014-06-29 03:03:57 +07:00
|
|
|
resched_curr(rq);
|
2010-10-31 18:37:04 +07:00
|
|
|
break;
|
|
|
|
}
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* A queue event has occurred, and we're going to schedule. In
|
|
|
|
* this case, we can save a useless back to back clock update.
|
|
|
|
*/
|
2014-08-20 16:47:32 +07:00
|
|
|
if (task_on_rq_queued(rq->curr) && test_tsk_need_resched(rq->curr))
|
2015-01-05 17:18:11 +07:00
|
|
|
rq_clock_skip_update(rq, true);
|
2010-10-31 18:37:04 +07:00
|
|
|
}
|
|
|
|
|
2005-04-17 05:20:36 +07:00
|
|
|
#ifdef CONFIG_SMP
|
2015-06-11 19:46:50 +07:00
|
|
|
/*
|
|
|
|
* This is how migration works:
|
|
|
|
*
|
|
|
|
* 1) we invoke migration_cpu_stop() on the target CPU using
|
|
|
|
* stop_one_cpu().
|
|
|
|
* 2) stopper starts to run (implicitly forcing the migrated thread
|
|
|
|
* off the CPU)
|
|
|
|
* 3) it checks whether the migrated task is still in the wrong runqueue.
|
|
|
|
* 4) if it's in the wrong runqueue then the migration thread removes
|
|
|
|
* it and puts it into the right queue.
|
|
|
|
* 5) stopper completes and stop_one_cpu() returns and the migration
|
|
|
|
* is done.
|
|
|
|
*/
|
|
|
|
|
|
|
|
/*
|
|
|
|
* move_queued_task - move a queued task to new rq.
|
|
|
|
*
|
|
|
|
* Returns (locked) new rq. Old rq's lock is released.
|
|
|
|
*/
|
2015-06-11 19:46:51 +07:00
|
|
|
static struct rq *move_queued_task(struct rq *rq, struct task_struct *p, int new_cpu)
|
2015-06-11 19:46:50 +07:00
|
|
|
{
|
|
|
|
lockdep_assert_held(&rq->lock);
|
|
|
|
|
|
|
|
p->on_rq = TASK_ON_RQ_MIGRATING;
|
2015-11-13 10:38:54 +07:00
|
|
|
dequeue_task(rq, p, 0);
|
2015-06-11 19:46:50 +07:00
|
|
|
set_task_cpu(p, new_cpu);
|
|
|
|
raw_spin_unlock(&rq->lock);
|
|
|
|
|
|
|
|
rq = cpu_rq(new_cpu);
|
|
|
|
|
|
|
|
raw_spin_lock(&rq->lock);
|
|
|
|
BUG_ON(task_cpu(p) != new_cpu);
|
|
|
|
enqueue_task(rq, p, 0);
|
2015-11-13 10:38:54 +07:00
|
|
|
p->on_rq = TASK_ON_RQ_QUEUED;
|
2015-06-11 19:46:50 +07:00
|
|
|
check_preempt_curr(rq, p, 0);
|
|
|
|
|
|
|
|
return rq;
|
|
|
|
}
|
|
|
|
|
|
|
|
struct migration_arg {
|
|
|
|
struct task_struct *task;
|
|
|
|
int dest_cpu;
|
|
|
|
};
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Move (not current) task off this cpu, onto dest cpu. We're doing
|
|
|
|
* this because either it can't run here any more (set_cpus_allowed()
|
|
|
|
* away from this CPU, or CPU going down), or because we're
|
|
|
|
* attempting to rebalance this task on exec (sched_exec).
|
|
|
|
*
|
|
|
|
* So we race with normal scheduler movements, but that's OK, as long
|
|
|
|
* as the task is no longer on this CPU.
|
|
|
|
*/
|
2015-06-11 19:46:51 +07:00
|
|
|
static struct rq *__migrate_task(struct rq *rq, struct task_struct *p, int dest_cpu)
|
2015-06-11 19:46:50 +07:00
|
|
|
{
|
|
|
|
if (unlikely(!cpu_active(dest_cpu)))
|
2015-06-11 19:46:51 +07:00
|
|
|
return rq;
|
2015-06-11 19:46:50 +07:00
|
|
|
|
|
|
|
/* Affinity changed (again). */
|
|
|
|
if (!cpumask_test_cpu(dest_cpu, tsk_cpus_allowed(p)))
|
2015-06-11 19:46:51 +07:00
|
|
|
return rq;
|
2015-06-11 19:46:50 +07:00
|
|
|
|
2015-06-11 19:46:51 +07:00
|
|
|
rq = move_queued_task(rq, p, dest_cpu);
|
|
|
|
|
|
|
|
return rq;
|
2015-06-11 19:46:50 +07:00
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* migration_cpu_stop - this will be executed by a highprio stopper thread
|
|
|
|
* and performs thread migration by bumping thread off CPU then
|
|
|
|
* 'pushing' onto another runqueue.
|
|
|
|
*/
|
|
|
|
static int migration_cpu_stop(void *data)
|
|
|
|
{
|
|
|
|
struct migration_arg *arg = data;
|
2015-06-11 19:46:51 +07:00
|
|
|
struct task_struct *p = arg->task;
|
|
|
|
struct rq *rq = this_rq();
|
2015-06-11 19:46:50 +07:00
|
|
|
|
|
|
|
/*
|
|
|
|
* The original target cpu might have gone down and we might
|
|
|
|
* be on another cpu but it doesn't matter.
|
|
|
|
*/
|
|
|
|
local_irq_disable();
|
|
|
|
/*
|
|
|
|
* We need to explicitly wake pending tasks before running
|
|
|
|
* __migrate_task() such that we will not miss enforcing cpus_allowed
|
|
|
|
* during wakeups, see set_cpus_allowed_ptr()'s TASK_WAKING test.
|
|
|
|
*/
|
|
|
|
sched_ttwu_pending();
|
2015-06-11 19:46:51 +07:00
|
|
|
|
|
|
|
raw_spin_lock(&p->pi_lock);
|
|
|
|
raw_spin_lock(&rq->lock);
|
|
|
|
/*
|
|
|
|
* If task_rq(p) != rq, it cannot be migrated here, because we're
|
|
|
|
* holding rq->lock, if p->on_rq == 0 it cannot get enqueued because
|
|
|
|
* we're holding p->pi_lock.
|
|
|
|
*/
|
|
|
|
if (task_rq(p) == rq && task_on_rq_queued(p))
|
|
|
|
rq = __migrate_task(rq, p, arg->dest_cpu);
|
|
|
|
raw_spin_unlock(&rq->lock);
|
|
|
|
raw_spin_unlock(&p->pi_lock);
|
|
|
|
|
2015-06-11 19:46:50 +07:00
|
|
|
local_irq_enable();
|
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
|
2015-05-15 22:43:35 +07:00
|
|
|
/*
|
|
|
|
* sched_class::set_cpus_allowed must do the below, but is not required to
|
|
|
|
* actually call this function.
|
|
|
|
*/
|
|
|
|
void set_cpus_allowed_common(struct task_struct *p, const struct cpumask *new_mask)
|
2015-06-11 19:46:50 +07:00
|
|
|
{
|
|
|
|
cpumask_copy(&p->cpus_allowed, new_mask);
|
|
|
|
p->nr_cpus_allowed = cpumask_weight(new_mask);
|
|
|
|
}
|
|
|
|
|
2015-05-15 22:43:35 +07:00
|
|
|
void do_set_cpus_allowed(struct task_struct *p, const struct cpumask *new_mask)
|
|
|
|
{
|
2015-05-15 22:43:36 +07:00
|
|
|
struct rq *rq = task_rq(p);
|
|
|
|
bool queued, running;
|
|
|
|
|
2015-05-15 22:43:35 +07:00
|
|
|
lockdep_assert_held(&p->pi_lock);
|
2015-05-15 22:43:36 +07:00
|
|
|
|
|
|
|
queued = task_on_rq_queued(p);
|
|
|
|
running = task_current(rq, p);
|
|
|
|
|
|
|
|
if (queued) {
|
|
|
|
/*
|
|
|
|
* Because __kthread_bind() calls this on blocked tasks without
|
|
|
|
* holding rq->lock.
|
|
|
|
*/
|
|
|
|
lockdep_assert_held(&rq->lock);
|
sched/core: Fix task and run queue sched_info::run_delay inconsistencies
Mike Meyer reported the following bug:
> During evaluation of some performance data, it was discovered thread
> and run queue run_delay accounting data was inconsistent with the other
> accounting data that was collected. Further investigation found under
> certain circumstances execution time was leaking into the task and
> run queue accounting of run_delay.
>
> Consider the following sequence:
>
> a. thread is running.
> b. thread moves beween cgroups, changes scheduling class or priority.
> c. thread sleeps OR
> d. thread involuntarily gives up cpu.
>
> a. implies:
>
> thread->sched_info.last_queued = 0
>
> a. and b. results in the following:
>
> 1. dequeue_task(rq, thread)
>
> sched_info_dequeued(rq, thread)
> delta = 0
>
> sched_info_reset_dequeued(thread)
> thread->sched_info.last_queued = 0
>
> thread->sched_info.run_delay += delta
>
> 2. enqueue_task(rq, thread)
>
> sched_info_queued(rq, thread)
>
> /* thread is still on cpu at this point. */
> thread->sched_info.last_queued = task_rq(thread)->clock;
>
> c. results in:
>
> dequeue_task(rq, thread)
>
> sched_info_dequeued(rq, thread)
>
> /* delta is execution time not run_delay. */
> delta = task_rq(thread)->clock - thread->sched_info.last_queued
>
> sched_info_reset_dequeued(thread)
> thread->sched_info.last_queued = 0
>
> thread->sched_info.run_delay += delta
>
> Since thread was running between enqueue_task(rq, thread) and
> dequeue_task(rq, thread), the delta above is really execution
> time and not run_delay.
>
> d. results in:
>
> __sched_info_switch(thread, next_thread)
>
> sched_info_depart(rq, thread)
>
> sched_info_queued(rq, thread)
>
> /* last_queued not updated due to being non-zero */
> return
>
> Since thread was running between enqueue_task(rq, thread) and
> __sched_info_switch(thread, next_thread), the execution time
> between enqueue_task(rq, thread) and
> __sched_info_switch(thread, next_thread) now will become
> associated with run_delay due to when last_queued was last updated.
>
This alternative patch solves the problem by not calling
sched_info_{de,}queued() in {de,en}queue_task(). Therefore the
sched_info state is preserved and things work as expected.
By inlining the {de,en}queue_task() functions the new condition
becomes (mostly) a compile-time constant and we'll not emit any new
branch instructions.
It even shrinks the code (due to inlining {en,de}queue_task()):
$ size defconfig-build/kernel/sched/core.o defconfig-build/kernel/sched/core.o.orig
text data bss dec hex filename
64019 23378 2344 89741 15e8d defconfig-build/kernel/sched/core.o
64149 23378 2344 89871 15f0f defconfig-build/kernel/sched/core.o.orig
Reported-by: Mike Meyer <Mike.Meyer@Teradata.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Link: http://lkml.kernel.org/r/20150930154413.GO3604@twins.programming.kicks-ass.net
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-09-30 22:44:13 +07:00
|
|
|
dequeue_task(rq, p, DEQUEUE_SAVE);
|
2015-05-15 22:43:36 +07:00
|
|
|
}
|
|
|
|
if (running)
|
|
|
|
put_prev_task(rq, p);
|
|
|
|
|
2015-05-15 22:43:35 +07:00
|
|
|
p->sched_class->set_cpus_allowed(p, new_mask);
|
2015-05-15 22:43:36 +07:00
|
|
|
|
|
|
|
if (running)
|
|
|
|
p->sched_class->set_curr_task(rq);
|
|
|
|
if (queued)
|
sched/core: Fix task and run queue sched_info::run_delay inconsistencies
Mike Meyer reported the following bug:
> During evaluation of some performance data, it was discovered thread
> and run queue run_delay accounting data was inconsistent with the other
> accounting data that was collected. Further investigation found under
> certain circumstances execution time was leaking into the task and
> run queue accounting of run_delay.
>
> Consider the following sequence:
>
> a. thread is running.
> b. thread moves beween cgroups, changes scheduling class or priority.
> c. thread sleeps OR
> d. thread involuntarily gives up cpu.
>
> a. implies:
>
> thread->sched_info.last_queued = 0
>
> a. and b. results in the following:
>
> 1. dequeue_task(rq, thread)
>
> sched_info_dequeued(rq, thread)
> delta = 0
>
> sched_info_reset_dequeued(thread)
> thread->sched_info.last_queued = 0
>
> thread->sched_info.run_delay += delta
>
> 2. enqueue_task(rq, thread)
>
> sched_info_queued(rq, thread)
>
> /* thread is still on cpu at this point. */
> thread->sched_info.last_queued = task_rq(thread)->clock;
>
> c. results in:
>
> dequeue_task(rq, thread)
>
> sched_info_dequeued(rq, thread)
>
> /* delta is execution time not run_delay. */
> delta = task_rq(thread)->clock - thread->sched_info.last_queued
>
> sched_info_reset_dequeued(thread)
> thread->sched_info.last_queued = 0
>
> thread->sched_info.run_delay += delta
>
> Since thread was running between enqueue_task(rq, thread) and
> dequeue_task(rq, thread), the delta above is really execution
> time and not run_delay.
>
> d. results in:
>
> __sched_info_switch(thread, next_thread)
>
> sched_info_depart(rq, thread)
>
> sched_info_queued(rq, thread)
>
> /* last_queued not updated due to being non-zero */
> return
>
> Since thread was running between enqueue_task(rq, thread) and
> __sched_info_switch(thread, next_thread), the execution time
> between enqueue_task(rq, thread) and
> __sched_info_switch(thread, next_thread) now will become
> associated with run_delay due to when last_queued was last updated.
>
This alternative patch solves the problem by not calling
sched_info_{de,}queued() in {de,en}queue_task(). Therefore the
sched_info state is preserved and things work as expected.
By inlining the {de,en}queue_task() functions the new condition
becomes (mostly) a compile-time constant and we'll not emit any new
branch instructions.
It even shrinks the code (due to inlining {en,de}queue_task()):
$ size defconfig-build/kernel/sched/core.o defconfig-build/kernel/sched/core.o.orig
text data bss dec hex filename
64019 23378 2344 89741 15e8d defconfig-build/kernel/sched/core.o
64149 23378 2344 89871 15f0f defconfig-build/kernel/sched/core.o.orig
Reported-by: Mike Meyer <Mike.Meyer@Teradata.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Link: http://lkml.kernel.org/r/20150930154413.GO3604@twins.programming.kicks-ass.net
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-09-30 22:44:13 +07:00
|
|
|
enqueue_task(rq, p, ENQUEUE_RESTORE);
|
2015-05-15 22:43:35 +07:00
|
|
|
}
|
|
|
|
|
2015-06-11 19:46:50 +07:00
|
|
|
/*
|
|
|
|
* Change a given task's CPU affinity. Migrate the thread to a
|
|
|
|
* proper CPU and schedule it away if the CPU it's executing on
|
|
|
|
* is removed from the allowed bitmask.
|
|
|
|
*
|
|
|
|
* NOTE: the caller must have a valid reference to the task, the
|
|
|
|
* task must not exit() & deallocate itself prematurely. The
|
|
|
|
* call is not atomic; no spinlocks may be held.
|
|
|
|
*/
|
2015-05-15 22:43:34 +07:00
|
|
|
static int __set_cpus_allowed_ptr(struct task_struct *p,
|
|
|
|
const struct cpumask *new_mask, bool check)
|
2015-06-11 19:46:50 +07:00
|
|
|
{
|
2016-03-10 18:54:08 +07:00
|
|
|
const struct cpumask *cpu_valid_mask = cpu_active_mask;
|
2015-06-11 19:46:50 +07:00
|
|
|
unsigned int dest_cpu;
|
2015-08-01 02:28:18 +07:00
|
|
|
struct rq_flags rf;
|
|
|
|
struct rq *rq;
|
2015-06-11 19:46:50 +07:00
|
|
|
int ret = 0;
|
|
|
|
|
2015-08-01 02:28:18 +07:00
|
|
|
rq = task_rq_lock(p, &rf);
|
2015-06-11 19:46:50 +07:00
|
|
|
|
2016-03-10 18:54:08 +07:00
|
|
|
if (p->flags & PF_KTHREAD) {
|
|
|
|
/*
|
|
|
|
* Kernel threads are allowed on online && !active CPUs
|
|
|
|
*/
|
|
|
|
cpu_valid_mask = cpu_online_mask;
|
|
|
|
}
|
|
|
|
|
2015-05-15 22:43:34 +07:00
|
|
|
/*
|
|
|
|
* Must re-check here, to close a race against __kthread_bind(),
|
|
|
|
* sched_setaffinity() is not guaranteed to observe the flag.
|
|
|
|
*/
|
|
|
|
if (check && (p->flags & PF_NO_SETAFFINITY)) {
|
|
|
|
ret = -EINVAL;
|
|
|
|
goto out;
|
|
|
|
}
|
|
|
|
|
2015-06-11 19:46:50 +07:00
|
|
|
if (cpumask_equal(&p->cpus_allowed, new_mask))
|
|
|
|
goto out;
|
|
|
|
|
2016-03-10 18:54:08 +07:00
|
|
|
if (!cpumask_intersects(new_mask, cpu_valid_mask)) {
|
2015-06-11 19:46:50 +07:00
|
|
|
ret = -EINVAL;
|
|
|
|
goto out;
|
|
|
|
}
|
|
|
|
|
|
|
|
do_set_cpus_allowed(p, new_mask);
|
|
|
|
|
2016-03-10 18:54:08 +07:00
|
|
|
if (p->flags & PF_KTHREAD) {
|
|
|
|
/*
|
|
|
|
* For kernel threads that do indeed end up on online &&
|
|
|
|
* !active we want to ensure they are strict per-cpu threads.
|
|
|
|
*/
|
|
|
|
WARN_ON(cpumask_intersects(new_mask, cpu_online_mask) &&
|
|
|
|
!cpumask_intersects(new_mask, cpu_active_mask) &&
|
|
|
|
p->nr_cpus_allowed != 1);
|
|
|
|
}
|
|
|
|
|
2015-06-11 19:46:50 +07:00
|
|
|
/* Can the task run on the task's current CPU? If so, we're done */
|
|
|
|
if (cpumask_test_cpu(task_cpu(p), new_mask))
|
|
|
|
goto out;
|
|
|
|
|
2016-03-10 18:54:08 +07:00
|
|
|
dest_cpu = cpumask_any_and(cpu_valid_mask, new_mask);
|
2015-06-11 19:46:50 +07:00
|
|
|
if (task_running(rq, p) || p->state == TASK_WAKING) {
|
|
|
|
struct migration_arg arg = { p, dest_cpu };
|
|
|
|
/* Need help from migration thread: drop lock and wait. */
|
2015-08-01 02:28:18 +07:00
|
|
|
task_rq_unlock(rq, p, &rf);
|
2015-06-11 19:46:50 +07:00
|
|
|
stop_one_cpu(cpu_of(rq), migration_cpu_stop, &arg);
|
|
|
|
tlb_migrate_finish(p->mm);
|
|
|
|
return 0;
|
2015-06-11 19:46:54 +07:00
|
|
|
} else if (task_on_rq_queued(p)) {
|
|
|
|
/*
|
|
|
|
* OK, since we're going to drop the lock immediately
|
|
|
|
* afterwards anyway.
|
|
|
|
*/
|
2015-08-02 00:25:08 +07:00
|
|
|
lockdep_unpin_lock(&rq->lock, rf.cookie);
|
2015-06-11 19:46:51 +07:00
|
|
|
rq = move_queued_task(rq, p, dest_cpu);
|
2015-08-02 00:25:08 +07:00
|
|
|
lockdep_repin_lock(&rq->lock, rf.cookie);
|
2015-06-11 19:46:54 +07:00
|
|
|
}
|
2015-06-11 19:46:50 +07:00
|
|
|
out:
|
2015-08-01 02:28:18 +07:00
|
|
|
task_rq_unlock(rq, p, &rf);
|
2015-06-11 19:46:50 +07:00
|
|
|
|
|
|
|
return ret;
|
|
|
|
}
|
2015-05-15 22:43:34 +07:00
|
|
|
|
|
|
|
int set_cpus_allowed_ptr(struct task_struct *p, const struct cpumask *new_mask)
|
|
|
|
{
|
|
|
|
return __set_cpus_allowed_ptr(p, new_mask, false);
|
|
|
|
}
|
2015-06-11 19:46:50 +07:00
|
|
|
EXPORT_SYMBOL_GPL(set_cpus_allowed_ptr);
|
|
|
|
|
2007-07-09 23:51:59 +07:00
|
|
|
void set_task_cpu(struct task_struct *p, unsigned int new_cpu)
|
2007-07-09 23:51:58 +07:00
|
|
|
{
|
2009-12-17 00:04:36 +07:00
|
|
|
#ifdef CONFIG_SCHED_DEBUG
|
|
|
|
/*
|
|
|
|
* We should never call set_task_cpu() on a blocked task,
|
|
|
|
* ttwu() will sort out the placement.
|
|
|
|
*/
|
2009-12-17 19:16:31 +07:00
|
|
|
WARN_ON_ONCE(p->state != TASK_RUNNING && p->state != TASK_WAKING &&
|
2014-10-09 01:33:48 +07:00
|
|
|
!p->on_rq);
|
2011-04-05 22:23:51 +07:00
|
|
|
|
2015-11-13 10:38:54 +07:00
|
|
|
/*
|
|
|
|
* Migrating fair class task must have p->on_rq = TASK_ON_RQ_MIGRATING,
|
|
|
|
* because schedstat_wait_{start,end} rebase migrating task's wait_start
|
|
|
|
* time relying on p->on_rq.
|
|
|
|
*/
|
|
|
|
WARN_ON_ONCE(p->state == TASK_RUNNING &&
|
|
|
|
p->sched_class == &fair_sched_class &&
|
|
|
|
(p->on_rq && !task_on_rq_migrating(p)));
|
|
|
|
|
2011-04-05 22:23:51 +07:00
|
|
|
#ifdef CONFIG_LOCKDEP
|
2011-06-03 22:37:07 +07:00
|
|
|
/*
|
|
|
|
* The caller should hold either p->pi_lock or rq->lock, when changing
|
|
|
|
* a task's CPU. ->pi_lock for waking tasks, rq->lock for runnable tasks.
|
|
|
|
*
|
|
|
|
* sched_move_task() holds both and thus holding either pins the cgroup,
|
2012-06-22 18:36:05 +07:00
|
|
|
* see task_group().
|
2011-06-03 22:37:07 +07:00
|
|
|
*
|
|
|
|
* Furthermore, all task_rq users should acquire both locks, see
|
|
|
|
* task_rq_lock().
|
|
|
|
*/
|
2011-04-05 22:23:51 +07:00
|
|
|
WARN_ON_ONCE(debug_locks && !(lockdep_is_held(&p->pi_lock) ||
|
|
|
|
lockdep_is_held(&task_rq(p)->lock)));
|
|
|
|
#endif
|
2009-12-17 00:04:36 +07:00
|
|
|
#endif
|
|
|
|
|
2009-05-05 15:49:59 +07:00
|
|
|
trace_sched_migrate_task(p, new_cpu);
|
2008-12-10 14:08:22 +07:00
|
|
|
|
2009-12-22 21:43:19 +07:00
|
|
|
if (task_cpu(p) != new_cpu) {
|
2012-10-04 18:18:30 +07:00
|
|
|
if (p->sched_class->migrate_task_rq)
|
2015-09-23 13:55:59 +07:00
|
|
|
p->sched_class->migrate_task_rq(p);
|
2009-12-22 21:43:19 +07:00
|
|
|
p->se.nr_migrations++;
|
2015-04-18 01:05:30 +07:00
|
|
|
perf_event_task_migrate(p);
|
2009-12-22 21:43:19 +07:00
|
|
|
}
|
2007-07-09 23:51:59 +07:00
|
|
|
|
|
|
|
__set_task_cpu(p, new_cpu);
|
2007-07-09 23:51:58 +07:00
|
|
|
}
|
|
|
|
|
2013-10-07 17:29:16 +07:00
|
|
|
static void __migrate_swap_task(struct task_struct *p, int cpu)
|
|
|
|
{
|
2014-08-20 16:47:32 +07:00
|
|
|
if (task_on_rq_queued(p)) {
|
2013-10-07 17:29:16 +07:00
|
|
|
struct rq *src_rq, *dst_rq;
|
|
|
|
|
|
|
|
src_rq = task_rq(p);
|
|
|
|
dst_rq = cpu_rq(cpu);
|
|
|
|
|
2015-11-13 10:38:54 +07:00
|
|
|
p->on_rq = TASK_ON_RQ_MIGRATING;
|
2013-10-07 17:29:16 +07:00
|
|
|
deactivate_task(src_rq, p, 0);
|
|
|
|
set_task_cpu(p, cpu);
|
|
|
|
activate_task(dst_rq, p, 0);
|
2015-11-13 10:38:54 +07:00
|
|
|
p->on_rq = TASK_ON_RQ_QUEUED;
|
2013-10-07 17:29:16 +07:00
|
|
|
check_preempt_curr(dst_rq, p, 0);
|
|
|
|
} else {
|
|
|
|
/*
|
|
|
|
* Task isn't running anymore; make it appear like we migrated
|
|
|
|
* it before it went to sleep. This means on wakeup we make the
|
|
|
|
* previous cpu our targer instead of where it really is.
|
|
|
|
*/
|
|
|
|
p->wake_cpu = cpu;
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
|
|
|
struct migration_swap_arg {
|
|
|
|
struct task_struct *src_task, *dst_task;
|
|
|
|
int src_cpu, dst_cpu;
|
|
|
|
};
|
|
|
|
|
|
|
|
static int migrate_swap_stop(void *data)
|
|
|
|
{
|
|
|
|
struct migration_swap_arg *arg = data;
|
|
|
|
struct rq *src_rq, *dst_rq;
|
|
|
|
int ret = -EAGAIN;
|
|
|
|
|
2015-10-09 23:36:29 +07:00
|
|
|
if (!cpu_active(arg->src_cpu) || !cpu_active(arg->dst_cpu))
|
|
|
|
return -EAGAIN;
|
|
|
|
|
2013-10-07 17:29:16 +07:00
|
|
|
src_rq = cpu_rq(arg->src_cpu);
|
|
|
|
dst_rq = cpu_rq(arg->dst_cpu);
|
|
|
|
|
sched: Fix race in migrate_swap_stop()
There is a subtle race in migrate_swap, when task P, on CPU A, decides to swap
places with task T, on CPU B.
Task P:
- call migrate_swap
Task T:
- go to sleep, removing itself from the runqueue
Task P:
- double lock the runqueues on CPU A & B
Task T:
- get woken up, place itself on the runqueue of CPU C
Task P:
- see that task T is on a runqueue, and pretend to remove it
from the runqueue on CPU B
Now CPUs B & C both have corrupted scheduler data structures.
This patch fixes it, by holding the pi_lock for both of the tasks
involved in the migrate swap. This prevents task T from waking up,
and placing itself onto another runqueue, until after migrate_swap
has released all locks.
This means that, when migrate_swap checks, task T will be either
on the runqueue where it was originally seen, or not on any
runqueue at all. Migrate_swap deals correctly with of those cases.
Tested-by: Joe Mario <jmario@redhat.com>
Acked-by: Mel Gorman <mgorman@suse.de>
Reviewed-by: Rik van Riel <riel@redhat.com>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Cc: hannes@cmpxchg.org
Cc: aarcange@redhat.com
Cc: srikar@linux.vnet.ibm.com
Cc: tglx@linutronix.de
Cc: hpa@zytor.com
Link: http://lkml.kernel.org/r/20131010181722.GO13848@laptop.programming.kicks-ass.net
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2013-10-11 01:17:22 +07:00
|
|
|
double_raw_lock(&arg->src_task->pi_lock,
|
|
|
|
&arg->dst_task->pi_lock);
|
2013-10-07 17:29:16 +07:00
|
|
|
double_rq_lock(src_rq, dst_rq);
|
2015-10-09 23:36:29 +07:00
|
|
|
|
2013-10-07 17:29:16 +07:00
|
|
|
if (task_cpu(arg->dst_task) != arg->dst_cpu)
|
|
|
|
goto unlock;
|
|
|
|
|
|
|
|
if (task_cpu(arg->src_task) != arg->src_cpu)
|
|
|
|
goto unlock;
|
|
|
|
|
|
|
|
if (!cpumask_test_cpu(arg->dst_cpu, tsk_cpus_allowed(arg->src_task)))
|
|
|
|
goto unlock;
|
|
|
|
|
|
|
|
if (!cpumask_test_cpu(arg->src_cpu, tsk_cpus_allowed(arg->dst_task)))
|
|
|
|
goto unlock;
|
|
|
|
|
|
|
|
__migrate_swap_task(arg->src_task, arg->dst_cpu);
|
|
|
|
__migrate_swap_task(arg->dst_task, arg->src_cpu);
|
|
|
|
|
|
|
|
ret = 0;
|
|
|
|
|
|
|
|
unlock:
|
|
|
|
double_rq_unlock(src_rq, dst_rq);
|
sched: Fix race in migrate_swap_stop()
There is a subtle race in migrate_swap, when task P, on CPU A, decides to swap
places with task T, on CPU B.
Task P:
- call migrate_swap
Task T:
- go to sleep, removing itself from the runqueue
Task P:
- double lock the runqueues on CPU A & B
Task T:
- get woken up, place itself on the runqueue of CPU C
Task P:
- see that task T is on a runqueue, and pretend to remove it
from the runqueue on CPU B
Now CPUs B & C both have corrupted scheduler data structures.
This patch fixes it, by holding the pi_lock for both of the tasks
involved in the migrate swap. This prevents task T from waking up,
and placing itself onto another runqueue, until after migrate_swap
has released all locks.
This means that, when migrate_swap checks, task T will be either
on the runqueue where it was originally seen, or not on any
runqueue at all. Migrate_swap deals correctly with of those cases.
Tested-by: Joe Mario <jmario@redhat.com>
Acked-by: Mel Gorman <mgorman@suse.de>
Reviewed-by: Rik van Riel <riel@redhat.com>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Cc: hannes@cmpxchg.org
Cc: aarcange@redhat.com
Cc: srikar@linux.vnet.ibm.com
Cc: tglx@linutronix.de
Cc: hpa@zytor.com
Link: http://lkml.kernel.org/r/20131010181722.GO13848@laptop.programming.kicks-ass.net
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2013-10-11 01:17:22 +07:00
|
|
|
raw_spin_unlock(&arg->dst_task->pi_lock);
|
|
|
|
raw_spin_unlock(&arg->src_task->pi_lock);
|
2013-10-07 17:29:16 +07:00
|
|
|
|
|
|
|
return ret;
|
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Cross migrate two tasks
|
|
|
|
*/
|
|
|
|
int migrate_swap(struct task_struct *cur, struct task_struct *p)
|
|
|
|
{
|
|
|
|
struct migration_swap_arg arg;
|
|
|
|
int ret = -EINVAL;
|
|
|
|
|
|
|
|
arg = (struct migration_swap_arg){
|
|
|
|
.src_task = cur,
|
|
|
|
.src_cpu = task_cpu(cur),
|
|
|
|
.dst_task = p,
|
|
|
|
.dst_cpu = task_cpu(p),
|
|
|
|
};
|
|
|
|
|
|
|
|
if (arg.src_cpu == arg.dst_cpu)
|
|
|
|
goto out;
|
|
|
|
|
2013-10-11 19:38:20 +07:00
|
|
|
/*
|
|
|
|
* These three tests are all lockless; this is OK since all of them
|
|
|
|
* will be re-checked with proper locks held further down the line.
|
|
|
|
*/
|
2013-10-07 17:29:16 +07:00
|
|
|
if (!cpu_active(arg.src_cpu) || !cpu_active(arg.dst_cpu))
|
|
|
|
goto out;
|
|
|
|
|
|
|
|
if (!cpumask_test_cpu(arg.dst_cpu, tsk_cpus_allowed(arg.src_task)))
|
|
|
|
goto out;
|
|
|
|
|
|
|
|
if (!cpumask_test_cpu(arg.src_cpu, tsk_cpus_allowed(arg.dst_task)))
|
|
|
|
goto out;
|
|
|
|
|
2014-01-22 06:51:03 +07:00
|
|
|
trace_sched_swap_numa(cur, arg.src_cpu, p, arg.dst_cpu);
|
2013-10-07 17:29:16 +07:00
|
|
|
ret = stop_two_cpus(arg.dst_cpu, arg.src_cpu, migrate_swap_stop, &arg);
|
|
|
|
|
|
|
|
out:
|
|
|
|
return ret;
|
|
|
|
}
|
|
|
|
|
2005-04-17 05:20:36 +07:00
|
|
|
/*
|
|
|
|
* wait_task_inactive - wait for a thread to unschedule.
|
|
|
|
*
|
2008-07-26 09:45:58 +07:00
|
|
|
* If @match_state is nonzero, it's the @p->state value just checked and
|
|
|
|
* not expected to change. If it changes, i.e. @p might have woken up,
|
|
|
|
* then return zero. When we succeed in waiting for @p to be off its CPU,
|
|
|
|
* we return a positive number (its total switch count). If a second call
|
|
|
|
* a short while later returns the same number, the caller can be sure that
|
|
|
|
* @p has remained unscheduled the whole time.
|
|
|
|
*
|
2005-04-17 05:20:36 +07:00
|
|
|
* The caller must ensure that the task *will* unschedule sometime soon,
|
|
|
|
* else this function might spin for a *long* time. This function can't
|
|
|
|
* be called with interrupts off, or it may introduce deadlock with
|
|
|
|
* smp_call_function() if an IPI is sent by the same process we are
|
|
|
|
* waiting to become inactive.
|
|
|
|
*/
|
2008-07-26 09:45:58 +07:00
|
|
|
unsigned long wait_task_inactive(struct task_struct *p, long match_state)
|
2005-04-17 05:20:36 +07:00
|
|
|
{
|
2014-08-20 16:47:32 +07:00
|
|
|
int running, queued;
|
2015-08-01 02:28:18 +07:00
|
|
|
struct rq_flags rf;
|
2008-07-26 09:45:58 +07:00
|
|
|
unsigned long ncsw;
|
2006-07-03 14:25:42 +07:00
|
|
|
struct rq *rq;
|
2005-04-17 05:20:36 +07:00
|
|
|
|
2007-10-15 22:00:14 +07:00
|
|
|
for (;;) {
|
|
|
|
/*
|
|
|
|
* We do the initial early heuristics without holding
|
|
|
|
* any task-queue locks at all. We'll only try to get
|
|
|
|
* the runqueue lock when things look like they will
|
|
|
|
* work out!
|
|
|
|
*/
|
|
|
|
rq = task_rq(p);
|
Fix possible runqueue lock starvation in wait_task_inactive()
Miklos Szeredi reported very long pauses (several seconds, sometimes
more) on his T60 (with a Core2Duo) which he managed to track down to
wait_task_inactive()'s open-coded busy-loop.
He observed that an interrupt on one core tries to acquire the
runqueue-lock but does not succeed in doing so for a very long time -
while wait_task_inactive() on the other core loops waiting for the first
core to deschedule a task (which it wont do while spinning in an
interrupt handler).
This rewrites wait_task_inactive() to do all its waiting optimistically
without any locks taken at all, and then just double-check the end
result with the proper runqueue lock held over just a very short
section. If there were races in the optimistic wait, of a preemption
event scheduled the process away, we simply re-synchronize, and start
over.
So the code now looks like this:
repeat:
/* Unlocked, optimistic looping! */
rq = task_rq(p);
while (task_running(rq, p))
cpu_relax();
/* Get the *real* values */
rq = task_rq_lock(p, &flags);
running = task_running(rq, p);
array = p->array;
task_rq_unlock(rq, &flags);
/* Check them.. */
if (unlikely(running)) {
cpu_relax();
goto repeat;
}
/* Preempted away? Yield if so.. */
if (unlikely(array)) {
yield();
goto repeat;
}
Basically, that first "while()" loop is done entirely without any
locking at all (and doesn't check for the case where the target process
might have been preempted away), and so it's possibly "incorrect", but
we don't really care. Both the runqueue used, and the "task_running()"
check might be the wrong tests, but they won't oops - they just mean
that we could possibly get the wrong results due to lack of locking and
exit the loop early in the case of a race condition.
So once we've exited the loop, we then get the proper (and careful) rq
lock, and check the running/runnable state _safely_. And if it turns
out that our quick-and-dirty and unsafe loop was wrong after all, we
just go back and try it all again.
(The patch also adds a lot of comments, which is the actual bulk of it
all, to make it more obvious why we can do these things without holding
the locks).
Thanks to Miklos for all the testing and tracking it down.
Tested-by: Miklos Szeredi <miklos@szeredi.hu>
Acked-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-06-18 23:34:40 +07:00
|
|
|
|
2007-10-15 22:00:14 +07:00
|
|
|
/*
|
|
|
|
* If the task is actively running on another CPU
|
|
|
|
* still, just relax and busy-wait without holding
|
|
|
|
* any locks.
|
|
|
|
*
|
|
|
|
* NOTE! Since we don't hold any locks, it's not
|
|
|
|
* even sure that "rq" stays as the right runqueue!
|
|
|
|
* But we don't care, since "task_running()" will
|
|
|
|
* return false if the runqueue has changed and p
|
|
|
|
* is actually now running somewhere else!
|
|
|
|
*/
|
2008-07-26 09:45:58 +07:00
|
|
|
while (task_running(rq, p)) {
|
|
|
|
if (match_state && unlikely(p->state != match_state))
|
|
|
|
return 0;
|
2007-10-15 22:00:14 +07:00
|
|
|
cpu_relax();
|
2008-07-26 09:45:58 +07:00
|
|
|
}
|
Fix possible runqueue lock starvation in wait_task_inactive()
Miklos Szeredi reported very long pauses (several seconds, sometimes
more) on his T60 (with a Core2Duo) which he managed to track down to
wait_task_inactive()'s open-coded busy-loop.
He observed that an interrupt on one core tries to acquire the
runqueue-lock but does not succeed in doing so for a very long time -
while wait_task_inactive() on the other core loops waiting for the first
core to deschedule a task (which it wont do while spinning in an
interrupt handler).
This rewrites wait_task_inactive() to do all its waiting optimistically
without any locks taken at all, and then just double-check the end
result with the proper runqueue lock held over just a very short
section. If there were races in the optimistic wait, of a preemption
event scheduled the process away, we simply re-synchronize, and start
over.
So the code now looks like this:
repeat:
/* Unlocked, optimistic looping! */
rq = task_rq(p);
while (task_running(rq, p))
cpu_relax();
/* Get the *real* values */
rq = task_rq_lock(p, &flags);
running = task_running(rq, p);
array = p->array;
task_rq_unlock(rq, &flags);
/* Check them.. */
if (unlikely(running)) {
cpu_relax();
goto repeat;
}
/* Preempted away? Yield if so.. */
if (unlikely(array)) {
yield();
goto repeat;
}
Basically, that first "while()" loop is done entirely without any
locking at all (and doesn't check for the case where the target process
might have been preempted away), and so it's possibly "incorrect", but
we don't really care. Both the runqueue used, and the "task_running()"
check might be the wrong tests, but they won't oops - they just mean
that we could possibly get the wrong results due to lack of locking and
exit the loop early in the case of a race condition.
So once we've exited the loop, we then get the proper (and careful) rq
lock, and check the running/runnable state _safely_. And if it turns
out that our quick-and-dirty and unsafe loop was wrong after all, we
just go back and try it all again.
(The patch also adds a lot of comments, which is the actual bulk of it
all, to make it more obvious why we can do these things without holding
the locks).
Thanks to Miklos for all the testing and tracking it down.
Tested-by: Miklos Szeredi <miklos@szeredi.hu>
Acked-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-06-18 23:34:40 +07:00
|
|
|
|
2007-10-15 22:00:14 +07:00
|
|
|
/*
|
|
|
|
* Ok, time to look more closely! We need the rq
|
|
|
|
* lock now, to be *sure*. If we're wrong, we'll
|
|
|
|
* just go back and repeat.
|
|
|
|
*/
|
2015-08-01 02:28:18 +07:00
|
|
|
rq = task_rq_lock(p, &rf);
|
2010-05-05 01:36:56 +07:00
|
|
|
trace_sched_wait_task(p);
|
2007-10-15 22:00:14 +07:00
|
|
|
running = task_running(rq, p);
|
2014-08-20 16:47:32 +07:00
|
|
|
queued = task_on_rq_queued(p);
|
2008-07-26 09:45:58 +07:00
|
|
|
ncsw = 0;
|
2008-08-21 06:54:44 +07:00
|
|
|
if (!match_state || p->state == match_state)
|
2008-08-21 06:54:44 +07:00
|
|
|
ncsw = p->nvcsw | LONG_MIN; /* sets MSB */
|
2015-08-01 02:28:18 +07:00
|
|
|
task_rq_unlock(rq, p, &rf);
|
Fix possible runqueue lock starvation in wait_task_inactive()
Miklos Szeredi reported very long pauses (several seconds, sometimes
more) on his T60 (with a Core2Duo) which he managed to track down to
wait_task_inactive()'s open-coded busy-loop.
He observed that an interrupt on one core tries to acquire the
runqueue-lock but does not succeed in doing so for a very long time -
while wait_task_inactive() on the other core loops waiting for the first
core to deschedule a task (which it wont do while spinning in an
interrupt handler).
This rewrites wait_task_inactive() to do all its waiting optimistically
without any locks taken at all, and then just double-check the end
result with the proper runqueue lock held over just a very short
section. If there were races in the optimistic wait, of a preemption
event scheduled the process away, we simply re-synchronize, and start
over.
So the code now looks like this:
repeat:
/* Unlocked, optimistic looping! */
rq = task_rq(p);
while (task_running(rq, p))
cpu_relax();
/* Get the *real* values */
rq = task_rq_lock(p, &flags);
running = task_running(rq, p);
array = p->array;
task_rq_unlock(rq, &flags);
/* Check them.. */
if (unlikely(running)) {
cpu_relax();
goto repeat;
}
/* Preempted away? Yield if so.. */
if (unlikely(array)) {
yield();
goto repeat;
}
Basically, that first "while()" loop is done entirely without any
locking at all (and doesn't check for the case where the target process
might have been preempted away), and so it's possibly "incorrect", but
we don't really care. Both the runqueue used, and the "task_running()"
check might be the wrong tests, but they won't oops - they just mean
that we could possibly get the wrong results due to lack of locking and
exit the loop early in the case of a race condition.
So once we've exited the loop, we then get the proper (and careful) rq
lock, and check the running/runnable state _safely_. And if it turns
out that our quick-and-dirty and unsafe loop was wrong after all, we
just go back and try it all again.
(The patch also adds a lot of comments, which is the actual bulk of it
all, to make it more obvious why we can do these things without holding
the locks).
Thanks to Miklos for all the testing and tracking it down.
Tested-by: Miklos Szeredi <miklos@szeredi.hu>
Acked-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-06-18 23:34:40 +07:00
|
|
|
|
2008-07-26 09:45:58 +07:00
|
|
|
/*
|
|
|
|
* If it changed from the expected state, bail out now.
|
|
|
|
*/
|
|
|
|
if (unlikely(!ncsw))
|
|
|
|
break;
|
|
|
|
|
2007-10-15 22:00:14 +07:00
|
|
|
/*
|
|
|
|
* Was it really running after all now that we
|
|
|
|
* checked with the proper locks actually held?
|
|
|
|
*
|
|
|
|
* Oops. Go back and try again..
|
|
|
|
*/
|
|
|
|
if (unlikely(running)) {
|
|
|
|
cpu_relax();
|
|
|
|
continue;
|
|
|
|
}
|
Fix possible runqueue lock starvation in wait_task_inactive()
Miklos Szeredi reported very long pauses (several seconds, sometimes
more) on his T60 (with a Core2Duo) which he managed to track down to
wait_task_inactive()'s open-coded busy-loop.
He observed that an interrupt on one core tries to acquire the
runqueue-lock but does not succeed in doing so for a very long time -
while wait_task_inactive() on the other core loops waiting for the first
core to deschedule a task (which it wont do while spinning in an
interrupt handler).
This rewrites wait_task_inactive() to do all its waiting optimistically
without any locks taken at all, and then just double-check the end
result with the proper runqueue lock held over just a very short
section. If there were races in the optimistic wait, of a preemption
event scheduled the process away, we simply re-synchronize, and start
over.
So the code now looks like this:
repeat:
/* Unlocked, optimistic looping! */
rq = task_rq(p);
while (task_running(rq, p))
cpu_relax();
/* Get the *real* values */
rq = task_rq_lock(p, &flags);
running = task_running(rq, p);
array = p->array;
task_rq_unlock(rq, &flags);
/* Check them.. */
if (unlikely(running)) {
cpu_relax();
goto repeat;
}
/* Preempted away? Yield if so.. */
if (unlikely(array)) {
yield();
goto repeat;
}
Basically, that first "while()" loop is done entirely without any
locking at all (and doesn't check for the case where the target process
might have been preempted away), and so it's possibly "incorrect", but
we don't really care. Both the runqueue used, and the "task_running()"
check might be the wrong tests, but they won't oops - they just mean
that we could possibly get the wrong results due to lack of locking and
exit the loop early in the case of a race condition.
So once we've exited the loop, we then get the proper (and careful) rq
lock, and check the running/runnable state _safely_. And if it turns
out that our quick-and-dirty and unsafe loop was wrong after all, we
just go back and try it all again.
(The patch also adds a lot of comments, which is the actual bulk of it
all, to make it more obvious why we can do these things without holding
the locks).
Thanks to Miklos for all the testing and tracking it down.
Tested-by: Miklos Szeredi <miklos@szeredi.hu>
Acked-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-06-18 23:34:40 +07:00
|
|
|
|
2007-10-15 22:00:14 +07:00
|
|
|
/*
|
|
|
|
* It's not enough that it's not actively running,
|
|
|
|
* it must be off the runqueue _entirely_, and not
|
|
|
|
* preempted!
|
|
|
|
*
|
2009-03-17 02:58:09 +07:00
|
|
|
* So if it was still runnable (but just not actively
|
2007-10-15 22:00:14 +07:00
|
|
|
* running right now), it's preempted, and we should
|
|
|
|
* yield - it could be a while.
|
|
|
|
*/
|
2014-08-20 16:47:32 +07:00
|
|
|
if (unlikely(queued)) {
|
2011-02-24 06:52:21 +07:00
|
|
|
ktime_t to = ktime_set(0, NSEC_PER_SEC/HZ);
|
|
|
|
|
|
|
|
set_current_state(TASK_UNINTERRUPTIBLE);
|
|
|
|
schedule_hrtimeout(&to, HRTIMER_MODE_REL);
|
2007-10-15 22:00:14 +07:00
|
|
|
continue;
|
|
|
|
}
|
Fix possible runqueue lock starvation in wait_task_inactive()
Miklos Szeredi reported very long pauses (several seconds, sometimes
more) on his T60 (with a Core2Duo) which he managed to track down to
wait_task_inactive()'s open-coded busy-loop.
He observed that an interrupt on one core tries to acquire the
runqueue-lock but does not succeed in doing so for a very long time -
while wait_task_inactive() on the other core loops waiting for the first
core to deschedule a task (which it wont do while spinning in an
interrupt handler).
This rewrites wait_task_inactive() to do all its waiting optimistically
without any locks taken at all, and then just double-check the end
result with the proper runqueue lock held over just a very short
section. If there were races in the optimistic wait, of a preemption
event scheduled the process away, we simply re-synchronize, and start
over.
So the code now looks like this:
repeat:
/* Unlocked, optimistic looping! */
rq = task_rq(p);
while (task_running(rq, p))
cpu_relax();
/* Get the *real* values */
rq = task_rq_lock(p, &flags);
running = task_running(rq, p);
array = p->array;
task_rq_unlock(rq, &flags);
/* Check them.. */
if (unlikely(running)) {
cpu_relax();
goto repeat;
}
/* Preempted away? Yield if so.. */
if (unlikely(array)) {
yield();
goto repeat;
}
Basically, that first "while()" loop is done entirely without any
locking at all (and doesn't check for the case where the target process
might have been preempted away), and so it's possibly "incorrect", but
we don't really care. Both the runqueue used, and the "task_running()"
check might be the wrong tests, but they won't oops - they just mean
that we could possibly get the wrong results due to lack of locking and
exit the loop early in the case of a race condition.
So once we've exited the loop, we then get the proper (and careful) rq
lock, and check the running/runnable state _safely_. And if it turns
out that our quick-and-dirty and unsafe loop was wrong after all, we
just go back and try it all again.
(The patch also adds a lot of comments, which is the actual bulk of it
all, to make it more obvious why we can do these things without holding
the locks).
Thanks to Miklos for all the testing and tracking it down.
Tested-by: Miklos Szeredi <miklos@szeredi.hu>
Acked-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-06-18 23:34:40 +07:00
|
|
|
|
2007-10-15 22:00:14 +07:00
|
|
|
/*
|
|
|
|
* Ahh, all good. It wasn't running, and it wasn't
|
|
|
|
* runnable, which means that it will never become
|
|
|
|
* running in the future either. We're all done!
|
|
|
|
*/
|
|
|
|
break;
|
|
|
|
}
|
2008-07-26 09:45:58 +07:00
|
|
|
|
|
|
|
return ncsw;
|
2005-04-17 05:20:36 +07:00
|
|
|
}
|
|
|
|
|
|
|
|
/***
|
|
|
|
* kick_process - kick a running thread to enter/exit the kernel
|
|
|
|
* @p: the to-be-kicked thread
|
|
|
|
*
|
|
|
|
* Cause a process which is running on another CPU to enter
|
|
|
|
* kernel-mode, without any delay. (to get signals handled.)
|
|
|
|
*
|
2011-03-31 08:57:33 +07:00
|
|
|
* NOTE: this function doesn't have to take the runqueue lock,
|
2005-04-17 05:20:36 +07:00
|
|
|
* because all it wants to ensure is that the remote task enters
|
|
|
|
* the kernel. If the IPI races and the task has been migrated
|
|
|
|
* to another CPU then no harm is done and the purpose has been
|
|
|
|
* achieved as well.
|
|
|
|
*/
|
2006-07-03 14:25:41 +07:00
|
|
|
void kick_process(struct task_struct *p)
|
2005-04-17 05:20:36 +07:00
|
|
|
{
|
|
|
|
int cpu;
|
|
|
|
|
|
|
|
preempt_disable();
|
|
|
|
cpu = task_cpu(p);
|
|
|
|
if ((cpu != smp_processor_id()) && task_curr(p))
|
|
|
|
smp_send_reschedule(cpu);
|
|
|
|
preempt_enable();
|
|
|
|
}
|
2009-06-13 11:27:00 +07:00
|
|
|
EXPORT_SYMBOL_GPL(kick_process);
|
2005-04-17 05:20:36 +07:00
|
|
|
|
2010-03-15 16:10:19 +07:00
|
|
|
/*
|
2011-04-05 22:23:45 +07:00
|
|
|
* ->cpus_allowed is protected by both rq->lock and p->pi_lock
|
2016-03-10 18:54:08 +07:00
|
|
|
*
|
|
|
|
* A few notes on cpu_active vs cpu_online:
|
|
|
|
*
|
|
|
|
* - cpu_active must be a subset of cpu_online
|
|
|
|
*
|
|
|
|
* - on cpu-up we allow per-cpu kthreads on the online && !active cpu,
|
|
|
|
* see __set_cpus_allowed_ptr(). At this point the newly online
|
|
|
|
* cpu isn't yet part of the sched domains, and balancing will not
|
|
|
|
* see it.
|
|
|
|
*
|
|
|
|
* - on cpu-down we clear cpu_active() to mask the sched domains and
|
|
|
|
* avoid the load balancer to place new tasks on the to be removed
|
|
|
|
* cpu. Existing tasks will remain running there and will be taken
|
|
|
|
* off.
|
|
|
|
*
|
|
|
|
* This means that fallback selection must not select !active CPUs.
|
|
|
|
* And can assume that any active CPU must be online. Conversely
|
|
|
|
* select_task_rq() below may allow selection of !active CPUs in order
|
|
|
|
* to satisfy the above rules.
|
2010-03-15 16:10:19 +07:00
|
|
|
*/
|
2009-12-17 00:04:38 +07:00
|
|
|
static int select_fallback_rq(int cpu, struct task_struct *p)
|
|
|
|
{
|
sched: do not use cpu_to_node() to find an offlined cpu's node.
If a cpu is offline, its nid will be set to -1, and cpu_to_node(cpu)
will return -1. As a result, cpumask_of_node(nid) will return NULL. In
this case, find_next_bit() in for_each_cpu will get a NULL pointer and
cause panic.
Here is a call trace:
Call Trace:
<IRQ>
select_fallback_rq+0x71/0x190
try_to_wake_up+0x2cb/0x2f0
wake_up_process+0x15/0x20
hrtimer_wakeup+0x22/0x30
__run_hrtimer+0x83/0x320
hrtimer_interrupt+0x106/0x280
smp_apic_timer_interrupt+0x69/0x99
apic_timer_interrupt+0x6f/0x80
There is a hrtimer process sleeping, whose cpu has already been
offlined. When it is waken up, it tries to find another cpu to run, and
get a -1 nid. As a result, cpumask_of_node(-1) returns NULL, and causes
ernel panic.
This patch fixes this problem by judging if the nid is -1. If nid is
not -1, a cpu on the same node will be picked. Else, a online cpu on
another node will be picked.
Signed-off-by: Tang Chen <tangchen@cn.fujitsu.com>
Signed-off-by: Wen Congyang <wency@cn.fujitsu.com>
Cc: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Jiang Liu <liuj97@gmail.com>
Cc: Minchan Kim <minchan.kim@gmail.com>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Mel Gorman <mel@csn.ul.ie>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-02-23 07:33:33 +07:00
|
|
|
int nid = cpu_to_node(cpu);
|
|
|
|
const struct cpumask *nodemask = NULL;
|
sched: Fix select_fallback_rq() vs cpu_active/cpu_online
Commit 5fbd036b55 ("sched: Cleanup cpu_active madness"), which was
supposed to finally sort the cpu_active mess, instead uncovered more.
Since CPU_STARTING is ran before setting the cpu online, there's a
(small) window where the cpu has active,!online.
If during this time there's a wakeup of a task that used to reside on
that cpu select_task_rq() will use select_fallback_rq() to compute an
alternative cpu to run on since we find !online.
select_fallback_rq() however will compute the new cpu against
cpu_active, this means that it can return the same cpu it started out
with, the !online one, since that cpu is in fact marked active.
This results in us trying to scheduling a task on an offline cpu and
triggering a WARN in the IPI code.
The solution proposed by Chuansheng Liu of setting cpu_active in
set_cpu_online() is buggy, firstly not all archs actually use
set_cpu_online(), secondly, not all archs call set_cpu_online() with
IRQs disabled, this means we would introduce either the same race or
the race from fd8a7de17 ("x86: cpu-hotplug: Prevent softirq wakeup on
wrong CPU") -- albeit much narrower.
[ By setting online first and active later we have a window of
online,!active, fresh and bound kthreads have task_cpu() of 0 and
since cpu0 isn't in tsk_cpus_allowed() we end up in
select_fallback_rq() which excludes !active, resulting in a reset
of ->cpus_allowed and the thread running all over the place. ]
The solution is to re-work select_fallback_rq() to require active
_and_ online. This makes the active,!online case work as expected,
OTOH archs running CPU_STARTING after setting online are now
vulnerable to the issue from fd8a7de17 -- these are alpha and
blackfin.
Reported-by: Chuansheng Liu <chuansheng.liu@intel.com>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Mike Frysinger <vapier@gentoo.org>
Cc: linux-alpha@vger.kernel.org
Link: http://lkml.kernel.org/n/tip-hubqk1i10o4dpvlm06gq7v6j@git.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2012-03-20 21:57:01 +07:00
|
|
|
enum { cpuset, possible, fail } state = cpuset;
|
|
|
|
int dest_cpu;
|
2009-12-17 00:04:38 +07:00
|
|
|
|
sched: do not use cpu_to_node() to find an offlined cpu's node.
If a cpu is offline, its nid will be set to -1, and cpu_to_node(cpu)
will return -1. As a result, cpumask_of_node(nid) will return NULL. In
this case, find_next_bit() in for_each_cpu will get a NULL pointer and
cause panic.
Here is a call trace:
Call Trace:
<IRQ>
select_fallback_rq+0x71/0x190
try_to_wake_up+0x2cb/0x2f0
wake_up_process+0x15/0x20
hrtimer_wakeup+0x22/0x30
__run_hrtimer+0x83/0x320
hrtimer_interrupt+0x106/0x280
smp_apic_timer_interrupt+0x69/0x99
apic_timer_interrupt+0x6f/0x80
There is a hrtimer process sleeping, whose cpu has already been
offlined. When it is waken up, it tries to find another cpu to run, and
get a -1 nid. As a result, cpumask_of_node(-1) returns NULL, and causes
ernel panic.
This patch fixes this problem by judging if the nid is -1. If nid is
not -1, a cpu on the same node will be picked. Else, a online cpu on
another node will be picked.
Signed-off-by: Tang Chen <tangchen@cn.fujitsu.com>
Signed-off-by: Wen Congyang <wency@cn.fujitsu.com>
Cc: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Jiang Liu <liuj97@gmail.com>
Cc: Minchan Kim <minchan.kim@gmail.com>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Mel Gorman <mel@csn.ul.ie>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-02-23 07:33:33 +07:00
|
|
|
/*
|
|
|
|
* If the node that the cpu is on has been offlined, cpu_to_node()
|
|
|
|
* will return -1. There is no cpu on the node, and we should
|
|
|
|
* select the cpu on the other node.
|
|
|
|
*/
|
|
|
|
if (nid != -1) {
|
|
|
|
nodemask = cpumask_of_node(nid);
|
|
|
|
|
|
|
|
/* Look for allowed, online CPU in same node. */
|
|
|
|
for_each_cpu(dest_cpu, nodemask) {
|
|
|
|
if (!cpu_active(dest_cpu))
|
|
|
|
continue;
|
|
|
|
if (cpumask_test_cpu(dest_cpu, tsk_cpus_allowed(p)))
|
|
|
|
return dest_cpu;
|
|
|
|
}
|
sched: Fix select_fallback_rq() vs cpu_active/cpu_online
Commit 5fbd036b55 ("sched: Cleanup cpu_active madness"), which was
supposed to finally sort the cpu_active mess, instead uncovered more.
Since CPU_STARTING is ran before setting the cpu online, there's a
(small) window where the cpu has active,!online.
If during this time there's a wakeup of a task that used to reside on
that cpu select_task_rq() will use select_fallback_rq() to compute an
alternative cpu to run on since we find !online.
select_fallback_rq() however will compute the new cpu against
cpu_active, this means that it can return the same cpu it started out
with, the !online one, since that cpu is in fact marked active.
This results in us trying to scheduling a task on an offline cpu and
triggering a WARN in the IPI code.
The solution proposed by Chuansheng Liu of setting cpu_active in
set_cpu_online() is buggy, firstly not all archs actually use
set_cpu_online(), secondly, not all archs call set_cpu_online() with
IRQs disabled, this means we would introduce either the same race or
the race from fd8a7de17 ("x86: cpu-hotplug: Prevent softirq wakeup on
wrong CPU") -- albeit much narrower.
[ By setting online first and active later we have a window of
online,!active, fresh and bound kthreads have task_cpu() of 0 and
since cpu0 isn't in tsk_cpus_allowed() we end up in
select_fallback_rq() which excludes !active, resulting in a reset
of ->cpus_allowed and the thread running all over the place. ]
The solution is to re-work select_fallback_rq() to require active
_and_ online. This makes the active,!online case work as expected,
OTOH archs running CPU_STARTING after setting online are now
vulnerable to the issue from fd8a7de17 -- these are alpha and
blackfin.
Reported-by: Chuansheng Liu <chuansheng.liu@intel.com>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Mike Frysinger <vapier@gentoo.org>
Cc: linux-alpha@vger.kernel.org
Link: http://lkml.kernel.org/n/tip-hubqk1i10o4dpvlm06gq7v6j@git.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2012-03-20 21:57:01 +07:00
|
|
|
}
|
2009-12-17 00:04:38 +07:00
|
|
|
|
sched: Fix select_fallback_rq() vs cpu_active/cpu_online
Commit 5fbd036b55 ("sched: Cleanup cpu_active madness"), which was
supposed to finally sort the cpu_active mess, instead uncovered more.
Since CPU_STARTING is ran before setting the cpu online, there's a
(small) window where the cpu has active,!online.
If during this time there's a wakeup of a task that used to reside on
that cpu select_task_rq() will use select_fallback_rq() to compute an
alternative cpu to run on since we find !online.
select_fallback_rq() however will compute the new cpu against
cpu_active, this means that it can return the same cpu it started out
with, the !online one, since that cpu is in fact marked active.
This results in us trying to scheduling a task on an offline cpu and
triggering a WARN in the IPI code.
The solution proposed by Chuansheng Liu of setting cpu_active in
set_cpu_online() is buggy, firstly not all archs actually use
set_cpu_online(), secondly, not all archs call set_cpu_online() with
IRQs disabled, this means we would introduce either the same race or
the race from fd8a7de17 ("x86: cpu-hotplug: Prevent softirq wakeup on
wrong CPU") -- albeit much narrower.
[ By setting online first and active later we have a window of
online,!active, fresh and bound kthreads have task_cpu() of 0 and
since cpu0 isn't in tsk_cpus_allowed() we end up in
select_fallback_rq() which excludes !active, resulting in a reset
of ->cpus_allowed and the thread running all over the place. ]
The solution is to re-work select_fallback_rq() to require active
_and_ online. This makes the active,!online case work as expected,
OTOH archs running CPU_STARTING after setting online are now
vulnerable to the issue from fd8a7de17 -- these are alpha and
blackfin.
Reported-by: Chuansheng Liu <chuansheng.liu@intel.com>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Mike Frysinger <vapier@gentoo.org>
Cc: linux-alpha@vger.kernel.org
Link: http://lkml.kernel.org/n/tip-hubqk1i10o4dpvlm06gq7v6j@git.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2012-03-20 21:57:01 +07:00
|
|
|
for (;;) {
|
|
|
|
/* Any allowed, online CPU? */
|
2012-03-30 21:10:28 +07:00
|
|
|
for_each_cpu(dest_cpu, tsk_cpus_allowed(p)) {
|
2016-06-17 02:35:04 +07:00
|
|
|
if (!(p->flags & PF_KTHREAD) && !cpu_active(dest_cpu))
|
|
|
|
continue;
|
|
|
|
if (!cpu_online(dest_cpu))
|
sched: Fix select_fallback_rq() vs cpu_active/cpu_online
Commit 5fbd036b55 ("sched: Cleanup cpu_active madness"), which was
supposed to finally sort the cpu_active mess, instead uncovered more.
Since CPU_STARTING is ran before setting the cpu online, there's a
(small) window where the cpu has active,!online.
If during this time there's a wakeup of a task that used to reside on
that cpu select_task_rq() will use select_fallback_rq() to compute an
alternative cpu to run on since we find !online.
select_fallback_rq() however will compute the new cpu against
cpu_active, this means that it can return the same cpu it started out
with, the !online one, since that cpu is in fact marked active.
This results in us trying to scheduling a task on an offline cpu and
triggering a WARN in the IPI code.
The solution proposed by Chuansheng Liu of setting cpu_active in
set_cpu_online() is buggy, firstly not all archs actually use
set_cpu_online(), secondly, not all archs call set_cpu_online() with
IRQs disabled, this means we would introduce either the same race or
the race from fd8a7de17 ("x86: cpu-hotplug: Prevent softirq wakeup on
wrong CPU") -- albeit much narrower.
[ By setting online first and active later we have a window of
online,!active, fresh and bound kthreads have task_cpu() of 0 and
since cpu0 isn't in tsk_cpus_allowed() we end up in
select_fallback_rq() which excludes !active, resulting in a reset
of ->cpus_allowed and the thread running all over the place. ]
The solution is to re-work select_fallback_rq() to require active
_and_ online. This makes the active,!online case work as expected,
OTOH archs running CPU_STARTING after setting online are now
vulnerable to the issue from fd8a7de17 -- these are alpha and
blackfin.
Reported-by: Chuansheng Liu <chuansheng.liu@intel.com>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Mike Frysinger <vapier@gentoo.org>
Cc: linux-alpha@vger.kernel.org
Link: http://lkml.kernel.org/n/tip-hubqk1i10o4dpvlm06gq7v6j@git.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2012-03-20 21:57:01 +07:00
|
|
|
continue;
|
|
|
|
goto out;
|
|
|
|
}
|
2009-12-17 00:04:38 +07:00
|
|
|
|
2015-10-11 01:53:15 +07:00
|
|
|
/* No more Mr. Nice Guy. */
|
sched: Fix select_fallback_rq() vs cpu_active/cpu_online
Commit 5fbd036b55 ("sched: Cleanup cpu_active madness"), which was
supposed to finally sort the cpu_active mess, instead uncovered more.
Since CPU_STARTING is ran before setting the cpu online, there's a
(small) window where the cpu has active,!online.
If during this time there's a wakeup of a task that used to reside on
that cpu select_task_rq() will use select_fallback_rq() to compute an
alternative cpu to run on since we find !online.
select_fallback_rq() however will compute the new cpu against
cpu_active, this means that it can return the same cpu it started out
with, the !online one, since that cpu is in fact marked active.
This results in us trying to scheduling a task on an offline cpu and
triggering a WARN in the IPI code.
The solution proposed by Chuansheng Liu of setting cpu_active in
set_cpu_online() is buggy, firstly not all archs actually use
set_cpu_online(), secondly, not all archs call set_cpu_online() with
IRQs disabled, this means we would introduce either the same race or
the race from fd8a7de17 ("x86: cpu-hotplug: Prevent softirq wakeup on
wrong CPU") -- albeit much narrower.
[ By setting online first and active later we have a window of
online,!active, fresh and bound kthreads have task_cpu() of 0 and
since cpu0 isn't in tsk_cpus_allowed() we end up in
select_fallback_rq() which excludes !active, resulting in a reset
of ->cpus_allowed and the thread running all over the place. ]
The solution is to re-work select_fallback_rq() to require active
_and_ online. This makes the active,!online case work as expected,
OTOH archs running CPU_STARTING after setting online are now
vulnerable to the issue from fd8a7de17 -- these are alpha and
blackfin.
Reported-by: Chuansheng Liu <chuansheng.liu@intel.com>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Mike Frysinger <vapier@gentoo.org>
Cc: linux-alpha@vger.kernel.org
Link: http://lkml.kernel.org/n/tip-hubqk1i10o4dpvlm06gq7v6j@git.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2012-03-20 21:57:01 +07:00
|
|
|
switch (state) {
|
|
|
|
case cpuset:
|
2015-10-11 01:53:15 +07:00
|
|
|
if (IS_ENABLED(CONFIG_CPUSETS)) {
|
|
|
|
cpuset_cpus_allowed_fallback(p);
|
|
|
|
state = possible;
|
|
|
|
break;
|
|
|
|
}
|
|
|
|
/* fall-through */
|
sched: Fix select_fallback_rq() vs cpu_active/cpu_online
Commit 5fbd036b55 ("sched: Cleanup cpu_active madness"), which was
supposed to finally sort the cpu_active mess, instead uncovered more.
Since CPU_STARTING is ran before setting the cpu online, there's a
(small) window where the cpu has active,!online.
If during this time there's a wakeup of a task that used to reside on
that cpu select_task_rq() will use select_fallback_rq() to compute an
alternative cpu to run on since we find !online.
select_fallback_rq() however will compute the new cpu against
cpu_active, this means that it can return the same cpu it started out
with, the !online one, since that cpu is in fact marked active.
This results in us trying to scheduling a task on an offline cpu and
triggering a WARN in the IPI code.
The solution proposed by Chuansheng Liu of setting cpu_active in
set_cpu_online() is buggy, firstly not all archs actually use
set_cpu_online(), secondly, not all archs call set_cpu_online() with
IRQs disabled, this means we would introduce either the same race or
the race from fd8a7de17 ("x86: cpu-hotplug: Prevent softirq wakeup on
wrong CPU") -- albeit much narrower.
[ By setting online first and active later we have a window of
online,!active, fresh and bound kthreads have task_cpu() of 0 and
since cpu0 isn't in tsk_cpus_allowed() we end up in
select_fallback_rq() which excludes !active, resulting in a reset
of ->cpus_allowed and the thread running all over the place. ]
The solution is to re-work select_fallback_rq() to require active
_and_ online. This makes the active,!online case work as expected,
OTOH archs running CPU_STARTING after setting online are now
vulnerable to the issue from fd8a7de17 -- these are alpha and
blackfin.
Reported-by: Chuansheng Liu <chuansheng.liu@intel.com>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Mike Frysinger <vapier@gentoo.org>
Cc: linux-alpha@vger.kernel.org
Link: http://lkml.kernel.org/n/tip-hubqk1i10o4dpvlm06gq7v6j@git.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2012-03-20 21:57:01 +07:00
|
|
|
case possible:
|
|
|
|
do_set_cpus_allowed(p, cpu_possible_mask);
|
|
|
|
state = fail;
|
|
|
|
break;
|
|
|
|
|
|
|
|
case fail:
|
|
|
|
BUG();
|
|
|
|
break;
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
|
|
|
out:
|
|
|
|
if (state != cpuset) {
|
|
|
|
/*
|
|
|
|
* Don't tell them about moving exiting tasks or
|
|
|
|
* kernel threads (both mm NULL), since they never
|
|
|
|
* leave kernel.
|
|
|
|
*/
|
|
|
|
if (p->mm && printk_ratelimit()) {
|
2014-06-05 06:11:40 +07:00
|
|
|
printk_deferred("process %d (%s) no longer affine to cpu%d\n",
|
sched: Fix select_fallback_rq() vs cpu_active/cpu_online
Commit 5fbd036b55 ("sched: Cleanup cpu_active madness"), which was
supposed to finally sort the cpu_active mess, instead uncovered more.
Since CPU_STARTING is ran before setting the cpu online, there's a
(small) window where the cpu has active,!online.
If during this time there's a wakeup of a task that used to reside on
that cpu select_task_rq() will use select_fallback_rq() to compute an
alternative cpu to run on since we find !online.
select_fallback_rq() however will compute the new cpu against
cpu_active, this means that it can return the same cpu it started out
with, the !online one, since that cpu is in fact marked active.
This results in us trying to scheduling a task on an offline cpu and
triggering a WARN in the IPI code.
The solution proposed by Chuansheng Liu of setting cpu_active in
set_cpu_online() is buggy, firstly not all archs actually use
set_cpu_online(), secondly, not all archs call set_cpu_online() with
IRQs disabled, this means we would introduce either the same race or
the race from fd8a7de17 ("x86: cpu-hotplug: Prevent softirq wakeup on
wrong CPU") -- albeit much narrower.
[ By setting online first and active later we have a window of
online,!active, fresh and bound kthreads have task_cpu() of 0 and
since cpu0 isn't in tsk_cpus_allowed() we end up in
select_fallback_rq() which excludes !active, resulting in a reset
of ->cpus_allowed and the thread running all over the place. ]
The solution is to re-work select_fallback_rq() to require active
_and_ online. This makes the active,!online case work as expected,
OTOH archs running CPU_STARTING after setting online are now
vulnerable to the issue from fd8a7de17 -- these are alpha and
blackfin.
Reported-by: Chuansheng Liu <chuansheng.liu@intel.com>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Mike Frysinger <vapier@gentoo.org>
Cc: linux-alpha@vger.kernel.org
Link: http://lkml.kernel.org/n/tip-hubqk1i10o4dpvlm06gq7v6j@git.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2012-03-20 21:57:01 +07:00
|
|
|
task_pid_nr(p), p->comm, cpu);
|
|
|
|
}
|
2009-12-17 00:04:38 +07:00
|
|
|
}
|
|
|
|
|
|
|
|
return dest_cpu;
|
|
|
|
}
|
|
|
|
|
2009-12-17 00:04:36 +07:00
|
|
|
/*
|
2011-04-05 22:23:45 +07:00
|
|
|
* The caller (fork, wakeup) owns p->pi_lock, ->cpus_allowed is stable.
|
2009-12-17 00:04:36 +07:00
|
|
|
*/
|
2009-11-25 19:31:39 +07:00
|
|
|
static inline
|
2013-10-07 17:29:16 +07:00
|
|
|
int select_task_rq(struct task_struct *p, int cpu, int sd_flags, int wake_flags)
|
2009-11-25 19:31:39 +07:00
|
|
|
{
|
2015-06-11 19:46:54 +07:00
|
|
|
lockdep_assert_held(&p->pi_lock);
|
|
|
|
|
2016-05-11 19:23:31 +07:00
|
|
|
if (tsk_nr_cpus_allowed(p) > 1)
|
2014-11-05 08:14:37 +07:00
|
|
|
cpu = p->sched_class->select_task_rq(p, cpu, sd_flags, wake_flags);
|
2016-03-10 18:54:08 +07:00
|
|
|
else
|
|
|
|
cpu = cpumask_any(tsk_cpus_allowed(p));
|
2009-12-17 00:04:36 +07:00
|
|
|
|
|
|
|
/*
|
|
|
|
* In order not to call set_task_cpu() on a blocking task we need
|
|
|
|
* to rely on ttwu() to place the task on a valid ->cpus_allowed
|
|
|
|
* cpu.
|
|
|
|
*
|
|
|
|
* Since this is common to all placement strategies, this lives here.
|
|
|
|
*
|
|
|
|
* [ this allows ->select_task() to simply return task_cpu(p) and
|
|
|
|
* not worry about this generic constraint ]
|
|
|
|
*/
|
2011-06-16 17:23:22 +07:00
|
|
|
if (unlikely(!cpumask_test_cpu(cpu, tsk_cpus_allowed(p)) ||
|
2009-12-20 23:36:27 +07:00
|
|
|
!cpu_online(cpu)))
|
2009-12-17 00:04:38 +07:00
|
|
|
cpu = select_fallback_rq(task_cpu(p), p);
|
2009-12-17 00:04:36 +07:00
|
|
|
|
|
|
|
return cpu;
|
2009-11-25 19:31:39 +07:00
|
|
|
}
|
2010-04-15 12:29:59 +07:00
|
|
|
|
|
|
|
static void update_avg(u64 *avg, u64 sample)
|
|
|
|
{
|
|
|
|
s64 diff = sample - *avg;
|
|
|
|
*avg += diff >> 3;
|
|
|
|
}
|
2015-05-15 22:43:34 +07:00
|
|
|
|
|
|
|
#else
|
|
|
|
|
|
|
|
static inline int __set_cpus_allowed_ptr(struct task_struct *p,
|
|
|
|
const struct cpumask *new_mask, bool check)
|
|
|
|
{
|
|
|
|
return set_cpus_allowed_ptr(p, new_mask);
|
|
|
|
}
|
|
|
|
|
2015-06-11 19:46:50 +07:00
|
|
|
#endif /* CONFIG_SMP */
|
2009-11-25 19:31:39 +07:00
|
|
|
|
2011-04-05 22:23:43 +07:00
|
|
|
static void
|
2011-04-05 22:23:55 +07:00
|
|
|
ttwu_stat(struct task_struct *p, int cpu, int wake_flags)
|
2009-12-03 13:08:03 +07:00
|
|
|
{
|
2011-04-05 22:23:43 +07:00
|
|
|
#ifdef CONFIG_SCHEDSTATS
|
2011-04-05 22:23:55 +07:00
|
|
|
struct rq *rq = this_rq();
|
|
|
|
|
2011-04-05 22:23:43 +07:00
|
|
|
#ifdef CONFIG_SMP
|
|
|
|
int this_cpu = smp_processor_id();
|
|
|
|
|
|
|
|
if (cpu == this_cpu) {
|
|
|
|
schedstat_inc(rq, ttwu_local);
|
|
|
|
schedstat_inc(p, se.statistics.nr_wakeups_local);
|
|
|
|
} else {
|
|
|
|
struct sched_domain *sd;
|
|
|
|
|
|
|
|
schedstat_inc(p, se.statistics.nr_wakeups_remote);
|
2011-04-18 16:24:34 +07:00
|
|
|
rcu_read_lock();
|
2011-04-05 22:23:43 +07:00
|
|
|
for_each_domain(this_cpu, sd) {
|
|
|
|
if (cpumask_test_cpu(cpu, sched_domain_span(sd))) {
|
|
|
|
schedstat_inc(sd, ttwu_wake_remote);
|
|
|
|
break;
|
|
|
|
}
|
|
|
|
}
|
2011-04-18 16:24:34 +07:00
|
|
|
rcu_read_unlock();
|
2011-04-05 22:23:43 +07:00
|
|
|
}
|
2011-05-31 15:49:20 +07:00
|
|
|
|
|
|
|
if (wake_flags & WF_MIGRATED)
|
|
|
|
schedstat_inc(p, se.statistics.nr_wakeups_migrate);
|
|
|
|
|
2011-04-05 22:23:43 +07:00
|
|
|
#endif /* CONFIG_SMP */
|
|
|
|
|
|
|
|
schedstat_inc(rq, ttwu_count);
|
2009-12-03 13:08:03 +07:00
|
|
|
schedstat_inc(p, se.statistics.nr_wakeups);
|
2011-04-05 22:23:43 +07:00
|
|
|
|
|
|
|
if (wake_flags & WF_SYNC)
|
2009-12-03 13:08:03 +07:00
|
|
|
schedstat_inc(p, se.statistics.nr_wakeups_sync);
|
2011-04-05 22:23:43 +07:00
|
|
|
|
|
|
|
#endif /* CONFIG_SCHEDSTATS */
|
|
|
|
}
|
|
|
|
|
sched/core: Fix task and run queue sched_info::run_delay inconsistencies
Mike Meyer reported the following bug:
> During evaluation of some performance data, it was discovered thread
> and run queue run_delay accounting data was inconsistent with the other
> accounting data that was collected. Further investigation found under
> certain circumstances execution time was leaking into the task and
> run queue accounting of run_delay.
>
> Consider the following sequence:
>
> a. thread is running.
> b. thread moves beween cgroups, changes scheduling class or priority.
> c. thread sleeps OR
> d. thread involuntarily gives up cpu.
>
> a. implies:
>
> thread->sched_info.last_queued = 0
>
> a. and b. results in the following:
>
> 1. dequeue_task(rq, thread)
>
> sched_info_dequeued(rq, thread)
> delta = 0
>
> sched_info_reset_dequeued(thread)
> thread->sched_info.last_queued = 0
>
> thread->sched_info.run_delay += delta
>
> 2. enqueue_task(rq, thread)
>
> sched_info_queued(rq, thread)
>
> /* thread is still on cpu at this point. */
> thread->sched_info.last_queued = task_rq(thread)->clock;
>
> c. results in:
>
> dequeue_task(rq, thread)
>
> sched_info_dequeued(rq, thread)
>
> /* delta is execution time not run_delay. */
> delta = task_rq(thread)->clock - thread->sched_info.last_queued
>
> sched_info_reset_dequeued(thread)
> thread->sched_info.last_queued = 0
>
> thread->sched_info.run_delay += delta
>
> Since thread was running between enqueue_task(rq, thread) and
> dequeue_task(rq, thread), the delta above is really execution
> time and not run_delay.
>
> d. results in:
>
> __sched_info_switch(thread, next_thread)
>
> sched_info_depart(rq, thread)
>
> sched_info_queued(rq, thread)
>
> /* last_queued not updated due to being non-zero */
> return
>
> Since thread was running between enqueue_task(rq, thread) and
> __sched_info_switch(thread, next_thread), the execution time
> between enqueue_task(rq, thread) and
> __sched_info_switch(thread, next_thread) now will become
> associated with run_delay due to when last_queued was last updated.
>
This alternative patch solves the problem by not calling
sched_info_{de,}queued() in {de,en}queue_task(). Therefore the
sched_info state is preserved and things work as expected.
By inlining the {de,en}queue_task() functions the new condition
becomes (mostly) a compile-time constant and we'll not emit any new
branch instructions.
It even shrinks the code (due to inlining {en,de}queue_task()):
$ size defconfig-build/kernel/sched/core.o defconfig-build/kernel/sched/core.o.orig
text data bss dec hex filename
64019 23378 2344 89741 15e8d defconfig-build/kernel/sched/core.o
64149 23378 2344 89871 15f0f defconfig-build/kernel/sched/core.o.orig
Reported-by: Mike Meyer <Mike.Meyer@Teradata.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Link: http://lkml.kernel.org/r/20150930154413.GO3604@twins.programming.kicks-ass.net
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-09-30 22:44:13 +07:00
|
|
|
static inline void ttwu_activate(struct rq *rq, struct task_struct *p, int en_flags)
|
2011-04-05 22:23:43 +07:00
|
|
|
{
|
2009-12-03 13:08:03 +07:00
|
|
|
activate_task(rq, p, en_flags);
|
2014-08-20 16:47:32 +07:00
|
|
|
p->on_rq = TASK_ON_RQ_QUEUED;
|
2011-04-13 18:28:56 +07:00
|
|
|
|
|
|
|
/* if a worker is waking up, notify workqueue */
|
|
|
|
if (p->flags & PF_WQ_WORKER)
|
|
|
|
wq_worker_waking_up(p, cpu_of(rq));
|
2009-12-03 13:08:03 +07:00
|
|
|
}
|
|
|
|
|
2011-04-05 22:23:56 +07:00
|
|
|
/*
|
|
|
|
* Mark the task runnable and perform wakeup-preemption.
|
|
|
|
*/
|
2015-08-02 00:25:08 +07:00
|
|
|
static void ttwu_do_wakeup(struct rq *rq, struct task_struct *p, int wake_flags,
|
|
|
|
struct pin_cookie cookie)
|
2009-12-03 13:08:03 +07:00
|
|
|
{
|
|
|
|
check_preempt_curr(rq, p, wake_flags);
|
|
|
|
p->state = TASK_RUNNING;
|
2015-06-09 16:13:36 +07:00
|
|
|
trace_sched_wakeup(p);
|
|
|
|
|
2009-12-03 13:08:03 +07:00
|
|
|
#ifdef CONFIG_SMP
|
2015-06-11 19:46:39 +07:00
|
|
|
if (p->sched_class->task_woken) {
|
|
|
|
/*
|
2015-06-11 19:46:54 +07:00
|
|
|
* Our task @p is fully woken up and running; so its safe to
|
|
|
|
* drop the rq->lock, hereafter rq is only used for statistics.
|
2015-06-11 19:46:39 +07:00
|
|
|
*/
|
2015-08-02 00:25:08 +07:00
|
|
|
lockdep_unpin_lock(&rq->lock, cookie);
|
2009-12-03 13:08:03 +07:00
|
|
|
p->sched_class->task_woken(rq, p);
|
2015-08-02 00:25:08 +07:00
|
|
|
lockdep_repin_lock(&rq->lock, cookie);
|
2015-06-11 19:46:39 +07:00
|
|
|
}
|
2009-12-03 13:08:03 +07:00
|
|
|
|
2010-12-07 05:10:31 +07:00
|
|
|
if (rq->idle_stamp) {
|
2013-04-12 06:51:02 +07:00
|
|
|
u64 delta = rq_clock(rq) - rq->idle_stamp;
|
2013-09-14 01:26:52 +07:00
|
|
|
u64 max = 2*rq->max_idle_balance_cost;
|
2009-12-03 13:08:03 +07:00
|
|
|
|
2013-09-14 01:26:51 +07:00
|
|
|
update_avg(&rq->avg_idle, delta);
|
|
|
|
|
|
|
|
if (rq->avg_idle > max)
|
2009-12-03 13:08:03 +07:00
|
|
|
rq->avg_idle = max;
|
2013-09-14 01:26:51 +07:00
|
|
|
|
2009-12-03 13:08:03 +07:00
|
|
|
rq->idle_stamp = 0;
|
|
|
|
}
|
|
|
|
#endif
|
|
|
|
}
|
|
|
|
|
2011-04-05 22:23:57 +07:00
|
|
|
static void
|
2015-08-02 00:25:08 +07:00
|
|
|
ttwu_do_activate(struct rq *rq, struct task_struct *p, int wake_flags,
|
|
|
|
struct pin_cookie cookie)
|
2011-04-05 22:23:57 +07:00
|
|
|
{
|
2016-05-11 21:10:34 +07:00
|
|
|
int en_flags = ENQUEUE_WAKEUP;
|
|
|
|
|
2015-06-11 19:46:54 +07:00
|
|
|
lockdep_assert_held(&rq->lock);
|
|
|
|
|
2011-04-05 22:23:57 +07:00
|
|
|
#ifdef CONFIG_SMP
|
|
|
|
if (p->sched_contributes_to_load)
|
|
|
|
rq->nr_uninterruptible--;
|
2016-05-11 21:10:34 +07:00
|
|
|
|
|
|
|
if (wake_flags & WF_MIGRATED)
|
2016-05-10 23:24:37 +07:00
|
|
|
en_flags |= ENQUEUE_MIGRATED;
|
2011-04-05 22:23:57 +07:00
|
|
|
#endif
|
|
|
|
|
2016-05-11 21:10:34 +07:00
|
|
|
ttwu_activate(rq, p, en_flags);
|
2015-08-02 00:25:08 +07:00
|
|
|
ttwu_do_wakeup(rq, p, wake_flags, cookie);
|
2011-04-05 22:23:57 +07:00
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Called in case the task @p isn't fully descheduled from its runqueue,
|
|
|
|
* in this case we must do a remote wakeup. Its a 'light' wakeup though,
|
|
|
|
* since all we need to do is flip p->state to TASK_RUNNING, since
|
|
|
|
* the task is still ->on_rq.
|
|
|
|
*/
|
|
|
|
static int ttwu_remote(struct task_struct *p, int wake_flags)
|
|
|
|
{
|
2015-08-01 02:28:18 +07:00
|
|
|
struct rq_flags rf;
|
2011-04-05 22:23:57 +07:00
|
|
|
struct rq *rq;
|
|
|
|
int ret = 0;
|
|
|
|
|
2015-08-01 02:28:18 +07:00
|
|
|
rq = __task_rq_lock(p, &rf);
|
2014-08-20 16:47:32 +07:00
|
|
|
if (task_on_rq_queued(p)) {
|
2013-04-12 06:51:00 +07:00
|
|
|
/* check_preempt_curr() may use rq clock */
|
|
|
|
update_rq_clock(rq);
|
2015-08-02 00:25:08 +07:00
|
|
|
ttwu_do_wakeup(rq, p, wake_flags, rf.cookie);
|
2011-04-05 22:23:57 +07:00
|
|
|
ret = 1;
|
|
|
|
}
|
2015-08-01 02:28:18 +07:00
|
|
|
__task_rq_unlock(rq, &rf);
|
2011-04-05 22:23:57 +07:00
|
|
|
|
|
|
|
return ret;
|
|
|
|
}
|
|
|
|
|
2011-04-05 22:23:58 +07:00
|
|
|
#ifdef CONFIG_SMP
|
2014-06-05 00:31:18 +07:00
|
|
|
void sched_ttwu_pending(void)
|
2011-04-05 22:23:58 +07:00
|
|
|
{
|
|
|
|
struct rq *rq = this_rq();
|
2011-09-12 18:06:17 +07:00
|
|
|
struct llist_node *llist = llist_del_all(&rq->wake_list);
|
2015-08-02 00:25:08 +07:00
|
|
|
struct pin_cookie cookie;
|
2011-09-12 18:06:17 +07:00
|
|
|
struct task_struct *p;
|
2014-06-05 00:31:18 +07:00
|
|
|
unsigned long flags;
|
2011-04-05 22:23:58 +07:00
|
|
|
|
2014-06-05 00:31:18 +07:00
|
|
|
if (!llist)
|
|
|
|
return;
|
|
|
|
|
|
|
|
raw_spin_lock_irqsave(&rq->lock, flags);
|
2015-08-02 00:25:08 +07:00
|
|
|
cookie = lockdep_pin_lock(&rq->lock);
|
2011-04-05 22:23:58 +07:00
|
|
|
|
2011-09-12 18:06:17 +07:00
|
|
|
while (llist) {
|
2016-05-23 16:19:07 +07:00
|
|
|
int wake_flags = 0;
|
|
|
|
|
2011-09-12 18:06:17 +07:00
|
|
|
p = llist_entry(llist, struct task_struct, wake_entry);
|
|
|
|
llist = llist_next(llist);
|
2016-05-23 16:19:07 +07:00
|
|
|
|
|
|
|
if (p->sched_remote_wakeup)
|
|
|
|
wake_flags = WF_MIGRATED;
|
|
|
|
|
|
|
|
ttwu_do_activate(rq, p, wake_flags, cookie);
|
2011-04-05 22:23:58 +07:00
|
|
|
}
|
|
|
|
|
2015-08-02 00:25:08 +07:00
|
|
|
lockdep_unpin_lock(&rq->lock, cookie);
|
2014-06-05 00:31:18 +07:00
|
|
|
raw_spin_unlock_irqrestore(&rq->lock, flags);
|
2011-04-05 22:23:58 +07:00
|
|
|
}
|
|
|
|
|
|
|
|
void scheduler_ipi(void)
|
|
|
|
{
|
2013-08-14 19:55:31 +07:00
|
|
|
/*
|
|
|
|
* Fold TIF_NEED_RESCHED into the preempt_count; anybody setting
|
|
|
|
* TIF_NEED_RESCHED remotely (for the first time) will also send
|
|
|
|
* this IPI.
|
|
|
|
*/
|
2013-11-20 18:22:37 +07:00
|
|
|
preempt_fold_need_resched();
|
2013-08-14 19:55:31 +07:00
|
|
|
|
2014-03-19 03:12:53 +07:00
|
|
|
if (llist_empty(&this_rq()->wake_list) && !got_nohz_idle_kick())
|
2011-07-20 05:07:25 +07:00
|
|
|
return;
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Not all reschedule IPI handlers call irq_enter/irq_exit, since
|
|
|
|
* traditionally all their work was done from the interrupt return
|
|
|
|
* path. Now that we actually do some work, we need to make sure
|
|
|
|
* we do call them.
|
|
|
|
*
|
|
|
|
* Some archs already do call them, luckily irq_enter/exit nest
|
|
|
|
* properly.
|
|
|
|
*
|
|
|
|
* Arguably we should visit all archs and update all handlers,
|
|
|
|
* however a fair share of IPIs are still resched only so this would
|
|
|
|
* somewhat pessimize the simple resched case.
|
|
|
|
*/
|
|
|
|
irq_enter();
|
2011-09-12 18:06:17 +07:00
|
|
|
sched_ttwu_pending();
|
2011-10-04 05:09:00 +07:00
|
|
|
|
|
|
|
/*
|
|
|
|
* Check if someone kicked us for doing the nohz idle load balance.
|
|
|
|
*/
|
2013-06-05 15:13:11 +07:00
|
|
|
if (unlikely(got_nohz_idle_kick())) {
|
2011-10-04 05:09:01 +07:00
|
|
|
this_rq()->idle_balance = 1;
|
2011-10-04 05:09:00 +07:00
|
|
|
raise_softirq_irqoff(SCHED_SOFTIRQ);
|
2011-10-04 05:09:01 +07:00
|
|
|
}
|
2011-07-20 05:07:25 +07:00
|
|
|
irq_exit();
|
2011-04-05 22:23:58 +07:00
|
|
|
}
|
|
|
|
|
2016-05-23 16:19:07 +07:00
|
|
|
static void ttwu_queue_remote(struct task_struct *p, int cpu, int wake_flags)
|
2011-04-05 22:23:58 +07:00
|
|
|
{
|
2014-06-05 00:31:18 +07:00
|
|
|
struct rq *rq = cpu_rq(cpu);
|
|
|
|
|
2016-05-23 16:19:07 +07:00
|
|
|
p->sched_remote_wakeup = !!(wake_flags & WF_MIGRATED);
|
|
|
|
|
2014-06-05 00:31:18 +07:00
|
|
|
if (llist_add(&p->wake_entry, &cpu_rq(cpu)->wake_list)) {
|
|
|
|
if (!set_nr_if_polling(rq->idle))
|
|
|
|
smp_send_reschedule(cpu);
|
|
|
|
else
|
|
|
|
trace_sched_wake_idle_without_ipi(cpu);
|
|
|
|
}
|
2011-04-05 22:23:58 +07:00
|
|
|
}
|
2011-05-26 19:21:33 +07:00
|
|
|
|
2014-09-04 14:17:53 +07:00
|
|
|
void wake_up_if_idle(int cpu)
|
|
|
|
{
|
|
|
|
struct rq *rq = cpu_rq(cpu);
|
|
|
|
unsigned long flags;
|
|
|
|
|
2014-11-29 23:13:51 +07:00
|
|
|
rcu_read_lock();
|
|
|
|
|
|
|
|
if (!is_idle_task(rcu_dereference(rq->curr)))
|
|
|
|
goto out;
|
2014-09-04 14:17:53 +07:00
|
|
|
|
|
|
|
if (set_nr_if_polling(rq->idle)) {
|
|
|
|
trace_sched_wake_idle_without_ipi(cpu);
|
|
|
|
} else {
|
|
|
|
raw_spin_lock_irqsave(&rq->lock, flags);
|
|
|
|
if (is_idle_task(rq->curr))
|
|
|
|
smp_send_reschedule(cpu);
|
|
|
|
/* Else cpu is not in idle, do nothing here */
|
|
|
|
raw_spin_unlock_irqrestore(&rq->lock, flags);
|
|
|
|
}
|
2014-11-29 23:13:51 +07:00
|
|
|
|
|
|
|
out:
|
|
|
|
rcu_read_unlock();
|
2014-09-04 14:17:53 +07:00
|
|
|
}
|
|
|
|
|
2012-01-26 18:44:34 +07:00
|
|
|
bool cpus_share_cache(int this_cpu, int that_cpu)
|
2011-12-07 21:07:31 +07:00
|
|
|
{
|
|
|
|
return per_cpu(sd_llc_id, this_cpu) == per_cpu(sd_llc_id, that_cpu);
|
|
|
|
}
|
2011-05-26 19:21:33 +07:00
|
|
|
#endif /* CONFIG_SMP */
|
2011-04-05 22:23:58 +07:00
|
|
|
|
2016-05-11 21:10:34 +07:00
|
|
|
static void ttwu_queue(struct task_struct *p, int cpu, int wake_flags)
|
2011-04-05 22:23:57 +07:00
|
|
|
{
|
|
|
|
struct rq *rq = cpu_rq(cpu);
|
2015-08-02 00:25:08 +07:00
|
|
|
struct pin_cookie cookie;
|
2011-04-05 22:23:57 +07:00
|
|
|
|
2011-05-20 11:01:10 +07:00
|
|
|
#if defined(CONFIG_SMP)
|
2012-01-26 18:44:34 +07:00
|
|
|
if (sched_feat(TTWU_QUEUE) && !cpus_share_cache(smp_processor_id(), cpu)) {
|
2011-05-31 17:26:55 +07:00
|
|
|
sched_clock_cpu(cpu); /* sync clocks x-cpu */
|
2016-05-23 16:19:07 +07:00
|
|
|
ttwu_queue_remote(p, cpu, wake_flags);
|
2011-04-05 22:23:58 +07:00
|
|
|
return;
|
|
|
|
}
|
|
|
|
#endif
|
|
|
|
|
2011-04-05 22:23:57 +07:00
|
|
|
raw_spin_lock(&rq->lock);
|
2015-08-02 00:25:08 +07:00
|
|
|
cookie = lockdep_pin_lock(&rq->lock);
|
2016-05-11 21:10:34 +07:00
|
|
|
ttwu_do_activate(rq, p, wake_flags, cookie);
|
2015-08-02 00:25:08 +07:00
|
|
|
lockdep_unpin_lock(&rq->lock, cookie);
|
2011-04-05 22:23:57 +07:00
|
|
|
raw_spin_unlock(&rq->lock);
|
2009-12-03 13:08:03 +07:00
|
|
|
}
|
|
|
|
|
2015-11-18 01:01:11 +07:00
|
|
|
/*
|
|
|
|
* Notes on Program-Order guarantees on SMP systems.
|
|
|
|
*
|
|
|
|
* MIGRATION
|
|
|
|
*
|
|
|
|
* The basic program-order guarantee on SMP systems is that when a task [t]
|
|
|
|
* migrates, all its activity on its old cpu [c0] happens-before any subsequent
|
|
|
|
* execution on its new cpu [c1].
|
|
|
|
*
|
|
|
|
* For migration (of runnable tasks) this is provided by the following means:
|
|
|
|
*
|
|
|
|
* A) UNLOCK of the rq(c0)->lock scheduling out task t
|
|
|
|
* B) migration for t is required to synchronize *both* rq(c0)->lock and
|
|
|
|
* rq(c1)->lock (if not at the same time, then in that order).
|
|
|
|
* C) LOCK of the rq(c1)->lock scheduling in task
|
|
|
|
*
|
|
|
|
* Transitivity guarantees that B happens after A and C after B.
|
|
|
|
* Note: we only require RCpc transitivity.
|
|
|
|
* Note: the cpu doing B need not be c0 or c1
|
|
|
|
*
|
|
|
|
* Example:
|
|
|
|
*
|
|
|
|
* CPU0 CPU1 CPU2
|
|
|
|
*
|
|
|
|
* LOCK rq(0)->lock
|
|
|
|
* sched-out X
|
|
|
|
* sched-in Y
|
|
|
|
* UNLOCK rq(0)->lock
|
|
|
|
*
|
|
|
|
* LOCK rq(0)->lock // orders against CPU0
|
|
|
|
* dequeue X
|
|
|
|
* UNLOCK rq(0)->lock
|
|
|
|
*
|
|
|
|
* LOCK rq(1)->lock
|
|
|
|
* enqueue X
|
|
|
|
* UNLOCK rq(1)->lock
|
|
|
|
*
|
|
|
|
* LOCK rq(1)->lock // orders against CPU2
|
|
|
|
* sched-out Z
|
|
|
|
* sched-in X
|
|
|
|
* UNLOCK rq(1)->lock
|
|
|
|
*
|
|
|
|
*
|
|
|
|
* BLOCKING -- aka. SLEEP + WAKEUP
|
|
|
|
*
|
|
|
|
* For blocking we (obviously) need to provide the same guarantee as for
|
|
|
|
* migration. However the means are completely different as there is no lock
|
|
|
|
* chain to provide order. Instead we do:
|
|
|
|
*
|
|
|
|
* 1) smp_store_release(X->on_cpu, 0)
|
2016-04-04 15:57:12 +07:00
|
|
|
* 2) smp_cond_load_acquire(!X->on_cpu)
|
2015-11-18 01:01:11 +07:00
|
|
|
*
|
|
|
|
* Example:
|
|
|
|
*
|
|
|
|
* CPU0 (schedule) CPU1 (try_to_wake_up) CPU2 (schedule)
|
|
|
|
*
|
|
|
|
* LOCK rq(0)->lock LOCK X->pi_lock
|
|
|
|
* dequeue X
|
|
|
|
* sched-out X
|
|
|
|
* smp_store_release(X->on_cpu, 0);
|
|
|
|
*
|
2016-04-04 15:57:12 +07:00
|
|
|
* smp_cond_load_acquire(&X->on_cpu, !VAL);
|
2015-11-18 01:01:11 +07:00
|
|
|
* X->state = WAKING
|
|
|
|
* set_task_cpu(X,2)
|
|
|
|
*
|
|
|
|
* LOCK rq(2)->lock
|
|
|
|
* enqueue X
|
|
|
|
* X->state = RUNNING
|
|
|
|
* UNLOCK rq(2)->lock
|
|
|
|
*
|
|
|
|
* LOCK rq(2)->lock // orders against CPU1
|
|
|
|
* sched-out Z
|
|
|
|
* sched-in X
|
|
|
|
* UNLOCK rq(2)->lock
|
|
|
|
*
|
|
|
|
* UNLOCK X->pi_lock
|
|
|
|
* UNLOCK rq(0)->lock
|
|
|
|
*
|
|
|
|
*
|
|
|
|
* However; for wakeups there is a second guarantee we must provide, namely we
|
|
|
|
* must observe the state that lead to our wakeup. That is, not only must our
|
|
|
|
* task observe its own prior state, it must also observe the stores prior to
|
|
|
|
* its wakeup.
|
|
|
|
*
|
|
|
|
* This means that any means of doing remote wakeups must order the CPU doing
|
|
|
|
* the wakeup against the CPU the task is going to end up running on. This,
|
|
|
|
* however, is already required for the regular Program-Order guarantee above,
|
2016-04-04 15:57:12 +07:00
|
|
|
* since the waking CPU is the one issueing the ACQUIRE (smp_cond_load_acquire).
|
2015-11-18 01:01:11 +07:00
|
|
|
*
|
|
|
|
*/
|
|
|
|
|
2009-12-03 13:08:03 +07:00
|
|
|
/**
|
2005-04-17 05:20:36 +07:00
|
|
|
* try_to_wake_up - wake up a thread
|
2009-12-03 13:08:03 +07:00
|
|
|
* @p: the thread to be awakened
|
2005-04-17 05:20:36 +07:00
|
|
|
* @state: the mask of task states that can be woken
|
2009-12-03 13:08:03 +07:00
|
|
|
* @wake_flags: wake modifier flags (WF_*)
|
2005-04-17 05:20:36 +07:00
|
|
|
*
|
|
|
|
* Put it on the run-queue if it's not already there. The "current"
|
|
|
|
* thread is always on the run-queue (except when the actual
|
|
|
|
* re-schedule is in progress), and as such you're allowed to do
|
|
|
|
* the simpler "current->state = TASK_RUNNING" to mark yourself
|
|
|
|
* runnable without the overhead of this.
|
|
|
|
*
|
2013-07-13 01:45:47 +07:00
|
|
|
* Return: %true if @p was woken up, %false if it was already running.
|
2009-12-03 13:08:03 +07:00
|
|
|
* or @state didn't match @p's state.
|
2005-04-17 05:20:36 +07:00
|
|
|
*/
|
2011-04-05 22:23:54 +07:00
|
|
|
static int
|
|
|
|
try_to_wake_up(struct task_struct *p, unsigned int state, int wake_flags)
|
2005-04-17 05:20:36 +07:00
|
|
|
{
|
|
|
|
unsigned long flags;
|
2011-04-05 22:23:57 +07:00
|
|
|
int cpu, success = 0;
|
2008-06-27 18:41:35 +07:00
|
|
|
|
2013-08-12 23:14:00 +07:00
|
|
|
/*
|
|
|
|
* If we are going to wake up a thread waiting for CONDITION we
|
|
|
|
* need to ensure that CONDITION=1 done by the caller can not be
|
|
|
|
* reordered with p->state check below. This pairs with mb() in
|
|
|
|
* set_current_state() the waiting thread does.
|
|
|
|
*/
|
|
|
|
smp_mb__before_spinlock();
|
2011-04-05 22:23:45 +07:00
|
|
|
raw_spin_lock_irqsave(&p->pi_lock, flags);
|
2009-09-15 19:43:03 +07:00
|
|
|
if (!(p->state & state))
|
2005-04-17 05:20:36 +07:00
|
|
|
goto out;
|
|
|
|
|
2015-06-09 16:13:36 +07:00
|
|
|
trace_sched_waking(p);
|
|
|
|
|
2011-04-05 22:23:57 +07:00
|
|
|
success = 1; /* we're going to change ->state */
|
2005-04-17 05:20:36 +07:00
|
|
|
cpu = task_cpu(p);
|
|
|
|
|
2011-04-05 22:23:57 +07:00
|
|
|
if (p->on_rq && ttwu_remote(p, wake_flags))
|
|
|
|
goto stat;
|
2005-04-17 05:20:36 +07:00
|
|
|
|
|
|
|
#ifdef CONFIG_SMP
|
2015-10-07 19:14:13 +07:00
|
|
|
/*
|
|
|
|
* Ensure we load p->on_cpu _after_ p->on_rq, otherwise it would be
|
|
|
|
* possible to, falsely, observe p->on_cpu == 0.
|
|
|
|
*
|
|
|
|
* One must be running (->on_cpu == 1) in order to remove oneself
|
|
|
|
* from the runqueue.
|
|
|
|
*
|
|
|
|
* [S] ->on_cpu = 1; [L] ->on_rq
|
|
|
|
* UNLOCK rq->lock
|
|
|
|
* RMB
|
|
|
|
* LOCK rq->lock
|
|
|
|
* [S] ->on_rq = 0; [L] ->on_cpu
|
|
|
|
*
|
|
|
|
* Pairs with the full barrier implied in the UNLOCK+LOCK on rq->lock
|
|
|
|
* from the consecutive calls to schedule(); the first switching to our
|
|
|
|
* task, the second putting it to sleep.
|
|
|
|
*/
|
|
|
|
smp_rmb();
|
|
|
|
|
2009-09-15 19:43:03 +07:00
|
|
|
/*
|
2011-04-05 22:23:57 +07:00
|
|
|
* If the owning (remote) cpu is still in the middle of schedule() with
|
|
|
|
* this task as prev, wait until its done referencing the task.
|
2015-10-06 19:36:17 +07:00
|
|
|
*
|
|
|
|
* Pairs with the smp_store_release() in finish_lock_switch().
|
|
|
|
*
|
|
|
|
* This ensures that tasks getting woken will be fully ordered against
|
|
|
|
* their previous state and preserve Program Order.
|
2010-02-15 20:45:54 +07:00
|
|
|
*/
|
2016-04-04 15:57:12 +07:00
|
|
|
smp_cond_load_acquire(&p->on_cpu, !VAL);
|
2005-04-17 05:20:36 +07:00
|
|
|
|
2011-04-05 22:23:49 +07:00
|
|
|
p->sched_contributes_to_load = !!task_contributes_to_load(p);
|
2009-09-15 19:43:03 +07:00
|
|
|
p->state = TASK_WAKING;
|
2008-01-26 03:08:09 +07:00
|
|
|
|
2013-10-07 17:29:16 +07:00
|
|
|
cpu = select_task_rq(p, p->wake_cpu, SD_BALANCE_WAKE, wake_flags);
|
2011-05-31 15:49:20 +07:00
|
|
|
if (task_cpu(p) != cpu) {
|
|
|
|
wake_flags |= WF_MIGRATED;
|
2011-04-05 22:23:54 +07:00
|
|
|
set_task_cpu(p, cpu);
|
2011-05-31 15:49:20 +07:00
|
|
|
}
|
2005-04-17 05:20:36 +07:00
|
|
|
#endif /* CONFIG_SMP */
|
|
|
|
|
2016-05-11 21:10:34 +07:00
|
|
|
ttwu_queue(p, cpu, wake_flags);
|
2011-04-05 22:23:57 +07:00
|
|
|
stat:
|
2016-02-05 16:08:36 +07:00
|
|
|
if (schedstat_enabled())
|
|
|
|
ttwu_stat(p, cpu, wake_flags);
|
2005-04-17 05:20:36 +07:00
|
|
|
out:
|
2011-04-05 22:23:45 +07:00
|
|
|
raw_spin_unlock_irqrestore(&p->pi_lock, flags);
|
2005-04-17 05:20:36 +07:00
|
|
|
|
|
|
|
return success;
|
|
|
|
}
|
|
|
|
|
2010-06-09 02:40:37 +07:00
|
|
|
/**
|
|
|
|
* try_to_wake_up_local - try to wake up a local task with rq lock held
|
|
|
|
* @p: the thread to be awakened
|
|
|
|
*
|
2011-04-05 22:23:50 +07:00
|
|
|
* Put @p on the run-queue if it's not already there. The caller must
|
2010-06-09 02:40:37 +07:00
|
|
|
* ensure that this_rq() is locked, @p is bound to this_rq() and not
|
2011-04-05 22:23:50 +07:00
|
|
|
* the current task.
|
2010-06-09 02:40:37 +07:00
|
|
|
*/
|
2015-08-02 00:25:08 +07:00
|
|
|
static void try_to_wake_up_local(struct task_struct *p, struct pin_cookie cookie)
|
2010-06-09 02:40:37 +07:00
|
|
|
{
|
|
|
|
struct rq *rq = task_rq(p);
|
|
|
|
|
2013-03-19 02:22:34 +07:00
|
|
|
if (WARN_ON_ONCE(rq != this_rq()) ||
|
|
|
|
WARN_ON_ONCE(p == current))
|
|
|
|
return;
|
|
|
|
|
2010-06-09 02:40:37 +07:00
|
|
|
lockdep_assert_held(&rq->lock);
|
|
|
|
|
2011-04-05 22:23:50 +07:00
|
|
|
if (!raw_spin_trylock(&p->pi_lock)) {
|
2015-06-11 19:46:54 +07:00
|
|
|
/*
|
|
|
|
* This is OK, because current is on_cpu, which avoids it being
|
|
|
|
* picked for load-balance and preemption/IRQs are still
|
|
|
|
* disabled avoiding further scheduler activity on it and we've
|
|
|
|
* not yet picked a replacement task.
|
|
|
|
*/
|
2015-08-02 00:25:08 +07:00
|
|
|
lockdep_unpin_lock(&rq->lock, cookie);
|
2011-04-05 22:23:50 +07:00
|
|
|
raw_spin_unlock(&rq->lock);
|
|
|
|
raw_spin_lock(&p->pi_lock);
|
|
|
|
raw_spin_lock(&rq->lock);
|
2015-08-02 00:25:08 +07:00
|
|
|
lockdep_repin_lock(&rq->lock, cookie);
|
2011-04-05 22:23:50 +07:00
|
|
|
}
|
|
|
|
|
2010-06-09 02:40:37 +07:00
|
|
|
if (!(p->state & TASK_NORMAL))
|
2011-04-05 22:23:50 +07:00
|
|
|
goto out;
|
2010-06-09 02:40:37 +07:00
|
|
|
|
2015-06-09 16:13:36 +07:00
|
|
|
trace_sched_waking(p);
|
|
|
|
|
2014-08-20 16:47:32 +07:00
|
|
|
if (!task_on_rq_queued(p))
|
2011-04-05 22:23:43 +07:00
|
|
|
ttwu_activate(rq, p, ENQUEUE_WAKEUP);
|
|
|
|
|
2015-08-02 00:25:08 +07:00
|
|
|
ttwu_do_wakeup(rq, p, 0, cookie);
|
2016-02-05 16:08:36 +07:00
|
|
|
if (schedstat_enabled())
|
|
|
|
ttwu_stat(p, smp_processor_id(), 0);
|
2011-04-05 22:23:50 +07:00
|
|
|
out:
|
|
|
|
raw_spin_unlock(&p->pi_lock);
|
2010-06-09 02:40:37 +07:00
|
|
|
}
|
|
|
|
|
2009-04-28 21:01:38 +07:00
|
|
|
/**
|
|
|
|
* wake_up_process - Wake up a specific process
|
|
|
|
* @p: The process to be woken up.
|
|
|
|
*
|
|
|
|
* Attempt to wake up the nominated process and move it to the set of runnable
|
2013-07-13 01:45:47 +07:00
|
|
|
* processes.
|
|
|
|
*
|
|
|
|
* Return: 1 if the process was woken up, 0 if it was already running.
|
2009-04-28 21:01:38 +07:00
|
|
|
*
|
|
|
|
* It may be assumed that this function implies a write memory barrier before
|
|
|
|
* changing the task state if and only if any tasks are woken up.
|
|
|
|
*/
|
2008-02-08 19:19:53 +07:00
|
|
|
int wake_up_process(struct task_struct *p)
|
2005-04-17 05:20:36 +07:00
|
|
|
{
|
2013-01-22 02:48:17 +07:00
|
|
|
return try_to_wake_up(p, TASK_NORMAL, 0);
|
2005-04-17 05:20:36 +07:00
|
|
|
}
|
|
|
|
EXPORT_SYMBOL(wake_up_process);
|
|
|
|
|
2008-02-08 19:19:53 +07:00
|
|
|
int wake_up_state(struct task_struct *p, unsigned int state)
|
2005-04-17 05:20:36 +07:00
|
|
|
{
|
|
|
|
return try_to_wake_up(p, state, 0);
|
|
|
|
}
|
|
|
|
|
2014-09-19 16:22:39 +07:00
|
|
|
/*
|
|
|
|
* This function clears the sched_dl_entity static params.
|
|
|
|
*/
|
|
|
|
void __dl_clear_params(struct task_struct *p)
|
|
|
|
{
|
|
|
|
struct sched_dl_entity *dl_se = &p->dl;
|
|
|
|
|
|
|
|
dl_se->dl_runtime = 0;
|
|
|
|
dl_se->dl_deadline = 0;
|
|
|
|
dl_se->dl_period = 0;
|
|
|
|
dl_se->flags = 0;
|
|
|
|
dl_se->dl_bw = 0;
|
2015-01-28 21:08:03 +07:00
|
|
|
|
|
|
|
dl_se->dl_throttled = 0;
|
|
|
|
dl_se->dl_yielded = 0;
|
2014-09-19 16:22:39 +07:00
|
|
|
}
|
|
|
|
|
2005-04-17 05:20:36 +07:00
|
|
|
/*
|
|
|
|
* Perform scheduler related setup for a newly forked process p.
|
|
|
|
* p is forked by current.
|
2007-07-09 23:51:59 +07:00
|
|
|
*
|
|
|
|
* __sched_fork() is basic setup used by init_idle() too:
|
|
|
|
*/
|
2013-10-07 17:29:26 +07:00
|
|
|
static void __sched_fork(unsigned long clone_flags, struct task_struct *p)
|
2007-07-09 23:51:59 +07:00
|
|
|
{
|
2011-04-05 22:23:44 +07:00
|
|
|
p->on_rq = 0;
|
|
|
|
|
|
|
|
p->se.on_rq = 0;
|
2007-07-09 23:51:59 +07:00
|
|
|
p->se.exec_start = 0;
|
|
|
|
p->se.sum_exec_runtime = 0;
|
2007-08-28 17:53:24 +07:00
|
|
|
p->se.prev_sum_exec_runtime = 0;
|
2008-12-14 18:34:15 +07:00
|
|
|
p->se.nr_migrations = 0;
|
2011-01-17 23:03:27 +07:00
|
|
|
p->se.vruntime = 0;
|
2011-04-05 22:23:44 +07:00
|
|
|
INIT_LIST_HEAD(&p->se.group_node);
|
2007-08-02 22:41:40 +07:00
|
|
|
|
2015-10-23 23:16:19 +07:00
|
|
|
#ifdef CONFIG_FAIR_GROUP_SCHED
|
|
|
|
p->se.cfs_rq = NULL;
|
|
|
|
#endif
|
|
|
|
|
2007-08-02 22:41:40 +07:00
|
|
|
#ifdef CONFIG_SCHEDSTATS
|
2016-02-05 16:08:36 +07:00
|
|
|
/* Even if schedstat is disabled, there should not be garbage */
|
2010-03-11 09:37:45 +07:00
|
|
|
memset(&p->se.statistics, 0, sizeof(p->se.statistics));
|
2007-08-02 22:41:40 +07:00
|
|
|
#endif
|
2005-06-26 04:57:29 +07:00
|
|
|
|
sched/deadline: Add SCHED_DEADLINE structures & implementation
Introduces the data structures, constants and symbols needed for
SCHED_DEADLINE implementation.
Core data structure of SCHED_DEADLINE are defined, along with their
initializers. Hooks for checking if a task belong to the new policy
are also added where they are needed.
Adds a scheduling class, in sched/dl.c and a new policy called
SCHED_DEADLINE. It is an implementation of the Earliest Deadline
First (EDF) scheduling algorithm, augmented with a mechanism (called
Constant Bandwidth Server, CBS) that makes it possible to isolate
the behaviour of tasks between each other.
The typical -deadline task will be made up of a computation phase
(instance) which is activated on a periodic or sporadic fashion. The
expected (maximum) duration of such computation is called the task's
runtime; the time interval by which each instance need to be completed
is called the task's relative deadline. The task's absolute deadline
is dynamically calculated as the time instant a task (better, an
instance) activates plus the relative deadline.
The EDF algorithms selects the task with the smallest absolute
deadline as the one to be executed first, while the CBS ensures each
task to run for at most its runtime every (relative) deadline
length time interval, avoiding any interference between different
tasks (bandwidth isolation).
Thanks to this feature, also tasks that do not strictly comply with
the computational model sketched above can effectively use the new
policy.
To summarize, this patch:
- introduces the data structures, constants and symbols needed;
- implements the core logic of the scheduling algorithm in the new
scheduling class file;
- provides all the glue code between the new scheduling class and
the core scheduler and refines the interactions between sched/dl
and the other existing scheduling classes.
Signed-off-by: Dario Faggioli <raistlin@linux.it>
Signed-off-by: Michael Trimarchi <michael@amarulasolutions.com>
Signed-off-by: Fabio Checconi <fchecconi@gmail.com>
Signed-off-by: Juri Lelli <juri.lelli@gmail.com>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/1383831828-15501-4-git-send-email-juri.lelli@gmail.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2013-11-28 17:14:43 +07:00
|
|
|
RB_CLEAR_NODE(&p->dl.rb_node);
|
2015-01-28 21:08:03 +07:00
|
|
|
init_dl_task_timer(&p->dl);
|
2014-09-19 16:22:39 +07:00
|
|
|
__dl_clear_params(p);
|
sched/deadline: Add SCHED_DEADLINE structures & implementation
Introduces the data structures, constants and symbols needed for
SCHED_DEADLINE implementation.
Core data structure of SCHED_DEADLINE are defined, along with their
initializers. Hooks for checking if a task belong to the new policy
are also added where they are needed.
Adds a scheduling class, in sched/dl.c and a new policy called
SCHED_DEADLINE. It is an implementation of the Earliest Deadline
First (EDF) scheduling algorithm, augmented with a mechanism (called
Constant Bandwidth Server, CBS) that makes it possible to isolate
the behaviour of tasks between each other.
The typical -deadline task will be made up of a computation phase
(instance) which is activated on a periodic or sporadic fashion. The
expected (maximum) duration of such computation is called the task's
runtime; the time interval by which each instance need to be completed
is called the task's relative deadline. The task's absolute deadline
is dynamically calculated as the time instant a task (better, an
instance) activates plus the relative deadline.
The EDF algorithms selects the task with the smallest absolute
deadline as the one to be executed first, while the CBS ensures each
task to run for at most its runtime every (relative) deadline
length time interval, avoiding any interference between different
tasks (bandwidth isolation).
Thanks to this feature, also tasks that do not strictly comply with
the computational model sketched above can effectively use the new
policy.
To summarize, this patch:
- introduces the data structures, constants and symbols needed;
- implements the core logic of the scheduling algorithm in the new
scheduling class file;
- provides all the glue code between the new scheduling class and
the core scheduler and refines the interactions between sched/dl
and the other existing scheduling classes.
Signed-off-by: Dario Faggioli <raistlin@linux.it>
Signed-off-by: Michael Trimarchi <michael@amarulasolutions.com>
Signed-off-by: Fabio Checconi <fchecconi@gmail.com>
Signed-off-by: Juri Lelli <juri.lelli@gmail.com>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/1383831828-15501-4-git-send-email-juri.lelli@gmail.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2013-11-28 17:14:43 +07:00
|
|
|
|
2008-01-26 03:08:27 +07:00
|
|
|
INIT_LIST_HEAD(&p->rt.run_list);
|
2016-01-18 21:27:07 +07:00
|
|
|
p->rt.timeout = 0;
|
|
|
|
p->rt.time_slice = sched_rr_timeslice;
|
|
|
|
p->rt.on_rq = 0;
|
|
|
|
p->rt.on_list = 0;
|
2005-06-26 04:57:29 +07:00
|
|
|
|
2007-07-26 18:40:43 +07:00
|
|
|
#ifdef CONFIG_PREEMPT_NOTIFIERS
|
|
|
|
INIT_HLIST_HEAD(&p->preempt_notifiers);
|
|
|
|
#endif
|
2012-10-25 19:16:43 +07:00
|
|
|
|
|
|
|
#ifdef CONFIG_NUMA_BALANCING
|
|
|
|
if (p->mm && atomic_read(&p->mm->mm_users) == 1) {
|
2013-10-07 17:28:54 +07:00
|
|
|
p->mm->numa_next_scan = jiffies + msecs_to_jiffies(sysctl_numa_balancing_scan_delay);
|
2012-10-25 19:16:43 +07:00
|
|
|
p->mm->numa_scan_seq = 0;
|
|
|
|
}
|
|
|
|
|
2013-10-07 17:29:26 +07:00
|
|
|
if (clone_flags & CLONE_VM)
|
|
|
|
p->numa_preferred_nid = current->numa_preferred_nid;
|
|
|
|
else
|
|
|
|
p->numa_preferred_nid = -1;
|
|
|
|
|
2012-10-25 19:16:43 +07:00
|
|
|
p->node_stamp = 0ULL;
|
|
|
|
p->numa_scan_seq = p->mm ? p->mm->numa_scan_seq : 0;
|
2012-10-25 19:16:47 +07:00
|
|
|
p->numa_scan_period = sysctl_numa_balancing_scan_delay;
|
2012-10-25 19:16:43 +07:00
|
|
|
p->numa_work.next = &p->numa_work;
|
2014-10-31 07:13:31 +07:00
|
|
|
p->numa_faults = NULL;
|
sched/numa: Normalize faults_cpu stats and weigh by CPU use
Tracing the code that decides the active nodes has made it abundantly clear
that the naive implementation of the faults_from code has issues.
Specifically, the garbage collector in some workloads will access orders
of magnitudes more memory than the threads that do all the active work.
This resulted in the node with the garbage collector being marked the only
active node in the group.
This issue is avoided if we weigh the statistics by CPU use of each task in
the numa group, instead of by how many faults each thread has occurred.
To achieve this, we normalize the number of faults to the fraction of faults
that occurred on each node, and then multiply that fraction by the fraction
of CPU time the task has used since the last time task_numa_placement was
invoked.
This way the nodes in the active node mask will be the ones where the tasks
from the numa group are most actively running, and the influence of eg. the
garbage collector and other do-little threads is properly minimized.
On a 4 node system, using CPU use statistics calculated over a longer interval
results in about 1% fewer page migrations with two 32-warehouse specjbb runs
on a 4 node system, and about 5% fewer page migrations, as well as 1% better
throughput, with two 8-warehouse specjbb runs, as compared with the shorter
term statistics kept by the scheduler.
Signed-off-by: Rik van Riel <riel@redhat.com>
Acked-by: Mel Gorman <mgorman@suse.de>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Cc: Chegu Vinod <chegu_vinod@hp.com>
Link: http://lkml.kernel.org/r/1390860228-21539-7-git-send-email-riel@redhat.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2014-01-28 05:03:45 +07:00
|
|
|
p->last_task_numa_placement = 0;
|
|
|
|
p->last_sum_exec_runtime = 0;
|
2013-10-07 17:29:21 +07:00
|
|
|
|
|
|
|
p->numa_group = NULL;
|
2012-10-25 19:16:43 +07:00
|
|
|
#endif /* CONFIG_NUMA_BALANCING */
|
2007-07-09 23:51:59 +07:00
|
|
|
}
|
|
|
|
|
2015-08-11 23:24:21 +07:00
|
|
|
DEFINE_STATIC_KEY_FALSE(sched_numa_balancing);
|
|
|
|
|
2012-11-22 18:16:36 +07:00
|
|
|
#ifdef CONFIG_NUMA_BALANCING
|
2015-08-11 18:00:12 +07:00
|
|
|
|
2012-11-22 18:16:36 +07:00
|
|
|
void set_numabalancing_state(bool enabled)
|
|
|
|
{
|
|
|
|
if (enabled)
|
2015-08-11 23:24:21 +07:00
|
|
|
static_branch_enable(&sched_numa_balancing);
|
2012-11-22 18:16:36 +07:00
|
|
|
else
|
2015-08-11 23:24:21 +07:00
|
|
|
static_branch_disable(&sched_numa_balancing);
|
2012-11-22 18:16:36 +07:00
|
|
|
}
|
2014-01-24 06:53:13 +07:00
|
|
|
|
|
|
|
#ifdef CONFIG_PROC_SYSCTL
|
|
|
|
int sysctl_numa_balancing(struct ctl_table *table, int write,
|
|
|
|
void __user *buffer, size_t *lenp, loff_t *ppos)
|
|
|
|
{
|
|
|
|
struct ctl_table t;
|
|
|
|
int err;
|
2015-08-11 23:24:21 +07:00
|
|
|
int state = static_branch_likely(&sched_numa_balancing);
|
2014-01-24 06:53:13 +07:00
|
|
|
|
|
|
|
if (write && !capable(CAP_SYS_ADMIN))
|
|
|
|
return -EPERM;
|
|
|
|
|
|
|
|
t = *table;
|
|
|
|
t.data = &state;
|
|
|
|
err = proc_dointvec_minmax(&t, write, buffer, lenp, ppos);
|
|
|
|
if (err < 0)
|
|
|
|
return err;
|
|
|
|
if (write)
|
|
|
|
set_numabalancing_state(state);
|
|
|
|
return err;
|
|
|
|
}
|
|
|
|
#endif
|
|
|
|
#endif
|
2007-07-09 23:51:59 +07:00
|
|
|
|
2016-06-08 02:43:16 +07:00
|
|
|
#ifdef CONFIG_SCHEDSTATS
|
|
|
|
|
2016-02-05 16:08:36 +07:00
|
|
|
DEFINE_STATIC_KEY_FALSE(sched_schedstats);
|
2016-06-08 02:43:16 +07:00
|
|
|
static bool __initdata __sched_schedstats = false;
|
2016-02-05 16:08:36 +07:00
|
|
|
|
|
|
|
static void set_schedstats(bool enabled)
|
|
|
|
{
|
|
|
|
if (enabled)
|
|
|
|
static_branch_enable(&sched_schedstats);
|
|
|
|
else
|
|
|
|
static_branch_disable(&sched_schedstats);
|
|
|
|
}
|
|
|
|
|
|
|
|
void force_schedstat_enabled(void)
|
|
|
|
{
|
|
|
|
if (!schedstat_enabled()) {
|
|
|
|
pr_info("kernel profiling enabled schedstats, disable via kernel.sched_schedstats.\n");
|
|
|
|
static_branch_enable(&sched_schedstats);
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
|
|
|
static int __init setup_schedstats(char *str)
|
|
|
|
{
|
|
|
|
int ret = 0;
|
|
|
|
if (!str)
|
|
|
|
goto out;
|
|
|
|
|
2016-06-08 02:43:16 +07:00
|
|
|
/*
|
|
|
|
* This code is called before jump labels have been set up, so we can't
|
|
|
|
* change the static branch directly just yet. Instead set a temporary
|
|
|
|
* variable so init_schedstats() can do it later.
|
|
|
|
*/
|
2016-02-05 16:08:36 +07:00
|
|
|
if (!strcmp(str, "enable")) {
|
2016-06-08 02:43:16 +07:00
|
|
|
__sched_schedstats = true;
|
2016-02-05 16:08:36 +07:00
|
|
|
ret = 1;
|
|
|
|
} else if (!strcmp(str, "disable")) {
|
2016-06-08 02:43:16 +07:00
|
|
|
__sched_schedstats = false;
|
2016-02-05 16:08:36 +07:00
|
|
|
ret = 1;
|
|
|
|
}
|
|
|
|
out:
|
|
|
|
if (!ret)
|
|
|
|
pr_warn("Unable to parse schedstats=\n");
|
|
|
|
|
|
|
|
return ret;
|
|
|
|
}
|
|
|
|
__setup("schedstats=", setup_schedstats);
|
|
|
|
|
2016-06-08 02:43:16 +07:00
|
|
|
static void __init init_schedstats(void)
|
|
|
|
{
|
|
|
|
set_schedstats(__sched_schedstats);
|
|
|
|
}
|
|
|
|
|
2016-02-05 16:08:36 +07:00
|
|
|
#ifdef CONFIG_PROC_SYSCTL
|
|
|
|
int sysctl_schedstats(struct ctl_table *table, int write,
|
|
|
|
void __user *buffer, size_t *lenp, loff_t *ppos)
|
|
|
|
{
|
|
|
|
struct ctl_table t;
|
|
|
|
int err;
|
|
|
|
int state = static_branch_likely(&sched_schedstats);
|
|
|
|
|
|
|
|
if (write && !capable(CAP_SYS_ADMIN))
|
|
|
|
return -EPERM;
|
|
|
|
|
|
|
|
t = *table;
|
|
|
|
t.data = &state;
|
|
|
|
err = proc_dointvec_minmax(&t, write, buffer, lenp, ppos);
|
|
|
|
if (err < 0)
|
|
|
|
return err;
|
|
|
|
if (write)
|
|
|
|
set_schedstats(state);
|
|
|
|
return err;
|
|
|
|
}
|
2016-06-08 02:43:16 +07:00
|
|
|
#endif /* CONFIG_PROC_SYSCTL */
|
|
|
|
#else /* !CONFIG_SCHEDSTATS */
|
|
|
|
static inline void init_schedstats(void) {}
|
|
|
|
#endif /* CONFIG_SCHEDSTATS */
|
2007-07-09 23:51:59 +07:00
|
|
|
|
|
|
|
/*
|
|
|
|
* fork()/clone()-time setup:
|
|
|
|
*/
|
sched/deadline: Add SCHED_DEADLINE structures & implementation
Introduces the data structures, constants and symbols needed for
SCHED_DEADLINE implementation.
Core data structure of SCHED_DEADLINE are defined, along with their
initializers. Hooks for checking if a task belong to the new policy
are also added where they are needed.
Adds a scheduling class, in sched/dl.c and a new policy called
SCHED_DEADLINE. It is an implementation of the Earliest Deadline
First (EDF) scheduling algorithm, augmented with a mechanism (called
Constant Bandwidth Server, CBS) that makes it possible to isolate
the behaviour of tasks between each other.
The typical -deadline task will be made up of a computation phase
(instance) which is activated on a periodic or sporadic fashion. The
expected (maximum) duration of such computation is called the task's
runtime; the time interval by which each instance need to be completed
is called the task's relative deadline. The task's absolute deadline
is dynamically calculated as the time instant a task (better, an
instance) activates plus the relative deadline.
The EDF algorithms selects the task with the smallest absolute
deadline as the one to be executed first, while the CBS ensures each
task to run for at most its runtime every (relative) deadline
length time interval, avoiding any interference between different
tasks (bandwidth isolation).
Thanks to this feature, also tasks that do not strictly comply with
the computational model sketched above can effectively use the new
policy.
To summarize, this patch:
- introduces the data structures, constants and symbols needed;
- implements the core logic of the scheduling algorithm in the new
scheduling class file;
- provides all the glue code between the new scheduling class and
the core scheduler and refines the interactions between sched/dl
and the other existing scheduling classes.
Signed-off-by: Dario Faggioli <raistlin@linux.it>
Signed-off-by: Michael Trimarchi <michael@amarulasolutions.com>
Signed-off-by: Fabio Checconi <fchecconi@gmail.com>
Signed-off-by: Juri Lelli <juri.lelli@gmail.com>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/1383831828-15501-4-git-send-email-juri.lelli@gmail.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2013-11-28 17:14:43 +07:00
|
|
|
int sched_fork(unsigned long clone_flags, struct task_struct *p)
|
2007-07-09 23:51:59 +07:00
|
|
|
{
|
2011-04-05 22:23:51 +07:00
|
|
|
unsigned long flags;
|
2007-07-09 23:51:59 +07:00
|
|
|
int cpu = get_cpu();
|
|
|
|
|
2013-10-07 17:29:26 +07:00
|
|
|
__sched_fork(clone_flags, p);
|
2009-12-17 00:04:35 +07:00
|
|
|
/*
|
2016-06-16 18:29:28 +07:00
|
|
|
* We mark the process as NEW here. This guarantees that
|
2009-12-17 00:04:35 +07:00
|
|
|
* nobody will actually run it, and a signal or other external
|
|
|
|
* event cannot wake it up and insert it on the runqueue either.
|
|
|
|
*/
|
2016-06-16 18:29:28 +07:00
|
|
|
p->state = TASK_NEW;
|
2007-07-09 23:51:59 +07:00
|
|
|
|
2011-07-27 22:14:55 +07:00
|
|
|
/*
|
|
|
|
* Make sure we do not leak PI boosting priority to the child.
|
|
|
|
*/
|
|
|
|
p->prio = current->normal_prio;
|
|
|
|
|
2009-06-17 15:46:01 +07:00
|
|
|
/*
|
|
|
|
* Revert to default priority/policy on fork if requested.
|
|
|
|
*/
|
|
|
|
if (unlikely(p->sched_reset_on_fork)) {
|
sched/deadline: Add SCHED_DEADLINE structures & implementation
Introduces the data structures, constants and symbols needed for
SCHED_DEADLINE implementation.
Core data structure of SCHED_DEADLINE are defined, along with their
initializers. Hooks for checking if a task belong to the new policy
are also added where they are needed.
Adds a scheduling class, in sched/dl.c and a new policy called
SCHED_DEADLINE. It is an implementation of the Earliest Deadline
First (EDF) scheduling algorithm, augmented with a mechanism (called
Constant Bandwidth Server, CBS) that makes it possible to isolate
the behaviour of tasks between each other.
The typical -deadline task will be made up of a computation phase
(instance) which is activated on a periodic or sporadic fashion. The
expected (maximum) duration of such computation is called the task's
runtime; the time interval by which each instance need to be completed
is called the task's relative deadline. The task's absolute deadline
is dynamically calculated as the time instant a task (better, an
instance) activates plus the relative deadline.
The EDF algorithms selects the task with the smallest absolute
deadline as the one to be executed first, while the CBS ensures each
task to run for at most its runtime every (relative) deadline
length time interval, avoiding any interference between different
tasks (bandwidth isolation).
Thanks to this feature, also tasks that do not strictly comply with
the computational model sketched above can effectively use the new
policy.
To summarize, this patch:
- introduces the data structures, constants and symbols needed;
- implements the core logic of the scheduling algorithm in the new
scheduling class file;
- provides all the glue code between the new scheduling class and
the core scheduler and refines the interactions between sched/dl
and the other existing scheduling classes.
Signed-off-by: Dario Faggioli <raistlin@linux.it>
Signed-off-by: Michael Trimarchi <michael@amarulasolutions.com>
Signed-off-by: Fabio Checconi <fchecconi@gmail.com>
Signed-off-by: Juri Lelli <juri.lelli@gmail.com>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/1383831828-15501-4-git-send-email-juri.lelli@gmail.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2013-11-28 17:14:43 +07:00
|
|
|
if (task_has_dl_policy(p) || task_has_rt_policy(p)) {
|
2009-06-17 15:46:01 +07:00
|
|
|
p->policy = SCHED_NORMAL;
|
2009-06-17 15:48:02 +07:00
|
|
|
p->static_prio = NICE_TO_PRIO(0);
|
2011-07-27 22:14:55 +07:00
|
|
|
p->rt_priority = 0;
|
|
|
|
} else if (PRIO_TO_NICE(p->static_prio) < 0)
|
|
|
|
p->static_prio = NICE_TO_PRIO(0);
|
|
|
|
|
|
|
|
p->prio = p->normal_prio = __normal_prio(p);
|
|
|
|
set_load_weight(p);
|
2009-06-17 15:48:02 +07:00
|
|
|
|
2009-06-17 15:46:01 +07:00
|
|
|
/*
|
|
|
|
* We don't need the reset flag anymore after the fork. It has
|
|
|
|
* fulfilled its duty:
|
|
|
|
*/
|
|
|
|
p->sched_reset_on_fork = 0;
|
|
|
|
}
|
2009-06-15 22:17:47 +07:00
|
|
|
|
sched/deadline: Add SCHED_DEADLINE structures & implementation
Introduces the data structures, constants and symbols needed for
SCHED_DEADLINE implementation.
Core data structure of SCHED_DEADLINE are defined, along with their
initializers. Hooks for checking if a task belong to the new policy
are also added where they are needed.
Adds a scheduling class, in sched/dl.c and a new policy called
SCHED_DEADLINE. It is an implementation of the Earliest Deadline
First (EDF) scheduling algorithm, augmented with a mechanism (called
Constant Bandwidth Server, CBS) that makes it possible to isolate
the behaviour of tasks between each other.
The typical -deadline task will be made up of a computation phase
(instance) which is activated on a periodic or sporadic fashion. The
expected (maximum) duration of such computation is called the task's
runtime; the time interval by which each instance need to be completed
is called the task's relative deadline. The task's absolute deadline
is dynamically calculated as the time instant a task (better, an
instance) activates plus the relative deadline.
The EDF algorithms selects the task with the smallest absolute
deadline as the one to be executed first, while the CBS ensures each
task to run for at most its runtime every (relative) deadline
length time interval, avoiding any interference between different
tasks (bandwidth isolation).
Thanks to this feature, also tasks that do not strictly comply with
the computational model sketched above can effectively use the new
policy.
To summarize, this patch:
- introduces the data structures, constants and symbols needed;
- implements the core logic of the scheduling algorithm in the new
scheduling class file;
- provides all the glue code between the new scheduling class and
the core scheduler and refines the interactions between sched/dl
and the other existing scheduling classes.
Signed-off-by: Dario Faggioli <raistlin@linux.it>
Signed-off-by: Michael Trimarchi <michael@amarulasolutions.com>
Signed-off-by: Fabio Checconi <fchecconi@gmail.com>
Signed-off-by: Juri Lelli <juri.lelli@gmail.com>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/1383831828-15501-4-git-send-email-juri.lelli@gmail.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2013-11-28 17:14:43 +07:00
|
|
|
if (dl_prio(p->prio)) {
|
|
|
|
put_cpu();
|
|
|
|
return -EAGAIN;
|
|
|
|
} else if (rt_prio(p->prio)) {
|
|
|
|
p->sched_class = &rt_sched_class;
|
|
|
|
} else {
|
2007-10-15 22:00:11 +07:00
|
|
|
p->sched_class = &fair_sched_class;
|
sched/deadline: Add SCHED_DEADLINE structures & implementation
Introduces the data structures, constants and symbols needed for
SCHED_DEADLINE implementation.
Core data structure of SCHED_DEADLINE are defined, along with their
initializers. Hooks for checking if a task belong to the new policy
are also added where they are needed.
Adds a scheduling class, in sched/dl.c and a new policy called
SCHED_DEADLINE. It is an implementation of the Earliest Deadline
First (EDF) scheduling algorithm, augmented with a mechanism (called
Constant Bandwidth Server, CBS) that makes it possible to isolate
the behaviour of tasks between each other.
The typical -deadline task will be made up of a computation phase
(instance) which is activated on a periodic or sporadic fashion. The
expected (maximum) duration of such computation is called the task's
runtime; the time interval by which each instance need to be completed
is called the task's relative deadline. The task's absolute deadline
is dynamically calculated as the time instant a task (better, an
instance) activates plus the relative deadline.
The EDF algorithms selects the task with the smallest absolute
deadline as the one to be executed first, while the CBS ensures each
task to run for at most its runtime every (relative) deadline
length time interval, avoiding any interference between different
tasks (bandwidth isolation).
Thanks to this feature, also tasks that do not strictly comply with
the computational model sketched above can effectively use the new
policy.
To summarize, this patch:
- introduces the data structures, constants and symbols needed;
- implements the core logic of the scheduling algorithm in the new
scheduling class file;
- provides all the glue code between the new scheduling class and
the core scheduler and refines the interactions between sched/dl
and the other existing scheduling classes.
Signed-off-by: Dario Faggioli <raistlin@linux.it>
Signed-off-by: Michael Trimarchi <michael@amarulasolutions.com>
Signed-off-by: Fabio Checconi <fchecconi@gmail.com>
Signed-off-by: Juri Lelli <juri.lelli@gmail.com>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/1383831828-15501-4-git-send-email-juri.lelli@gmail.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2013-11-28 17:14:43 +07:00
|
|
|
}
|
2006-06-27 16:54:51 +07:00
|
|
|
|
2016-06-16 18:29:28 +07:00
|
|
|
init_entity_runnable_average(&p->se);
|
2009-11-27 23:32:46 +07:00
|
|
|
|
2010-06-22 16:44:53 +07:00
|
|
|
/*
|
|
|
|
* The child is not yet in the pid-hash so no cgroup attach races,
|
|
|
|
* and the cgroup is pinned to this child due to cgroup_fork()
|
|
|
|
* is ran before sched_fork().
|
|
|
|
*
|
|
|
|
* Silence PROVE_RCU.
|
|
|
|
*/
|
2011-04-05 22:23:51 +07:00
|
|
|
raw_spin_lock_irqsave(&p->pi_lock, flags);
|
2016-06-16 23:51:48 +07:00
|
|
|
/*
|
|
|
|
* We're setting the cpu for the first time, we don't migrate,
|
|
|
|
* so use __set_task_cpu().
|
|
|
|
*/
|
|
|
|
__set_task_cpu(p, cpu);
|
|
|
|
if (p->sched_class->task_fork)
|
|
|
|
p->sched_class->task_fork(p);
|
2011-04-05 22:23:51 +07:00
|
|
|
raw_spin_unlock_irqrestore(&p->pi_lock, flags);
|
2009-09-10 18:42:00 +07:00
|
|
|
|
2015-06-26 01:23:37 +07:00
|
|
|
#ifdef CONFIG_SCHED_INFO
|
2007-07-09 23:51:59 +07:00
|
|
|
if (likely(sched_info_on()))
|
2006-07-14 14:24:38 +07:00
|
|
|
memset(&p->sched_info, 0, sizeof(p->sched_info));
|
2005-04-17 05:20:36 +07:00
|
|
|
#endif
|
2011-04-05 22:23:40 +07:00
|
|
|
#if defined(CONFIG_SMP)
|
|
|
|
p->on_cpu = 0;
|
2005-06-26 04:57:23 +07:00
|
|
|
#endif
|
2013-08-14 19:55:46 +07:00
|
|
|
init_task_preempt_count(p);
|
2010-12-01 01:51:33 +07:00
|
|
|
#ifdef CONFIG_SMP
|
sched: create "pushable_tasks" list to limit pushing to one attempt
The RT scheduler employs a "push/pull" design to actively balance tasks
within the system (on a per disjoint cpuset basis). When a task is
awoken, it is immediately determined if there are any lower priority
cpus which should be preempted. This is opposed to the way normal
SCHED_OTHER tasks behave, which will wait for a periodic rebalancing
operation to occur before spreading out load.
When a particular RQ has more than 1 active RT task, it is said to
be in an "overloaded" state. Once this occurs, the system enters
the active balancing mode, where it will try to push the task away,
or persuade a different cpu to pull it over. The system will stay
in this state until the system falls back below the <= 1 queued RT
task per RQ.
However, the current implementation suffers from a limitation in the
push logic. Once overloaded, all tasks (other than current) on the
RQ are analyzed on every push operation, even if it was previously
unpushable (due to affinity, etc). Whats more, the operation stops
at the first task that is unpushable and will not look at items
lower in the queue. This causes two problems:
1) We can have the same tasks analyzed over and over again during each
push, which extends out the fast path in the scheduler for no
gain. Consider a RQ that has dozens of tasks that are bound to a
core. Each one of those tasks will be encountered and skipped
for each push operation while they are queued.
2) There may be lower-priority tasks under the unpushable task that
could have been successfully pushed, but will never be considered
until either the unpushable task is cleared, or a pull operation
succeeds. The net result is a potential latency source for mid
priority tasks.
This patch aims to rectify these two conditions by introducing a new
priority sorted list: "pushable_tasks". A task is added to the list
each time a task is activated or preempted. It is removed from the
list any time it is deactivated, made current, or fails to push.
This works because a task only needs to be attempted to push once.
After an initial failure to push, the other cpus will eventually try to
pull the task when the conditions are proper. This also solves the
problem that we don't completely analyze all tasks due to encountering
an unpushable tasks. Now every task will have a push attempted (when
appropriate).
This reduces latency both by shorting the critical section of the
rq->lock for certain workloads, and by making sure the algorithm
considers all eligible tasks in the system.
[ rostedt: added a couple more BUG_ONs ]
Signed-off-by: Gregory Haskins <ghaskins@novell.com>
Acked-by: Steven Rostedt <srostedt@redhat.com>
2008-12-29 21:39:53 +07:00
|
|
|
plist_node_init(&p->pushable_tasks, MAX_PRIO);
|
sched/deadline: Add SCHED_DEADLINE SMP-related data structures & logic
Introduces data structures relevant for implementing dynamic
migration of -deadline tasks and the logic for checking if
runqueues are overloaded with -deadline tasks and for choosing
where a task should migrate, when it is the case.
Adds also dynamic migrations to SCHED_DEADLINE, so that tasks can
be moved among CPUs when necessary. It is also possible to bind a
task to a (set of) CPU(s), thus restricting its capability of
migrating, or forbidding migrations at all.
The very same approach used in sched_rt is utilised:
- -deadline tasks are kept into CPU-specific runqueues,
- -deadline tasks are migrated among runqueues to achieve the
following:
* on an M-CPU system the M earliest deadline ready tasks
are always running;
* affinity/cpusets settings of all the -deadline tasks is
always respected.
Therefore, this very special form of "load balancing" is done with
an active method, i.e., the scheduler pushes or pulls tasks between
runqueues when they are woken up and/or (de)scheduled.
IOW, every time a preemption occurs, the descheduled task might be sent
to some other CPU (depending on its deadline) to continue executing
(push). On the other hand, every time a CPU becomes idle, it might pull
the second earliest deadline ready task from some other CPU.
To enforce this, a pull operation is always attempted before taking any
scheduling decision (pre_schedule()), as well as a push one after each
scheduling decision (post_schedule()). In addition, when a task arrives
or wakes up, the best CPU where to resume it is selected taking into
account its affinity mask, the system topology, but also its deadline.
E.g., from the scheduling point of view, the best CPU where to wake
up (and also where to push) a task is the one which is running the task
with the latest deadline among the M executing ones.
In order to facilitate these decisions, per-runqueue "caching" of the
deadlines of the currently running and of the first ready task is used.
Queued but not running tasks are also parked in another rb-tree to
speed-up pushes.
Signed-off-by: Juri Lelli <juri.lelli@gmail.com>
Signed-off-by: Dario Faggioli <raistlin@linux.it>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/1383831828-15501-5-git-send-email-juri.lelli@gmail.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2013-11-07 20:43:38 +07:00
|
|
|
RB_CLEAR_NODE(&p->pushable_dl_tasks);
|
2010-12-01 01:51:33 +07:00
|
|
|
#endif
|
sched: create "pushable_tasks" list to limit pushing to one attempt
The RT scheduler employs a "push/pull" design to actively balance tasks
within the system (on a per disjoint cpuset basis). When a task is
awoken, it is immediately determined if there are any lower priority
cpus which should be preempted. This is opposed to the way normal
SCHED_OTHER tasks behave, which will wait for a periodic rebalancing
operation to occur before spreading out load.
When a particular RQ has more than 1 active RT task, it is said to
be in an "overloaded" state. Once this occurs, the system enters
the active balancing mode, where it will try to push the task away,
or persuade a different cpu to pull it over. The system will stay
in this state until the system falls back below the <= 1 queued RT
task per RQ.
However, the current implementation suffers from a limitation in the
push logic. Once overloaded, all tasks (other than current) on the
RQ are analyzed on every push operation, even if it was previously
unpushable (due to affinity, etc). Whats more, the operation stops
at the first task that is unpushable and will not look at items
lower in the queue. This causes two problems:
1) We can have the same tasks analyzed over and over again during each
push, which extends out the fast path in the scheduler for no
gain. Consider a RQ that has dozens of tasks that are bound to a
core. Each one of those tasks will be encountered and skipped
for each push operation while they are queued.
2) There may be lower-priority tasks under the unpushable task that
could have been successfully pushed, but will never be considered
until either the unpushable task is cleared, or a pull operation
succeeds. The net result is a potential latency source for mid
priority tasks.
This patch aims to rectify these two conditions by introducing a new
priority sorted list: "pushable_tasks". A task is added to the list
each time a task is activated or preempted. It is removed from the
list any time it is deactivated, made current, or fails to push.
This works because a task only needs to be attempted to push once.
After an initial failure to push, the other cpus will eventually try to
pull the task when the conditions are proper. This also solves the
problem that we don't completely analyze all tasks due to encountering
an unpushable tasks. Now every task will have a push attempted (when
appropriate).
This reduces latency both by shorting the critical section of the
rq->lock for certain workloads, and by making sure the algorithm
considers all eligible tasks in the system.
[ rostedt: added a couple more BUG_ONs ]
Signed-off-by: Gregory Haskins <ghaskins@novell.com>
Acked-by: Steven Rostedt <srostedt@redhat.com>
2008-12-29 21:39:53 +07:00
|
|
|
|
2005-06-26 04:57:29 +07:00
|
|
|
put_cpu();
|
sched/deadline: Add SCHED_DEADLINE structures & implementation
Introduces the data structures, constants and symbols needed for
SCHED_DEADLINE implementation.
Core data structure of SCHED_DEADLINE are defined, along with their
initializers. Hooks for checking if a task belong to the new policy
are also added where they are needed.
Adds a scheduling class, in sched/dl.c and a new policy called
SCHED_DEADLINE. It is an implementation of the Earliest Deadline
First (EDF) scheduling algorithm, augmented with a mechanism (called
Constant Bandwidth Server, CBS) that makes it possible to isolate
the behaviour of tasks between each other.
The typical -deadline task will be made up of a computation phase
(instance) which is activated on a periodic or sporadic fashion. The
expected (maximum) duration of such computation is called the task's
runtime; the time interval by which each instance need to be completed
is called the task's relative deadline. The task's absolute deadline
is dynamically calculated as the time instant a task (better, an
instance) activates plus the relative deadline.
The EDF algorithms selects the task with the smallest absolute
deadline as the one to be executed first, while the CBS ensures each
task to run for at most its runtime every (relative) deadline
length time interval, avoiding any interference between different
tasks (bandwidth isolation).
Thanks to this feature, also tasks that do not strictly comply with
the computational model sketched above can effectively use the new
policy.
To summarize, this patch:
- introduces the data structures, constants and symbols needed;
- implements the core logic of the scheduling algorithm in the new
scheduling class file;
- provides all the glue code between the new scheduling class and
the core scheduler and refines the interactions between sched/dl
and the other existing scheduling classes.
Signed-off-by: Dario Faggioli <raistlin@linux.it>
Signed-off-by: Michael Trimarchi <michael@amarulasolutions.com>
Signed-off-by: Fabio Checconi <fchecconi@gmail.com>
Signed-off-by: Juri Lelli <juri.lelli@gmail.com>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/1383831828-15501-4-git-send-email-juri.lelli@gmail.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2013-11-28 17:14:43 +07:00
|
|
|
return 0;
|
2005-04-17 05:20:36 +07:00
|
|
|
}
|
|
|
|
|
sched/deadline: Add bandwidth management for SCHED_DEADLINE tasks
In order of deadline scheduling to be effective and useful, it is
important that some method of having the allocation of the available
CPU bandwidth to tasks and task groups under control.
This is usually called "admission control" and if it is not performed
at all, no guarantee can be given on the actual scheduling of the
-deadline tasks.
Since when RT-throttling has been introduced each task group have a
bandwidth associated to itself, calculated as a certain amount of
runtime over a period. Moreover, to make it possible to manipulate
such bandwidth, readable/writable controls have been added to both
procfs (for system wide settings) and cgroupfs (for per-group
settings).
Therefore, the same interface is being used for controlling the
bandwidth distrubution to -deadline tasks and task groups, i.e.,
new controls but with similar names, equivalent meaning and with
the same usage paradigm are added.
However, more discussion is needed in order to figure out how
we want to manage SCHED_DEADLINE bandwidth at the task group level.
Therefore, this patch adds a less sophisticated, but actually
very sensible, mechanism to ensure that a certain utilization
cap is not overcome per each root_domain (the single rq for !SMP
configurations).
Another main difference between deadline bandwidth management and
RT-throttling is that -deadline tasks have bandwidth on their own
(while -rt ones doesn't!), and thus we don't need an higher level
throttling mechanism to enforce the desired bandwidth.
This patch, therefore:
- adds system wide deadline bandwidth management by means of:
* /proc/sys/kernel/sched_dl_runtime_us,
* /proc/sys/kernel/sched_dl_period_us,
that determine (i.e., runtime / period) the total bandwidth
available on each CPU of each root_domain for -deadline tasks;
- couples the RT and deadline bandwidth management, i.e., enforces
that the sum of how much bandwidth is being devoted to -rt
-deadline tasks to stay below 100%.
This means that, for a root_domain comprising M CPUs, -deadline tasks
can be created until the sum of their bandwidths stay below:
M * (sched_dl_runtime_us / sched_dl_period_us)
It is also possible to disable this bandwidth management logic, and
be thus free of oversubscribing the system up to any arbitrary level.
Signed-off-by: Dario Faggioli <raistlin@linux.it>
Signed-off-by: Juri Lelli <juri.lelli@gmail.com>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/1383831828-15501-12-git-send-email-juri.lelli@gmail.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2013-11-07 20:43:45 +07:00
|
|
|
unsigned long to_ratio(u64 period, u64 runtime)
|
|
|
|
{
|
|
|
|
if (runtime == RUNTIME_INF)
|
|
|
|
return 1ULL << 20;
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Doing this here saves a lot of checks in all
|
|
|
|
* the calling paths, and returning zero seems
|
|
|
|
* safe for them anyway.
|
|
|
|
*/
|
|
|
|
if (period == 0)
|
|
|
|
return 0;
|
|
|
|
|
|
|
|
return div64_u64(runtime << 20, period);
|
|
|
|
}
|
|
|
|
|
|
|
|
#ifdef CONFIG_SMP
|
|
|
|
inline struct dl_bw *dl_bw_of(int i)
|
|
|
|
{
|
2015-06-19 05:50:02 +07:00
|
|
|
RCU_LOCKDEP_WARN(!rcu_read_lock_sched_held(),
|
|
|
|
"sched RCU must be held");
|
sched/deadline: Add bandwidth management for SCHED_DEADLINE tasks
In order of deadline scheduling to be effective and useful, it is
important that some method of having the allocation of the available
CPU bandwidth to tasks and task groups under control.
This is usually called "admission control" and if it is not performed
at all, no guarantee can be given on the actual scheduling of the
-deadline tasks.
Since when RT-throttling has been introduced each task group have a
bandwidth associated to itself, calculated as a certain amount of
runtime over a period. Moreover, to make it possible to manipulate
such bandwidth, readable/writable controls have been added to both
procfs (for system wide settings) and cgroupfs (for per-group
settings).
Therefore, the same interface is being used for controlling the
bandwidth distrubution to -deadline tasks and task groups, i.e.,
new controls but with similar names, equivalent meaning and with
the same usage paradigm are added.
However, more discussion is needed in order to figure out how
we want to manage SCHED_DEADLINE bandwidth at the task group level.
Therefore, this patch adds a less sophisticated, but actually
very sensible, mechanism to ensure that a certain utilization
cap is not overcome per each root_domain (the single rq for !SMP
configurations).
Another main difference between deadline bandwidth management and
RT-throttling is that -deadline tasks have bandwidth on their own
(while -rt ones doesn't!), and thus we don't need an higher level
throttling mechanism to enforce the desired bandwidth.
This patch, therefore:
- adds system wide deadline bandwidth management by means of:
* /proc/sys/kernel/sched_dl_runtime_us,
* /proc/sys/kernel/sched_dl_period_us,
that determine (i.e., runtime / period) the total bandwidth
available on each CPU of each root_domain for -deadline tasks;
- couples the RT and deadline bandwidth management, i.e., enforces
that the sum of how much bandwidth is being devoted to -rt
-deadline tasks to stay below 100%.
This means that, for a root_domain comprising M CPUs, -deadline tasks
can be created until the sum of their bandwidths stay below:
M * (sched_dl_runtime_us / sched_dl_period_us)
It is also possible to disable this bandwidth management logic, and
be thus free of oversubscribing the system up to any arbitrary level.
Signed-off-by: Dario Faggioli <raistlin@linux.it>
Signed-off-by: Juri Lelli <juri.lelli@gmail.com>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/1383831828-15501-12-git-send-email-juri.lelli@gmail.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2013-11-07 20:43:45 +07:00
|
|
|
return &cpu_rq(i)->rd->dl_bw;
|
|
|
|
}
|
|
|
|
|
2013-12-19 17:54:45 +07:00
|
|
|
static inline int dl_bw_cpus(int i)
|
sched/deadline: Add bandwidth management for SCHED_DEADLINE tasks
In order of deadline scheduling to be effective and useful, it is
important that some method of having the allocation of the available
CPU bandwidth to tasks and task groups under control.
This is usually called "admission control" and if it is not performed
at all, no guarantee can be given on the actual scheduling of the
-deadline tasks.
Since when RT-throttling has been introduced each task group have a
bandwidth associated to itself, calculated as a certain amount of
runtime over a period. Moreover, to make it possible to manipulate
such bandwidth, readable/writable controls have been added to both
procfs (for system wide settings) and cgroupfs (for per-group
settings).
Therefore, the same interface is being used for controlling the
bandwidth distrubution to -deadline tasks and task groups, i.e.,
new controls but with similar names, equivalent meaning and with
the same usage paradigm are added.
However, more discussion is needed in order to figure out how
we want to manage SCHED_DEADLINE bandwidth at the task group level.
Therefore, this patch adds a less sophisticated, but actually
very sensible, mechanism to ensure that a certain utilization
cap is not overcome per each root_domain (the single rq for !SMP
configurations).
Another main difference between deadline bandwidth management and
RT-throttling is that -deadline tasks have bandwidth on their own
(while -rt ones doesn't!), and thus we don't need an higher level
throttling mechanism to enforce the desired bandwidth.
This patch, therefore:
- adds system wide deadline bandwidth management by means of:
* /proc/sys/kernel/sched_dl_runtime_us,
* /proc/sys/kernel/sched_dl_period_us,
that determine (i.e., runtime / period) the total bandwidth
available on each CPU of each root_domain for -deadline tasks;
- couples the RT and deadline bandwidth management, i.e., enforces
that the sum of how much bandwidth is being devoted to -rt
-deadline tasks to stay below 100%.
This means that, for a root_domain comprising M CPUs, -deadline tasks
can be created until the sum of their bandwidths stay below:
M * (sched_dl_runtime_us / sched_dl_period_us)
It is also possible to disable this bandwidth management logic, and
be thus free of oversubscribing the system up to any arbitrary level.
Signed-off-by: Dario Faggioli <raistlin@linux.it>
Signed-off-by: Juri Lelli <juri.lelli@gmail.com>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/1383831828-15501-12-git-send-email-juri.lelli@gmail.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2013-11-07 20:43:45 +07:00
|
|
|
{
|
2013-12-19 17:54:45 +07:00
|
|
|
struct root_domain *rd = cpu_rq(i)->rd;
|
|
|
|
int cpus = 0;
|
|
|
|
|
2015-06-19 05:50:02 +07:00
|
|
|
RCU_LOCKDEP_WARN(!rcu_read_lock_sched_held(),
|
|
|
|
"sched RCU must be held");
|
2013-12-19 17:54:45 +07:00
|
|
|
for_each_cpu_and(i, rd->span, cpu_active_mask)
|
|
|
|
cpus++;
|
|
|
|
|
|
|
|
return cpus;
|
sched/deadline: Add bandwidth management for SCHED_DEADLINE tasks
In order of deadline scheduling to be effective and useful, it is
important that some method of having the allocation of the available
CPU bandwidth to tasks and task groups under control.
This is usually called "admission control" and if it is not performed
at all, no guarantee can be given on the actual scheduling of the
-deadline tasks.
Since when RT-throttling has been introduced each task group have a
bandwidth associated to itself, calculated as a certain amount of
runtime over a period. Moreover, to make it possible to manipulate
such bandwidth, readable/writable controls have been added to both
procfs (for system wide settings) and cgroupfs (for per-group
settings).
Therefore, the same interface is being used for controlling the
bandwidth distrubution to -deadline tasks and task groups, i.e.,
new controls but with similar names, equivalent meaning and with
the same usage paradigm are added.
However, more discussion is needed in order to figure out how
we want to manage SCHED_DEADLINE bandwidth at the task group level.
Therefore, this patch adds a less sophisticated, but actually
very sensible, mechanism to ensure that a certain utilization
cap is not overcome per each root_domain (the single rq for !SMP
configurations).
Another main difference between deadline bandwidth management and
RT-throttling is that -deadline tasks have bandwidth on their own
(while -rt ones doesn't!), and thus we don't need an higher level
throttling mechanism to enforce the desired bandwidth.
This patch, therefore:
- adds system wide deadline bandwidth management by means of:
* /proc/sys/kernel/sched_dl_runtime_us,
* /proc/sys/kernel/sched_dl_period_us,
that determine (i.e., runtime / period) the total bandwidth
available on each CPU of each root_domain for -deadline tasks;
- couples the RT and deadline bandwidth management, i.e., enforces
that the sum of how much bandwidth is being devoted to -rt
-deadline tasks to stay below 100%.
This means that, for a root_domain comprising M CPUs, -deadline tasks
can be created until the sum of their bandwidths stay below:
M * (sched_dl_runtime_us / sched_dl_period_us)
It is also possible to disable this bandwidth management logic, and
be thus free of oversubscribing the system up to any arbitrary level.
Signed-off-by: Dario Faggioli <raistlin@linux.it>
Signed-off-by: Juri Lelli <juri.lelli@gmail.com>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/1383831828-15501-12-git-send-email-juri.lelli@gmail.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2013-11-07 20:43:45 +07:00
|
|
|
}
|
|
|
|
#else
|
|
|
|
inline struct dl_bw *dl_bw_of(int i)
|
|
|
|
{
|
|
|
|
return &cpu_rq(i)->dl.dl_bw;
|
|
|
|
}
|
|
|
|
|
2013-12-19 17:54:45 +07:00
|
|
|
static inline int dl_bw_cpus(int i)
|
sched/deadline: Add bandwidth management for SCHED_DEADLINE tasks
In order of deadline scheduling to be effective and useful, it is
important that some method of having the allocation of the available
CPU bandwidth to tasks and task groups under control.
This is usually called "admission control" and if it is not performed
at all, no guarantee can be given on the actual scheduling of the
-deadline tasks.
Since when RT-throttling has been introduced each task group have a
bandwidth associated to itself, calculated as a certain amount of
runtime over a period. Moreover, to make it possible to manipulate
such bandwidth, readable/writable controls have been added to both
procfs (for system wide settings) and cgroupfs (for per-group
settings).
Therefore, the same interface is being used for controlling the
bandwidth distrubution to -deadline tasks and task groups, i.e.,
new controls but with similar names, equivalent meaning and with
the same usage paradigm are added.
However, more discussion is needed in order to figure out how
we want to manage SCHED_DEADLINE bandwidth at the task group level.
Therefore, this patch adds a less sophisticated, but actually
very sensible, mechanism to ensure that a certain utilization
cap is not overcome per each root_domain (the single rq for !SMP
configurations).
Another main difference between deadline bandwidth management and
RT-throttling is that -deadline tasks have bandwidth on their own
(while -rt ones doesn't!), and thus we don't need an higher level
throttling mechanism to enforce the desired bandwidth.
This patch, therefore:
- adds system wide deadline bandwidth management by means of:
* /proc/sys/kernel/sched_dl_runtime_us,
* /proc/sys/kernel/sched_dl_period_us,
that determine (i.e., runtime / period) the total bandwidth
available on each CPU of each root_domain for -deadline tasks;
- couples the RT and deadline bandwidth management, i.e., enforces
that the sum of how much bandwidth is being devoted to -rt
-deadline tasks to stay below 100%.
This means that, for a root_domain comprising M CPUs, -deadline tasks
can be created until the sum of their bandwidths stay below:
M * (sched_dl_runtime_us / sched_dl_period_us)
It is also possible to disable this bandwidth management logic, and
be thus free of oversubscribing the system up to any arbitrary level.
Signed-off-by: Dario Faggioli <raistlin@linux.it>
Signed-off-by: Juri Lelli <juri.lelli@gmail.com>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/1383831828-15501-12-git-send-email-juri.lelli@gmail.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2013-11-07 20:43:45 +07:00
|
|
|
{
|
|
|
|
return 1;
|
|
|
|
}
|
|
|
|
#endif
|
|
|
|
|
|
|
|
/*
|
|
|
|
* We must be sure that accepting a new task (or allowing changing the
|
|
|
|
* parameters of an existing one) is consistent with the bandwidth
|
|
|
|
* constraints. If yes, this function also accordingly updates the currently
|
|
|
|
* allocated bandwidth to reflect the new situation.
|
|
|
|
*
|
|
|
|
* This function is called while holding p's rq->lock.
|
2015-01-28 21:08:03 +07:00
|
|
|
*
|
|
|
|
* XXX we should delay bw change until the task's 0-lag point, see
|
|
|
|
* __setparam_dl().
|
sched/deadline: Add bandwidth management for SCHED_DEADLINE tasks
In order of deadline scheduling to be effective and useful, it is
important that some method of having the allocation of the available
CPU bandwidth to tasks and task groups under control.
This is usually called "admission control" and if it is not performed
at all, no guarantee can be given on the actual scheduling of the
-deadline tasks.
Since when RT-throttling has been introduced each task group have a
bandwidth associated to itself, calculated as a certain amount of
runtime over a period. Moreover, to make it possible to manipulate
such bandwidth, readable/writable controls have been added to both
procfs (for system wide settings) and cgroupfs (for per-group
settings).
Therefore, the same interface is being used for controlling the
bandwidth distrubution to -deadline tasks and task groups, i.e.,
new controls but with similar names, equivalent meaning and with
the same usage paradigm are added.
However, more discussion is needed in order to figure out how
we want to manage SCHED_DEADLINE bandwidth at the task group level.
Therefore, this patch adds a less sophisticated, but actually
very sensible, mechanism to ensure that a certain utilization
cap is not overcome per each root_domain (the single rq for !SMP
configurations).
Another main difference between deadline bandwidth management and
RT-throttling is that -deadline tasks have bandwidth on their own
(while -rt ones doesn't!), and thus we don't need an higher level
throttling mechanism to enforce the desired bandwidth.
This patch, therefore:
- adds system wide deadline bandwidth management by means of:
* /proc/sys/kernel/sched_dl_runtime_us,
* /proc/sys/kernel/sched_dl_period_us,
that determine (i.e., runtime / period) the total bandwidth
available on each CPU of each root_domain for -deadline tasks;
- couples the RT and deadline bandwidth management, i.e., enforces
that the sum of how much bandwidth is being devoted to -rt
-deadline tasks to stay below 100%.
This means that, for a root_domain comprising M CPUs, -deadline tasks
can be created until the sum of their bandwidths stay below:
M * (sched_dl_runtime_us / sched_dl_period_us)
It is also possible to disable this bandwidth management logic, and
be thus free of oversubscribing the system up to any arbitrary level.
Signed-off-by: Dario Faggioli <raistlin@linux.it>
Signed-off-by: Juri Lelli <juri.lelli@gmail.com>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/1383831828-15501-12-git-send-email-juri.lelli@gmail.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2013-11-07 20:43:45 +07:00
|
|
|
*/
|
|
|
|
static int dl_overflow(struct task_struct *p, int policy,
|
|
|
|
const struct sched_attr *attr)
|
|
|
|
{
|
|
|
|
|
|
|
|
struct dl_bw *dl_b = dl_bw_of(task_cpu(p));
|
sched/deadline: Fix overflow to handle period==0 and deadline!=0
While debugging the crash with the bad nr_running accounting, I hit
another bug where, after running my sched deadline test, I was getting
failures to take a CPU offline. It was giving me a -EBUSY error.
Adding a bunch of trace_printk()s around, I found that the cpu
notifier that called sched_cpu_inactive() was returning a failure. The
overflow value was coming up negative?
Talking this over with Juri, the problem is that the total_bw update was
suppose to be made by dl_overflow() which, during my tests, seemed to
not be called. Adding more trace_printk()s, it wasn't that it wasn't
called, but it exited out right away with the check of new_bw being
equal to p->dl.dl_bw. The new_bw calculates the ratio between period and
runtime. The bug is that if you set a deadline, you do not need to set
a period if you plan on the period being equal to the deadline. That
is, if period is zero and deadline is not, then the system call should
set the period to be equal to the deadline. This is done elsewhere in
the code.
The fix is easy, check if period is set, and if it is not, then use the
deadline.
Cc: Juri Lelli <juri.lelli@gmail.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/20140219135335.7e74abd4@gandalf.local.home
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2014-02-20 01:53:35 +07:00
|
|
|
u64 period = attr->sched_period ?: attr->sched_deadline;
|
sched/deadline: Add bandwidth management for SCHED_DEADLINE tasks
In order of deadline scheduling to be effective and useful, it is
important that some method of having the allocation of the available
CPU bandwidth to tasks and task groups under control.
This is usually called "admission control" and if it is not performed
at all, no guarantee can be given on the actual scheduling of the
-deadline tasks.
Since when RT-throttling has been introduced each task group have a
bandwidth associated to itself, calculated as a certain amount of
runtime over a period. Moreover, to make it possible to manipulate
such bandwidth, readable/writable controls have been added to both
procfs (for system wide settings) and cgroupfs (for per-group
settings).
Therefore, the same interface is being used for controlling the
bandwidth distrubution to -deadline tasks and task groups, i.e.,
new controls but with similar names, equivalent meaning and with
the same usage paradigm are added.
However, more discussion is needed in order to figure out how
we want to manage SCHED_DEADLINE bandwidth at the task group level.
Therefore, this patch adds a less sophisticated, but actually
very sensible, mechanism to ensure that a certain utilization
cap is not overcome per each root_domain (the single rq for !SMP
configurations).
Another main difference between deadline bandwidth management and
RT-throttling is that -deadline tasks have bandwidth on their own
(while -rt ones doesn't!), and thus we don't need an higher level
throttling mechanism to enforce the desired bandwidth.
This patch, therefore:
- adds system wide deadline bandwidth management by means of:
* /proc/sys/kernel/sched_dl_runtime_us,
* /proc/sys/kernel/sched_dl_period_us,
that determine (i.e., runtime / period) the total bandwidth
available on each CPU of each root_domain for -deadline tasks;
- couples the RT and deadline bandwidth management, i.e., enforces
that the sum of how much bandwidth is being devoted to -rt
-deadline tasks to stay below 100%.
This means that, for a root_domain comprising M CPUs, -deadline tasks
can be created until the sum of their bandwidths stay below:
M * (sched_dl_runtime_us / sched_dl_period_us)
It is also possible to disable this bandwidth management logic, and
be thus free of oversubscribing the system up to any arbitrary level.
Signed-off-by: Dario Faggioli <raistlin@linux.it>
Signed-off-by: Juri Lelli <juri.lelli@gmail.com>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/1383831828-15501-12-git-send-email-juri.lelli@gmail.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2013-11-07 20:43:45 +07:00
|
|
|
u64 runtime = attr->sched_runtime;
|
|
|
|
u64 new_bw = dl_policy(policy) ? to_ratio(period, runtime) : 0;
|
2013-12-19 17:54:45 +07:00
|
|
|
int cpus, err = -1;
|
sched/deadline: Add bandwidth management for SCHED_DEADLINE tasks
In order of deadline scheduling to be effective and useful, it is
important that some method of having the allocation of the available
CPU bandwidth to tasks and task groups under control.
This is usually called "admission control" and if it is not performed
at all, no guarantee can be given on the actual scheduling of the
-deadline tasks.
Since when RT-throttling has been introduced each task group have a
bandwidth associated to itself, calculated as a certain amount of
runtime over a period. Moreover, to make it possible to manipulate
such bandwidth, readable/writable controls have been added to both
procfs (for system wide settings) and cgroupfs (for per-group
settings).
Therefore, the same interface is being used for controlling the
bandwidth distrubution to -deadline tasks and task groups, i.e.,
new controls but with similar names, equivalent meaning and with
the same usage paradigm are added.
However, more discussion is needed in order to figure out how
we want to manage SCHED_DEADLINE bandwidth at the task group level.
Therefore, this patch adds a less sophisticated, but actually
very sensible, mechanism to ensure that a certain utilization
cap is not overcome per each root_domain (the single rq for !SMP
configurations).
Another main difference between deadline bandwidth management and
RT-throttling is that -deadline tasks have bandwidth on their own
(while -rt ones doesn't!), and thus we don't need an higher level
throttling mechanism to enforce the desired bandwidth.
This patch, therefore:
- adds system wide deadline bandwidth management by means of:
* /proc/sys/kernel/sched_dl_runtime_us,
* /proc/sys/kernel/sched_dl_period_us,
that determine (i.e., runtime / period) the total bandwidth
available on each CPU of each root_domain for -deadline tasks;
- couples the RT and deadline bandwidth management, i.e., enforces
that the sum of how much bandwidth is being devoted to -rt
-deadline tasks to stay below 100%.
This means that, for a root_domain comprising M CPUs, -deadline tasks
can be created until the sum of their bandwidths stay below:
M * (sched_dl_runtime_us / sched_dl_period_us)
It is also possible to disable this bandwidth management logic, and
be thus free of oversubscribing the system up to any arbitrary level.
Signed-off-by: Dario Faggioli <raistlin@linux.it>
Signed-off-by: Juri Lelli <juri.lelli@gmail.com>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/1383831828-15501-12-git-send-email-juri.lelli@gmail.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2013-11-07 20:43:45 +07:00
|
|
|
|
sched/deadline: Fix a bug in dl_overflow()
I got a minus(very big) dl_b->total_bw during my deadline tests.
# grep dl /proc/sched_debug
dl_rq[0]:
.dl_nr_running : 0
.dl_bw->bw : 996147
.dl_bw->total_bw : -222297900
Something unusual must have happened.
After some digging, I finally noticed that when changing a deadline
task to normal(cfs), and changing it back to deadline immediately,
after it died, we will got the wrong dl_bw->total_bw.
The root cause is in dl_overflow(), it has:
if (new_bw == p->dl.dl_bw)
return 0;
1) When a deadline task is changed to !deadline task, it will start
dl timer in switched_from_dl(), and retain previous deadline parameter
till the timer expires.
2) If we change it back to deadline with the same bandwidth parameter
before the timer expires, as it keeps the old bandwidth although it
is not a deadline task. dl_overflow() simply returns success without
updating the right data, and got the wrong dl_bw->total_bw.
The solution is simple, if @p is not deadline, don't return.
Signed-off-by: Xunlei Pang <xlpang@redhat.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Juri Lelli <juri.lelli@arm.com>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/1460636368-1993-1-git-send-email-xlpang@redhat.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-04-14 19:19:28 +07:00
|
|
|
/* !deadline task may carry old deadline bandwidth */
|
|
|
|
if (new_bw == p->dl.dl_bw && task_has_dl_policy(p))
|
sched/deadline: Add bandwidth management for SCHED_DEADLINE tasks
In order of deadline scheduling to be effective and useful, it is
important that some method of having the allocation of the available
CPU bandwidth to tasks and task groups under control.
This is usually called "admission control" and if it is not performed
at all, no guarantee can be given on the actual scheduling of the
-deadline tasks.
Since when RT-throttling has been introduced each task group have a
bandwidth associated to itself, calculated as a certain amount of
runtime over a period. Moreover, to make it possible to manipulate
such bandwidth, readable/writable controls have been added to both
procfs (for system wide settings) and cgroupfs (for per-group
settings).
Therefore, the same interface is being used for controlling the
bandwidth distrubution to -deadline tasks and task groups, i.e.,
new controls but with similar names, equivalent meaning and with
the same usage paradigm are added.
However, more discussion is needed in order to figure out how
we want to manage SCHED_DEADLINE bandwidth at the task group level.
Therefore, this patch adds a less sophisticated, but actually
very sensible, mechanism to ensure that a certain utilization
cap is not overcome per each root_domain (the single rq for !SMP
configurations).
Another main difference between deadline bandwidth management and
RT-throttling is that -deadline tasks have bandwidth on their own
(while -rt ones doesn't!), and thus we don't need an higher level
throttling mechanism to enforce the desired bandwidth.
This patch, therefore:
- adds system wide deadline bandwidth management by means of:
* /proc/sys/kernel/sched_dl_runtime_us,
* /proc/sys/kernel/sched_dl_period_us,
that determine (i.e., runtime / period) the total bandwidth
available on each CPU of each root_domain for -deadline tasks;
- couples the RT and deadline bandwidth management, i.e., enforces
that the sum of how much bandwidth is being devoted to -rt
-deadline tasks to stay below 100%.
This means that, for a root_domain comprising M CPUs, -deadline tasks
can be created until the sum of their bandwidths stay below:
M * (sched_dl_runtime_us / sched_dl_period_us)
It is also possible to disable this bandwidth management logic, and
be thus free of oversubscribing the system up to any arbitrary level.
Signed-off-by: Dario Faggioli <raistlin@linux.it>
Signed-off-by: Juri Lelli <juri.lelli@gmail.com>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/1383831828-15501-12-git-send-email-juri.lelli@gmail.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2013-11-07 20:43:45 +07:00
|
|
|
return 0;
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Either if a task, enters, leave, or stays -deadline but changes
|
|
|
|
* its parameters, we may need to update accordingly the total
|
|
|
|
* allocated bandwidth of the container.
|
|
|
|
*/
|
|
|
|
raw_spin_lock(&dl_b->lock);
|
2013-12-19 17:54:45 +07:00
|
|
|
cpus = dl_bw_cpus(task_cpu(p));
|
sched/deadline: Add bandwidth management for SCHED_DEADLINE tasks
In order of deadline scheduling to be effective and useful, it is
important that some method of having the allocation of the available
CPU bandwidth to tasks and task groups under control.
This is usually called "admission control" and if it is not performed
at all, no guarantee can be given on the actual scheduling of the
-deadline tasks.
Since when RT-throttling has been introduced each task group have a
bandwidth associated to itself, calculated as a certain amount of
runtime over a period. Moreover, to make it possible to manipulate
such bandwidth, readable/writable controls have been added to both
procfs (for system wide settings) and cgroupfs (for per-group
settings).
Therefore, the same interface is being used for controlling the
bandwidth distrubution to -deadline tasks and task groups, i.e.,
new controls but with similar names, equivalent meaning and with
the same usage paradigm are added.
However, more discussion is needed in order to figure out how
we want to manage SCHED_DEADLINE bandwidth at the task group level.
Therefore, this patch adds a less sophisticated, but actually
very sensible, mechanism to ensure that a certain utilization
cap is not overcome per each root_domain (the single rq for !SMP
configurations).
Another main difference between deadline bandwidth management and
RT-throttling is that -deadline tasks have bandwidth on their own
(while -rt ones doesn't!), and thus we don't need an higher level
throttling mechanism to enforce the desired bandwidth.
This patch, therefore:
- adds system wide deadline bandwidth management by means of:
* /proc/sys/kernel/sched_dl_runtime_us,
* /proc/sys/kernel/sched_dl_period_us,
that determine (i.e., runtime / period) the total bandwidth
available on each CPU of each root_domain for -deadline tasks;
- couples the RT and deadline bandwidth management, i.e., enforces
that the sum of how much bandwidth is being devoted to -rt
-deadline tasks to stay below 100%.
This means that, for a root_domain comprising M CPUs, -deadline tasks
can be created until the sum of their bandwidths stay below:
M * (sched_dl_runtime_us / sched_dl_period_us)
It is also possible to disable this bandwidth management logic, and
be thus free of oversubscribing the system up to any arbitrary level.
Signed-off-by: Dario Faggioli <raistlin@linux.it>
Signed-off-by: Juri Lelli <juri.lelli@gmail.com>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/1383831828-15501-12-git-send-email-juri.lelli@gmail.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2013-11-07 20:43:45 +07:00
|
|
|
if (dl_policy(policy) && !task_has_dl_policy(p) &&
|
|
|
|
!__dl_overflow(dl_b, cpus, 0, new_bw)) {
|
|
|
|
__dl_add(dl_b, new_bw);
|
|
|
|
err = 0;
|
|
|
|
} else if (dl_policy(policy) && task_has_dl_policy(p) &&
|
|
|
|
!__dl_overflow(dl_b, cpus, p->dl.dl_bw, new_bw)) {
|
|
|
|
__dl_clear(dl_b, p->dl.dl_bw);
|
|
|
|
__dl_add(dl_b, new_bw);
|
|
|
|
err = 0;
|
|
|
|
} else if (!dl_policy(policy) && task_has_dl_policy(p)) {
|
|
|
|
__dl_clear(dl_b, p->dl.dl_bw);
|
|
|
|
err = 0;
|
|
|
|
}
|
|
|
|
raw_spin_unlock(&dl_b->lock);
|
|
|
|
|
|
|
|
return err;
|
|
|
|
}
|
|
|
|
|
|
|
|
extern void init_dl_bw(struct dl_bw *dl_b);
|
|
|
|
|
2005-04-17 05:20:36 +07:00
|
|
|
/*
|
|
|
|
* wake_up_new_task - wake up a newly created task for the first time.
|
|
|
|
*
|
|
|
|
* This function will do some initial scheduler statistics housekeeping
|
|
|
|
* that must be done for every newly created context, then puts the task
|
|
|
|
* on the runqueue and wakes it.
|
|
|
|
*/
|
2011-05-11 23:18:05 +07:00
|
|
|
void wake_up_new_task(struct task_struct *p)
|
2005-04-17 05:20:36 +07:00
|
|
|
{
|
2015-08-01 02:28:18 +07:00
|
|
|
struct rq_flags rf;
|
2007-07-09 23:51:59 +07:00
|
|
|
struct rq *rq;
|
2010-01-22 03:04:57 +07:00
|
|
|
|
2015-08-01 02:28:18 +07:00
|
|
|
raw_spin_lock_irqsave(&p->pi_lock, rf.flags);
|
2016-06-16 18:29:28 +07:00
|
|
|
p->state = TASK_RUNNING;
|
2010-01-22 03:04:57 +07:00
|
|
|
#ifdef CONFIG_SMP
|
|
|
|
/*
|
|
|
|
* Fork balancing, do it here and not earlier because:
|
|
|
|
* - cpus_allowed can change in the fork path
|
|
|
|
* - any previously selected cpu might disappear through hotplug
|
2016-06-16 23:51:48 +07:00
|
|
|
*
|
|
|
|
* Use __set_task_cpu() to avoid calling sched_class::migrate_task_rq,
|
|
|
|
* as we're not fully set-up yet.
|
2010-01-22 03:04:57 +07:00
|
|
|
*/
|
2016-06-16 23:51:48 +07:00
|
|
|
__set_task_cpu(p, select_task_rq(p, task_cpu(p), SD_BALANCE_FORK, 0));
|
2010-03-25 00:34:10 +07:00
|
|
|
#endif
|
sched/fair: Fix post_init_entity_util_avg() serialization
Chris Wilson reported a divide by 0 at:
post_init_entity_util_avg():
> 725 if (cfs_rq->avg.util_avg != 0) {
> 726 sa->util_avg = cfs_rq->avg.util_avg * se->load.weight;
> -> 727 sa->util_avg /= (cfs_rq->avg.load_avg + 1);
> 728
> 729 if (sa->util_avg > cap)
> 730 sa->util_avg = cap;
> 731 } else {
Which given the lack of serialization, and the code generated from
update_cfs_rq_load_avg() is entirely possible:
if (atomic_long_read(&cfs_rq->removed_load_avg)) {
s64 r = atomic_long_xchg(&cfs_rq->removed_load_avg, 0);
sa->load_avg = max_t(long, sa->load_avg - r, 0);
sa->load_sum = max_t(s64, sa->load_sum - r * LOAD_AVG_MAX, 0);
removed_load = 1;
}
turns into:
ffffffff81087064: 49 8b 85 98 00 00 00 mov 0x98(%r13),%rax
ffffffff8108706b: 48 85 c0 test %rax,%rax
ffffffff8108706e: 74 40 je ffffffff810870b0
ffffffff81087070: 4c 89 f8 mov %r15,%rax
ffffffff81087073: 49 87 85 98 00 00 00 xchg %rax,0x98(%r13)
ffffffff8108707a: 49 29 45 70 sub %rax,0x70(%r13)
ffffffff8108707e: 4c 89 f9 mov %r15,%rcx
ffffffff81087081: bb 01 00 00 00 mov $0x1,%ebx
ffffffff81087086: 49 83 7d 70 00 cmpq $0x0,0x70(%r13)
ffffffff8108708b: 49 0f 49 4d 70 cmovns 0x70(%r13),%rcx
Which you'll note ends up with 'sa->load_avg - r' in memory at
ffffffff8108707a.
By calling post_init_entity_util_avg() under rq->lock we're sure to be
fully serialized against PELT updates and cannot observe intermediate
state like this.
Reported-by: Chris Wilson <chris@chris-wilson.co.uk>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Andrey Ryabinin <aryabinin@virtuozzo.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Yuyang Du <yuyang.du@intel.com>
Cc: bsegall@google.com
Cc: morten.rasmussen@arm.com
Cc: pjt@google.com
Cc: steve.muckle@linaro.org
Fixes: 2b8c41daba32 ("sched/fair: Initiate a new task's util avg to a bounded value")
Link: http://lkml.kernel.org/r/20160609130750.GQ30909@twins.programming.kicks-ass.net
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-06-09 20:07:50 +07:00
|
|
|
rq = __task_rq_lock(p, &rf);
|
2016-03-30 03:30:56 +07:00
|
|
|
post_init_entity_util_avg(&p->se);
|
2010-03-25 00:34:10 +07:00
|
|
|
|
2009-11-27 23:32:46 +07:00
|
|
|
activate_task(rq, p, 0);
|
2014-08-20 16:47:32 +07:00
|
|
|
p->on_rq = TASK_ON_RQ_QUEUED;
|
2015-06-09 16:13:36 +07:00
|
|
|
trace_sched_wakeup_new(p);
|
2009-09-15 01:02:34 +07:00
|
|
|
check_preempt_curr(rq, p, WF_FORK);
|
2008-01-26 03:08:22 +07:00
|
|
|
#ifdef CONFIG_SMP
|
2015-10-23 16:50:08 +07:00
|
|
|
if (p->sched_class->task_woken) {
|
|
|
|
/*
|
|
|
|
* Nothing relies on rq->lock after this, so its fine to
|
|
|
|
* drop it.
|
|
|
|
*/
|
2015-08-02 00:25:08 +07:00
|
|
|
lockdep_unpin_lock(&rq->lock, rf.cookie);
|
2009-12-17 00:04:40 +07:00
|
|
|
p->sched_class->task_woken(rq, p);
|
2015-08-02 00:25:08 +07:00
|
|
|
lockdep_repin_lock(&rq->lock, rf.cookie);
|
2015-10-23 16:50:08 +07:00
|
|
|
}
|
2008-01-26 03:08:22 +07:00
|
|
|
#endif
|
2015-08-01 02:28:18 +07:00
|
|
|
task_rq_unlock(rq, p, &rf);
|
2005-04-17 05:20:36 +07:00
|
|
|
}
|
|
|
|
|
2007-07-26 18:40:43 +07:00
|
|
|
#ifdef CONFIG_PREEMPT_NOTIFIERS
|
|
|
|
|
sched/preempt: Add static_key() to preempt_notifiers
Avoid touching the curr->preempt_notifier cacheline when not needed.
Provides a small improvement on pipe-bench:
taskset 01 perf stat --repeat 10 -- perf bench sched pipe
before:
Performance counter stats for 'perf bench sched pipe' (10 runs):
12385.016204 task-clock (msec) # 1.001 CPUs utilized ( +- 0.34% )
2,000,023 context-switches # 0.161 M/sec ( +- 0.00% )
0 cpu-migrations # 0.000 K/sec
175 page-faults # 0.014 K/sec ( +- 0.26% )
41,376,162,250 cycles # 3.341 GHz ( +- 0.11% )
17,389,139,321 stalled-cycles-frontend # 42.03% frontend cycles idle ( +- 0.25% )
<not supported> stalled-cycles-backend
68,788,588,003 instructions # 1.66 insns per cycle
# 0.25 stalled cycles per insn ( +- 0.02% )
13,449,387,620 branches # 1085.940 M/sec ( +- 0.02% )
20,880,690 branch-misses # 0.16% of all branches ( +- 0.98% )
12.372646094 seconds time elapsed ( +- 0.34% )
after:
Performance counter stats for 'perf bench sched pipe' (10 runs):
12180.936528 task-clock (msec) # 1.001 CPUs utilized ( +- 0.33% )
2,000,077 context-switches # 0.164 M/sec ( +- 0.00% )
0 cpu-migrations # 0.000 K/sec
174 page-faults # 0.014 K/sec ( +- 0.27% )
40,691,545,577 cycles # 3.341 GHz ( +- 0.06% )
16,446,333,371 stalled-cycles-frontend # 40.42% frontend cycles idle ( +- 0.18% )
<not supported> stalled-cycles-backend
68,570,100,387 instructions # 1.69 insns per cycle
# 0.24 stalled cycles per insn ( +- 0.01% )
13,389,740,014 branches # 1099.237 M/sec ( +- 0.01% )
20,175,440 branch-misses # 0.15% of all branches ( +- 0.52% )
12.169253010 seconds time elapsed ( +- 0.33% )
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-06-08 21:00:30 +07:00
|
|
|
static struct static_key preempt_notifier_key = STATIC_KEY_INIT_FALSE;
|
|
|
|
|
2015-07-03 23:53:58 +07:00
|
|
|
void preempt_notifier_inc(void)
|
|
|
|
{
|
|
|
|
static_key_slow_inc(&preempt_notifier_key);
|
|
|
|
}
|
|
|
|
EXPORT_SYMBOL_GPL(preempt_notifier_inc);
|
|
|
|
|
|
|
|
void preempt_notifier_dec(void)
|
|
|
|
{
|
|
|
|
static_key_slow_dec(&preempt_notifier_key);
|
|
|
|
}
|
|
|
|
EXPORT_SYMBOL_GPL(preempt_notifier_dec);
|
|
|
|
|
2007-07-26 18:40:43 +07:00
|
|
|
/**
|
2009-03-17 02:58:09 +07:00
|
|
|
* preempt_notifier_register - tell me when current is being preempted & rescheduled
|
2007-07-31 14:37:50 +07:00
|
|
|
* @notifier: notifier struct to register
|
2007-07-26 18:40:43 +07:00
|
|
|
*/
|
|
|
|
void preempt_notifier_register(struct preempt_notifier *notifier)
|
|
|
|
{
|
2015-07-03 23:53:58 +07:00
|
|
|
if (!static_key_false(&preempt_notifier_key))
|
|
|
|
WARN(1, "registering preempt_notifier while notifiers disabled\n");
|
|
|
|
|
2007-07-26 18:40:43 +07:00
|
|
|
hlist_add_head(¬ifier->link, ¤t->preempt_notifiers);
|
|
|
|
}
|
|
|
|
EXPORT_SYMBOL_GPL(preempt_notifier_register);
|
|
|
|
|
|
|
|
/**
|
|
|
|
* preempt_notifier_unregister - no longer interested in preemption notifications
|
2007-07-31 14:37:50 +07:00
|
|
|
* @notifier: notifier struct to unregister
|
2007-07-26 18:40:43 +07:00
|
|
|
*
|
2015-05-17 23:53:10 +07:00
|
|
|
* This is *not* safe to call from within a preemption notifier.
|
2007-07-26 18:40:43 +07:00
|
|
|
*/
|
|
|
|
void preempt_notifier_unregister(struct preempt_notifier *notifier)
|
|
|
|
{
|
|
|
|
hlist_del(¬ifier->link);
|
|
|
|
}
|
|
|
|
EXPORT_SYMBOL_GPL(preempt_notifier_unregister);
|
|
|
|
|
sched/preempt: Add static_key() to preempt_notifiers
Avoid touching the curr->preempt_notifier cacheline when not needed.
Provides a small improvement on pipe-bench:
taskset 01 perf stat --repeat 10 -- perf bench sched pipe
before:
Performance counter stats for 'perf bench sched pipe' (10 runs):
12385.016204 task-clock (msec) # 1.001 CPUs utilized ( +- 0.34% )
2,000,023 context-switches # 0.161 M/sec ( +- 0.00% )
0 cpu-migrations # 0.000 K/sec
175 page-faults # 0.014 K/sec ( +- 0.26% )
41,376,162,250 cycles # 3.341 GHz ( +- 0.11% )
17,389,139,321 stalled-cycles-frontend # 42.03% frontend cycles idle ( +- 0.25% )
<not supported> stalled-cycles-backend
68,788,588,003 instructions # 1.66 insns per cycle
# 0.25 stalled cycles per insn ( +- 0.02% )
13,449,387,620 branches # 1085.940 M/sec ( +- 0.02% )
20,880,690 branch-misses # 0.16% of all branches ( +- 0.98% )
12.372646094 seconds time elapsed ( +- 0.34% )
after:
Performance counter stats for 'perf bench sched pipe' (10 runs):
12180.936528 task-clock (msec) # 1.001 CPUs utilized ( +- 0.33% )
2,000,077 context-switches # 0.164 M/sec ( +- 0.00% )
0 cpu-migrations # 0.000 K/sec
174 page-faults # 0.014 K/sec ( +- 0.27% )
40,691,545,577 cycles # 3.341 GHz ( +- 0.06% )
16,446,333,371 stalled-cycles-frontend # 40.42% frontend cycles idle ( +- 0.18% )
<not supported> stalled-cycles-backend
68,570,100,387 instructions # 1.69 insns per cycle
# 0.24 stalled cycles per insn ( +- 0.01% )
13,389,740,014 branches # 1099.237 M/sec ( +- 0.01% )
20,175,440 branch-misses # 0.15% of all branches ( +- 0.52% )
12.169253010 seconds time elapsed ( +- 0.33% )
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-06-08 21:00:30 +07:00
|
|
|
static void __fire_sched_in_preempt_notifiers(struct task_struct *curr)
|
2007-07-26 18:40:43 +07:00
|
|
|
{
|
|
|
|
struct preempt_notifier *notifier;
|
|
|
|
|
hlist: drop the node parameter from iterators
I'm not sure why, but the hlist for each entry iterators were conceived
list_for_each_entry(pos, head, member)
The hlist ones were greedy and wanted an extra parameter:
hlist_for_each_entry(tpos, pos, head, member)
Why did they need an extra pos parameter? I'm not quite sure. Not only
they don't really need it, it also prevents the iterator from looking
exactly like the list iterator, which is unfortunate.
Besides the semantic patch, there was some manual work required:
- Fix up the actual hlist iterators in linux/list.h
- Fix up the declaration of other iterators based on the hlist ones.
- A very small amount of places were using the 'node' parameter, this
was modified to use 'obj->member' instead.
- Coccinelle didn't handle the hlist_for_each_entry_safe iterator
properly, so those had to be fixed up manually.
The semantic patch which is mostly the work of Peter Senna Tschudin is here:
@@
iterator name hlist_for_each_entry, hlist_for_each_entry_continue, hlist_for_each_entry_from, hlist_for_each_entry_rcu, hlist_for_each_entry_rcu_bh, hlist_for_each_entry_continue_rcu_bh, for_each_busy_worker, ax25_uid_for_each, ax25_for_each, inet_bind_bucket_for_each, sctp_for_each_hentry, sk_for_each, sk_for_each_rcu, sk_for_each_from, sk_for_each_safe, sk_for_each_bound, hlist_for_each_entry_safe, hlist_for_each_entry_continue_rcu, nr_neigh_for_each, nr_neigh_for_each_safe, nr_node_for_each, nr_node_for_each_safe, for_each_gfn_indirect_valid_sp, for_each_gfn_sp, for_each_host;
type T;
expression a,c,d,e;
identifier b;
statement S;
@@
-T b;
<+... when != b
(
hlist_for_each_entry(a,
- b,
c, d) S
|
hlist_for_each_entry_continue(a,
- b,
c) S
|
hlist_for_each_entry_from(a,
- b,
c) S
|
hlist_for_each_entry_rcu(a,
- b,
c, d) S
|
hlist_for_each_entry_rcu_bh(a,
- b,
c, d) S
|
hlist_for_each_entry_continue_rcu_bh(a,
- b,
c) S
|
for_each_busy_worker(a, c,
- b,
d) S
|
ax25_uid_for_each(a,
- b,
c) S
|
ax25_for_each(a,
- b,
c) S
|
inet_bind_bucket_for_each(a,
- b,
c) S
|
sctp_for_each_hentry(a,
- b,
c) S
|
sk_for_each(a,
- b,
c) S
|
sk_for_each_rcu(a,
- b,
c) S
|
sk_for_each_from
-(a, b)
+(a)
S
+ sk_for_each_from(a) S
|
sk_for_each_safe(a,
- b,
c, d) S
|
sk_for_each_bound(a,
- b,
c) S
|
hlist_for_each_entry_safe(a,
- b,
c, d, e) S
|
hlist_for_each_entry_continue_rcu(a,
- b,
c) S
|
nr_neigh_for_each(a,
- b,
c) S
|
nr_neigh_for_each_safe(a,
- b,
c, d) S
|
nr_node_for_each(a,
- b,
c) S
|
nr_node_for_each_safe(a,
- b,
c, d) S
|
- for_each_gfn_sp(a, c, d, b) S
+ for_each_gfn_sp(a, c, d) S
|
- for_each_gfn_indirect_valid_sp(a, c, d, b) S
+ for_each_gfn_indirect_valid_sp(a, c, d) S
|
for_each_host(a,
- b,
c) S
|
for_each_host_safe(a,
- b,
c, d) S
|
for_each_mesh_entry(a,
- b,
c, d) S
)
...+>
[akpm@linux-foundation.org: drop bogus change from net/ipv4/raw.c]
[akpm@linux-foundation.org: drop bogus hunk from net/ipv6/raw.c]
[akpm@linux-foundation.org: checkpatch fixes]
[akpm@linux-foundation.org: fix warnings]
[akpm@linux-foudnation.org: redo intrusive kvm changes]
Tested-by: Peter Senna Tschudin <peter.senna@gmail.com>
Acked-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Signed-off-by: Sasha Levin <sasha.levin@oracle.com>
Cc: Wu Fengguang <fengguang.wu@intel.com>
Cc: Marcelo Tosatti <mtosatti@redhat.com>
Cc: Gleb Natapov <gleb@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-02-28 08:06:00 +07:00
|
|
|
hlist_for_each_entry(notifier, &curr->preempt_notifiers, link)
|
2007-07-26 18:40:43 +07:00
|
|
|
notifier->ops->sched_in(notifier, raw_smp_processor_id());
|
|
|
|
}
|
|
|
|
|
sched/preempt: Add static_key() to preempt_notifiers
Avoid touching the curr->preempt_notifier cacheline when not needed.
Provides a small improvement on pipe-bench:
taskset 01 perf stat --repeat 10 -- perf bench sched pipe
before:
Performance counter stats for 'perf bench sched pipe' (10 runs):
12385.016204 task-clock (msec) # 1.001 CPUs utilized ( +- 0.34% )
2,000,023 context-switches # 0.161 M/sec ( +- 0.00% )
0 cpu-migrations # 0.000 K/sec
175 page-faults # 0.014 K/sec ( +- 0.26% )
41,376,162,250 cycles # 3.341 GHz ( +- 0.11% )
17,389,139,321 stalled-cycles-frontend # 42.03% frontend cycles idle ( +- 0.25% )
<not supported> stalled-cycles-backend
68,788,588,003 instructions # 1.66 insns per cycle
# 0.25 stalled cycles per insn ( +- 0.02% )
13,449,387,620 branches # 1085.940 M/sec ( +- 0.02% )
20,880,690 branch-misses # 0.16% of all branches ( +- 0.98% )
12.372646094 seconds time elapsed ( +- 0.34% )
after:
Performance counter stats for 'perf bench sched pipe' (10 runs):
12180.936528 task-clock (msec) # 1.001 CPUs utilized ( +- 0.33% )
2,000,077 context-switches # 0.164 M/sec ( +- 0.00% )
0 cpu-migrations # 0.000 K/sec
174 page-faults # 0.014 K/sec ( +- 0.27% )
40,691,545,577 cycles # 3.341 GHz ( +- 0.06% )
16,446,333,371 stalled-cycles-frontend # 40.42% frontend cycles idle ( +- 0.18% )
<not supported> stalled-cycles-backend
68,570,100,387 instructions # 1.69 insns per cycle
# 0.24 stalled cycles per insn ( +- 0.01% )
13,389,740,014 branches # 1099.237 M/sec ( +- 0.01% )
20,175,440 branch-misses # 0.15% of all branches ( +- 0.52% )
12.169253010 seconds time elapsed ( +- 0.33% )
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-06-08 21:00:30 +07:00
|
|
|
static __always_inline void fire_sched_in_preempt_notifiers(struct task_struct *curr)
|
|
|
|
{
|
|
|
|
if (static_key_false(&preempt_notifier_key))
|
|
|
|
__fire_sched_in_preempt_notifiers(curr);
|
|
|
|
}
|
|
|
|
|
2007-07-26 18:40:43 +07:00
|
|
|
static void
|
sched/preempt: Add static_key() to preempt_notifiers
Avoid touching the curr->preempt_notifier cacheline when not needed.
Provides a small improvement on pipe-bench:
taskset 01 perf stat --repeat 10 -- perf bench sched pipe
before:
Performance counter stats for 'perf bench sched pipe' (10 runs):
12385.016204 task-clock (msec) # 1.001 CPUs utilized ( +- 0.34% )
2,000,023 context-switches # 0.161 M/sec ( +- 0.00% )
0 cpu-migrations # 0.000 K/sec
175 page-faults # 0.014 K/sec ( +- 0.26% )
41,376,162,250 cycles # 3.341 GHz ( +- 0.11% )
17,389,139,321 stalled-cycles-frontend # 42.03% frontend cycles idle ( +- 0.25% )
<not supported> stalled-cycles-backend
68,788,588,003 instructions # 1.66 insns per cycle
# 0.25 stalled cycles per insn ( +- 0.02% )
13,449,387,620 branches # 1085.940 M/sec ( +- 0.02% )
20,880,690 branch-misses # 0.16% of all branches ( +- 0.98% )
12.372646094 seconds time elapsed ( +- 0.34% )
after:
Performance counter stats for 'perf bench sched pipe' (10 runs):
12180.936528 task-clock (msec) # 1.001 CPUs utilized ( +- 0.33% )
2,000,077 context-switches # 0.164 M/sec ( +- 0.00% )
0 cpu-migrations # 0.000 K/sec
174 page-faults # 0.014 K/sec ( +- 0.27% )
40,691,545,577 cycles # 3.341 GHz ( +- 0.06% )
16,446,333,371 stalled-cycles-frontend # 40.42% frontend cycles idle ( +- 0.18% )
<not supported> stalled-cycles-backend
68,570,100,387 instructions # 1.69 insns per cycle
# 0.24 stalled cycles per insn ( +- 0.01% )
13,389,740,014 branches # 1099.237 M/sec ( +- 0.01% )
20,175,440 branch-misses # 0.15% of all branches ( +- 0.52% )
12.169253010 seconds time elapsed ( +- 0.33% )
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-06-08 21:00:30 +07:00
|
|
|
__fire_sched_out_preempt_notifiers(struct task_struct *curr,
|
|
|
|
struct task_struct *next)
|
2007-07-26 18:40:43 +07:00
|
|
|
{
|
|
|
|
struct preempt_notifier *notifier;
|
|
|
|
|
hlist: drop the node parameter from iterators
I'm not sure why, but the hlist for each entry iterators were conceived
list_for_each_entry(pos, head, member)
The hlist ones were greedy and wanted an extra parameter:
hlist_for_each_entry(tpos, pos, head, member)
Why did they need an extra pos parameter? I'm not quite sure. Not only
they don't really need it, it also prevents the iterator from looking
exactly like the list iterator, which is unfortunate.
Besides the semantic patch, there was some manual work required:
- Fix up the actual hlist iterators in linux/list.h
- Fix up the declaration of other iterators based on the hlist ones.
- A very small amount of places were using the 'node' parameter, this
was modified to use 'obj->member' instead.
- Coccinelle didn't handle the hlist_for_each_entry_safe iterator
properly, so those had to be fixed up manually.
The semantic patch which is mostly the work of Peter Senna Tschudin is here:
@@
iterator name hlist_for_each_entry, hlist_for_each_entry_continue, hlist_for_each_entry_from, hlist_for_each_entry_rcu, hlist_for_each_entry_rcu_bh, hlist_for_each_entry_continue_rcu_bh, for_each_busy_worker, ax25_uid_for_each, ax25_for_each, inet_bind_bucket_for_each, sctp_for_each_hentry, sk_for_each, sk_for_each_rcu, sk_for_each_from, sk_for_each_safe, sk_for_each_bound, hlist_for_each_entry_safe, hlist_for_each_entry_continue_rcu, nr_neigh_for_each, nr_neigh_for_each_safe, nr_node_for_each, nr_node_for_each_safe, for_each_gfn_indirect_valid_sp, for_each_gfn_sp, for_each_host;
type T;
expression a,c,d,e;
identifier b;
statement S;
@@
-T b;
<+... when != b
(
hlist_for_each_entry(a,
- b,
c, d) S
|
hlist_for_each_entry_continue(a,
- b,
c) S
|
hlist_for_each_entry_from(a,
- b,
c) S
|
hlist_for_each_entry_rcu(a,
- b,
c, d) S
|
hlist_for_each_entry_rcu_bh(a,
- b,
c, d) S
|
hlist_for_each_entry_continue_rcu_bh(a,
- b,
c) S
|
for_each_busy_worker(a, c,
- b,
d) S
|
ax25_uid_for_each(a,
- b,
c) S
|
ax25_for_each(a,
- b,
c) S
|
inet_bind_bucket_for_each(a,
- b,
c) S
|
sctp_for_each_hentry(a,
- b,
c) S
|
sk_for_each(a,
- b,
c) S
|
sk_for_each_rcu(a,
- b,
c) S
|
sk_for_each_from
-(a, b)
+(a)
S
+ sk_for_each_from(a) S
|
sk_for_each_safe(a,
- b,
c, d) S
|
sk_for_each_bound(a,
- b,
c) S
|
hlist_for_each_entry_safe(a,
- b,
c, d, e) S
|
hlist_for_each_entry_continue_rcu(a,
- b,
c) S
|
nr_neigh_for_each(a,
- b,
c) S
|
nr_neigh_for_each_safe(a,
- b,
c, d) S
|
nr_node_for_each(a,
- b,
c) S
|
nr_node_for_each_safe(a,
- b,
c, d) S
|
- for_each_gfn_sp(a, c, d, b) S
+ for_each_gfn_sp(a, c, d) S
|
- for_each_gfn_indirect_valid_sp(a, c, d, b) S
+ for_each_gfn_indirect_valid_sp(a, c, d) S
|
for_each_host(a,
- b,
c) S
|
for_each_host_safe(a,
- b,
c, d) S
|
for_each_mesh_entry(a,
- b,
c, d) S
)
...+>
[akpm@linux-foundation.org: drop bogus change from net/ipv4/raw.c]
[akpm@linux-foundation.org: drop bogus hunk from net/ipv6/raw.c]
[akpm@linux-foundation.org: checkpatch fixes]
[akpm@linux-foundation.org: fix warnings]
[akpm@linux-foudnation.org: redo intrusive kvm changes]
Tested-by: Peter Senna Tschudin <peter.senna@gmail.com>
Acked-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Signed-off-by: Sasha Levin <sasha.levin@oracle.com>
Cc: Wu Fengguang <fengguang.wu@intel.com>
Cc: Marcelo Tosatti <mtosatti@redhat.com>
Cc: Gleb Natapov <gleb@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-02-28 08:06:00 +07:00
|
|
|
hlist_for_each_entry(notifier, &curr->preempt_notifiers, link)
|
2007-07-26 18:40:43 +07:00
|
|
|
notifier->ops->sched_out(notifier, next);
|
|
|
|
}
|
|
|
|
|
sched/preempt: Add static_key() to preempt_notifiers
Avoid touching the curr->preempt_notifier cacheline when not needed.
Provides a small improvement on pipe-bench:
taskset 01 perf stat --repeat 10 -- perf bench sched pipe
before:
Performance counter stats for 'perf bench sched pipe' (10 runs):
12385.016204 task-clock (msec) # 1.001 CPUs utilized ( +- 0.34% )
2,000,023 context-switches # 0.161 M/sec ( +- 0.00% )
0 cpu-migrations # 0.000 K/sec
175 page-faults # 0.014 K/sec ( +- 0.26% )
41,376,162,250 cycles # 3.341 GHz ( +- 0.11% )
17,389,139,321 stalled-cycles-frontend # 42.03% frontend cycles idle ( +- 0.25% )
<not supported> stalled-cycles-backend
68,788,588,003 instructions # 1.66 insns per cycle
# 0.25 stalled cycles per insn ( +- 0.02% )
13,449,387,620 branches # 1085.940 M/sec ( +- 0.02% )
20,880,690 branch-misses # 0.16% of all branches ( +- 0.98% )
12.372646094 seconds time elapsed ( +- 0.34% )
after:
Performance counter stats for 'perf bench sched pipe' (10 runs):
12180.936528 task-clock (msec) # 1.001 CPUs utilized ( +- 0.33% )
2,000,077 context-switches # 0.164 M/sec ( +- 0.00% )
0 cpu-migrations # 0.000 K/sec
174 page-faults # 0.014 K/sec ( +- 0.27% )
40,691,545,577 cycles # 3.341 GHz ( +- 0.06% )
16,446,333,371 stalled-cycles-frontend # 40.42% frontend cycles idle ( +- 0.18% )
<not supported> stalled-cycles-backend
68,570,100,387 instructions # 1.69 insns per cycle
# 0.24 stalled cycles per insn ( +- 0.01% )
13,389,740,014 branches # 1099.237 M/sec ( +- 0.01% )
20,175,440 branch-misses # 0.15% of all branches ( +- 0.52% )
12.169253010 seconds time elapsed ( +- 0.33% )
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-06-08 21:00:30 +07:00
|
|
|
static __always_inline void
|
|
|
|
fire_sched_out_preempt_notifiers(struct task_struct *curr,
|
|
|
|
struct task_struct *next)
|
|
|
|
{
|
|
|
|
if (static_key_false(&preempt_notifier_key))
|
|
|
|
__fire_sched_out_preempt_notifiers(curr, next);
|
|
|
|
}
|
|
|
|
|
2008-05-30 19:23:45 +07:00
|
|
|
#else /* !CONFIG_PREEMPT_NOTIFIERS */
|
2007-07-26 18:40:43 +07:00
|
|
|
|
sched/preempt: Add static_key() to preempt_notifiers
Avoid touching the curr->preempt_notifier cacheline when not needed.
Provides a small improvement on pipe-bench:
taskset 01 perf stat --repeat 10 -- perf bench sched pipe
before:
Performance counter stats for 'perf bench sched pipe' (10 runs):
12385.016204 task-clock (msec) # 1.001 CPUs utilized ( +- 0.34% )
2,000,023 context-switches # 0.161 M/sec ( +- 0.00% )
0 cpu-migrations # 0.000 K/sec
175 page-faults # 0.014 K/sec ( +- 0.26% )
41,376,162,250 cycles # 3.341 GHz ( +- 0.11% )
17,389,139,321 stalled-cycles-frontend # 42.03% frontend cycles idle ( +- 0.25% )
<not supported> stalled-cycles-backend
68,788,588,003 instructions # 1.66 insns per cycle
# 0.25 stalled cycles per insn ( +- 0.02% )
13,449,387,620 branches # 1085.940 M/sec ( +- 0.02% )
20,880,690 branch-misses # 0.16% of all branches ( +- 0.98% )
12.372646094 seconds time elapsed ( +- 0.34% )
after:
Performance counter stats for 'perf bench sched pipe' (10 runs):
12180.936528 task-clock (msec) # 1.001 CPUs utilized ( +- 0.33% )
2,000,077 context-switches # 0.164 M/sec ( +- 0.00% )
0 cpu-migrations # 0.000 K/sec
174 page-faults # 0.014 K/sec ( +- 0.27% )
40,691,545,577 cycles # 3.341 GHz ( +- 0.06% )
16,446,333,371 stalled-cycles-frontend # 40.42% frontend cycles idle ( +- 0.18% )
<not supported> stalled-cycles-backend
68,570,100,387 instructions # 1.69 insns per cycle
# 0.24 stalled cycles per insn ( +- 0.01% )
13,389,740,014 branches # 1099.237 M/sec ( +- 0.01% )
20,175,440 branch-misses # 0.15% of all branches ( +- 0.52% )
12.169253010 seconds time elapsed ( +- 0.33% )
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-06-08 21:00:30 +07:00
|
|
|
static inline void fire_sched_in_preempt_notifiers(struct task_struct *curr)
|
2007-07-26 18:40:43 +07:00
|
|
|
{
|
|
|
|
}
|
|
|
|
|
sched/preempt: Add static_key() to preempt_notifiers
Avoid touching the curr->preempt_notifier cacheline when not needed.
Provides a small improvement on pipe-bench:
taskset 01 perf stat --repeat 10 -- perf bench sched pipe
before:
Performance counter stats for 'perf bench sched pipe' (10 runs):
12385.016204 task-clock (msec) # 1.001 CPUs utilized ( +- 0.34% )
2,000,023 context-switches # 0.161 M/sec ( +- 0.00% )
0 cpu-migrations # 0.000 K/sec
175 page-faults # 0.014 K/sec ( +- 0.26% )
41,376,162,250 cycles # 3.341 GHz ( +- 0.11% )
17,389,139,321 stalled-cycles-frontend # 42.03% frontend cycles idle ( +- 0.25% )
<not supported> stalled-cycles-backend
68,788,588,003 instructions # 1.66 insns per cycle
# 0.25 stalled cycles per insn ( +- 0.02% )
13,449,387,620 branches # 1085.940 M/sec ( +- 0.02% )
20,880,690 branch-misses # 0.16% of all branches ( +- 0.98% )
12.372646094 seconds time elapsed ( +- 0.34% )
after:
Performance counter stats for 'perf bench sched pipe' (10 runs):
12180.936528 task-clock (msec) # 1.001 CPUs utilized ( +- 0.33% )
2,000,077 context-switches # 0.164 M/sec ( +- 0.00% )
0 cpu-migrations # 0.000 K/sec
174 page-faults # 0.014 K/sec ( +- 0.27% )
40,691,545,577 cycles # 3.341 GHz ( +- 0.06% )
16,446,333,371 stalled-cycles-frontend # 40.42% frontend cycles idle ( +- 0.18% )
<not supported> stalled-cycles-backend
68,570,100,387 instructions # 1.69 insns per cycle
# 0.24 stalled cycles per insn ( +- 0.01% )
13,389,740,014 branches # 1099.237 M/sec ( +- 0.01% )
20,175,440 branch-misses # 0.15% of all branches ( +- 0.52% )
12.169253010 seconds time elapsed ( +- 0.33% )
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-06-08 21:00:30 +07:00
|
|
|
static inline void
|
2007-07-26 18:40:43 +07:00
|
|
|
fire_sched_out_preempt_notifiers(struct task_struct *curr,
|
|
|
|
struct task_struct *next)
|
|
|
|
{
|
|
|
|
}
|
|
|
|
|
2008-05-30 19:23:45 +07:00
|
|
|
#endif /* CONFIG_PREEMPT_NOTIFIERS */
|
2007-07-26 18:40:43 +07:00
|
|
|
|
2005-06-26 04:57:23 +07:00
|
|
|
/**
|
|
|
|
* prepare_task_switch - prepare to switch tasks
|
|
|
|
* @rq: the runqueue preparing to switch
|
2007-07-31 14:37:50 +07:00
|
|
|
* @prev: the current task that is being switched out
|
2005-06-26 04:57:23 +07:00
|
|
|
* @next: the task we are going to switch to.
|
|
|
|
*
|
|
|
|
* This is called with the rq lock held and interrupts off. It must
|
|
|
|
* be paired with a subsequent finish_task_switch after the context
|
|
|
|
* switch.
|
|
|
|
*
|
|
|
|
* prepare_task_switch sets up locking and calls architecture specific
|
|
|
|
* hooks.
|
|
|
|
*/
|
2007-07-26 18:40:43 +07:00
|
|
|
static inline void
|
|
|
|
prepare_task_switch(struct rq *rq, struct task_struct *prev,
|
|
|
|
struct task_struct *next)
|
2005-06-26 04:57:23 +07:00
|
|
|
{
|
2013-09-22 21:20:54 +07:00
|
|
|
sched_info_switch(rq, prev, next);
|
2011-02-02 19:19:09 +07:00
|
|
|
perf_event_task_sched_out(prev, next);
|
2007-07-26 18:40:43 +07:00
|
|
|
fire_sched_out_preempt_notifiers(prev, next);
|
2005-06-26 04:57:23 +07:00
|
|
|
prepare_lock_switch(rq, next);
|
|
|
|
prepare_arch_switch(next);
|
|
|
|
}
|
|
|
|
|
2005-04-17 05:20:36 +07:00
|
|
|
/**
|
|
|
|
* finish_task_switch - clean up after a task-switch
|
|
|
|
* @prev: the thread we just switched away from.
|
|
|
|
*
|
2005-06-26 04:57:23 +07:00
|
|
|
* finish_task_switch must be called after the context switch, paired
|
|
|
|
* with a prepare_task_switch call before the context switch.
|
|
|
|
* finish_task_switch will reconcile locking set up by prepare_task_switch,
|
|
|
|
* and do any other architecture-specific cleanup actions.
|
2005-04-17 05:20:36 +07:00
|
|
|
*
|
|
|
|
* Note that we may have delayed dropping an mm in context_switch(). If
|
2007-12-05 21:46:09 +07:00
|
|
|
* so, we finish that here outside of the runqueue lock. (Doing it
|
2005-04-17 05:20:36 +07:00
|
|
|
* with the lock held can cause deadlocks; see schedule() for
|
|
|
|
* details.)
|
2014-10-10 02:32:32 +07:00
|
|
|
*
|
|
|
|
* The context switch have flipped the stack from under us and restored the
|
|
|
|
* local variables which were saved when this task called schedule() in the
|
|
|
|
* past. prev == current is still correct but we need to recalculate this_rq
|
|
|
|
* because prev may have moved to another CPU.
|
2005-04-17 05:20:36 +07:00
|
|
|
*/
|
2014-10-10 02:32:32 +07:00
|
|
|
static struct rq *finish_task_switch(struct task_struct *prev)
|
2005-04-17 05:20:36 +07:00
|
|
|
__releases(rq->lock)
|
|
|
|
{
|
2014-10-10 02:32:32 +07:00
|
|
|
struct rq *rq = this_rq();
|
2005-04-17 05:20:36 +07:00
|
|
|
struct mm_struct *mm = rq->prev_mm;
|
2006-09-29 16:01:10 +07:00
|
|
|
long prev_state;
|
2005-04-17 05:20:36 +07:00
|
|
|
|
2015-09-28 22:52:18 +07:00
|
|
|
/*
|
|
|
|
* The previous task will have left us with a preempt_count of 2
|
|
|
|
* because it left us after:
|
|
|
|
*
|
|
|
|
* schedule()
|
|
|
|
* preempt_disable(); // 1
|
|
|
|
* __schedule()
|
|
|
|
* raw_spin_lock_irq(&rq->lock) // 2
|
|
|
|
*
|
|
|
|
* Also, see FORK_PREEMPT_COUNT.
|
|
|
|
*/
|
2015-09-29 17:18:46 +07:00
|
|
|
if (WARN_ONCE(preempt_count() != 2*PREEMPT_DISABLE_OFFSET,
|
|
|
|
"corrupted preempt_count: %s/%d/0x%x\n",
|
|
|
|
current->comm, current->pid, preempt_count()))
|
|
|
|
preempt_count_set(FORK_PREEMPT_COUNT);
|
2015-09-28 22:52:18 +07:00
|
|
|
|
2005-04-17 05:20:36 +07:00
|
|
|
rq->prev_mm = NULL;
|
|
|
|
|
|
|
|
/*
|
|
|
|
* A task struct has one reference for the use as "current".
|
2006-09-29 16:01:11 +07:00
|
|
|
* If a task dies, then it sets TASK_DEAD in tsk->state and calls
|
2006-09-29 16:01:10 +07:00
|
|
|
* schedule one last time. The schedule call will never return, and
|
|
|
|
* the scheduled task must drop that reference.
|
2015-09-29 19:45:09 +07:00
|
|
|
*
|
|
|
|
* We must observe prev->state before clearing prev->on_cpu (in
|
|
|
|
* finish_lock_switch), otherwise a concurrent wakeup can get prev
|
|
|
|
* running on another CPU and we could rave with its RUNNING -> DEAD
|
|
|
|
* transition, resulting in a double drop.
|
2005-04-17 05:20:36 +07:00
|
|
|
*/
|
2006-09-29 16:01:10 +07:00
|
|
|
prev_state = prev->state;
|
2012-09-08 20:23:11 +07:00
|
|
|
vtime_task_switch(prev);
|
perf events: Fix slow and broken cgroup context switch code
The current cgroup context switch code was incorrect leading
to bogus counts. Furthermore, as soon as there was an active
cgroup event on a CPU, the context switch cost on that CPU
would increase by a significant amount as demonstrated by a
simple ping/pong example:
$ ./pong
Both processes pinned to CPU1, running for 10s
10684.51 ctxsw/s
Now start a cgroup perf stat:
$ perf stat -e cycles,cycles -A -a -G test -C 1 -- sleep 100
$ ./pong
Both processes pinned to CPU1, running for 10s
6674.61 ctxsw/s
That's a 37% penalty.
Note that pong is not even in the monitored cgroup.
The results shown by perf stat are bogus:
$ perf stat -e cycles,cycles -A -a -G test -C 1 -- sleep 100
Performance counter stats for 'sleep 100':
CPU1 <not counted> cycles test
CPU1 16,984,189,138 cycles # 0.000 GHz
The second 'cycles' event should report a count @ CPU clock
(here 2.4GHz) as it is counting across all cgroups.
The patch below fixes the bogus accounting and bypasses any
cgroup switches in case the outgoing and incoming tasks are
in the same cgroup.
With this patch the same test now yields:
$ ./pong
Both processes pinned to CPU1, running for 10s
10775.30 ctxsw/s
Start perf stat with cgroup:
$ perf stat -e cycles,cycles -A -a -G test -C 1 -- sleep 10
Run pong outside the cgroup:
$ /pong
Both processes pinned to CPU1, running for 10s
10687.80 ctxsw/s
The penalty is now less than 2%.
And the results for perf stat are correct:
$ perf stat -e cycles,cycles -A -a -G test -C 1 -- sleep 10
Performance counter stats for 'sleep 10':
CPU1 <not counted> cycles test # 0.000 GHz
CPU1 23,933,981,448 cycles # 0.000 GHz
Now perf stat reports the correct counts for
for the non cgroup event.
If we run pong inside the cgroup, then we also get the
correct counts:
$ perf stat -e cycles,cycles -A -a -G test -C 1 -- sleep 10
Performance counter stats for 'sleep 10':
CPU1 22,297,726,205 cycles test # 0.000 GHz
CPU1 23,933,981,448 cycles # 0.000 GHz
10.001457237 seconds time elapsed
Signed-off-by: Stephane Eranian <eranian@google.com>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Link: http://lkml.kernel.org/r/20110825135803.GA4697@quad
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2011-08-25 20:58:03 +07:00
|
|
|
perf_event_task_sched_in(prev, current);
|
2005-06-26 04:57:23 +07:00
|
|
|
finish_lock_switch(rq, prev);
|
2011-11-28 04:43:10 +07:00
|
|
|
finish_arch_post_lock_switch();
|
2008-01-26 03:08:05 +07:00
|
|
|
|
2007-07-26 18:40:43 +07:00
|
|
|
fire_sched_in_preempt_notifiers(current);
|
2005-04-17 05:20:36 +07:00
|
|
|
if (mm)
|
|
|
|
mmdrop(mm);
|
2006-09-29 16:01:11 +07:00
|
|
|
if (unlikely(prev_state == TASK_DEAD)) {
|
2013-11-07 20:43:35 +07:00
|
|
|
if (prev->sched_class->task_dead)
|
|
|
|
prev->sched_class->task_dead(prev);
|
|
|
|
|
2006-03-26 16:38:20 +07:00
|
|
|
/*
|
|
|
|
* Remove function-return probe instances associated with this
|
|
|
|
* task and put them back on the free list.
|
2007-07-09 23:52:00 +07:00
|
|
|
*/
|
2006-03-26 16:38:20 +07:00
|
|
|
kprobe_flush_task(prev);
|
2005-04-17 05:20:36 +07:00
|
|
|
put_task_struct(prev);
|
2006-03-26 16:38:20 +07:00
|
|
|
}
|
2013-04-20 22:11:50 +07:00
|
|
|
|
2015-06-11 23:07:12 +07:00
|
|
|
tick_nohz_task_switch();
|
2014-10-10 02:32:32 +07:00
|
|
|
return rq;
|
2005-04-17 05:20:36 +07:00
|
|
|
}
|
|
|
|
|
2009-07-29 22:08:47 +07:00
|
|
|
#ifdef CONFIG_SMP
|
|
|
|
|
|
|
|
/* rq->lock is NOT held, but preemption is disabled */
|
2015-06-11 19:46:37 +07:00
|
|
|
static void __balance_callback(struct rq *rq)
|
2009-07-29 22:08:47 +07:00
|
|
|
{
|
2015-06-11 19:46:37 +07:00
|
|
|
struct callback_head *head, *next;
|
|
|
|
void (*func)(struct rq *rq);
|
|
|
|
unsigned long flags;
|
2009-07-29 22:08:47 +07:00
|
|
|
|
2015-06-11 19:46:37 +07:00
|
|
|
raw_spin_lock_irqsave(&rq->lock, flags);
|
|
|
|
head = rq->balance_callback;
|
|
|
|
rq->balance_callback = NULL;
|
|
|
|
while (head) {
|
|
|
|
func = (void (*)(struct rq *))head->func;
|
|
|
|
next = head->next;
|
|
|
|
head->next = NULL;
|
|
|
|
head = next;
|
2009-07-29 22:08:47 +07:00
|
|
|
|
2015-06-11 19:46:37 +07:00
|
|
|
func(rq);
|
2009-07-29 22:08:47 +07:00
|
|
|
}
|
2015-06-11 19:46:37 +07:00
|
|
|
raw_spin_unlock_irqrestore(&rq->lock, flags);
|
|
|
|
}
|
|
|
|
|
|
|
|
static inline void balance_callback(struct rq *rq)
|
|
|
|
{
|
|
|
|
if (unlikely(rq->balance_callback))
|
|
|
|
__balance_callback(rq);
|
2009-07-29 22:08:47 +07:00
|
|
|
}
|
|
|
|
|
|
|
|
#else
|
2009-07-29 11:21:22 +07:00
|
|
|
|
2015-06-11 19:46:37 +07:00
|
|
|
static inline void balance_callback(struct rq *rq)
|
2009-07-29 22:08:47 +07:00
|
|
|
{
|
2005-04-17 05:20:36 +07:00
|
|
|
}
|
|
|
|
|
2009-07-29 22:08:47 +07:00
|
|
|
#endif
|
|
|
|
|
2005-04-17 05:20:36 +07:00
|
|
|
/**
|
|
|
|
* schedule_tail - first thing a freshly forked thread must call.
|
|
|
|
* @prev: the thread we just switched away from.
|
|
|
|
*/
|
2014-05-02 05:44:38 +07:00
|
|
|
asmlinkage __visible void schedule_tail(struct task_struct *prev)
|
2005-04-17 05:20:36 +07:00
|
|
|
__releases(rq->lock)
|
|
|
|
{
|
2014-10-09 02:36:44 +07:00
|
|
|
struct rq *rq;
|
2009-07-29 11:21:22 +07:00
|
|
|
|
2015-09-28 22:52:18 +07:00
|
|
|
/*
|
|
|
|
* New tasks start with FORK_PREEMPT_COUNT, see there and
|
|
|
|
* finish_task_switch() for details.
|
|
|
|
*
|
|
|
|
* finish_task_switch() will drop rq->lock() and lower preempt_count
|
|
|
|
* and the preempt_enable() will end up enabling preemption (on
|
|
|
|
* PREEMPT_COUNT kernels).
|
|
|
|
*/
|
|
|
|
|
2014-10-10 02:32:32 +07:00
|
|
|
rq = finish_task_switch(prev);
|
2015-06-11 19:46:37 +07:00
|
|
|
balance_callback(rq);
|
2014-10-09 02:36:44 +07:00
|
|
|
preempt_enable();
|
2006-07-03 14:25:42 +07:00
|
|
|
|
2005-04-17 05:20:36 +07:00
|
|
|
if (current->set_child_tid)
|
2007-10-19 13:40:14 +07:00
|
|
|
put_user(task_pid_vnr(current), current->set_child_tid);
|
2005-04-17 05:20:36 +07:00
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
2014-10-10 02:32:32 +07:00
|
|
|
* context_switch - switch to the new MM and the new thread's register state.
|
2005-04-17 05:20:36 +07:00
|
|
|
*/
|
2016-02-29 11:22:39 +07:00
|
|
|
static __always_inline struct rq *
|
2006-07-03 14:25:42 +07:00
|
|
|
context_switch(struct rq *rq, struct task_struct *prev,
|
2015-08-02 00:25:08 +07:00
|
|
|
struct task_struct *next, struct pin_cookie cookie)
|
2005-04-17 05:20:36 +07:00
|
|
|
{
|
2007-07-09 23:51:59 +07:00
|
|
|
struct mm_struct *mm, *oldmm;
|
2005-04-17 05:20:36 +07:00
|
|
|
|
2007-07-26 18:40:43 +07:00
|
|
|
prepare_task_switch(rq, prev, next);
|
2011-02-02 19:19:09 +07:00
|
|
|
|
2007-07-09 23:51:59 +07:00
|
|
|
mm = next->mm;
|
|
|
|
oldmm = prev->active_mm;
|
2007-02-13 19:26:21 +07:00
|
|
|
/*
|
|
|
|
* For paravirt, this is coupled with an exit in switch_to to
|
|
|
|
* combine the page table reload and the switch backend into
|
|
|
|
* one hypercall.
|
|
|
|
*/
|
2009-02-19 02:18:57 +07:00
|
|
|
arch_start_context_switch(prev);
|
2007-02-13 19:26:21 +07:00
|
|
|
|
2010-09-16 19:42:25 +07:00
|
|
|
if (!mm) {
|
2005-04-17 05:20:36 +07:00
|
|
|
next->active_mm = oldmm;
|
|
|
|
atomic_inc(&oldmm->mm_count);
|
|
|
|
enter_lazy_tlb(oldmm, next);
|
|
|
|
} else
|
2016-04-26 23:39:06 +07:00
|
|
|
switch_mm_irqs_off(oldmm, mm, next);
|
2005-04-17 05:20:36 +07:00
|
|
|
|
2010-09-16 19:42:25 +07:00
|
|
|
if (!prev->mm) {
|
2005-04-17 05:20:36 +07:00
|
|
|
prev->active_mm = NULL;
|
|
|
|
rq->prev_mm = oldmm;
|
|
|
|
}
|
2006-07-14 14:24:27 +07:00
|
|
|
/*
|
|
|
|
* Since the runqueue lock will be released by the next
|
|
|
|
* task (which is an invalid locking op but in the case
|
|
|
|
* of the scheduler it's an obvious special-case), so we
|
|
|
|
* do an early lockdep release here:
|
|
|
|
*/
|
2015-08-02 00:25:08 +07:00
|
|
|
lockdep_unpin_lock(&rq->lock, cookie);
|
2006-07-03 14:24:54 +07:00
|
|
|
spin_release(&rq->lock.dep_map, 1, _THIS_IP_);
|
2005-04-17 05:20:36 +07:00
|
|
|
|
|
|
|
/* Here we just switch the register state and the stack. */
|
|
|
|
switch_to(prev, next, prev);
|
2007-07-09 23:51:59 +07:00
|
|
|
barrier();
|
2014-10-10 02:32:32 +07:00
|
|
|
|
|
|
|
return finish_task_switch(prev);
|
2005-04-17 05:20:36 +07:00
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
2013-02-20 16:14:38 +07:00
|
|
|
* nr_running and nr_context_switches:
|
2005-04-17 05:20:36 +07:00
|
|
|
*
|
|
|
|
* externally visible scheduler statistics: current number of runnable
|
2013-02-20 16:14:38 +07:00
|
|
|
* threads, total number of context switches performed since bootup.
|
2005-04-17 05:20:36 +07:00
|
|
|
*/
|
|
|
|
unsigned long nr_running(void)
|
|
|
|
{
|
|
|
|
unsigned long i, sum = 0;
|
|
|
|
|
|
|
|
for_each_online_cpu(i)
|
|
|
|
sum += cpu_rq(i)->nr_running;
|
|
|
|
|
|
|
|
return sum;
|
2009-04-14 11:55:30 +07:00
|
|
|
}
|
2005-04-17 05:20:36 +07:00
|
|
|
|
2014-08-01 00:29:48 +07:00
|
|
|
/*
|
|
|
|
* Check if only the current task is running on the cpu.
|
2015-09-18 16:27:45 +07:00
|
|
|
*
|
|
|
|
* Caution: this function does not check that the caller has disabled
|
|
|
|
* preemption, thus the result might have a time-of-check-to-time-of-use
|
|
|
|
* race. The caller is responsible to use it correctly, for example:
|
|
|
|
*
|
|
|
|
* - from a non-preemptable section (of course)
|
|
|
|
*
|
|
|
|
* - from a thread that is bound to a single CPU
|
|
|
|
*
|
|
|
|
* - in a loop with very short iterations (e.g. a polling loop)
|
2014-08-01 00:29:48 +07:00
|
|
|
*/
|
|
|
|
bool single_task_running(void)
|
|
|
|
{
|
2015-09-18 16:27:45 +07:00
|
|
|
return raw_rq()->nr_running == 1;
|
2014-08-01 00:29:48 +07:00
|
|
|
}
|
|
|
|
EXPORT_SYMBOL(single_task_running);
|
|
|
|
|
2005-04-17 05:20:36 +07:00
|
|
|
unsigned long long nr_context_switches(void)
|
2007-05-08 14:32:51 +07:00
|
|
|
{
|
2006-06-27 16:54:31 +07:00
|
|
|
int i;
|
|
|
|
unsigned long long sum = 0;
|
2007-05-08 14:32:51 +07:00
|
|
|
|
2006-03-28 16:56:37 +07:00
|
|
|
for_each_possible_cpu(i)
|
2005-04-17 05:20:36 +07:00
|
|
|
sum += cpu_rq(i)->nr_switches;
|
2007-05-08 14:32:51 +07:00
|
|
|
|
2005-04-17 05:20:36 +07:00
|
|
|
return sum;
|
|
|
|
}
|
2009-02-05 02:59:44 +07:00
|
|
|
|
2005-04-17 05:20:36 +07:00
|
|
|
unsigned long nr_iowait(void)
|
|
|
|
{
|
|
|
|
unsigned long i, sum = 0;
|
2009-02-05 02:59:44 +07:00
|
|
|
|
2006-03-28 16:56:37 +07:00
|
|
|
for_each_possible_cpu(i)
|
2005-04-17 05:20:36 +07:00
|
|
|
sum += atomic_read(&cpu_rq(i)->nr_iowait);
|
2007-05-08 14:32:51 +07:00
|
|
|
|
2005-04-17 05:20:36 +07:00
|
|
|
return sum;
|
|
|
|
}
|
2009-02-05 02:59:44 +07:00
|
|
|
|
2010-07-01 14:07:17 +07:00
|
|
|
unsigned long nr_iowait_cpu(int cpu)
|
2009-09-22 07:04:08 +07:00
|
|
|
{
|
2010-07-01 14:07:17 +07:00
|
|
|
struct rq *this = cpu_rq(cpu);
|
2009-09-22 07:04:08 +07:00
|
|
|
return atomic_read(&this->nr_iowait);
|
|
|
|
}
|
2007-05-08 14:32:51 +07:00
|
|
|
|
2014-08-06 20:19:21 +07:00
|
|
|
void get_iowait_load(unsigned long *nr_waiters, unsigned long *load)
|
|
|
|
{
|
2015-04-14 18:19:42 +07:00
|
|
|
struct rq *rq = this_rq();
|
|
|
|
*nr_waiters = atomic_read(&rq->nr_iowait);
|
|
|
|
*load = rq->load.weight;
|
2014-08-06 20:19:21 +07:00
|
|
|
}
|
|
|
|
|
2007-07-09 23:51:59 +07:00
|
|
|
#ifdef CONFIG_SMP
|
sched: don't rebalance if attached on NULL domain
Impact: fix function graph trace hang / drop pointless softirq on UP
While debugging a function graph trace hang on an old PII, I saw
that it consumed most of its time on the timer interrupt. And
the domain rebalancing softirq was the most concerned.
The timer interrupt calls trigger_load_balance() which will
decide if it is worth to schedule a rebalancing softirq.
In case of builtin UP kernel, no problem arises because there is
no domain question.
In case of builtin SMP kernel running on an SMP box, still no
problem, the softirq will be raised each time we reach the
next_balance time.
In case of builtin SMP kernel running on a UP box (most distros
provide default SMP kernels, whatever the box you have), then
the CPU is attached to the NULL sched domain. So a kind of
unexpected behaviour happen:
trigger_load_balance() -> raises the rebalancing softirq later
on softirq: run_rebalance_domains() -> rebalance_domains() where
the for_each_domain(cpu, sd) is not taken because of the NULL
domain we are attached at. Which means rq->next_balance is never
updated. So on the next timer tick, we will enter
trigger_load_balance() which will always reschedule() the
rebalacing softirq:
if (time_after_eq(jiffies, rq->next_balance))
raise_softirq(SCHED_SOFTIRQ);
So for each tick, we process this pointless softirq.
This patch fixes it by checking if we are attached to the null
domain before raising the softirq, another possible fix would be
to set the maximal possible JIFFIES value to rq->next_balance if
we are attached to the NULL domain.
v2: build fix on UP
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Peter Zijlstra <peterz@infradead.org>
LKML-Reference: <49af242d.1c07d00a.32d5.ffffc019@mx.google.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-03-05 07:27:02 +07:00
|
|
|
|
2007-05-08 14:32:51 +07:00
|
|
|
/*
|
2009-12-17 00:04:37 +07:00
|
|
|
* sched_exec - execve() is a valuable balancing opportunity, because at
|
|
|
|
* this point the task has the smallest effective memory and cache footprint.
|
2007-05-08 14:32:51 +07:00
|
|
|
*/
|
2009-12-17 00:04:37 +07:00
|
|
|
void sched_exec(void)
|
2007-05-08 14:32:51 +07:00
|
|
|
{
|
2009-12-17 00:04:37 +07:00
|
|
|
struct task_struct *p = current;
|
2005-04-17 05:20:36 +07:00
|
|
|
unsigned long flags;
|
2010-03-25 00:34:10 +07:00
|
|
|
int dest_cpu;
|
2007-05-08 14:32:51 +07:00
|
|
|
|
2011-04-05 22:23:53 +07:00
|
|
|
raw_spin_lock_irqsave(&p->pi_lock, flags);
|
2013-10-07 17:29:16 +07:00
|
|
|
dest_cpu = p->sched_class->select_task_rq(p, task_cpu(p), SD_BALANCE_EXEC, 0);
|
2010-03-25 00:34:10 +07:00
|
|
|
if (dest_cpu == smp_processor_id())
|
|
|
|
goto unlock;
|
2009-12-17 00:04:37 +07:00
|
|
|
|
2011-04-05 22:23:53 +07:00
|
|
|
if (likely(cpu_active(dest_cpu))) {
|
2010-05-06 23:49:21 +07:00
|
|
|
struct migration_arg arg = { p, dest_cpu };
|
2007-05-08 14:32:51 +07:00
|
|
|
|
2011-04-05 22:23:53 +07:00
|
|
|
raw_spin_unlock_irqrestore(&p->pi_lock, flags);
|
|
|
|
stop_one_cpu(task_cpu(p), migration_cpu_stop, &arg);
|
2005-04-17 05:20:36 +07:00
|
|
|
return;
|
|
|
|
}
|
2010-03-25 00:34:10 +07:00
|
|
|
unlock:
|
2011-04-05 22:23:53 +07:00
|
|
|
raw_spin_unlock_irqrestore(&p->pi_lock, flags);
|
2005-04-17 05:20:36 +07:00
|
|
|
}
|
2007-07-09 23:51:59 +07:00
|
|
|
|
2005-04-17 05:20:36 +07:00
|
|
|
#endif
|
|
|
|
|
|
|
|
DEFINE_PER_CPU(struct kernel_stat, kstat);
|
2011-11-28 23:45:17 +07:00
|
|
|
DEFINE_PER_CPU(struct kernel_cpustat, kernel_cpustat);
|
2005-04-17 05:20:36 +07:00
|
|
|
|
|
|
|
EXPORT_PER_CPU_SYMBOL(kstat);
|
2011-11-28 23:45:17 +07:00
|
|
|
EXPORT_PER_CPU_SYMBOL(kernel_cpustat);
|
2005-04-17 05:20:36 +07:00
|
|
|
|
sched/cputime: Mitigate performance regression in times()/clock_gettime()
Commit:
6e998916dfe3 ("sched/cputime: Fix clock_nanosleep()/clock_gettime() inconsistency")
fixed a problem whereby clock_nanosleep() followed by clock_gettime() could
allow a task to wake early. It addressed the problem by calling the scheduling
classes update_curr() when the cputimer starts.
Said change induced a considerable performance regression on the syscalls
times() and clock_gettimes(CLOCK_PROCESS_CPUTIME_ID). There are some
debuggers and applications that monitor their own performance that
accidentally depend on the performance of these specific calls.
This patch mitigates the performace loss by prefetching data in the CPU
cache, as stalls due to cache misses appear to be where most time is spent
in our benchmarks.
Here are the performance gain of this patch over v4.7-rc7 on a Sandy Bridge
box with 32 logical cores and 2 NUMA nodes. The test is repeated with a
variable number of threads, from 2 to 4*num_cpus; the results are in
seconds and correspond to the average of 10 runs; the percentage gain is
computed with (before-after)/before so a positive value is an improvement
(it's faster). The improvement varies between a few percents for 5-20
threads and more than 10% for 2 or >20 threads.
pound_clock_gettime:
threads 4.7-rc7 patched 4.7-rc7
[num] [secs] [secs (percent)]
2 3.48 3.06 ( 11.83%)
5 3.33 3.25 ( 2.40%)
8 3.37 3.26 ( 3.30%)
12 3.32 3.37 ( -1.60%)
21 4.01 3.90 ( 2.74%)
30 3.63 3.36 ( 7.41%)
48 3.71 3.11 ( 16.27%)
79 3.75 3.16 ( 15.74%)
110 3.81 3.25 ( 14.80%)
128 3.88 3.31 ( 14.76%)
pound_times:
threads 4.7-rc7 patched 4.7-rc7
[num] [secs] [secs (percent)]
2 3.65 3.25 ( 11.03%)
5 3.45 3.17 ( 7.92%)
8 3.52 3.22 ( 8.69%)
12 3.29 3.36 ( -2.04%)
21 4.07 3.92 ( 3.78%)
30 3.87 3.40 ( 12.17%)
48 3.79 3.16 ( 16.61%)
79 3.88 3.28 ( 15.42%)
110 3.90 3.38 ( 13.35%)
128 4.00 3.38 ( 15.45%)
pound_clock_gettime and pound_clock_gettime are two benchmarks included in
the MMTests framework. They launch a given number of threads which
repeatedly call times() or clock_gettimes(). The results above can be
reproduced with cloning MMTests from github.com and running the "poundtime"
workload:
$ git clone https://github.com/gormanm/mmtests.git
$ cd mmtests
$ cp configs/config-global-dhp__workload_poundtime config
$ ./run-mmtests.sh --run-monitor $(uname -r)
The above will run "poundtime" measuring the kernel currently running on
the machine; Once a new kernel is installed and the machine rebooted,
running again
$ cd mmtests
$ ./run-mmtests.sh --run-monitor $(uname -r)
will produce results to compare with. A comparison table will be output
with:
$ cd mmtests/work/log
$ ../../compare-kernels.sh
the table will contain a lot of entries; grepping for "Amean" (as in
"arithmetic mean") will give the tables presented above. The source code
for the two benchmarks is reported at the end of this changelog for
clairity.
The cache misses addressed by this patch were found using a combination of
`perf top`, `perf record` and `perf annotate`. The incriminated lines were
found to be
struct sched_entity *curr = cfs_rq->curr;
and
delta_exec = now - curr->exec_start;
in the function update_curr() from kernel/sched/fair.c. This patch
prefetches the data from memory just before update_curr is called in the
interested execution path.
A comparison of the total number of cycles before and after the patch
follows; the data is obtained using `perf stat -r 10 -ddd <program>`
running over the same sequence of number of threads used above (a positive
gain is an improvement):
threads cycles before cycles after gain
2 19,699,563,964 +-1.19% 17,358,917,517 +-1.85% 11.88%
5 47,401,089,566 +-2.96% 45,103,730,829 +-0.97% 4.85%
8 80,923,501,004 +-3.01% 71,419,385,977 +-0.77% 11.74%
12 112,326,485,473 +-0.47% 110,371,524,403 +-0.47% 1.74%
21 193,455,574,299 +-0.72% 180,120,667,904 +-0.36% 6.89%
30 315,073,519,013 +-1.64% 271,222,225,950 +-1.29% 13.92%
48 321,969,515,332 +-1.48% 273,353,977,321 +-1.16% 15.10%
79 337,866,003,422 +-0.97% 289,462,481,538 +-1.05% 14.33%
110 338,712,691,920 +-0.78% 290,574,233,170 +-0.77% 14.21%
128 348,384,794,006 +-0.50% 292,691,648,206 +-0.66% 15.99%
A comparison of cache miss vs total cache loads ratios, before and after
the patch (again from the `perf stat -r 10 -ddd <program>` tables):
threads L1 misses/total*100 L1 misses/total*100 gain
before after
2 7.43 +-4.90% 7.36 +-4.70% 0.94%
5 13.09 +-4.74% 13.52 +-3.73% -3.28%
8 13.79 +-5.61% 12.90 +-3.27% 6.45%
12 11.57 +-2.44% 8.71 +-1.40% 24.72%
21 12.39 +-3.92% 9.97 +-1.84% 19.53%
30 13.91 +-2.53% 11.73 +-2.28% 15.67%
48 13.71 +-1.59% 12.32 +-1.97% 10.14%
79 14.44 +-0.66% 13.40 +-1.06% 7.20%
110 15.86 +-0.50% 14.46 +-0.59% 8.83%
128 16.51 +-0.32% 15.06 +-0.78% 8.78%
As a final note, the following shows the evolution of performance figures
in the "poundtime" benchmark and pinpoints commit 6e998916dfe3
("sched/cputime: Fix clock_nanosleep()/clock_gettime() inconsistency") as a
major source of degradation, mostly unaddressed to this day (figures
expressed in seconds).
pound_clock_gettime:
threads parent of 6e998916dfe3 4.7-rc7
6e998916dfe3 itself
2 2.23 3.68 ( -64.56%) 3.48 (-55.48%)
5 2.83 3.78 ( -33.42%) 3.33 (-17.43%)
8 2.84 4.31 ( -52.12%) 3.37 (-18.76%)
12 3.09 3.61 ( -16.74%) 3.32 ( -7.17%)
21 3.14 4.63 ( -47.36%) 4.01 (-27.71%)
30 3.28 5.75 ( -75.37%) 3.63 (-10.80%)
48 3.02 6.05 (-100.56%) 3.71 (-22.99%)
79 2.88 6.30 (-118.90%) 3.75 (-30.26%)
110 2.95 6.46 (-119.00%) 3.81 (-29.24%)
128 3.05 6.42 (-110.08%) 3.88 (-27.04%)
pound_times:
threads parent of 6e998916dfe3 4.7-rc7
6e998916dfe3 itself
2 2.27 3.73 ( -64.71%) 3.65 (-61.14%)
5 2.78 3.77 ( -35.56%) 3.45 (-23.98%)
8 2.79 4.41 ( -57.71%) 3.52 (-26.05%)
12 3.02 3.56 ( -17.94%) 3.29 ( -9.08%)
21 3.10 4.61 ( -48.74%) 4.07 (-31.34%)
30 3.33 5.75 ( -72.53%) 3.87 (-16.01%)
48 2.96 6.06 (-105.04%) 3.79 (-28.10%)
79 2.88 6.24 (-116.83%) 3.88 (-34.81%)
110 2.98 6.37 (-114.08%) 3.90 (-31.12%)
128 3.10 6.35 (-104.61%) 4.00 (-28.87%)
The source code of the two benchmarks follows. To compile the two:
NR_THREADS=42
for FILE in pound_times pound_clock_gettime; do
gcc -lrt -O2 -lpthread -DNUM_THREADS=$NR_THREADS $FILE.c -o $FILE
done
==== BEGIN pound_times.c ====
struct tms start;
void *pound (void *threadid)
{
struct tms end;
int oldutime = 0;
int utime;
int i;
for (i = 0; i < 5000000 / NUM_THREADS; i++) {
times(&end);
utime = ((int)end.tms_utime - (int)start.tms_utime);
if (oldutime > utime) {
printf("utime decreased, was %d, now %d!\n", oldutime, utime);
}
oldutime = utime;
}
pthread_exit(NULL);
}
int main()
{
pthread_t th[NUM_THREADS];
long i;
times(&start);
for (i = 0; i < NUM_THREADS; i++) {
pthread_create (&th[i], NULL, pound, (void *)i);
}
pthread_exit(NULL);
return 0;
}
==== END pound_times.c ====
==== BEGIN pound_clock_gettime.c ====
void *pound (void *threadid)
{
struct timespec ts;
int rc, i;
unsigned long prev = 0, this = 0;
for (i = 0; i < 5000000 / NUM_THREADS; i++) {
rc = clock_gettime(CLOCK_PROCESS_CPUTIME_ID, &ts);
if (rc < 0)
perror("clock_gettime");
this = (ts.tv_sec * 1000000000) + ts.tv_nsec;
if (0 && this < prev)
printf("%lu ns timewarp at iteration %d\n", prev - this, i);
prev = this;
}
pthread_exit(NULL);
}
int main()
{
pthread_t th[NUM_THREADS];
long rc, i;
pid_t pgid;
for (i = 0; i < NUM_THREADS; i++) {
rc = pthread_create(&th[i], NULL, pound, (void *)i);
if (rc < 0)
perror("pthread_create");
}
pthread_exit(NULL);
return 0;
}
==== END pound_clock_gettime.c ====
Suggested-by: Mike Galbraith <mgalbraith@suse.de>
Signed-off-by: Giovanni Gherdovich <ggherdovich@suse.cz>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Stanislaw Gruszka <sgruszka@redhat.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/1470385316-15027-2-git-send-email-ggherdovich@suse.cz
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-08-05 15:21:56 +07:00
|
|
|
/*
|
|
|
|
* The function fair_sched_class.update_curr accesses the struct curr
|
|
|
|
* and its field curr->exec_start; when called from task_sched_runtime(),
|
|
|
|
* we observe a high rate of cache misses in practice.
|
|
|
|
* Prefetching this data results in improved performance.
|
|
|
|
*/
|
|
|
|
static inline void prefetch_curr_exec_start(struct task_struct *p)
|
|
|
|
{
|
|
|
|
#ifdef CONFIG_FAIR_GROUP_SCHED
|
|
|
|
struct sched_entity *curr = (&p->se)->cfs_rq->curr;
|
|
|
|
#else
|
|
|
|
struct sched_entity *curr = (&task_rq(p)->cfs)->curr;
|
|
|
|
#endif
|
|
|
|
prefetch(curr);
|
|
|
|
prefetch(&curr->exec_start);
|
|
|
|
}
|
|
|
|
|
2009-03-31 14:56:03 +07:00
|
|
|
/*
|
|
|
|
* Return accounted runtime for the task.
|
|
|
|
* In case the task is currently running, return the runtime plus current's
|
|
|
|
* pending runtime that have not been accounted yet.
|
|
|
|
*/
|
|
|
|
unsigned long long task_sched_runtime(struct task_struct *p)
|
|
|
|
{
|
2015-08-01 02:28:18 +07:00
|
|
|
struct rq_flags rf;
|
2009-03-31 14:56:03 +07:00
|
|
|
struct rq *rq;
|
sched/cputime: Fix clock_nanosleep()/clock_gettime() inconsistency
Commit d670ec13178d0 "posix-cpu-timers: Cure SMP wobbles" fixes one glibc
test case in cost of breaking another one. After that commit, calling
clock_nanosleep(TIMER_ABSTIME, X) and then clock_gettime(&Y) can result
of Y time being smaller than X time.
Reproducer/tester can be found further below, it can be compiled and ran by:
gcc -o tst-cpuclock2 tst-cpuclock2.c -pthread
while ./tst-cpuclock2 ; do : ; done
This reproducer, when running on a buggy kernel, will complain
about "clock_gettime difference too small".
Issue happens because on start in thread_group_cputimer() we initialize
sum_exec_runtime of cputimer with threads runtime not yet accounted and
then add the threads runtime to running cputimer again on scheduler
tick, making it's sum_exec_runtime bigger than actual threads runtime.
KOSAKI Motohiro posted a fix for this problem, but that patch was never
applied: https://lkml.org/lkml/2013/5/26/191 .
This patch takes different approach to cure the problem. It calls
update_curr() when cputimer starts, that assure we will have updated
stats of running threads and on the next schedule tick we will account
only the runtime that elapsed from cputimer start. That also assure we
have consistent state between cpu times of individual threads and cpu
time of the process consisted by those threads.
Full reproducer (tst-cpuclock2.c):
#define _GNU_SOURCE
#include <unistd.h>
#include <sys/syscall.h>
#include <stdio.h>
#include <time.h>
#include <pthread.h>
#include <stdint.h>
#include <inttypes.h>
/* Parameters for the Linux kernel ABI for CPU clocks. */
#define CPUCLOCK_SCHED 2
#define MAKE_PROCESS_CPUCLOCK(pid, clock) \
((~(clockid_t) (pid) << 3) | (clockid_t) (clock))
static pthread_barrier_t barrier;
/* Help advance the clock. */
static void *chew_cpu(void *arg)
{
pthread_barrier_wait(&barrier);
while (1) ;
return NULL;
}
/* Don't use the glibc wrapper. */
static int do_nanosleep(int flags, const struct timespec *req)
{
clockid_t clock_id = MAKE_PROCESS_CPUCLOCK(0, CPUCLOCK_SCHED);
return syscall(SYS_clock_nanosleep, clock_id, flags, req, NULL);
}
static int64_t tsdiff(const struct timespec *before, const struct timespec *after)
{
int64_t before_i = before->tv_sec * 1000000000ULL + before->tv_nsec;
int64_t after_i = after->tv_sec * 1000000000ULL + after->tv_nsec;
return after_i - before_i;
}
int main(void)
{
int result = 0;
pthread_t th;
pthread_barrier_init(&barrier, NULL, 2);
if (pthread_create(&th, NULL, chew_cpu, NULL) != 0) {
perror("pthread_create");
return 1;
}
pthread_barrier_wait(&barrier);
/* The test. */
struct timespec before, after, sleeptimeabs;
int64_t sleepdiff, diffabs;
const struct timespec sleeptime = {.tv_sec = 0,.tv_nsec = 100000000 };
/* The relative nanosleep. Not sure why this is needed, but its presence
seems to make it easier to reproduce the problem. */
if (do_nanosleep(0, &sleeptime) != 0) {
perror("clock_nanosleep");
return 1;
}
/* Get the current time. */
if (clock_gettime(CLOCK_PROCESS_CPUTIME_ID, &before) < 0) {
perror("clock_gettime[2]");
return 1;
}
/* Compute the absolute sleep time based on the current time. */
uint64_t nsec = before.tv_nsec + sleeptime.tv_nsec;
sleeptimeabs.tv_sec = before.tv_sec + nsec / 1000000000;
sleeptimeabs.tv_nsec = nsec % 1000000000;
/* Sleep for the computed time. */
if (do_nanosleep(TIMER_ABSTIME, &sleeptimeabs) != 0) {
perror("absolute clock_nanosleep");
return 1;
}
/* Get the time after the sleep. */
if (clock_gettime(CLOCK_PROCESS_CPUTIME_ID, &after) < 0) {
perror("clock_gettime[3]");
return 1;
}
/* The time after sleep should always be equal to or after the absolute sleep
time passed to clock_nanosleep. */
sleepdiff = tsdiff(&sleeptimeabs, &after);
if (sleepdiff < 0) {
printf("absolute clock_nanosleep woke too early: %" PRId64 "\n", sleepdiff);
result = 1;
printf("Before %llu.%09llu\n", before.tv_sec, before.tv_nsec);
printf("After %llu.%09llu\n", after.tv_sec, after.tv_nsec);
printf("Sleep %llu.%09llu\n", sleeptimeabs.tv_sec, sleeptimeabs.tv_nsec);
}
/* The difference between the timestamps taken before and after the
clock_nanosleep call should be equal to or more than the duration of the
sleep. */
diffabs = tsdiff(&before, &after);
if (diffabs < sleeptime.tv_nsec) {
printf("clock_gettime difference too small: %" PRId64 "\n", diffabs);
result = 1;
}
pthread_cancel(th);
return result;
}
Signed-off-by: Stanislaw Gruszka <sgruszka@redhat.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Rik van Riel <riel@redhat.com>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Link: http://lkml.kernel.org/r/20141112155843.GA24803@redhat.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2014-11-12 22:58:44 +07:00
|
|
|
u64 ns;
|
2009-03-31 14:56:03 +07:00
|
|
|
|
2013-11-12 00:21:56 +07:00
|
|
|
#if defined(CONFIG_64BIT) && defined(CONFIG_SMP)
|
|
|
|
/*
|
|
|
|
* 64-bit doesn't need locks to atomically read a 64bit value.
|
|
|
|
* So we have a optimization chance when the task's delta_exec is 0.
|
|
|
|
* Reading ->on_cpu is racy, but this is ok.
|
|
|
|
*
|
|
|
|
* If we race with it leaving cpu, we'll take a lock. So we're correct.
|
|
|
|
* If we race with it entering cpu, unaccounted time is 0. This is
|
|
|
|
* indistinguishable from the read occurring a few cycles earlier.
|
2014-06-24 12:49:40 +07:00
|
|
|
* If we see ->on_cpu without ->on_rq, the task is leaving, and has
|
|
|
|
* been accounted, so we're correct here as well.
|
2013-11-12 00:21:56 +07:00
|
|
|
*/
|
2014-08-20 16:47:32 +07:00
|
|
|
if (!p->on_cpu || !task_on_rq_queued(p))
|
2013-11-12 00:21:56 +07:00
|
|
|
return p->se.sum_exec_runtime;
|
|
|
|
#endif
|
|
|
|
|
2015-08-01 02:28:18 +07:00
|
|
|
rq = task_rq_lock(p, &rf);
|
sched/cputime: Fix clock_nanosleep()/clock_gettime() inconsistency
Commit d670ec13178d0 "posix-cpu-timers: Cure SMP wobbles" fixes one glibc
test case in cost of breaking another one. After that commit, calling
clock_nanosleep(TIMER_ABSTIME, X) and then clock_gettime(&Y) can result
of Y time being smaller than X time.
Reproducer/tester can be found further below, it can be compiled and ran by:
gcc -o tst-cpuclock2 tst-cpuclock2.c -pthread
while ./tst-cpuclock2 ; do : ; done
This reproducer, when running on a buggy kernel, will complain
about "clock_gettime difference too small".
Issue happens because on start in thread_group_cputimer() we initialize
sum_exec_runtime of cputimer with threads runtime not yet accounted and
then add the threads runtime to running cputimer again on scheduler
tick, making it's sum_exec_runtime bigger than actual threads runtime.
KOSAKI Motohiro posted a fix for this problem, but that patch was never
applied: https://lkml.org/lkml/2013/5/26/191 .
This patch takes different approach to cure the problem. It calls
update_curr() when cputimer starts, that assure we will have updated
stats of running threads and on the next schedule tick we will account
only the runtime that elapsed from cputimer start. That also assure we
have consistent state between cpu times of individual threads and cpu
time of the process consisted by those threads.
Full reproducer (tst-cpuclock2.c):
#define _GNU_SOURCE
#include <unistd.h>
#include <sys/syscall.h>
#include <stdio.h>
#include <time.h>
#include <pthread.h>
#include <stdint.h>
#include <inttypes.h>
/* Parameters for the Linux kernel ABI for CPU clocks. */
#define CPUCLOCK_SCHED 2
#define MAKE_PROCESS_CPUCLOCK(pid, clock) \
((~(clockid_t) (pid) << 3) | (clockid_t) (clock))
static pthread_barrier_t barrier;
/* Help advance the clock. */
static void *chew_cpu(void *arg)
{
pthread_barrier_wait(&barrier);
while (1) ;
return NULL;
}
/* Don't use the glibc wrapper. */
static int do_nanosleep(int flags, const struct timespec *req)
{
clockid_t clock_id = MAKE_PROCESS_CPUCLOCK(0, CPUCLOCK_SCHED);
return syscall(SYS_clock_nanosleep, clock_id, flags, req, NULL);
}
static int64_t tsdiff(const struct timespec *before, const struct timespec *after)
{
int64_t before_i = before->tv_sec * 1000000000ULL + before->tv_nsec;
int64_t after_i = after->tv_sec * 1000000000ULL + after->tv_nsec;
return after_i - before_i;
}
int main(void)
{
int result = 0;
pthread_t th;
pthread_barrier_init(&barrier, NULL, 2);
if (pthread_create(&th, NULL, chew_cpu, NULL) != 0) {
perror("pthread_create");
return 1;
}
pthread_barrier_wait(&barrier);
/* The test. */
struct timespec before, after, sleeptimeabs;
int64_t sleepdiff, diffabs;
const struct timespec sleeptime = {.tv_sec = 0,.tv_nsec = 100000000 };
/* The relative nanosleep. Not sure why this is needed, but its presence
seems to make it easier to reproduce the problem. */
if (do_nanosleep(0, &sleeptime) != 0) {
perror("clock_nanosleep");
return 1;
}
/* Get the current time. */
if (clock_gettime(CLOCK_PROCESS_CPUTIME_ID, &before) < 0) {
perror("clock_gettime[2]");
return 1;
}
/* Compute the absolute sleep time based on the current time. */
uint64_t nsec = before.tv_nsec + sleeptime.tv_nsec;
sleeptimeabs.tv_sec = before.tv_sec + nsec / 1000000000;
sleeptimeabs.tv_nsec = nsec % 1000000000;
/* Sleep for the computed time. */
if (do_nanosleep(TIMER_ABSTIME, &sleeptimeabs) != 0) {
perror("absolute clock_nanosleep");
return 1;
}
/* Get the time after the sleep. */
if (clock_gettime(CLOCK_PROCESS_CPUTIME_ID, &after) < 0) {
perror("clock_gettime[3]");
return 1;
}
/* The time after sleep should always be equal to or after the absolute sleep
time passed to clock_nanosleep. */
sleepdiff = tsdiff(&sleeptimeabs, &after);
if (sleepdiff < 0) {
printf("absolute clock_nanosleep woke too early: %" PRId64 "\n", sleepdiff);
result = 1;
printf("Before %llu.%09llu\n", before.tv_sec, before.tv_nsec);
printf("After %llu.%09llu\n", after.tv_sec, after.tv_nsec);
printf("Sleep %llu.%09llu\n", sleeptimeabs.tv_sec, sleeptimeabs.tv_nsec);
}
/* The difference between the timestamps taken before and after the
clock_nanosleep call should be equal to or more than the duration of the
sleep. */
diffabs = tsdiff(&before, &after);
if (diffabs < sleeptime.tv_nsec) {
printf("clock_gettime difference too small: %" PRId64 "\n", diffabs);
result = 1;
}
pthread_cancel(th);
return result;
}
Signed-off-by: Stanislaw Gruszka <sgruszka@redhat.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Rik van Riel <riel@redhat.com>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Link: http://lkml.kernel.org/r/20141112155843.GA24803@redhat.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2014-11-12 22:58:44 +07:00
|
|
|
/*
|
|
|
|
* Must be ->curr _and_ ->on_rq. If dequeued, we would
|
|
|
|
* project cycles that may never be accounted to this
|
|
|
|
* thread, breaking clock_gettime().
|
|
|
|
*/
|
|
|
|
if (task_current(rq, p) && task_on_rq_queued(p)) {
|
sched/cputime: Mitigate performance regression in times()/clock_gettime()
Commit:
6e998916dfe3 ("sched/cputime: Fix clock_nanosleep()/clock_gettime() inconsistency")
fixed a problem whereby clock_nanosleep() followed by clock_gettime() could
allow a task to wake early. It addressed the problem by calling the scheduling
classes update_curr() when the cputimer starts.
Said change induced a considerable performance regression on the syscalls
times() and clock_gettimes(CLOCK_PROCESS_CPUTIME_ID). There are some
debuggers and applications that monitor their own performance that
accidentally depend on the performance of these specific calls.
This patch mitigates the performace loss by prefetching data in the CPU
cache, as stalls due to cache misses appear to be where most time is spent
in our benchmarks.
Here are the performance gain of this patch over v4.7-rc7 on a Sandy Bridge
box with 32 logical cores and 2 NUMA nodes. The test is repeated with a
variable number of threads, from 2 to 4*num_cpus; the results are in
seconds and correspond to the average of 10 runs; the percentage gain is
computed with (before-after)/before so a positive value is an improvement
(it's faster). The improvement varies between a few percents for 5-20
threads and more than 10% for 2 or >20 threads.
pound_clock_gettime:
threads 4.7-rc7 patched 4.7-rc7
[num] [secs] [secs (percent)]
2 3.48 3.06 ( 11.83%)
5 3.33 3.25 ( 2.40%)
8 3.37 3.26 ( 3.30%)
12 3.32 3.37 ( -1.60%)
21 4.01 3.90 ( 2.74%)
30 3.63 3.36 ( 7.41%)
48 3.71 3.11 ( 16.27%)
79 3.75 3.16 ( 15.74%)
110 3.81 3.25 ( 14.80%)
128 3.88 3.31 ( 14.76%)
pound_times:
threads 4.7-rc7 patched 4.7-rc7
[num] [secs] [secs (percent)]
2 3.65 3.25 ( 11.03%)
5 3.45 3.17 ( 7.92%)
8 3.52 3.22 ( 8.69%)
12 3.29 3.36 ( -2.04%)
21 4.07 3.92 ( 3.78%)
30 3.87 3.40 ( 12.17%)
48 3.79 3.16 ( 16.61%)
79 3.88 3.28 ( 15.42%)
110 3.90 3.38 ( 13.35%)
128 4.00 3.38 ( 15.45%)
pound_clock_gettime and pound_clock_gettime are two benchmarks included in
the MMTests framework. They launch a given number of threads which
repeatedly call times() or clock_gettimes(). The results above can be
reproduced with cloning MMTests from github.com and running the "poundtime"
workload:
$ git clone https://github.com/gormanm/mmtests.git
$ cd mmtests
$ cp configs/config-global-dhp__workload_poundtime config
$ ./run-mmtests.sh --run-monitor $(uname -r)
The above will run "poundtime" measuring the kernel currently running on
the machine; Once a new kernel is installed and the machine rebooted,
running again
$ cd mmtests
$ ./run-mmtests.sh --run-monitor $(uname -r)
will produce results to compare with. A comparison table will be output
with:
$ cd mmtests/work/log
$ ../../compare-kernels.sh
the table will contain a lot of entries; grepping for "Amean" (as in
"arithmetic mean") will give the tables presented above. The source code
for the two benchmarks is reported at the end of this changelog for
clairity.
The cache misses addressed by this patch were found using a combination of
`perf top`, `perf record` and `perf annotate`. The incriminated lines were
found to be
struct sched_entity *curr = cfs_rq->curr;
and
delta_exec = now - curr->exec_start;
in the function update_curr() from kernel/sched/fair.c. This patch
prefetches the data from memory just before update_curr is called in the
interested execution path.
A comparison of the total number of cycles before and after the patch
follows; the data is obtained using `perf stat -r 10 -ddd <program>`
running over the same sequence of number of threads used above (a positive
gain is an improvement):
threads cycles before cycles after gain
2 19,699,563,964 +-1.19% 17,358,917,517 +-1.85% 11.88%
5 47,401,089,566 +-2.96% 45,103,730,829 +-0.97% 4.85%
8 80,923,501,004 +-3.01% 71,419,385,977 +-0.77% 11.74%
12 112,326,485,473 +-0.47% 110,371,524,403 +-0.47% 1.74%
21 193,455,574,299 +-0.72% 180,120,667,904 +-0.36% 6.89%
30 315,073,519,013 +-1.64% 271,222,225,950 +-1.29% 13.92%
48 321,969,515,332 +-1.48% 273,353,977,321 +-1.16% 15.10%
79 337,866,003,422 +-0.97% 289,462,481,538 +-1.05% 14.33%
110 338,712,691,920 +-0.78% 290,574,233,170 +-0.77% 14.21%
128 348,384,794,006 +-0.50% 292,691,648,206 +-0.66% 15.99%
A comparison of cache miss vs total cache loads ratios, before and after
the patch (again from the `perf stat -r 10 -ddd <program>` tables):
threads L1 misses/total*100 L1 misses/total*100 gain
before after
2 7.43 +-4.90% 7.36 +-4.70% 0.94%
5 13.09 +-4.74% 13.52 +-3.73% -3.28%
8 13.79 +-5.61% 12.90 +-3.27% 6.45%
12 11.57 +-2.44% 8.71 +-1.40% 24.72%
21 12.39 +-3.92% 9.97 +-1.84% 19.53%
30 13.91 +-2.53% 11.73 +-2.28% 15.67%
48 13.71 +-1.59% 12.32 +-1.97% 10.14%
79 14.44 +-0.66% 13.40 +-1.06% 7.20%
110 15.86 +-0.50% 14.46 +-0.59% 8.83%
128 16.51 +-0.32% 15.06 +-0.78% 8.78%
As a final note, the following shows the evolution of performance figures
in the "poundtime" benchmark and pinpoints commit 6e998916dfe3
("sched/cputime: Fix clock_nanosleep()/clock_gettime() inconsistency") as a
major source of degradation, mostly unaddressed to this day (figures
expressed in seconds).
pound_clock_gettime:
threads parent of 6e998916dfe3 4.7-rc7
6e998916dfe3 itself
2 2.23 3.68 ( -64.56%) 3.48 (-55.48%)
5 2.83 3.78 ( -33.42%) 3.33 (-17.43%)
8 2.84 4.31 ( -52.12%) 3.37 (-18.76%)
12 3.09 3.61 ( -16.74%) 3.32 ( -7.17%)
21 3.14 4.63 ( -47.36%) 4.01 (-27.71%)
30 3.28 5.75 ( -75.37%) 3.63 (-10.80%)
48 3.02 6.05 (-100.56%) 3.71 (-22.99%)
79 2.88 6.30 (-118.90%) 3.75 (-30.26%)
110 2.95 6.46 (-119.00%) 3.81 (-29.24%)
128 3.05 6.42 (-110.08%) 3.88 (-27.04%)
pound_times:
threads parent of 6e998916dfe3 4.7-rc7
6e998916dfe3 itself
2 2.27 3.73 ( -64.71%) 3.65 (-61.14%)
5 2.78 3.77 ( -35.56%) 3.45 (-23.98%)
8 2.79 4.41 ( -57.71%) 3.52 (-26.05%)
12 3.02 3.56 ( -17.94%) 3.29 ( -9.08%)
21 3.10 4.61 ( -48.74%) 4.07 (-31.34%)
30 3.33 5.75 ( -72.53%) 3.87 (-16.01%)
48 2.96 6.06 (-105.04%) 3.79 (-28.10%)
79 2.88 6.24 (-116.83%) 3.88 (-34.81%)
110 2.98 6.37 (-114.08%) 3.90 (-31.12%)
128 3.10 6.35 (-104.61%) 4.00 (-28.87%)
The source code of the two benchmarks follows. To compile the two:
NR_THREADS=42
for FILE in pound_times pound_clock_gettime; do
gcc -lrt -O2 -lpthread -DNUM_THREADS=$NR_THREADS $FILE.c -o $FILE
done
==== BEGIN pound_times.c ====
struct tms start;
void *pound (void *threadid)
{
struct tms end;
int oldutime = 0;
int utime;
int i;
for (i = 0; i < 5000000 / NUM_THREADS; i++) {
times(&end);
utime = ((int)end.tms_utime - (int)start.tms_utime);
if (oldutime > utime) {
printf("utime decreased, was %d, now %d!\n", oldutime, utime);
}
oldutime = utime;
}
pthread_exit(NULL);
}
int main()
{
pthread_t th[NUM_THREADS];
long i;
times(&start);
for (i = 0; i < NUM_THREADS; i++) {
pthread_create (&th[i], NULL, pound, (void *)i);
}
pthread_exit(NULL);
return 0;
}
==== END pound_times.c ====
==== BEGIN pound_clock_gettime.c ====
void *pound (void *threadid)
{
struct timespec ts;
int rc, i;
unsigned long prev = 0, this = 0;
for (i = 0; i < 5000000 / NUM_THREADS; i++) {
rc = clock_gettime(CLOCK_PROCESS_CPUTIME_ID, &ts);
if (rc < 0)
perror("clock_gettime");
this = (ts.tv_sec * 1000000000) + ts.tv_nsec;
if (0 && this < prev)
printf("%lu ns timewarp at iteration %d\n", prev - this, i);
prev = this;
}
pthread_exit(NULL);
}
int main()
{
pthread_t th[NUM_THREADS];
long rc, i;
pid_t pgid;
for (i = 0; i < NUM_THREADS; i++) {
rc = pthread_create(&th[i], NULL, pound, (void *)i);
if (rc < 0)
perror("pthread_create");
}
pthread_exit(NULL);
return 0;
}
==== END pound_clock_gettime.c ====
Suggested-by: Mike Galbraith <mgalbraith@suse.de>
Signed-off-by: Giovanni Gherdovich <ggherdovich@suse.cz>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Stanislaw Gruszka <sgruszka@redhat.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/1470385316-15027-2-git-send-email-ggherdovich@suse.cz
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-08-05 15:21:56 +07:00
|
|
|
prefetch_curr_exec_start(p);
|
sched/cputime: Fix clock_nanosleep()/clock_gettime() inconsistency
Commit d670ec13178d0 "posix-cpu-timers: Cure SMP wobbles" fixes one glibc
test case in cost of breaking another one. After that commit, calling
clock_nanosleep(TIMER_ABSTIME, X) and then clock_gettime(&Y) can result
of Y time being smaller than X time.
Reproducer/tester can be found further below, it can be compiled and ran by:
gcc -o tst-cpuclock2 tst-cpuclock2.c -pthread
while ./tst-cpuclock2 ; do : ; done
This reproducer, when running on a buggy kernel, will complain
about "clock_gettime difference too small".
Issue happens because on start in thread_group_cputimer() we initialize
sum_exec_runtime of cputimer with threads runtime not yet accounted and
then add the threads runtime to running cputimer again on scheduler
tick, making it's sum_exec_runtime bigger than actual threads runtime.
KOSAKI Motohiro posted a fix for this problem, but that patch was never
applied: https://lkml.org/lkml/2013/5/26/191 .
This patch takes different approach to cure the problem. It calls
update_curr() when cputimer starts, that assure we will have updated
stats of running threads and on the next schedule tick we will account
only the runtime that elapsed from cputimer start. That also assure we
have consistent state between cpu times of individual threads and cpu
time of the process consisted by those threads.
Full reproducer (tst-cpuclock2.c):
#define _GNU_SOURCE
#include <unistd.h>
#include <sys/syscall.h>
#include <stdio.h>
#include <time.h>
#include <pthread.h>
#include <stdint.h>
#include <inttypes.h>
/* Parameters for the Linux kernel ABI for CPU clocks. */
#define CPUCLOCK_SCHED 2
#define MAKE_PROCESS_CPUCLOCK(pid, clock) \
((~(clockid_t) (pid) << 3) | (clockid_t) (clock))
static pthread_barrier_t barrier;
/* Help advance the clock. */
static void *chew_cpu(void *arg)
{
pthread_barrier_wait(&barrier);
while (1) ;
return NULL;
}
/* Don't use the glibc wrapper. */
static int do_nanosleep(int flags, const struct timespec *req)
{
clockid_t clock_id = MAKE_PROCESS_CPUCLOCK(0, CPUCLOCK_SCHED);
return syscall(SYS_clock_nanosleep, clock_id, flags, req, NULL);
}
static int64_t tsdiff(const struct timespec *before, const struct timespec *after)
{
int64_t before_i = before->tv_sec * 1000000000ULL + before->tv_nsec;
int64_t after_i = after->tv_sec * 1000000000ULL + after->tv_nsec;
return after_i - before_i;
}
int main(void)
{
int result = 0;
pthread_t th;
pthread_barrier_init(&barrier, NULL, 2);
if (pthread_create(&th, NULL, chew_cpu, NULL) != 0) {
perror("pthread_create");
return 1;
}
pthread_barrier_wait(&barrier);
/* The test. */
struct timespec before, after, sleeptimeabs;
int64_t sleepdiff, diffabs;
const struct timespec sleeptime = {.tv_sec = 0,.tv_nsec = 100000000 };
/* The relative nanosleep. Not sure why this is needed, but its presence
seems to make it easier to reproduce the problem. */
if (do_nanosleep(0, &sleeptime) != 0) {
perror("clock_nanosleep");
return 1;
}
/* Get the current time. */
if (clock_gettime(CLOCK_PROCESS_CPUTIME_ID, &before) < 0) {
perror("clock_gettime[2]");
return 1;
}
/* Compute the absolute sleep time based on the current time. */
uint64_t nsec = before.tv_nsec + sleeptime.tv_nsec;
sleeptimeabs.tv_sec = before.tv_sec + nsec / 1000000000;
sleeptimeabs.tv_nsec = nsec % 1000000000;
/* Sleep for the computed time. */
if (do_nanosleep(TIMER_ABSTIME, &sleeptimeabs) != 0) {
perror("absolute clock_nanosleep");
return 1;
}
/* Get the time after the sleep. */
if (clock_gettime(CLOCK_PROCESS_CPUTIME_ID, &after) < 0) {
perror("clock_gettime[3]");
return 1;
}
/* The time after sleep should always be equal to or after the absolute sleep
time passed to clock_nanosleep. */
sleepdiff = tsdiff(&sleeptimeabs, &after);
if (sleepdiff < 0) {
printf("absolute clock_nanosleep woke too early: %" PRId64 "\n", sleepdiff);
result = 1;
printf("Before %llu.%09llu\n", before.tv_sec, before.tv_nsec);
printf("After %llu.%09llu\n", after.tv_sec, after.tv_nsec);
printf("Sleep %llu.%09llu\n", sleeptimeabs.tv_sec, sleeptimeabs.tv_nsec);
}
/* The difference between the timestamps taken before and after the
clock_nanosleep call should be equal to or more than the duration of the
sleep. */
diffabs = tsdiff(&before, &after);
if (diffabs < sleeptime.tv_nsec) {
printf("clock_gettime difference too small: %" PRId64 "\n", diffabs);
result = 1;
}
pthread_cancel(th);
return result;
}
Signed-off-by: Stanislaw Gruszka <sgruszka@redhat.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Rik van Riel <riel@redhat.com>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Link: http://lkml.kernel.org/r/20141112155843.GA24803@redhat.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2014-11-12 22:58:44 +07:00
|
|
|
update_rq_clock(rq);
|
|
|
|
p->sched_class->update_curr(rq);
|
|
|
|
}
|
|
|
|
ns = p->se.sum_exec_runtime;
|
2015-08-01 02:28:18 +07:00
|
|
|
task_rq_unlock(rq, p, &rf);
|
2009-03-31 14:56:03 +07:00
|
|
|
|
|
|
|
return ns;
|
|
|
|
}
|
2006-07-03 14:25:40 +07:00
|
|
|
|
2006-12-10 17:20:22 +07:00
|
|
|
/*
|
|
|
|
* This function gets called by the timer code, with HZ frequency.
|
|
|
|
* We call it with interrupts disabled.
|
|
|
|
*/
|
|
|
|
void scheduler_tick(void)
|
|
|
|
{
|
|
|
|
int cpu = smp_processor_id();
|
|
|
|
struct rq *rq = cpu_rq(cpu);
|
2007-07-09 23:51:59 +07:00
|
|
|
struct task_struct *curr = rq->curr;
|
2008-05-03 23:29:28 +07:00
|
|
|
|
|
|
|
sched_clock_tick();
|
2007-07-09 23:51:59 +07:00
|
|
|
|
2009-11-17 20:28:38 +07:00
|
|
|
raw_spin_lock(&rq->lock);
|
2008-05-03 23:29:28 +07:00
|
|
|
update_rq_clock(rq);
|
2008-01-26 03:08:29 +07:00
|
|
|
curr->sched_class->task_tick(rq, curr, 0);
|
2016-04-13 20:56:50 +07:00
|
|
|
cpu_load_update_active(rq);
|
2015-04-14 18:19:42 +07:00
|
|
|
calc_global_load_tick(rq);
|
2009-11-17 20:28:38 +07:00
|
|
|
raw_spin_unlock(&rq->lock);
|
2006-12-10 17:20:22 +07:00
|
|
|
|
2010-09-17 16:28:50 +07:00
|
|
|
perf_event_task_tick();
|
2009-05-23 23:28:55 +07:00
|
|
|
|
2006-12-10 17:20:23 +07:00
|
|
|
#ifdef CONFIG_SMP
|
2011-10-04 05:09:01 +07:00
|
|
|
rq->idle_balance = idle_cpu(cpu);
|
2014-01-06 18:34:38 +07:00
|
|
|
trigger_load_balance(rq);
|
2006-12-10 17:20:23 +07:00
|
|
|
#endif
|
2013-05-03 08:39:05 +07:00
|
|
|
rq_last_tick_reset(rq);
|
2005-04-17 05:20:36 +07:00
|
|
|
}
|
|
|
|
|
2013-05-03 08:39:05 +07:00
|
|
|
#ifdef CONFIG_NO_HZ_FULL
|
|
|
|
/**
|
|
|
|
* scheduler_tick_max_deferment
|
|
|
|
*
|
|
|
|
* Keep at least one tick per second when a single
|
|
|
|
* active task is running because the scheduler doesn't
|
|
|
|
* yet completely support full dynticks environment.
|
|
|
|
*
|
|
|
|
* This makes sure that uptime, CFS vruntime, load
|
|
|
|
* balancing, etc... continue to move forward, even
|
|
|
|
* with a very low granularity.
|
2013-07-13 01:45:47 +07:00
|
|
|
*
|
|
|
|
* Return: Maximum deferment in nanoseconds.
|
2013-05-03 08:39:05 +07:00
|
|
|
*/
|
|
|
|
u64 scheduler_tick_max_deferment(void)
|
|
|
|
{
|
|
|
|
struct rq *rq = this_rq();
|
2015-04-29 03:00:20 +07:00
|
|
|
unsigned long next, now = READ_ONCE(jiffies);
|
2013-05-03 08:39:05 +07:00
|
|
|
|
|
|
|
next = rq->last_sched_tick + HZ;
|
|
|
|
|
|
|
|
if (time_before_eq(next, now))
|
|
|
|
return 0;
|
|
|
|
|
2014-01-15 20:51:38 +07:00
|
|
|
return jiffies_to_nsecs(next - now);
|
2005-04-17 05:20:36 +07:00
|
|
|
}
|
2013-05-03 08:39:05 +07:00
|
|
|
#endif
|
2005-04-17 05:20:36 +07:00
|
|
|
|
2009-01-23 07:01:40 +07:00
|
|
|
#if defined(CONFIG_PREEMPT) && (defined(CONFIG_DEBUG_PREEMPT) || \
|
|
|
|
defined(CONFIG_PREEMPT_TRACER))
|
sched/core: Add preempt checks in preempt_schedule() code
While testing the tracer preemptoff, I hit this strange trace:
<...>-259 0...1 0us : schedule <-worker_thread
<...>-259 0d..1 0us : rcu_note_context_switch <-__schedule
<...>-259 0d..1 0us : rcu_sched_qs <-rcu_note_context_switch
<...>-259 0d..1 0us : rcu_preempt_qs <-rcu_note_context_switch
<...>-259 0d..1 0us : _raw_spin_lock <-__schedule
<...>-259 0d..1 0us : preempt_count_add <-_raw_spin_lock
<...>-259 0d..2 0us : do_raw_spin_lock <-_raw_spin_lock
<...>-259 0d..2 1us : deactivate_task <-__schedule
<...>-259 0d..2 1us : update_rq_clock.part.84 <-deactivate_task
<...>-259 0d..2 1us : dequeue_task_fair <-deactivate_task
<...>-259 0d..2 1us : dequeue_entity <-dequeue_task_fair
<...>-259 0d..2 1us : update_curr <-dequeue_entity
<...>-259 0d..2 1us : update_min_vruntime <-update_curr
<...>-259 0d..2 1us : cpuacct_charge <-update_curr
<...>-259 0d..2 1us : __rcu_read_lock <-cpuacct_charge
<...>-259 0d..2 1us : __rcu_read_unlock <-cpuacct_charge
<...>-259 0d..2 1us : clear_buddies <-dequeue_entity
<...>-259 0d..2 1us : account_entity_dequeue <-dequeue_entity
<...>-259 0d..2 2us : update_min_vruntime <-dequeue_entity
<...>-259 0d..2 2us : update_cfs_shares <-dequeue_entity
<...>-259 0d..2 2us : hrtick_update <-dequeue_task_fair
<...>-259 0d..2 2us : wq_worker_sleeping <-__schedule
<...>-259 0d..2 2us : kthread_data <-wq_worker_sleeping
<...>-259 0d..2 2us : pick_next_task_fair <-__schedule
<...>-259 0d..2 2us : check_cfs_rq_runtime <-pick_next_task_fair
<...>-259 0d..2 2us : pick_next_entity <-pick_next_task_fair
<...>-259 0d..2 2us : clear_buddies <-pick_next_entity
<...>-259 0d..2 2us : pick_next_entity <-pick_next_task_fair
<...>-259 0d..2 2us : clear_buddies <-pick_next_entity
<...>-259 0d..2 2us : set_next_entity <-pick_next_task_fair
<...>-259 0d..2 3us : put_prev_entity <-pick_next_task_fair
<...>-259 0d..2 3us : check_cfs_rq_runtime <-put_prev_entity
<...>-259 0d..2 3us : set_next_entity <-pick_next_task_fair
gnome-sh-1031 0d..2 3us : finish_task_switch <-__schedule
gnome-sh-1031 0d..2 3us : _raw_spin_unlock_irq <-finish_task_switch
gnome-sh-1031 0d..2 3us : do_raw_spin_unlock <-_raw_spin_unlock_irq
gnome-sh-1031 0...2 3us!: preempt_count_sub <-_raw_spin_unlock_irq
gnome-sh-1031 0...1 582us : do_raw_spin_lock <-_raw_spin_lock
gnome-sh-1031 0...1 583us : _raw_spin_unlock <-drm_gem_object_lookup
gnome-sh-1031 0...1 583us : do_raw_spin_unlock <-_raw_spin_unlock
gnome-sh-1031 0...1 583us : preempt_count_sub <-_raw_spin_unlock
gnome-sh-1031 0...1 584us : _raw_spin_unlock <-drm_gem_object_lookup
gnome-sh-1031 0...1 584us+: trace_preempt_on <-drm_gem_object_lookup
gnome-sh-1031 0...1 603us : <stack trace>
=> preempt_count_sub
=> _raw_spin_unlock
=> drm_gem_object_lookup
=> i915_gem_madvise_ioctl
=> drm_ioctl
=> do_vfs_ioctl
=> SyS_ioctl
=> entry_SYSCALL_64_fastpath
As I'm tracing preemption disabled, it seemed incorrect that the trace
would go across a schedule and report not being in the scheduler.
Looking into this I discovered the problem.
schedule() calls preempt_disable() but the preempt_schedule() calls
preempt_enable_notrace(). What happened above was that the gnome-shell
task was preempted on another CPU, migrated over to the idle cpu. The
tracer stared with idle calling schedule(), which called
preempt_disable(), but then gnome-shell finished, and it enabled
preemption with preempt_enable_notrace() that does stop the trace, even
though preemption was enabled.
The purpose of the preempt_disable_notrace() in the preempt_schedule()
is to prevent function tracing from going into an infinite loop.
Because function tracing can trace the preempt_enable/disable() calls
that are traced. The problem with function tracing is:
NEED_RESCHED set
preempt_schedule()
preempt_disable()
preempt_count_inc()
function trace (before incrementing preempt count)
preempt_disable_notrace()
preempt_enable_notrace()
sees NEED_RESCHED set
preempt_schedule() (repeat)
Now by breaking out the preempt off/on tracing into their own code:
preempt_disable_check() and preempt_enable_check(), we can add these to
the preempt_schedule() code. As preemption would then be disabled, even
if they were to be traced by the function tracer, the disabled
preemption would prevent the recursion.
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/20160321112339.6dc78ad6@gandalf.local.home
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-03-21 22:23:39 +07:00
|
|
|
/*
|
|
|
|
* If the value passed in is equal to the current preempt count
|
|
|
|
* then we just disabled preemption. Start timing the latency.
|
|
|
|
*/
|
|
|
|
static inline void preempt_latency_start(int val)
|
|
|
|
{
|
|
|
|
if (preempt_count() == val) {
|
|
|
|
unsigned long ip = get_lock_parent_ip();
|
|
|
|
#ifdef CONFIG_DEBUG_PREEMPT
|
|
|
|
current->preempt_disable_ip = ip;
|
|
|
|
#endif
|
|
|
|
trace_preempt_off(CALLER_ADDR0, ip);
|
|
|
|
}
|
|
|
|
}
|
2009-01-23 07:01:40 +07:00
|
|
|
|
2014-04-17 15:18:42 +07:00
|
|
|
void preempt_count_add(int val)
|
2005-04-17 05:20:36 +07:00
|
|
|
{
|
2008-05-13 02:20:42 +07:00
|
|
|
#ifdef CONFIG_DEBUG_PREEMPT
|
2005-04-17 05:20:36 +07:00
|
|
|
/*
|
|
|
|
* Underflow?
|
|
|
|
*/
|
2006-07-03 14:24:33 +07:00
|
|
|
if (DEBUG_LOCKS_WARN_ON((preempt_count() < 0)))
|
|
|
|
return;
|
2008-05-13 02:20:42 +07:00
|
|
|
#endif
|
2013-09-10 17:15:23 +07:00
|
|
|
__preempt_count_add(val);
|
2008-05-13 02:20:42 +07:00
|
|
|
#ifdef CONFIG_DEBUG_PREEMPT
|
2005-04-17 05:20:36 +07:00
|
|
|
/*
|
|
|
|
* Spinlock count overflowing soon?
|
|
|
|
*/
|
2006-12-10 17:20:38 +07:00
|
|
|
DEBUG_LOCKS_WARN_ON((preempt_count() & PREEMPT_MASK) >=
|
|
|
|
PREEMPT_MASK - 10);
|
2008-05-13 02:20:42 +07:00
|
|
|
#endif
|
sched/core: Add preempt checks in preempt_schedule() code
While testing the tracer preemptoff, I hit this strange trace:
<...>-259 0...1 0us : schedule <-worker_thread
<...>-259 0d..1 0us : rcu_note_context_switch <-__schedule
<...>-259 0d..1 0us : rcu_sched_qs <-rcu_note_context_switch
<...>-259 0d..1 0us : rcu_preempt_qs <-rcu_note_context_switch
<...>-259 0d..1 0us : _raw_spin_lock <-__schedule
<...>-259 0d..1 0us : preempt_count_add <-_raw_spin_lock
<...>-259 0d..2 0us : do_raw_spin_lock <-_raw_spin_lock
<...>-259 0d..2 1us : deactivate_task <-__schedule
<...>-259 0d..2 1us : update_rq_clock.part.84 <-deactivate_task
<...>-259 0d..2 1us : dequeue_task_fair <-deactivate_task
<...>-259 0d..2 1us : dequeue_entity <-dequeue_task_fair
<...>-259 0d..2 1us : update_curr <-dequeue_entity
<...>-259 0d..2 1us : update_min_vruntime <-update_curr
<...>-259 0d..2 1us : cpuacct_charge <-update_curr
<...>-259 0d..2 1us : __rcu_read_lock <-cpuacct_charge
<...>-259 0d..2 1us : __rcu_read_unlock <-cpuacct_charge
<...>-259 0d..2 1us : clear_buddies <-dequeue_entity
<...>-259 0d..2 1us : account_entity_dequeue <-dequeue_entity
<...>-259 0d..2 2us : update_min_vruntime <-dequeue_entity
<...>-259 0d..2 2us : update_cfs_shares <-dequeue_entity
<...>-259 0d..2 2us : hrtick_update <-dequeue_task_fair
<...>-259 0d..2 2us : wq_worker_sleeping <-__schedule
<...>-259 0d..2 2us : kthread_data <-wq_worker_sleeping
<...>-259 0d..2 2us : pick_next_task_fair <-__schedule
<...>-259 0d..2 2us : check_cfs_rq_runtime <-pick_next_task_fair
<...>-259 0d..2 2us : pick_next_entity <-pick_next_task_fair
<...>-259 0d..2 2us : clear_buddies <-pick_next_entity
<...>-259 0d..2 2us : pick_next_entity <-pick_next_task_fair
<...>-259 0d..2 2us : clear_buddies <-pick_next_entity
<...>-259 0d..2 2us : set_next_entity <-pick_next_task_fair
<...>-259 0d..2 3us : put_prev_entity <-pick_next_task_fair
<...>-259 0d..2 3us : check_cfs_rq_runtime <-put_prev_entity
<...>-259 0d..2 3us : set_next_entity <-pick_next_task_fair
gnome-sh-1031 0d..2 3us : finish_task_switch <-__schedule
gnome-sh-1031 0d..2 3us : _raw_spin_unlock_irq <-finish_task_switch
gnome-sh-1031 0d..2 3us : do_raw_spin_unlock <-_raw_spin_unlock_irq
gnome-sh-1031 0...2 3us!: preempt_count_sub <-_raw_spin_unlock_irq
gnome-sh-1031 0...1 582us : do_raw_spin_lock <-_raw_spin_lock
gnome-sh-1031 0...1 583us : _raw_spin_unlock <-drm_gem_object_lookup
gnome-sh-1031 0...1 583us : do_raw_spin_unlock <-_raw_spin_unlock
gnome-sh-1031 0...1 583us : preempt_count_sub <-_raw_spin_unlock
gnome-sh-1031 0...1 584us : _raw_spin_unlock <-drm_gem_object_lookup
gnome-sh-1031 0...1 584us+: trace_preempt_on <-drm_gem_object_lookup
gnome-sh-1031 0...1 603us : <stack trace>
=> preempt_count_sub
=> _raw_spin_unlock
=> drm_gem_object_lookup
=> i915_gem_madvise_ioctl
=> drm_ioctl
=> do_vfs_ioctl
=> SyS_ioctl
=> entry_SYSCALL_64_fastpath
As I'm tracing preemption disabled, it seemed incorrect that the trace
would go across a schedule and report not being in the scheduler.
Looking into this I discovered the problem.
schedule() calls preempt_disable() but the preempt_schedule() calls
preempt_enable_notrace(). What happened above was that the gnome-shell
task was preempted on another CPU, migrated over to the idle cpu. The
tracer stared with idle calling schedule(), which called
preempt_disable(), but then gnome-shell finished, and it enabled
preemption with preempt_enable_notrace() that does stop the trace, even
though preemption was enabled.
The purpose of the preempt_disable_notrace() in the preempt_schedule()
is to prevent function tracing from going into an infinite loop.
Because function tracing can trace the preempt_enable/disable() calls
that are traced. The problem with function tracing is:
NEED_RESCHED set
preempt_schedule()
preempt_disable()
preempt_count_inc()
function trace (before incrementing preempt count)
preempt_disable_notrace()
preempt_enable_notrace()
sees NEED_RESCHED set
preempt_schedule() (repeat)
Now by breaking out the preempt off/on tracing into their own code:
preempt_disable_check() and preempt_enable_check(), we can add these to
the preempt_schedule() code. As preemption would then be disabled, even
if they were to be traced by the function tracer, the disabled
preemption would prevent the recursion.
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/20160321112339.6dc78ad6@gandalf.local.home
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-03-21 22:23:39 +07:00
|
|
|
preempt_latency_start(val);
|
2005-04-17 05:20:36 +07:00
|
|
|
}
|
2013-09-10 17:15:23 +07:00
|
|
|
EXPORT_SYMBOL(preempt_count_add);
|
2014-04-17 15:18:42 +07:00
|
|
|
NOKPROBE_SYMBOL(preempt_count_add);
|
2005-04-17 05:20:36 +07:00
|
|
|
|
sched/core: Add preempt checks in preempt_schedule() code
While testing the tracer preemptoff, I hit this strange trace:
<...>-259 0...1 0us : schedule <-worker_thread
<...>-259 0d..1 0us : rcu_note_context_switch <-__schedule
<...>-259 0d..1 0us : rcu_sched_qs <-rcu_note_context_switch
<...>-259 0d..1 0us : rcu_preempt_qs <-rcu_note_context_switch
<...>-259 0d..1 0us : _raw_spin_lock <-__schedule
<...>-259 0d..1 0us : preempt_count_add <-_raw_spin_lock
<...>-259 0d..2 0us : do_raw_spin_lock <-_raw_spin_lock
<...>-259 0d..2 1us : deactivate_task <-__schedule
<...>-259 0d..2 1us : update_rq_clock.part.84 <-deactivate_task
<...>-259 0d..2 1us : dequeue_task_fair <-deactivate_task
<...>-259 0d..2 1us : dequeue_entity <-dequeue_task_fair
<...>-259 0d..2 1us : update_curr <-dequeue_entity
<...>-259 0d..2 1us : update_min_vruntime <-update_curr
<...>-259 0d..2 1us : cpuacct_charge <-update_curr
<...>-259 0d..2 1us : __rcu_read_lock <-cpuacct_charge
<...>-259 0d..2 1us : __rcu_read_unlock <-cpuacct_charge
<...>-259 0d..2 1us : clear_buddies <-dequeue_entity
<...>-259 0d..2 1us : account_entity_dequeue <-dequeue_entity
<...>-259 0d..2 2us : update_min_vruntime <-dequeue_entity
<...>-259 0d..2 2us : update_cfs_shares <-dequeue_entity
<...>-259 0d..2 2us : hrtick_update <-dequeue_task_fair
<...>-259 0d..2 2us : wq_worker_sleeping <-__schedule
<...>-259 0d..2 2us : kthread_data <-wq_worker_sleeping
<...>-259 0d..2 2us : pick_next_task_fair <-__schedule
<...>-259 0d..2 2us : check_cfs_rq_runtime <-pick_next_task_fair
<...>-259 0d..2 2us : pick_next_entity <-pick_next_task_fair
<...>-259 0d..2 2us : clear_buddies <-pick_next_entity
<...>-259 0d..2 2us : pick_next_entity <-pick_next_task_fair
<...>-259 0d..2 2us : clear_buddies <-pick_next_entity
<...>-259 0d..2 2us : set_next_entity <-pick_next_task_fair
<...>-259 0d..2 3us : put_prev_entity <-pick_next_task_fair
<...>-259 0d..2 3us : check_cfs_rq_runtime <-put_prev_entity
<...>-259 0d..2 3us : set_next_entity <-pick_next_task_fair
gnome-sh-1031 0d..2 3us : finish_task_switch <-__schedule
gnome-sh-1031 0d..2 3us : _raw_spin_unlock_irq <-finish_task_switch
gnome-sh-1031 0d..2 3us : do_raw_spin_unlock <-_raw_spin_unlock_irq
gnome-sh-1031 0...2 3us!: preempt_count_sub <-_raw_spin_unlock_irq
gnome-sh-1031 0...1 582us : do_raw_spin_lock <-_raw_spin_lock
gnome-sh-1031 0...1 583us : _raw_spin_unlock <-drm_gem_object_lookup
gnome-sh-1031 0...1 583us : do_raw_spin_unlock <-_raw_spin_unlock
gnome-sh-1031 0...1 583us : preempt_count_sub <-_raw_spin_unlock
gnome-sh-1031 0...1 584us : _raw_spin_unlock <-drm_gem_object_lookup
gnome-sh-1031 0...1 584us+: trace_preempt_on <-drm_gem_object_lookup
gnome-sh-1031 0...1 603us : <stack trace>
=> preempt_count_sub
=> _raw_spin_unlock
=> drm_gem_object_lookup
=> i915_gem_madvise_ioctl
=> drm_ioctl
=> do_vfs_ioctl
=> SyS_ioctl
=> entry_SYSCALL_64_fastpath
As I'm tracing preemption disabled, it seemed incorrect that the trace
would go across a schedule and report not being in the scheduler.
Looking into this I discovered the problem.
schedule() calls preempt_disable() but the preempt_schedule() calls
preempt_enable_notrace(). What happened above was that the gnome-shell
task was preempted on another CPU, migrated over to the idle cpu. The
tracer stared with idle calling schedule(), which called
preempt_disable(), but then gnome-shell finished, and it enabled
preemption with preempt_enable_notrace() that does stop the trace, even
though preemption was enabled.
The purpose of the preempt_disable_notrace() in the preempt_schedule()
is to prevent function tracing from going into an infinite loop.
Because function tracing can trace the preempt_enable/disable() calls
that are traced. The problem with function tracing is:
NEED_RESCHED set
preempt_schedule()
preempt_disable()
preempt_count_inc()
function trace (before incrementing preempt count)
preempt_disable_notrace()
preempt_enable_notrace()
sees NEED_RESCHED set
preempt_schedule() (repeat)
Now by breaking out the preempt off/on tracing into their own code:
preempt_disable_check() and preempt_enable_check(), we can add these to
the preempt_schedule() code. As preemption would then be disabled, even
if they were to be traced by the function tracer, the disabled
preemption would prevent the recursion.
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/20160321112339.6dc78ad6@gandalf.local.home
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-03-21 22:23:39 +07:00
|
|
|
/*
|
|
|
|
* If the value passed in equals to the current preempt count
|
|
|
|
* then we just enabled preemption. Stop timing the latency.
|
|
|
|
*/
|
|
|
|
static inline void preempt_latency_stop(int val)
|
|
|
|
{
|
|
|
|
if (preempt_count() == val)
|
|
|
|
trace_preempt_on(CALLER_ADDR0, get_lock_parent_ip());
|
|
|
|
}
|
|
|
|
|
2014-04-17 15:18:42 +07:00
|
|
|
void preempt_count_sub(int val)
|
2005-04-17 05:20:36 +07:00
|
|
|
{
|
2008-05-13 02:20:42 +07:00
|
|
|
#ifdef CONFIG_DEBUG_PREEMPT
|
2005-04-17 05:20:36 +07:00
|
|
|
/*
|
|
|
|
* Underflow?
|
|
|
|
*/
|
2009-01-12 19:00:50 +07:00
|
|
|
if (DEBUG_LOCKS_WARN_ON(val > preempt_count()))
|
2006-07-03 14:24:33 +07:00
|
|
|
return;
|
2005-04-17 05:20:36 +07:00
|
|
|
/*
|
|
|
|
* Is the spinlock portion underflowing?
|
|
|
|
*/
|
2006-07-03 14:24:33 +07:00
|
|
|
if (DEBUG_LOCKS_WARN_ON((val < PREEMPT_MASK) &&
|
|
|
|
!(preempt_count() & PREEMPT_MASK)))
|
|
|
|
return;
|
2008-05-13 02:20:42 +07:00
|
|
|
#endif
|
2006-07-03 14:24:33 +07:00
|
|
|
|
sched/core: Add preempt checks in preempt_schedule() code
While testing the tracer preemptoff, I hit this strange trace:
<...>-259 0...1 0us : schedule <-worker_thread
<...>-259 0d..1 0us : rcu_note_context_switch <-__schedule
<...>-259 0d..1 0us : rcu_sched_qs <-rcu_note_context_switch
<...>-259 0d..1 0us : rcu_preempt_qs <-rcu_note_context_switch
<...>-259 0d..1 0us : _raw_spin_lock <-__schedule
<...>-259 0d..1 0us : preempt_count_add <-_raw_spin_lock
<...>-259 0d..2 0us : do_raw_spin_lock <-_raw_spin_lock
<...>-259 0d..2 1us : deactivate_task <-__schedule
<...>-259 0d..2 1us : update_rq_clock.part.84 <-deactivate_task
<...>-259 0d..2 1us : dequeue_task_fair <-deactivate_task
<...>-259 0d..2 1us : dequeue_entity <-dequeue_task_fair
<...>-259 0d..2 1us : update_curr <-dequeue_entity
<...>-259 0d..2 1us : update_min_vruntime <-update_curr
<...>-259 0d..2 1us : cpuacct_charge <-update_curr
<...>-259 0d..2 1us : __rcu_read_lock <-cpuacct_charge
<...>-259 0d..2 1us : __rcu_read_unlock <-cpuacct_charge
<...>-259 0d..2 1us : clear_buddies <-dequeue_entity
<...>-259 0d..2 1us : account_entity_dequeue <-dequeue_entity
<...>-259 0d..2 2us : update_min_vruntime <-dequeue_entity
<...>-259 0d..2 2us : update_cfs_shares <-dequeue_entity
<...>-259 0d..2 2us : hrtick_update <-dequeue_task_fair
<...>-259 0d..2 2us : wq_worker_sleeping <-__schedule
<...>-259 0d..2 2us : kthread_data <-wq_worker_sleeping
<...>-259 0d..2 2us : pick_next_task_fair <-__schedule
<...>-259 0d..2 2us : check_cfs_rq_runtime <-pick_next_task_fair
<...>-259 0d..2 2us : pick_next_entity <-pick_next_task_fair
<...>-259 0d..2 2us : clear_buddies <-pick_next_entity
<...>-259 0d..2 2us : pick_next_entity <-pick_next_task_fair
<...>-259 0d..2 2us : clear_buddies <-pick_next_entity
<...>-259 0d..2 2us : set_next_entity <-pick_next_task_fair
<...>-259 0d..2 3us : put_prev_entity <-pick_next_task_fair
<...>-259 0d..2 3us : check_cfs_rq_runtime <-put_prev_entity
<...>-259 0d..2 3us : set_next_entity <-pick_next_task_fair
gnome-sh-1031 0d..2 3us : finish_task_switch <-__schedule
gnome-sh-1031 0d..2 3us : _raw_spin_unlock_irq <-finish_task_switch
gnome-sh-1031 0d..2 3us : do_raw_spin_unlock <-_raw_spin_unlock_irq
gnome-sh-1031 0...2 3us!: preempt_count_sub <-_raw_spin_unlock_irq
gnome-sh-1031 0...1 582us : do_raw_spin_lock <-_raw_spin_lock
gnome-sh-1031 0...1 583us : _raw_spin_unlock <-drm_gem_object_lookup
gnome-sh-1031 0...1 583us : do_raw_spin_unlock <-_raw_spin_unlock
gnome-sh-1031 0...1 583us : preempt_count_sub <-_raw_spin_unlock
gnome-sh-1031 0...1 584us : _raw_spin_unlock <-drm_gem_object_lookup
gnome-sh-1031 0...1 584us+: trace_preempt_on <-drm_gem_object_lookup
gnome-sh-1031 0...1 603us : <stack trace>
=> preempt_count_sub
=> _raw_spin_unlock
=> drm_gem_object_lookup
=> i915_gem_madvise_ioctl
=> drm_ioctl
=> do_vfs_ioctl
=> SyS_ioctl
=> entry_SYSCALL_64_fastpath
As I'm tracing preemption disabled, it seemed incorrect that the trace
would go across a schedule and report not being in the scheduler.
Looking into this I discovered the problem.
schedule() calls preempt_disable() but the preempt_schedule() calls
preempt_enable_notrace(). What happened above was that the gnome-shell
task was preempted on another CPU, migrated over to the idle cpu. The
tracer stared with idle calling schedule(), which called
preempt_disable(), but then gnome-shell finished, and it enabled
preemption with preempt_enable_notrace() that does stop the trace, even
though preemption was enabled.
The purpose of the preempt_disable_notrace() in the preempt_schedule()
is to prevent function tracing from going into an infinite loop.
Because function tracing can trace the preempt_enable/disable() calls
that are traced. The problem with function tracing is:
NEED_RESCHED set
preempt_schedule()
preempt_disable()
preempt_count_inc()
function trace (before incrementing preempt count)
preempt_disable_notrace()
preempt_enable_notrace()
sees NEED_RESCHED set
preempt_schedule() (repeat)
Now by breaking out the preempt off/on tracing into their own code:
preempt_disable_check() and preempt_enable_check(), we can add these to
the preempt_schedule() code. As preemption would then be disabled, even
if they were to be traced by the function tracer, the disabled
preemption would prevent the recursion.
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/20160321112339.6dc78ad6@gandalf.local.home
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-03-21 22:23:39 +07:00
|
|
|
preempt_latency_stop(val);
|
2013-09-10 17:15:23 +07:00
|
|
|
__preempt_count_sub(val);
|
2005-04-17 05:20:36 +07:00
|
|
|
}
|
2013-09-10 17:15:23 +07:00
|
|
|
EXPORT_SYMBOL(preempt_count_sub);
|
2014-04-17 15:18:42 +07:00
|
|
|
NOKPROBE_SYMBOL(preempt_count_sub);
|
2005-04-17 05:20:36 +07:00
|
|
|
|
sched/core: Add preempt checks in preempt_schedule() code
While testing the tracer preemptoff, I hit this strange trace:
<...>-259 0...1 0us : schedule <-worker_thread
<...>-259 0d..1 0us : rcu_note_context_switch <-__schedule
<...>-259 0d..1 0us : rcu_sched_qs <-rcu_note_context_switch
<...>-259 0d..1 0us : rcu_preempt_qs <-rcu_note_context_switch
<...>-259 0d..1 0us : _raw_spin_lock <-__schedule
<...>-259 0d..1 0us : preempt_count_add <-_raw_spin_lock
<...>-259 0d..2 0us : do_raw_spin_lock <-_raw_spin_lock
<...>-259 0d..2 1us : deactivate_task <-__schedule
<...>-259 0d..2 1us : update_rq_clock.part.84 <-deactivate_task
<...>-259 0d..2 1us : dequeue_task_fair <-deactivate_task
<...>-259 0d..2 1us : dequeue_entity <-dequeue_task_fair
<...>-259 0d..2 1us : update_curr <-dequeue_entity
<...>-259 0d..2 1us : update_min_vruntime <-update_curr
<...>-259 0d..2 1us : cpuacct_charge <-update_curr
<...>-259 0d..2 1us : __rcu_read_lock <-cpuacct_charge
<...>-259 0d..2 1us : __rcu_read_unlock <-cpuacct_charge
<...>-259 0d..2 1us : clear_buddies <-dequeue_entity
<...>-259 0d..2 1us : account_entity_dequeue <-dequeue_entity
<...>-259 0d..2 2us : update_min_vruntime <-dequeue_entity
<...>-259 0d..2 2us : update_cfs_shares <-dequeue_entity
<...>-259 0d..2 2us : hrtick_update <-dequeue_task_fair
<...>-259 0d..2 2us : wq_worker_sleeping <-__schedule
<...>-259 0d..2 2us : kthread_data <-wq_worker_sleeping
<...>-259 0d..2 2us : pick_next_task_fair <-__schedule
<...>-259 0d..2 2us : check_cfs_rq_runtime <-pick_next_task_fair
<...>-259 0d..2 2us : pick_next_entity <-pick_next_task_fair
<...>-259 0d..2 2us : clear_buddies <-pick_next_entity
<...>-259 0d..2 2us : pick_next_entity <-pick_next_task_fair
<...>-259 0d..2 2us : clear_buddies <-pick_next_entity
<...>-259 0d..2 2us : set_next_entity <-pick_next_task_fair
<...>-259 0d..2 3us : put_prev_entity <-pick_next_task_fair
<...>-259 0d..2 3us : check_cfs_rq_runtime <-put_prev_entity
<...>-259 0d..2 3us : set_next_entity <-pick_next_task_fair
gnome-sh-1031 0d..2 3us : finish_task_switch <-__schedule
gnome-sh-1031 0d..2 3us : _raw_spin_unlock_irq <-finish_task_switch
gnome-sh-1031 0d..2 3us : do_raw_spin_unlock <-_raw_spin_unlock_irq
gnome-sh-1031 0...2 3us!: preempt_count_sub <-_raw_spin_unlock_irq
gnome-sh-1031 0...1 582us : do_raw_spin_lock <-_raw_spin_lock
gnome-sh-1031 0...1 583us : _raw_spin_unlock <-drm_gem_object_lookup
gnome-sh-1031 0...1 583us : do_raw_spin_unlock <-_raw_spin_unlock
gnome-sh-1031 0...1 583us : preempt_count_sub <-_raw_spin_unlock
gnome-sh-1031 0...1 584us : _raw_spin_unlock <-drm_gem_object_lookup
gnome-sh-1031 0...1 584us+: trace_preempt_on <-drm_gem_object_lookup
gnome-sh-1031 0...1 603us : <stack trace>
=> preempt_count_sub
=> _raw_spin_unlock
=> drm_gem_object_lookup
=> i915_gem_madvise_ioctl
=> drm_ioctl
=> do_vfs_ioctl
=> SyS_ioctl
=> entry_SYSCALL_64_fastpath
As I'm tracing preemption disabled, it seemed incorrect that the trace
would go across a schedule and report not being in the scheduler.
Looking into this I discovered the problem.
schedule() calls preempt_disable() but the preempt_schedule() calls
preempt_enable_notrace(). What happened above was that the gnome-shell
task was preempted on another CPU, migrated over to the idle cpu. The
tracer stared with idle calling schedule(), which called
preempt_disable(), but then gnome-shell finished, and it enabled
preemption with preempt_enable_notrace() that does stop the trace, even
though preemption was enabled.
The purpose of the preempt_disable_notrace() in the preempt_schedule()
is to prevent function tracing from going into an infinite loop.
Because function tracing can trace the preempt_enable/disable() calls
that are traced. The problem with function tracing is:
NEED_RESCHED set
preempt_schedule()
preempt_disable()
preempt_count_inc()
function trace (before incrementing preempt count)
preempt_disable_notrace()
preempt_enable_notrace()
sees NEED_RESCHED set
preempt_schedule() (repeat)
Now by breaking out the preempt off/on tracing into their own code:
preempt_disable_check() and preempt_enable_check(), we can add these to
the preempt_schedule() code. As preemption would then be disabled, even
if they were to be traced by the function tracer, the disabled
preemption would prevent the recursion.
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/20160321112339.6dc78ad6@gandalf.local.home
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-03-21 22:23:39 +07:00
|
|
|
#else
|
|
|
|
static inline void preempt_latency_start(int val) { }
|
|
|
|
static inline void preempt_latency_stop(int val) { }
|
2005-04-17 05:20:36 +07:00
|
|
|
#endif
|
|
|
|
|
|
|
|
/*
|
2007-07-09 23:51:59 +07:00
|
|
|
* Print scheduling while atomic bug:
|
2005-04-17 05:20:36 +07:00
|
|
|
*/
|
2007-07-09 23:51:59 +07:00
|
|
|
static noinline void __schedule_bug(struct task_struct *prev)
|
2005-04-17 05:20:36 +07:00
|
|
|
{
|
2011-12-23 04:39:30 +07:00
|
|
|
if (oops_in_progress)
|
|
|
|
return;
|
|
|
|
|
2009-12-20 20:23:57 +07:00
|
|
|
printk(KERN_ERR "BUG: scheduling while atomic: %s/%d/0x%08x\n",
|
|
|
|
prev->comm, prev->pid, preempt_count());
|
2007-10-24 23:23:50 +07:00
|
|
|
|
2007-07-09 23:51:59 +07:00
|
|
|
debug_show_held_locks(prev);
|
2008-05-23 23:05:58 +07:00
|
|
|
print_modules();
|
2007-07-09 23:51:59 +07:00
|
|
|
if (irqs_disabled())
|
|
|
|
print_irqtrace_events(prev);
|
2014-02-08 02:58:39 +07:00
|
|
|
#ifdef CONFIG_DEBUG_PREEMPT
|
|
|
|
if (in_atomic_preempt_off()) {
|
|
|
|
pr_err("Preemption disabled at:");
|
|
|
|
print_ip_sym(current->preempt_disable_ip);
|
|
|
|
pr_cont("\n");
|
|
|
|
}
|
|
|
|
#endif
|
2016-06-04 03:10:18 +07:00
|
|
|
if (panic_on_warn)
|
|
|
|
panic("scheduling while atomic\n");
|
|
|
|
|
2012-03-29 07:10:47 +07:00
|
|
|
dump_stack();
|
2013-01-21 13:47:39 +07:00
|
|
|
add_taint(TAINT_WARN, LOCKDEP_STILL_OK);
|
2007-07-09 23:51:59 +07:00
|
|
|
}
|
2005-04-17 05:20:36 +07:00
|
|
|
|
2007-07-09 23:51:59 +07:00
|
|
|
/*
|
|
|
|
* Various schedule()-time debugging checks and statistics:
|
|
|
|
*/
|
|
|
|
static inline void schedule_debug(struct task_struct *prev)
|
|
|
|
{
|
2014-09-12 20:16:19 +07:00
|
|
|
#ifdef CONFIG_SCHED_STACK_END_CHECK
|
2016-06-01 16:55:07 +07:00
|
|
|
if (task_stack_end_corrupted(prev))
|
|
|
|
panic("corrupted stack end detected inside scheduler\n");
|
2014-09-12 20:16:19 +07:00
|
|
|
#endif
|
2015-09-28 23:02:03 +07:00
|
|
|
|
2015-09-28 22:57:39 +07:00
|
|
|
if (unlikely(in_atomic_preempt_off())) {
|
2007-07-09 23:51:59 +07:00
|
|
|
__schedule_bug(prev);
|
2015-09-28 22:57:39 +07:00
|
|
|
preempt_count_set(PREEMPT_DISABLED);
|
|
|
|
}
|
2011-05-24 22:31:09 +07:00
|
|
|
rcu_sleep_check();
|
2007-07-09 23:51:59 +07:00
|
|
|
|
2005-04-17 05:20:36 +07:00
|
|
|
profile_hit(SCHED_PROFILING, __builtin_return_address(0));
|
|
|
|
|
2007-10-15 22:00:12 +07:00
|
|
|
schedstat_inc(this_rq(), sched_count);
|
2007-07-09 23:51:59 +07:00
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Pick up the highest-prio task:
|
|
|
|
*/
|
|
|
|
static inline struct task_struct *
|
2015-08-02 00:25:08 +07:00
|
|
|
pick_next_task(struct rq *rq, struct task_struct *prev, struct pin_cookie cookie)
|
2007-07-09 23:51:59 +07:00
|
|
|
{
|
2014-02-14 18:25:08 +07:00
|
|
|
const struct sched_class *class = &fair_sched_class;
|
2007-07-09 23:51:59 +07:00
|
|
|
struct task_struct *p;
|
2005-04-17 05:20:36 +07:00
|
|
|
|
|
|
|
/*
|
2007-07-09 23:51:59 +07:00
|
|
|
* Optimization: we know that if all tasks are in
|
|
|
|
* the fair class we can call that function directly:
|
2005-04-17 05:20:36 +07:00
|
|
|
*/
|
2014-02-14 18:25:08 +07:00
|
|
|
if (likely(prev->sched_class == class &&
|
2014-01-24 02:32:21 +07:00
|
|
|
rq->nr_running == rq->cfs.h_nr_running)) {
|
2015-08-02 00:25:08 +07:00
|
|
|
p = fair_sched_class.pick_next_task(rq, prev, cookie);
|
2014-04-24 17:00:47 +07:00
|
|
|
if (unlikely(p == RETRY_TASK))
|
|
|
|
goto again;
|
|
|
|
|
|
|
|
/* assumes fair_sched_class->next == idle_sched_class */
|
|
|
|
if (unlikely(!p))
|
2015-08-02 00:25:08 +07:00
|
|
|
p = idle_sched_class.pick_next_task(rq, prev, cookie);
|
2014-04-24 17:00:47 +07:00
|
|
|
|
|
|
|
return p;
|
2005-04-17 05:20:36 +07:00
|
|
|
}
|
|
|
|
|
2014-02-14 18:25:08 +07:00
|
|
|
again:
|
2010-09-22 18:53:15 +07:00
|
|
|
for_each_class(class) {
|
2015-08-02 00:25:08 +07:00
|
|
|
p = class->pick_next_task(rq, prev, cookie);
|
2014-02-14 18:25:08 +07:00
|
|
|
if (p) {
|
|
|
|
if (unlikely(p == RETRY_TASK))
|
|
|
|
goto again;
|
2007-07-09 23:51:59 +07:00
|
|
|
return p;
|
2014-02-14 18:25:08 +07:00
|
|
|
}
|
2007-07-09 23:51:59 +07:00
|
|
|
}
|
2010-09-22 18:53:15 +07:00
|
|
|
|
|
|
|
BUG(); /* the idle class will always have a runnable task */
|
2007-07-09 23:51:59 +07:00
|
|
|
}
|
2005-04-17 05:20:36 +07:00
|
|
|
|
2007-07-09 23:51:59 +07:00
|
|
|
/*
|
2011-06-23 00:47:00 +07:00
|
|
|
* __schedule() is the main scheduler function.
|
2012-08-04 15:49:47 +07:00
|
|
|
*
|
|
|
|
* The main means of driving the scheduler and thus entering this function are:
|
|
|
|
*
|
|
|
|
* 1. Explicit blocking: mutex, semaphore, waitqueue, etc.
|
|
|
|
*
|
|
|
|
* 2. TIF_NEED_RESCHED flag is checked on interrupt and userspace return
|
|
|
|
* paths. For example, see arch/x86/entry_64.S.
|
|
|
|
*
|
|
|
|
* To drive preemption between tasks, the scheduler sets the flag in timer
|
|
|
|
* interrupt handler scheduler_tick().
|
|
|
|
*
|
|
|
|
* 3. Wakeups don't really cause entry into schedule(). They add a
|
|
|
|
* task to the run-queue and that's it.
|
|
|
|
*
|
|
|
|
* Now, if the new task added to the run-queue preempts the current
|
|
|
|
* task, then the wakeup sets TIF_NEED_RESCHED and schedule() gets
|
|
|
|
* called on the nearest possible occasion:
|
|
|
|
*
|
|
|
|
* - If the kernel is preemptible (CONFIG_PREEMPT=y):
|
|
|
|
*
|
|
|
|
* - in syscall or exception context, at the next outmost
|
|
|
|
* preempt_enable(). (this might be as soon as the wake_up()'s
|
|
|
|
* spin_unlock()!)
|
|
|
|
*
|
|
|
|
* - in IRQ context, return from interrupt-handler to
|
|
|
|
* preemptible context
|
|
|
|
*
|
|
|
|
* - If the kernel is not preemptible (CONFIG_PREEMPT is not set)
|
|
|
|
* then at the next:
|
|
|
|
*
|
|
|
|
* - cond_resched() call
|
|
|
|
* - explicit schedule() call
|
|
|
|
* - return from syscall or exception to user-space
|
|
|
|
* - return from interrupt-handler to user-space
|
2015-01-28 07:24:09 +07:00
|
|
|
*
|
2015-05-12 21:41:49 +07:00
|
|
|
* WARNING: must be called with preemption disabled!
|
2007-07-09 23:51:59 +07:00
|
|
|
*/
|
sched/core: More notrace annotations
preempt_schedule_common() is marked notrace, but it does not use
_notrace() preempt_count functions and __schedule() is also not marked
notrace, which means that its perfectly possible to end up in the
tracer from preempt_schedule_common().
Steve says:
| Yep, there's some history to this. This was originally the issue that
| caused function tracing to go into infinite recursion. But now we have
| preempt_schedule_notrace(), which is used by the function tracer, and
| that function must not be traced till preemption is disabled.
|
| Now if function tracing is running and we take an interrupt when
| NEED_RESCHED is set, it calls
|
| preempt_schedule_common() (not traced)
|
| But then that calls preempt_disable() (traced)
|
| function tracer calls preempt_disable_notrace() followed by
| preempt_enable_notrace() which will see NEED_RESCHED set, and it will
| call preempt_schedule_notrace(), which stops the recursion, but
| still calls __schedule() here, and that means when we return, we call
| the __schedule() from preempt_schedule_common().
|
| That said, I prefer this patch. Preemption is disabled before calling
| __schedule(), and we get rid of a one round recursion with the
| scheduler.
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Steven Rostedt <rostedt@goodmis.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-09-28 23:52:36 +07:00
|
|
|
static void __sched notrace __schedule(bool preempt)
|
2007-07-09 23:51:59 +07:00
|
|
|
{
|
|
|
|
struct task_struct *prev, *next;
|
2008-02-16 00:56:36 +07:00
|
|
|
unsigned long *switch_count;
|
2015-08-02 00:25:08 +07:00
|
|
|
struct pin_cookie cookie;
|
2007-07-09 23:51:59 +07:00
|
|
|
struct rq *rq;
|
2008-07-18 23:01:23 +07:00
|
|
|
int cpu;
|
2007-07-09 23:51:59 +07:00
|
|
|
|
|
|
|
cpu = smp_processor_id();
|
|
|
|
rq = cpu_rq(cpu);
|
|
|
|
prev = rq->curr;
|
|
|
|
|
2015-09-28 23:02:03 +07:00
|
|
|
/*
|
|
|
|
* do_exit() calls schedule() with preemption disabled as an exception;
|
|
|
|
* however we must fix that up, otherwise the next task will see an
|
|
|
|
* inconsistent (higher) preempt count.
|
|
|
|
*
|
|
|
|
* It also avoids the below schedule_debug() test from complaining
|
|
|
|
* about this.
|
|
|
|
*/
|
|
|
|
if (unlikely(prev->state == TASK_DEAD))
|
|
|
|
preempt_enable_no_resched_notrace();
|
|
|
|
|
2007-07-09 23:51:59 +07:00
|
|
|
schedule_debug(prev);
|
2005-04-17 05:20:36 +07:00
|
|
|
|
2008-07-18 23:01:23 +07:00
|
|
|
if (sched_feat(HRTICK))
|
2008-05-13 02:20:55 +07:00
|
|
|
hrtick_clear(rq);
|
2008-01-26 03:08:29 +07:00
|
|
|
|
2015-10-07 23:10:48 +07:00
|
|
|
local_irq_disable();
|
|
|
|
rcu_note_context_switch();
|
|
|
|
|
2013-08-12 23:14:00 +07:00
|
|
|
/*
|
|
|
|
* Make sure that signal_pending_state()->signal_pending() below
|
|
|
|
* can't be reordered with __set_current_state(TASK_INTERRUPTIBLE)
|
|
|
|
* done by the caller to avoid the race with signal_wake_up().
|
|
|
|
*/
|
|
|
|
smp_mb__before_spinlock();
|
2015-10-07 23:10:48 +07:00
|
|
|
raw_spin_lock(&rq->lock);
|
2015-08-02 00:25:08 +07:00
|
|
|
cookie = lockdep_pin_lock(&rq->lock);
|
2005-04-17 05:20:36 +07:00
|
|
|
|
2015-01-05 17:18:11 +07:00
|
|
|
rq->clock_skip_update <<= 1; /* promote REQ to ACT */
|
|
|
|
|
2010-05-19 19:57:11 +07:00
|
|
|
switch_count = &prev->nivcsw;
|
2015-09-28 23:05:34 +07:00
|
|
|
if (!preempt && prev->state) {
|
2010-06-09 02:40:37 +07:00
|
|
|
if (unlikely(signal_pending_state(prev->state, prev))) {
|
2005-04-17 05:20:36 +07:00
|
|
|
prev->state = TASK_RUNNING;
|
2010-06-09 02:40:37 +07:00
|
|
|
} else {
|
2011-04-05 22:23:50 +07:00
|
|
|
deactivate_task(rq, prev, DEQUEUE_SLEEP);
|
|
|
|
prev->on_rq = 0;
|
|
|
|
|
2010-06-09 02:40:37 +07:00
|
|
|
/*
|
2011-04-05 22:23:50 +07:00
|
|
|
* If a worker went to sleep, notify and ask workqueue
|
|
|
|
* whether it wants to wake up a task to maintain
|
|
|
|
* concurrency.
|
2010-06-09 02:40:37 +07:00
|
|
|
*/
|
|
|
|
if (prev->flags & PF_WQ_WORKER) {
|
|
|
|
struct task_struct *to_wakeup;
|
|
|
|
|
2016-03-02 18:53:31 +07:00
|
|
|
to_wakeup = wq_worker_sleeping(prev);
|
2010-06-09 02:40:37 +07:00
|
|
|
if (to_wakeup)
|
2015-08-02 00:25:08 +07:00
|
|
|
try_to_wake_up_local(to_wakeup, cookie);
|
2010-06-09 02:40:37 +07:00
|
|
|
}
|
|
|
|
}
|
2007-07-09 23:51:59 +07:00
|
|
|
switch_count = &prev->nvcsw;
|
2005-04-17 05:20:36 +07:00
|
|
|
}
|
|
|
|
|
2015-01-05 17:18:11 +07:00
|
|
|
if (task_on_rq_queued(prev))
|
2012-02-11 12:05:00 +07:00
|
|
|
update_rq_clock(rq);
|
|
|
|
|
2015-08-02 00:25:08 +07:00
|
|
|
next = pick_next_task(rq, prev, cookie);
|
2010-12-08 17:05:42 +07:00
|
|
|
clear_tsk_need_resched(prev);
|
2013-08-14 19:55:31 +07:00
|
|
|
clear_preempt_need_resched();
|
2015-01-05 17:18:11 +07:00
|
|
|
rq->clock_skip_update = 0;
|
2005-04-17 05:20:36 +07:00
|
|
|
|
|
|
|
if (likely(prev != next)) {
|
|
|
|
rq->nr_switches++;
|
|
|
|
rq->curr = next;
|
|
|
|
++*switch_count;
|
|
|
|
|
2015-09-28 23:06:56 +07:00
|
|
|
trace_sched_switch(preempt, prev, next);
|
2015-08-02 00:25:08 +07:00
|
|
|
rq = context_switch(rq, prev, next, cookie); /* unlocks the rq */
|
2015-06-11 19:46:54 +07:00
|
|
|
} else {
|
2015-08-02 00:25:08 +07:00
|
|
|
lockdep_unpin_lock(&rq->lock, cookie);
|
2009-11-17 20:28:38 +07:00
|
|
|
raw_spin_unlock_irq(&rq->lock);
|
2015-06-11 19:46:54 +07:00
|
|
|
}
|
2005-04-17 05:20:36 +07:00
|
|
|
|
2015-06-11 19:46:37 +07:00
|
|
|
balance_callback(rq);
|
2005-04-17 05:20:36 +07:00
|
|
|
}
|
2016-02-29 11:22:38 +07:00
|
|
|
STACK_FRAME_NON_STANDARD(__schedule); /* switch_to() */
|
2011-06-23 00:47:00 +07:00
|
|
|
|
2011-06-23 00:47:01 +07:00
|
|
|
static inline void sched_submit_work(struct task_struct *tsk)
|
|
|
|
{
|
2011-07-18 01:46:52 +07:00
|
|
|
if (!tsk->state || tsk_is_pi_blocked(tsk))
|
2011-06-23 00:47:01 +07:00
|
|
|
return;
|
|
|
|
/*
|
|
|
|
* If we are going to sleep and we have plugged IO queued,
|
|
|
|
* make sure to submit it to avoid deadlocks.
|
|
|
|
*/
|
|
|
|
if (blk_needs_flush_plug(tsk))
|
|
|
|
blk_schedule_flush_plug(tsk);
|
|
|
|
}
|
|
|
|
|
2014-05-02 05:44:38 +07:00
|
|
|
asmlinkage __visible void __sched schedule(void)
|
2011-06-23 00:47:00 +07:00
|
|
|
{
|
2011-06-23 00:47:01 +07:00
|
|
|
struct task_struct *tsk = current;
|
|
|
|
|
|
|
|
sched_submit_work(tsk);
|
2015-01-28 07:24:09 +07:00
|
|
|
do {
|
2015-05-12 21:41:49 +07:00
|
|
|
preempt_disable();
|
2015-09-28 23:05:34 +07:00
|
|
|
__schedule(false);
|
2015-05-12 21:41:49 +07:00
|
|
|
sched_preempt_enable_no_resched();
|
2015-01-28 07:24:09 +07:00
|
|
|
} while (need_resched());
|
2011-06-23 00:47:00 +07:00
|
|
|
}
|
2005-04-17 05:20:36 +07:00
|
|
|
EXPORT_SYMBOL(schedule);
|
|
|
|
|
2012-11-28 01:33:25 +07:00
|
|
|
#ifdef CONFIG_CONTEXT_TRACKING
|
2014-05-02 05:44:38 +07:00
|
|
|
asmlinkage __visible void __sched schedule_user(void)
|
2012-07-12 01:26:37 +07:00
|
|
|
{
|
|
|
|
/*
|
|
|
|
* If we come here after a random call to set_need_resched(),
|
|
|
|
* or we have been woken up remotely but the IPI has not yet arrived,
|
|
|
|
* we haven't yet exited the RCU idle mode. Do it here manually until
|
|
|
|
* we find a better solution.
|
2014-12-04 06:37:08 +07:00
|
|
|
*
|
|
|
|
* NB: There are buggy callers of this function. Ideally we
|
2015-03-05 00:06:33 +07:00
|
|
|
* should warn if prev_state != CONTEXT_USER, but that will trigger
|
2014-12-04 06:37:08 +07:00
|
|
|
* too frequently to make sense yet.
|
2012-07-12 01:26:37 +07:00
|
|
|
*/
|
2014-12-04 06:37:08 +07:00
|
|
|
enum ctx_state prev_state = exception_enter();
|
2012-07-12 01:26:37 +07:00
|
|
|
schedule();
|
2014-12-04 06:37:08 +07:00
|
|
|
exception_exit(prev_state);
|
2012-07-12 01:26:37 +07:00
|
|
|
}
|
|
|
|
#endif
|
|
|
|
|
2011-03-21 18:09:35 +07:00
|
|
|
/**
|
|
|
|
* schedule_preempt_disabled - called with preemption disabled
|
|
|
|
*
|
|
|
|
* Returns with preemption disabled. Note: preempt_count must be 1
|
|
|
|
*/
|
|
|
|
void __sched schedule_preempt_disabled(void)
|
|
|
|
{
|
2011-03-21 19:32:17 +07:00
|
|
|
sched_preempt_enable_no_resched();
|
2011-03-21 18:09:35 +07:00
|
|
|
schedule();
|
|
|
|
preempt_disable();
|
|
|
|
}
|
|
|
|
|
2015-02-17 01:20:07 +07:00
|
|
|
static void __sched notrace preempt_schedule_common(void)
|
2015-01-23 00:08:04 +07:00
|
|
|
{
|
|
|
|
do {
|
sched/core: Add preempt checks in preempt_schedule() code
While testing the tracer preemptoff, I hit this strange trace:
<...>-259 0...1 0us : schedule <-worker_thread
<...>-259 0d..1 0us : rcu_note_context_switch <-__schedule
<...>-259 0d..1 0us : rcu_sched_qs <-rcu_note_context_switch
<...>-259 0d..1 0us : rcu_preempt_qs <-rcu_note_context_switch
<...>-259 0d..1 0us : _raw_spin_lock <-__schedule
<...>-259 0d..1 0us : preempt_count_add <-_raw_spin_lock
<...>-259 0d..2 0us : do_raw_spin_lock <-_raw_spin_lock
<...>-259 0d..2 1us : deactivate_task <-__schedule
<...>-259 0d..2 1us : update_rq_clock.part.84 <-deactivate_task
<...>-259 0d..2 1us : dequeue_task_fair <-deactivate_task
<...>-259 0d..2 1us : dequeue_entity <-dequeue_task_fair
<...>-259 0d..2 1us : update_curr <-dequeue_entity
<...>-259 0d..2 1us : update_min_vruntime <-update_curr
<...>-259 0d..2 1us : cpuacct_charge <-update_curr
<...>-259 0d..2 1us : __rcu_read_lock <-cpuacct_charge
<...>-259 0d..2 1us : __rcu_read_unlock <-cpuacct_charge
<...>-259 0d..2 1us : clear_buddies <-dequeue_entity
<...>-259 0d..2 1us : account_entity_dequeue <-dequeue_entity
<...>-259 0d..2 2us : update_min_vruntime <-dequeue_entity
<...>-259 0d..2 2us : update_cfs_shares <-dequeue_entity
<...>-259 0d..2 2us : hrtick_update <-dequeue_task_fair
<...>-259 0d..2 2us : wq_worker_sleeping <-__schedule
<...>-259 0d..2 2us : kthread_data <-wq_worker_sleeping
<...>-259 0d..2 2us : pick_next_task_fair <-__schedule
<...>-259 0d..2 2us : check_cfs_rq_runtime <-pick_next_task_fair
<...>-259 0d..2 2us : pick_next_entity <-pick_next_task_fair
<...>-259 0d..2 2us : clear_buddies <-pick_next_entity
<...>-259 0d..2 2us : pick_next_entity <-pick_next_task_fair
<...>-259 0d..2 2us : clear_buddies <-pick_next_entity
<...>-259 0d..2 2us : set_next_entity <-pick_next_task_fair
<...>-259 0d..2 3us : put_prev_entity <-pick_next_task_fair
<...>-259 0d..2 3us : check_cfs_rq_runtime <-put_prev_entity
<...>-259 0d..2 3us : set_next_entity <-pick_next_task_fair
gnome-sh-1031 0d..2 3us : finish_task_switch <-__schedule
gnome-sh-1031 0d..2 3us : _raw_spin_unlock_irq <-finish_task_switch
gnome-sh-1031 0d..2 3us : do_raw_spin_unlock <-_raw_spin_unlock_irq
gnome-sh-1031 0...2 3us!: preempt_count_sub <-_raw_spin_unlock_irq
gnome-sh-1031 0...1 582us : do_raw_spin_lock <-_raw_spin_lock
gnome-sh-1031 0...1 583us : _raw_spin_unlock <-drm_gem_object_lookup
gnome-sh-1031 0...1 583us : do_raw_spin_unlock <-_raw_spin_unlock
gnome-sh-1031 0...1 583us : preempt_count_sub <-_raw_spin_unlock
gnome-sh-1031 0...1 584us : _raw_spin_unlock <-drm_gem_object_lookup
gnome-sh-1031 0...1 584us+: trace_preempt_on <-drm_gem_object_lookup
gnome-sh-1031 0...1 603us : <stack trace>
=> preempt_count_sub
=> _raw_spin_unlock
=> drm_gem_object_lookup
=> i915_gem_madvise_ioctl
=> drm_ioctl
=> do_vfs_ioctl
=> SyS_ioctl
=> entry_SYSCALL_64_fastpath
As I'm tracing preemption disabled, it seemed incorrect that the trace
would go across a schedule and report not being in the scheduler.
Looking into this I discovered the problem.
schedule() calls preempt_disable() but the preempt_schedule() calls
preempt_enable_notrace(). What happened above was that the gnome-shell
task was preempted on another CPU, migrated over to the idle cpu. The
tracer stared with idle calling schedule(), which called
preempt_disable(), but then gnome-shell finished, and it enabled
preemption with preempt_enable_notrace() that does stop the trace, even
though preemption was enabled.
The purpose of the preempt_disable_notrace() in the preempt_schedule()
is to prevent function tracing from going into an infinite loop.
Because function tracing can trace the preempt_enable/disable() calls
that are traced. The problem with function tracing is:
NEED_RESCHED set
preempt_schedule()
preempt_disable()
preempt_count_inc()
function trace (before incrementing preempt count)
preempt_disable_notrace()
preempt_enable_notrace()
sees NEED_RESCHED set
preempt_schedule() (repeat)
Now by breaking out the preempt off/on tracing into their own code:
preempt_disable_check() and preempt_enable_check(), we can add these to
the preempt_schedule() code. As preemption would then be disabled, even
if they were to be traced by the function tracer, the disabled
preemption would prevent the recursion.
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/20160321112339.6dc78ad6@gandalf.local.home
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-03-21 22:23:39 +07:00
|
|
|
/*
|
|
|
|
* Because the function tracer can trace preempt_count_sub()
|
|
|
|
* and it also uses preempt_enable/disable_notrace(), if
|
|
|
|
* NEED_RESCHED is set, the preempt_enable_notrace() called
|
|
|
|
* by the function tracer will call this function again and
|
|
|
|
* cause infinite recursion.
|
|
|
|
*
|
|
|
|
* Preemption must be disabled here before the function
|
|
|
|
* tracer can trace. Break up preempt_disable() into two
|
|
|
|
* calls. One to disable preemption without fear of being
|
|
|
|
* traced. The other to still record the preemption latency,
|
|
|
|
* which can also be traced by the function tracer.
|
|
|
|
*/
|
sched/core: More notrace annotations
preempt_schedule_common() is marked notrace, but it does not use
_notrace() preempt_count functions and __schedule() is also not marked
notrace, which means that its perfectly possible to end up in the
tracer from preempt_schedule_common().
Steve says:
| Yep, there's some history to this. This was originally the issue that
| caused function tracing to go into infinite recursion. But now we have
| preempt_schedule_notrace(), which is used by the function tracer, and
| that function must not be traced till preemption is disabled.
|
| Now if function tracing is running and we take an interrupt when
| NEED_RESCHED is set, it calls
|
| preempt_schedule_common() (not traced)
|
| But then that calls preempt_disable() (traced)
|
| function tracer calls preempt_disable_notrace() followed by
| preempt_enable_notrace() which will see NEED_RESCHED set, and it will
| call preempt_schedule_notrace(), which stops the recursion, but
| still calls __schedule() here, and that means when we return, we call
| the __schedule() from preempt_schedule_common().
|
| That said, I prefer this patch. Preemption is disabled before calling
| __schedule(), and we get rid of a one round recursion with the
| scheduler.
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Steven Rostedt <rostedt@goodmis.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-09-28 23:52:36 +07:00
|
|
|
preempt_disable_notrace();
|
sched/core: Add preempt checks in preempt_schedule() code
While testing the tracer preemptoff, I hit this strange trace:
<...>-259 0...1 0us : schedule <-worker_thread
<...>-259 0d..1 0us : rcu_note_context_switch <-__schedule
<...>-259 0d..1 0us : rcu_sched_qs <-rcu_note_context_switch
<...>-259 0d..1 0us : rcu_preempt_qs <-rcu_note_context_switch
<...>-259 0d..1 0us : _raw_spin_lock <-__schedule
<...>-259 0d..1 0us : preempt_count_add <-_raw_spin_lock
<...>-259 0d..2 0us : do_raw_spin_lock <-_raw_spin_lock
<...>-259 0d..2 1us : deactivate_task <-__schedule
<...>-259 0d..2 1us : update_rq_clock.part.84 <-deactivate_task
<...>-259 0d..2 1us : dequeue_task_fair <-deactivate_task
<...>-259 0d..2 1us : dequeue_entity <-dequeue_task_fair
<...>-259 0d..2 1us : update_curr <-dequeue_entity
<...>-259 0d..2 1us : update_min_vruntime <-update_curr
<...>-259 0d..2 1us : cpuacct_charge <-update_curr
<...>-259 0d..2 1us : __rcu_read_lock <-cpuacct_charge
<...>-259 0d..2 1us : __rcu_read_unlock <-cpuacct_charge
<...>-259 0d..2 1us : clear_buddies <-dequeue_entity
<...>-259 0d..2 1us : account_entity_dequeue <-dequeue_entity
<...>-259 0d..2 2us : update_min_vruntime <-dequeue_entity
<...>-259 0d..2 2us : update_cfs_shares <-dequeue_entity
<...>-259 0d..2 2us : hrtick_update <-dequeue_task_fair
<...>-259 0d..2 2us : wq_worker_sleeping <-__schedule
<...>-259 0d..2 2us : kthread_data <-wq_worker_sleeping
<...>-259 0d..2 2us : pick_next_task_fair <-__schedule
<...>-259 0d..2 2us : check_cfs_rq_runtime <-pick_next_task_fair
<...>-259 0d..2 2us : pick_next_entity <-pick_next_task_fair
<...>-259 0d..2 2us : clear_buddies <-pick_next_entity
<...>-259 0d..2 2us : pick_next_entity <-pick_next_task_fair
<...>-259 0d..2 2us : clear_buddies <-pick_next_entity
<...>-259 0d..2 2us : set_next_entity <-pick_next_task_fair
<...>-259 0d..2 3us : put_prev_entity <-pick_next_task_fair
<...>-259 0d..2 3us : check_cfs_rq_runtime <-put_prev_entity
<...>-259 0d..2 3us : set_next_entity <-pick_next_task_fair
gnome-sh-1031 0d..2 3us : finish_task_switch <-__schedule
gnome-sh-1031 0d..2 3us : _raw_spin_unlock_irq <-finish_task_switch
gnome-sh-1031 0d..2 3us : do_raw_spin_unlock <-_raw_spin_unlock_irq
gnome-sh-1031 0...2 3us!: preempt_count_sub <-_raw_spin_unlock_irq
gnome-sh-1031 0...1 582us : do_raw_spin_lock <-_raw_spin_lock
gnome-sh-1031 0...1 583us : _raw_spin_unlock <-drm_gem_object_lookup
gnome-sh-1031 0...1 583us : do_raw_spin_unlock <-_raw_spin_unlock
gnome-sh-1031 0...1 583us : preempt_count_sub <-_raw_spin_unlock
gnome-sh-1031 0...1 584us : _raw_spin_unlock <-drm_gem_object_lookup
gnome-sh-1031 0...1 584us+: trace_preempt_on <-drm_gem_object_lookup
gnome-sh-1031 0...1 603us : <stack trace>
=> preempt_count_sub
=> _raw_spin_unlock
=> drm_gem_object_lookup
=> i915_gem_madvise_ioctl
=> drm_ioctl
=> do_vfs_ioctl
=> SyS_ioctl
=> entry_SYSCALL_64_fastpath
As I'm tracing preemption disabled, it seemed incorrect that the trace
would go across a schedule and report not being in the scheduler.
Looking into this I discovered the problem.
schedule() calls preempt_disable() but the preempt_schedule() calls
preempt_enable_notrace(). What happened above was that the gnome-shell
task was preempted on another CPU, migrated over to the idle cpu. The
tracer stared with idle calling schedule(), which called
preempt_disable(), but then gnome-shell finished, and it enabled
preemption with preempt_enable_notrace() that does stop the trace, even
though preemption was enabled.
The purpose of the preempt_disable_notrace() in the preempt_schedule()
is to prevent function tracing from going into an infinite loop.
Because function tracing can trace the preempt_enable/disable() calls
that are traced. The problem with function tracing is:
NEED_RESCHED set
preempt_schedule()
preempt_disable()
preempt_count_inc()
function trace (before incrementing preempt count)
preempt_disable_notrace()
preempt_enable_notrace()
sees NEED_RESCHED set
preempt_schedule() (repeat)
Now by breaking out the preempt off/on tracing into their own code:
preempt_disable_check() and preempt_enable_check(), we can add these to
the preempt_schedule() code. As preemption would then be disabled, even
if they were to be traced by the function tracer, the disabled
preemption would prevent the recursion.
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/20160321112339.6dc78ad6@gandalf.local.home
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-03-21 22:23:39 +07:00
|
|
|
preempt_latency_start(1);
|
2015-09-28 23:05:34 +07:00
|
|
|
__schedule(true);
|
sched/core: Add preempt checks in preempt_schedule() code
While testing the tracer preemptoff, I hit this strange trace:
<...>-259 0...1 0us : schedule <-worker_thread
<...>-259 0d..1 0us : rcu_note_context_switch <-__schedule
<...>-259 0d..1 0us : rcu_sched_qs <-rcu_note_context_switch
<...>-259 0d..1 0us : rcu_preempt_qs <-rcu_note_context_switch
<...>-259 0d..1 0us : _raw_spin_lock <-__schedule
<...>-259 0d..1 0us : preempt_count_add <-_raw_spin_lock
<...>-259 0d..2 0us : do_raw_spin_lock <-_raw_spin_lock
<...>-259 0d..2 1us : deactivate_task <-__schedule
<...>-259 0d..2 1us : update_rq_clock.part.84 <-deactivate_task
<...>-259 0d..2 1us : dequeue_task_fair <-deactivate_task
<...>-259 0d..2 1us : dequeue_entity <-dequeue_task_fair
<...>-259 0d..2 1us : update_curr <-dequeue_entity
<...>-259 0d..2 1us : update_min_vruntime <-update_curr
<...>-259 0d..2 1us : cpuacct_charge <-update_curr
<...>-259 0d..2 1us : __rcu_read_lock <-cpuacct_charge
<...>-259 0d..2 1us : __rcu_read_unlock <-cpuacct_charge
<...>-259 0d..2 1us : clear_buddies <-dequeue_entity
<...>-259 0d..2 1us : account_entity_dequeue <-dequeue_entity
<...>-259 0d..2 2us : update_min_vruntime <-dequeue_entity
<...>-259 0d..2 2us : update_cfs_shares <-dequeue_entity
<...>-259 0d..2 2us : hrtick_update <-dequeue_task_fair
<...>-259 0d..2 2us : wq_worker_sleeping <-__schedule
<...>-259 0d..2 2us : kthread_data <-wq_worker_sleeping
<...>-259 0d..2 2us : pick_next_task_fair <-__schedule
<...>-259 0d..2 2us : check_cfs_rq_runtime <-pick_next_task_fair
<...>-259 0d..2 2us : pick_next_entity <-pick_next_task_fair
<...>-259 0d..2 2us : clear_buddies <-pick_next_entity
<...>-259 0d..2 2us : pick_next_entity <-pick_next_task_fair
<...>-259 0d..2 2us : clear_buddies <-pick_next_entity
<...>-259 0d..2 2us : set_next_entity <-pick_next_task_fair
<...>-259 0d..2 3us : put_prev_entity <-pick_next_task_fair
<...>-259 0d..2 3us : check_cfs_rq_runtime <-put_prev_entity
<...>-259 0d..2 3us : set_next_entity <-pick_next_task_fair
gnome-sh-1031 0d..2 3us : finish_task_switch <-__schedule
gnome-sh-1031 0d..2 3us : _raw_spin_unlock_irq <-finish_task_switch
gnome-sh-1031 0d..2 3us : do_raw_spin_unlock <-_raw_spin_unlock_irq
gnome-sh-1031 0...2 3us!: preempt_count_sub <-_raw_spin_unlock_irq
gnome-sh-1031 0...1 582us : do_raw_spin_lock <-_raw_spin_lock
gnome-sh-1031 0...1 583us : _raw_spin_unlock <-drm_gem_object_lookup
gnome-sh-1031 0...1 583us : do_raw_spin_unlock <-_raw_spin_unlock
gnome-sh-1031 0...1 583us : preempt_count_sub <-_raw_spin_unlock
gnome-sh-1031 0...1 584us : _raw_spin_unlock <-drm_gem_object_lookup
gnome-sh-1031 0...1 584us+: trace_preempt_on <-drm_gem_object_lookup
gnome-sh-1031 0...1 603us : <stack trace>
=> preempt_count_sub
=> _raw_spin_unlock
=> drm_gem_object_lookup
=> i915_gem_madvise_ioctl
=> drm_ioctl
=> do_vfs_ioctl
=> SyS_ioctl
=> entry_SYSCALL_64_fastpath
As I'm tracing preemption disabled, it seemed incorrect that the trace
would go across a schedule and report not being in the scheduler.
Looking into this I discovered the problem.
schedule() calls preempt_disable() but the preempt_schedule() calls
preempt_enable_notrace(). What happened above was that the gnome-shell
task was preempted on another CPU, migrated over to the idle cpu. The
tracer stared with idle calling schedule(), which called
preempt_disable(), but then gnome-shell finished, and it enabled
preemption with preempt_enable_notrace() that does stop the trace, even
though preemption was enabled.
The purpose of the preempt_disable_notrace() in the preempt_schedule()
is to prevent function tracing from going into an infinite loop.
Because function tracing can trace the preempt_enable/disable() calls
that are traced. The problem with function tracing is:
NEED_RESCHED set
preempt_schedule()
preempt_disable()
preempt_count_inc()
function trace (before incrementing preempt count)
preempt_disable_notrace()
preempt_enable_notrace()
sees NEED_RESCHED set
preempt_schedule() (repeat)
Now by breaking out the preempt off/on tracing into their own code:
preempt_disable_check() and preempt_enable_check(), we can add these to
the preempt_schedule() code. As preemption would then be disabled, even
if they were to be traced by the function tracer, the disabled
preemption would prevent the recursion.
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/20160321112339.6dc78ad6@gandalf.local.home
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-03-21 22:23:39 +07:00
|
|
|
preempt_latency_stop(1);
|
sched/core: More notrace annotations
preempt_schedule_common() is marked notrace, but it does not use
_notrace() preempt_count functions and __schedule() is also not marked
notrace, which means that its perfectly possible to end up in the
tracer from preempt_schedule_common().
Steve says:
| Yep, there's some history to this. This was originally the issue that
| caused function tracing to go into infinite recursion. But now we have
| preempt_schedule_notrace(), which is used by the function tracer, and
| that function must not be traced till preemption is disabled.
|
| Now if function tracing is running and we take an interrupt when
| NEED_RESCHED is set, it calls
|
| preempt_schedule_common() (not traced)
|
| But then that calls preempt_disable() (traced)
|
| function tracer calls preempt_disable_notrace() followed by
| preempt_enable_notrace() which will see NEED_RESCHED set, and it will
| call preempt_schedule_notrace(), which stops the recursion, but
| still calls __schedule() here, and that means when we return, we call
| the __schedule() from preempt_schedule_common().
|
| That said, I prefer this patch. Preemption is disabled before calling
| __schedule(), and we get rid of a one round recursion with the
| scheduler.
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Steven Rostedt <rostedt@goodmis.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-09-28 23:52:36 +07:00
|
|
|
preempt_enable_no_resched_notrace();
|
2015-01-23 00:08:04 +07:00
|
|
|
|
|
|
|
/*
|
|
|
|
* Check again in case we missed a preemption opportunity
|
|
|
|
* between schedule and now.
|
|
|
|
*/
|
|
|
|
} while (need_resched());
|
|
|
|
}
|
|
|
|
|
2005-04-17 05:20:36 +07:00
|
|
|
#ifdef CONFIG_PREEMPT
|
|
|
|
/*
|
2006-07-10 18:43:52 +07:00
|
|
|
* this is the entry point to schedule() from in-kernel preemption
|
2007-12-05 21:46:09 +07:00
|
|
|
* off of preempt_enable. Kernel preemptions off return from interrupt
|
2005-04-17 05:20:36 +07:00
|
|
|
* occur there and call schedule directly.
|
|
|
|
*/
|
2014-05-02 05:44:38 +07:00
|
|
|
asmlinkage __visible void __sched notrace preempt_schedule(void)
|
2005-04-17 05:20:36 +07:00
|
|
|
{
|
|
|
|
/*
|
|
|
|
* If there is a non-zero preempt_count or interrupts are disabled,
|
2007-12-05 21:46:09 +07:00
|
|
|
* we do not want to preempt the current task. Just return..
|
2005-04-17 05:20:36 +07:00
|
|
|
*/
|
2013-06-20 04:56:22 +07:00
|
|
|
if (likely(!preemptible()))
|
2005-04-17 05:20:36 +07:00
|
|
|
return;
|
|
|
|
|
2015-01-23 00:08:04 +07:00
|
|
|
preempt_schedule_common();
|
2005-04-17 05:20:36 +07:00
|
|
|
}
|
2014-04-17 15:17:05 +07:00
|
|
|
NOKPROBE_SYMBOL(preempt_schedule);
|
2005-04-17 05:20:36 +07:00
|
|
|
EXPORT_SYMBOL(preempt_schedule);
|
2014-10-06 03:23:22 +07:00
|
|
|
|
|
|
|
/**
|
2015-06-04 22:39:08 +07:00
|
|
|
* preempt_schedule_notrace - preempt_schedule called by tracing
|
2014-10-06 03:23:22 +07:00
|
|
|
*
|
|
|
|
* The tracing infrastructure uses preempt_enable_notrace to prevent
|
|
|
|
* recursion and tracing preempt enabling caused by the tracing
|
|
|
|
* infrastructure itself. But as tracing can happen in areas coming
|
|
|
|
* from userspace or just about to enter userspace, a preempt enable
|
|
|
|
* can occur before user_exit() is called. This will cause the scheduler
|
|
|
|
* to be called when the system is still in usermode.
|
|
|
|
*
|
|
|
|
* To prevent this, the preempt_enable_notrace will use this function
|
|
|
|
* instead of preempt_schedule() to exit user context if needed before
|
|
|
|
* calling the scheduler.
|
|
|
|
*/
|
2015-06-04 22:39:08 +07:00
|
|
|
asmlinkage __visible void __sched notrace preempt_schedule_notrace(void)
|
2014-10-06 03:23:22 +07:00
|
|
|
{
|
|
|
|
enum ctx_state prev_ctx;
|
|
|
|
|
|
|
|
if (likely(!preemptible()))
|
|
|
|
return;
|
|
|
|
|
|
|
|
do {
|
sched/core: Add preempt checks in preempt_schedule() code
While testing the tracer preemptoff, I hit this strange trace:
<...>-259 0...1 0us : schedule <-worker_thread
<...>-259 0d..1 0us : rcu_note_context_switch <-__schedule
<...>-259 0d..1 0us : rcu_sched_qs <-rcu_note_context_switch
<...>-259 0d..1 0us : rcu_preempt_qs <-rcu_note_context_switch
<...>-259 0d..1 0us : _raw_spin_lock <-__schedule
<...>-259 0d..1 0us : preempt_count_add <-_raw_spin_lock
<...>-259 0d..2 0us : do_raw_spin_lock <-_raw_spin_lock
<...>-259 0d..2 1us : deactivate_task <-__schedule
<...>-259 0d..2 1us : update_rq_clock.part.84 <-deactivate_task
<...>-259 0d..2 1us : dequeue_task_fair <-deactivate_task
<...>-259 0d..2 1us : dequeue_entity <-dequeue_task_fair
<...>-259 0d..2 1us : update_curr <-dequeue_entity
<...>-259 0d..2 1us : update_min_vruntime <-update_curr
<...>-259 0d..2 1us : cpuacct_charge <-update_curr
<...>-259 0d..2 1us : __rcu_read_lock <-cpuacct_charge
<...>-259 0d..2 1us : __rcu_read_unlock <-cpuacct_charge
<...>-259 0d..2 1us : clear_buddies <-dequeue_entity
<...>-259 0d..2 1us : account_entity_dequeue <-dequeue_entity
<...>-259 0d..2 2us : update_min_vruntime <-dequeue_entity
<...>-259 0d..2 2us : update_cfs_shares <-dequeue_entity
<...>-259 0d..2 2us : hrtick_update <-dequeue_task_fair
<...>-259 0d..2 2us : wq_worker_sleeping <-__schedule
<...>-259 0d..2 2us : kthread_data <-wq_worker_sleeping
<...>-259 0d..2 2us : pick_next_task_fair <-__schedule
<...>-259 0d..2 2us : check_cfs_rq_runtime <-pick_next_task_fair
<...>-259 0d..2 2us : pick_next_entity <-pick_next_task_fair
<...>-259 0d..2 2us : clear_buddies <-pick_next_entity
<...>-259 0d..2 2us : pick_next_entity <-pick_next_task_fair
<...>-259 0d..2 2us : clear_buddies <-pick_next_entity
<...>-259 0d..2 2us : set_next_entity <-pick_next_task_fair
<...>-259 0d..2 3us : put_prev_entity <-pick_next_task_fair
<...>-259 0d..2 3us : check_cfs_rq_runtime <-put_prev_entity
<...>-259 0d..2 3us : set_next_entity <-pick_next_task_fair
gnome-sh-1031 0d..2 3us : finish_task_switch <-__schedule
gnome-sh-1031 0d..2 3us : _raw_spin_unlock_irq <-finish_task_switch
gnome-sh-1031 0d..2 3us : do_raw_spin_unlock <-_raw_spin_unlock_irq
gnome-sh-1031 0...2 3us!: preempt_count_sub <-_raw_spin_unlock_irq
gnome-sh-1031 0...1 582us : do_raw_spin_lock <-_raw_spin_lock
gnome-sh-1031 0...1 583us : _raw_spin_unlock <-drm_gem_object_lookup
gnome-sh-1031 0...1 583us : do_raw_spin_unlock <-_raw_spin_unlock
gnome-sh-1031 0...1 583us : preempt_count_sub <-_raw_spin_unlock
gnome-sh-1031 0...1 584us : _raw_spin_unlock <-drm_gem_object_lookup
gnome-sh-1031 0...1 584us+: trace_preempt_on <-drm_gem_object_lookup
gnome-sh-1031 0...1 603us : <stack trace>
=> preempt_count_sub
=> _raw_spin_unlock
=> drm_gem_object_lookup
=> i915_gem_madvise_ioctl
=> drm_ioctl
=> do_vfs_ioctl
=> SyS_ioctl
=> entry_SYSCALL_64_fastpath
As I'm tracing preemption disabled, it seemed incorrect that the trace
would go across a schedule and report not being in the scheduler.
Looking into this I discovered the problem.
schedule() calls preempt_disable() but the preempt_schedule() calls
preempt_enable_notrace(). What happened above was that the gnome-shell
task was preempted on another CPU, migrated over to the idle cpu. The
tracer stared with idle calling schedule(), which called
preempt_disable(), but then gnome-shell finished, and it enabled
preemption with preempt_enable_notrace() that does stop the trace, even
though preemption was enabled.
The purpose of the preempt_disable_notrace() in the preempt_schedule()
is to prevent function tracing from going into an infinite loop.
Because function tracing can trace the preempt_enable/disable() calls
that are traced. The problem with function tracing is:
NEED_RESCHED set
preempt_schedule()
preempt_disable()
preempt_count_inc()
function trace (before incrementing preempt count)
preempt_disable_notrace()
preempt_enable_notrace()
sees NEED_RESCHED set
preempt_schedule() (repeat)
Now by breaking out the preempt off/on tracing into their own code:
preempt_disable_check() and preempt_enable_check(), we can add these to
the preempt_schedule() code. As preemption would then be disabled, even
if they were to be traced by the function tracer, the disabled
preemption would prevent the recursion.
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/20160321112339.6dc78ad6@gandalf.local.home
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-03-21 22:23:39 +07:00
|
|
|
/*
|
|
|
|
* Because the function tracer can trace preempt_count_sub()
|
|
|
|
* and it also uses preempt_enable/disable_notrace(), if
|
|
|
|
* NEED_RESCHED is set, the preempt_enable_notrace() called
|
|
|
|
* by the function tracer will call this function again and
|
|
|
|
* cause infinite recursion.
|
|
|
|
*
|
|
|
|
* Preemption must be disabled here before the function
|
|
|
|
* tracer can trace. Break up preempt_disable() into two
|
|
|
|
* calls. One to disable preemption without fear of being
|
|
|
|
* traced. The other to still record the preemption latency,
|
|
|
|
* which can also be traced by the function tracer.
|
|
|
|
*/
|
2015-09-28 23:09:19 +07:00
|
|
|
preempt_disable_notrace();
|
sched/core: Add preempt checks in preempt_schedule() code
While testing the tracer preemptoff, I hit this strange trace:
<...>-259 0...1 0us : schedule <-worker_thread
<...>-259 0d..1 0us : rcu_note_context_switch <-__schedule
<...>-259 0d..1 0us : rcu_sched_qs <-rcu_note_context_switch
<...>-259 0d..1 0us : rcu_preempt_qs <-rcu_note_context_switch
<...>-259 0d..1 0us : _raw_spin_lock <-__schedule
<...>-259 0d..1 0us : preempt_count_add <-_raw_spin_lock
<...>-259 0d..2 0us : do_raw_spin_lock <-_raw_spin_lock
<...>-259 0d..2 1us : deactivate_task <-__schedule
<...>-259 0d..2 1us : update_rq_clock.part.84 <-deactivate_task
<...>-259 0d..2 1us : dequeue_task_fair <-deactivate_task
<...>-259 0d..2 1us : dequeue_entity <-dequeue_task_fair
<...>-259 0d..2 1us : update_curr <-dequeue_entity
<...>-259 0d..2 1us : update_min_vruntime <-update_curr
<...>-259 0d..2 1us : cpuacct_charge <-update_curr
<...>-259 0d..2 1us : __rcu_read_lock <-cpuacct_charge
<...>-259 0d..2 1us : __rcu_read_unlock <-cpuacct_charge
<...>-259 0d..2 1us : clear_buddies <-dequeue_entity
<...>-259 0d..2 1us : account_entity_dequeue <-dequeue_entity
<...>-259 0d..2 2us : update_min_vruntime <-dequeue_entity
<...>-259 0d..2 2us : update_cfs_shares <-dequeue_entity
<...>-259 0d..2 2us : hrtick_update <-dequeue_task_fair
<...>-259 0d..2 2us : wq_worker_sleeping <-__schedule
<...>-259 0d..2 2us : kthread_data <-wq_worker_sleeping
<...>-259 0d..2 2us : pick_next_task_fair <-__schedule
<...>-259 0d..2 2us : check_cfs_rq_runtime <-pick_next_task_fair
<...>-259 0d..2 2us : pick_next_entity <-pick_next_task_fair
<...>-259 0d..2 2us : clear_buddies <-pick_next_entity
<...>-259 0d..2 2us : pick_next_entity <-pick_next_task_fair
<...>-259 0d..2 2us : clear_buddies <-pick_next_entity
<...>-259 0d..2 2us : set_next_entity <-pick_next_task_fair
<...>-259 0d..2 3us : put_prev_entity <-pick_next_task_fair
<...>-259 0d..2 3us : check_cfs_rq_runtime <-put_prev_entity
<...>-259 0d..2 3us : set_next_entity <-pick_next_task_fair
gnome-sh-1031 0d..2 3us : finish_task_switch <-__schedule
gnome-sh-1031 0d..2 3us : _raw_spin_unlock_irq <-finish_task_switch
gnome-sh-1031 0d..2 3us : do_raw_spin_unlock <-_raw_spin_unlock_irq
gnome-sh-1031 0...2 3us!: preempt_count_sub <-_raw_spin_unlock_irq
gnome-sh-1031 0...1 582us : do_raw_spin_lock <-_raw_spin_lock
gnome-sh-1031 0...1 583us : _raw_spin_unlock <-drm_gem_object_lookup
gnome-sh-1031 0...1 583us : do_raw_spin_unlock <-_raw_spin_unlock
gnome-sh-1031 0...1 583us : preempt_count_sub <-_raw_spin_unlock
gnome-sh-1031 0...1 584us : _raw_spin_unlock <-drm_gem_object_lookup
gnome-sh-1031 0...1 584us+: trace_preempt_on <-drm_gem_object_lookup
gnome-sh-1031 0...1 603us : <stack trace>
=> preempt_count_sub
=> _raw_spin_unlock
=> drm_gem_object_lookup
=> i915_gem_madvise_ioctl
=> drm_ioctl
=> do_vfs_ioctl
=> SyS_ioctl
=> entry_SYSCALL_64_fastpath
As I'm tracing preemption disabled, it seemed incorrect that the trace
would go across a schedule and report not being in the scheduler.
Looking into this I discovered the problem.
schedule() calls preempt_disable() but the preempt_schedule() calls
preempt_enable_notrace(). What happened above was that the gnome-shell
task was preempted on another CPU, migrated over to the idle cpu. The
tracer stared with idle calling schedule(), which called
preempt_disable(), but then gnome-shell finished, and it enabled
preemption with preempt_enable_notrace() that does stop the trace, even
though preemption was enabled.
The purpose of the preempt_disable_notrace() in the preempt_schedule()
is to prevent function tracing from going into an infinite loop.
Because function tracing can trace the preempt_enable/disable() calls
that are traced. The problem with function tracing is:
NEED_RESCHED set
preempt_schedule()
preempt_disable()
preempt_count_inc()
function trace (before incrementing preempt count)
preempt_disable_notrace()
preempt_enable_notrace()
sees NEED_RESCHED set
preempt_schedule() (repeat)
Now by breaking out the preempt off/on tracing into their own code:
preempt_disable_check() and preempt_enable_check(), we can add these to
the preempt_schedule() code. As preemption would then be disabled, even
if they were to be traced by the function tracer, the disabled
preemption would prevent the recursion.
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/20160321112339.6dc78ad6@gandalf.local.home
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-03-21 22:23:39 +07:00
|
|
|
preempt_latency_start(1);
|
2014-10-06 03:23:22 +07:00
|
|
|
/*
|
|
|
|
* Needs preempt disabled in case user_exit() is traced
|
|
|
|
* and the tracer calls preempt_enable_notrace() causing
|
|
|
|
* an infinite recursion.
|
|
|
|
*/
|
|
|
|
prev_ctx = exception_enter();
|
2015-09-28 23:05:34 +07:00
|
|
|
__schedule(true);
|
2014-10-06 03:23:22 +07:00
|
|
|
exception_exit(prev_ctx);
|
|
|
|
|
sched/core: Add preempt checks in preempt_schedule() code
While testing the tracer preemptoff, I hit this strange trace:
<...>-259 0...1 0us : schedule <-worker_thread
<...>-259 0d..1 0us : rcu_note_context_switch <-__schedule
<...>-259 0d..1 0us : rcu_sched_qs <-rcu_note_context_switch
<...>-259 0d..1 0us : rcu_preempt_qs <-rcu_note_context_switch
<...>-259 0d..1 0us : _raw_spin_lock <-__schedule
<...>-259 0d..1 0us : preempt_count_add <-_raw_spin_lock
<...>-259 0d..2 0us : do_raw_spin_lock <-_raw_spin_lock
<...>-259 0d..2 1us : deactivate_task <-__schedule
<...>-259 0d..2 1us : update_rq_clock.part.84 <-deactivate_task
<...>-259 0d..2 1us : dequeue_task_fair <-deactivate_task
<...>-259 0d..2 1us : dequeue_entity <-dequeue_task_fair
<...>-259 0d..2 1us : update_curr <-dequeue_entity
<...>-259 0d..2 1us : update_min_vruntime <-update_curr
<...>-259 0d..2 1us : cpuacct_charge <-update_curr
<...>-259 0d..2 1us : __rcu_read_lock <-cpuacct_charge
<...>-259 0d..2 1us : __rcu_read_unlock <-cpuacct_charge
<...>-259 0d..2 1us : clear_buddies <-dequeue_entity
<...>-259 0d..2 1us : account_entity_dequeue <-dequeue_entity
<...>-259 0d..2 2us : update_min_vruntime <-dequeue_entity
<...>-259 0d..2 2us : update_cfs_shares <-dequeue_entity
<...>-259 0d..2 2us : hrtick_update <-dequeue_task_fair
<...>-259 0d..2 2us : wq_worker_sleeping <-__schedule
<...>-259 0d..2 2us : kthread_data <-wq_worker_sleeping
<...>-259 0d..2 2us : pick_next_task_fair <-__schedule
<...>-259 0d..2 2us : check_cfs_rq_runtime <-pick_next_task_fair
<...>-259 0d..2 2us : pick_next_entity <-pick_next_task_fair
<...>-259 0d..2 2us : clear_buddies <-pick_next_entity
<...>-259 0d..2 2us : pick_next_entity <-pick_next_task_fair
<...>-259 0d..2 2us : clear_buddies <-pick_next_entity
<...>-259 0d..2 2us : set_next_entity <-pick_next_task_fair
<...>-259 0d..2 3us : put_prev_entity <-pick_next_task_fair
<...>-259 0d..2 3us : check_cfs_rq_runtime <-put_prev_entity
<...>-259 0d..2 3us : set_next_entity <-pick_next_task_fair
gnome-sh-1031 0d..2 3us : finish_task_switch <-__schedule
gnome-sh-1031 0d..2 3us : _raw_spin_unlock_irq <-finish_task_switch
gnome-sh-1031 0d..2 3us : do_raw_spin_unlock <-_raw_spin_unlock_irq
gnome-sh-1031 0...2 3us!: preempt_count_sub <-_raw_spin_unlock_irq
gnome-sh-1031 0...1 582us : do_raw_spin_lock <-_raw_spin_lock
gnome-sh-1031 0...1 583us : _raw_spin_unlock <-drm_gem_object_lookup
gnome-sh-1031 0...1 583us : do_raw_spin_unlock <-_raw_spin_unlock
gnome-sh-1031 0...1 583us : preempt_count_sub <-_raw_spin_unlock
gnome-sh-1031 0...1 584us : _raw_spin_unlock <-drm_gem_object_lookup
gnome-sh-1031 0...1 584us+: trace_preempt_on <-drm_gem_object_lookup
gnome-sh-1031 0...1 603us : <stack trace>
=> preempt_count_sub
=> _raw_spin_unlock
=> drm_gem_object_lookup
=> i915_gem_madvise_ioctl
=> drm_ioctl
=> do_vfs_ioctl
=> SyS_ioctl
=> entry_SYSCALL_64_fastpath
As I'm tracing preemption disabled, it seemed incorrect that the trace
would go across a schedule and report not being in the scheduler.
Looking into this I discovered the problem.
schedule() calls preempt_disable() but the preempt_schedule() calls
preempt_enable_notrace(). What happened above was that the gnome-shell
task was preempted on another CPU, migrated over to the idle cpu. The
tracer stared with idle calling schedule(), which called
preempt_disable(), but then gnome-shell finished, and it enabled
preemption with preempt_enable_notrace() that does stop the trace, even
though preemption was enabled.
The purpose of the preempt_disable_notrace() in the preempt_schedule()
is to prevent function tracing from going into an infinite loop.
Because function tracing can trace the preempt_enable/disable() calls
that are traced. The problem with function tracing is:
NEED_RESCHED set
preempt_schedule()
preempt_disable()
preempt_count_inc()
function trace (before incrementing preempt count)
preempt_disable_notrace()
preempt_enable_notrace()
sees NEED_RESCHED set
preempt_schedule() (repeat)
Now by breaking out the preempt off/on tracing into their own code:
preempt_disable_check() and preempt_enable_check(), we can add these to
the preempt_schedule() code. As preemption would then be disabled, even
if they were to be traced by the function tracer, the disabled
preemption would prevent the recursion.
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/20160321112339.6dc78ad6@gandalf.local.home
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-03-21 22:23:39 +07:00
|
|
|
preempt_latency_stop(1);
|
2015-09-28 23:09:19 +07:00
|
|
|
preempt_enable_no_resched_notrace();
|
2014-10-06 03:23:22 +07:00
|
|
|
} while (need_resched());
|
|
|
|
}
|
2015-06-04 22:39:08 +07:00
|
|
|
EXPORT_SYMBOL_GPL(preempt_schedule_notrace);
|
2014-10-06 03:23:22 +07:00
|
|
|
|
2013-11-21 18:41:44 +07:00
|
|
|
#endif /* CONFIG_PREEMPT */
|
2005-04-17 05:20:36 +07:00
|
|
|
|
|
|
|
/*
|
2006-07-10 18:43:52 +07:00
|
|
|
* this is the entry point to schedule() from kernel preemption
|
2005-04-17 05:20:36 +07:00
|
|
|
* off of irq context.
|
|
|
|
* Note, that this is called and return with irqs disabled. This will
|
|
|
|
* protect us against recursive calling from irq.
|
|
|
|
*/
|
2014-05-02 05:44:38 +07:00
|
|
|
asmlinkage __visible void __sched preempt_schedule_irq(void)
|
2005-04-17 05:20:36 +07:00
|
|
|
{
|
2013-02-24 18:59:30 +07:00
|
|
|
enum ctx_state prev_state;
|
2008-01-26 03:08:33 +07:00
|
|
|
|
2006-07-10 18:43:52 +07:00
|
|
|
/* Catch callers which need to be fixed */
|
2013-08-14 19:55:31 +07:00
|
|
|
BUG_ON(preempt_count() || !irqs_disabled());
|
2005-04-17 05:20:36 +07:00
|
|
|
|
2013-02-24 18:59:30 +07:00
|
|
|
prev_state = exception_enter();
|
|
|
|
|
2007-10-15 22:00:14 +07:00
|
|
|
do {
|
2015-09-28 23:09:19 +07:00
|
|
|
preempt_disable();
|
2007-10-15 22:00:14 +07:00
|
|
|
local_irq_enable();
|
2015-09-28 23:05:34 +07:00
|
|
|
__schedule(true);
|
2007-10-15 22:00:14 +07:00
|
|
|
local_irq_disable();
|
2015-09-28 23:09:19 +07:00
|
|
|
sched_preempt_enable_no_resched();
|
2009-03-06 18:40:20 +07:00
|
|
|
} while (need_resched());
|
2013-02-24 18:59:30 +07:00
|
|
|
|
|
|
|
exception_exit(prev_state);
|
2005-04-17 05:20:36 +07:00
|
|
|
}
|
|
|
|
|
2009-09-16 00:14:42 +07:00
|
|
|
int default_wake_function(wait_queue_t *curr, unsigned mode, int wake_flags,
|
2005-09-10 14:26:11 +07:00
|
|
|
void *key)
|
2005-04-17 05:20:36 +07:00
|
|
|
{
|
2009-09-16 00:14:42 +07:00
|
|
|
return try_to_wake_up(curr->private, mode, wake_flags);
|
2005-04-17 05:20:36 +07:00
|
|
|
}
|
|
|
|
EXPORT_SYMBOL(default_wake_function);
|
|
|
|
|
2006-06-27 16:54:51 +07:00
|
|
|
#ifdef CONFIG_RT_MUTEXES
|
|
|
|
|
|
|
|
/*
|
|
|
|
* rt_mutex_setprio - set the current priority of a task
|
|
|
|
* @p: task
|
|
|
|
* @prio: prio value (kernel-internal form)
|
|
|
|
*
|
|
|
|
* This function changes the 'effective' priority of a task. It does
|
|
|
|
* not touch ->normal_prio like __setscheduler().
|
|
|
|
*
|
2014-02-08 02:58:42 +07:00
|
|
|
* Used by the rt_mutex code to implement priority inheritance
|
|
|
|
* logic. Call site only calls if the priority of the task changed.
|
2006-06-27 16:54:51 +07:00
|
|
|
*/
|
2006-07-03 14:25:41 +07:00
|
|
|
void rt_mutex_setprio(struct task_struct *p, int prio)
|
2006-06-27 16:54:51 +07:00
|
|
|
{
|
2016-01-18 21:27:07 +07:00
|
|
|
int oldprio, queued, running, queue_flag = DEQUEUE_SAVE | DEQUEUE_MOVE;
|
2010-02-17 15:05:48 +07:00
|
|
|
const struct sched_class *prev_class;
|
2015-08-01 02:28:18 +07:00
|
|
|
struct rq_flags rf;
|
|
|
|
struct rq *rq;
|
2006-06-27 16:54:51 +07:00
|
|
|
|
sched/deadline: Add SCHED_DEADLINE structures & implementation
Introduces the data structures, constants and symbols needed for
SCHED_DEADLINE implementation.
Core data structure of SCHED_DEADLINE are defined, along with their
initializers. Hooks for checking if a task belong to the new policy
are also added where they are needed.
Adds a scheduling class, in sched/dl.c and a new policy called
SCHED_DEADLINE. It is an implementation of the Earliest Deadline
First (EDF) scheduling algorithm, augmented with a mechanism (called
Constant Bandwidth Server, CBS) that makes it possible to isolate
the behaviour of tasks between each other.
The typical -deadline task will be made up of a computation phase
(instance) which is activated on a periodic or sporadic fashion. The
expected (maximum) duration of such computation is called the task's
runtime; the time interval by which each instance need to be completed
is called the task's relative deadline. The task's absolute deadline
is dynamically calculated as the time instant a task (better, an
instance) activates plus the relative deadline.
The EDF algorithms selects the task with the smallest absolute
deadline as the one to be executed first, while the CBS ensures each
task to run for at most its runtime every (relative) deadline
length time interval, avoiding any interference between different
tasks (bandwidth isolation).
Thanks to this feature, also tasks that do not strictly comply with
the computational model sketched above can effectively use the new
policy.
To summarize, this patch:
- introduces the data structures, constants and symbols needed;
- implements the core logic of the scheduling algorithm in the new
scheduling class file;
- provides all the glue code between the new scheduling class and
the core scheduler and refines the interactions between sched/dl
and the other existing scheduling classes.
Signed-off-by: Dario Faggioli <raistlin@linux.it>
Signed-off-by: Michael Trimarchi <michael@amarulasolutions.com>
Signed-off-by: Fabio Checconi <fchecconi@gmail.com>
Signed-off-by: Juri Lelli <juri.lelli@gmail.com>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/1383831828-15501-4-git-send-email-juri.lelli@gmail.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2013-11-28 17:14:43 +07:00
|
|
|
BUG_ON(prio > MAX_PRIO);
|
2006-06-27 16:54:51 +07:00
|
|
|
|
2015-08-01 02:28:18 +07:00
|
|
|
rq = __task_rq_lock(p, &rf);
|
2006-06-27 16:54:51 +07:00
|
|
|
|
2011-06-07 01:07:38 +07:00
|
|
|
/*
|
|
|
|
* Idle task boosting is a nono in general. There is one
|
|
|
|
* exception, when PREEMPT_RT and NOHZ is active:
|
|
|
|
*
|
|
|
|
* The idle task calls get_next_timer_interrupt() and holds
|
|
|
|
* the timer wheel base->lock on the CPU and another CPU wants
|
|
|
|
* to access the timer (probably to cancel it). We can safely
|
|
|
|
* ignore the boosting request, as the idle CPU runs this code
|
|
|
|
* with interrupts disabled and will complete the lock
|
|
|
|
* protected section without being interrupted. So there is no
|
|
|
|
* real need to boost.
|
|
|
|
*/
|
|
|
|
if (unlikely(p == rq->idle)) {
|
|
|
|
WARN_ON(p != rq->curr);
|
|
|
|
WARN_ON(p->pi_blocked_on);
|
|
|
|
goto out_unlock;
|
|
|
|
}
|
|
|
|
|
2010-09-21 02:13:34 +07:00
|
|
|
trace_sched_pi_setprio(p, prio);
|
2007-05-09 10:27:06 +07:00
|
|
|
oldprio = p->prio;
|
2016-01-18 21:27:07 +07:00
|
|
|
|
|
|
|
if (oldprio == prio)
|
|
|
|
queue_flag &= ~DEQUEUE_MOVE;
|
|
|
|
|
2010-02-17 15:05:48 +07:00
|
|
|
prev_class = p->sched_class;
|
2014-08-20 16:47:32 +07:00
|
|
|
queued = task_on_rq_queued(p);
|
2007-12-18 21:21:13 +07:00
|
|
|
running = task_current(rq, p);
|
2014-08-20 16:47:32 +07:00
|
|
|
if (queued)
|
2016-01-18 21:27:07 +07:00
|
|
|
dequeue_task(rq, p, queue_flag);
|
2008-03-11 01:01:20 +07:00
|
|
|
if (running)
|
2014-09-12 20:41:40 +07:00
|
|
|
put_prev_task(rq, p);
|
2007-07-09 23:51:59 +07:00
|
|
|
|
sched/deadline: Add SCHED_DEADLINE inheritance logic
Some method to deal with rt-mutexes and make sched_dl interact with
the current PI-coded is needed, raising all but trivial issues, that
needs (according to us) to be solved with some restructuring of
the pi-code (i.e., going toward a proxy execution-ish implementation).
This is under development, in the meanwhile, as a temporary solution,
what this commits does is:
- ensure a pi-lock owner with waiters is never throttled down. Instead,
when it runs out of runtime, it immediately gets replenished and it's
deadline is postponed;
- the scheduling parameters (relative deadline and default runtime)
used for that replenishments --during the whole period it holds the
pi-lock-- are the ones of the waiting task with earliest deadline.
Acting this way, we provide some kind of boosting to the lock-owner,
still by using the existing (actually, slightly modified by the previous
commit) pi-architecture.
We would stress the fact that this is only a surely needed, all but
clean solution to the problem. In the end it's only a way to re-start
discussion within the community. So, as always, comments, ideas, rants,
etc.. are welcome! :-)
Signed-off-by: Dario Faggioli <raistlin@linux.it>
Signed-off-by: Juri Lelli <juri.lelli@gmail.com>
[ Added !RT_MUTEXES build fix. ]
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/1383831828-15501-11-git-send-email-juri.lelli@gmail.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2013-11-07 20:43:44 +07:00
|
|
|
/*
|
|
|
|
* Boosting condition are:
|
|
|
|
* 1. -rt task is running and holds mutex A
|
|
|
|
* --> -dl task blocks on mutex A
|
|
|
|
*
|
|
|
|
* 2. -dl task is running and holds mutex A
|
|
|
|
* --> -dl task blocks on mutex A and could preempt the
|
|
|
|
* running task
|
|
|
|
*/
|
|
|
|
if (dl_prio(prio)) {
|
2014-06-06 23:52:06 +07:00
|
|
|
struct task_struct *pi_task = rt_mutex_get_top_task(p);
|
|
|
|
if (!dl_prio(p->normal_prio) ||
|
|
|
|
(pi_task && dl_entity_preempt(&pi_task->dl, &p->dl))) {
|
sched/deadline: Add SCHED_DEADLINE inheritance logic
Some method to deal with rt-mutexes and make sched_dl interact with
the current PI-coded is needed, raising all but trivial issues, that
needs (according to us) to be solved with some restructuring of
the pi-code (i.e., going toward a proxy execution-ish implementation).
This is under development, in the meanwhile, as a temporary solution,
what this commits does is:
- ensure a pi-lock owner with waiters is never throttled down. Instead,
when it runs out of runtime, it immediately gets replenished and it's
deadline is postponed;
- the scheduling parameters (relative deadline and default runtime)
used for that replenishments --during the whole period it holds the
pi-lock-- are the ones of the waiting task with earliest deadline.
Acting this way, we provide some kind of boosting to the lock-owner,
still by using the existing (actually, slightly modified by the previous
commit) pi-architecture.
We would stress the fact that this is only a surely needed, all but
clean solution to the problem. In the end it's only a way to re-start
discussion within the community. So, as always, comments, ideas, rants,
etc.. are welcome! :-)
Signed-off-by: Dario Faggioli <raistlin@linux.it>
Signed-off-by: Juri Lelli <juri.lelli@gmail.com>
[ Added !RT_MUTEXES build fix. ]
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/1383831828-15501-11-git-send-email-juri.lelli@gmail.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2013-11-07 20:43:44 +07:00
|
|
|
p->dl.dl_boosted = 1;
|
2016-01-18 21:27:07 +07:00
|
|
|
queue_flag |= ENQUEUE_REPLENISH;
|
sched/deadline: Add SCHED_DEADLINE inheritance logic
Some method to deal with rt-mutexes and make sched_dl interact with
the current PI-coded is needed, raising all but trivial issues, that
needs (according to us) to be solved with some restructuring of
the pi-code (i.e., going toward a proxy execution-ish implementation).
This is under development, in the meanwhile, as a temporary solution,
what this commits does is:
- ensure a pi-lock owner with waiters is never throttled down. Instead,
when it runs out of runtime, it immediately gets replenished and it's
deadline is postponed;
- the scheduling parameters (relative deadline and default runtime)
used for that replenishments --during the whole period it holds the
pi-lock-- are the ones of the waiting task with earliest deadline.
Acting this way, we provide some kind of boosting to the lock-owner,
still by using the existing (actually, slightly modified by the previous
commit) pi-architecture.
We would stress the fact that this is only a surely needed, all but
clean solution to the problem. In the end it's only a way to re-start
discussion within the community. So, as always, comments, ideas, rants,
etc.. are welcome! :-)
Signed-off-by: Dario Faggioli <raistlin@linux.it>
Signed-off-by: Juri Lelli <juri.lelli@gmail.com>
[ Added !RT_MUTEXES build fix. ]
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/1383831828-15501-11-git-send-email-juri.lelli@gmail.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2013-11-07 20:43:44 +07:00
|
|
|
} else
|
|
|
|
p->dl.dl_boosted = 0;
|
sched/deadline: Add SCHED_DEADLINE structures & implementation
Introduces the data structures, constants and symbols needed for
SCHED_DEADLINE implementation.
Core data structure of SCHED_DEADLINE are defined, along with their
initializers. Hooks for checking if a task belong to the new policy
are also added where they are needed.
Adds a scheduling class, in sched/dl.c and a new policy called
SCHED_DEADLINE. It is an implementation of the Earliest Deadline
First (EDF) scheduling algorithm, augmented with a mechanism (called
Constant Bandwidth Server, CBS) that makes it possible to isolate
the behaviour of tasks between each other.
The typical -deadline task will be made up of a computation phase
(instance) which is activated on a periodic or sporadic fashion. The
expected (maximum) duration of such computation is called the task's
runtime; the time interval by which each instance need to be completed
is called the task's relative deadline. The task's absolute deadline
is dynamically calculated as the time instant a task (better, an
instance) activates plus the relative deadline.
The EDF algorithms selects the task with the smallest absolute
deadline as the one to be executed first, while the CBS ensures each
task to run for at most its runtime every (relative) deadline
length time interval, avoiding any interference between different
tasks (bandwidth isolation).
Thanks to this feature, also tasks that do not strictly comply with
the computational model sketched above can effectively use the new
policy.
To summarize, this patch:
- introduces the data structures, constants and symbols needed;
- implements the core logic of the scheduling algorithm in the new
scheduling class file;
- provides all the glue code between the new scheduling class and
the core scheduler and refines the interactions between sched/dl
and the other existing scheduling classes.
Signed-off-by: Dario Faggioli <raistlin@linux.it>
Signed-off-by: Michael Trimarchi <michael@amarulasolutions.com>
Signed-off-by: Fabio Checconi <fchecconi@gmail.com>
Signed-off-by: Juri Lelli <juri.lelli@gmail.com>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/1383831828-15501-4-git-send-email-juri.lelli@gmail.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2013-11-28 17:14:43 +07:00
|
|
|
p->sched_class = &dl_sched_class;
|
sched/deadline: Add SCHED_DEADLINE inheritance logic
Some method to deal with rt-mutexes and make sched_dl interact with
the current PI-coded is needed, raising all but trivial issues, that
needs (according to us) to be solved with some restructuring of
the pi-code (i.e., going toward a proxy execution-ish implementation).
This is under development, in the meanwhile, as a temporary solution,
what this commits does is:
- ensure a pi-lock owner with waiters is never throttled down. Instead,
when it runs out of runtime, it immediately gets replenished and it's
deadline is postponed;
- the scheduling parameters (relative deadline and default runtime)
used for that replenishments --during the whole period it holds the
pi-lock-- are the ones of the waiting task with earliest deadline.
Acting this way, we provide some kind of boosting to the lock-owner,
still by using the existing (actually, slightly modified by the previous
commit) pi-architecture.
We would stress the fact that this is only a surely needed, all but
clean solution to the problem. In the end it's only a way to re-start
discussion within the community. So, as always, comments, ideas, rants,
etc.. are welcome! :-)
Signed-off-by: Dario Faggioli <raistlin@linux.it>
Signed-off-by: Juri Lelli <juri.lelli@gmail.com>
[ Added !RT_MUTEXES build fix. ]
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/1383831828-15501-11-git-send-email-juri.lelli@gmail.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2013-11-07 20:43:44 +07:00
|
|
|
} else if (rt_prio(prio)) {
|
|
|
|
if (dl_prio(oldprio))
|
|
|
|
p->dl.dl_boosted = 0;
|
|
|
|
if (oldprio < prio)
|
2016-01-18 21:27:07 +07:00
|
|
|
queue_flag |= ENQUEUE_HEAD;
|
2007-07-09 23:51:59 +07:00
|
|
|
p->sched_class = &rt_sched_class;
|
sched/deadline: Add SCHED_DEADLINE inheritance logic
Some method to deal with rt-mutexes and make sched_dl interact with
the current PI-coded is needed, raising all but trivial issues, that
needs (according to us) to be solved with some restructuring of
the pi-code (i.e., going toward a proxy execution-ish implementation).
This is under development, in the meanwhile, as a temporary solution,
what this commits does is:
- ensure a pi-lock owner with waiters is never throttled down. Instead,
when it runs out of runtime, it immediately gets replenished and it's
deadline is postponed;
- the scheduling parameters (relative deadline and default runtime)
used for that replenishments --during the whole period it holds the
pi-lock-- are the ones of the waiting task with earliest deadline.
Acting this way, we provide some kind of boosting to the lock-owner,
still by using the existing (actually, slightly modified by the previous
commit) pi-architecture.
We would stress the fact that this is only a surely needed, all but
clean solution to the problem. In the end it's only a way to re-start
discussion within the community. So, as always, comments, ideas, rants,
etc.. are welcome! :-)
Signed-off-by: Dario Faggioli <raistlin@linux.it>
Signed-off-by: Juri Lelli <juri.lelli@gmail.com>
[ Added !RT_MUTEXES build fix. ]
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/1383831828-15501-11-git-send-email-juri.lelli@gmail.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2013-11-07 20:43:44 +07:00
|
|
|
} else {
|
|
|
|
if (dl_prio(oldprio))
|
|
|
|
p->dl.dl_boosted = 0;
|
2015-02-19 07:23:56 +07:00
|
|
|
if (rt_prio(oldprio))
|
|
|
|
p->rt.timeout = 0;
|
2007-07-09 23:51:59 +07:00
|
|
|
p->sched_class = &fair_sched_class;
|
sched/deadline: Add SCHED_DEADLINE inheritance logic
Some method to deal with rt-mutexes and make sched_dl interact with
the current PI-coded is needed, raising all but trivial issues, that
needs (according to us) to be solved with some restructuring of
the pi-code (i.e., going toward a proxy execution-ish implementation).
This is under development, in the meanwhile, as a temporary solution,
what this commits does is:
- ensure a pi-lock owner with waiters is never throttled down. Instead,
when it runs out of runtime, it immediately gets replenished and it's
deadline is postponed;
- the scheduling parameters (relative deadline and default runtime)
used for that replenishments --during the whole period it holds the
pi-lock-- are the ones of the waiting task with earliest deadline.
Acting this way, we provide some kind of boosting to the lock-owner,
still by using the existing (actually, slightly modified by the previous
commit) pi-architecture.
We would stress the fact that this is only a surely needed, all but
clean solution to the problem. In the end it's only a way to re-start
discussion within the community. So, as always, comments, ideas, rants,
etc.. are welcome! :-)
Signed-off-by: Dario Faggioli <raistlin@linux.it>
Signed-off-by: Juri Lelli <juri.lelli@gmail.com>
[ Added !RT_MUTEXES build fix. ]
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/1383831828-15501-11-git-send-email-juri.lelli@gmail.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2013-11-07 20:43:44 +07:00
|
|
|
}
|
2007-07-09 23:51:59 +07:00
|
|
|
|
2006-06-27 16:54:51 +07:00
|
|
|
p->prio = prio;
|
|
|
|
|
2008-03-11 01:01:20 +07:00
|
|
|
if (running)
|
|
|
|
p->sched_class->set_curr_task(rq);
|
2014-08-20 16:47:32 +07:00
|
|
|
if (queued)
|
2016-01-18 21:27:07 +07:00
|
|
|
enqueue_task(rq, p, queue_flag);
|
2008-01-26 03:08:22 +07:00
|
|
|
|
2011-01-17 23:03:27 +07:00
|
|
|
check_class_changed(rq, p, prev_class, oldprio);
|
2011-06-07 01:07:38 +07:00
|
|
|
out_unlock:
|
2015-06-11 19:46:39 +07:00
|
|
|
preempt_disable(); /* avoid rq from going away on us */
|
2015-08-01 02:28:18 +07:00
|
|
|
__task_rq_unlock(rq, &rf);
|
2015-06-11 19:46:39 +07:00
|
|
|
|
|
|
|
balance_callback(rq);
|
|
|
|
preempt_enable();
|
2006-06-27 16:54:51 +07:00
|
|
|
}
|
|
|
|
#endif
|
sched: Add new scheduler syscalls to support an extended scheduling parameters ABI
Add the syscalls needed for supporting scheduling algorithms
with extended scheduling parameters (e.g., SCHED_DEADLINE).
In general, it makes possible to specify a periodic/sporadic task,
that executes for a given amount of runtime at each instance, and is
scheduled according to the urgency of their own timing constraints,
i.e.:
- a (maximum/typical) instance execution time,
- a minimum interval between consecutive instances,
- a time constraint by which each instance must be completed.
Thus, both the data structure that holds the scheduling parameters of
the tasks and the system calls dealing with it must be extended.
Unfortunately, modifying the existing struct sched_param would break
the ABI and result in potentially serious compatibility issues with
legacy binaries.
For these reasons, this patch:
- defines the new struct sched_attr, containing all the fields
that are necessary for specifying a task in the computational
model described above;
- defines and implements the new scheduling related syscalls that
manipulate it, i.e., sched_setattr() and sched_getattr().
Syscalls are introduced for x86 (32 and 64 bits) and ARM only, as a
proof of concept and for developing and testing purposes. Making them
available on other architectures is straightforward.
Since no "user" for these new parameters is introduced in this patch,
the implementation of the new system calls is just identical to their
already existing counterpart. Future patches that implement scheduling
policies able to exploit the new data structure must also take care of
modifying the sched_*attr() calls accordingly with their own purposes.
Signed-off-by: Dario Faggioli <raistlin@linux.it>
[ Rewrote to use sched_attr. ]
Signed-off-by: Juri Lelli <juri.lelli@gmail.com>
[ Removed sched_setscheduler2() for now. ]
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/1383831828-15501-3-git-send-email-juri.lelli@gmail.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2013-11-07 20:43:36 +07:00
|
|
|
|
2006-07-03 14:25:41 +07:00
|
|
|
void set_user_nice(struct task_struct *p, long nice)
|
2005-04-17 05:20:36 +07:00
|
|
|
{
|
2014-08-20 16:47:32 +07:00
|
|
|
int old_prio, delta, queued;
|
2015-08-01 02:28:18 +07:00
|
|
|
struct rq_flags rf;
|
2006-07-03 14:25:42 +07:00
|
|
|
struct rq *rq;
|
2005-04-17 05:20:36 +07:00
|
|
|
|
2014-02-11 14:34:50 +07:00
|
|
|
if (task_nice(p) == nice || nice < MIN_NICE || nice > MAX_NICE)
|
2005-04-17 05:20:36 +07:00
|
|
|
return;
|
|
|
|
/*
|
|
|
|
* We have to be careful, if called from sys_setpriority(),
|
|
|
|
* the task might be in the middle of scheduling on another CPU.
|
|
|
|
*/
|
2015-08-01 02:28:18 +07:00
|
|
|
rq = task_rq_lock(p, &rf);
|
2005-04-17 05:20:36 +07:00
|
|
|
/*
|
|
|
|
* The RT priorities are set via sched_setscheduler(), but we still
|
|
|
|
* allow the 'normal' nice value to be set - but as expected
|
|
|
|
* it wont have any effect on scheduling until the task is
|
sched/deadline: Add SCHED_DEADLINE structures & implementation
Introduces the data structures, constants and symbols needed for
SCHED_DEADLINE implementation.
Core data structure of SCHED_DEADLINE are defined, along with their
initializers. Hooks for checking if a task belong to the new policy
are also added where they are needed.
Adds a scheduling class, in sched/dl.c and a new policy called
SCHED_DEADLINE. It is an implementation of the Earliest Deadline
First (EDF) scheduling algorithm, augmented with a mechanism (called
Constant Bandwidth Server, CBS) that makes it possible to isolate
the behaviour of tasks between each other.
The typical -deadline task will be made up of a computation phase
(instance) which is activated on a periodic or sporadic fashion. The
expected (maximum) duration of such computation is called the task's
runtime; the time interval by which each instance need to be completed
is called the task's relative deadline. The task's absolute deadline
is dynamically calculated as the time instant a task (better, an
instance) activates plus the relative deadline.
The EDF algorithms selects the task with the smallest absolute
deadline as the one to be executed first, while the CBS ensures each
task to run for at most its runtime every (relative) deadline
length time interval, avoiding any interference between different
tasks (bandwidth isolation).
Thanks to this feature, also tasks that do not strictly comply with
the computational model sketched above can effectively use the new
policy.
To summarize, this patch:
- introduces the data structures, constants and symbols needed;
- implements the core logic of the scheduling algorithm in the new
scheduling class file;
- provides all the glue code between the new scheduling class and
the core scheduler and refines the interactions between sched/dl
and the other existing scheduling classes.
Signed-off-by: Dario Faggioli <raistlin@linux.it>
Signed-off-by: Michael Trimarchi <michael@amarulasolutions.com>
Signed-off-by: Fabio Checconi <fchecconi@gmail.com>
Signed-off-by: Juri Lelli <juri.lelli@gmail.com>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/1383831828-15501-4-git-send-email-juri.lelli@gmail.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2013-11-28 17:14:43 +07:00
|
|
|
* SCHED_DEADLINE, SCHED_FIFO or SCHED_RR:
|
2005-04-17 05:20:36 +07:00
|
|
|
*/
|
sched/deadline: Add SCHED_DEADLINE structures & implementation
Introduces the data structures, constants and symbols needed for
SCHED_DEADLINE implementation.
Core data structure of SCHED_DEADLINE are defined, along with their
initializers. Hooks for checking if a task belong to the new policy
are also added where they are needed.
Adds a scheduling class, in sched/dl.c and a new policy called
SCHED_DEADLINE. It is an implementation of the Earliest Deadline
First (EDF) scheduling algorithm, augmented with a mechanism (called
Constant Bandwidth Server, CBS) that makes it possible to isolate
the behaviour of tasks between each other.
The typical -deadline task will be made up of a computation phase
(instance) which is activated on a periodic or sporadic fashion. The
expected (maximum) duration of such computation is called the task's
runtime; the time interval by which each instance need to be completed
is called the task's relative deadline. The task's absolute deadline
is dynamically calculated as the time instant a task (better, an
instance) activates plus the relative deadline.
The EDF algorithms selects the task with the smallest absolute
deadline as the one to be executed first, while the CBS ensures each
task to run for at most its runtime every (relative) deadline
length time interval, avoiding any interference between different
tasks (bandwidth isolation).
Thanks to this feature, also tasks that do not strictly comply with
the computational model sketched above can effectively use the new
policy.
To summarize, this patch:
- introduces the data structures, constants and symbols needed;
- implements the core logic of the scheduling algorithm in the new
scheduling class file;
- provides all the glue code between the new scheduling class and
the core scheduler and refines the interactions between sched/dl
and the other existing scheduling classes.
Signed-off-by: Dario Faggioli <raistlin@linux.it>
Signed-off-by: Michael Trimarchi <michael@amarulasolutions.com>
Signed-off-by: Fabio Checconi <fchecconi@gmail.com>
Signed-off-by: Juri Lelli <juri.lelli@gmail.com>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/1383831828-15501-4-git-send-email-juri.lelli@gmail.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2013-11-28 17:14:43 +07:00
|
|
|
if (task_has_dl_policy(p) || task_has_rt_policy(p)) {
|
2005-04-17 05:20:36 +07:00
|
|
|
p->static_prio = NICE_TO_PRIO(nice);
|
|
|
|
goto out_unlock;
|
|
|
|
}
|
2014-08-20 16:47:32 +07:00
|
|
|
queued = task_on_rq_queued(p);
|
|
|
|
if (queued)
|
sched/core: Fix task and run queue sched_info::run_delay inconsistencies
Mike Meyer reported the following bug:
> During evaluation of some performance data, it was discovered thread
> and run queue run_delay accounting data was inconsistent with the other
> accounting data that was collected. Further investigation found under
> certain circumstances execution time was leaking into the task and
> run queue accounting of run_delay.
>
> Consider the following sequence:
>
> a. thread is running.
> b. thread moves beween cgroups, changes scheduling class or priority.
> c. thread sleeps OR
> d. thread involuntarily gives up cpu.
>
> a. implies:
>
> thread->sched_info.last_queued = 0
>
> a. and b. results in the following:
>
> 1. dequeue_task(rq, thread)
>
> sched_info_dequeued(rq, thread)
> delta = 0
>
> sched_info_reset_dequeued(thread)
> thread->sched_info.last_queued = 0
>
> thread->sched_info.run_delay += delta
>
> 2. enqueue_task(rq, thread)
>
> sched_info_queued(rq, thread)
>
> /* thread is still on cpu at this point. */
> thread->sched_info.last_queued = task_rq(thread)->clock;
>
> c. results in:
>
> dequeue_task(rq, thread)
>
> sched_info_dequeued(rq, thread)
>
> /* delta is execution time not run_delay. */
> delta = task_rq(thread)->clock - thread->sched_info.last_queued
>
> sched_info_reset_dequeued(thread)
> thread->sched_info.last_queued = 0
>
> thread->sched_info.run_delay += delta
>
> Since thread was running between enqueue_task(rq, thread) and
> dequeue_task(rq, thread), the delta above is really execution
> time and not run_delay.
>
> d. results in:
>
> __sched_info_switch(thread, next_thread)
>
> sched_info_depart(rq, thread)
>
> sched_info_queued(rq, thread)
>
> /* last_queued not updated due to being non-zero */
> return
>
> Since thread was running between enqueue_task(rq, thread) and
> __sched_info_switch(thread, next_thread), the execution time
> between enqueue_task(rq, thread) and
> __sched_info_switch(thread, next_thread) now will become
> associated with run_delay due to when last_queued was last updated.
>
This alternative patch solves the problem by not calling
sched_info_{de,}queued() in {de,en}queue_task(). Therefore the
sched_info state is preserved and things work as expected.
By inlining the {de,en}queue_task() functions the new condition
becomes (mostly) a compile-time constant and we'll not emit any new
branch instructions.
It even shrinks the code (due to inlining {en,de}queue_task()):
$ size defconfig-build/kernel/sched/core.o defconfig-build/kernel/sched/core.o.orig
text data bss dec hex filename
64019 23378 2344 89741 15e8d defconfig-build/kernel/sched/core.o
64149 23378 2344 89871 15f0f defconfig-build/kernel/sched/core.o.orig
Reported-by: Mike Meyer <Mike.Meyer@Teradata.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Link: http://lkml.kernel.org/r/20150930154413.GO3604@twins.programming.kicks-ass.net
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-09-30 22:44:13 +07:00
|
|
|
dequeue_task(rq, p, DEQUEUE_SAVE);
|
2005-04-17 05:20:36 +07:00
|
|
|
|
|
|
|
p->static_prio = NICE_TO_PRIO(nice);
|
[PATCH] sched: implement smpnice
Problem:
The introduction of separate run queues per CPU has brought with it "nice"
enforcement problems that are best described by a simple example.
For the sake of argument suppose that on a single CPU machine with a
nice==19 hard spinner and a nice==0 hard spinner running that the nice==0
task gets 95% of the CPU and the nice==19 task gets 5% of the CPU. Now
suppose that there is a system with 2 CPUs and 2 nice==19 hard spinners and
2 nice==0 hard spinners running. The user of this system would be entitled
to expect that the nice==0 tasks each get 95% of a CPU and the nice==19
tasks only get 5% each. However, whether this expectation is met is pretty
much down to luck as there are four equally likely distributions of the
tasks to the CPUs that the load balancing code will consider to be balanced
with loads of 2.0 for each CPU. Two of these distributions involve one
nice==0 and one nice==19 task per CPU and in these circumstances the users
expectations will be met. The other two distributions both involve both
nice==0 tasks being on one CPU and both nice==19 being on the other CPU and
each task will get 50% of a CPU and the user's expectations will not be
met.
Solution:
The solution to this problem that is implemented in the attached patch is
to use weighted loads when determining if the system is balanced and, when
an imbalance is detected, to move an amount of weighted load between run
queues (as opposed to a number of tasks) to restore the balance. Once
again, the easiest way to explain why both of these measures are necessary
is to use a simple example. Suppose that (in a slight variation of the
above example) that we have a two CPU system with 4 nice==0 and 4 nice=19
hard spinning tasks running and that the 4 nice==0 tasks are on one CPU and
the 4 nice==19 tasks are on the other CPU. The weighted loads for the two
CPUs would be 4.0 and 0.2 respectively and the load balancing code would
move 2 tasks resulting in one CPU with a load of 2.0 and the other with
load of 2.2. If this was considered to be a big enough imbalance to
justify moving a task and that task was moved using the current
move_tasks() then it would move the highest priority task that it found and
this would result in one CPU with a load of 3.0 and the other with a load
of 1.2 which would result in the movement of a task in the opposite
direction and so on -- infinite loop. If, on the other hand, an amount of
load to be moved is calculated from the imbalance (in this case 0.1) and
move_tasks() skips tasks until it find ones whose contributions to the
weighted load are less than this amount it would move two of the nice==19
tasks resulting in a system with 2 nice==0 and 2 nice=19 on each CPU with
loads of 2.1 for each CPU.
One of the advantages of this mechanism is that on a system where all tasks
have nice==0 the load balancing calculations would be mathematically
identical to the current load balancing code.
Notes:
struct task_struct:
has a new field load_weight which (in a trade off of space for speed)
stores the contribution that this task makes to a CPU's weighted load when
it is runnable.
struct runqueue:
has a new field raw_weighted_load which is the sum of the load_weight
values for the currently runnable tasks on this run queue. This field
always needs to be updated when nr_running is updated so two new inline
functions inc_nr_running() and dec_nr_running() have been created to make
sure that this happens. This also offers a convenient way to optimize away
this part of the smpnice mechanism when CONFIG_SMP is not defined.
int try_to_wake_up():
in this function the value SCHED_LOAD_BALANCE is used to represent the load
contribution of a single task in various calculations in the code that
decides which CPU to put the waking task on. While this would be a valid
on a system where the nice values for the runnable tasks were distributed
evenly around zero it will lead to anomalous load balancing if the
distribution is skewed in either direction. To overcome this problem
SCHED_LOAD_SCALE has been replaced by the load_weight for the relevant task
or by the average load_weight per task for the queue in question (as
appropriate).
int move_tasks():
The modifications to this function were complicated by the fact that
active_load_balance() uses it to move exactly one task without checking
whether an imbalance actually exists. This precluded the simple
overloading of max_nr_move with max_load_move and necessitated the addition
of the latter as an extra argument to the function. The internal
implementation is then modified to move up to max_nr_move tasks and
max_load_move of weighted load. This slightly complicates the code where
move_tasks() is called and if ever active_load_balance() is changed to not
use move_tasks() the implementation of move_tasks() should be simplified
accordingly.
struct sched_group *find_busiest_group():
Similar to try_to_wake_up(), there are places in this function where
SCHED_LOAD_SCALE is used to represent the load contribution of a single
task and the same issues are created. A similar solution is adopted except
that it is now the average per task contribution to a group's load (as
opposed to a run queue) that is required. As this value is not directly
available from the group it is calculated on the fly as the queues in the
groups are visited when determining the busiest group.
A key change to this function is that it is no longer to scale down
*imbalance on exit as move_tasks() uses the load in its scaled form.
void set_user_nice():
has been modified to update the task's load_weight field when it's nice
value and also to ensure that its run queue's raw_weighted_load field is
updated if it was runnable.
From: "Siddha, Suresh B" <suresh.b.siddha@intel.com>
With smpnice, sched groups with highest priority tasks can mask the imbalance
between the other sched groups with in the same domain. This patch fixes some
of the listed down scenarios by not considering the sched groups which are
lightly loaded.
a) on a simple 4-way MP system, if we have one high priority and 4 normal
priority tasks, with smpnice we would like to see the high priority task
scheduled on one cpu, two other cpus getting one normal task each and the
fourth cpu getting the remaining two normal tasks. but with current
smpnice extra normal priority task keeps jumping from one cpu to another
cpu having the normal priority task. This is because of the
busiest_has_loaded_cpus, nr_loaded_cpus logic.. We are not including the
cpu with high priority task in max_load calculations but including that in
total and avg_load calcuations.. leading to max_load < avg_load and load
balance between cpus running normal priority tasks(2 Vs 1) will always show
imbalanace as one normal priority and the extra normal priority task will
keep moving from one cpu to another cpu having normal priority task..
b) 4-way system with HT (8 logical processors). Package-P0 T0 has a
highest priority task, T1 is idle. Package-P1 Both T0 and T1 have 1 normal
priority task each.. P2 and P3 are idle. With this patch, one of the
normal priority tasks on P1 will be moved to P2 or P3..
c) With the current weighted smp nice calculations, it doesn't always make
sense to look at the highest weighted runqueue in the busy group..
Consider a load balance scenario on a DP with HT system, with Package-0
containing one high priority and one low priority, Package-1 containing one
low priority(with other thread being idle).. Package-1 thinks that it need
to take the low priority thread from Package-0. And find_busiest_queue()
returns the cpu thread with highest priority task.. And ultimately(with
help of active load balance) we move high priority task to Package-1. And
same continues with Package-0 now, moving high priority task from package-1
to package-0.. Even without the presence of active load balance, load
balance will fail to balance the above scenario.. Fix find_busiest_queue
to use "imbalance" when it is lightly loaded.
[kernel@kolivas.org: sched: store weighted load on up]
[kernel@kolivas.org: sched: add discrete weighted cpu load function]
[suresh.b.siddha@intel.com: sched: remove dead code]
Signed-off-by: Peter Williams <pwil3058@bigpond.com.au>
Cc: "Siddha, Suresh B" <suresh.b.siddha@intel.com>
Cc: "Chen, Kenneth W" <kenneth.w.chen@intel.com>
Acked-by: Ingo Molnar <mingo@elte.hu>
Cc: Nick Piggin <nickpiggin@yahoo.com.au>
Signed-off-by: Con Kolivas <kernel@kolivas.org>
Cc: John Hawkes <hawkes@sgi.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-27 16:54:34 +07:00
|
|
|
set_load_weight(p);
|
2006-06-27 16:54:51 +07:00
|
|
|
old_prio = p->prio;
|
|
|
|
p->prio = effective_prio(p);
|
|
|
|
delta = p->prio - old_prio;
|
2005-04-17 05:20:36 +07:00
|
|
|
|
2014-08-20 16:47:32 +07:00
|
|
|
if (queued) {
|
sched/core: Fix task and run queue sched_info::run_delay inconsistencies
Mike Meyer reported the following bug:
> During evaluation of some performance data, it was discovered thread
> and run queue run_delay accounting data was inconsistent with the other
> accounting data that was collected. Further investigation found under
> certain circumstances execution time was leaking into the task and
> run queue accounting of run_delay.
>
> Consider the following sequence:
>
> a. thread is running.
> b. thread moves beween cgroups, changes scheduling class or priority.
> c. thread sleeps OR
> d. thread involuntarily gives up cpu.
>
> a. implies:
>
> thread->sched_info.last_queued = 0
>
> a. and b. results in the following:
>
> 1. dequeue_task(rq, thread)
>
> sched_info_dequeued(rq, thread)
> delta = 0
>
> sched_info_reset_dequeued(thread)
> thread->sched_info.last_queued = 0
>
> thread->sched_info.run_delay += delta
>
> 2. enqueue_task(rq, thread)
>
> sched_info_queued(rq, thread)
>
> /* thread is still on cpu at this point. */
> thread->sched_info.last_queued = task_rq(thread)->clock;
>
> c. results in:
>
> dequeue_task(rq, thread)
>
> sched_info_dequeued(rq, thread)
>
> /* delta is execution time not run_delay. */
> delta = task_rq(thread)->clock - thread->sched_info.last_queued
>
> sched_info_reset_dequeued(thread)
> thread->sched_info.last_queued = 0
>
> thread->sched_info.run_delay += delta
>
> Since thread was running between enqueue_task(rq, thread) and
> dequeue_task(rq, thread), the delta above is really execution
> time and not run_delay.
>
> d. results in:
>
> __sched_info_switch(thread, next_thread)
>
> sched_info_depart(rq, thread)
>
> sched_info_queued(rq, thread)
>
> /* last_queued not updated due to being non-zero */
> return
>
> Since thread was running between enqueue_task(rq, thread) and
> __sched_info_switch(thread, next_thread), the execution time
> between enqueue_task(rq, thread) and
> __sched_info_switch(thread, next_thread) now will become
> associated with run_delay due to when last_queued was last updated.
>
This alternative patch solves the problem by not calling
sched_info_{de,}queued() in {de,en}queue_task(). Therefore the
sched_info state is preserved and things work as expected.
By inlining the {de,en}queue_task() functions the new condition
becomes (mostly) a compile-time constant and we'll not emit any new
branch instructions.
It even shrinks the code (due to inlining {en,de}queue_task()):
$ size defconfig-build/kernel/sched/core.o defconfig-build/kernel/sched/core.o.orig
text data bss dec hex filename
64019 23378 2344 89741 15e8d defconfig-build/kernel/sched/core.o
64149 23378 2344 89871 15f0f defconfig-build/kernel/sched/core.o.orig
Reported-by: Mike Meyer <Mike.Meyer@Teradata.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Link: http://lkml.kernel.org/r/20150930154413.GO3604@twins.programming.kicks-ass.net
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-09-30 22:44:13 +07:00
|
|
|
enqueue_task(rq, p, ENQUEUE_RESTORE);
|
2005-04-17 05:20:36 +07:00
|
|
|
/*
|
2007-05-09 10:27:06 +07:00
|
|
|
* If the task increased its priority or is running and
|
|
|
|
* lowered its priority, then reschedule its CPU:
|
2005-04-17 05:20:36 +07:00
|
|
|
*/
|
2007-05-09 10:27:06 +07:00
|
|
|
if (delta < 0 || (delta > 0 && task_running(rq, p)))
|
2014-06-29 03:03:57 +07:00
|
|
|
resched_curr(rq);
|
2005-04-17 05:20:36 +07:00
|
|
|
}
|
|
|
|
out_unlock:
|
2015-08-01 02:28:18 +07:00
|
|
|
task_rq_unlock(rq, p, &rf);
|
2005-04-17 05:20:36 +07:00
|
|
|
}
|
|
|
|
EXPORT_SYMBOL(set_user_nice);
|
|
|
|
|
2005-05-01 22:59:00 +07:00
|
|
|
/*
|
|
|
|
* can_nice - check if a task can reduce its nice value
|
|
|
|
* @p: task
|
|
|
|
* @nice: nice value
|
|
|
|
*/
|
2006-07-03 14:25:41 +07:00
|
|
|
int can_nice(const struct task_struct *p, const int nice)
|
2005-05-01 22:59:00 +07:00
|
|
|
{
|
2005-08-19 01:24:19 +07:00
|
|
|
/* convert nice value [19,-20] to rlimit style value [1,40] */
|
2014-05-08 16:33:49 +07:00
|
|
|
int nice_rlim = nice_to_rlimit(nice);
|
2006-07-03 14:25:40 +07:00
|
|
|
|
2010-03-06 04:42:54 +07:00
|
|
|
return (nice_rlim <= task_rlimit(p, RLIMIT_NICE) ||
|
2005-05-01 22:59:00 +07:00
|
|
|
capable(CAP_SYS_NICE));
|
|
|
|
}
|
|
|
|
|
2005-04-17 05:20:36 +07:00
|
|
|
#ifdef __ARCH_WANT_SYS_NICE
|
|
|
|
|
|
|
|
/*
|
|
|
|
* sys_nice - change the priority of the current process.
|
|
|
|
* @increment: priority increment
|
|
|
|
*
|
|
|
|
* sys_setpriority is a more generic, but much slower function that
|
|
|
|
* does similar things.
|
|
|
|
*/
|
2009-01-14 20:14:08 +07:00
|
|
|
SYSCALL_DEFINE1(nice, int, increment)
|
2005-04-17 05:20:36 +07:00
|
|
|
{
|
2006-07-03 14:25:40 +07:00
|
|
|
long nice, retval;
|
2005-04-17 05:20:36 +07:00
|
|
|
|
|
|
|
/*
|
|
|
|
* Setpriority might change our priority at the same moment.
|
|
|
|
* We don't have to worry. Conceptually one call occurs first
|
|
|
|
* and we have a single winner.
|
|
|
|
*/
|
2014-05-08 16:35:15 +07:00
|
|
|
increment = clamp(increment, -NICE_WIDTH, NICE_WIDTH);
|
2014-01-28 10:00:45 +07:00
|
|
|
nice = task_nice(current) + increment;
|
2005-04-17 05:20:36 +07:00
|
|
|
|
2014-05-08 16:35:15 +07:00
|
|
|
nice = clamp_val(nice, MIN_NICE, MAX_NICE);
|
2005-05-01 22:59:00 +07:00
|
|
|
if (increment < 0 && !can_nice(current, nice))
|
|
|
|
return -EPERM;
|
|
|
|
|
2005-04-17 05:20:36 +07:00
|
|
|
retval = security_task_setnice(current, nice);
|
|
|
|
if (retval)
|
|
|
|
return retval;
|
|
|
|
|
|
|
|
set_user_nice(current, nice);
|
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
|
|
|
|
#endif
|
|
|
|
|
|
|
|
/**
|
|
|
|
* task_prio - return the priority value of a given task.
|
|
|
|
* @p: the task in question.
|
|
|
|
*
|
2013-07-13 01:45:47 +07:00
|
|
|
* Return: The priority value as seen by users in /proc.
|
2005-04-17 05:20:36 +07:00
|
|
|
* RT tasks are offset by -200. Normal tasks are centered
|
|
|
|
* around 0, value goes from -16 to +15.
|
|
|
|
*/
|
2006-07-03 14:25:41 +07:00
|
|
|
int task_prio(const struct task_struct *p)
|
2005-04-17 05:20:36 +07:00
|
|
|
{
|
|
|
|
return p->prio - MAX_RT_PRIO;
|
|
|
|
}
|
|
|
|
|
|
|
|
/**
|
|
|
|
* idle_cpu - is a given cpu idle currently?
|
|
|
|
* @cpu: the processor in question.
|
2013-07-13 01:45:47 +07:00
|
|
|
*
|
|
|
|
* Return: 1 if the CPU is currently idle. 0 otherwise.
|
2005-04-17 05:20:36 +07:00
|
|
|
*/
|
|
|
|
int idle_cpu(int cpu)
|
|
|
|
{
|
2011-09-15 20:32:06 +07:00
|
|
|
struct rq *rq = cpu_rq(cpu);
|
|
|
|
|
|
|
|
if (rq->curr != rq->idle)
|
|
|
|
return 0;
|
|
|
|
|
|
|
|
if (rq->nr_running)
|
|
|
|
return 0;
|
|
|
|
|
|
|
|
#ifdef CONFIG_SMP
|
|
|
|
if (!llist_empty(&rq->wake_list))
|
|
|
|
return 0;
|
|
|
|
#endif
|
|
|
|
|
|
|
|
return 1;
|
2005-04-17 05:20:36 +07:00
|
|
|
}
|
|
|
|
|
|
|
|
/**
|
|
|
|
* idle_task - return the idle task for a given cpu.
|
|
|
|
* @cpu: the processor in question.
|
2013-07-13 01:45:47 +07:00
|
|
|
*
|
|
|
|
* Return: The idle task for the cpu @cpu.
|
2005-04-17 05:20:36 +07:00
|
|
|
*/
|
2006-07-03 14:25:41 +07:00
|
|
|
struct task_struct *idle_task(int cpu)
|
2005-04-17 05:20:36 +07:00
|
|
|
{
|
|
|
|
return cpu_rq(cpu)->idle;
|
|
|
|
}
|
|
|
|
|
|
|
|
/**
|
|
|
|
* find_process_by_pid - find a process with a matching PID value.
|
|
|
|
* @pid: the pid in question.
|
2013-07-13 01:45:47 +07:00
|
|
|
*
|
|
|
|
* The task of @pid, if found. %NULL otherwise.
|
2005-04-17 05:20:36 +07:00
|
|
|
*/
|
2007-10-15 22:00:13 +07:00
|
|
|
static struct task_struct *find_process_by_pid(pid_t pid)
|
2005-04-17 05:20:36 +07:00
|
|
|
{
|
2007-10-19 13:40:16 +07:00
|
|
|
return pid ? find_task_by_vpid(pid) : current;
|
2005-04-17 05:20:36 +07:00
|
|
|
}
|
|
|
|
|
sched/deadline: Add SCHED_DEADLINE structures & implementation
Introduces the data structures, constants and symbols needed for
SCHED_DEADLINE implementation.
Core data structure of SCHED_DEADLINE are defined, along with their
initializers. Hooks for checking if a task belong to the new policy
are also added where they are needed.
Adds a scheduling class, in sched/dl.c and a new policy called
SCHED_DEADLINE. It is an implementation of the Earliest Deadline
First (EDF) scheduling algorithm, augmented with a mechanism (called
Constant Bandwidth Server, CBS) that makes it possible to isolate
the behaviour of tasks between each other.
The typical -deadline task will be made up of a computation phase
(instance) which is activated on a periodic or sporadic fashion. The
expected (maximum) duration of such computation is called the task's
runtime; the time interval by which each instance need to be completed
is called the task's relative deadline. The task's absolute deadline
is dynamically calculated as the time instant a task (better, an
instance) activates plus the relative deadline.
The EDF algorithms selects the task with the smallest absolute
deadline as the one to be executed first, while the CBS ensures each
task to run for at most its runtime every (relative) deadline
length time interval, avoiding any interference between different
tasks (bandwidth isolation).
Thanks to this feature, also tasks that do not strictly comply with
the computational model sketched above can effectively use the new
policy.
To summarize, this patch:
- introduces the data structures, constants and symbols needed;
- implements the core logic of the scheduling algorithm in the new
scheduling class file;
- provides all the glue code between the new scheduling class and
the core scheduler and refines the interactions between sched/dl
and the other existing scheduling classes.
Signed-off-by: Dario Faggioli <raistlin@linux.it>
Signed-off-by: Michael Trimarchi <michael@amarulasolutions.com>
Signed-off-by: Fabio Checconi <fchecconi@gmail.com>
Signed-off-by: Juri Lelli <juri.lelli@gmail.com>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/1383831828-15501-4-git-send-email-juri.lelli@gmail.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2013-11-28 17:14:43 +07:00
|
|
|
/*
|
|
|
|
* This function initializes the sched_dl_entity of a newly becoming
|
|
|
|
* SCHED_DEADLINE task.
|
|
|
|
*
|
|
|
|
* Only the static values are considered here, the actual runtime and the
|
|
|
|
* absolute deadline will be properly calculated when the task is enqueued
|
|
|
|
* for the first time with its new policy.
|
|
|
|
*/
|
|
|
|
static void
|
|
|
|
__setparam_dl(struct task_struct *p, const struct sched_attr *attr)
|
|
|
|
{
|
|
|
|
struct sched_dl_entity *dl_se = &p->dl;
|
|
|
|
|
|
|
|
dl_se->dl_runtime = attr->sched_runtime;
|
|
|
|
dl_se->dl_deadline = attr->sched_deadline;
|
2013-11-07 20:43:40 +07:00
|
|
|
dl_se->dl_period = attr->sched_period ?: dl_se->dl_deadline;
|
sched/deadline: Add SCHED_DEADLINE structures & implementation
Introduces the data structures, constants and symbols needed for
SCHED_DEADLINE implementation.
Core data structure of SCHED_DEADLINE are defined, along with their
initializers. Hooks for checking if a task belong to the new policy
are also added where they are needed.
Adds a scheduling class, in sched/dl.c and a new policy called
SCHED_DEADLINE. It is an implementation of the Earliest Deadline
First (EDF) scheduling algorithm, augmented with a mechanism (called
Constant Bandwidth Server, CBS) that makes it possible to isolate
the behaviour of tasks between each other.
The typical -deadline task will be made up of a computation phase
(instance) which is activated on a periodic or sporadic fashion. The
expected (maximum) duration of such computation is called the task's
runtime; the time interval by which each instance need to be completed
is called the task's relative deadline. The task's absolute deadline
is dynamically calculated as the time instant a task (better, an
instance) activates plus the relative deadline.
The EDF algorithms selects the task with the smallest absolute
deadline as the one to be executed first, while the CBS ensures each
task to run for at most its runtime every (relative) deadline
length time interval, avoiding any interference between different
tasks (bandwidth isolation).
Thanks to this feature, also tasks that do not strictly comply with
the computational model sketched above can effectively use the new
policy.
To summarize, this patch:
- introduces the data structures, constants and symbols needed;
- implements the core logic of the scheduling algorithm in the new
scheduling class file;
- provides all the glue code between the new scheduling class and
the core scheduler and refines the interactions between sched/dl
and the other existing scheduling classes.
Signed-off-by: Dario Faggioli <raistlin@linux.it>
Signed-off-by: Michael Trimarchi <michael@amarulasolutions.com>
Signed-off-by: Fabio Checconi <fchecconi@gmail.com>
Signed-off-by: Juri Lelli <juri.lelli@gmail.com>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/1383831828-15501-4-git-send-email-juri.lelli@gmail.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2013-11-28 17:14:43 +07:00
|
|
|
dl_se->flags = attr->sched_flags;
|
sched/deadline: Add bandwidth management for SCHED_DEADLINE tasks
In order of deadline scheduling to be effective and useful, it is
important that some method of having the allocation of the available
CPU bandwidth to tasks and task groups under control.
This is usually called "admission control" and if it is not performed
at all, no guarantee can be given on the actual scheduling of the
-deadline tasks.
Since when RT-throttling has been introduced each task group have a
bandwidth associated to itself, calculated as a certain amount of
runtime over a period. Moreover, to make it possible to manipulate
such bandwidth, readable/writable controls have been added to both
procfs (for system wide settings) and cgroupfs (for per-group
settings).
Therefore, the same interface is being used for controlling the
bandwidth distrubution to -deadline tasks and task groups, i.e.,
new controls but with similar names, equivalent meaning and with
the same usage paradigm are added.
However, more discussion is needed in order to figure out how
we want to manage SCHED_DEADLINE bandwidth at the task group level.
Therefore, this patch adds a less sophisticated, but actually
very sensible, mechanism to ensure that a certain utilization
cap is not overcome per each root_domain (the single rq for !SMP
configurations).
Another main difference between deadline bandwidth management and
RT-throttling is that -deadline tasks have bandwidth on their own
(while -rt ones doesn't!), and thus we don't need an higher level
throttling mechanism to enforce the desired bandwidth.
This patch, therefore:
- adds system wide deadline bandwidth management by means of:
* /proc/sys/kernel/sched_dl_runtime_us,
* /proc/sys/kernel/sched_dl_period_us,
that determine (i.e., runtime / period) the total bandwidth
available on each CPU of each root_domain for -deadline tasks;
- couples the RT and deadline bandwidth management, i.e., enforces
that the sum of how much bandwidth is being devoted to -rt
-deadline tasks to stay below 100%.
This means that, for a root_domain comprising M CPUs, -deadline tasks
can be created until the sum of their bandwidths stay below:
M * (sched_dl_runtime_us / sched_dl_period_us)
It is also possible to disable this bandwidth management logic, and
be thus free of oversubscribing the system up to any arbitrary level.
Signed-off-by: Dario Faggioli <raistlin@linux.it>
Signed-off-by: Juri Lelli <juri.lelli@gmail.com>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/1383831828-15501-12-git-send-email-juri.lelli@gmail.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2013-11-07 20:43:45 +07:00
|
|
|
dl_se->dl_bw = to_ratio(dl_se->dl_period, dl_se->dl_runtime);
|
2015-01-28 21:08:03 +07:00
|
|
|
|
|
|
|
/*
|
|
|
|
* Changing the parameters of a task is 'tricky' and we're not doing
|
|
|
|
* the correct thing -- also see task_dead_dl() and switched_from_dl().
|
|
|
|
*
|
|
|
|
* What we SHOULD do is delay the bandwidth release until the 0-lag
|
|
|
|
* point. This would include retaining the task_struct until that time
|
|
|
|
* and change dl_overflow() to not immediately decrement the current
|
|
|
|
* amount.
|
|
|
|
*
|
|
|
|
* Instead we retain the current runtime/deadline and let the new
|
|
|
|
* parameters take effect after the current reservation period lapses.
|
|
|
|
* This is safe (albeit pessimistic) because the 0-lag point is always
|
|
|
|
* before the current scheduling deadline.
|
|
|
|
*
|
|
|
|
* We can still have temporary overloads because we do not delay the
|
|
|
|
* change in bandwidth until that time; so admission control is
|
|
|
|
* not on the safe side. It does however guarantee tasks will never
|
|
|
|
* consume more than promised.
|
|
|
|
*/
|
sched/deadline: Add SCHED_DEADLINE structures & implementation
Introduces the data structures, constants and symbols needed for
SCHED_DEADLINE implementation.
Core data structure of SCHED_DEADLINE are defined, along with their
initializers. Hooks for checking if a task belong to the new policy
are also added where they are needed.
Adds a scheduling class, in sched/dl.c and a new policy called
SCHED_DEADLINE. It is an implementation of the Earliest Deadline
First (EDF) scheduling algorithm, augmented with a mechanism (called
Constant Bandwidth Server, CBS) that makes it possible to isolate
the behaviour of tasks between each other.
The typical -deadline task will be made up of a computation phase
(instance) which is activated on a periodic or sporadic fashion. The
expected (maximum) duration of such computation is called the task's
runtime; the time interval by which each instance need to be completed
is called the task's relative deadline. The task's absolute deadline
is dynamically calculated as the time instant a task (better, an
instance) activates plus the relative deadline.
The EDF algorithms selects the task with the smallest absolute
deadline as the one to be executed first, while the CBS ensures each
task to run for at most its runtime every (relative) deadline
length time interval, avoiding any interference between different
tasks (bandwidth isolation).
Thanks to this feature, also tasks that do not strictly comply with
the computational model sketched above can effectively use the new
policy.
To summarize, this patch:
- introduces the data structures, constants and symbols needed;
- implements the core logic of the scheduling algorithm in the new
scheduling class file;
- provides all the glue code between the new scheduling class and
the core scheduler and refines the interactions between sched/dl
and the other existing scheduling classes.
Signed-off-by: Dario Faggioli <raistlin@linux.it>
Signed-off-by: Michael Trimarchi <michael@amarulasolutions.com>
Signed-off-by: Fabio Checconi <fchecconi@gmail.com>
Signed-off-by: Juri Lelli <juri.lelli@gmail.com>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/1383831828-15501-4-git-send-email-juri.lelli@gmail.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2013-11-28 17:14:43 +07:00
|
|
|
}
|
|
|
|
|
2014-07-23 22:28:26 +07:00
|
|
|
/*
|
|
|
|
* sched_setparam() passes in -1 for its policy, to let the functions
|
|
|
|
* it calls know not to change it.
|
|
|
|
*/
|
|
|
|
#define SETPARAM_POLICY -1
|
|
|
|
|
2014-02-08 02:58:42 +07:00
|
|
|
static void __setscheduler_params(struct task_struct *p,
|
|
|
|
const struct sched_attr *attr)
|
2005-04-17 05:20:36 +07:00
|
|
|
{
|
sched: Add new scheduler syscalls to support an extended scheduling parameters ABI
Add the syscalls needed for supporting scheduling algorithms
with extended scheduling parameters (e.g., SCHED_DEADLINE).
In general, it makes possible to specify a periodic/sporadic task,
that executes for a given amount of runtime at each instance, and is
scheduled according to the urgency of their own timing constraints,
i.e.:
- a (maximum/typical) instance execution time,
- a minimum interval between consecutive instances,
- a time constraint by which each instance must be completed.
Thus, both the data structure that holds the scheduling parameters of
the tasks and the system calls dealing with it must be extended.
Unfortunately, modifying the existing struct sched_param would break
the ABI and result in potentially serious compatibility issues with
legacy binaries.
For these reasons, this patch:
- defines the new struct sched_attr, containing all the fields
that are necessary for specifying a task in the computational
model described above;
- defines and implements the new scheduling related syscalls that
manipulate it, i.e., sched_setattr() and sched_getattr().
Syscalls are introduced for x86 (32 and 64 bits) and ARM only, as a
proof of concept and for developing and testing purposes. Making them
available on other architectures is straightforward.
Since no "user" for these new parameters is introduced in this patch,
the implementation of the new system calls is just identical to their
already existing counterpart. Future patches that implement scheduling
policies able to exploit the new data structure must also take care of
modifying the sched_*attr() calls accordingly with their own purposes.
Signed-off-by: Dario Faggioli <raistlin@linux.it>
[ Rewrote to use sched_attr. ]
Signed-off-by: Juri Lelli <juri.lelli@gmail.com>
[ Removed sched_setscheduler2() for now. ]
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/1383831828-15501-3-git-send-email-juri.lelli@gmail.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2013-11-07 20:43:36 +07:00
|
|
|
int policy = attr->sched_policy;
|
|
|
|
|
2014-07-23 22:28:26 +07:00
|
|
|
if (policy == SETPARAM_POLICY)
|
2014-01-15 22:33:20 +07:00
|
|
|
policy = p->policy;
|
|
|
|
|
2005-04-17 05:20:36 +07:00
|
|
|
p->policy = policy;
|
sched: Add new scheduler syscalls to support an extended scheduling parameters ABI
Add the syscalls needed for supporting scheduling algorithms
with extended scheduling parameters (e.g., SCHED_DEADLINE).
In general, it makes possible to specify a periodic/sporadic task,
that executes for a given amount of runtime at each instance, and is
scheduled according to the urgency of their own timing constraints,
i.e.:
- a (maximum/typical) instance execution time,
- a minimum interval between consecutive instances,
- a time constraint by which each instance must be completed.
Thus, both the data structure that holds the scheduling parameters of
the tasks and the system calls dealing with it must be extended.
Unfortunately, modifying the existing struct sched_param would break
the ABI and result in potentially serious compatibility issues with
legacy binaries.
For these reasons, this patch:
- defines the new struct sched_attr, containing all the fields
that are necessary for specifying a task in the computational
model described above;
- defines and implements the new scheduling related syscalls that
manipulate it, i.e., sched_setattr() and sched_getattr().
Syscalls are introduced for x86 (32 and 64 bits) and ARM only, as a
proof of concept and for developing and testing purposes. Making them
available on other architectures is straightforward.
Since no "user" for these new parameters is introduced in this patch,
the implementation of the new system calls is just identical to their
already existing counterpart. Future patches that implement scheduling
policies able to exploit the new data structure must also take care of
modifying the sched_*attr() calls accordingly with their own purposes.
Signed-off-by: Dario Faggioli <raistlin@linux.it>
[ Rewrote to use sched_attr. ]
Signed-off-by: Juri Lelli <juri.lelli@gmail.com>
[ Removed sched_setscheduler2() for now. ]
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/1383831828-15501-3-git-send-email-juri.lelli@gmail.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2013-11-07 20:43:36 +07:00
|
|
|
|
sched/deadline: Add SCHED_DEADLINE structures & implementation
Introduces the data structures, constants and symbols needed for
SCHED_DEADLINE implementation.
Core data structure of SCHED_DEADLINE are defined, along with their
initializers. Hooks for checking if a task belong to the new policy
are also added where they are needed.
Adds a scheduling class, in sched/dl.c and a new policy called
SCHED_DEADLINE. It is an implementation of the Earliest Deadline
First (EDF) scheduling algorithm, augmented with a mechanism (called
Constant Bandwidth Server, CBS) that makes it possible to isolate
the behaviour of tasks between each other.
The typical -deadline task will be made up of a computation phase
(instance) which is activated on a periodic or sporadic fashion. The
expected (maximum) duration of such computation is called the task's
runtime; the time interval by which each instance need to be completed
is called the task's relative deadline. The task's absolute deadline
is dynamically calculated as the time instant a task (better, an
instance) activates plus the relative deadline.
The EDF algorithms selects the task with the smallest absolute
deadline as the one to be executed first, while the CBS ensures each
task to run for at most its runtime every (relative) deadline
length time interval, avoiding any interference between different
tasks (bandwidth isolation).
Thanks to this feature, also tasks that do not strictly comply with
the computational model sketched above can effectively use the new
policy.
To summarize, this patch:
- introduces the data structures, constants and symbols needed;
- implements the core logic of the scheduling algorithm in the new
scheduling class file;
- provides all the glue code between the new scheduling class and
the core scheduler and refines the interactions between sched/dl
and the other existing scheduling classes.
Signed-off-by: Dario Faggioli <raistlin@linux.it>
Signed-off-by: Michael Trimarchi <michael@amarulasolutions.com>
Signed-off-by: Fabio Checconi <fchecconi@gmail.com>
Signed-off-by: Juri Lelli <juri.lelli@gmail.com>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/1383831828-15501-4-git-send-email-juri.lelli@gmail.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2013-11-28 17:14:43 +07:00
|
|
|
if (dl_policy(policy))
|
|
|
|
__setparam_dl(p, attr);
|
2014-01-15 22:33:20 +07:00
|
|
|
else if (fair_policy(policy))
|
sched: Add new scheduler syscalls to support an extended scheduling parameters ABI
Add the syscalls needed for supporting scheduling algorithms
with extended scheduling parameters (e.g., SCHED_DEADLINE).
In general, it makes possible to specify a periodic/sporadic task,
that executes for a given amount of runtime at each instance, and is
scheduled according to the urgency of their own timing constraints,
i.e.:
- a (maximum/typical) instance execution time,
- a minimum interval between consecutive instances,
- a time constraint by which each instance must be completed.
Thus, both the data structure that holds the scheduling parameters of
the tasks and the system calls dealing with it must be extended.
Unfortunately, modifying the existing struct sched_param would break
the ABI and result in potentially serious compatibility issues with
legacy binaries.
For these reasons, this patch:
- defines the new struct sched_attr, containing all the fields
that are necessary for specifying a task in the computational
model described above;
- defines and implements the new scheduling related syscalls that
manipulate it, i.e., sched_setattr() and sched_getattr().
Syscalls are introduced for x86 (32 and 64 bits) and ARM only, as a
proof of concept and for developing and testing purposes. Making them
available on other architectures is straightforward.
Since no "user" for these new parameters is introduced in this patch,
the implementation of the new system calls is just identical to their
already existing counterpart. Future patches that implement scheduling
policies able to exploit the new data structure must also take care of
modifying the sched_*attr() calls accordingly with their own purposes.
Signed-off-by: Dario Faggioli <raistlin@linux.it>
[ Rewrote to use sched_attr. ]
Signed-off-by: Juri Lelli <juri.lelli@gmail.com>
[ Removed sched_setscheduler2() for now. ]
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/1383831828-15501-3-git-send-email-juri.lelli@gmail.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2013-11-07 20:43:36 +07:00
|
|
|
p->static_prio = NICE_TO_PRIO(attr->sched_nice);
|
|
|
|
|
2014-01-15 22:33:20 +07:00
|
|
|
/*
|
|
|
|
* __sched_setscheduler() ensures attr->sched_priority == 0 when
|
|
|
|
* !rt_policy. Always setting this ensures that things like
|
|
|
|
* getparam()/getattr() don't report silly values for !rt tasks.
|
|
|
|
*/
|
|
|
|
p->rt_priority = attr->sched_priority;
|
sched: Fix broken setscheduler()
I decided to run my tests on linux-next, and my wakeup_rt tracer was
broken. After running a bisect, I found that the problem commit was:
linux-next commit c365c292d059
"sched: Consider pi boosting in setscheduler()"
And the reason the wake_rt tracer test was failing, was because it had
no RT task to trace. I first noticed this when running with
sched_switch event and saw that my RT task still had normal SCHED_OTHER
priority. Looking at the problem commit, I found:
- p->normal_prio = normal_prio(p);
- p->prio = rt_mutex_getprio(p);
With no
+ p->normal_prio = normal_prio(p);
+ p->prio = rt_mutex_getprio(p);
Reading what the commit is suppose to do, I realize that the p->prio
can't be set if the task is boosted with a higher prio, but the
p->normal_prio still needs to be set regardless, otherwise, when the
task is deboosted, it wont get the new priority.
The p->prio has to be set before "check_class_changed()" is called,
otherwise the class wont be changed.
Also added fix to newprio to include a check for deadline policy that
was missing. This change was suggested by Juri Lelli.
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Cc: SebastianAndrzej Siewior <bigeasy@linutronix.de>
Cc: Juri Lelli <juri.lelli@gmail.com>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/20140306120438.638bfe94@gandalf.local.home
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2014-03-12 06:24:20 +07:00
|
|
|
p->normal_prio = normal_prio(p);
|
2014-02-08 02:58:42 +07:00
|
|
|
set_load_weight(p);
|
|
|
|
}
|
2014-01-15 22:33:20 +07:00
|
|
|
|
2014-02-08 02:58:42 +07:00
|
|
|
/* Actually do priority change: must hold pi & rq lock. */
|
|
|
|
static void __setscheduler(struct rq *rq, struct task_struct *p,
|
2015-05-06 00:49:49 +07:00
|
|
|
const struct sched_attr *attr, bool keep_boost)
|
2014-02-08 02:58:42 +07:00
|
|
|
{
|
|
|
|
__setscheduler_params(p, attr);
|
sched: Add new scheduler syscalls to support an extended scheduling parameters ABI
Add the syscalls needed for supporting scheduling algorithms
with extended scheduling parameters (e.g., SCHED_DEADLINE).
In general, it makes possible to specify a periodic/sporadic task,
that executes for a given amount of runtime at each instance, and is
scheduled according to the urgency of their own timing constraints,
i.e.:
- a (maximum/typical) instance execution time,
- a minimum interval between consecutive instances,
- a time constraint by which each instance must be completed.
Thus, both the data structure that holds the scheduling parameters of
the tasks and the system calls dealing with it must be extended.
Unfortunately, modifying the existing struct sched_param would break
the ABI and result in potentially serious compatibility issues with
legacy binaries.
For these reasons, this patch:
- defines the new struct sched_attr, containing all the fields
that are necessary for specifying a task in the computational
model described above;
- defines and implements the new scheduling related syscalls that
manipulate it, i.e., sched_setattr() and sched_getattr().
Syscalls are introduced for x86 (32 and 64 bits) and ARM only, as a
proof of concept and for developing and testing purposes. Making them
available on other architectures is straightforward.
Since no "user" for these new parameters is introduced in this patch,
the implementation of the new system calls is just identical to their
already existing counterpart. Future patches that implement scheduling
policies able to exploit the new data structure must also take care of
modifying the sched_*attr() calls accordingly with their own purposes.
Signed-off-by: Dario Faggioli <raistlin@linux.it>
[ Rewrote to use sched_attr. ]
Signed-off-by: Juri Lelli <juri.lelli@gmail.com>
[ Removed sched_setscheduler2() for now. ]
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/1383831828-15501-3-git-send-email-juri.lelli@gmail.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2013-11-07 20:43:36 +07:00
|
|
|
|
sched: Fix broken setscheduler()
I decided to run my tests on linux-next, and my wakeup_rt tracer was
broken. After running a bisect, I found that the problem commit was:
linux-next commit c365c292d059
"sched: Consider pi boosting in setscheduler()"
And the reason the wake_rt tracer test was failing, was because it had
no RT task to trace. I first noticed this when running with
sched_switch event and saw that my RT task still had normal SCHED_OTHER
priority. Looking at the problem commit, I found:
- p->normal_prio = normal_prio(p);
- p->prio = rt_mutex_getprio(p);
With no
+ p->normal_prio = normal_prio(p);
+ p->prio = rt_mutex_getprio(p);
Reading what the commit is suppose to do, I realize that the p->prio
can't be set if the task is boosted with a higher prio, but the
p->normal_prio still needs to be set regardless, otherwise, when the
task is deboosted, it wont get the new priority.
The p->prio has to be set before "check_class_changed()" is called,
otherwise the class wont be changed.
Also added fix to newprio to include a check for deadline policy that
was missing. This change was suggested by Juri Lelli.
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Cc: SebastianAndrzej Siewior <bigeasy@linutronix.de>
Cc: Juri Lelli <juri.lelli@gmail.com>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/20140306120438.638bfe94@gandalf.local.home
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2014-03-12 06:24:20 +07:00
|
|
|
/*
|
2015-05-06 00:49:49 +07:00
|
|
|
* Keep a potential priority boosting if called from
|
|
|
|
* sched_setscheduler().
|
sched: Fix broken setscheduler()
I decided to run my tests on linux-next, and my wakeup_rt tracer was
broken. After running a bisect, I found that the problem commit was:
linux-next commit c365c292d059
"sched: Consider pi boosting in setscheduler()"
And the reason the wake_rt tracer test was failing, was because it had
no RT task to trace. I first noticed this when running with
sched_switch event and saw that my RT task still had normal SCHED_OTHER
priority. Looking at the problem commit, I found:
- p->normal_prio = normal_prio(p);
- p->prio = rt_mutex_getprio(p);
With no
+ p->normal_prio = normal_prio(p);
+ p->prio = rt_mutex_getprio(p);
Reading what the commit is suppose to do, I realize that the p->prio
can't be set if the task is boosted with a higher prio, but the
p->normal_prio still needs to be set regardless, otherwise, when the
task is deboosted, it wont get the new priority.
The p->prio has to be set before "check_class_changed()" is called,
otherwise the class wont be changed.
Also added fix to newprio to include a check for deadline policy that
was missing. This change was suggested by Juri Lelli.
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Cc: SebastianAndrzej Siewior <bigeasy@linutronix.de>
Cc: Juri Lelli <juri.lelli@gmail.com>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/20140306120438.638bfe94@gandalf.local.home
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2014-03-12 06:24:20 +07:00
|
|
|
*/
|
2015-05-06 00:49:49 +07:00
|
|
|
if (keep_boost)
|
|
|
|
p->prio = rt_mutex_get_effective_prio(p, normal_prio(p));
|
|
|
|
else
|
|
|
|
p->prio = normal_prio(p);
|
sched: Fix broken setscheduler()
I decided to run my tests on linux-next, and my wakeup_rt tracer was
broken. After running a bisect, I found that the problem commit was:
linux-next commit c365c292d059
"sched: Consider pi boosting in setscheduler()"
And the reason the wake_rt tracer test was failing, was because it had
no RT task to trace. I first noticed this when running with
sched_switch event and saw that my RT task still had normal SCHED_OTHER
priority. Looking at the problem commit, I found:
- p->normal_prio = normal_prio(p);
- p->prio = rt_mutex_getprio(p);
With no
+ p->normal_prio = normal_prio(p);
+ p->prio = rt_mutex_getprio(p);
Reading what the commit is suppose to do, I realize that the p->prio
can't be set if the task is boosted with a higher prio, but the
p->normal_prio still needs to be set regardless, otherwise, when the
task is deboosted, it wont get the new priority.
The p->prio has to be set before "check_class_changed()" is called,
otherwise the class wont be changed.
Also added fix to newprio to include a check for deadline policy that
was missing. This change was suggested by Juri Lelli.
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Cc: SebastianAndrzej Siewior <bigeasy@linutronix.de>
Cc: Juri Lelli <juri.lelli@gmail.com>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/20140306120438.638bfe94@gandalf.local.home
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2014-03-12 06:24:20 +07:00
|
|
|
|
sched/deadline: Add SCHED_DEADLINE structures & implementation
Introduces the data structures, constants and symbols needed for
SCHED_DEADLINE implementation.
Core data structure of SCHED_DEADLINE are defined, along with their
initializers. Hooks for checking if a task belong to the new policy
are also added where they are needed.
Adds a scheduling class, in sched/dl.c and a new policy called
SCHED_DEADLINE. It is an implementation of the Earliest Deadline
First (EDF) scheduling algorithm, augmented with a mechanism (called
Constant Bandwidth Server, CBS) that makes it possible to isolate
the behaviour of tasks between each other.
The typical -deadline task will be made up of a computation phase
(instance) which is activated on a periodic or sporadic fashion. The
expected (maximum) duration of such computation is called the task's
runtime; the time interval by which each instance need to be completed
is called the task's relative deadline. The task's absolute deadline
is dynamically calculated as the time instant a task (better, an
instance) activates plus the relative deadline.
The EDF algorithms selects the task with the smallest absolute
deadline as the one to be executed first, while the CBS ensures each
task to run for at most its runtime every (relative) deadline
length time interval, avoiding any interference between different
tasks (bandwidth isolation).
Thanks to this feature, also tasks that do not strictly comply with
the computational model sketched above can effectively use the new
policy.
To summarize, this patch:
- introduces the data structures, constants and symbols needed;
- implements the core logic of the scheduling algorithm in the new
scheduling class file;
- provides all the glue code between the new scheduling class and
the core scheduler and refines the interactions between sched/dl
and the other existing scheduling classes.
Signed-off-by: Dario Faggioli <raistlin@linux.it>
Signed-off-by: Michael Trimarchi <michael@amarulasolutions.com>
Signed-off-by: Fabio Checconi <fchecconi@gmail.com>
Signed-off-by: Juri Lelli <juri.lelli@gmail.com>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/1383831828-15501-4-git-send-email-juri.lelli@gmail.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2013-11-28 17:14:43 +07:00
|
|
|
if (dl_prio(p->prio))
|
|
|
|
p->sched_class = &dl_sched_class;
|
|
|
|
else if (rt_prio(p->prio))
|
2009-11-11 02:12:01 +07:00
|
|
|
p->sched_class = &rt_sched_class;
|
|
|
|
else
|
|
|
|
p->sched_class = &fair_sched_class;
|
2005-04-17 05:20:36 +07:00
|
|
|
}
|
sched/deadline: Add SCHED_DEADLINE structures & implementation
Introduces the data structures, constants and symbols needed for
SCHED_DEADLINE implementation.
Core data structure of SCHED_DEADLINE are defined, along with their
initializers. Hooks for checking if a task belong to the new policy
are also added where they are needed.
Adds a scheduling class, in sched/dl.c and a new policy called
SCHED_DEADLINE. It is an implementation of the Earliest Deadline
First (EDF) scheduling algorithm, augmented with a mechanism (called
Constant Bandwidth Server, CBS) that makes it possible to isolate
the behaviour of tasks between each other.
The typical -deadline task will be made up of a computation phase
(instance) which is activated on a periodic or sporadic fashion. The
expected (maximum) duration of such computation is called the task's
runtime; the time interval by which each instance need to be completed
is called the task's relative deadline. The task's absolute deadline
is dynamically calculated as the time instant a task (better, an
instance) activates plus the relative deadline.
The EDF algorithms selects the task with the smallest absolute
deadline as the one to be executed first, while the CBS ensures each
task to run for at most its runtime every (relative) deadline
length time interval, avoiding any interference between different
tasks (bandwidth isolation).
Thanks to this feature, also tasks that do not strictly comply with
the computational model sketched above can effectively use the new
policy.
To summarize, this patch:
- introduces the data structures, constants and symbols needed;
- implements the core logic of the scheduling algorithm in the new
scheduling class file;
- provides all the glue code between the new scheduling class and
the core scheduler and refines the interactions between sched/dl
and the other existing scheduling classes.
Signed-off-by: Dario Faggioli <raistlin@linux.it>
Signed-off-by: Michael Trimarchi <michael@amarulasolutions.com>
Signed-off-by: Fabio Checconi <fchecconi@gmail.com>
Signed-off-by: Juri Lelli <juri.lelli@gmail.com>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/1383831828-15501-4-git-send-email-juri.lelli@gmail.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2013-11-28 17:14:43 +07:00
|
|
|
|
|
|
|
static void
|
|
|
|
__getparam_dl(struct task_struct *p, struct sched_attr *attr)
|
|
|
|
{
|
|
|
|
struct sched_dl_entity *dl_se = &p->dl;
|
|
|
|
|
|
|
|
attr->sched_priority = p->rt_priority;
|
|
|
|
attr->sched_runtime = dl_se->dl_runtime;
|
|
|
|
attr->sched_deadline = dl_se->dl_deadline;
|
2013-11-07 20:43:40 +07:00
|
|
|
attr->sched_period = dl_se->dl_period;
|
sched/deadline: Add SCHED_DEADLINE structures & implementation
Introduces the data structures, constants and symbols needed for
SCHED_DEADLINE implementation.
Core data structure of SCHED_DEADLINE are defined, along with their
initializers. Hooks for checking if a task belong to the new policy
are also added where they are needed.
Adds a scheduling class, in sched/dl.c and a new policy called
SCHED_DEADLINE. It is an implementation of the Earliest Deadline
First (EDF) scheduling algorithm, augmented with a mechanism (called
Constant Bandwidth Server, CBS) that makes it possible to isolate
the behaviour of tasks between each other.
The typical -deadline task will be made up of a computation phase
(instance) which is activated on a periodic or sporadic fashion. The
expected (maximum) duration of such computation is called the task's
runtime; the time interval by which each instance need to be completed
is called the task's relative deadline. The task's absolute deadline
is dynamically calculated as the time instant a task (better, an
instance) activates plus the relative deadline.
The EDF algorithms selects the task with the smallest absolute
deadline as the one to be executed first, while the CBS ensures each
task to run for at most its runtime every (relative) deadline
length time interval, avoiding any interference between different
tasks (bandwidth isolation).
Thanks to this feature, also tasks that do not strictly comply with
the computational model sketched above can effectively use the new
policy.
To summarize, this patch:
- introduces the data structures, constants and symbols needed;
- implements the core logic of the scheduling algorithm in the new
scheduling class file;
- provides all the glue code between the new scheduling class and
the core scheduler and refines the interactions between sched/dl
and the other existing scheduling classes.
Signed-off-by: Dario Faggioli <raistlin@linux.it>
Signed-off-by: Michael Trimarchi <michael@amarulasolutions.com>
Signed-off-by: Fabio Checconi <fchecconi@gmail.com>
Signed-off-by: Juri Lelli <juri.lelli@gmail.com>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/1383831828-15501-4-git-send-email-juri.lelli@gmail.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2013-11-28 17:14:43 +07:00
|
|
|
attr->sched_flags = dl_se->flags;
|
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* This function validates the new parameters of a -deadline task.
|
|
|
|
* We ask for the deadline not being zero, and greater or equal
|
2013-11-07 20:43:40 +07:00
|
|
|
* than the runtime, as well as the period of being zero or
|
sched/deadline: Add bandwidth management for SCHED_DEADLINE tasks
In order of deadline scheduling to be effective and useful, it is
important that some method of having the allocation of the available
CPU bandwidth to tasks and task groups under control.
This is usually called "admission control" and if it is not performed
at all, no guarantee can be given on the actual scheduling of the
-deadline tasks.
Since when RT-throttling has been introduced each task group have a
bandwidth associated to itself, calculated as a certain amount of
runtime over a period. Moreover, to make it possible to manipulate
such bandwidth, readable/writable controls have been added to both
procfs (for system wide settings) and cgroupfs (for per-group
settings).
Therefore, the same interface is being used for controlling the
bandwidth distrubution to -deadline tasks and task groups, i.e.,
new controls but with similar names, equivalent meaning and with
the same usage paradigm are added.
However, more discussion is needed in order to figure out how
we want to manage SCHED_DEADLINE bandwidth at the task group level.
Therefore, this patch adds a less sophisticated, but actually
very sensible, mechanism to ensure that a certain utilization
cap is not overcome per each root_domain (the single rq for !SMP
configurations).
Another main difference between deadline bandwidth management and
RT-throttling is that -deadline tasks have bandwidth on their own
(while -rt ones doesn't!), and thus we don't need an higher level
throttling mechanism to enforce the desired bandwidth.
This patch, therefore:
- adds system wide deadline bandwidth management by means of:
* /proc/sys/kernel/sched_dl_runtime_us,
* /proc/sys/kernel/sched_dl_period_us,
that determine (i.e., runtime / period) the total bandwidth
available on each CPU of each root_domain for -deadline tasks;
- couples the RT and deadline bandwidth management, i.e., enforces
that the sum of how much bandwidth is being devoted to -rt
-deadline tasks to stay below 100%.
This means that, for a root_domain comprising M CPUs, -deadline tasks
can be created until the sum of their bandwidths stay below:
M * (sched_dl_runtime_us / sched_dl_period_us)
It is also possible to disable this bandwidth management logic, and
be thus free of oversubscribing the system up to any arbitrary level.
Signed-off-by: Dario Faggioli <raistlin@linux.it>
Signed-off-by: Juri Lelli <juri.lelli@gmail.com>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/1383831828-15501-12-git-send-email-juri.lelli@gmail.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2013-11-07 20:43:45 +07:00
|
|
|
* greater than deadline. Furthermore, we have to be sure that
|
2014-05-13 19:11:31 +07:00
|
|
|
* user parameters are above the internal resolution of 1us (we
|
|
|
|
* check sched_runtime only since it is always the smaller one) and
|
|
|
|
* below 2^63 ns (we have to check both sched_deadline and
|
|
|
|
* sched_period, as the latter can be zero).
|
sched/deadline: Add SCHED_DEADLINE structures & implementation
Introduces the data structures, constants and symbols needed for
SCHED_DEADLINE implementation.
Core data structure of SCHED_DEADLINE are defined, along with their
initializers. Hooks for checking if a task belong to the new policy
are also added where they are needed.
Adds a scheduling class, in sched/dl.c and a new policy called
SCHED_DEADLINE. It is an implementation of the Earliest Deadline
First (EDF) scheduling algorithm, augmented with a mechanism (called
Constant Bandwidth Server, CBS) that makes it possible to isolate
the behaviour of tasks between each other.
The typical -deadline task will be made up of a computation phase
(instance) which is activated on a periodic or sporadic fashion. The
expected (maximum) duration of such computation is called the task's
runtime; the time interval by which each instance need to be completed
is called the task's relative deadline. The task's absolute deadline
is dynamically calculated as the time instant a task (better, an
instance) activates plus the relative deadline.
The EDF algorithms selects the task with the smallest absolute
deadline as the one to be executed first, while the CBS ensures each
task to run for at most its runtime every (relative) deadline
length time interval, avoiding any interference between different
tasks (bandwidth isolation).
Thanks to this feature, also tasks that do not strictly comply with
the computational model sketched above can effectively use the new
policy.
To summarize, this patch:
- introduces the data structures, constants and symbols needed;
- implements the core logic of the scheduling algorithm in the new
scheduling class file;
- provides all the glue code between the new scheduling class and
the core scheduler and refines the interactions between sched/dl
and the other existing scheduling classes.
Signed-off-by: Dario Faggioli <raistlin@linux.it>
Signed-off-by: Michael Trimarchi <michael@amarulasolutions.com>
Signed-off-by: Fabio Checconi <fchecconi@gmail.com>
Signed-off-by: Juri Lelli <juri.lelli@gmail.com>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/1383831828-15501-4-git-send-email-juri.lelli@gmail.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2013-11-28 17:14:43 +07:00
|
|
|
*/
|
|
|
|
static bool
|
|
|
|
__checkparam_dl(const struct sched_attr *attr)
|
|
|
|
{
|
2014-05-13 19:11:31 +07:00
|
|
|
/* deadline != 0 */
|
|
|
|
if (attr->sched_deadline == 0)
|
|
|
|
return false;
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Since we truncate DL_SCALE bits, make sure we're at least
|
|
|
|
* that big.
|
|
|
|
*/
|
|
|
|
if (attr->sched_runtime < (1ULL << DL_SCALE))
|
|
|
|
return false;
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Since we use the MSB for wrap-around and sign issues, make
|
|
|
|
* sure it's not set (mind that period can be equal to zero).
|
|
|
|
*/
|
|
|
|
if (attr->sched_deadline & (1ULL << 63) ||
|
|
|
|
attr->sched_period & (1ULL << 63))
|
|
|
|
return false;
|
|
|
|
|
|
|
|
/* runtime <= deadline <= period (if period != 0) */
|
|
|
|
if ((attr->sched_period != 0 &&
|
|
|
|
attr->sched_period < attr->sched_deadline) ||
|
|
|
|
attr->sched_deadline < attr->sched_runtime)
|
|
|
|
return false;
|
|
|
|
|
|
|
|
return true;
|
sched/deadline: Add SCHED_DEADLINE structures & implementation
Introduces the data structures, constants and symbols needed for
SCHED_DEADLINE implementation.
Core data structure of SCHED_DEADLINE are defined, along with their
initializers. Hooks for checking if a task belong to the new policy
are also added where they are needed.
Adds a scheduling class, in sched/dl.c and a new policy called
SCHED_DEADLINE. It is an implementation of the Earliest Deadline
First (EDF) scheduling algorithm, augmented with a mechanism (called
Constant Bandwidth Server, CBS) that makes it possible to isolate
the behaviour of tasks between each other.
The typical -deadline task will be made up of a computation phase
(instance) which is activated on a periodic or sporadic fashion. The
expected (maximum) duration of such computation is called the task's
runtime; the time interval by which each instance need to be completed
is called the task's relative deadline. The task's absolute deadline
is dynamically calculated as the time instant a task (better, an
instance) activates plus the relative deadline.
The EDF algorithms selects the task with the smallest absolute
deadline as the one to be executed first, while the CBS ensures each
task to run for at most its runtime every (relative) deadline
length time interval, avoiding any interference between different
tasks (bandwidth isolation).
Thanks to this feature, also tasks that do not strictly comply with
the computational model sketched above can effectively use the new
policy.
To summarize, this patch:
- introduces the data structures, constants and symbols needed;
- implements the core logic of the scheduling algorithm in the new
scheduling class file;
- provides all the glue code between the new scheduling class and
the core scheduler and refines the interactions between sched/dl
and the other existing scheduling classes.
Signed-off-by: Dario Faggioli <raistlin@linux.it>
Signed-off-by: Michael Trimarchi <michael@amarulasolutions.com>
Signed-off-by: Fabio Checconi <fchecconi@gmail.com>
Signed-off-by: Juri Lelli <juri.lelli@gmail.com>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/1383831828-15501-4-git-send-email-juri.lelli@gmail.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2013-11-28 17:14:43 +07:00
|
|
|
}
|
|
|
|
|
2008-11-14 06:39:19 +07:00
|
|
|
/*
|
|
|
|
* check the target process has a UID that matches the current process's
|
|
|
|
*/
|
|
|
|
static bool check_same_owner(struct task_struct *p)
|
|
|
|
{
|
|
|
|
const struct cred *cred = current_cred(), *pcred;
|
|
|
|
bool match;
|
|
|
|
|
|
|
|
rcu_read_lock();
|
|
|
|
pcred = __task_cred(p);
|
2012-02-03 09:54:02 +07:00
|
|
|
match = (uid_eq(cred->euid, pcred->euid) ||
|
|
|
|
uid_eq(cred->euid, pcred->uid));
|
2008-11-14 06:39:19 +07:00
|
|
|
rcu_read_unlock();
|
|
|
|
return match;
|
|
|
|
}
|
|
|
|
|
2014-11-26 07:44:04 +07:00
|
|
|
static bool dl_param_changed(struct task_struct *p,
|
|
|
|
const struct sched_attr *attr)
|
|
|
|
{
|
|
|
|
struct sched_dl_entity *dl_se = &p->dl;
|
|
|
|
|
|
|
|
if (dl_se->dl_runtime != attr->sched_runtime ||
|
|
|
|
dl_se->dl_deadline != attr->sched_deadline ||
|
|
|
|
dl_se->dl_period != attr->sched_period ||
|
|
|
|
dl_se->flags != attr->sched_flags)
|
|
|
|
return true;
|
|
|
|
|
|
|
|
return false;
|
|
|
|
}
|
|
|
|
|
sched: Add new scheduler syscalls to support an extended scheduling parameters ABI
Add the syscalls needed for supporting scheduling algorithms
with extended scheduling parameters (e.g., SCHED_DEADLINE).
In general, it makes possible to specify a periodic/sporadic task,
that executes for a given amount of runtime at each instance, and is
scheduled according to the urgency of their own timing constraints,
i.e.:
- a (maximum/typical) instance execution time,
- a minimum interval between consecutive instances,
- a time constraint by which each instance must be completed.
Thus, both the data structure that holds the scheduling parameters of
the tasks and the system calls dealing with it must be extended.
Unfortunately, modifying the existing struct sched_param would break
the ABI and result in potentially serious compatibility issues with
legacy binaries.
For these reasons, this patch:
- defines the new struct sched_attr, containing all the fields
that are necessary for specifying a task in the computational
model described above;
- defines and implements the new scheduling related syscalls that
manipulate it, i.e., sched_setattr() and sched_getattr().
Syscalls are introduced for x86 (32 and 64 bits) and ARM only, as a
proof of concept and for developing and testing purposes. Making them
available on other architectures is straightforward.
Since no "user" for these new parameters is introduced in this patch,
the implementation of the new system calls is just identical to their
already existing counterpart. Future patches that implement scheduling
policies able to exploit the new data structure must also take care of
modifying the sched_*attr() calls accordingly with their own purposes.
Signed-off-by: Dario Faggioli <raistlin@linux.it>
[ Rewrote to use sched_attr. ]
Signed-off-by: Juri Lelli <juri.lelli@gmail.com>
[ Removed sched_setscheduler2() for now. ]
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/1383831828-15501-3-git-send-email-juri.lelli@gmail.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2013-11-07 20:43:36 +07:00
|
|
|
static int __sched_setscheduler(struct task_struct *p,
|
|
|
|
const struct sched_attr *attr,
|
2015-06-11 19:46:38 +07:00
|
|
|
bool user, bool pi)
|
2005-04-17 05:20:36 +07:00
|
|
|
{
|
sched: Fix broken setscheduler()
I decided to run my tests on linux-next, and my wakeup_rt tracer was
broken. After running a bisect, I found that the problem commit was:
linux-next commit c365c292d059
"sched: Consider pi boosting in setscheduler()"
And the reason the wake_rt tracer test was failing, was because it had
no RT task to trace. I first noticed this when running with
sched_switch event and saw that my RT task still had normal SCHED_OTHER
priority. Looking at the problem commit, I found:
- p->normal_prio = normal_prio(p);
- p->prio = rt_mutex_getprio(p);
With no
+ p->normal_prio = normal_prio(p);
+ p->prio = rt_mutex_getprio(p);
Reading what the commit is suppose to do, I realize that the p->prio
can't be set if the task is boosted with a higher prio, but the
p->normal_prio still needs to be set regardless, otherwise, when the
task is deboosted, it wont get the new priority.
The p->prio has to be set before "check_class_changed()" is called,
otherwise the class wont be changed.
Also added fix to newprio to include a check for deadline policy that
was missing. This change was suggested by Juri Lelli.
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Cc: SebastianAndrzej Siewior <bigeasy@linutronix.de>
Cc: Juri Lelli <juri.lelli@gmail.com>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/20140306120438.638bfe94@gandalf.local.home
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2014-03-12 06:24:20 +07:00
|
|
|
int newprio = dl_policy(attr->sched_policy) ? MAX_DL_PRIO - 1 :
|
|
|
|
MAX_RT_PRIO - 1 - attr->sched_priority;
|
2014-08-20 16:47:32 +07:00
|
|
|
int retval, oldprio, oldpolicy = -1, queued, running;
|
2015-05-06 00:49:49 +07:00
|
|
|
int new_effective_prio, policy = attr->sched_policy;
|
2010-02-17 15:05:48 +07:00
|
|
|
const struct sched_class *prev_class;
|
2015-08-01 02:28:18 +07:00
|
|
|
struct rq_flags rf;
|
2009-06-15 22:17:47 +07:00
|
|
|
int reset_on_fork;
|
2016-01-18 21:27:07 +07:00
|
|
|
int queue_flags = DEQUEUE_SAVE | DEQUEUE_MOVE;
|
2015-08-01 02:28:18 +07:00
|
|
|
struct rq *rq;
|
2005-04-17 05:20:36 +07:00
|
|
|
|
2006-06-27 16:54:44 +07:00
|
|
|
/* may grab non-irq protected spin_locks */
|
|
|
|
BUG_ON(in_interrupt());
|
2005-04-17 05:20:36 +07:00
|
|
|
recheck:
|
|
|
|
/* double check policy once rq lock held */
|
2009-06-15 22:17:47 +07:00
|
|
|
if (policy < 0) {
|
|
|
|
reset_on_fork = p->sched_reset_on_fork;
|
2005-04-17 05:20:36 +07:00
|
|
|
policy = oldpolicy = p->policy;
|
2009-06-15 22:17:47 +07:00
|
|
|
} else {
|
2014-01-15 23:05:04 +07:00
|
|
|
reset_on_fork = !!(attr->sched_flags & SCHED_FLAG_RESET_ON_FORK);
|
2009-06-15 22:17:47 +07:00
|
|
|
|
2015-09-09 22:00:41 +07:00
|
|
|
if (!valid_policy(policy))
|
2009-06-15 22:17:47 +07:00
|
|
|
return -EINVAL;
|
|
|
|
}
|
|
|
|
|
2014-01-15 23:05:04 +07:00
|
|
|
if (attr->sched_flags & ~(SCHED_FLAG_RESET_ON_FORK))
|
|
|
|
return -EINVAL;
|
|
|
|
|
2005-04-17 05:20:36 +07:00
|
|
|
/*
|
|
|
|
* Valid priorities for SCHED_FIFO and SCHED_RR are
|
2007-07-09 23:51:59 +07:00
|
|
|
* 1..MAX_USER_RT_PRIO-1, valid priority for SCHED_NORMAL,
|
|
|
|
* SCHED_BATCH and SCHED_IDLE is 0.
|
2005-04-17 05:20:36 +07:00
|
|
|
*/
|
2014-01-15 23:15:13 +07:00
|
|
|
if ((p->mm && attr->sched_priority > MAX_USER_RT_PRIO-1) ||
|
sched: Add new scheduler syscalls to support an extended scheduling parameters ABI
Add the syscalls needed for supporting scheduling algorithms
with extended scheduling parameters (e.g., SCHED_DEADLINE).
In general, it makes possible to specify a periodic/sporadic task,
that executes for a given amount of runtime at each instance, and is
scheduled according to the urgency of their own timing constraints,
i.e.:
- a (maximum/typical) instance execution time,
- a minimum interval between consecutive instances,
- a time constraint by which each instance must be completed.
Thus, both the data structure that holds the scheduling parameters of
the tasks and the system calls dealing with it must be extended.
Unfortunately, modifying the existing struct sched_param would break
the ABI and result in potentially serious compatibility issues with
legacy binaries.
For these reasons, this patch:
- defines the new struct sched_attr, containing all the fields
that are necessary for specifying a task in the computational
model described above;
- defines and implements the new scheduling related syscalls that
manipulate it, i.e., sched_setattr() and sched_getattr().
Syscalls are introduced for x86 (32 and 64 bits) and ARM only, as a
proof of concept and for developing and testing purposes. Making them
available on other architectures is straightforward.
Since no "user" for these new parameters is introduced in this patch,
the implementation of the new system calls is just identical to their
already existing counterpart. Future patches that implement scheduling
policies able to exploit the new data structure must also take care of
modifying the sched_*attr() calls accordingly with their own purposes.
Signed-off-by: Dario Faggioli <raistlin@linux.it>
[ Rewrote to use sched_attr. ]
Signed-off-by: Juri Lelli <juri.lelli@gmail.com>
[ Removed sched_setscheduler2() for now. ]
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/1383831828-15501-3-git-send-email-juri.lelli@gmail.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2013-11-07 20:43:36 +07:00
|
|
|
(!p->mm && attr->sched_priority > MAX_RT_PRIO-1))
|
2005-04-17 05:20:36 +07:00
|
|
|
return -EINVAL;
|
sched/deadline: Add SCHED_DEADLINE structures & implementation
Introduces the data structures, constants and symbols needed for
SCHED_DEADLINE implementation.
Core data structure of SCHED_DEADLINE are defined, along with their
initializers. Hooks for checking if a task belong to the new policy
are also added where they are needed.
Adds a scheduling class, in sched/dl.c and a new policy called
SCHED_DEADLINE. It is an implementation of the Earliest Deadline
First (EDF) scheduling algorithm, augmented with a mechanism (called
Constant Bandwidth Server, CBS) that makes it possible to isolate
the behaviour of tasks between each other.
The typical -deadline task will be made up of a computation phase
(instance) which is activated on a periodic or sporadic fashion. The
expected (maximum) duration of such computation is called the task's
runtime; the time interval by which each instance need to be completed
is called the task's relative deadline. The task's absolute deadline
is dynamically calculated as the time instant a task (better, an
instance) activates plus the relative deadline.
The EDF algorithms selects the task with the smallest absolute
deadline as the one to be executed first, while the CBS ensures each
task to run for at most its runtime every (relative) deadline
length time interval, avoiding any interference between different
tasks (bandwidth isolation).
Thanks to this feature, also tasks that do not strictly comply with
the computational model sketched above can effectively use the new
policy.
To summarize, this patch:
- introduces the data structures, constants and symbols needed;
- implements the core logic of the scheduling algorithm in the new
scheduling class file;
- provides all the glue code between the new scheduling class and
the core scheduler and refines the interactions between sched/dl
and the other existing scheduling classes.
Signed-off-by: Dario Faggioli <raistlin@linux.it>
Signed-off-by: Michael Trimarchi <michael@amarulasolutions.com>
Signed-off-by: Fabio Checconi <fchecconi@gmail.com>
Signed-off-by: Juri Lelli <juri.lelli@gmail.com>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/1383831828-15501-4-git-send-email-juri.lelli@gmail.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2013-11-28 17:14:43 +07:00
|
|
|
if ((dl_policy(policy) && !__checkparam_dl(attr)) ||
|
|
|
|
(rt_policy(policy) != (attr->sched_priority != 0)))
|
2005-04-17 05:20:36 +07:00
|
|
|
return -EINVAL;
|
|
|
|
|
[PATCH] Changing RT priority without CAP_SYS_NICE
Presently, a process without the capability CAP_SYS_NICE can not change
its own policy, which is OK.
But it can also not decrease its RT priority (if scheduled with policy
SCHED_RR or SCHED_FIFO), which is what this patch changes.
The rationale is the same as for the nice value: a process should be
able to require less priority for itself. Increasing the priority is
still not allowed.
This is for example useful if you give a multithreaded user process a RT
priority, and the process would like to organize its internal threads
using priorities also. Then you can give the process the highest
priority needed N, and the process starts its threads with lower
priorities: N-1, N-2...
The POSIX norm says that the permissions are implementation specific, so
I think we can do that.
In a sense, it makes the permissions consistent whatever the policy is:
with this patch, process scheduled by SCHED_FIFO, SCHED_RR and
SCHED_OTHER can all decrease their priority.
From: Ingo Molnar <mingo@elte.hu>
cleaned up and merged to -mm.
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-06-26 04:57:32 +07:00
|
|
|
/*
|
|
|
|
* Allow unprivileged RT tasks to decrease priority:
|
|
|
|
*/
|
2008-06-23 10:55:38 +07:00
|
|
|
if (user && !capable(CAP_SYS_NICE)) {
|
sched: Add new scheduler syscalls to support an extended scheduling parameters ABI
Add the syscalls needed for supporting scheduling algorithms
with extended scheduling parameters (e.g., SCHED_DEADLINE).
In general, it makes possible to specify a periodic/sporadic task,
that executes for a given amount of runtime at each instance, and is
scheduled according to the urgency of their own timing constraints,
i.e.:
- a (maximum/typical) instance execution time,
- a minimum interval between consecutive instances,
- a time constraint by which each instance must be completed.
Thus, both the data structure that holds the scheduling parameters of
the tasks and the system calls dealing with it must be extended.
Unfortunately, modifying the existing struct sched_param would break
the ABI and result in potentially serious compatibility issues with
legacy binaries.
For these reasons, this patch:
- defines the new struct sched_attr, containing all the fields
that are necessary for specifying a task in the computational
model described above;
- defines and implements the new scheduling related syscalls that
manipulate it, i.e., sched_setattr() and sched_getattr().
Syscalls are introduced for x86 (32 and 64 bits) and ARM only, as a
proof of concept and for developing and testing purposes. Making them
available on other architectures is straightforward.
Since no "user" for these new parameters is introduced in this patch,
the implementation of the new system calls is just identical to their
already existing counterpart. Future patches that implement scheduling
policies able to exploit the new data structure must also take care of
modifying the sched_*attr() calls accordingly with their own purposes.
Signed-off-by: Dario Faggioli <raistlin@linux.it>
[ Rewrote to use sched_attr. ]
Signed-off-by: Juri Lelli <juri.lelli@gmail.com>
[ Removed sched_setscheduler2() for now. ]
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/1383831828-15501-3-git-send-email-juri.lelli@gmail.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2013-11-07 20:43:36 +07:00
|
|
|
if (fair_policy(policy)) {
|
2014-01-28 10:00:45 +07:00
|
|
|
if (attr->sched_nice < task_nice(p) &&
|
2014-01-16 23:54:25 +07:00
|
|
|
!can_nice(p, attr->sched_nice))
|
sched: Add new scheduler syscalls to support an extended scheduling parameters ABI
Add the syscalls needed for supporting scheduling algorithms
with extended scheduling parameters (e.g., SCHED_DEADLINE).
In general, it makes possible to specify a periodic/sporadic task,
that executes for a given amount of runtime at each instance, and is
scheduled according to the urgency of their own timing constraints,
i.e.:
- a (maximum/typical) instance execution time,
- a minimum interval between consecutive instances,
- a time constraint by which each instance must be completed.
Thus, both the data structure that holds the scheduling parameters of
the tasks and the system calls dealing with it must be extended.
Unfortunately, modifying the existing struct sched_param would break
the ABI and result in potentially serious compatibility issues with
legacy binaries.
For these reasons, this patch:
- defines the new struct sched_attr, containing all the fields
that are necessary for specifying a task in the computational
model described above;
- defines and implements the new scheduling related syscalls that
manipulate it, i.e., sched_setattr() and sched_getattr().
Syscalls are introduced for x86 (32 and 64 bits) and ARM only, as a
proof of concept and for developing and testing purposes. Making them
available on other architectures is straightforward.
Since no "user" for these new parameters is introduced in this patch,
the implementation of the new system calls is just identical to their
already existing counterpart. Future patches that implement scheduling
policies able to exploit the new data structure must also take care of
modifying the sched_*attr() calls accordingly with their own purposes.
Signed-off-by: Dario Faggioli <raistlin@linux.it>
[ Rewrote to use sched_attr. ]
Signed-off-by: Juri Lelli <juri.lelli@gmail.com>
[ Removed sched_setscheduler2() for now. ]
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/1383831828-15501-3-git-send-email-juri.lelli@gmail.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2013-11-07 20:43:36 +07:00
|
|
|
return -EPERM;
|
|
|
|
}
|
|
|
|
|
2007-07-09 23:51:59 +07:00
|
|
|
if (rt_policy(policy)) {
|
2010-06-11 06:09:44 +07:00
|
|
|
unsigned long rlim_rtprio =
|
|
|
|
task_rlimit(p, RLIMIT_RTPRIO);
|
2006-09-29 16:00:50 +07:00
|
|
|
|
|
|
|
/* can't set/change the rt policy */
|
|
|
|
if (policy != p->policy && !rlim_rtprio)
|
|
|
|
return -EPERM;
|
|
|
|
|
|
|
|
/* can't increase priority */
|
sched: Add new scheduler syscalls to support an extended scheduling parameters ABI
Add the syscalls needed for supporting scheduling algorithms
with extended scheduling parameters (e.g., SCHED_DEADLINE).
In general, it makes possible to specify a periodic/sporadic task,
that executes for a given amount of runtime at each instance, and is
scheduled according to the urgency of their own timing constraints,
i.e.:
- a (maximum/typical) instance execution time,
- a minimum interval between consecutive instances,
- a time constraint by which each instance must be completed.
Thus, both the data structure that holds the scheduling parameters of
the tasks and the system calls dealing with it must be extended.
Unfortunately, modifying the existing struct sched_param would break
the ABI and result in potentially serious compatibility issues with
legacy binaries.
For these reasons, this patch:
- defines the new struct sched_attr, containing all the fields
that are necessary for specifying a task in the computational
model described above;
- defines and implements the new scheduling related syscalls that
manipulate it, i.e., sched_setattr() and sched_getattr().
Syscalls are introduced for x86 (32 and 64 bits) and ARM only, as a
proof of concept and for developing and testing purposes. Making them
available on other architectures is straightforward.
Since no "user" for these new parameters is introduced in this patch,
the implementation of the new system calls is just identical to their
already existing counterpart. Future patches that implement scheduling
policies able to exploit the new data structure must also take care of
modifying the sched_*attr() calls accordingly with their own purposes.
Signed-off-by: Dario Faggioli <raistlin@linux.it>
[ Rewrote to use sched_attr. ]
Signed-off-by: Juri Lelli <juri.lelli@gmail.com>
[ Removed sched_setscheduler2() for now. ]
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/1383831828-15501-3-git-send-email-juri.lelli@gmail.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2013-11-07 20:43:36 +07:00
|
|
|
if (attr->sched_priority > p->rt_priority &&
|
|
|
|
attr->sched_priority > rlim_rtprio)
|
2006-09-29 16:00:50 +07:00
|
|
|
return -EPERM;
|
|
|
|
}
|
2011-02-18 06:37:07 +07:00
|
|
|
|
2014-03-03 18:09:21 +07:00
|
|
|
/*
|
|
|
|
* Can't set/change SCHED_DEADLINE policy at all for now
|
|
|
|
* (safest behavior); in the future we would like to allow
|
|
|
|
* unprivileged DL tasks to increase their relative deadline
|
|
|
|
* or reduce their runtime (both ways reducing utilization)
|
|
|
|
*/
|
|
|
|
if (dl_policy(policy))
|
|
|
|
return -EPERM;
|
|
|
|
|
2007-07-09 23:51:59 +07:00
|
|
|
/*
|
2011-02-18 06:37:07 +07:00
|
|
|
* Treat SCHED_IDLE as nice 20. Only allow a switch to
|
|
|
|
* SCHED_NORMAL if the RLIMIT_NICE would normally permit it.
|
2007-07-09 23:51:59 +07:00
|
|
|
*/
|
2015-09-09 22:00:41 +07:00
|
|
|
if (idle_policy(p->policy) && !idle_policy(policy)) {
|
2014-01-28 10:00:45 +07:00
|
|
|
if (!can_nice(p, task_nice(p)))
|
2011-02-18 06:37:07 +07:00
|
|
|
return -EPERM;
|
|
|
|
}
|
2006-09-29 16:00:48 +07:00
|
|
|
|
[PATCH] Changing RT priority without CAP_SYS_NICE
Presently, a process without the capability CAP_SYS_NICE can not change
its own policy, which is OK.
But it can also not decrease its RT priority (if scheduled with policy
SCHED_RR or SCHED_FIFO), which is what this patch changes.
The rationale is the same as for the nice value: a process should be
able to require less priority for itself. Increasing the priority is
still not allowed.
This is for example useful if you give a multithreaded user process a RT
priority, and the process would like to organize its internal threads
using priorities also. Then you can give the process the highest
priority needed N, and the process starts its threads with lower
priorities: N-1, N-2...
The POSIX norm says that the permissions are implementation specific, so
I think we can do that.
In a sense, it makes the permissions consistent whatever the policy is:
with this patch, process scheduled by SCHED_FIFO, SCHED_RR and
SCHED_OTHER can all decrease their priority.
From: Ingo Molnar <mingo@elte.hu>
cleaned up and merged to -mm.
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-06-26 04:57:32 +07:00
|
|
|
/* can't change other user's priorities */
|
2008-11-14 06:39:19 +07:00
|
|
|
if (!check_same_owner(p))
|
[PATCH] Changing RT priority without CAP_SYS_NICE
Presently, a process without the capability CAP_SYS_NICE can not change
its own policy, which is OK.
But it can also not decrease its RT priority (if scheduled with policy
SCHED_RR or SCHED_FIFO), which is what this patch changes.
The rationale is the same as for the nice value: a process should be
able to require less priority for itself. Increasing the priority is
still not allowed.
This is for example useful if you give a multithreaded user process a RT
priority, and the process would like to organize its internal threads
using priorities also. Then you can give the process the highest
priority needed N, and the process starts its threads with lower
priorities: N-1, N-2...
The POSIX norm says that the permissions are implementation specific, so
I think we can do that.
In a sense, it makes the permissions consistent whatever the policy is:
with this patch, process scheduled by SCHED_FIFO, SCHED_RR and
SCHED_OTHER can all decrease their priority.
From: Ingo Molnar <mingo@elte.hu>
cleaned up and merged to -mm.
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-06-26 04:57:32 +07:00
|
|
|
return -EPERM;
|
2009-06-15 22:17:47 +07:00
|
|
|
|
|
|
|
/* Normal users shall not reset the sched_reset_on_fork flag */
|
|
|
|
if (p->sched_reset_on_fork && !reset_on_fork)
|
|
|
|
return -EPERM;
|
[PATCH] Changing RT priority without CAP_SYS_NICE
Presently, a process without the capability CAP_SYS_NICE can not change
its own policy, which is OK.
But it can also not decrease its RT priority (if scheduled with policy
SCHED_RR or SCHED_FIFO), which is what this patch changes.
The rationale is the same as for the nice value: a process should be
able to require less priority for itself. Increasing the priority is
still not allowed.
This is for example useful if you give a multithreaded user process a RT
priority, and the process would like to organize its internal threads
using priorities also. Then you can give the process the highest
priority needed N, and the process starts its threads with lower
priorities: N-1, N-2...
The POSIX norm says that the permissions are implementation specific, so
I think we can do that.
In a sense, it makes the permissions consistent whatever the policy is:
with this patch, process scheduled by SCHED_FIFO, SCHED_RR and
SCHED_OTHER can all decrease their priority.
From: Ingo Molnar <mingo@elte.hu>
cleaned up and merged to -mm.
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-06-26 04:57:32 +07:00
|
|
|
}
|
2005-04-17 05:20:36 +07:00
|
|
|
|
2008-08-03 23:33:03 +07:00
|
|
|
if (user) {
|
2010-10-15 02:21:18 +07:00
|
|
|
retval = security_task_setscheduler(p);
|
2008-08-03 23:33:03 +07:00
|
|
|
if (retval)
|
|
|
|
return retval;
|
|
|
|
}
|
|
|
|
|
2006-06-27 16:54:51 +07:00
|
|
|
/*
|
|
|
|
* make sure no PI-waiters arrive (or leave) while we are
|
|
|
|
* changing the priority of the task:
|
2011-04-05 22:23:51 +07:00
|
|
|
*
|
2011-03-31 08:57:33 +07:00
|
|
|
* To be able to change p->policy safely, the appropriate
|
2005-04-17 05:20:36 +07:00
|
|
|
* runqueue lock must be held.
|
|
|
|
*/
|
2015-08-01 02:28:18 +07:00
|
|
|
rq = task_rq_lock(p, &rf);
|
2010-06-08 16:40:42 +07:00
|
|
|
|
2010-09-22 18:53:15 +07:00
|
|
|
/*
|
|
|
|
* Changing the policy of the stop threads its a very bad idea
|
|
|
|
*/
|
|
|
|
if (p == rq->stop) {
|
2015-08-01 02:28:18 +07:00
|
|
|
task_rq_unlock(rq, p, &rf);
|
2010-09-22 18:53:15 +07:00
|
|
|
return -EINVAL;
|
|
|
|
}
|
|
|
|
|
sched: Leave sched_setscheduler() earlier if possible, do not disturb SCHED_FIFO tasks
sched_setscheduler() (in sched.c) is called in order of changing the
scheduling policy and/or the real-time priority of a task. Thus,
if we find out that neither of those are actually being modified, it
is possible to return earlier and save the overhead of a full
deactivate+activate cycle of the task in question.
Beside that, if we have more than one SCHED_FIFO task with the same
priority on the same rq (which means they share the same priority queue)
having one of them changing its position in the priority queue because of
a sched_setscheduler (as it happens by means of the deactivate+activate)
that does not actually change the priority violates POSIX which states,
for SCHED_FIFO:
"If a thread whose policy or priority has been modified by
pthread_setschedprio() is a running thread or is runnable, the effect on
its position in the thread list depends on the direction of the
modification, as follows: a. <...> b. If the priority is unchanged, the
thread does not change position in the thread list. c. <...>"
http://pubs.opengroup.org/onlinepubs/009695399/functions/xsh_chap02_08.html
(ed: And the POSIX specification here does, briefly and somewhat unexpectedly,
match what common sense tells us as well. )
Signed-off-by: Dario Faggioli <raistlin@linux.it>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
LKML-Reference: <1300971618.3960.82.camel@Palantir>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2011-03-24 20:00:18 +07:00
|
|
|
/*
|
2014-02-08 02:58:40 +07:00
|
|
|
* If not changing anything there's no need to proceed further,
|
|
|
|
* but store a possible modification of reset_on_fork.
|
sched: Leave sched_setscheduler() earlier if possible, do not disturb SCHED_FIFO tasks
sched_setscheduler() (in sched.c) is called in order of changing the
scheduling policy and/or the real-time priority of a task. Thus,
if we find out that neither of those are actually being modified, it
is possible to return earlier and save the overhead of a full
deactivate+activate cycle of the task in question.
Beside that, if we have more than one SCHED_FIFO task with the same
priority on the same rq (which means they share the same priority queue)
having one of them changing its position in the priority queue because of
a sched_setscheduler (as it happens by means of the deactivate+activate)
that does not actually change the priority violates POSIX which states,
for SCHED_FIFO:
"If a thread whose policy or priority has been modified by
pthread_setschedprio() is a running thread or is runnable, the effect on
its position in the thread list depends on the direction of the
modification, as follows: a. <...> b. If the priority is unchanged, the
thread does not change position in the thread list. c. <...>"
http://pubs.opengroup.org/onlinepubs/009695399/functions/xsh_chap02_08.html
(ed: And the POSIX specification here does, briefly and somewhat unexpectedly,
match what common sense tells us as well. )
Signed-off-by: Dario Faggioli <raistlin@linux.it>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
LKML-Reference: <1300971618.3960.82.camel@Palantir>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2011-03-24 20:00:18 +07:00
|
|
|
*/
|
sched: Add new scheduler syscalls to support an extended scheduling parameters ABI
Add the syscalls needed for supporting scheduling algorithms
with extended scheduling parameters (e.g., SCHED_DEADLINE).
In general, it makes possible to specify a periodic/sporadic task,
that executes for a given amount of runtime at each instance, and is
scheduled according to the urgency of their own timing constraints,
i.e.:
- a (maximum/typical) instance execution time,
- a minimum interval between consecutive instances,
- a time constraint by which each instance must be completed.
Thus, both the data structure that holds the scheduling parameters of
the tasks and the system calls dealing with it must be extended.
Unfortunately, modifying the existing struct sched_param would break
the ABI and result in potentially serious compatibility issues with
legacy binaries.
For these reasons, this patch:
- defines the new struct sched_attr, containing all the fields
that are necessary for specifying a task in the computational
model described above;
- defines and implements the new scheduling related syscalls that
manipulate it, i.e., sched_setattr() and sched_getattr().
Syscalls are introduced for x86 (32 and 64 bits) and ARM only, as a
proof of concept and for developing and testing purposes. Making them
available on other architectures is straightforward.
Since no "user" for these new parameters is introduced in this patch,
the implementation of the new system calls is just identical to their
already existing counterpart. Future patches that implement scheduling
policies able to exploit the new data structure must also take care of
modifying the sched_*attr() calls accordingly with their own purposes.
Signed-off-by: Dario Faggioli <raistlin@linux.it>
[ Rewrote to use sched_attr. ]
Signed-off-by: Juri Lelli <juri.lelli@gmail.com>
[ Removed sched_setscheduler2() for now. ]
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/1383831828-15501-3-git-send-email-juri.lelli@gmail.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2013-11-07 20:43:36 +07:00
|
|
|
if (unlikely(policy == p->policy)) {
|
2014-01-28 10:00:45 +07:00
|
|
|
if (fair_policy(policy) && attr->sched_nice != task_nice(p))
|
sched: Add new scheduler syscalls to support an extended scheduling parameters ABI
Add the syscalls needed for supporting scheduling algorithms
with extended scheduling parameters (e.g., SCHED_DEADLINE).
In general, it makes possible to specify a periodic/sporadic task,
that executes for a given amount of runtime at each instance, and is
scheduled according to the urgency of their own timing constraints,
i.e.:
- a (maximum/typical) instance execution time,
- a minimum interval between consecutive instances,
- a time constraint by which each instance must be completed.
Thus, both the data structure that holds the scheduling parameters of
the tasks and the system calls dealing with it must be extended.
Unfortunately, modifying the existing struct sched_param would break
the ABI and result in potentially serious compatibility issues with
legacy binaries.
For these reasons, this patch:
- defines the new struct sched_attr, containing all the fields
that are necessary for specifying a task in the computational
model described above;
- defines and implements the new scheduling related syscalls that
manipulate it, i.e., sched_setattr() and sched_getattr().
Syscalls are introduced for x86 (32 and 64 bits) and ARM only, as a
proof of concept and for developing and testing purposes. Making them
available on other architectures is straightforward.
Since no "user" for these new parameters is introduced in this patch,
the implementation of the new system calls is just identical to their
already existing counterpart. Future patches that implement scheduling
policies able to exploit the new data structure must also take care of
modifying the sched_*attr() calls accordingly with their own purposes.
Signed-off-by: Dario Faggioli <raistlin@linux.it>
[ Rewrote to use sched_attr. ]
Signed-off-by: Juri Lelli <juri.lelli@gmail.com>
[ Removed sched_setscheduler2() for now. ]
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/1383831828-15501-3-git-send-email-juri.lelli@gmail.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2013-11-07 20:43:36 +07:00
|
|
|
goto change;
|
|
|
|
if (rt_policy(policy) && attr->sched_priority != p->rt_priority)
|
|
|
|
goto change;
|
2014-11-26 07:44:04 +07:00
|
|
|
if (dl_policy(policy) && dl_param_changed(p, attr))
|
sched/deadline: Add SCHED_DEADLINE structures & implementation
Introduces the data structures, constants and symbols needed for
SCHED_DEADLINE implementation.
Core data structure of SCHED_DEADLINE are defined, along with their
initializers. Hooks for checking if a task belong to the new policy
are also added where they are needed.
Adds a scheduling class, in sched/dl.c and a new policy called
SCHED_DEADLINE. It is an implementation of the Earliest Deadline
First (EDF) scheduling algorithm, augmented with a mechanism (called
Constant Bandwidth Server, CBS) that makes it possible to isolate
the behaviour of tasks between each other.
The typical -deadline task will be made up of a computation phase
(instance) which is activated on a periodic or sporadic fashion. The
expected (maximum) duration of such computation is called the task's
runtime; the time interval by which each instance need to be completed
is called the task's relative deadline. The task's absolute deadline
is dynamically calculated as the time instant a task (better, an
instance) activates plus the relative deadline.
The EDF algorithms selects the task with the smallest absolute
deadline as the one to be executed first, while the CBS ensures each
task to run for at most its runtime every (relative) deadline
length time interval, avoiding any interference between different
tasks (bandwidth isolation).
Thanks to this feature, also tasks that do not strictly comply with
the computational model sketched above can effectively use the new
policy.
To summarize, this patch:
- introduces the data structures, constants and symbols needed;
- implements the core logic of the scheduling algorithm in the new
scheduling class file;
- provides all the glue code between the new scheduling class and
the core scheduler and refines the interactions between sched/dl
and the other existing scheduling classes.
Signed-off-by: Dario Faggioli <raistlin@linux.it>
Signed-off-by: Michael Trimarchi <michael@amarulasolutions.com>
Signed-off-by: Fabio Checconi <fchecconi@gmail.com>
Signed-off-by: Juri Lelli <juri.lelli@gmail.com>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/1383831828-15501-4-git-send-email-juri.lelli@gmail.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2013-11-28 17:14:43 +07:00
|
|
|
goto change;
|
sched: Add new scheduler syscalls to support an extended scheduling parameters ABI
Add the syscalls needed for supporting scheduling algorithms
with extended scheduling parameters (e.g., SCHED_DEADLINE).
In general, it makes possible to specify a periodic/sporadic task,
that executes for a given amount of runtime at each instance, and is
scheduled according to the urgency of their own timing constraints,
i.e.:
- a (maximum/typical) instance execution time,
- a minimum interval between consecutive instances,
- a time constraint by which each instance must be completed.
Thus, both the data structure that holds the scheduling parameters of
the tasks and the system calls dealing with it must be extended.
Unfortunately, modifying the existing struct sched_param would break
the ABI and result in potentially serious compatibility issues with
legacy binaries.
For these reasons, this patch:
- defines the new struct sched_attr, containing all the fields
that are necessary for specifying a task in the computational
model described above;
- defines and implements the new scheduling related syscalls that
manipulate it, i.e., sched_setattr() and sched_getattr().
Syscalls are introduced for x86 (32 and 64 bits) and ARM only, as a
proof of concept and for developing and testing purposes. Making them
available on other architectures is straightforward.
Since no "user" for these new parameters is introduced in this patch,
the implementation of the new system calls is just identical to their
already existing counterpart. Future patches that implement scheduling
policies able to exploit the new data structure must also take care of
modifying the sched_*attr() calls accordingly with their own purposes.
Signed-off-by: Dario Faggioli <raistlin@linux.it>
[ Rewrote to use sched_attr. ]
Signed-off-by: Juri Lelli <juri.lelli@gmail.com>
[ Removed sched_setscheduler2() for now. ]
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/1383831828-15501-3-git-send-email-juri.lelli@gmail.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2013-11-07 20:43:36 +07:00
|
|
|
|
2014-02-08 02:58:40 +07:00
|
|
|
p->sched_reset_on_fork = reset_on_fork;
|
2015-08-01 02:28:18 +07:00
|
|
|
task_rq_unlock(rq, p, &rf);
|
sched: Leave sched_setscheduler() earlier if possible, do not disturb SCHED_FIFO tasks
sched_setscheduler() (in sched.c) is called in order of changing the
scheduling policy and/or the real-time priority of a task. Thus,
if we find out that neither of those are actually being modified, it
is possible to return earlier and save the overhead of a full
deactivate+activate cycle of the task in question.
Beside that, if we have more than one SCHED_FIFO task with the same
priority on the same rq (which means they share the same priority queue)
having one of them changing its position in the priority queue because of
a sched_setscheduler (as it happens by means of the deactivate+activate)
that does not actually change the priority violates POSIX which states,
for SCHED_FIFO:
"If a thread whose policy or priority has been modified by
pthread_setschedprio() is a running thread or is runnable, the effect on
its position in the thread list depends on the direction of the
modification, as follows: a. <...> b. If the priority is unchanged, the
thread does not change position in the thread list. c. <...>"
http://pubs.opengroup.org/onlinepubs/009695399/functions/xsh_chap02_08.html
(ed: And the POSIX specification here does, briefly and somewhat unexpectedly,
match what common sense tells us as well. )
Signed-off-by: Dario Faggioli <raistlin@linux.it>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
LKML-Reference: <1300971618.3960.82.camel@Palantir>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2011-03-24 20:00:18 +07:00
|
|
|
return 0;
|
|
|
|
}
|
sched: Add new scheduler syscalls to support an extended scheduling parameters ABI
Add the syscalls needed for supporting scheduling algorithms
with extended scheduling parameters (e.g., SCHED_DEADLINE).
In general, it makes possible to specify a periodic/sporadic task,
that executes for a given amount of runtime at each instance, and is
scheduled according to the urgency of their own timing constraints,
i.e.:
- a (maximum/typical) instance execution time,
- a minimum interval between consecutive instances,
- a time constraint by which each instance must be completed.
Thus, both the data structure that holds the scheduling parameters of
the tasks and the system calls dealing with it must be extended.
Unfortunately, modifying the existing struct sched_param would break
the ABI and result in potentially serious compatibility issues with
legacy binaries.
For these reasons, this patch:
- defines the new struct sched_attr, containing all the fields
that are necessary for specifying a task in the computational
model described above;
- defines and implements the new scheduling related syscalls that
manipulate it, i.e., sched_setattr() and sched_getattr().
Syscalls are introduced for x86 (32 and 64 bits) and ARM only, as a
proof of concept and for developing and testing purposes. Making them
available on other architectures is straightforward.
Since no "user" for these new parameters is introduced in this patch,
the implementation of the new system calls is just identical to their
already existing counterpart. Future patches that implement scheduling
policies able to exploit the new data structure must also take care of
modifying the sched_*attr() calls accordingly with their own purposes.
Signed-off-by: Dario Faggioli <raistlin@linux.it>
[ Rewrote to use sched_attr. ]
Signed-off-by: Juri Lelli <juri.lelli@gmail.com>
[ Removed sched_setscheduler2() for now. ]
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/1383831828-15501-3-git-send-email-juri.lelli@gmail.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2013-11-07 20:43:36 +07:00
|
|
|
change:
|
sched: Leave sched_setscheduler() earlier if possible, do not disturb SCHED_FIFO tasks
sched_setscheduler() (in sched.c) is called in order of changing the
scheduling policy and/or the real-time priority of a task. Thus,
if we find out that neither of those are actually being modified, it
is possible to return earlier and save the overhead of a full
deactivate+activate cycle of the task in question.
Beside that, if we have more than one SCHED_FIFO task with the same
priority on the same rq (which means they share the same priority queue)
having one of them changing its position in the priority queue because of
a sched_setscheduler (as it happens by means of the deactivate+activate)
that does not actually change the priority violates POSIX which states,
for SCHED_FIFO:
"If a thread whose policy or priority has been modified by
pthread_setschedprio() is a running thread or is runnable, the effect on
its position in the thread list depends on the direction of the
modification, as follows: a. <...> b. If the priority is unchanged, the
thread does not change position in the thread list. c. <...>"
http://pubs.opengroup.org/onlinepubs/009695399/functions/xsh_chap02_08.html
(ed: And the POSIX specification here does, briefly and somewhat unexpectedly,
match what common sense tells us as well. )
Signed-off-by: Dario Faggioli <raistlin@linux.it>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
LKML-Reference: <1300971618.3960.82.camel@Palantir>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2011-03-24 20:00:18 +07:00
|
|
|
|
2010-06-08 16:40:42 +07:00
|
|
|
if (user) {
|
sched/deadline: Add bandwidth management for SCHED_DEADLINE tasks
In order of deadline scheduling to be effective and useful, it is
important that some method of having the allocation of the available
CPU bandwidth to tasks and task groups under control.
This is usually called "admission control" and if it is not performed
at all, no guarantee can be given on the actual scheduling of the
-deadline tasks.
Since when RT-throttling has been introduced each task group have a
bandwidth associated to itself, calculated as a certain amount of
runtime over a period. Moreover, to make it possible to manipulate
such bandwidth, readable/writable controls have been added to both
procfs (for system wide settings) and cgroupfs (for per-group
settings).
Therefore, the same interface is being used for controlling the
bandwidth distrubution to -deadline tasks and task groups, i.e.,
new controls but with similar names, equivalent meaning and with
the same usage paradigm are added.
However, more discussion is needed in order to figure out how
we want to manage SCHED_DEADLINE bandwidth at the task group level.
Therefore, this patch adds a less sophisticated, but actually
very sensible, mechanism to ensure that a certain utilization
cap is not overcome per each root_domain (the single rq for !SMP
configurations).
Another main difference between deadline bandwidth management and
RT-throttling is that -deadline tasks have bandwidth on their own
(while -rt ones doesn't!), and thus we don't need an higher level
throttling mechanism to enforce the desired bandwidth.
This patch, therefore:
- adds system wide deadline bandwidth management by means of:
* /proc/sys/kernel/sched_dl_runtime_us,
* /proc/sys/kernel/sched_dl_period_us,
that determine (i.e., runtime / period) the total bandwidth
available on each CPU of each root_domain for -deadline tasks;
- couples the RT and deadline bandwidth management, i.e., enforces
that the sum of how much bandwidth is being devoted to -rt
-deadline tasks to stay below 100%.
This means that, for a root_domain comprising M CPUs, -deadline tasks
can be created until the sum of their bandwidths stay below:
M * (sched_dl_runtime_us / sched_dl_period_us)
It is also possible to disable this bandwidth management logic, and
be thus free of oversubscribing the system up to any arbitrary level.
Signed-off-by: Dario Faggioli <raistlin@linux.it>
Signed-off-by: Juri Lelli <juri.lelli@gmail.com>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/1383831828-15501-12-git-send-email-juri.lelli@gmail.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2013-11-07 20:43:45 +07:00
|
|
|
#ifdef CONFIG_RT_GROUP_SCHED
|
2010-06-08 16:40:42 +07:00
|
|
|
/*
|
|
|
|
* Do not allow realtime tasks into groups that have no runtime
|
|
|
|
* assigned.
|
|
|
|
*/
|
|
|
|
if (rt_bandwidth_enabled() && rt_policy(policy) &&
|
2011-01-13 10:54:50 +07:00
|
|
|
task_group(p)->rt_bandwidth.rt_runtime == 0 &&
|
|
|
|
!task_group_is_autogroup(task_group(p))) {
|
2015-08-01 02:28:18 +07:00
|
|
|
task_rq_unlock(rq, p, &rf);
|
2010-06-08 16:40:42 +07:00
|
|
|
return -EPERM;
|
|
|
|
}
|
|
|
|
#endif
|
sched/deadline: Add bandwidth management for SCHED_DEADLINE tasks
In order of deadline scheduling to be effective and useful, it is
important that some method of having the allocation of the available
CPU bandwidth to tasks and task groups under control.
This is usually called "admission control" and if it is not performed
at all, no guarantee can be given on the actual scheduling of the
-deadline tasks.
Since when RT-throttling has been introduced each task group have a
bandwidth associated to itself, calculated as a certain amount of
runtime over a period. Moreover, to make it possible to manipulate
such bandwidth, readable/writable controls have been added to both
procfs (for system wide settings) and cgroupfs (for per-group
settings).
Therefore, the same interface is being used for controlling the
bandwidth distrubution to -deadline tasks and task groups, i.e.,
new controls but with similar names, equivalent meaning and with
the same usage paradigm are added.
However, more discussion is needed in order to figure out how
we want to manage SCHED_DEADLINE bandwidth at the task group level.
Therefore, this patch adds a less sophisticated, but actually
very sensible, mechanism to ensure that a certain utilization
cap is not overcome per each root_domain (the single rq for !SMP
configurations).
Another main difference between deadline bandwidth management and
RT-throttling is that -deadline tasks have bandwidth on their own
(while -rt ones doesn't!), and thus we don't need an higher level
throttling mechanism to enforce the desired bandwidth.
This patch, therefore:
- adds system wide deadline bandwidth management by means of:
* /proc/sys/kernel/sched_dl_runtime_us,
* /proc/sys/kernel/sched_dl_period_us,
that determine (i.e., runtime / period) the total bandwidth
available on each CPU of each root_domain for -deadline tasks;
- couples the RT and deadline bandwidth management, i.e., enforces
that the sum of how much bandwidth is being devoted to -rt
-deadline tasks to stay below 100%.
This means that, for a root_domain comprising M CPUs, -deadline tasks
can be created until the sum of their bandwidths stay below:
M * (sched_dl_runtime_us / sched_dl_period_us)
It is also possible to disable this bandwidth management logic, and
be thus free of oversubscribing the system up to any arbitrary level.
Signed-off-by: Dario Faggioli <raistlin@linux.it>
Signed-off-by: Juri Lelli <juri.lelli@gmail.com>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/1383831828-15501-12-git-send-email-juri.lelli@gmail.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2013-11-07 20:43:45 +07:00
|
|
|
#ifdef CONFIG_SMP
|
|
|
|
if (dl_bandwidth_enabled() && dl_policy(policy)) {
|
|
|
|
cpumask_t *span = rq->rd->span;
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Don't allow tasks with an affinity mask smaller than
|
|
|
|
* the entire root_domain to become SCHED_DEADLINE. We
|
|
|
|
* will also fail if there's no bandwidth available.
|
|
|
|
*/
|
2013-12-17 16:03:34 +07:00
|
|
|
if (!cpumask_subset(span, &p->cpus_allowed) ||
|
|
|
|
rq->rd->dl_bw.bw == 0) {
|
2015-08-01 02:28:18 +07:00
|
|
|
task_rq_unlock(rq, p, &rf);
|
sched/deadline: Add bandwidth management for SCHED_DEADLINE tasks
In order of deadline scheduling to be effective and useful, it is
important that some method of having the allocation of the available
CPU bandwidth to tasks and task groups under control.
This is usually called "admission control" and if it is not performed
at all, no guarantee can be given on the actual scheduling of the
-deadline tasks.
Since when RT-throttling has been introduced each task group have a
bandwidth associated to itself, calculated as a certain amount of
runtime over a period. Moreover, to make it possible to manipulate
such bandwidth, readable/writable controls have been added to both
procfs (for system wide settings) and cgroupfs (for per-group
settings).
Therefore, the same interface is being used for controlling the
bandwidth distrubution to -deadline tasks and task groups, i.e.,
new controls but with similar names, equivalent meaning and with
the same usage paradigm are added.
However, more discussion is needed in order to figure out how
we want to manage SCHED_DEADLINE bandwidth at the task group level.
Therefore, this patch adds a less sophisticated, but actually
very sensible, mechanism to ensure that a certain utilization
cap is not overcome per each root_domain (the single rq for !SMP
configurations).
Another main difference between deadline bandwidth management and
RT-throttling is that -deadline tasks have bandwidth on their own
(while -rt ones doesn't!), and thus we don't need an higher level
throttling mechanism to enforce the desired bandwidth.
This patch, therefore:
- adds system wide deadline bandwidth management by means of:
* /proc/sys/kernel/sched_dl_runtime_us,
* /proc/sys/kernel/sched_dl_period_us,
that determine (i.e., runtime / period) the total bandwidth
available on each CPU of each root_domain for -deadline tasks;
- couples the RT and deadline bandwidth management, i.e., enforces
that the sum of how much bandwidth is being devoted to -rt
-deadline tasks to stay below 100%.
This means that, for a root_domain comprising M CPUs, -deadline tasks
can be created until the sum of their bandwidths stay below:
M * (sched_dl_runtime_us / sched_dl_period_us)
It is also possible to disable this bandwidth management logic, and
be thus free of oversubscribing the system up to any arbitrary level.
Signed-off-by: Dario Faggioli <raistlin@linux.it>
Signed-off-by: Juri Lelli <juri.lelli@gmail.com>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/1383831828-15501-12-git-send-email-juri.lelli@gmail.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2013-11-07 20:43:45 +07:00
|
|
|
return -EPERM;
|
|
|
|
}
|
|
|
|
}
|
|
|
|
#endif
|
|
|
|
}
|
2010-06-08 16:40:42 +07:00
|
|
|
|
2005-04-17 05:20:36 +07:00
|
|
|
/* recheck policy now with rq lock held */
|
|
|
|
if (unlikely(oldpolicy != -1 && oldpolicy != p->policy)) {
|
|
|
|
policy = oldpolicy = -1;
|
2015-08-01 02:28:18 +07:00
|
|
|
task_rq_unlock(rq, p, &rf);
|
2005-04-17 05:20:36 +07:00
|
|
|
goto recheck;
|
|
|
|
}
|
sched/deadline: Add bandwidth management for SCHED_DEADLINE tasks
In order of deadline scheduling to be effective and useful, it is
important that some method of having the allocation of the available
CPU bandwidth to tasks and task groups under control.
This is usually called "admission control" and if it is not performed
at all, no guarantee can be given on the actual scheduling of the
-deadline tasks.
Since when RT-throttling has been introduced each task group have a
bandwidth associated to itself, calculated as a certain amount of
runtime over a period. Moreover, to make it possible to manipulate
such bandwidth, readable/writable controls have been added to both
procfs (for system wide settings) and cgroupfs (for per-group
settings).
Therefore, the same interface is being used for controlling the
bandwidth distrubution to -deadline tasks and task groups, i.e.,
new controls but with similar names, equivalent meaning and with
the same usage paradigm are added.
However, more discussion is needed in order to figure out how
we want to manage SCHED_DEADLINE bandwidth at the task group level.
Therefore, this patch adds a less sophisticated, but actually
very sensible, mechanism to ensure that a certain utilization
cap is not overcome per each root_domain (the single rq for !SMP
configurations).
Another main difference between deadline bandwidth management and
RT-throttling is that -deadline tasks have bandwidth on their own
(while -rt ones doesn't!), and thus we don't need an higher level
throttling mechanism to enforce the desired bandwidth.
This patch, therefore:
- adds system wide deadline bandwidth management by means of:
* /proc/sys/kernel/sched_dl_runtime_us,
* /proc/sys/kernel/sched_dl_period_us,
that determine (i.e., runtime / period) the total bandwidth
available on each CPU of each root_domain for -deadline tasks;
- couples the RT and deadline bandwidth management, i.e., enforces
that the sum of how much bandwidth is being devoted to -rt
-deadline tasks to stay below 100%.
This means that, for a root_domain comprising M CPUs, -deadline tasks
can be created until the sum of their bandwidths stay below:
M * (sched_dl_runtime_us / sched_dl_period_us)
It is also possible to disable this bandwidth management logic, and
be thus free of oversubscribing the system up to any arbitrary level.
Signed-off-by: Dario Faggioli <raistlin@linux.it>
Signed-off-by: Juri Lelli <juri.lelli@gmail.com>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/1383831828-15501-12-git-send-email-juri.lelli@gmail.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2013-11-07 20:43:45 +07:00
|
|
|
|
|
|
|
/*
|
|
|
|
* If setscheduling to SCHED_DEADLINE (or changing the parameters
|
|
|
|
* of a SCHED_DEADLINE task) we need to check if enough bandwidth
|
|
|
|
* is available.
|
|
|
|
*/
|
2013-12-17 16:03:34 +07:00
|
|
|
if ((dl_policy(policy) || dl_task(p)) && dl_overflow(p, policy, attr)) {
|
2015-08-01 02:28:18 +07:00
|
|
|
task_rq_unlock(rq, p, &rf);
|
sched/deadline: Add bandwidth management for SCHED_DEADLINE tasks
In order of deadline scheduling to be effective and useful, it is
important that some method of having the allocation of the available
CPU bandwidth to tasks and task groups under control.
This is usually called "admission control" and if it is not performed
at all, no guarantee can be given on the actual scheduling of the
-deadline tasks.
Since when RT-throttling has been introduced each task group have a
bandwidth associated to itself, calculated as a certain amount of
runtime over a period. Moreover, to make it possible to manipulate
such bandwidth, readable/writable controls have been added to both
procfs (for system wide settings) and cgroupfs (for per-group
settings).
Therefore, the same interface is being used for controlling the
bandwidth distrubution to -deadline tasks and task groups, i.e.,
new controls but with similar names, equivalent meaning and with
the same usage paradigm are added.
However, more discussion is needed in order to figure out how
we want to manage SCHED_DEADLINE bandwidth at the task group level.
Therefore, this patch adds a less sophisticated, but actually
very sensible, mechanism to ensure that a certain utilization
cap is not overcome per each root_domain (the single rq for !SMP
configurations).
Another main difference between deadline bandwidth management and
RT-throttling is that -deadline tasks have bandwidth on their own
(while -rt ones doesn't!), and thus we don't need an higher level
throttling mechanism to enforce the desired bandwidth.
This patch, therefore:
- adds system wide deadline bandwidth management by means of:
* /proc/sys/kernel/sched_dl_runtime_us,
* /proc/sys/kernel/sched_dl_period_us,
that determine (i.e., runtime / period) the total bandwidth
available on each CPU of each root_domain for -deadline tasks;
- couples the RT and deadline bandwidth management, i.e., enforces
that the sum of how much bandwidth is being devoted to -rt
-deadline tasks to stay below 100%.
This means that, for a root_domain comprising M CPUs, -deadline tasks
can be created until the sum of their bandwidths stay below:
M * (sched_dl_runtime_us / sched_dl_period_us)
It is also possible to disable this bandwidth management logic, and
be thus free of oversubscribing the system up to any arbitrary level.
Signed-off-by: Dario Faggioli <raistlin@linux.it>
Signed-off-by: Juri Lelli <juri.lelli@gmail.com>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/1383831828-15501-12-git-send-email-juri.lelli@gmail.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2013-11-07 20:43:45 +07:00
|
|
|
return -EBUSY;
|
|
|
|
}
|
|
|
|
|
2014-02-08 02:58:42 +07:00
|
|
|
p->sched_reset_on_fork = reset_on_fork;
|
|
|
|
oldprio = p->prio;
|
|
|
|
|
2015-06-11 19:46:38 +07:00
|
|
|
if (pi) {
|
|
|
|
/*
|
|
|
|
* Take priority boosted tasks into account. If the new
|
|
|
|
* effective priority is unchanged, we just store the new
|
|
|
|
* normal parameters and do not touch the scheduler class and
|
|
|
|
* the runqueue. This will be done when the task deboost
|
|
|
|
* itself.
|
|
|
|
*/
|
|
|
|
new_effective_prio = rt_mutex_get_effective_prio(p, newprio);
|
2016-01-18 21:27:07 +07:00
|
|
|
if (new_effective_prio == oldprio)
|
|
|
|
queue_flags &= ~DEQUEUE_MOVE;
|
2014-02-08 02:58:42 +07:00
|
|
|
}
|
|
|
|
|
2014-08-20 16:47:32 +07:00
|
|
|
queued = task_on_rq_queued(p);
|
2007-12-18 21:21:13 +07:00
|
|
|
running = task_current(rq, p);
|
2014-08-20 16:47:32 +07:00
|
|
|
if (queued)
|
2016-01-18 21:27:07 +07:00
|
|
|
dequeue_task(rq, p, queue_flags);
|
2008-03-11 01:01:20 +07:00
|
|
|
if (running)
|
2014-09-12 20:41:40 +07:00
|
|
|
put_prev_task(rq, p);
|
2007-10-15 22:00:08 +07:00
|
|
|
|
2010-02-17 15:05:48 +07:00
|
|
|
prev_class = p->sched_class;
|
2015-06-11 19:46:38 +07:00
|
|
|
__setscheduler(rq, p, attr, pi);
|
2007-10-15 22:00:08 +07:00
|
|
|
|
2008-03-11 01:01:20 +07:00
|
|
|
if (running)
|
|
|
|
p->sched_class->set_curr_task(rq);
|
2014-08-20 16:47:32 +07:00
|
|
|
if (queued) {
|
2014-02-08 02:58:41 +07:00
|
|
|
/*
|
|
|
|
* We enqueue to tail when the priority of a task is
|
|
|
|
* increased (user space view).
|
|
|
|
*/
|
2016-01-18 21:27:07 +07:00
|
|
|
if (oldprio < p->prio)
|
|
|
|
queue_flags |= ENQUEUE_HEAD;
|
sched/core: Fix task and run queue sched_info::run_delay inconsistencies
Mike Meyer reported the following bug:
> During evaluation of some performance data, it was discovered thread
> and run queue run_delay accounting data was inconsistent with the other
> accounting data that was collected. Further investigation found under
> certain circumstances execution time was leaking into the task and
> run queue accounting of run_delay.
>
> Consider the following sequence:
>
> a. thread is running.
> b. thread moves beween cgroups, changes scheduling class or priority.
> c. thread sleeps OR
> d. thread involuntarily gives up cpu.
>
> a. implies:
>
> thread->sched_info.last_queued = 0
>
> a. and b. results in the following:
>
> 1. dequeue_task(rq, thread)
>
> sched_info_dequeued(rq, thread)
> delta = 0
>
> sched_info_reset_dequeued(thread)
> thread->sched_info.last_queued = 0
>
> thread->sched_info.run_delay += delta
>
> 2. enqueue_task(rq, thread)
>
> sched_info_queued(rq, thread)
>
> /* thread is still on cpu at this point. */
> thread->sched_info.last_queued = task_rq(thread)->clock;
>
> c. results in:
>
> dequeue_task(rq, thread)
>
> sched_info_dequeued(rq, thread)
>
> /* delta is execution time not run_delay. */
> delta = task_rq(thread)->clock - thread->sched_info.last_queued
>
> sched_info_reset_dequeued(thread)
> thread->sched_info.last_queued = 0
>
> thread->sched_info.run_delay += delta
>
> Since thread was running between enqueue_task(rq, thread) and
> dequeue_task(rq, thread), the delta above is really execution
> time and not run_delay.
>
> d. results in:
>
> __sched_info_switch(thread, next_thread)
>
> sched_info_depart(rq, thread)
>
> sched_info_queued(rq, thread)
>
> /* last_queued not updated due to being non-zero */
> return
>
> Since thread was running between enqueue_task(rq, thread) and
> __sched_info_switch(thread, next_thread), the execution time
> between enqueue_task(rq, thread) and
> __sched_info_switch(thread, next_thread) now will become
> associated with run_delay due to when last_queued was last updated.
>
This alternative patch solves the problem by not calling
sched_info_{de,}queued() in {de,en}queue_task(). Therefore the
sched_info state is preserved and things work as expected.
By inlining the {de,en}queue_task() functions the new condition
becomes (mostly) a compile-time constant and we'll not emit any new
branch instructions.
It even shrinks the code (due to inlining {en,de}queue_task()):
$ size defconfig-build/kernel/sched/core.o defconfig-build/kernel/sched/core.o.orig
text data bss dec hex filename
64019 23378 2344 89741 15e8d defconfig-build/kernel/sched/core.o
64149 23378 2344 89871 15f0f defconfig-build/kernel/sched/core.o.orig
Reported-by: Mike Meyer <Mike.Meyer@Teradata.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Link: http://lkml.kernel.org/r/20150930154413.GO3604@twins.programming.kicks-ass.net
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-09-30 22:44:13 +07:00
|
|
|
|
2016-01-18 21:27:07 +07:00
|
|
|
enqueue_task(rq, p, queue_flags);
|
2014-02-08 02:58:41 +07:00
|
|
|
}
|
2008-01-26 03:08:22 +07:00
|
|
|
|
2011-01-17 23:03:27 +07:00
|
|
|
check_class_changed(rq, p, prev_class, oldprio);
|
2015-06-11 19:46:39 +07:00
|
|
|
preempt_disable(); /* avoid rq from going away on us */
|
2015-08-01 02:28:18 +07:00
|
|
|
task_rq_unlock(rq, p, &rf);
|
2006-06-27 16:54:51 +07:00
|
|
|
|
2015-06-11 19:46:38 +07:00
|
|
|
if (pi)
|
|
|
|
rt_mutex_adjust_pi(p);
|
2006-06-27 16:55:02 +07:00
|
|
|
|
2015-06-11 19:46:39 +07:00
|
|
|
/*
|
|
|
|
* Run balance callbacks after we've adjusted the PI chain.
|
|
|
|
*/
|
|
|
|
balance_callback(rq);
|
|
|
|
preempt_enable();
|
2006-06-27 16:55:02 +07:00
|
|
|
|
2005-04-17 05:20:36 +07:00
|
|
|
return 0;
|
|
|
|
}
|
2008-06-23 10:55:38 +07:00
|
|
|
|
2014-01-15 23:05:04 +07:00
|
|
|
static int _sched_setscheduler(struct task_struct *p, int policy,
|
|
|
|
const struct sched_param *param, bool check)
|
|
|
|
{
|
|
|
|
struct sched_attr attr = {
|
|
|
|
.sched_policy = policy,
|
|
|
|
.sched_priority = param->sched_priority,
|
|
|
|
.sched_nice = PRIO_TO_NICE(p->static_prio),
|
|
|
|
};
|
|
|
|
|
2014-07-23 22:28:26 +07:00
|
|
|
/* Fixup the legacy SCHED_RESET_ON_FORK hack. */
|
|
|
|
if ((policy != SETPARAM_POLICY) && (policy & SCHED_RESET_ON_FORK)) {
|
2014-01-15 23:05:04 +07:00
|
|
|
attr.sched_flags |= SCHED_FLAG_RESET_ON_FORK;
|
|
|
|
policy &= ~SCHED_RESET_ON_FORK;
|
|
|
|
attr.sched_policy = policy;
|
|
|
|
}
|
|
|
|
|
2015-06-11 19:46:38 +07:00
|
|
|
return __sched_setscheduler(p, &attr, check, true);
|
2014-01-15 23:05:04 +07:00
|
|
|
}
|
2008-06-23 10:55:38 +07:00
|
|
|
/**
|
|
|
|
* sched_setscheduler - change the scheduling policy and/or RT priority of a thread.
|
|
|
|
* @p: the task in question.
|
|
|
|
* @policy: new policy.
|
|
|
|
* @param: structure containing the new RT priority.
|
|
|
|
*
|
2013-07-13 01:45:47 +07:00
|
|
|
* Return: 0 on success. An error code otherwise.
|
|
|
|
*
|
2008-06-23 10:55:38 +07:00
|
|
|
* NOTE that the task may be already dead.
|
|
|
|
*/
|
|
|
|
int sched_setscheduler(struct task_struct *p, int policy,
|
2010-10-21 06:01:12 +07:00
|
|
|
const struct sched_param *param)
|
2008-06-23 10:55:38 +07:00
|
|
|
{
|
2014-01-15 23:05:04 +07:00
|
|
|
return _sched_setscheduler(p, policy, param, true);
|
2008-06-23 10:55:38 +07:00
|
|
|
}
|
2005-04-17 05:20:36 +07:00
|
|
|
EXPORT_SYMBOL_GPL(sched_setscheduler);
|
|
|
|
|
sched: Add new scheduler syscalls to support an extended scheduling parameters ABI
Add the syscalls needed for supporting scheduling algorithms
with extended scheduling parameters (e.g., SCHED_DEADLINE).
In general, it makes possible to specify a periodic/sporadic task,
that executes for a given amount of runtime at each instance, and is
scheduled according to the urgency of their own timing constraints,
i.e.:
- a (maximum/typical) instance execution time,
- a minimum interval between consecutive instances,
- a time constraint by which each instance must be completed.
Thus, both the data structure that holds the scheduling parameters of
the tasks and the system calls dealing with it must be extended.
Unfortunately, modifying the existing struct sched_param would break
the ABI and result in potentially serious compatibility issues with
legacy binaries.
For these reasons, this patch:
- defines the new struct sched_attr, containing all the fields
that are necessary for specifying a task in the computational
model described above;
- defines and implements the new scheduling related syscalls that
manipulate it, i.e., sched_setattr() and sched_getattr().
Syscalls are introduced for x86 (32 and 64 bits) and ARM only, as a
proof of concept and for developing and testing purposes. Making them
available on other architectures is straightforward.
Since no "user" for these new parameters is introduced in this patch,
the implementation of the new system calls is just identical to their
already existing counterpart. Future patches that implement scheduling
policies able to exploit the new data structure must also take care of
modifying the sched_*attr() calls accordingly with their own purposes.
Signed-off-by: Dario Faggioli <raistlin@linux.it>
[ Rewrote to use sched_attr. ]
Signed-off-by: Juri Lelli <juri.lelli@gmail.com>
[ Removed sched_setscheduler2() for now. ]
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/1383831828-15501-3-git-send-email-juri.lelli@gmail.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2013-11-07 20:43:36 +07:00
|
|
|
int sched_setattr(struct task_struct *p, const struct sched_attr *attr)
|
|
|
|
{
|
2015-06-11 19:46:38 +07:00
|
|
|
return __sched_setscheduler(p, attr, true, true);
|
sched: Add new scheduler syscalls to support an extended scheduling parameters ABI
Add the syscalls needed for supporting scheduling algorithms
with extended scheduling parameters (e.g., SCHED_DEADLINE).
In general, it makes possible to specify a periodic/sporadic task,
that executes for a given amount of runtime at each instance, and is
scheduled according to the urgency of their own timing constraints,
i.e.:
- a (maximum/typical) instance execution time,
- a minimum interval between consecutive instances,
- a time constraint by which each instance must be completed.
Thus, both the data structure that holds the scheduling parameters of
the tasks and the system calls dealing with it must be extended.
Unfortunately, modifying the existing struct sched_param would break
the ABI and result in potentially serious compatibility issues with
legacy binaries.
For these reasons, this patch:
- defines the new struct sched_attr, containing all the fields
that are necessary for specifying a task in the computational
model described above;
- defines and implements the new scheduling related syscalls that
manipulate it, i.e., sched_setattr() and sched_getattr().
Syscalls are introduced for x86 (32 and 64 bits) and ARM only, as a
proof of concept and for developing and testing purposes. Making them
available on other architectures is straightforward.
Since no "user" for these new parameters is introduced in this patch,
the implementation of the new system calls is just identical to their
already existing counterpart. Future patches that implement scheduling
policies able to exploit the new data structure must also take care of
modifying the sched_*attr() calls accordingly with their own purposes.
Signed-off-by: Dario Faggioli <raistlin@linux.it>
[ Rewrote to use sched_attr. ]
Signed-off-by: Juri Lelli <juri.lelli@gmail.com>
[ Removed sched_setscheduler2() for now. ]
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/1383831828-15501-3-git-send-email-juri.lelli@gmail.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2013-11-07 20:43:36 +07:00
|
|
|
}
|
|
|
|
EXPORT_SYMBOL_GPL(sched_setattr);
|
|
|
|
|
2008-06-23 10:55:38 +07:00
|
|
|
/**
|
|
|
|
* sched_setscheduler_nocheck - change the scheduling policy and/or RT priority of a thread from kernelspace.
|
|
|
|
* @p: the task in question.
|
|
|
|
* @policy: new policy.
|
|
|
|
* @param: structure containing the new RT priority.
|
|
|
|
*
|
|
|
|
* Just like sched_setscheduler, only don't bother checking if the
|
|
|
|
* current context has permission. For example, this is needed in
|
|
|
|
* stop_machine(): we create temporary high priority worker threads,
|
|
|
|
* but our caller might not have that capability.
|
2013-07-13 01:45:47 +07:00
|
|
|
*
|
|
|
|
* Return: 0 on success. An error code otherwise.
|
2008-06-23 10:55:38 +07:00
|
|
|
*/
|
|
|
|
int sched_setscheduler_nocheck(struct task_struct *p, int policy,
|
2010-10-21 06:01:12 +07:00
|
|
|
const struct sched_param *param)
|
2008-06-23 10:55:38 +07:00
|
|
|
{
|
2014-01-15 23:05:04 +07:00
|
|
|
return _sched_setscheduler(p, policy, param, false);
|
2008-06-23 10:55:38 +07:00
|
|
|
}
|
2015-09-02 15:28:44 +07:00
|
|
|
EXPORT_SYMBOL_GPL(sched_setscheduler_nocheck);
|
2008-06-23 10:55:38 +07:00
|
|
|
|
2005-09-10 14:26:11 +07:00
|
|
|
static int
|
|
|
|
do_sched_setscheduler(pid_t pid, int policy, struct sched_param __user *param)
|
2005-04-17 05:20:36 +07:00
|
|
|
{
|
|
|
|
struct sched_param lparam;
|
|
|
|
struct task_struct *p;
|
2006-07-03 14:25:41 +07:00
|
|
|
int retval;
|
2005-04-17 05:20:36 +07:00
|
|
|
|
|
|
|
if (!param || pid < 0)
|
|
|
|
return -EINVAL;
|
|
|
|
if (copy_from_user(&lparam, param, sizeof(struct sched_param)))
|
|
|
|
return -EFAULT;
|
2006-09-29 16:00:48 +07:00
|
|
|
|
|
|
|
rcu_read_lock();
|
|
|
|
retval = -ESRCH;
|
2005-04-17 05:20:36 +07:00
|
|
|
p = find_process_by_pid(pid);
|
2006-09-29 16:00:48 +07:00
|
|
|
if (p != NULL)
|
|
|
|
retval = sched_setscheduler(p, policy, &lparam);
|
|
|
|
rcu_read_unlock();
|
2006-07-03 14:25:41 +07:00
|
|
|
|
2005-04-17 05:20:36 +07:00
|
|
|
return retval;
|
|
|
|
}
|
|
|
|
|
sched: Add new scheduler syscalls to support an extended scheduling parameters ABI
Add the syscalls needed for supporting scheduling algorithms
with extended scheduling parameters (e.g., SCHED_DEADLINE).
In general, it makes possible to specify a periodic/sporadic task,
that executes for a given amount of runtime at each instance, and is
scheduled according to the urgency of their own timing constraints,
i.e.:
- a (maximum/typical) instance execution time,
- a minimum interval between consecutive instances,
- a time constraint by which each instance must be completed.
Thus, both the data structure that holds the scheduling parameters of
the tasks and the system calls dealing with it must be extended.
Unfortunately, modifying the existing struct sched_param would break
the ABI and result in potentially serious compatibility issues with
legacy binaries.
For these reasons, this patch:
- defines the new struct sched_attr, containing all the fields
that are necessary for specifying a task in the computational
model described above;
- defines and implements the new scheduling related syscalls that
manipulate it, i.e., sched_setattr() and sched_getattr().
Syscalls are introduced for x86 (32 and 64 bits) and ARM only, as a
proof of concept and for developing and testing purposes. Making them
available on other architectures is straightforward.
Since no "user" for these new parameters is introduced in this patch,
the implementation of the new system calls is just identical to their
already existing counterpart. Future patches that implement scheduling
policies able to exploit the new data structure must also take care of
modifying the sched_*attr() calls accordingly with their own purposes.
Signed-off-by: Dario Faggioli <raistlin@linux.it>
[ Rewrote to use sched_attr. ]
Signed-off-by: Juri Lelli <juri.lelli@gmail.com>
[ Removed sched_setscheduler2() for now. ]
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/1383831828-15501-3-git-send-email-juri.lelli@gmail.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2013-11-07 20:43:36 +07:00
|
|
|
/*
|
|
|
|
* Mimics kernel/events/core.c perf_copy_attr().
|
|
|
|
*/
|
|
|
|
static int sched_copy_attr(struct sched_attr __user *uattr,
|
|
|
|
struct sched_attr *attr)
|
|
|
|
{
|
|
|
|
u32 size;
|
|
|
|
int ret;
|
|
|
|
|
|
|
|
if (!access_ok(VERIFY_WRITE, uattr, SCHED_ATTR_SIZE_VER0))
|
|
|
|
return -EFAULT;
|
|
|
|
|
|
|
|
/*
|
|
|
|
* zero the full structure, so that a short copy will be nice.
|
|
|
|
*/
|
|
|
|
memset(attr, 0, sizeof(*attr));
|
|
|
|
|
|
|
|
ret = get_user(size, &uattr->size);
|
|
|
|
if (ret)
|
|
|
|
return ret;
|
|
|
|
|
|
|
|
if (size > PAGE_SIZE) /* silly large */
|
|
|
|
goto err_size;
|
|
|
|
|
|
|
|
if (!size) /* abi compat */
|
|
|
|
size = SCHED_ATTR_SIZE_VER0;
|
|
|
|
|
|
|
|
if (size < SCHED_ATTR_SIZE_VER0)
|
|
|
|
goto err_size;
|
|
|
|
|
|
|
|
/*
|
|
|
|
* If we're handed a bigger struct than we know of,
|
|
|
|
* ensure all the unknown bits are 0 - i.e. new
|
|
|
|
* user-space does not rely on any kernel feature
|
|
|
|
* extensions we dont know about yet.
|
|
|
|
*/
|
|
|
|
if (size > sizeof(*attr)) {
|
|
|
|
unsigned char __user *addr;
|
|
|
|
unsigned char __user *end;
|
|
|
|
unsigned char val;
|
|
|
|
|
|
|
|
addr = (void __user *)uattr + sizeof(*attr);
|
|
|
|
end = (void __user *)uattr + size;
|
|
|
|
|
|
|
|
for (; addr < end; addr++) {
|
|
|
|
ret = get_user(val, addr);
|
|
|
|
if (ret)
|
|
|
|
return ret;
|
|
|
|
if (val)
|
|
|
|
goto err_size;
|
|
|
|
}
|
|
|
|
size = sizeof(*attr);
|
|
|
|
}
|
|
|
|
|
|
|
|
ret = copy_from_user(attr, uattr, size);
|
|
|
|
if (ret)
|
|
|
|
return -EFAULT;
|
|
|
|
|
|
|
|
/*
|
|
|
|
* XXX: do we want to be lenient like existing syscalls; or do we want
|
|
|
|
* to be strict and return an error on out-of-bounds values?
|
|
|
|
*/
|
2014-02-11 14:34:50 +07:00
|
|
|
attr->sched_nice = clamp(attr->sched_nice, MIN_NICE, MAX_NICE);
|
sched: Add new scheduler syscalls to support an extended scheduling parameters ABI
Add the syscalls needed for supporting scheduling algorithms
with extended scheduling parameters (e.g., SCHED_DEADLINE).
In general, it makes possible to specify a periodic/sporadic task,
that executes for a given amount of runtime at each instance, and is
scheduled according to the urgency of their own timing constraints,
i.e.:
- a (maximum/typical) instance execution time,
- a minimum interval between consecutive instances,
- a time constraint by which each instance must be completed.
Thus, both the data structure that holds the scheduling parameters of
the tasks and the system calls dealing with it must be extended.
Unfortunately, modifying the existing struct sched_param would break
the ABI and result in potentially serious compatibility issues with
legacy binaries.
For these reasons, this patch:
- defines the new struct sched_attr, containing all the fields
that are necessary for specifying a task in the computational
model described above;
- defines and implements the new scheduling related syscalls that
manipulate it, i.e., sched_setattr() and sched_getattr().
Syscalls are introduced for x86 (32 and 64 bits) and ARM only, as a
proof of concept and for developing and testing purposes. Making them
available on other architectures is straightforward.
Since no "user" for these new parameters is introduced in this patch,
the implementation of the new system calls is just identical to their
already existing counterpart. Future patches that implement scheduling
policies able to exploit the new data structure must also take care of
modifying the sched_*attr() calls accordingly with their own purposes.
Signed-off-by: Dario Faggioli <raistlin@linux.it>
[ Rewrote to use sched_attr. ]
Signed-off-by: Juri Lelli <juri.lelli@gmail.com>
[ Removed sched_setscheduler2() for now. ]
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/1383831828-15501-3-git-send-email-juri.lelli@gmail.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2013-11-07 20:43:36 +07:00
|
|
|
|
2014-05-09 21:54:28 +07:00
|
|
|
return 0;
|
sched: Add new scheduler syscalls to support an extended scheduling parameters ABI
Add the syscalls needed for supporting scheduling algorithms
with extended scheduling parameters (e.g., SCHED_DEADLINE).
In general, it makes possible to specify a periodic/sporadic task,
that executes for a given amount of runtime at each instance, and is
scheduled according to the urgency of their own timing constraints,
i.e.:
- a (maximum/typical) instance execution time,
- a minimum interval between consecutive instances,
- a time constraint by which each instance must be completed.
Thus, both the data structure that holds the scheduling parameters of
the tasks and the system calls dealing with it must be extended.
Unfortunately, modifying the existing struct sched_param would break
the ABI and result in potentially serious compatibility issues with
legacy binaries.
For these reasons, this patch:
- defines the new struct sched_attr, containing all the fields
that are necessary for specifying a task in the computational
model described above;
- defines and implements the new scheduling related syscalls that
manipulate it, i.e., sched_setattr() and sched_getattr().
Syscalls are introduced for x86 (32 and 64 bits) and ARM only, as a
proof of concept and for developing and testing purposes. Making them
available on other architectures is straightforward.
Since no "user" for these new parameters is introduced in this patch,
the implementation of the new system calls is just identical to their
already existing counterpart. Future patches that implement scheduling
policies able to exploit the new data structure must also take care of
modifying the sched_*attr() calls accordingly with their own purposes.
Signed-off-by: Dario Faggioli <raistlin@linux.it>
[ Rewrote to use sched_attr. ]
Signed-off-by: Juri Lelli <juri.lelli@gmail.com>
[ Removed sched_setscheduler2() for now. ]
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/1383831828-15501-3-git-send-email-juri.lelli@gmail.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2013-11-07 20:43:36 +07:00
|
|
|
|
|
|
|
err_size:
|
|
|
|
put_user(sizeof(*attr), &uattr->size);
|
2014-05-09 21:54:28 +07:00
|
|
|
return -E2BIG;
|
sched: Add new scheduler syscalls to support an extended scheduling parameters ABI
Add the syscalls needed for supporting scheduling algorithms
with extended scheduling parameters (e.g., SCHED_DEADLINE).
In general, it makes possible to specify a periodic/sporadic task,
that executes for a given amount of runtime at each instance, and is
scheduled according to the urgency of their own timing constraints,
i.e.:
- a (maximum/typical) instance execution time,
- a minimum interval between consecutive instances,
- a time constraint by which each instance must be completed.
Thus, both the data structure that holds the scheduling parameters of
the tasks and the system calls dealing with it must be extended.
Unfortunately, modifying the existing struct sched_param would break
the ABI and result in potentially serious compatibility issues with
legacy binaries.
For these reasons, this patch:
- defines the new struct sched_attr, containing all the fields
that are necessary for specifying a task in the computational
model described above;
- defines and implements the new scheduling related syscalls that
manipulate it, i.e., sched_setattr() and sched_getattr().
Syscalls are introduced for x86 (32 and 64 bits) and ARM only, as a
proof of concept and for developing and testing purposes. Making them
available on other architectures is straightforward.
Since no "user" for these new parameters is introduced in this patch,
the implementation of the new system calls is just identical to their
already existing counterpart. Future patches that implement scheduling
policies able to exploit the new data structure must also take care of
modifying the sched_*attr() calls accordingly with their own purposes.
Signed-off-by: Dario Faggioli <raistlin@linux.it>
[ Rewrote to use sched_attr. ]
Signed-off-by: Juri Lelli <juri.lelli@gmail.com>
[ Removed sched_setscheduler2() for now. ]
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/1383831828-15501-3-git-send-email-juri.lelli@gmail.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2013-11-07 20:43:36 +07:00
|
|
|
}
|
|
|
|
|
2005-04-17 05:20:36 +07:00
|
|
|
/**
|
|
|
|
* sys_sched_setscheduler - set/change the scheduler policy and RT priority
|
|
|
|
* @pid: the pid in question.
|
|
|
|
* @policy: new policy.
|
|
|
|
* @param: structure containing the new RT priority.
|
2013-07-13 01:45:47 +07:00
|
|
|
*
|
|
|
|
* Return: 0 on success. An error code otherwise.
|
2005-04-17 05:20:36 +07:00
|
|
|
*/
|
2009-01-14 20:14:08 +07:00
|
|
|
SYSCALL_DEFINE3(sched_setscheduler, pid_t, pid, int, policy,
|
|
|
|
struct sched_param __user *, param)
|
2005-04-17 05:20:36 +07:00
|
|
|
{
|
2006-01-19 08:43:03 +07:00
|
|
|
/* negative values for policy are not valid */
|
|
|
|
if (policy < 0)
|
|
|
|
return -EINVAL;
|
|
|
|
|
2005-04-17 05:20:36 +07:00
|
|
|
return do_sched_setscheduler(pid, policy, param);
|
|
|
|
}
|
|
|
|
|
|
|
|
/**
|
|
|
|
* sys_sched_setparam - set/change the RT priority of a thread
|
|
|
|
* @pid: the pid in question.
|
|
|
|
* @param: structure containing the new RT priority.
|
2013-07-13 01:45:47 +07:00
|
|
|
*
|
|
|
|
* Return: 0 on success. An error code otherwise.
|
2005-04-17 05:20:36 +07:00
|
|
|
*/
|
2009-01-14 20:14:08 +07:00
|
|
|
SYSCALL_DEFINE2(sched_setparam, pid_t, pid, struct sched_param __user *, param)
|
2005-04-17 05:20:36 +07:00
|
|
|
{
|
2014-07-23 22:28:26 +07:00
|
|
|
return do_sched_setscheduler(pid, SETPARAM_POLICY, param);
|
2005-04-17 05:20:36 +07:00
|
|
|
}
|
|
|
|
|
sched: Add new scheduler syscalls to support an extended scheduling parameters ABI
Add the syscalls needed for supporting scheduling algorithms
with extended scheduling parameters (e.g., SCHED_DEADLINE).
In general, it makes possible to specify a periodic/sporadic task,
that executes for a given amount of runtime at each instance, and is
scheduled according to the urgency of their own timing constraints,
i.e.:
- a (maximum/typical) instance execution time,
- a minimum interval between consecutive instances,
- a time constraint by which each instance must be completed.
Thus, both the data structure that holds the scheduling parameters of
the tasks and the system calls dealing with it must be extended.
Unfortunately, modifying the existing struct sched_param would break
the ABI and result in potentially serious compatibility issues with
legacy binaries.
For these reasons, this patch:
- defines the new struct sched_attr, containing all the fields
that are necessary for specifying a task in the computational
model described above;
- defines and implements the new scheduling related syscalls that
manipulate it, i.e., sched_setattr() and sched_getattr().
Syscalls are introduced for x86 (32 and 64 bits) and ARM only, as a
proof of concept and for developing and testing purposes. Making them
available on other architectures is straightforward.
Since no "user" for these new parameters is introduced in this patch,
the implementation of the new system calls is just identical to their
already existing counterpart. Future patches that implement scheduling
policies able to exploit the new data structure must also take care of
modifying the sched_*attr() calls accordingly with their own purposes.
Signed-off-by: Dario Faggioli <raistlin@linux.it>
[ Rewrote to use sched_attr. ]
Signed-off-by: Juri Lelli <juri.lelli@gmail.com>
[ Removed sched_setscheduler2() for now. ]
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/1383831828-15501-3-git-send-email-juri.lelli@gmail.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2013-11-07 20:43:36 +07:00
|
|
|
/**
|
|
|
|
* sys_sched_setattr - same as above, but with extended sched_attr
|
|
|
|
* @pid: the pid in question.
|
2014-01-14 22:10:39 +07:00
|
|
|
* @uattr: structure containing the extended parameters.
|
2014-04-17 23:59:15 +07:00
|
|
|
* @flags: for future extension.
|
sched: Add new scheduler syscalls to support an extended scheduling parameters ABI
Add the syscalls needed for supporting scheduling algorithms
with extended scheduling parameters (e.g., SCHED_DEADLINE).
In general, it makes possible to specify a periodic/sporadic task,
that executes for a given amount of runtime at each instance, and is
scheduled according to the urgency of their own timing constraints,
i.e.:
- a (maximum/typical) instance execution time,
- a minimum interval between consecutive instances,
- a time constraint by which each instance must be completed.
Thus, both the data structure that holds the scheduling parameters of
the tasks and the system calls dealing with it must be extended.
Unfortunately, modifying the existing struct sched_param would break
the ABI and result in potentially serious compatibility issues with
legacy binaries.
For these reasons, this patch:
- defines the new struct sched_attr, containing all the fields
that are necessary for specifying a task in the computational
model described above;
- defines and implements the new scheduling related syscalls that
manipulate it, i.e., sched_setattr() and sched_getattr().
Syscalls are introduced for x86 (32 and 64 bits) and ARM only, as a
proof of concept and for developing and testing purposes. Making them
available on other architectures is straightforward.
Since no "user" for these new parameters is introduced in this patch,
the implementation of the new system calls is just identical to their
already existing counterpart. Future patches that implement scheduling
policies able to exploit the new data structure must also take care of
modifying the sched_*attr() calls accordingly with their own purposes.
Signed-off-by: Dario Faggioli <raistlin@linux.it>
[ Rewrote to use sched_attr. ]
Signed-off-by: Juri Lelli <juri.lelli@gmail.com>
[ Removed sched_setscheduler2() for now. ]
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/1383831828-15501-3-git-send-email-juri.lelli@gmail.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2013-11-07 20:43:36 +07:00
|
|
|
*/
|
2014-02-14 23:19:29 +07:00
|
|
|
SYSCALL_DEFINE3(sched_setattr, pid_t, pid, struct sched_attr __user *, uattr,
|
|
|
|
unsigned int, flags)
|
sched: Add new scheduler syscalls to support an extended scheduling parameters ABI
Add the syscalls needed for supporting scheduling algorithms
with extended scheduling parameters (e.g., SCHED_DEADLINE).
In general, it makes possible to specify a periodic/sporadic task,
that executes for a given amount of runtime at each instance, and is
scheduled according to the urgency of their own timing constraints,
i.e.:
- a (maximum/typical) instance execution time,
- a minimum interval between consecutive instances,
- a time constraint by which each instance must be completed.
Thus, both the data structure that holds the scheduling parameters of
the tasks and the system calls dealing with it must be extended.
Unfortunately, modifying the existing struct sched_param would break
the ABI and result in potentially serious compatibility issues with
legacy binaries.
For these reasons, this patch:
- defines the new struct sched_attr, containing all the fields
that are necessary for specifying a task in the computational
model described above;
- defines and implements the new scheduling related syscalls that
manipulate it, i.e., sched_setattr() and sched_getattr().
Syscalls are introduced for x86 (32 and 64 bits) and ARM only, as a
proof of concept and for developing and testing purposes. Making them
available on other architectures is straightforward.
Since no "user" for these new parameters is introduced in this patch,
the implementation of the new system calls is just identical to their
already existing counterpart. Future patches that implement scheduling
policies able to exploit the new data structure must also take care of
modifying the sched_*attr() calls accordingly with their own purposes.
Signed-off-by: Dario Faggioli <raistlin@linux.it>
[ Rewrote to use sched_attr. ]
Signed-off-by: Juri Lelli <juri.lelli@gmail.com>
[ Removed sched_setscheduler2() for now. ]
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/1383831828-15501-3-git-send-email-juri.lelli@gmail.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2013-11-07 20:43:36 +07:00
|
|
|
{
|
|
|
|
struct sched_attr attr;
|
|
|
|
struct task_struct *p;
|
|
|
|
int retval;
|
|
|
|
|
2014-02-14 23:19:29 +07:00
|
|
|
if (!uattr || pid < 0 || flags)
|
sched: Add new scheduler syscalls to support an extended scheduling parameters ABI
Add the syscalls needed for supporting scheduling algorithms
with extended scheduling parameters (e.g., SCHED_DEADLINE).
In general, it makes possible to specify a periodic/sporadic task,
that executes for a given amount of runtime at each instance, and is
scheduled according to the urgency of their own timing constraints,
i.e.:
- a (maximum/typical) instance execution time,
- a minimum interval between consecutive instances,
- a time constraint by which each instance must be completed.
Thus, both the data structure that holds the scheduling parameters of
the tasks and the system calls dealing with it must be extended.
Unfortunately, modifying the existing struct sched_param would break
the ABI and result in potentially serious compatibility issues with
legacy binaries.
For these reasons, this patch:
- defines the new struct sched_attr, containing all the fields
that are necessary for specifying a task in the computational
model described above;
- defines and implements the new scheduling related syscalls that
manipulate it, i.e., sched_setattr() and sched_getattr().
Syscalls are introduced for x86 (32 and 64 bits) and ARM only, as a
proof of concept and for developing and testing purposes. Making them
available on other architectures is straightforward.
Since no "user" for these new parameters is introduced in this patch,
the implementation of the new system calls is just identical to their
already existing counterpart. Future patches that implement scheduling
policies able to exploit the new data structure must also take care of
modifying the sched_*attr() calls accordingly with their own purposes.
Signed-off-by: Dario Faggioli <raistlin@linux.it>
[ Rewrote to use sched_attr. ]
Signed-off-by: Juri Lelli <juri.lelli@gmail.com>
[ Removed sched_setscheduler2() for now. ]
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/1383831828-15501-3-git-send-email-juri.lelli@gmail.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2013-11-07 20:43:36 +07:00
|
|
|
return -EINVAL;
|
|
|
|
|
2014-05-09 21:54:15 +07:00
|
|
|
retval = sched_copy_attr(uattr, &attr);
|
|
|
|
if (retval)
|
|
|
|
return retval;
|
sched: Add new scheduler syscalls to support an extended scheduling parameters ABI
Add the syscalls needed for supporting scheduling algorithms
with extended scheduling parameters (e.g., SCHED_DEADLINE).
In general, it makes possible to specify a periodic/sporadic task,
that executes for a given amount of runtime at each instance, and is
scheduled according to the urgency of their own timing constraints,
i.e.:
- a (maximum/typical) instance execution time,
- a minimum interval between consecutive instances,
- a time constraint by which each instance must be completed.
Thus, both the data structure that holds the scheduling parameters of
the tasks and the system calls dealing with it must be extended.
Unfortunately, modifying the existing struct sched_param would break
the ABI and result in potentially serious compatibility issues with
legacy binaries.
For these reasons, this patch:
- defines the new struct sched_attr, containing all the fields
that are necessary for specifying a task in the computational
model described above;
- defines and implements the new scheduling related syscalls that
manipulate it, i.e., sched_setattr() and sched_getattr().
Syscalls are introduced for x86 (32 and 64 bits) and ARM only, as a
proof of concept and for developing and testing purposes. Making them
available on other architectures is straightforward.
Since no "user" for these new parameters is introduced in this patch,
the implementation of the new system calls is just identical to their
already existing counterpart. Future patches that implement scheduling
policies able to exploit the new data structure must also take care of
modifying the sched_*attr() calls accordingly with their own purposes.
Signed-off-by: Dario Faggioli <raistlin@linux.it>
[ Rewrote to use sched_attr. ]
Signed-off-by: Juri Lelli <juri.lelli@gmail.com>
[ Removed sched_setscheduler2() for now. ]
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/1383831828-15501-3-git-send-email-juri.lelli@gmail.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2013-11-07 20:43:36 +07:00
|
|
|
|
2014-06-03 03:38:34 +07:00
|
|
|
if ((int)attr.sched_policy < 0)
|
2014-05-09 15:49:03 +07:00
|
|
|
return -EINVAL;
|
sched: Add new scheduler syscalls to support an extended scheduling parameters ABI
Add the syscalls needed for supporting scheduling algorithms
with extended scheduling parameters (e.g., SCHED_DEADLINE).
In general, it makes possible to specify a periodic/sporadic task,
that executes for a given amount of runtime at each instance, and is
scheduled according to the urgency of their own timing constraints,
i.e.:
- a (maximum/typical) instance execution time,
- a minimum interval between consecutive instances,
- a time constraint by which each instance must be completed.
Thus, both the data structure that holds the scheduling parameters of
the tasks and the system calls dealing with it must be extended.
Unfortunately, modifying the existing struct sched_param would break
the ABI and result in potentially serious compatibility issues with
legacy binaries.
For these reasons, this patch:
- defines the new struct sched_attr, containing all the fields
that are necessary for specifying a task in the computational
model described above;
- defines and implements the new scheduling related syscalls that
manipulate it, i.e., sched_setattr() and sched_getattr().
Syscalls are introduced for x86 (32 and 64 bits) and ARM only, as a
proof of concept and for developing and testing purposes. Making them
available on other architectures is straightforward.
Since no "user" for these new parameters is introduced in this patch,
the implementation of the new system calls is just identical to their
already existing counterpart. Future patches that implement scheduling
policies able to exploit the new data structure must also take care of
modifying the sched_*attr() calls accordingly with their own purposes.
Signed-off-by: Dario Faggioli <raistlin@linux.it>
[ Rewrote to use sched_attr. ]
Signed-off-by: Juri Lelli <juri.lelli@gmail.com>
[ Removed sched_setscheduler2() for now. ]
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/1383831828-15501-3-git-send-email-juri.lelli@gmail.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2013-11-07 20:43:36 +07:00
|
|
|
|
|
|
|
rcu_read_lock();
|
|
|
|
retval = -ESRCH;
|
|
|
|
p = find_process_by_pid(pid);
|
|
|
|
if (p != NULL)
|
|
|
|
retval = sched_setattr(p, &attr);
|
|
|
|
rcu_read_unlock();
|
|
|
|
|
|
|
|
return retval;
|
|
|
|
}
|
|
|
|
|
2005-04-17 05:20:36 +07:00
|
|
|
/**
|
|
|
|
* sys_sched_getscheduler - get the policy (scheduling class) of a thread
|
|
|
|
* @pid: the pid in question.
|
2013-07-13 01:45:47 +07:00
|
|
|
*
|
|
|
|
* Return: On success, the policy of the thread. Otherwise, a negative error
|
|
|
|
* code.
|
2005-04-17 05:20:36 +07:00
|
|
|
*/
|
2009-01-14 20:14:08 +07:00
|
|
|
SYSCALL_DEFINE1(sched_getscheduler, pid_t, pid)
|
2005-04-17 05:20:36 +07:00
|
|
|
{
|
2006-07-03 14:25:41 +07:00
|
|
|
struct task_struct *p;
|
2007-10-15 22:00:14 +07:00
|
|
|
int retval;
|
2005-04-17 05:20:36 +07:00
|
|
|
|
|
|
|
if (pid < 0)
|
2007-10-15 22:00:14 +07:00
|
|
|
return -EINVAL;
|
2005-04-17 05:20:36 +07:00
|
|
|
|
|
|
|
retval = -ESRCH;
|
2009-12-09 17:14:58 +07:00
|
|
|
rcu_read_lock();
|
2005-04-17 05:20:36 +07:00
|
|
|
p = find_process_by_pid(pid);
|
|
|
|
if (p) {
|
|
|
|
retval = security_task_getscheduler(p);
|
|
|
|
if (!retval)
|
2009-06-15 22:17:47 +07:00
|
|
|
retval = p->policy
|
|
|
|
| (p->sched_reset_on_fork ? SCHED_RESET_ON_FORK : 0);
|
2005-04-17 05:20:36 +07:00
|
|
|
}
|
2009-12-09 17:14:58 +07:00
|
|
|
rcu_read_unlock();
|
2005-04-17 05:20:36 +07:00
|
|
|
return retval;
|
|
|
|
}
|
|
|
|
|
|
|
|
/**
|
2009-06-15 22:17:47 +07:00
|
|
|
* sys_sched_getparam - get the RT priority of a thread
|
2005-04-17 05:20:36 +07:00
|
|
|
* @pid: the pid in question.
|
|
|
|
* @param: structure containing the RT priority.
|
2013-07-13 01:45:47 +07:00
|
|
|
*
|
|
|
|
* Return: On success, 0 and the RT priority is in @param. Otherwise, an error
|
|
|
|
* code.
|
2005-04-17 05:20:36 +07:00
|
|
|
*/
|
2009-01-14 20:14:08 +07:00
|
|
|
SYSCALL_DEFINE2(sched_getparam, pid_t, pid, struct sched_param __user *, param)
|
2005-04-17 05:20:36 +07:00
|
|
|
{
|
2014-05-13 03:50:34 +07:00
|
|
|
struct sched_param lp = { .sched_priority = 0 };
|
2006-07-03 14:25:41 +07:00
|
|
|
struct task_struct *p;
|
2007-10-15 22:00:14 +07:00
|
|
|
int retval;
|
2005-04-17 05:20:36 +07:00
|
|
|
|
|
|
|
if (!param || pid < 0)
|
2007-10-15 22:00:14 +07:00
|
|
|
return -EINVAL;
|
2005-04-17 05:20:36 +07:00
|
|
|
|
2009-12-09 17:14:58 +07:00
|
|
|
rcu_read_lock();
|
2005-04-17 05:20:36 +07:00
|
|
|
p = find_process_by_pid(pid);
|
|
|
|
retval = -ESRCH;
|
|
|
|
if (!p)
|
|
|
|
goto out_unlock;
|
|
|
|
|
|
|
|
retval = security_task_getscheduler(p);
|
|
|
|
if (retval)
|
|
|
|
goto out_unlock;
|
|
|
|
|
2014-05-13 03:50:34 +07:00
|
|
|
if (task_has_rt_policy(p))
|
|
|
|
lp.sched_priority = p->rt_priority;
|
2009-12-09 17:14:58 +07:00
|
|
|
rcu_read_unlock();
|
2005-04-17 05:20:36 +07:00
|
|
|
|
|
|
|
/*
|
|
|
|
* This one might sleep, we cannot do it with a spinlock held ...
|
|
|
|
*/
|
|
|
|
retval = copy_to_user(param, &lp, sizeof(*param)) ? -EFAULT : 0;
|
|
|
|
|
|
|
|
return retval;
|
|
|
|
|
|
|
|
out_unlock:
|
2009-12-09 17:14:58 +07:00
|
|
|
rcu_read_unlock();
|
2005-04-17 05:20:36 +07:00
|
|
|
return retval;
|
|
|
|
}
|
|
|
|
|
sched: Add new scheduler syscalls to support an extended scheduling parameters ABI
Add the syscalls needed for supporting scheduling algorithms
with extended scheduling parameters (e.g., SCHED_DEADLINE).
In general, it makes possible to specify a periodic/sporadic task,
that executes for a given amount of runtime at each instance, and is
scheduled according to the urgency of their own timing constraints,
i.e.:
- a (maximum/typical) instance execution time,
- a minimum interval between consecutive instances,
- a time constraint by which each instance must be completed.
Thus, both the data structure that holds the scheduling parameters of
the tasks and the system calls dealing with it must be extended.
Unfortunately, modifying the existing struct sched_param would break
the ABI and result in potentially serious compatibility issues with
legacy binaries.
For these reasons, this patch:
- defines the new struct sched_attr, containing all the fields
that are necessary for specifying a task in the computational
model described above;
- defines and implements the new scheduling related syscalls that
manipulate it, i.e., sched_setattr() and sched_getattr().
Syscalls are introduced for x86 (32 and 64 bits) and ARM only, as a
proof of concept and for developing and testing purposes. Making them
available on other architectures is straightforward.
Since no "user" for these new parameters is introduced in this patch,
the implementation of the new system calls is just identical to their
already existing counterpart. Future patches that implement scheduling
policies able to exploit the new data structure must also take care of
modifying the sched_*attr() calls accordingly with their own purposes.
Signed-off-by: Dario Faggioli <raistlin@linux.it>
[ Rewrote to use sched_attr. ]
Signed-off-by: Juri Lelli <juri.lelli@gmail.com>
[ Removed sched_setscheduler2() for now. ]
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/1383831828-15501-3-git-send-email-juri.lelli@gmail.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2013-11-07 20:43:36 +07:00
|
|
|
static int sched_read_attr(struct sched_attr __user *uattr,
|
|
|
|
struct sched_attr *attr,
|
|
|
|
unsigned int usize)
|
|
|
|
{
|
|
|
|
int ret;
|
|
|
|
|
|
|
|
if (!access_ok(VERIFY_WRITE, uattr, usize))
|
|
|
|
return -EFAULT;
|
|
|
|
|
|
|
|
/*
|
|
|
|
* If we're handed a smaller struct than we know of,
|
|
|
|
* ensure all the unknown bits are 0 - i.e. old
|
|
|
|
* user-space does not get uncomplete information.
|
|
|
|
*/
|
|
|
|
if (usize < sizeof(*attr)) {
|
|
|
|
unsigned char *addr;
|
|
|
|
unsigned char *end;
|
|
|
|
|
|
|
|
addr = (void *)attr + usize;
|
|
|
|
end = (void *)attr + sizeof(*attr);
|
|
|
|
|
|
|
|
for (; addr < end; addr++) {
|
|
|
|
if (*addr)
|
2014-05-09 21:54:33 +07:00
|
|
|
return -EFBIG;
|
sched: Add new scheduler syscalls to support an extended scheduling parameters ABI
Add the syscalls needed for supporting scheduling algorithms
with extended scheduling parameters (e.g., SCHED_DEADLINE).
In general, it makes possible to specify a periodic/sporadic task,
that executes for a given amount of runtime at each instance, and is
scheduled according to the urgency of their own timing constraints,
i.e.:
- a (maximum/typical) instance execution time,
- a minimum interval between consecutive instances,
- a time constraint by which each instance must be completed.
Thus, both the data structure that holds the scheduling parameters of
the tasks and the system calls dealing with it must be extended.
Unfortunately, modifying the existing struct sched_param would break
the ABI and result in potentially serious compatibility issues with
legacy binaries.
For these reasons, this patch:
- defines the new struct sched_attr, containing all the fields
that are necessary for specifying a task in the computational
model described above;
- defines and implements the new scheduling related syscalls that
manipulate it, i.e., sched_setattr() and sched_getattr().
Syscalls are introduced for x86 (32 and 64 bits) and ARM only, as a
proof of concept and for developing and testing purposes. Making them
available on other architectures is straightforward.
Since no "user" for these new parameters is introduced in this patch,
the implementation of the new system calls is just identical to their
already existing counterpart. Future patches that implement scheduling
policies able to exploit the new data structure must also take care of
modifying the sched_*attr() calls accordingly with their own purposes.
Signed-off-by: Dario Faggioli <raistlin@linux.it>
[ Rewrote to use sched_attr. ]
Signed-off-by: Juri Lelli <juri.lelli@gmail.com>
[ Removed sched_setscheduler2() for now. ]
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/1383831828-15501-3-git-send-email-juri.lelli@gmail.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2013-11-07 20:43:36 +07:00
|
|
|
}
|
|
|
|
|
|
|
|
attr->size = usize;
|
|
|
|
}
|
|
|
|
|
2014-02-17 04:24:17 +07:00
|
|
|
ret = copy_to_user(uattr, attr, attr->size);
|
sched: Add new scheduler syscalls to support an extended scheduling parameters ABI
Add the syscalls needed for supporting scheduling algorithms
with extended scheduling parameters (e.g., SCHED_DEADLINE).
In general, it makes possible to specify a periodic/sporadic task,
that executes for a given amount of runtime at each instance, and is
scheduled according to the urgency of their own timing constraints,
i.e.:
- a (maximum/typical) instance execution time,
- a minimum interval between consecutive instances,
- a time constraint by which each instance must be completed.
Thus, both the data structure that holds the scheduling parameters of
the tasks and the system calls dealing with it must be extended.
Unfortunately, modifying the existing struct sched_param would break
the ABI and result in potentially serious compatibility issues with
legacy binaries.
For these reasons, this patch:
- defines the new struct sched_attr, containing all the fields
that are necessary for specifying a task in the computational
model described above;
- defines and implements the new scheduling related syscalls that
manipulate it, i.e., sched_setattr() and sched_getattr().
Syscalls are introduced for x86 (32 and 64 bits) and ARM only, as a
proof of concept and for developing and testing purposes. Making them
available on other architectures is straightforward.
Since no "user" for these new parameters is introduced in this patch,
the implementation of the new system calls is just identical to their
already existing counterpart. Future patches that implement scheduling
policies able to exploit the new data structure must also take care of
modifying the sched_*attr() calls accordingly with their own purposes.
Signed-off-by: Dario Faggioli <raistlin@linux.it>
[ Rewrote to use sched_attr. ]
Signed-off-by: Juri Lelli <juri.lelli@gmail.com>
[ Removed sched_setscheduler2() for now. ]
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/1383831828-15501-3-git-send-email-juri.lelli@gmail.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2013-11-07 20:43:36 +07:00
|
|
|
if (ret)
|
|
|
|
return -EFAULT;
|
|
|
|
|
2014-05-09 21:54:33 +07:00
|
|
|
return 0;
|
sched: Add new scheduler syscalls to support an extended scheduling parameters ABI
Add the syscalls needed for supporting scheduling algorithms
with extended scheduling parameters (e.g., SCHED_DEADLINE).
In general, it makes possible to specify a periodic/sporadic task,
that executes for a given amount of runtime at each instance, and is
scheduled according to the urgency of their own timing constraints,
i.e.:
- a (maximum/typical) instance execution time,
- a minimum interval between consecutive instances,
- a time constraint by which each instance must be completed.
Thus, both the data structure that holds the scheduling parameters of
the tasks and the system calls dealing with it must be extended.
Unfortunately, modifying the existing struct sched_param would break
the ABI and result in potentially serious compatibility issues with
legacy binaries.
For these reasons, this patch:
- defines the new struct sched_attr, containing all the fields
that are necessary for specifying a task in the computational
model described above;
- defines and implements the new scheduling related syscalls that
manipulate it, i.e., sched_setattr() and sched_getattr().
Syscalls are introduced for x86 (32 and 64 bits) and ARM only, as a
proof of concept and for developing and testing purposes. Making them
available on other architectures is straightforward.
Since no "user" for these new parameters is introduced in this patch,
the implementation of the new system calls is just identical to their
already existing counterpart. Future patches that implement scheduling
policies able to exploit the new data structure must also take care of
modifying the sched_*attr() calls accordingly with their own purposes.
Signed-off-by: Dario Faggioli <raistlin@linux.it>
[ Rewrote to use sched_attr. ]
Signed-off-by: Juri Lelli <juri.lelli@gmail.com>
[ Removed sched_setscheduler2() for now. ]
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/1383831828-15501-3-git-send-email-juri.lelli@gmail.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2013-11-07 20:43:36 +07:00
|
|
|
}
|
|
|
|
|
|
|
|
/**
|
sched/deadline: Add SCHED_DEADLINE structures & implementation
Introduces the data structures, constants and symbols needed for
SCHED_DEADLINE implementation.
Core data structure of SCHED_DEADLINE are defined, along with their
initializers. Hooks for checking if a task belong to the new policy
are also added where they are needed.
Adds a scheduling class, in sched/dl.c and a new policy called
SCHED_DEADLINE. It is an implementation of the Earliest Deadline
First (EDF) scheduling algorithm, augmented with a mechanism (called
Constant Bandwidth Server, CBS) that makes it possible to isolate
the behaviour of tasks between each other.
The typical -deadline task will be made up of a computation phase
(instance) which is activated on a periodic or sporadic fashion. The
expected (maximum) duration of such computation is called the task's
runtime; the time interval by which each instance need to be completed
is called the task's relative deadline. The task's absolute deadline
is dynamically calculated as the time instant a task (better, an
instance) activates plus the relative deadline.
The EDF algorithms selects the task with the smallest absolute
deadline as the one to be executed first, while the CBS ensures each
task to run for at most its runtime every (relative) deadline
length time interval, avoiding any interference between different
tasks (bandwidth isolation).
Thanks to this feature, also tasks that do not strictly comply with
the computational model sketched above can effectively use the new
policy.
To summarize, this patch:
- introduces the data structures, constants and symbols needed;
- implements the core logic of the scheduling algorithm in the new
scheduling class file;
- provides all the glue code between the new scheduling class and
the core scheduler and refines the interactions between sched/dl
and the other existing scheduling classes.
Signed-off-by: Dario Faggioli <raistlin@linux.it>
Signed-off-by: Michael Trimarchi <michael@amarulasolutions.com>
Signed-off-by: Fabio Checconi <fchecconi@gmail.com>
Signed-off-by: Juri Lelli <juri.lelli@gmail.com>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/1383831828-15501-4-git-send-email-juri.lelli@gmail.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2013-11-28 17:14:43 +07:00
|
|
|
* sys_sched_getattr - similar to sched_getparam, but with sched_attr
|
sched: Add new scheduler syscalls to support an extended scheduling parameters ABI
Add the syscalls needed for supporting scheduling algorithms
with extended scheduling parameters (e.g., SCHED_DEADLINE).
In general, it makes possible to specify a periodic/sporadic task,
that executes for a given amount of runtime at each instance, and is
scheduled according to the urgency of their own timing constraints,
i.e.:
- a (maximum/typical) instance execution time,
- a minimum interval between consecutive instances,
- a time constraint by which each instance must be completed.
Thus, both the data structure that holds the scheduling parameters of
the tasks and the system calls dealing with it must be extended.
Unfortunately, modifying the existing struct sched_param would break
the ABI and result in potentially serious compatibility issues with
legacy binaries.
For these reasons, this patch:
- defines the new struct sched_attr, containing all the fields
that are necessary for specifying a task in the computational
model described above;
- defines and implements the new scheduling related syscalls that
manipulate it, i.e., sched_setattr() and sched_getattr().
Syscalls are introduced for x86 (32 and 64 bits) and ARM only, as a
proof of concept and for developing and testing purposes. Making them
available on other architectures is straightforward.
Since no "user" for these new parameters is introduced in this patch,
the implementation of the new system calls is just identical to their
already existing counterpart. Future patches that implement scheduling
policies able to exploit the new data structure must also take care of
modifying the sched_*attr() calls accordingly with their own purposes.
Signed-off-by: Dario Faggioli <raistlin@linux.it>
[ Rewrote to use sched_attr. ]
Signed-off-by: Juri Lelli <juri.lelli@gmail.com>
[ Removed sched_setscheduler2() for now. ]
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/1383831828-15501-3-git-send-email-juri.lelli@gmail.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2013-11-07 20:43:36 +07:00
|
|
|
* @pid: the pid in question.
|
2014-01-14 22:10:39 +07:00
|
|
|
* @uattr: structure containing the extended parameters.
|
sched: Add new scheduler syscalls to support an extended scheduling parameters ABI
Add the syscalls needed for supporting scheduling algorithms
with extended scheduling parameters (e.g., SCHED_DEADLINE).
In general, it makes possible to specify a periodic/sporadic task,
that executes for a given amount of runtime at each instance, and is
scheduled according to the urgency of their own timing constraints,
i.e.:
- a (maximum/typical) instance execution time,
- a minimum interval between consecutive instances,
- a time constraint by which each instance must be completed.
Thus, both the data structure that holds the scheduling parameters of
the tasks and the system calls dealing with it must be extended.
Unfortunately, modifying the existing struct sched_param would break
the ABI and result in potentially serious compatibility issues with
legacy binaries.
For these reasons, this patch:
- defines the new struct sched_attr, containing all the fields
that are necessary for specifying a task in the computational
model described above;
- defines and implements the new scheduling related syscalls that
manipulate it, i.e., sched_setattr() and sched_getattr().
Syscalls are introduced for x86 (32 and 64 bits) and ARM only, as a
proof of concept and for developing and testing purposes. Making them
available on other architectures is straightforward.
Since no "user" for these new parameters is introduced in this patch,
the implementation of the new system calls is just identical to their
already existing counterpart. Future patches that implement scheduling
policies able to exploit the new data structure must also take care of
modifying the sched_*attr() calls accordingly with their own purposes.
Signed-off-by: Dario Faggioli <raistlin@linux.it>
[ Rewrote to use sched_attr. ]
Signed-off-by: Juri Lelli <juri.lelli@gmail.com>
[ Removed sched_setscheduler2() for now. ]
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/1383831828-15501-3-git-send-email-juri.lelli@gmail.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2013-11-07 20:43:36 +07:00
|
|
|
* @size: sizeof(attr) for fwd/bwd comp.
|
2014-04-17 23:59:15 +07:00
|
|
|
* @flags: for future extension.
|
sched: Add new scheduler syscalls to support an extended scheduling parameters ABI
Add the syscalls needed for supporting scheduling algorithms
with extended scheduling parameters (e.g., SCHED_DEADLINE).
In general, it makes possible to specify a periodic/sporadic task,
that executes for a given amount of runtime at each instance, and is
scheduled according to the urgency of their own timing constraints,
i.e.:
- a (maximum/typical) instance execution time,
- a minimum interval between consecutive instances,
- a time constraint by which each instance must be completed.
Thus, both the data structure that holds the scheduling parameters of
the tasks and the system calls dealing with it must be extended.
Unfortunately, modifying the existing struct sched_param would break
the ABI and result in potentially serious compatibility issues with
legacy binaries.
For these reasons, this patch:
- defines the new struct sched_attr, containing all the fields
that are necessary for specifying a task in the computational
model described above;
- defines and implements the new scheduling related syscalls that
manipulate it, i.e., sched_setattr() and sched_getattr().
Syscalls are introduced for x86 (32 and 64 bits) and ARM only, as a
proof of concept and for developing and testing purposes. Making them
available on other architectures is straightforward.
Since no "user" for these new parameters is introduced in this patch,
the implementation of the new system calls is just identical to their
already existing counterpart. Future patches that implement scheduling
policies able to exploit the new data structure must also take care of
modifying the sched_*attr() calls accordingly with their own purposes.
Signed-off-by: Dario Faggioli <raistlin@linux.it>
[ Rewrote to use sched_attr. ]
Signed-off-by: Juri Lelli <juri.lelli@gmail.com>
[ Removed sched_setscheduler2() for now. ]
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/1383831828-15501-3-git-send-email-juri.lelli@gmail.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2013-11-07 20:43:36 +07:00
|
|
|
*/
|
2014-02-14 23:19:29 +07:00
|
|
|
SYSCALL_DEFINE4(sched_getattr, pid_t, pid, struct sched_attr __user *, uattr,
|
|
|
|
unsigned int, size, unsigned int, flags)
|
sched: Add new scheduler syscalls to support an extended scheduling parameters ABI
Add the syscalls needed for supporting scheduling algorithms
with extended scheduling parameters (e.g., SCHED_DEADLINE).
In general, it makes possible to specify a periodic/sporadic task,
that executes for a given amount of runtime at each instance, and is
scheduled according to the urgency of their own timing constraints,
i.e.:
- a (maximum/typical) instance execution time,
- a minimum interval between consecutive instances,
- a time constraint by which each instance must be completed.
Thus, both the data structure that holds the scheduling parameters of
the tasks and the system calls dealing with it must be extended.
Unfortunately, modifying the existing struct sched_param would break
the ABI and result in potentially serious compatibility issues with
legacy binaries.
For these reasons, this patch:
- defines the new struct sched_attr, containing all the fields
that are necessary for specifying a task in the computational
model described above;
- defines and implements the new scheduling related syscalls that
manipulate it, i.e., sched_setattr() and sched_getattr().
Syscalls are introduced for x86 (32 and 64 bits) and ARM only, as a
proof of concept and for developing and testing purposes. Making them
available on other architectures is straightforward.
Since no "user" for these new parameters is introduced in this patch,
the implementation of the new system calls is just identical to their
already existing counterpart. Future patches that implement scheduling
policies able to exploit the new data structure must also take care of
modifying the sched_*attr() calls accordingly with their own purposes.
Signed-off-by: Dario Faggioli <raistlin@linux.it>
[ Rewrote to use sched_attr. ]
Signed-off-by: Juri Lelli <juri.lelli@gmail.com>
[ Removed sched_setscheduler2() for now. ]
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/1383831828-15501-3-git-send-email-juri.lelli@gmail.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2013-11-07 20:43:36 +07:00
|
|
|
{
|
|
|
|
struct sched_attr attr = {
|
|
|
|
.size = sizeof(struct sched_attr),
|
|
|
|
};
|
|
|
|
struct task_struct *p;
|
|
|
|
int retval;
|
|
|
|
|
|
|
|
if (!uattr || pid < 0 || size > PAGE_SIZE ||
|
2014-02-14 23:19:29 +07:00
|
|
|
size < SCHED_ATTR_SIZE_VER0 || flags)
|
sched: Add new scheduler syscalls to support an extended scheduling parameters ABI
Add the syscalls needed for supporting scheduling algorithms
with extended scheduling parameters (e.g., SCHED_DEADLINE).
In general, it makes possible to specify a periodic/sporadic task,
that executes for a given amount of runtime at each instance, and is
scheduled according to the urgency of their own timing constraints,
i.e.:
- a (maximum/typical) instance execution time,
- a minimum interval between consecutive instances,
- a time constraint by which each instance must be completed.
Thus, both the data structure that holds the scheduling parameters of
the tasks and the system calls dealing with it must be extended.
Unfortunately, modifying the existing struct sched_param would break
the ABI and result in potentially serious compatibility issues with
legacy binaries.
For these reasons, this patch:
- defines the new struct sched_attr, containing all the fields
that are necessary for specifying a task in the computational
model described above;
- defines and implements the new scheduling related syscalls that
manipulate it, i.e., sched_setattr() and sched_getattr().
Syscalls are introduced for x86 (32 and 64 bits) and ARM only, as a
proof of concept and for developing and testing purposes. Making them
available on other architectures is straightforward.
Since no "user" for these new parameters is introduced in this patch,
the implementation of the new system calls is just identical to their
already existing counterpart. Future patches that implement scheduling
policies able to exploit the new data structure must also take care of
modifying the sched_*attr() calls accordingly with their own purposes.
Signed-off-by: Dario Faggioli <raistlin@linux.it>
[ Rewrote to use sched_attr. ]
Signed-off-by: Juri Lelli <juri.lelli@gmail.com>
[ Removed sched_setscheduler2() for now. ]
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/1383831828-15501-3-git-send-email-juri.lelli@gmail.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2013-11-07 20:43:36 +07:00
|
|
|
return -EINVAL;
|
|
|
|
|
|
|
|
rcu_read_lock();
|
|
|
|
p = find_process_by_pid(pid);
|
|
|
|
retval = -ESRCH;
|
|
|
|
if (!p)
|
|
|
|
goto out_unlock;
|
|
|
|
|
|
|
|
retval = security_task_getscheduler(p);
|
|
|
|
if (retval)
|
|
|
|
goto out_unlock;
|
|
|
|
|
|
|
|
attr.sched_policy = p->policy;
|
2014-01-15 23:05:04 +07:00
|
|
|
if (p->sched_reset_on_fork)
|
|
|
|
attr.sched_flags |= SCHED_FLAG_RESET_ON_FORK;
|
sched/deadline: Add SCHED_DEADLINE structures & implementation
Introduces the data structures, constants and symbols needed for
SCHED_DEADLINE implementation.
Core data structure of SCHED_DEADLINE are defined, along with their
initializers. Hooks for checking if a task belong to the new policy
are also added where they are needed.
Adds a scheduling class, in sched/dl.c and a new policy called
SCHED_DEADLINE. It is an implementation of the Earliest Deadline
First (EDF) scheduling algorithm, augmented with a mechanism (called
Constant Bandwidth Server, CBS) that makes it possible to isolate
the behaviour of tasks between each other.
The typical -deadline task will be made up of a computation phase
(instance) which is activated on a periodic or sporadic fashion. The
expected (maximum) duration of such computation is called the task's
runtime; the time interval by which each instance need to be completed
is called the task's relative deadline. The task's absolute deadline
is dynamically calculated as the time instant a task (better, an
instance) activates plus the relative deadline.
The EDF algorithms selects the task with the smallest absolute
deadline as the one to be executed first, while the CBS ensures each
task to run for at most its runtime every (relative) deadline
length time interval, avoiding any interference between different
tasks (bandwidth isolation).
Thanks to this feature, also tasks that do not strictly comply with
the computational model sketched above can effectively use the new
policy.
To summarize, this patch:
- introduces the data structures, constants and symbols needed;
- implements the core logic of the scheduling algorithm in the new
scheduling class file;
- provides all the glue code between the new scheduling class and
the core scheduler and refines the interactions between sched/dl
and the other existing scheduling classes.
Signed-off-by: Dario Faggioli <raistlin@linux.it>
Signed-off-by: Michael Trimarchi <michael@amarulasolutions.com>
Signed-off-by: Fabio Checconi <fchecconi@gmail.com>
Signed-off-by: Juri Lelli <juri.lelli@gmail.com>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/1383831828-15501-4-git-send-email-juri.lelli@gmail.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2013-11-28 17:14:43 +07:00
|
|
|
if (task_has_dl_policy(p))
|
|
|
|
__getparam_dl(p, &attr);
|
|
|
|
else if (task_has_rt_policy(p))
|
sched: Add new scheduler syscalls to support an extended scheduling parameters ABI
Add the syscalls needed for supporting scheduling algorithms
with extended scheduling parameters (e.g., SCHED_DEADLINE).
In general, it makes possible to specify a periodic/sporadic task,
that executes for a given amount of runtime at each instance, and is
scheduled according to the urgency of their own timing constraints,
i.e.:
- a (maximum/typical) instance execution time,
- a minimum interval between consecutive instances,
- a time constraint by which each instance must be completed.
Thus, both the data structure that holds the scheduling parameters of
the tasks and the system calls dealing with it must be extended.
Unfortunately, modifying the existing struct sched_param would break
the ABI and result in potentially serious compatibility issues with
legacy binaries.
For these reasons, this patch:
- defines the new struct sched_attr, containing all the fields
that are necessary for specifying a task in the computational
model described above;
- defines and implements the new scheduling related syscalls that
manipulate it, i.e., sched_setattr() and sched_getattr().
Syscalls are introduced for x86 (32 and 64 bits) and ARM only, as a
proof of concept and for developing and testing purposes. Making them
available on other architectures is straightforward.
Since no "user" for these new parameters is introduced in this patch,
the implementation of the new system calls is just identical to their
already existing counterpart. Future patches that implement scheduling
policies able to exploit the new data structure must also take care of
modifying the sched_*attr() calls accordingly with their own purposes.
Signed-off-by: Dario Faggioli <raistlin@linux.it>
[ Rewrote to use sched_attr. ]
Signed-off-by: Juri Lelli <juri.lelli@gmail.com>
[ Removed sched_setscheduler2() for now. ]
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/1383831828-15501-3-git-send-email-juri.lelli@gmail.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2013-11-07 20:43:36 +07:00
|
|
|
attr.sched_priority = p->rt_priority;
|
|
|
|
else
|
2014-01-28 10:00:45 +07:00
|
|
|
attr.sched_nice = task_nice(p);
|
sched: Add new scheduler syscalls to support an extended scheduling parameters ABI
Add the syscalls needed for supporting scheduling algorithms
with extended scheduling parameters (e.g., SCHED_DEADLINE).
In general, it makes possible to specify a periodic/sporadic task,
that executes for a given amount of runtime at each instance, and is
scheduled according to the urgency of their own timing constraints,
i.e.:
- a (maximum/typical) instance execution time,
- a minimum interval between consecutive instances,
- a time constraint by which each instance must be completed.
Thus, both the data structure that holds the scheduling parameters of
the tasks and the system calls dealing with it must be extended.
Unfortunately, modifying the existing struct sched_param would break
the ABI and result in potentially serious compatibility issues with
legacy binaries.
For these reasons, this patch:
- defines the new struct sched_attr, containing all the fields
that are necessary for specifying a task in the computational
model described above;
- defines and implements the new scheduling related syscalls that
manipulate it, i.e., sched_setattr() and sched_getattr().
Syscalls are introduced for x86 (32 and 64 bits) and ARM only, as a
proof of concept and for developing and testing purposes. Making them
available on other architectures is straightforward.
Since no "user" for these new parameters is introduced in this patch,
the implementation of the new system calls is just identical to their
already existing counterpart. Future patches that implement scheduling
policies able to exploit the new data structure must also take care of
modifying the sched_*attr() calls accordingly with their own purposes.
Signed-off-by: Dario Faggioli <raistlin@linux.it>
[ Rewrote to use sched_attr. ]
Signed-off-by: Juri Lelli <juri.lelli@gmail.com>
[ Removed sched_setscheduler2() for now. ]
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/1383831828-15501-3-git-send-email-juri.lelli@gmail.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2013-11-07 20:43:36 +07:00
|
|
|
|
|
|
|
rcu_read_unlock();
|
|
|
|
|
|
|
|
retval = sched_read_attr(uattr, &attr, size);
|
|
|
|
return retval;
|
|
|
|
|
|
|
|
out_unlock:
|
|
|
|
rcu_read_unlock();
|
|
|
|
return retval;
|
|
|
|
}
|
|
|
|
|
2008-11-24 23:05:14 +07:00
|
|
|
long sched_setaffinity(pid_t pid, const struct cpumask *in_mask)
|
2005-04-17 05:20:36 +07:00
|
|
|
{
|
2008-11-24 23:05:11 +07:00
|
|
|
cpumask_var_t cpus_allowed, new_mask;
|
2006-07-03 14:25:41 +07:00
|
|
|
struct task_struct *p;
|
|
|
|
int retval;
|
2005-04-17 05:20:36 +07:00
|
|
|
|
2009-12-09 17:15:01 +07:00
|
|
|
rcu_read_lock();
|
2005-04-17 05:20:36 +07:00
|
|
|
|
|
|
|
p = find_process_by_pid(pid);
|
|
|
|
if (!p) {
|
2009-12-09 17:15:01 +07:00
|
|
|
rcu_read_unlock();
|
2005-04-17 05:20:36 +07:00
|
|
|
return -ESRCH;
|
|
|
|
}
|
|
|
|
|
2009-12-09 17:15:01 +07:00
|
|
|
/* Prevent p going away */
|
2005-04-17 05:20:36 +07:00
|
|
|
get_task_struct(p);
|
2009-12-09 17:15:01 +07:00
|
|
|
rcu_read_unlock();
|
2005-04-17 05:20:36 +07:00
|
|
|
|
2013-03-20 03:45:20 +07:00
|
|
|
if (p->flags & PF_NO_SETAFFINITY) {
|
|
|
|
retval = -EINVAL;
|
|
|
|
goto out_put_task;
|
|
|
|
}
|
2008-11-24 23:05:11 +07:00
|
|
|
if (!alloc_cpumask_var(&cpus_allowed, GFP_KERNEL)) {
|
|
|
|
retval = -ENOMEM;
|
|
|
|
goto out_put_task;
|
|
|
|
}
|
|
|
|
if (!alloc_cpumask_var(&new_mask, GFP_KERNEL)) {
|
|
|
|
retval = -ENOMEM;
|
|
|
|
goto out_free_cpus_allowed;
|
|
|
|
}
|
2005-04-17 05:20:36 +07:00
|
|
|
retval = -EPERM;
|
2012-07-26 19:05:21 +07:00
|
|
|
if (!check_same_owner(p)) {
|
|
|
|
rcu_read_lock();
|
|
|
|
if (!ns_capable(__task_cred(p)->user_ns, CAP_SYS_NICE)) {
|
|
|
|
rcu_read_unlock();
|
2014-09-23 01:36:30 +07:00
|
|
|
goto out_free_new_mask;
|
2012-07-26 19:05:21 +07:00
|
|
|
}
|
|
|
|
rcu_read_unlock();
|
|
|
|
}
|
2005-04-17 05:20:36 +07:00
|
|
|
|
2010-10-15 02:21:18 +07:00
|
|
|
retval = security_task_setscheduler(p);
|
2006-06-23 16:03:59 +07:00
|
|
|
if (retval)
|
2014-09-23 01:36:30 +07:00
|
|
|
goto out_free_new_mask;
|
2006-06-23 16:03:59 +07:00
|
|
|
|
2013-12-17 16:03:34 +07:00
|
|
|
|
|
|
|
cpuset_cpus_allowed(p, cpus_allowed);
|
|
|
|
cpumask_and(new_mask, in_mask, cpus_allowed);
|
|
|
|
|
sched/deadline: Add bandwidth management for SCHED_DEADLINE tasks
In order of deadline scheduling to be effective and useful, it is
important that some method of having the allocation of the available
CPU bandwidth to tasks and task groups under control.
This is usually called "admission control" and if it is not performed
at all, no guarantee can be given on the actual scheduling of the
-deadline tasks.
Since when RT-throttling has been introduced each task group have a
bandwidth associated to itself, calculated as a certain amount of
runtime over a period. Moreover, to make it possible to manipulate
such bandwidth, readable/writable controls have been added to both
procfs (for system wide settings) and cgroupfs (for per-group
settings).
Therefore, the same interface is being used for controlling the
bandwidth distrubution to -deadline tasks and task groups, i.e.,
new controls but with similar names, equivalent meaning and with
the same usage paradigm are added.
However, more discussion is needed in order to figure out how
we want to manage SCHED_DEADLINE bandwidth at the task group level.
Therefore, this patch adds a less sophisticated, but actually
very sensible, mechanism to ensure that a certain utilization
cap is not overcome per each root_domain (the single rq for !SMP
configurations).
Another main difference between deadline bandwidth management and
RT-throttling is that -deadline tasks have bandwidth on their own
(while -rt ones doesn't!), and thus we don't need an higher level
throttling mechanism to enforce the desired bandwidth.
This patch, therefore:
- adds system wide deadline bandwidth management by means of:
* /proc/sys/kernel/sched_dl_runtime_us,
* /proc/sys/kernel/sched_dl_period_us,
that determine (i.e., runtime / period) the total bandwidth
available on each CPU of each root_domain for -deadline tasks;
- couples the RT and deadline bandwidth management, i.e., enforces
that the sum of how much bandwidth is being devoted to -rt
-deadline tasks to stay below 100%.
This means that, for a root_domain comprising M CPUs, -deadline tasks
can be created until the sum of their bandwidths stay below:
M * (sched_dl_runtime_us / sched_dl_period_us)
It is also possible to disable this bandwidth management logic, and
be thus free of oversubscribing the system up to any arbitrary level.
Signed-off-by: Dario Faggioli <raistlin@linux.it>
Signed-off-by: Juri Lelli <juri.lelli@gmail.com>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/1383831828-15501-12-git-send-email-juri.lelli@gmail.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2013-11-07 20:43:45 +07:00
|
|
|
/*
|
|
|
|
* Since bandwidth control happens on root_domain basis,
|
|
|
|
* if admission test is enabled, we only admit -deadline
|
|
|
|
* tasks allowed to run on all the CPUs in the task's
|
|
|
|
* root_domain.
|
|
|
|
*/
|
|
|
|
#ifdef CONFIG_SMP
|
2014-09-23 01:36:36 +07:00
|
|
|
if (task_has_dl_policy(p) && dl_bandwidth_enabled()) {
|
|
|
|
rcu_read_lock();
|
|
|
|
if (!cpumask_subset(task_rq(p)->rd->span, new_mask)) {
|
sched/deadline: Add bandwidth management for SCHED_DEADLINE tasks
In order of deadline scheduling to be effective and useful, it is
important that some method of having the allocation of the available
CPU bandwidth to tasks and task groups under control.
This is usually called "admission control" and if it is not performed
at all, no guarantee can be given on the actual scheduling of the
-deadline tasks.
Since when RT-throttling has been introduced each task group have a
bandwidth associated to itself, calculated as a certain amount of
runtime over a period. Moreover, to make it possible to manipulate
such bandwidth, readable/writable controls have been added to both
procfs (for system wide settings) and cgroupfs (for per-group
settings).
Therefore, the same interface is being used for controlling the
bandwidth distrubution to -deadline tasks and task groups, i.e.,
new controls but with similar names, equivalent meaning and with
the same usage paradigm are added.
However, more discussion is needed in order to figure out how
we want to manage SCHED_DEADLINE bandwidth at the task group level.
Therefore, this patch adds a less sophisticated, but actually
very sensible, mechanism to ensure that a certain utilization
cap is not overcome per each root_domain (the single rq for !SMP
configurations).
Another main difference between deadline bandwidth management and
RT-throttling is that -deadline tasks have bandwidth on their own
(while -rt ones doesn't!), and thus we don't need an higher level
throttling mechanism to enforce the desired bandwidth.
This patch, therefore:
- adds system wide deadline bandwidth management by means of:
* /proc/sys/kernel/sched_dl_runtime_us,
* /proc/sys/kernel/sched_dl_period_us,
that determine (i.e., runtime / period) the total bandwidth
available on each CPU of each root_domain for -deadline tasks;
- couples the RT and deadline bandwidth management, i.e., enforces
that the sum of how much bandwidth is being devoted to -rt
-deadline tasks to stay below 100%.
This means that, for a root_domain comprising M CPUs, -deadline tasks
can be created until the sum of their bandwidths stay below:
M * (sched_dl_runtime_us / sched_dl_period_us)
It is also possible to disable this bandwidth management logic, and
be thus free of oversubscribing the system up to any arbitrary level.
Signed-off-by: Dario Faggioli <raistlin@linux.it>
Signed-off-by: Juri Lelli <juri.lelli@gmail.com>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/1383831828-15501-12-git-send-email-juri.lelli@gmail.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2013-11-07 20:43:45 +07:00
|
|
|
retval = -EBUSY;
|
2014-09-23 01:36:36 +07:00
|
|
|
rcu_read_unlock();
|
2014-09-23 01:36:30 +07:00
|
|
|
goto out_free_new_mask;
|
sched/deadline: Add bandwidth management for SCHED_DEADLINE tasks
In order of deadline scheduling to be effective and useful, it is
important that some method of having the allocation of the available
CPU bandwidth to tasks and task groups under control.
This is usually called "admission control" and if it is not performed
at all, no guarantee can be given on the actual scheduling of the
-deadline tasks.
Since when RT-throttling has been introduced each task group have a
bandwidth associated to itself, calculated as a certain amount of
runtime over a period. Moreover, to make it possible to manipulate
such bandwidth, readable/writable controls have been added to both
procfs (for system wide settings) and cgroupfs (for per-group
settings).
Therefore, the same interface is being used for controlling the
bandwidth distrubution to -deadline tasks and task groups, i.e.,
new controls but with similar names, equivalent meaning and with
the same usage paradigm are added.
However, more discussion is needed in order to figure out how
we want to manage SCHED_DEADLINE bandwidth at the task group level.
Therefore, this patch adds a less sophisticated, but actually
very sensible, mechanism to ensure that a certain utilization
cap is not overcome per each root_domain (the single rq for !SMP
configurations).
Another main difference between deadline bandwidth management and
RT-throttling is that -deadline tasks have bandwidth on their own
(while -rt ones doesn't!), and thus we don't need an higher level
throttling mechanism to enforce the desired bandwidth.
This patch, therefore:
- adds system wide deadline bandwidth management by means of:
* /proc/sys/kernel/sched_dl_runtime_us,
* /proc/sys/kernel/sched_dl_period_us,
that determine (i.e., runtime / period) the total bandwidth
available on each CPU of each root_domain for -deadline tasks;
- couples the RT and deadline bandwidth management, i.e., enforces
that the sum of how much bandwidth is being devoted to -rt
-deadline tasks to stay below 100%.
This means that, for a root_domain comprising M CPUs, -deadline tasks
can be created until the sum of their bandwidths stay below:
M * (sched_dl_runtime_us / sched_dl_period_us)
It is also possible to disable this bandwidth management logic, and
be thus free of oversubscribing the system up to any arbitrary level.
Signed-off-by: Dario Faggioli <raistlin@linux.it>
Signed-off-by: Juri Lelli <juri.lelli@gmail.com>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/1383831828-15501-12-git-send-email-juri.lelli@gmail.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2013-11-07 20:43:45 +07:00
|
|
|
}
|
2014-09-23 01:36:36 +07:00
|
|
|
rcu_read_unlock();
|
sched/deadline: Add bandwidth management for SCHED_DEADLINE tasks
In order of deadline scheduling to be effective and useful, it is
important that some method of having the allocation of the available
CPU bandwidth to tasks and task groups under control.
This is usually called "admission control" and if it is not performed
at all, no guarantee can be given on the actual scheduling of the
-deadline tasks.
Since when RT-throttling has been introduced each task group have a
bandwidth associated to itself, calculated as a certain amount of
runtime over a period. Moreover, to make it possible to manipulate
such bandwidth, readable/writable controls have been added to both
procfs (for system wide settings) and cgroupfs (for per-group
settings).
Therefore, the same interface is being used for controlling the
bandwidth distrubution to -deadline tasks and task groups, i.e.,
new controls but with similar names, equivalent meaning and with
the same usage paradigm are added.
However, more discussion is needed in order to figure out how
we want to manage SCHED_DEADLINE bandwidth at the task group level.
Therefore, this patch adds a less sophisticated, but actually
very sensible, mechanism to ensure that a certain utilization
cap is not overcome per each root_domain (the single rq for !SMP
configurations).
Another main difference between deadline bandwidth management and
RT-throttling is that -deadline tasks have bandwidth on their own
(while -rt ones doesn't!), and thus we don't need an higher level
throttling mechanism to enforce the desired bandwidth.
This patch, therefore:
- adds system wide deadline bandwidth management by means of:
* /proc/sys/kernel/sched_dl_runtime_us,
* /proc/sys/kernel/sched_dl_period_us,
that determine (i.e., runtime / period) the total bandwidth
available on each CPU of each root_domain for -deadline tasks;
- couples the RT and deadline bandwidth management, i.e., enforces
that the sum of how much bandwidth is being devoted to -rt
-deadline tasks to stay below 100%.
This means that, for a root_domain comprising M CPUs, -deadline tasks
can be created until the sum of their bandwidths stay below:
M * (sched_dl_runtime_us / sched_dl_period_us)
It is also possible to disable this bandwidth management logic, and
be thus free of oversubscribing the system up to any arbitrary level.
Signed-off-by: Dario Faggioli <raistlin@linux.it>
Signed-off-by: Juri Lelli <juri.lelli@gmail.com>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/1383831828-15501-12-git-send-email-juri.lelli@gmail.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2013-11-07 20:43:45 +07:00
|
|
|
}
|
|
|
|
#endif
|
2010-10-18 02:46:10 +07:00
|
|
|
again:
|
2015-05-15 22:43:34 +07:00
|
|
|
retval = __set_cpus_allowed_ptr(p, new_mask, true);
|
2005-04-17 05:20:36 +07:00
|
|
|
|
2007-10-19 13:40:22 +07:00
|
|
|
if (!retval) {
|
2008-11-24 23:05:11 +07:00
|
|
|
cpuset_cpus_allowed(p, cpus_allowed);
|
|
|
|
if (!cpumask_subset(new_mask, cpus_allowed)) {
|
2007-10-19 13:40:22 +07:00
|
|
|
/*
|
|
|
|
* We must have raced with a concurrent cpuset
|
|
|
|
* update. Just reset the cpus_allowed to the
|
|
|
|
* cpuset's cpus_allowed
|
|
|
|
*/
|
2008-11-24 23:05:11 +07:00
|
|
|
cpumask_copy(new_mask, cpus_allowed);
|
2007-10-19 13:40:22 +07:00
|
|
|
goto again;
|
|
|
|
}
|
|
|
|
}
|
2014-09-23 01:36:30 +07:00
|
|
|
out_free_new_mask:
|
2008-11-24 23:05:11 +07:00
|
|
|
free_cpumask_var(new_mask);
|
|
|
|
out_free_cpus_allowed:
|
|
|
|
free_cpumask_var(cpus_allowed);
|
|
|
|
out_put_task:
|
2005-04-17 05:20:36 +07:00
|
|
|
put_task_struct(p);
|
|
|
|
return retval;
|
|
|
|
}
|
|
|
|
|
|
|
|
static int get_user_cpu_mask(unsigned long __user *user_mask_ptr, unsigned len,
|
2008-11-24 23:05:14 +07:00
|
|
|
struct cpumask *new_mask)
|
2005-04-17 05:20:36 +07:00
|
|
|
{
|
2008-11-24 23:05:14 +07:00
|
|
|
if (len < cpumask_size())
|
|
|
|
cpumask_clear(new_mask);
|
|
|
|
else if (len > cpumask_size())
|
|
|
|
len = cpumask_size();
|
|
|
|
|
2005-04-17 05:20:36 +07:00
|
|
|
return copy_from_user(new_mask, user_mask_ptr, len) ? -EFAULT : 0;
|
|
|
|
}
|
|
|
|
|
|
|
|
/**
|
|
|
|
* sys_sched_setaffinity - set the cpu affinity of a process
|
|
|
|
* @pid: pid of the process
|
|
|
|
* @len: length in bytes of the bitmask pointed to by user_mask_ptr
|
|
|
|
* @user_mask_ptr: user-space pointer to the new cpu mask
|
2013-07-13 01:45:47 +07:00
|
|
|
*
|
|
|
|
* Return: 0 on success. An error code otherwise.
|
2005-04-17 05:20:36 +07:00
|
|
|
*/
|
2009-01-14 20:14:08 +07:00
|
|
|
SYSCALL_DEFINE3(sched_setaffinity, pid_t, pid, unsigned int, len,
|
|
|
|
unsigned long __user *, user_mask_ptr)
|
2005-04-17 05:20:36 +07:00
|
|
|
{
|
2008-11-24 23:05:11 +07:00
|
|
|
cpumask_var_t new_mask;
|
2005-04-17 05:20:36 +07:00
|
|
|
int retval;
|
|
|
|
|
2008-11-24 23:05:11 +07:00
|
|
|
if (!alloc_cpumask_var(&new_mask, GFP_KERNEL))
|
|
|
|
return -ENOMEM;
|
2005-04-17 05:20:36 +07:00
|
|
|
|
2008-11-24 23:05:11 +07:00
|
|
|
retval = get_user_cpu_mask(user_mask_ptr, len, new_mask);
|
|
|
|
if (retval == 0)
|
|
|
|
retval = sched_setaffinity(pid, new_mask);
|
|
|
|
free_cpumask_var(new_mask);
|
|
|
|
return retval;
|
2005-04-17 05:20:36 +07:00
|
|
|
}
|
|
|
|
|
2008-11-24 23:05:14 +07:00
|
|
|
long sched_getaffinity(pid_t pid, struct cpumask *mask)
|
2005-04-17 05:20:36 +07:00
|
|
|
{
|
2006-07-03 14:25:41 +07:00
|
|
|
struct task_struct *p;
|
2009-12-09 03:24:16 +07:00
|
|
|
unsigned long flags;
|
2005-04-17 05:20:36 +07:00
|
|
|
int retval;
|
|
|
|
|
2009-12-09 17:15:01 +07:00
|
|
|
rcu_read_lock();
|
2005-04-17 05:20:36 +07:00
|
|
|
|
|
|
|
retval = -ESRCH;
|
|
|
|
p = find_process_by_pid(pid);
|
|
|
|
if (!p)
|
|
|
|
goto out_unlock;
|
|
|
|
|
2006-06-23 16:03:59 +07:00
|
|
|
retval = security_task_getscheduler(p);
|
|
|
|
if (retval)
|
|
|
|
goto out_unlock;
|
|
|
|
|
2011-04-05 22:23:45 +07:00
|
|
|
raw_spin_lock_irqsave(&p->pi_lock, flags);
|
2013-10-11 19:38:20 +07:00
|
|
|
cpumask_and(mask, &p->cpus_allowed, cpu_active_mask);
|
2011-04-05 22:23:45 +07:00
|
|
|
raw_spin_unlock_irqrestore(&p->pi_lock, flags);
|
2005-04-17 05:20:36 +07:00
|
|
|
|
|
|
|
out_unlock:
|
2009-12-09 17:15:01 +07:00
|
|
|
rcu_read_unlock();
|
2005-04-17 05:20:36 +07:00
|
|
|
|
2007-08-09 16:16:46 +07:00
|
|
|
return retval;
|
2005-04-17 05:20:36 +07:00
|
|
|
}
|
|
|
|
|
|
|
|
/**
|
|
|
|
* sys_sched_getaffinity - get the cpu affinity of a process
|
|
|
|
* @pid: pid of the process
|
|
|
|
* @len: length in bytes of the bitmask pointed to by user_mask_ptr
|
|
|
|
* @user_mask_ptr: user-space pointer to hold the current cpu mask
|
2013-07-13 01:45:47 +07:00
|
|
|
*
|
2016-06-27 04:13:23 +07:00
|
|
|
* Return: size of CPU mask copied to user_mask_ptr on success. An
|
|
|
|
* error code otherwise.
|
2005-04-17 05:20:36 +07:00
|
|
|
*/
|
2009-01-14 20:14:08 +07:00
|
|
|
SYSCALL_DEFINE3(sched_getaffinity, pid_t, pid, unsigned int, len,
|
|
|
|
unsigned long __user *, user_mask_ptr)
|
2005-04-17 05:20:36 +07:00
|
|
|
{
|
|
|
|
int ret;
|
2008-11-24 23:05:11 +07:00
|
|
|
cpumask_var_t mask;
|
2005-04-17 05:20:36 +07:00
|
|
|
|
2010-04-06 14:02:19 +07:00
|
|
|
if ((len * BITS_PER_BYTE) < nr_cpu_ids)
|
sched: sched_getaffinity(): Allow less than NR_CPUS length
[ Note, this commit changes the syscall ABI for > 1024 CPUs systems. ]
Recently, some distro decided to use NR_CPUS=4096 for mysterious reasons.
Unfortunately, glibc sched interface has the following definition:
# define __CPU_SETSIZE 1024
# define __NCPUBITS (8 * sizeof (__cpu_mask))
typedef unsigned long int __cpu_mask;
typedef struct
{
__cpu_mask __bits[__CPU_SETSIZE / __NCPUBITS];
} cpu_set_t;
It mean, if NR_CPUS is bigger than 1024, cpu_set_t makes an
ABI issue ...
More recently, Sharyathi Nagesh reported following test program makes
misterious syscall failure:
-----------------------------------------------------------------------
#define _GNU_SOURCE
#include<stdio.h>
#include<errno.h>
#include<sched.h>
int main()
{
cpu_set_t set;
if (sched_getaffinity(0, sizeof(cpu_set_t), &set) < 0)
printf("\n Call is failing with:%d", errno);
}
-----------------------------------------------------------------------
Because the kernel assumes len argument of sched_getaffinity() is bigger
than NR_CPUS. But now it is not correct.
Now we are faced with the following annoying dilemma, due to
the limitations of the glibc interface built in years ago:
(1) if we change glibc's __CPU_SETSIZE definition, we lost
binary compatibility of _all_ application.
(2) if we don't change it, we also lost binary compatibility of
Sharyathi's use case.
Then, I would propse to change the rule of the len argument of
sched_getaffinity().
Old:
len should be bigger than NR_CPUS
New:
len should be bigger than maximum possible cpu id
This creates the following behavior:
(A) In the real 4096 cpus machine, the above test program still
return -EINVAL.
(B) NR_CPUS=4096 but the machine have less than 1024 cpus (almost
all machines in the world), the above can run successfully.
Fortunatelly, BIG SGI machine is mainly used for HPC use case. It means
they can rebuild their programs.
IOW we hope they are not annoyed by this issue ...
Reported-by: Sharyathi Nagesh <sharyath@in.ibm.com>
Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Acked-by: Ulrich Drepper <drepper@redhat.com>
Acked-by: Peter Zijlstra <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Jack Steiner <steiner@sgi.com>
Cc: Russ Anderson <rja@sgi.com>
Cc: Mike Travis <travis@sgi.com>
LKML-Reference: <20100312161316.9520.A69D9226@jp.fujitsu.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2010-03-12 14:15:36 +07:00
|
|
|
return -EINVAL;
|
|
|
|
if (len & (sizeof(unsigned long)-1))
|
2005-04-17 05:20:36 +07:00
|
|
|
return -EINVAL;
|
|
|
|
|
2008-11-24 23:05:11 +07:00
|
|
|
if (!alloc_cpumask_var(&mask, GFP_KERNEL))
|
|
|
|
return -ENOMEM;
|
2005-04-17 05:20:36 +07:00
|
|
|
|
2008-11-24 23:05:11 +07:00
|
|
|
ret = sched_getaffinity(pid, mask);
|
|
|
|
if (ret == 0) {
|
2010-03-17 07:36:58 +07:00
|
|
|
size_t retlen = min_t(size_t, len, cpumask_size());
|
sched: sched_getaffinity(): Allow less than NR_CPUS length
[ Note, this commit changes the syscall ABI for > 1024 CPUs systems. ]
Recently, some distro decided to use NR_CPUS=4096 for mysterious reasons.
Unfortunately, glibc sched interface has the following definition:
# define __CPU_SETSIZE 1024
# define __NCPUBITS (8 * sizeof (__cpu_mask))
typedef unsigned long int __cpu_mask;
typedef struct
{
__cpu_mask __bits[__CPU_SETSIZE / __NCPUBITS];
} cpu_set_t;
It mean, if NR_CPUS is bigger than 1024, cpu_set_t makes an
ABI issue ...
More recently, Sharyathi Nagesh reported following test program makes
misterious syscall failure:
-----------------------------------------------------------------------
#define _GNU_SOURCE
#include<stdio.h>
#include<errno.h>
#include<sched.h>
int main()
{
cpu_set_t set;
if (sched_getaffinity(0, sizeof(cpu_set_t), &set) < 0)
printf("\n Call is failing with:%d", errno);
}
-----------------------------------------------------------------------
Because the kernel assumes len argument of sched_getaffinity() is bigger
than NR_CPUS. But now it is not correct.
Now we are faced with the following annoying dilemma, due to
the limitations of the glibc interface built in years ago:
(1) if we change glibc's __CPU_SETSIZE definition, we lost
binary compatibility of _all_ application.
(2) if we don't change it, we also lost binary compatibility of
Sharyathi's use case.
Then, I would propse to change the rule of the len argument of
sched_getaffinity().
Old:
len should be bigger than NR_CPUS
New:
len should be bigger than maximum possible cpu id
This creates the following behavior:
(A) In the real 4096 cpus machine, the above test program still
return -EINVAL.
(B) NR_CPUS=4096 but the machine have less than 1024 cpus (almost
all machines in the world), the above can run successfully.
Fortunatelly, BIG SGI machine is mainly used for HPC use case. It means
they can rebuild their programs.
IOW we hope they are not annoyed by this issue ...
Reported-by: Sharyathi Nagesh <sharyath@in.ibm.com>
Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Acked-by: Ulrich Drepper <drepper@redhat.com>
Acked-by: Peter Zijlstra <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Jack Steiner <steiner@sgi.com>
Cc: Russ Anderson <rja@sgi.com>
Cc: Mike Travis <travis@sgi.com>
LKML-Reference: <20100312161316.9520.A69D9226@jp.fujitsu.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2010-03-12 14:15:36 +07:00
|
|
|
|
|
|
|
if (copy_to_user(user_mask_ptr, mask, retlen))
|
2008-11-24 23:05:11 +07:00
|
|
|
ret = -EFAULT;
|
|
|
|
else
|
sched: sched_getaffinity(): Allow less than NR_CPUS length
[ Note, this commit changes the syscall ABI for > 1024 CPUs systems. ]
Recently, some distro decided to use NR_CPUS=4096 for mysterious reasons.
Unfortunately, glibc sched interface has the following definition:
# define __CPU_SETSIZE 1024
# define __NCPUBITS (8 * sizeof (__cpu_mask))
typedef unsigned long int __cpu_mask;
typedef struct
{
__cpu_mask __bits[__CPU_SETSIZE / __NCPUBITS];
} cpu_set_t;
It mean, if NR_CPUS is bigger than 1024, cpu_set_t makes an
ABI issue ...
More recently, Sharyathi Nagesh reported following test program makes
misterious syscall failure:
-----------------------------------------------------------------------
#define _GNU_SOURCE
#include<stdio.h>
#include<errno.h>
#include<sched.h>
int main()
{
cpu_set_t set;
if (sched_getaffinity(0, sizeof(cpu_set_t), &set) < 0)
printf("\n Call is failing with:%d", errno);
}
-----------------------------------------------------------------------
Because the kernel assumes len argument of sched_getaffinity() is bigger
than NR_CPUS. But now it is not correct.
Now we are faced with the following annoying dilemma, due to
the limitations of the glibc interface built in years ago:
(1) if we change glibc's __CPU_SETSIZE definition, we lost
binary compatibility of _all_ application.
(2) if we don't change it, we also lost binary compatibility of
Sharyathi's use case.
Then, I would propse to change the rule of the len argument of
sched_getaffinity().
Old:
len should be bigger than NR_CPUS
New:
len should be bigger than maximum possible cpu id
This creates the following behavior:
(A) In the real 4096 cpus machine, the above test program still
return -EINVAL.
(B) NR_CPUS=4096 but the machine have less than 1024 cpus (almost
all machines in the world), the above can run successfully.
Fortunatelly, BIG SGI machine is mainly used for HPC use case. It means
they can rebuild their programs.
IOW we hope they are not annoyed by this issue ...
Reported-by: Sharyathi Nagesh <sharyath@in.ibm.com>
Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Acked-by: Ulrich Drepper <drepper@redhat.com>
Acked-by: Peter Zijlstra <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Jack Steiner <steiner@sgi.com>
Cc: Russ Anderson <rja@sgi.com>
Cc: Mike Travis <travis@sgi.com>
LKML-Reference: <20100312161316.9520.A69D9226@jp.fujitsu.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2010-03-12 14:15:36 +07:00
|
|
|
ret = retlen;
|
2008-11-24 23:05:11 +07:00
|
|
|
}
|
|
|
|
free_cpumask_var(mask);
|
2005-04-17 05:20:36 +07:00
|
|
|
|
2008-11-24 23:05:11 +07:00
|
|
|
return ret;
|
2005-04-17 05:20:36 +07:00
|
|
|
}
|
|
|
|
|
|
|
|
/**
|
|
|
|
* sys_sched_yield - yield the current processor to other threads.
|
|
|
|
*
|
2007-07-09 23:51:59 +07:00
|
|
|
* This function yields the current CPU to other tasks. If there are no
|
|
|
|
* other threads running on this CPU then this function will return.
|
2013-07-13 01:45:47 +07:00
|
|
|
*
|
|
|
|
* Return: 0.
|
2005-04-17 05:20:36 +07:00
|
|
|
*/
|
2009-01-14 20:14:08 +07:00
|
|
|
SYSCALL_DEFINE0(sched_yield)
|
2005-04-17 05:20:36 +07:00
|
|
|
{
|
2006-07-03 14:25:42 +07:00
|
|
|
struct rq *rq = this_rq_lock();
|
2005-04-17 05:20:36 +07:00
|
|
|
|
2007-10-15 22:00:12 +07:00
|
|
|
schedstat_inc(rq, yld_count);
|
2007-10-15 22:00:08 +07:00
|
|
|
current->sched_class->yield_task(rq);
|
2005-04-17 05:20:36 +07:00
|
|
|
|
|
|
|
/*
|
|
|
|
* Since we are going to call schedule() anyway, there's
|
|
|
|
* no need to preempt or enable interrupts:
|
|
|
|
*/
|
|
|
|
__release(rq->lock);
|
2006-07-03 14:24:54 +07:00
|
|
|
spin_release(&rq->lock.dep_map, 1, _THIS_IP_);
|
2009-12-04 02:55:53 +07:00
|
|
|
do_raw_spin_unlock(&rq->lock);
|
2011-03-21 19:32:17 +07:00
|
|
|
sched_preempt_enable_no_resched();
|
2005-04-17 05:20:36 +07:00
|
|
|
|
|
|
|
schedule();
|
|
|
|
|
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
|
2008-01-26 03:08:28 +07:00
|
|
|
int __sched _cond_resched(void)
|
2005-04-17 05:20:36 +07:00
|
|
|
{
|
2015-07-15 16:52:04 +07:00
|
|
|
if (should_resched(0)) {
|
2015-01-23 00:08:04 +07:00
|
|
|
preempt_schedule_common();
|
2005-04-17 05:20:36 +07:00
|
|
|
return 1;
|
|
|
|
}
|
|
|
|
return 0;
|
|
|
|
}
|
2008-01-26 03:08:28 +07:00
|
|
|
EXPORT_SYMBOL(_cond_resched);
|
2005-04-17 05:20:36 +07:00
|
|
|
|
|
|
|
/*
|
2009-07-16 20:44:29 +07:00
|
|
|
* __cond_resched_lock() - if a reschedule is pending, drop the given lock,
|
2005-04-17 05:20:36 +07:00
|
|
|
* call schedule, and on return reacquire the lock.
|
|
|
|
*
|
2007-12-05 21:46:09 +07:00
|
|
|
* This works OK both with and without CONFIG_PREEMPT. We do strange low-level
|
2005-04-17 05:20:36 +07:00
|
|
|
* operations here to prevent schedule() from being called twice (once via
|
|
|
|
* spin_unlock(), once by hand).
|
|
|
|
*/
|
2009-07-16 20:44:29 +07:00
|
|
|
int __cond_resched_lock(spinlock_t *lock)
|
2005-04-17 05:20:36 +07:00
|
|
|
{
|
2015-07-15 16:52:04 +07:00
|
|
|
int resched = should_resched(PREEMPT_LOCK_OFFSET);
|
2005-06-14 05:52:32 +07:00
|
|
|
int ret = 0;
|
|
|
|
|
2009-07-21 00:16:29 +07:00
|
|
|
lockdep_assert_held(lock);
|
|
|
|
|
2014-06-21 06:49:01 +07:00
|
|
|
if (spin_needbreak(lock) || resched) {
|
2005-04-17 05:20:36 +07:00
|
|
|
spin_unlock(lock);
|
2009-07-10 19:57:57 +07:00
|
|
|
if (resched)
|
2015-01-23 00:08:04 +07:00
|
|
|
preempt_schedule_common();
|
2008-01-30 19:31:20 +07:00
|
|
|
else
|
|
|
|
cpu_relax();
|
2005-06-14 05:52:32 +07:00
|
|
|
ret = 1;
|
2005-04-17 05:20:36 +07:00
|
|
|
spin_lock(lock);
|
|
|
|
}
|
2005-06-14 05:52:32 +07:00
|
|
|
return ret;
|
2005-04-17 05:20:36 +07:00
|
|
|
}
|
2009-07-16 20:44:29 +07:00
|
|
|
EXPORT_SYMBOL(__cond_resched_lock);
|
2005-04-17 05:20:36 +07:00
|
|
|
|
2009-07-16 20:44:29 +07:00
|
|
|
int __sched __cond_resched_softirq(void)
|
2005-04-17 05:20:36 +07:00
|
|
|
{
|
|
|
|
BUG_ON(!in_softirq());
|
|
|
|
|
2015-07-15 16:52:04 +07:00
|
|
|
if (should_resched(SOFTIRQ_DISABLE_OFFSET)) {
|
2007-05-24 03:58:18 +07:00
|
|
|
local_bh_enable();
|
2015-01-23 00:08:04 +07:00
|
|
|
preempt_schedule_common();
|
2005-04-17 05:20:36 +07:00
|
|
|
local_bh_disable();
|
|
|
|
return 1;
|
|
|
|
}
|
|
|
|
return 0;
|
|
|
|
}
|
2009-07-16 20:44:29 +07:00
|
|
|
EXPORT_SYMBOL(__cond_resched_softirq);
|
2005-04-17 05:20:36 +07:00
|
|
|
|
|
|
|
/**
|
|
|
|
* yield - yield the current processor to other threads.
|
|
|
|
*
|
2012-03-07 00:54:26 +07:00
|
|
|
* Do not ever use this function, there's a 99% chance you're doing it wrong.
|
|
|
|
*
|
|
|
|
* The scheduler is at all times free to pick the calling task as the most
|
|
|
|
* eligible task to run, if removing the yield() call from your code breaks
|
|
|
|
* it, its already broken.
|
|
|
|
*
|
|
|
|
* Typical broken usage is:
|
|
|
|
*
|
|
|
|
* while (!event)
|
|
|
|
* yield();
|
|
|
|
*
|
|
|
|
* where one assumes that yield() will let 'the other' process run that will
|
|
|
|
* make event true. If the current task is a SCHED_FIFO task that will never
|
|
|
|
* happen. Never use yield() as a progress guarantee!!
|
|
|
|
*
|
|
|
|
* If you want to use yield() to wait for something, use wait_event().
|
|
|
|
* If you want to use yield() to be 'nice' for others, use cond_resched().
|
|
|
|
* If you still want to use yield(), do not!
|
2005-04-17 05:20:36 +07:00
|
|
|
*/
|
|
|
|
void __sched yield(void)
|
|
|
|
{
|
|
|
|
set_current_state(TASK_RUNNING);
|
|
|
|
sys_sched_yield();
|
|
|
|
}
|
|
|
|
EXPORT_SYMBOL(yield);
|
|
|
|
|
2011-02-01 21:50:51 +07:00
|
|
|
/**
|
|
|
|
* yield_to - yield the current processor to another thread in
|
|
|
|
* your thread group, or accelerate that thread toward the
|
|
|
|
* processor it's on.
|
2011-03-18 23:34:53 +07:00
|
|
|
* @p: target task
|
|
|
|
* @preempt: whether task preemption is allowed or not
|
2011-02-01 21:50:51 +07:00
|
|
|
*
|
|
|
|
* It's the caller's job to ensure that the target task struct
|
|
|
|
* can't go away on us before we can do any checks.
|
|
|
|
*
|
2013-07-13 01:45:47 +07:00
|
|
|
* Return:
|
2013-01-22 14:39:13 +07:00
|
|
|
* true (>0) if we indeed boosted the target task.
|
|
|
|
* false (0) if we failed to boost the target.
|
|
|
|
* -ESRCH if there's no task to yield to.
|
2011-02-01 21:50:51 +07:00
|
|
|
*/
|
2014-05-23 17:20:42 +07:00
|
|
|
int __sched yield_to(struct task_struct *p, bool preempt)
|
2011-02-01 21:50:51 +07:00
|
|
|
{
|
|
|
|
struct task_struct *curr = current;
|
|
|
|
struct rq *rq, *p_rq;
|
|
|
|
unsigned long flags;
|
2013-02-05 18:37:51 +07:00
|
|
|
int yielded = 0;
|
2011-02-01 21:50:51 +07:00
|
|
|
|
|
|
|
local_irq_save(flags);
|
|
|
|
rq = this_rq();
|
|
|
|
|
|
|
|
again:
|
|
|
|
p_rq = task_rq(p);
|
2013-01-22 14:39:13 +07:00
|
|
|
/*
|
|
|
|
* If we're the only runnable task on the rq and target rq also
|
|
|
|
* has only one task, there's absolutely no point in yielding.
|
|
|
|
*/
|
|
|
|
if (rq->nr_running == 1 && p_rq->nr_running == 1) {
|
|
|
|
yielded = -ESRCH;
|
|
|
|
goto out_irq;
|
|
|
|
}
|
|
|
|
|
2011-02-01 21:50:51 +07:00
|
|
|
double_rq_lock(rq, p_rq);
|
2013-11-23 16:38:01 +07:00
|
|
|
if (task_rq(p) != p_rq) {
|
2011-02-01 21:50:51 +07:00
|
|
|
double_rq_unlock(rq, p_rq);
|
|
|
|
goto again;
|
|
|
|
}
|
|
|
|
|
|
|
|
if (!curr->sched_class->yield_to_task)
|
2013-01-22 14:39:13 +07:00
|
|
|
goto out_unlock;
|
2011-02-01 21:50:51 +07:00
|
|
|
|
|
|
|
if (curr->sched_class != p->sched_class)
|
2013-01-22 14:39:13 +07:00
|
|
|
goto out_unlock;
|
2011-02-01 21:50:51 +07:00
|
|
|
|
|
|
|
if (task_running(p_rq, p) || p->state)
|
2013-01-22 14:39:13 +07:00
|
|
|
goto out_unlock;
|
2011-02-01 21:50:51 +07:00
|
|
|
|
|
|
|
yielded = curr->sched_class->yield_to_task(rq, p, preempt);
|
2011-03-02 07:28:21 +07:00
|
|
|
if (yielded) {
|
2011-02-01 21:50:51 +07:00
|
|
|
schedstat_inc(rq, yld_count);
|
2011-03-02 07:28:21 +07:00
|
|
|
/*
|
|
|
|
* Make p's CPU reschedule; pick_next_entity takes care of
|
|
|
|
* fairness.
|
|
|
|
*/
|
|
|
|
if (preempt && rq != p_rq)
|
2014-06-29 03:03:57 +07:00
|
|
|
resched_curr(p_rq);
|
2011-03-02 07:28:21 +07:00
|
|
|
}
|
2011-02-01 21:50:51 +07:00
|
|
|
|
2013-01-22 14:39:13 +07:00
|
|
|
out_unlock:
|
2011-02-01 21:50:51 +07:00
|
|
|
double_rq_unlock(rq, p_rq);
|
2013-01-22 14:39:13 +07:00
|
|
|
out_irq:
|
2011-02-01 21:50:51 +07:00
|
|
|
local_irq_restore(flags);
|
|
|
|
|
2013-01-22 14:39:13 +07:00
|
|
|
if (yielded > 0)
|
2011-02-01 21:50:51 +07:00
|
|
|
schedule();
|
|
|
|
|
|
|
|
return yielded;
|
|
|
|
}
|
|
|
|
EXPORT_SYMBOL_GPL(yield_to);
|
|
|
|
|
2005-04-17 05:20:36 +07:00
|
|
|
/*
|
2007-12-05 21:46:09 +07:00
|
|
|
* This task is about to go to sleep on IO. Increment rq->nr_iowait so
|
2005-04-17 05:20:36 +07:00
|
|
|
* that process accounting knows that this is a task in IO wait state.
|
|
|
|
*/
|
|
|
|
long __sched io_schedule_timeout(long timeout)
|
|
|
|
{
|
sched: Prevent recursion in io_schedule()
io_schedule() calls blk_flush_plug() which, depending on the
contents of current->plug, can initiate arbitrary blk-io requests.
Note that this contrasts with blk_schedule_flush_plug() which requires
all non-trivial work to be handed off to a separate thread.
This makes it possible for io_schedule() to recurse, and initiating
block requests could possibly call mempool_alloc() which, in times of
memory pressure, uses io_schedule().
Apart from any stack usage issues, io_schedule() will not behave
correctly when called recursively as delayacct_blkio_start() does
not allow for repeated calls.
So:
- use ->in_iowait to detect recursion. Set it earlier, and restore
it to the old value.
- move the call to "raw_rq" after the call to blk_flush_plug().
As this is some sort of per-cpu thing, we want some chance that
we are on the right CPU
- When io_schedule() is called recurively, use blk_schedule_flush_plug()
which cannot further recurse.
- as this makes io_schedule() a lot more complex and as io_schedule()
must match io_schedule_timeout(), but all the changes in io_schedule_timeout()
and make io_schedule a simple wrapper for that.
Signed-off-by: NeilBrown <neilb@suse.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
[ Moved the now rudimentary io_schedule() into sched.h. ]
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Tony Battersby <tonyb@cybernetics.com>
Link: http://lkml.kernel.org/r/20150213162600.059fffb2@notabene.brown
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-02-13 11:49:17 +07:00
|
|
|
int old_iowait = current->in_iowait;
|
|
|
|
struct rq *rq;
|
2005-04-17 05:20:36 +07:00
|
|
|
long ret;
|
|
|
|
|
sched: Prevent recursion in io_schedule()
io_schedule() calls blk_flush_plug() which, depending on the
contents of current->plug, can initiate arbitrary blk-io requests.
Note that this contrasts with blk_schedule_flush_plug() which requires
all non-trivial work to be handed off to a separate thread.
This makes it possible for io_schedule() to recurse, and initiating
block requests could possibly call mempool_alloc() which, in times of
memory pressure, uses io_schedule().
Apart from any stack usage issues, io_schedule() will not behave
correctly when called recursively as delayacct_blkio_start() does
not allow for repeated calls.
So:
- use ->in_iowait to detect recursion. Set it earlier, and restore
it to the old value.
- move the call to "raw_rq" after the call to blk_flush_plug().
As this is some sort of per-cpu thing, we want some chance that
we are on the right CPU
- When io_schedule() is called recurively, use blk_schedule_flush_plug()
which cannot further recurse.
- as this makes io_schedule() a lot more complex and as io_schedule()
must match io_schedule_timeout(), but all the changes in io_schedule_timeout()
and make io_schedule a simple wrapper for that.
Signed-off-by: NeilBrown <neilb@suse.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
[ Moved the now rudimentary io_schedule() into sched.h. ]
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Tony Battersby <tonyb@cybernetics.com>
Link: http://lkml.kernel.org/r/20150213162600.059fffb2@notabene.brown
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-02-13 11:49:17 +07:00
|
|
|
current->in_iowait = 1;
|
2015-05-09 00:51:29 +07:00
|
|
|
blk_schedule_flush_plug(current);
|
sched: Prevent recursion in io_schedule()
io_schedule() calls blk_flush_plug() which, depending on the
contents of current->plug, can initiate arbitrary blk-io requests.
Note that this contrasts with blk_schedule_flush_plug() which requires
all non-trivial work to be handed off to a separate thread.
This makes it possible for io_schedule() to recurse, and initiating
block requests could possibly call mempool_alloc() which, in times of
memory pressure, uses io_schedule().
Apart from any stack usage issues, io_schedule() will not behave
correctly when called recursively as delayacct_blkio_start() does
not allow for repeated calls.
So:
- use ->in_iowait to detect recursion. Set it earlier, and restore
it to the old value.
- move the call to "raw_rq" after the call to blk_flush_plug().
As this is some sort of per-cpu thing, we want some chance that
we are on the right CPU
- When io_schedule() is called recurively, use blk_schedule_flush_plug()
which cannot further recurse.
- as this makes io_schedule() a lot more complex and as io_schedule()
must match io_schedule_timeout(), but all the changes in io_schedule_timeout()
and make io_schedule a simple wrapper for that.
Signed-off-by: NeilBrown <neilb@suse.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
[ Moved the now rudimentary io_schedule() into sched.h. ]
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Tony Battersby <tonyb@cybernetics.com>
Link: http://lkml.kernel.org/r/20150213162600.059fffb2@notabene.brown
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-02-13 11:49:17 +07:00
|
|
|
|
2006-07-14 14:24:37 +07:00
|
|
|
delayacct_blkio_start();
|
sched: Prevent recursion in io_schedule()
io_schedule() calls blk_flush_plug() which, depending on the
contents of current->plug, can initiate arbitrary blk-io requests.
Note that this contrasts with blk_schedule_flush_plug() which requires
all non-trivial work to be handed off to a separate thread.
This makes it possible for io_schedule() to recurse, and initiating
block requests could possibly call mempool_alloc() which, in times of
memory pressure, uses io_schedule().
Apart from any stack usage issues, io_schedule() will not behave
correctly when called recursively as delayacct_blkio_start() does
not allow for repeated calls.
So:
- use ->in_iowait to detect recursion. Set it earlier, and restore
it to the old value.
- move the call to "raw_rq" after the call to blk_flush_plug().
As this is some sort of per-cpu thing, we want some chance that
we are on the right CPU
- When io_schedule() is called recurively, use blk_schedule_flush_plug()
which cannot further recurse.
- as this makes io_schedule() a lot more complex and as io_schedule()
must match io_schedule_timeout(), but all the changes in io_schedule_timeout()
and make io_schedule a simple wrapper for that.
Signed-off-by: NeilBrown <neilb@suse.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
[ Moved the now rudimentary io_schedule() into sched.h. ]
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Tony Battersby <tonyb@cybernetics.com>
Link: http://lkml.kernel.org/r/20150213162600.059fffb2@notabene.brown
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-02-13 11:49:17 +07:00
|
|
|
rq = raw_rq();
|
2005-04-17 05:20:36 +07:00
|
|
|
atomic_inc(&rq->nr_iowait);
|
|
|
|
ret = schedule_timeout(timeout);
|
sched: Prevent recursion in io_schedule()
io_schedule() calls blk_flush_plug() which, depending on the
contents of current->plug, can initiate arbitrary blk-io requests.
Note that this contrasts with blk_schedule_flush_plug() which requires
all non-trivial work to be handed off to a separate thread.
This makes it possible for io_schedule() to recurse, and initiating
block requests could possibly call mempool_alloc() which, in times of
memory pressure, uses io_schedule().
Apart from any stack usage issues, io_schedule() will not behave
correctly when called recursively as delayacct_blkio_start() does
not allow for repeated calls.
So:
- use ->in_iowait to detect recursion. Set it earlier, and restore
it to the old value.
- move the call to "raw_rq" after the call to blk_flush_plug().
As this is some sort of per-cpu thing, we want some chance that
we are on the right CPU
- When io_schedule() is called recurively, use blk_schedule_flush_plug()
which cannot further recurse.
- as this makes io_schedule() a lot more complex and as io_schedule()
must match io_schedule_timeout(), but all the changes in io_schedule_timeout()
and make io_schedule a simple wrapper for that.
Signed-off-by: NeilBrown <neilb@suse.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
[ Moved the now rudimentary io_schedule() into sched.h. ]
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Tony Battersby <tonyb@cybernetics.com>
Link: http://lkml.kernel.org/r/20150213162600.059fffb2@notabene.brown
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-02-13 11:49:17 +07:00
|
|
|
current->in_iowait = old_iowait;
|
2005-04-17 05:20:36 +07:00
|
|
|
atomic_dec(&rq->nr_iowait);
|
2006-07-14 14:24:37 +07:00
|
|
|
delayacct_blkio_end();
|
sched: Prevent recursion in io_schedule()
io_schedule() calls blk_flush_plug() which, depending on the
contents of current->plug, can initiate arbitrary blk-io requests.
Note that this contrasts with blk_schedule_flush_plug() which requires
all non-trivial work to be handed off to a separate thread.
This makes it possible for io_schedule() to recurse, and initiating
block requests could possibly call mempool_alloc() which, in times of
memory pressure, uses io_schedule().
Apart from any stack usage issues, io_schedule() will not behave
correctly when called recursively as delayacct_blkio_start() does
not allow for repeated calls.
So:
- use ->in_iowait to detect recursion. Set it earlier, and restore
it to the old value.
- move the call to "raw_rq" after the call to blk_flush_plug().
As this is some sort of per-cpu thing, we want some chance that
we are on the right CPU
- When io_schedule() is called recurively, use blk_schedule_flush_plug()
which cannot further recurse.
- as this makes io_schedule() a lot more complex and as io_schedule()
must match io_schedule_timeout(), but all the changes in io_schedule_timeout()
and make io_schedule a simple wrapper for that.
Signed-off-by: NeilBrown <neilb@suse.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
[ Moved the now rudimentary io_schedule() into sched.h. ]
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Tony Battersby <tonyb@cybernetics.com>
Link: http://lkml.kernel.org/r/20150213162600.059fffb2@notabene.brown
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-02-13 11:49:17 +07:00
|
|
|
|
2005-04-17 05:20:36 +07:00
|
|
|
return ret;
|
|
|
|
}
|
sched: Prevent recursion in io_schedule()
io_schedule() calls blk_flush_plug() which, depending on the
contents of current->plug, can initiate arbitrary blk-io requests.
Note that this contrasts with blk_schedule_flush_plug() which requires
all non-trivial work to be handed off to a separate thread.
This makes it possible for io_schedule() to recurse, and initiating
block requests could possibly call mempool_alloc() which, in times of
memory pressure, uses io_schedule().
Apart from any stack usage issues, io_schedule() will not behave
correctly when called recursively as delayacct_blkio_start() does
not allow for repeated calls.
So:
- use ->in_iowait to detect recursion. Set it earlier, and restore
it to the old value.
- move the call to "raw_rq" after the call to blk_flush_plug().
As this is some sort of per-cpu thing, we want some chance that
we are on the right CPU
- When io_schedule() is called recurively, use blk_schedule_flush_plug()
which cannot further recurse.
- as this makes io_schedule() a lot more complex and as io_schedule()
must match io_schedule_timeout(), but all the changes in io_schedule_timeout()
and make io_schedule a simple wrapper for that.
Signed-off-by: NeilBrown <neilb@suse.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
[ Moved the now rudimentary io_schedule() into sched.h. ]
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Tony Battersby <tonyb@cybernetics.com>
Link: http://lkml.kernel.org/r/20150213162600.059fffb2@notabene.brown
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-02-13 11:49:17 +07:00
|
|
|
EXPORT_SYMBOL(io_schedule_timeout);
|
2005-04-17 05:20:36 +07:00
|
|
|
|
|
|
|
/**
|
|
|
|
* sys_sched_get_priority_max - return maximum RT priority.
|
|
|
|
* @policy: scheduling class.
|
|
|
|
*
|
2013-07-13 01:45:47 +07:00
|
|
|
* Return: On success, this syscall returns the maximum
|
|
|
|
* rt_priority that can be used by a given scheduling class.
|
|
|
|
* On failure, a negative error code is returned.
|
2005-04-17 05:20:36 +07:00
|
|
|
*/
|
2009-01-14 20:14:08 +07:00
|
|
|
SYSCALL_DEFINE1(sched_get_priority_max, int, policy)
|
2005-04-17 05:20:36 +07:00
|
|
|
{
|
|
|
|
int ret = -EINVAL;
|
|
|
|
|
|
|
|
switch (policy) {
|
|
|
|
case SCHED_FIFO:
|
|
|
|
case SCHED_RR:
|
|
|
|
ret = MAX_USER_RT_PRIO-1;
|
|
|
|
break;
|
sched/deadline: Add SCHED_DEADLINE structures & implementation
Introduces the data structures, constants and symbols needed for
SCHED_DEADLINE implementation.
Core data structure of SCHED_DEADLINE are defined, along with their
initializers. Hooks for checking if a task belong to the new policy
are also added where they are needed.
Adds a scheduling class, in sched/dl.c and a new policy called
SCHED_DEADLINE. It is an implementation of the Earliest Deadline
First (EDF) scheduling algorithm, augmented with a mechanism (called
Constant Bandwidth Server, CBS) that makes it possible to isolate
the behaviour of tasks between each other.
The typical -deadline task will be made up of a computation phase
(instance) which is activated on a periodic or sporadic fashion. The
expected (maximum) duration of such computation is called the task's
runtime; the time interval by which each instance need to be completed
is called the task's relative deadline. The task's absolute deadline
is dynamically calculated as the time instant a task (better, an
instance) activates plus the relative deadline.
The EDF algorithms selects the task with the smallest absolute
deadline as the one to be executed first, while the CBS ensures each
task to run for at most its runtime every (relative) deadline
length time interval, avoiding any interference between different
tasks (bandwidth isolation).
Thanks to this feature, also tasks that do not strictly comply with
the computational model sketched above can effectively use the new
policy.
To summarize, this patch:
- introduces the data structures, constants and symbols needed;
- implements the core logic of the scheduling algorithm in the new
scheduling class file;
- provides all the glue code between the new scheduling class and
the core scheduler and refines the interactions between sched/dl
and the other existing scheduling classes.
Signed-off-by: Dario Faggioli <raistlin@linux.it>
Signed-off-by: Michael Trimarchi <michael@amarulasolutions.com>
Signed-off-by: Fabio Checconi <fchecconi@gmail.com>
Signed-off-by: Juri Lelli <juri.lelli@gmail.com>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/1383831828-15501-4-git-send-email-juri.lelli@gmail.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2013-11-28 17:14:43 +07:00
|
|
|
case SCHED_DEADLINE:
|
2005-04-17 05:20:36 +07:00
|
|
|
case SCHED_NORMAL:
|
2006-01-15 04:20:41 +07:00
|
|
|
case SCHED_BATCH:
|
2007-07-09 23:51:59 +07:00
|
|
|
case SCHED_IDLE:
|
2005-04-17 05:20:36 +07:00
|
|
|
ret = 0;
|
|
|
|
break;
|
|
|
|
}
|
|
|
|
return ret;
|
|
|
|
}
|
|
|
|
|
|
|
|
/**
|
|
|
|
* sys_sched_get_priority_min - return minimum RT priority.
|
|
|
|
* @policy: scheduling class.
|
|
|
|
*
|
2013-07-13 01:45:47 +07:00
|
|
|
* Return: On success, this syscall returns the minimum
|
|
|
|
* rt_priority that can be used by a given scheduling class.
|
|
|
|
* On failure, a negative error code is returned.
|
2005-04-17 05:20:36 +07:00
|
|
|
*/
|
2009-01-14 20:14:08 +07:00
|
|
|
SYSCALL_DEFINE1(sched_get_priority_min, int, policy)
|
2005-04-17 05:20:36 +07:00
|
|
|
{
|
|
|
|
int ret = -EINVAL;
|
|
|
|
|
|
|
|
switch (policy) {
|
|
|
|
case SCHED_FIFO:
|
|
|
|
case SCHED_RR:
|
|
|
|
ret = 1;
|
|
|
|
break;
|
sched/deadline: Add SCHED_DEADLINE structures & implementation
Introduces the data structures, constants and symbols needed for
SCHED_DEADLINE implementation.
Core data structure of SCHED_DEADLINE are defined, along with their
initializers. Hooks for checking if a task belong to the new policy
are also added where they are needed.
Adds a scheduling class, in sched/dl.c and a new policy called
SCHED_DEADLINE. It is an implementation of the Earliest Deadline
First (EDF) scheduling algorithm, augmented with a mechanism (called
Constant Bandwidth Server, CBS) that makes it possible to isolate
the behaviour of tasks between each other.
The typical -deadline task will be made up of a computation phase
(instance) which is activated on a periodic or sporadic fashion. The
expected (maximum) duration of such computation is called the task's
runtime; the time interval by which each instance need to be completed
is called the task's relative deadline. The task's absolute deadline
is dynamically calculated as the time instant a task (better, an
instance) activates plus the relative deadline.
The EDF algorithms selects the task with the smallest absolute
deadline as the one to be executed first, while the CBS ensures each
task to run for at most its runtime every (relative) deadline
length time interval, avoiding any interference between different
tasks (bandwidth isolation).
Thanks to this feature, also tasks that do not strictly comply with
the computational model sketched above can effectively use the new
policy.
To summarize, this patch:
- introduces the data structures, constants and symbols needed;
- implements the core logic of the scheduling algorithm in the new
scheduling class file;
- provides all the glue code between the new scheduling class and
the core scheduler and refines the interactions between sched/dl
and the other existing scheduling classes.
Signed-off-by: Dario Faggioli <raistlin@linux.it>
Signed-off-by: Michael Trimarchi <michael@amarulasolutions.com>
Signed-off-by: Fabio Checconi <fchecconi@gmail.com>
Signed-off-by: Juri Lelli <juri.lelli@gmail.com>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/1383831828-15501-4-git-send-email-juri.lelli@gmail.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2013-11-28 17:14:43 +07:00
|
|
|
case SCHED_DEADLINE:
|
2005-04-17 05:20:36 +07:00
|
|
|
case SCHED_NORMAL:
|
2006-01-15 04:20:41 +07:00
|
|
|
case SCHED_BATCH:
|
2007-07-09 23:51:59 +07:00
|
|
|
case SCHED_IDLE:
|
2005-04-17 05:20:36 +07:00
|
|
|
ret = 0;
|
|
|
|
}
|
|
|
|
return ret;
|
|
|
|
}
|
|
|
|
|
|
|
|
/**
|
|
|
|
* sys_sched_rr_get_interval - return the default timeslice of a process.
|
|
|
|
* @pid: pid of the process.
|
|
|
|
* @interval: userspace pointer to the timeslice value.
|
|
|
|
*
|
|
|
|
* this syscall writes the default timeslice value of a given process
|
|
|
|
* into the user-space timespec buffer. A value of '0' means infinity.
|
2013-07-13 01:45:47 +07:00
|
|
|
*
|
|
|
|
* Return: On success, 0 and the timeslice is in @interval. Otherwise,
|
|
|
|
* an error code.
|
2005-04-17 05:20:36 +07:00
|
|
|
*/
|
2009-01-14 20:14:10 +07:00
|
|
|
SYSCALL_DEFINE2(sched_rr_get_interval, pid_t, pid,
|
2009-01-14 20:14:09 +07:00
|
|
|
struct timespec __user *, interval)
|
2005-04-17 05:20:36 +07:00
|
|
|
{
|
2006-07-03 14:25:41 +07:00
|
|
|
struct task_struct *p;
|
2007-10-15 22:00:13 +07:00
|
|
|
unsigned int time_slice;
|
2015-08-01 02:28:18 +07:00
|
|
|
struct rq_flags rf;
|
|
|
|
struct timespec t;
|
2009-12-09 15:32:03 +07:00
|
|
|
struct rq *rq;
|
2007-10-15 22:00:14 +07:00
|
|
|
int retval;
|
2005-04-17 05:20:36 +07:00
|
|
|
|
|
|
|
if (pid < 0)
|
2007-10-15 22:00:14 +07:00
|
|
|
return -EINVAL;
|
2005-04-17 05:20:36 +07:00
|
|
|
|
|
|
|
retval = -ESRCH;
|
2009-12-09 17:15:11 +07:00
|
|
|
rcu_read_lock();
|
2005-04-17 05:20:36 +07:00
|
|
|
p = find_process_by_pid(pid);
|
|
|
|
if (!p)
|
|
|
|
goto out_unlock;
|
|
|
|
|
|
|
|
retval = security_task_getscheduler(p);
|
|
|
|
if (retval)
|
|
|
|
goto out_unlock;
|
|
|
|
|
2015-08-01 02:28:18 +07:00
|
|
|
rq = task_rq_lock(p, &rf);
|
2014-01-27 17:54:13 +07:00
|
|
|
time_slice = 0;
|
|
|
|
if (p->sched_class->get_rr_interval)
|
|
|
|
time_slice = p->sched_class->get_rr_interval(rq, p);
|
2015-08-01 02:28:18 +07:00
|
|
|
task_rq_unlock(rq, p, &rf);
|
2007-10-15 22:00:13 +07:00
|
|
|
|
2009-12-09 17:15:11 +07:00
|
|
|
rcu_read_unlock();
|
2007-10-15 22:00:13 +07:00
|
|
|
jiffies_to_timespec(time_slice, &t);
|
2005-04-17 05:20:36 +07:00
|
|
|
retval = copy_to_user(interval, &t, sizeof(t)) ? -EFAULT : 0;
|
|
|
|
return retval;
|
2007-10-15 22:00:14 +07:00
|
|
|
|
2005-04-17 05:20:36 +07:00
|
|
|
out_unlock:
|
2009-12-09 17:15:11 +07:00
|
|
|
rcu_read_unlock();
|
2005-04-17 05:20:36 +07:00
|
|
|
return retval;
|
|
|
|
}
|
|
|
|
|
2008-05-13 02:20:41 +07:00
|
|
|
static const char stat_nam[] = TASK_STATE_TO_CHAR_STR;
|
2006-07-03 14:25:41 +07:00
|
|
|
|
2008-01-26 03:08:02 +07:00
|
|
|
void sched_show_task(struct task_struct *p)
|
2005-04-17 05:20:36 +07:00
|
|
|
{
|
|
|
|
unsigned long free = 0;
|
2012-11-08 04:35:32 +07:00
|
|
|
int ppid;
|
2014-12-05 19:22:22 +07:00
|
|
|
unsigned long state = p->state;
|
2005-04-17 05:20:36 +07:00
|
|
|
|
2014-12-05 19:22:22 +07:00
|
|
|
if (state)
|
|
|
|
state = __ffs(state) + 1;
|
2010-11-20 09:08:51 +07:00
|
|
|
printk(KERN_INFO "%-15.15s %c", p->comm,
|
2006-07-10 18:43:52 +07:00
|
|
|
state < sizeof(stat_nam) - 1 ? stat_nam[state] : '?');
|
2007-07-12 02:21:47 +07:00
|
|
|
#if BITS_PER_LONG == 32
|
2005-04-17 05:20:36 +07:00
|
|
|
if (state == TASK_RUNNING)
|
2009-12-20 20:23:57 +07:00
|
|
|
printk(KERN_CONT " running ");
|
2005-04-17 05:20:36 +07:00
|
|
|
else
|
2009-12-20 20:23:57 +07:00
|
|
|
printk(KERN_CONT " %08lx ", thread_saved_pc(p));
|
2005-04-17 05:20:36 +07:00
|
|
|
#else
|
|
|
|
if (state == TASK_RUNNING)
|
2009-12-20 20:23:57 +07:00
|
|
|
printk(KERN_CONT " running task ");
|
2005-04-17 05:20:36 +07:00
|
|
|
else
|
2009-12-20 20:23:57 +07:00
|
|
|
printk(KERN_CONT " %016lx ", thread_saved_pc(p));
|
2005-04-17 05:20:36 +07:00
|
|
|
#endif
|
|
|
|
#ifdef CONFIG_DEBUG_STACK_USAGE
|
2008-04-23 04:38:23 +07:00
|
|
|
free = stack_not_used(p);
|
2005-04-17 05:20:36 +07:00
|
|
|
#endif
|
2014-12-11 06:45:21 +07:00
|
|
|
ppid = 0;
|
2012-11-08 04:35:32 +07:00
|
|
|
rcu_read_lock();
|
2014-12-11 06:45:21 +07:00
|
|
|
if (pid_alive(p))
|
|
|
|
ppid = task_pid_nr(rcu_dereference(p->real_parent));
|
2012-11-08 04:35:32 +07:00
|
|
|
rcu_read_unlock();
|
2009-12-20 20:23:57 +07:00
|
|
|
printk(KERN_CONT "%5lu %5d %6d 0x%08lx\n", free,
|
2012-11-08 04:35:32 +07:00
|
|
|
task_pid_nr(p), ppid,
|
2009-05-04 15:38:05 +07:00
|
|
|
(unsigned long)task_thread_info(p)->flags);
|
2005-04-17 05:20:36 +07:00
|
|
|
|
2013-05-01 05:27:22 +07:00
|
|
|
print_worker_info(KERN_INFO, p);
|
2008-01-26 03:08:34 +07:00
|
|
|
show_stack(p, NULL);
|
2005-04-17 05:20:36 +07:00
|
|
|
}
|
|
|
|
|
2006-12-07 11:35:59 +07:00
|
|
|
void show_state_filter(unsigned long state_filter)
|
2005-04-17 05:20:36 +07:00
|
|
|
{
|
2006-07-03 14:25:41 +07:00
|
|
|
struct task_struct *g, *p;
|
2005-04-17 05:20:36 +07:00
|
|
|
|
2007-07-12 02:21:47 +07:00
|
|
|
#if BITS_PER_LONG == 32
|
2009-12-20 20:23:57 +07:00
|
|
|
printk(KERN_INFO
|
|
|
|
" task PC stack pid father\n");
|
2005-04-17 05:20:36 +07:00
|
|
|
#else
|
2009-12-20 20:23:57 +07:00
|
|
|
printk(KERN_INFO
|
|
|
|
" task PC stack pid father\n");
|
2005-04-17 05:20:36 +07:00
|
|
|
#endif
|
2011-07-18 01:47:54 +07:00
|
|
|
rcu_read_lock();
|
2014-08-14 02:19:53 +07:00
|
|
|
for_each_process_thread(g, p) {
|
2005-04-17 05:20:36 +07:00
|
|
|
/*
|
|
|
|
* reset the NMI-timeout, listing all files on a slow
|
2011-03-31 08:57:33 +07:00
|
|
|
* console might take a lot of time:
|
2016-06-09 19:20:05 +07:00
|
|
|
* Also, reset softlockup watchdogs on all CPUs, because
|
|
|
|
* another CPU might be blocked waiting for us to process
|
|
|
|
* an IPI.
|
2005-04-17 05:20:36 +07:00
|
|
|
*/
|
|
|
|
touch_nmi_watchdog();
|
2016-06-09 19:20:05 +07:00
|
|
|
touch_all_softlockup_watchdogs();
|
2007-04-26 10:50:03 +07:00
|
|
|
if (!state_filter || (p->state & state_filter))
|
2008-01-26 03:08:02 +07:00
|
|
|
sched_show_task(p);
|
2014-08-14 02:19:53 +07:00
|
|
|
}
|
2005-04-17 05:20:36 +07:00
|
|
|
|
2007-07-09 23:51:59 +07:00
|
|
|
#ifdef CONFIG_SCHED_DEBUG
|
2016-04-04 20:42:02 +07:00
|
|
|
if (!state_filter)
|
|
|
|
sysrq_sched_debug_show();
|
2007-07-09 23:51:59 +07:00
|
|
|
#endif
|
2011-07-18 01:47:54 +07:00
|
|
|
rcu_read_unlock();
|
2006-12-07 11:35:59 +07:00
|
|
|
/*
|
|
|
|
* Only show locks if all tasks are dumped:
|
|
|
|
*/
|
2009-11-25 20:23:41 +07:00
|
|
|
if (!state_filter)
|
2006-12-07 11:35:59 +07:00
|
|
|
debug_show_all_locks();
|
2005-04-17 05:20:36 +07:00
|
|
|
}
|
|
|
|
|
2013-06-20 01:53:51 +07:00
|
|
|
void init_idle_bootup_task(struct task_struct *idle)
|
2007-07-09 23:51:58 +07:00
|
|
|
{
|
2007-07-09 23:51:59 +07:00
|
|
|
idle->sched_class = &idle_sched_class;
|
2007-07-09 23:51:58 +07:00
|
|
|
}
|
|
|
|
|
2005-06-28 21:40:42 +07:00
|
|
|
/**
|
|
|
|
* init_idle - set up an idle thread for a given CPU
|
|
|
|
* @idle: task in question
|
|
|
|
* @cpu: cpu the idle task belongs to
|
|
|
|
*
|
|
|
|
* NOTE: this function does not set the idle thread's NEED_RESCHED
|
|
|
|
* flag, to make booting more robust.
|
|
|
|
*/
|
2013-06-20 01:53:51 +07:00
|
|
|
void init_idle(struct task_struct *idle, int cpu)
|
2005-04-17 05:20:36 +07:00
|
|
|
{
|
2006-07-03 14:25:42 +07:00
|
|
|
struct rq *rq = cpu_rq(cpu);
|
2005-04-17 05:20:36 +07:00
|
|
|
unsigned long flags;
|
|
|
|
|
2015-05-15 22:43:34 +07:00
|
|
|
raw_spin_lock_irqsave(&idle->pi_lock, flags);
|
|
|
|
raw_spin_lock(&rq->lock);
|
2008-11-13 02:05:50 +07:00
|
|
|
|
2013-10-07 17:29:26 +07:00
|
|
|
__sched_fork(0, idle);
|
2009-12-17 00:04:35 +07:00
|
|
|
idle->state = TASK_RUNNING;
|
2007-07-09 23:51:59 +07:00
|
|
|
idle->se.exec_start = sched_clock();
|
|
|
|
|
2016-03-10 05:08:18 +07:00
|
|
|
kasan_unpoison_task_stack(idle);
|
|
|
|
|
2015-08-14 04:09:29 +07:00
|
|
|
#ifdef CONFIG_SMP
|
|
|
|
/*
|
|
|
|
* Its possible that init_idle() gets called multiple times on a task,
|
|
|
|
* in that case do_set_cpus_allowed() will not do the right thing.
|
|
|
|
*
|
|
|
|
* And since this is boot we can forgo the serialization.
|
|
|
|
*/
|
|
|
|
set_cpus_allowed_common(idle, cpumask_of(cpu));
|
|
|
|
#endif
|
2010-09-16 22:50:31 +07:00
|
|
|
/*
|
|
|
|
* We're having a chicken and egg problem, even though we are
|
|
|
|
* holding rq->lock, the cpu isn't yet set to this cpu so the
|
|
|
|
* lockdep check in task_group() will fail.
|
|
|
|
*
|
|
|
|
* Similar case to sched_fork(). / Alternatively we could
|
|
|
|
* use task_rq_lock() here and obtain the other rq->lock.
|
|
|
|
*
|
|
|
|
* Silence PROVE_RCU
|
|
|
|
*/
|
|
|
|
rcu_read_lock();
|
2007-07-09 23:51:59 +07:00
|
|
|
__set_task_cpu(idle, cpu);
|
2010-09-16 22:50:31 +07:00
|
|
|
rcu_read_unlock();
|
2005-04-17 05:20:36 +07:00
|
|
|
|
|
|
|
rq->curr = rq->idle = idle;
|
2014-08-20 16:47:32 +07:00
|
|
|
idle->on_rq = TASK_ON_RQ_QUEUED;
|
2015-08-14 04:09:29 +07:00
|
|
|
#ifdef CONFIG_SMP
|
2011-04-05 22:23:40 +07:00
|
|
|
idle->on_cpu = 1;
|
2005-06-26 04:57:23 +07:00
|
|
|
#endif
|
2015-05-15 22:43:34 +07:00
|
|
|
raw_spin_unlock(&rq->lock);
|
|
|
|
raw_spin_unlock_irqrestore(&idle->pi_lock, flags);
|
2005-04-17 05:20:36 +07:00
|
|
|
|
|
|
|
/* Set the preempt count _outside_ the spinlocks! */
|
2013-08-14 19:55:46 +07:00
|
|
|
init_idle_preempt_count(idle, cpu);
|
2008-08-04 13:54:26 +07:00
|
|
|
|
2007-07-09 23:51:59 +07:00
|
|
|
/*
|
|
|
|
* The idle tasks have their own, simple scheduling class:
|
|
|
|
*/
|
|
|
|
idle->sched_class = &idle_sched_class;
|
ftrace: Fix memory leak with function graph and cpu hotplug
When the fuction graph tracer starts, it needs to make a special
stack for each task to save the real return values of the tasks.
All running tasks have this stack created, as well as any new
tasks.
On CPU hot plug, the new idle task will allocate a stack as well
when init_idle() is called. The problem is that cpu hotplug does
not create a new idle_task. Instead it uses the idle task that
existed when the cpu went down.
ftrace_graph_init_task() will add a new ret_stack to the task
that is given to it. Because a clone will make the task
have a stack of its parent it does not check if the task's
ret_stack is already NULL or not. When the CPU hotplug code
starts a CPU up again, it will allocate a new stack even
though one already existed for it.
The solution is to treat the idle_task specially. In fact, the
function_graph code already does, just not at init_idle().
Instead of using the ftrace_graph_init_task() for the idle task,
which that function expects the task to be a clone, have a
separate ftrace_graph_init_idle_task(). Also, we will create a
per_cpu ret_stack that is used by the idle task. When we call
ftrace_graph_init_idle_task() it will check if the idle task's
ret_stack is NULL, if it is, then it will assign it the per_cpu
ret_stack.
Reported-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Suggested-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Stable Tree <stable@kernel.org>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2011-02-11 09:26:13 +07:00
|
|
|
ftrace_graph_init_idle_task(idle, cpu);
|
2013-05-16 03:16:32 +07:00
|
|
|
vtime_init_idle(idle, cpu);
|
2015-08-14 04:09:29 +07:00
|
|
|
#ifdef CONFIG_SMP
|
2011-10-27 04:14:16 +07:00
|
|
|
sprintf(idle->comm, "%s/%d", INIT_TASK_COMM, cpu);
|
|
|
|
#endif
|
2007-11-10 04:39:38 +07:00
|
|
|
}
|
|
|
|
|
2014-10-07 15:52:11 +07:00
|
|
|
int cpuset_cpumask_can_shrink(const struct cpumask *cur,
|
|
|
|
const struct cpumask *trial)
|
|
|
|
{
|
|
|
|
int ret = 1, trial_cpus;
|
|
|
|
struct dl_bw *cur_dl_b;
|
|
|
|
unsigned long flags;
|
|
|
|
|
2015-01-28 10:53:55 +07:00
|
|
|
if (!cpumask_weight(cur))
|
|
|
|
return ret;
|
|
|
|
|
2014-10-28 18:54:46 +07:00
|
|
|
rcu_read_lock_sched();
|
2014-10-07 15:52:11 +07:00
|
|
|
cur_dl_b = dl_bw_of(cpumask_any(cur));
|
|
|
|
trial_cpus = cpumask_weight(trial);
|
|
|
|
|
|
|
|
raw_spin_lock_irqsave(&cur_dl_b->lock, flags);
|
|
|
|
if (cur_dl_b->bw != -1 &&
|
|
|
|
cur_dl_b->bw * trial_cpus < cur_dl_b->total_bw)
|
|
|
|
ret = 0;
|
|
|
|
raw_spin_unlock_irqrestore(&cur_dl_b->lock, flags);
|
2014-10-28 18:54:46 +07:00
|
|
|
rcu_read_unlock_sched();
|
2014-10-07 15:52:11 +07:00
|
|
|
|
|
|
|
return ret;
|
|
|
|
}
|
|
|
|
|
2014-09-19 16:22:40 +07:00
|
|
|
int task_can_attach(struct task_struct *p,
|
|
|
|
const struct cpumask *cs_cpus_allowed)
|
|
|
|
{
|
|
|
|
int ret = 0;
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Kthreads which disallow setaffinity shouldn't be moved
|
|
|
|
* to a new cpuset; we don't want to change their cpu
|
|
|
|
* affinity and isolating such threads by their set of
|
|
|
|
* allowed nodes is unnecessary. Thus, cpusets are not
|
|
|
|
* applicable for such threads. This prevents checking for
|
|
|
|
* success of set_cpus_allowed_ptr() on all attached tasks
|
|
|
|
* before cpus_allowed may be changed.
|
|
|
|
*/
|
|
|
|
if (p->flags & PF_NO_SETAFFINITY) {
|
|
|
|
ret = -EINVAL;
|
|
|
|
goto out;
|
|
|
|
}
|
|
|
|
|
|
|
|
#ifdef CONFIG_SMP
|
|
|
|
if (dl_task(p) && !cpumask_intersects(task_rq(p)->rd->span,
|
|
|
|
cs_cpus_allowed)) {
|
|
|
|
unsigned int dest_cpu = cpumask_any_and(cpu_active_mask,
|
|
|
|
cs_cpus_allowed);
|
2014-10-28 18:54:46 +07:00
|
|
|
struct dl_bw *dl_b;
|
2014-09-19 16:22:40 +07:00
|
|
|
bool overflow;
|
|
|
|
int cpus;
|
|
|
|
unsigned long flags;
|
|
|
|
|
2014-10-28 18:54:46 +07:00
|
|
|
rcu_read_lock_sched();
|
|
|
|
dl_b = dl_bw_of(dest_cpu);
|
2014-09-19 16:22:40 +07:00
|
|
|
raw_spin_lock_irqsave(&dl_b->lock, flags);
|
|
|
|
cpus = dl_bw_cpus(dest_cpu);
|
|
|
|
overflow = __dl_overflow(dl_b, cpus, 0, p->dl.dl_bw);
|
|
|
|
if (overflow)
|
|
|
|
ret = -EBUSY;
|
|
|
|
else {
|
|
|
|
/*
|
|
|
|
* We reserve space for this task in the destination
|
|
|
|
* root_domain, as we can't fail after this point.
|
|
|
|
* We will free resources in the source root_domain
|
|
|
|
* later on (see set_cpus_allowed_dl()).
|
|
|
|
*/
|
|
|
|
__dl_add(dl_b, p->dl.dl_bw);
|
|
|
|
}
|
|
|
|
raw_spin_unlock_irqrestore(&dl_b->lock, flags);
|
2014-10-28 18:54:46 +07:00
|
|
|
rcu_read_unlock_sched();
|
2014-09-19 16:22:40 +07:00
|
|
|
|
|
|
|
}
|
|
|
|
#endif
|
|
|
|
out:
|
|
|
|
return ret;
|
|
|
|
}
|
|
|
|
|
2005-04-17 05:20:36 +07:00
|
|
|
#ifdef CONFIG_SMP
|
|
|
|
|
2016-03-10 18:54:10 +07:00
|
|
|
static bool sched_smp_initialized __read_mostly;
|
|
|
|
|
2013-10-07 17:29:02 +07:00
|
|
|
#ifdef CONFIG_NUMA_BALANCING
|
|
|
|
/* Migrate current task p to target_cpu */
|
|
|
|
int migrate_task_to(struct task_struct *p, int target_cpu)
|
|
|
|
{
|
|
|
|
struct migration_arg arg = { p, target_cpu };
|
|
|
|
int curr_cpu = task_cpu(p);
|
|
|
|
|
|
|
|
if (curr_cpu == target_cpu)
|
|
|
|
return 0;
|
|
|
|
|
|
|
|
if (!cpumask_test_cpu(target_cpu, tsk_cpus_allowed(p)))
|
|
|
|
return -EINVAL;
|
|
|
|
|
|
|
|
/* TODO: This is not properly updating schedstats */
|
|
|
|
|
2014-01-22 06:51:03 +07:00
|
|
|
trace_sched_move_numa(p, curr_cpu, target_cpu);
|
2013-10-07 17:29:02 +07:00
|
|
|
return stop_one_cpu(curr_cpu, migration_cpu_stop, &arg);
|
|
|
|
}
|
2013-10-07 17:29:33 +07:00
|
|
|
|
|
|
|
/*
|
|
|
|
* Requeue a task on a given node and accurately track the number of NUMA
|
|
|
|
* tasks on the runqueues
|
|
|
|
*/
|
|
|
|
void sched_setnuma(struct task_struct *p, int nid)
|
|
|
|
{
|
2014-08-20 16:47:32 +07:00
|
|
|
bool queued, running;
|
2015-08-01 02:28:18 +07:00
|
|
|
struct rq_flags rf;
|
|
|
|
struct rq *rq;
|
2013-10-07 17:29:33 +07:00
|
|
|
|
2015-08-01 02:28:18 +07:00
|
|
|
rq = task_rq_lock(p, &rf);
|
2014-08-20 16:47:32 +07:00
|
|
|
queued = task_on_rq_queued(p);
|
2013-10-07 17:29:33 +07:00
|
|
|
running = task_current(rq, p);
|
|
|
|
|
2014-08-20 16:47:32 +07:00
|
|
|
if (queued)
|
sched/core: Fix task and run queue sched_info::run_delay inconsistencies
Mike Meyer reported the following bug:
> During evaluation of some performance data, it was discovered thread
> and run queue run_delay accounting data was inconsistent with the other
> accounting data that was collected. Further investigation found under
> certain circumstances execution time was leaking into the task and
> run queue accounting of run_delay.
>
> Consider the following sequence:
>
> a. thread is running.
> b. thread moves beween cgroups, changes scheduling class or priority.
> c. thread sleeps OR
> d. thread involuntarily gives up cpu.
>
> a. implies:
>
> thread->sched_info.last_queued = 0
>
> a. and b. results in the following:
>
> 1. dequeue_task(rq, thread)
>
> sched_info_dequeued(rq, thread)
> delta = 0
>
> sched_info_reset_dequeued(thread)
> thread->sched_info.last_queued = 0
>
> thread->sched_info.run_delay += delta
>
> 2. enqueue_task(rq, thread)
>
> sched_info_queued(rq, thread)
>
> /* thread is still on cpu at this point. */
> thread->sched_info.last_queued = task_rq(thread)->clock;
>
> c. results in:
>
> dequeue_task(rq, thread)
>
> sched_info_dequeued(rq, thread)
>
> /* delta is execution time not run_delay. */
> delta = task_rq(thread)->clock - thread->sched_info.last_queued
>
> sched_info_reset_dequeued(thread)
> thread->sched_info.last_queued = 0
>
> thread->sched_info.run_delay += delta
>
> Since thread was running between enqueue_task(rq, thread) and
> dequeue_task(rq, thread), the delta above is really execution
> time and not run_delay.
>
> d. results in:
>
> __sched_info_switch(thread, next_thread)
>
> sched_info_depart(rq, thread)
>
> sched_info_queued(rq, thread)
>
> /* last_queued not updated due to being non-zero */
> return
>
> Since thread was running between enqueue_task(rq, thread) and
> __sched_info_switch(thread, next_thread), the execution time
> between enqueue_task(rq, thread) and
> __sched_info_switch(thread, next_thread) now will become
> associated with run_delay due to when last_queued was last updated.
>
This alternative patch solves the problem by not calling
sched_info_{de,}queued() in {de,en}queue_task(). Therefore the
sched_info state is preserved and things work as expected.
By inlining the {de,en}queue_task() functions the new condition
becomes (mostly) a compile-time constant and we'll not emit any new
branch instructions.
It even shrinks the code (due to inlining {en,de}queue_task()):
$ size defconfig-build/kernel/sched/core.o defconfig-build/kernel/sched/core.o.orig
text data bss dec hex filename
64019 23378 2344 89741 15e8d defconfig-build/kernel/sched/core.o
64149 23378 2344 89871 15f0f defconfig-build/kernel/sched/core.o.orig
Reported-by: Mike Meyer <Mike.Meyer@Teradata.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Link: http://lkml.kernel.org/r/20150930154413.GO3604@twins.programming.kicks-ass.net
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-09-30 22:44:13 +07:00
|
|
|
dequeue_task(rq, p, DEQUEUE_SAVE);
|
2013-10-07 17:29:33 +07:00
|
|
|
if (running)
|
2014-09-12 20:41:40 +07:00
|
|
|
put_prev_task(rq, p);
|
2013-10-07 17:29:33 +07:00
|
|
|
|
|
|
|
p->numa_preferred_nid = nid;
|
|
|
|
|
|
|
|
if (running)
|
|
|
|
p->sched_class->set_curr_task(rq);
|
2014-08-20 16:47:32 +07:00
|
|
|
if (queued)
|
sched/core: Fix task and run queue sched_info::run_delay inconsistencies
Mike Meyer reported the following bug:
> During evaluation of some performance data, it was discovered thread
> and run queue run_delay accounting data was inconsistent with the other
> accounting data that was collected. Further investigation found under
> certain circumstances execution time was leaking into the task and
> run queue accounting of run_delay.
>
> Consider the following sequence:
>
> a. thread is running.
> b. thread moves beween cgroups, changes scheduling class or priority.
> c. thread sleeps OR
> d. thread involuntarily gives up cpu.
>
> a. implies:
>
> thread->sched_info.last_queued = 0
>
> a. and b. results in the following:
>
> 1. dequeue_task(rq, thread)
>
> sched_info_dequeued(rq, thread)
> delta = 0
>
> sched_info_reset_dequeued(thread)
> thread->sched_info.last_queued = 0
>
> thread->sched_info.run_delay += delta
>
> 2. enqueue_task(rq, thread)
>
> sched_info_queued(rq, thread)
>
> /* thread is still on cpu at this point. */
> thread->sched_info.last_queued = task_rq(thread)->clock;
>
> c. results in:
>
> dequeue_task(rq, thread)
>
> sched_info_dequeued(rq, thread)
>
> /* delta is execution time not run_delay. */
> delta = task_rq(thread)->clock - thread->sched_info.last_queued
>
> sched_info_reset_dequeued(thread)
> thread->sched_info.last_queued = 0
>
> thread->sched_info.run_delay += delta
>
> Since thread was running between enqueue_task(rq, thread) and
> dequeue_task(rq, thread), the delta above is really execution
> time and not run_delay.
>
> d. results in:
>
> __sched_info_switch(thread, next_thread)
>
> sched_info_depart(rq, thread)
>
> sched_info_queued(rq, thread)
>
> /* last_queued not updated due to being non-zero */
> return
>
> Since thread was running between enqueue_task(rq, thread) and
> __sched_info_switch(thread, next_thread), the execution time
> between enqueue_task(rq, thread) and
> __sched_info_switch(thread, next_thread) now will become
> associated with run_delay due to when last_queued was last updated.
>
This alternative patch solves the problem by not calling
sched_info_{de,}queued() in {de,en}queue_task(). Therefore the
sched_info state is preserved and things work as expected.
By inlining the {de,en}queue_task() functions the new condition
becomes (mostly) a compile-time constant and we'll not emit any new
branch instructions.
It even shrinks the code (due to inlining {en,de}queue_task()):
$ size defconfig-build/kernel/sched/core.o defconfig-build/kernel/sched/core.o.orig
text data bss dec hex filename
64019 23378 2344 89741 15e8d defconfig-build/kernel/sched/core.o
64149 23378 2344 89871 15f0f defconfig-build/kernel/sched/core.o.orig
Reported-by: Mike Meyer <Mike.Meyer@Teradata.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Link: http://lkml.kernel.org/r/20150930154413.GO3604@twins.programming.kicks-ass.net
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-09-30 22:44:13 +07:00
|
|
|
enqueue_task(rq, p, ENQUEUE_RESTORE);
|
2015-08-01 02:28:18 +07:00
|
|
|
task_rq_unlock(rq, p, &rf);
|
2013-10-07 17:29:33 +07:00
|
|
|
}
|
2015-06-11 19:46:50 +07:00
|
|
|
#endif /* CONFIG_NUMA_BALANCING */
|
2007-10-17 13:30:56 +07:00
|
|
|
|
2005-04-17 05:20:36 +07:00
|
|
|
#ifdef CONFIG_HOTPLUG_CPU
|
2006-12-10 17:20:11 +07:00
|
|
|
/*
|
2010-11-14 01:32:29 +07:00
|
|
|
* Ensures that the idle task is using init_mm right before its cpu goes
|
|
|
|
* offline.
|
2006-12-10 17:20:11 +07:00
|
|
|
*/
|
2010-11-14 01:32:29 +07:00
|
|
|
void idle_task_exit(void)
|
2005-04-17 05:20:36 +07:00
|
|
|
{
|
2010-11-14 01:32:29 +07:00
|
|
|
struct mm_struct *mm = current->active_mm;
|
2008-11-24 23:05:11 +07:00
|
|
|
|
2010-11-14 01:32:29 +07:00
|
|
|
BUG_ON(cpu_online(smp_processor_id()));
|
2008-11-24 23:05:11 +07:00
|
|
|
|
2012-10-26 22:17:44 +07:00
|
|
|
if (mm != &init_mm) {
|
2016-04-26 23:39:06 +07:00
|
|
|
switch_mm_irqs_off(mm, &init_mm, current);
|
2012-10-26 22:17:44 +07:00
|
|
|
finish_arch_post_lock_switch();
|
|
|
|
}
|
2010-11-14 01:32:29 +07:00
|
|
|
mmdrop(mm);
|
2005-04-17 05:20:36 +07:00
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
2012-08-20 16:26:57 +07:00
|
|
|
* Since this CPU is going 'away' for a while, fold any nr_active delta
|
|
|
|
* we might have. Assumes we're called after migrate_tasks() so that the
|
2016-07-12 23:33:56 +07:00
|
|
|
* nr_active count is stable. We need to take the teardown thread which
|
|
|
|
* is calling this into account, so we hand in adjust = 1 to the load
|
|
|
|
* calculation.
|
2012-08-20 16:26:57 +07:00
|
|
|
*
|
|
|
|
* Also see the comment "Global load-average calculations".
|
2005-04-17 05:20:36 +07:00
|
|
|
*/
|
2012-08-20 16:26:57 +07:00
|
|
|
static void calc_load_migrate(struct rq *rq)
|
2005-04-17 05:20:36 +07:00
|
|
|
{
|
2016-07-12 23:33:56 +07:00
|
|
|
long delta = calc_load_fold_active(rq, 1);
|
2012-08-20 16:26:57 +07:00
|
|
|
if (delta)
|
|
|
|
atomic_long_add(delta, &calc_load_tasks);
|
2005-04-17 05:20:36 +07:00
|
|
|
}
|
|
|
|
|
2014-02-12 16:49:30 +07:00
|
|
|
static void put_prev_task_fake(struct rq *rq, struct task_struct *prev)
|
|
|
|
{
|
|
|
|
}
|
|
|
|
|
|
|
|
static const struct sched_class fake_sched_class = {
|
|
|
|
.put_prev_task = put_prev_task_fake,
|
|
|
|
};
|
|
|
|
|
|
|
|
static struct task_struct fake_task = {
|
|
|
|
/*
|
|
|
|
* Avoid pull_{rt,dl}_task()
|
|
|
|
*/
|
|
|
|
.prio = MAX_PRIO + 1,
|
|
|
|
.sched_class = &fake_sched_class,
|
|
|
|
};
|
|
|
|
|
2006-07-03 14:25:40 +07:00
|
|
|
/*
|
2010-11-14 01:32:29 +07:00
|
|
|
* Migrate all tasks from the rq, sleeping tasks will be migrated by
|
|
|
|
* try_to_wake_up()->select_task_rq().
|
|
|
|
*
|
|
|
|
* Called with rq->lock held even though we'er in stop_machine() and
|
|
|
|
* there's no concurrency possible, we hold the required locks anyway
|
|
|
|
* because of lock validation efforts.
|
2005-04-17 05:20:36 +07:00
|
|
|
*/
|
2015-06-11 19:46:51 +07:00
|
|
|
static void migrate_tasks(struct rq *dead_rq)
|
2005-04-17 05:20:36 +07:00
|
|
|
{
|
2015-06-11 19:46:51 +07:00
|
|
|
struct rq *rq = dead_rq;
|
2010-11-14 01:32:29 +07:00
|
|
|
struct task_struct *next, *stop = rq->stop;
|
2015-08-02 00:25:08 +07:00
|
|
|
struct pin_cookie cookie;
|
2010-11-14 01:32:29 +07:00
|
|
|
int dest_cpu;
|
2005-04-17 05:20:36 +07:00
|
|
|
|
|
|
|
/*
|
2010-11-14 01:32:29 +07:00
|
|
|
* Fudge the rq selection such that the below task selection loop
|
|
|
|
* doesn't get stuck on the currently eligible stop task.
|
|
|
|
*
|
|
|
|
* We're currently inside stop_machine() and the rq is either stuck
|
|
|
|
* in the stop_machine_cpu_stop() loop, or we're executing this code,
|
|
|
|
* either way we should never end up calling schedule() until we're
|
|
|
|
* done here.
|
2005-04-17 05:20:36 +07:00
|
|
|
*/
|
2010-11-14 01:32:29 +07:00
|
|
|
rq->stop = NULL;
|
2006-07-03 14:25:40 +07:00
|
|
|
|
2013-04-12 06:50:58 +07:00
|
|
|
/*
|
|
|
|
* put_prev_task() and pick_next_task() sched
|
|
|
|
* class method both need to have an up-to-date
|
|
|
|
* value of rq->clock[_task]
|
|
|
|
*/
|
|
|
|
update_rq_clock(rq);
|
|
|
|
|
2015-06-11 19:46:51 +07:00
|
|
|
for (;;) {
|
2010-11-14 01:32:29 +07:00
|
|
|
/*
|
|
|
|
* There's this thread running, bail when that's the only
|
|
|
|
* remaining thread.
|
|
|
|
*/
|
|
|
|
if (rq->nr_running == 1)
|
2007-07-09 23:51:59 +07:00
|
|
|
break;
|
2010-11-14 01:32:29 +07:00
|
|
|
|
2015-06-11 19:46:54 +07:00
|
|
|
/*
|
2015-08-28 13:55:56 +07:00
|
|
|
* pick_next_task assumes pinned rq->lock.
|
2015-06-11 19:46:54 +07:00
|
|
|
*/
|
2015-08-02 00:25:08 +07:00
|
|
|
cookie = lockdep_pin_lock(&rq->lock);
|
|
|
|
next = pick_next_task(rq, &fake_task, cookie);
|
2010-11-14 01:32:29 +07:00
|
|
|
BUG_ON(!next);
|
2008-06-29 05:16:56 +07:00
|
|
|
next->sched_class->put_prev_task(rq, next);
|
2007-07-26 18:40:43 +07:00
|
|
|
|
2015-08-28 13:55:56 +07:00
|
|
|
/*
|
|
|
|
* Rules for changing task_struct::cpus_allowed are holding
|
|
|
|
* both pi_lock and rq->lock, such that holding either
|
|
|
|
* stabilizes the mask.
|
|
|
|
*
|
|
|
|
* Drop rq->lock is not quite as disastrous as it usually is
|
|
|
|
* because !cpu_active at this point, which means load-balance
|
|
|
|
* will not interfere. Also, stop-machine.
|
|
|
|
*/
|
2015-08-02 00:25:08 +07:00
|
|
|
lockdep_unpin_lock(&rq->lock, cookie);
|
2015-08-28 13:55:56 +07:00
|
|
|
raw_spin_unlock(&rq->lock);
|
|
|
|
raw_spin_lock(&next->pi_lock);
|
|
|
|
raw_spin_lock(&rq->lock);
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Since we're inside stop-machine, _nothing_ should have
|
|
|
|
* changed the task, WARN if weird stuff happened, because in
|
|
|
|
* that case the above rq->lock drop is a fail too.
|
|
|
|
*/
|
|
|
|
if (WARN_ON(task_rq(next) != rq || !task_on_rq_queued(next))) {
|
|
|
|
raw_spin_unlock(&next->pi_lock);
|
|
|
|
continue;
|
|
|
|
}
|
|
|
|
|
2010-11-14 01:32:29 +07:00
|
|
|
/* Find suitable destination for @next, with force if needed. */
|
2015-06-11 19:46:51 +07:00
|
|
|
dest_cpu = select_fallback_rq(dead_rq->cpu, next);
|
2010-11-14 01:32:29 +07:00
|
|
|
|
2015-06-11 19:46:51 +07:00
|
|
|
rq = __migrate_task(rq, next, dest_cpu);
|
|
|
|
if (rq != dead_rq) {
|
|
|
|
raw_spin_unlock(&rq->lock);
|
|
|
|
rq = dead_rq;
|
|
|
|
raw_spin_lock(&rq->lock);
|
|
|
|
}
|
2015-08-28 13:55:56 +07:00
|
|
|
raw_spin_unlock(&next->pi_lock);
|
2005-04-17 05:20:36 +07:00
|
|
|
}
|
2009-04-11 15:43:41 +07:00
|
|
|
|
2010-11-14 01:32:29 +07:00
|
|
|
rq->stop = stop;
|
2009-04-11 15:43:41 +07:00
|
|
|
}
|
2005-04-17 05:20:36 +07:00
|
|
|
#endif /* CONFIG_HOTPLUG_CPU */
|
|
|
|
|
2008-06-05 02:04:05 +07:00
|
|
|
static void set_rq_online(struct rq *rq)
|
|
|
|
{
|
|
|
|
if (!rq->online) {
|
|
|
|
const struct sched_class *class;
|
|
|
|
|
2008-11-24 23:05:05 +07:00
|
|
|
cpumask_set_cpu(rq->cpu, rq->rd->online);
|
2008-06-05 02:04:05 +07:00
|
|
|
rq->online = 1;
|
|
|
|
|
|
|
|
for_each_class(class) {
|
|
|
|
if (class->rq_online)
|
|
|
|
class->rq_online(rq);
|
|
|
|
}
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
|
|
|
static void set_rq_offline(struct rq *rq)
|
|
|
|
{
|
|
|
|
if (rq->online) {
|
|
|
|
const struct sched_class *class;
|
|
|
|
|
|
|
|
for_each_class(class) {
|
|
|
|
if (class->rq_offline)
|
|
|
|
class->rq_offline(rq);
|
|
|
|
}
|
|
|
|
|
2008-11-24 23:05:05 +07:00
|
|
|
cpumask_clear_cpu(rq->cpu, rq->rd->online);
|
2008-06-05 02:04:05 +07:00
|
|
|
rq->online = 0;
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
2016-03-10 18:54:09 +07:00
|
|
|
static void set_cpu_rq_start_time(unsigned int cpu)
|
2005-04-17 05:20:36 +07:00
|
|
|
{
|
2010-05-06 23:49:21 +07:00
|
|
|
struct rq *rq = cpu_rq(cpu);
|
2005-04-17 05:20:36 +07:00
|
|
|
|
2014-05-09 01:47:39 +07:00
|
|
|
rq->age_stamp = sched_clock_cpu(cpu);
|
|
|
|
}
|
|
|
|
|
2011-04-07 19:09:58 +07:00
|
|
|
static cpumask_var_t sched_domains_tmpmask; /* sched_domains_mutex */
|
|
|
|
|
2007-10-15 22:00:13 +07:00
|
|
|
#ifdef CONFIG_SCHED_DEBUG
|
2007-10-24 23:23:48 +07:00
|
|
|
|
2012-06-01 02:20:16 +07:00
|
|
|
static __read_mostly int sched_debug_enabled;
|
2009-11-18 07:22:15 +07:00
|
|
|
|
2012-06-01 02:20:16 +07:00
|
|
|
static int __init sched_debug_setup(char *str)
|
2009-11-18 07:22:15 +07:00
|
|
|
{
|
2012-06-01 02:20:16 +07:00
|
|
|
sched_debug_enabled = 1;
|
2009-11-18 07:22:15 +07:00
|
|
|
|
|
|
|
return 0;
|
|
|
|
}
|
2012-06-01 02:20:16 +07:00
|
|
|
early_param("sched_debug", sched_debug_setup);
|
|
|
|
|
|
|
|
static inline bool sched_debug(void)
|
|
|
|
{
|
|
|
|
return sched_debug_enabled;
|
|
|
|
}
|
2009-11-18 07:22:15 +07:00
|
|
|
|
2008-04-05 08:11:11 +07:00
|
|
|
static int sched_domain_debug_one(struct sched_domain *sd, int cpu, int level,
|
2008-11-24 23:05:14 +07:00
|
|
|
struct cpumask *groupmask)
|
2005-04-17 05:20:36 +07:00
|
|
|
{
|
2007-10-24 23:23:48 +07:00
|
|
|
struct sched_group *group = sd->groups;
|
2005-04-17 05:20:36 +07:00
|
|
|
|
2008-11-24 23:05:14 +07:00
|
|
|
cpumask_clear(groupmask);
|
2007-10-24 23:23:48 +07:00
|
|
|
|
|
|
|
printk(KERN_DEBUG "%*s domain %d: ", level, "", level);
|
|
|
|
|
|
|
|
if (!(sd->flags & SD_LOAD_BALANCE)) {
|
2009-12-20 20:23:57 +07:00
|
|
|
printk("does not load-balance\n");
|
2007-10-24 23:23:48 +07:00
|
|
|
if (sd->parent)
|
2009-12-20 20:23:57 +07:00
|
|
|
printk(KERN_ERR "ERROR: !SD_LOAD_BALANCE domain"
|
|
|
|
" has parent");
|
2007-10-24 23:23:48 +07:00
|
|
|
return -1;
|
2005-06-26 04:57:24 +07:00
|
|
|
}
|
|
|
|
|
2015-02-14 05:37:28 +07:00
|
|
|
printk(KERN_CONT "span %*pbl level %s\n",
|
|
|
|
cpumask_pr_args(sched_domain_span(sd)), sd->name);
|
2007-10-24 23:23:48 +07:00
|
|
|
|
2008-11-24 23:05:04 +07:00
|
|
|
if (!cpumask_test_cpu(cpu, sched_domain_span(sd))) {
|
2009-12-20 20:23:57 +07:00
|
|
|
printk(KERN_ERR "ERROR: domain->span does not contain "
|
|
|
|
"CPU%d\n", cpu);
|
2007-10-24 23:23:48 +07:00
|
|
|
}
|
2008-11-24 23:05:04 +07:00
|
|
|
if (!cpumask_test_cpu(cpu, sched_group_cpus(group))) {
|
2009-12-20 20:23:57 +07:00
|
|
|
printk(KERN_ERR "ERROR: domain->groups does not contain"
|
|
|
|
" CPU%d\n", cpu);
|
2007-10-24 23:23:48 +07:00
|
|
|
}
|
2005-04-17 05:20:36 +07:00
|
|
|
|
2007-10-24 23:23:48 +07:00
|
|
|
printk(KERN_DEBUG "%*s groups:", level + 1, "");
|
2005-04-17 05:20:36 +07:00
|
|
|
do {
|
2007-10-24 23:23:48 +07:00
|
|
|
if (!group) {
|
2009-12-20 20:23:57 +07:00
|
|
|
printk("\n");
|
|
|
|
printk(KERN_ERR "ERROR: group is NULL\n");
|
2005-04-17 05:20:36 +07:00
|
|
|
break;
|
|
|
|
}
|
|
|
|
|
2008-11-24 23:05:04 +07:00
|
|
|
if (!cpumask_weight(sched_group_cpus(group))) {
|
2009-12-20 20:23:57 +07:00
|
|
|
printk(KERN_CONT "\n");
|
|
|
|
printk(KERN_ERR "ERROR: empty group\n");
|
2007-10-24 23:23:48 +07:00
|
|
|
break;
|
|
|
|
}
|
2005-04-17 05:20:36 +07:00
|
|
|
|
2012-04-17 20:49:36 +07:00
|
|
|
if (!(sd->flags & SD_OVERLAP) &&
|
|
|
|
cpumask_intersects(groupmask, sched_group_cpus(group))) {
|
2009-12-20 20:23:57 +07:00
|
|
|
printk(KERN_CONT "\n");
|
|
|
|
printk(KERN_ERR "ERROR: repeated CPUs\n");
|
2007-10-24 23:23:48 +07:00
|
|
|
break;
|
|
|
|
}
|
2005-04-17 05:20:36 +07:00
|
|
|
|
2008-11-24 23:05:04 +07:00
|
|
|
cpumask_or(groupmask, groupmask, sched_group_cpus(group));
|
2005-04-17 05:20:36 +07:00
|
|
|
|
2015-02-14 05:37:28 +07:00
|
|
|
printk(KERN_CONT " %*pbl",
|
|
|
|
cpumask_pr_args(sched_group_cpus(group)));
|
2014-05-27 05:19:39 +07:00
|
|
|
if (group->sgc->capacity != SCHED_CAPACITY_SCALE) {
|
2014-05-27 05:19:37 +07:00
|
|
|
printk(KERN_CONT " (cpu_capacity = %d)",
|
|
|
|
group->sgc->capacity);
|
2009-04-14 10:39:36 +07:00
|
|
|
}
|
2005-04-17 05:20:36 +07:00
|
|
|
|
2007-10-24 23:23:48 +07:00
|
|
|
group = group->next;
|
|
|
|
} while (group != sd->groups);
|
2009-12-20 20:23:57 +07:00
|
|
|
printk(KERN_CONT "\n");
|
2005-04-17 05:20:36 +07:00
|
|
|
|
2008-11-24 23:05:04 +07:00
|
|
|
if (!cpumask_equal(sched_domain_span(sd), groupmask))
|
2009-12-20 20:23:57 +07:00
|
|
|
printk(KERN_ERR "ERROR: groups don't span domain->span\n");
|
2005-04-17 05:20:36 +07:00
|
|
|
|
2008-11-24 23:05:04 +07:00
|
|
|
if (sd->parent &&
|
|
|
|
!cpumask_subset(groupmask, sched_domain_span(sd->parent)))
|
2009-12-20 20:23:57 +07:00
|
|
|
printk(KERN_ERR "ERROR: parent span is not a superset "
|
|
|
|
"of domain->span\n");
|
2007-10-24 23:23:48 +07:00
|
|
|
return 0;
|
|
|
|
}
|
2005-04-17 05:20:36 +07:00
|
|
|
|
2007-10-24 23:23:48 +07:00
|
|
|
static void sched_domain_debug(struct sched_domain *sd, int cpu)
|
|
|
|
{
|
|
|
|
int level = 0;
|
2005-04-17 05:20:36 +07:00
|
|
|
|
2012-06-01 02:20:16 +07:00
|
|
|
if (!sched_debug_enabled)
|
2009-11-18 07:22:15 +07:00
|
|
|
return;
|
|
|
|
|
2007-10-24 23:23:48 +07:00
|
|
|
if (!sd) {
|
|
|
|
printk(KERN_DEBUG "CPU%d attaching NULL sched-domain.\n", cpu);
|
|
|
|
return;
|
|
|
|
}
|
2005-04-17 05:20:36 +07:00
|
|
|
|
2007-10-24 23:23:48 +07:00
|
|
|
printk(KERN_DEBUG "CPU%d attaching sched-domain:\n", cpu);
|
|
|
|
|
|
|
|
for (;;) {
|
2011-04-07 19:09:58 +07:00
|
|
|
if (sched_domain_debug_one(sd, cpu, level, sched_domains_tmpmask))
|
2007-10-24 23:23:48 +07:00
|
|
|
break;
|
2005-04-17 05:20:36 +07:00
|
|
|
level++;
|
|
|
|
sd = sd->parent;
|
2006-12-10 17:20:38 +07:00
|
|
|
if (!sd)
|
2007-10-24 23:23:48 +07:00
|
|
|
break;
|
|
|
|
}
|
2005-04-17 05:20:36 +07:00
|
|
|
}
|
2008-05-30 19:23:45 +07:00
|
|
|
#else /* !CONFIG_SCHED_DEBUG */
|
2006-07-03 14:25:40 +07:00
|
|
|
# define sched_domain_debug(sd, cpu) do { } while (0)
|
2012-06-01 02:20:16 +07:00
|
|
|
static inline bool sched_debug(void)
|
|
|
|
{
|
|
|
|
return false;
|
|
|
|
}
|
2008-05-30 19:23:45 +07:00
|
|
|
#endif /* CONFIG_SCHED_DEBUG */
|
2005-04-17 05:20:36 +07:00
|
|
|
|
2005-06-26 04:57:33 +07:00
|
|
|
static int sd_degenerate(struct sched_domain *sd)
|
2005-06-26 04:57:25 +07:00
|
|
|
{
|
2008-11-24 23:05:04 +07:00
|
|
|
if (cpumask_weight(sched_domain_span(sd)) == 1)
|
2005-06-26 04:57:25 +07:00
|
|
|
return 1;
|
|
|
|
|
|
|
|
/* Following flags need at least 2 groups */
|
|
|
|
if (sd->flags & (SD_LOAD_BALANCE |
|
|
|
|
SD_BALANCE_NEWIDLE |
|
|
|
|
SD_BALANCE_FORK |
|
2006-10-03 15:14:09 +07:00
|
|
|
SD_BALANCE_EXEC |
|
2014-05-28 00:50:41 +07:00
|
|
|
SD_SHARE_CPUCAPACITY |
|
2014-04-11 16:44:40 +07:00
|
|
|
SD_SHARE_PKG_RESOURCES |
|
|
|
|
SD_SHARE_POWERDOMAIN)) {
|
2005-06-26 04:57:25 +07:00
|
|
|
if (sd->groups != sd->groups->next)
|
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
|
|
|
|
/* Following flags don't use groups */
|
sched: Merge select_task_rq_fair() and sched_balance_self()
The problem with wake_idle() is that is doesn't respect things like
cpu_power, which means it doesn't deal well with SMT nor the recent
RT interaction.
To cure this, it needs to do what sched_balance_self() does, which
leads to the possibility of merging select_task_rq_fair() and
sched_balance_self().
Modify sched_balance_self() to:
- update_shares() when walking up the domain tree,
(it only called it for the top domain, but it should
have done this anyway), which allows us to remove
this ugly bit from try_to_wake_up().
- do wake_affine() on the smallest domain that contains
both this (the waking) and the prev (the wakee) cpu for
WAKE invocations.
Then use the top-down balance steps it had to replace wake_idle().
This leads to the dissapearance of SD_WAKE_BALANCE and
SD_WAKE_IDLE_FAR, with SD_WAKE_IDLE replaced with SD_BALANCE_WAKE.
SD_WAKE_AFFINE needs SD_BALANCE_WAKE to be effective.
Touch all topology bits to replace the old with new SD flags --
platforms might need re-tuning, enabling SD_BALANCE_WAKE
conditionally on a NUMA distance seems like a good additional
feature, magny-core and small nehalem systems would want this
enabled, systems with slow interconnects would not.
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
LKML-Reference: <new-submission>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-09-10 18:50:02 +07:00
|
|
|
if (sd->flags & (SD_WAKE_AFFINE))
|
2005-06-26 04:57:25 +07:00
|
|
|
return 0;
|
|
|
|
|
|
|
|
return 1;
|
|
|
|
}
|
|
|
|
|
2006-07-03 14:25:40 +07:00
|
|
|
static int
|
|
|
|
sd_parent_degenerate(struct sched_domain *sd, struct sched_domain *parent)
|
2005-06-26 04:57:25 +07:00
|
|
|
{
|
|
|
|
unsigned long cflags = sd->flags, pflags = parent->flags;
|
|
|
|
|
|
|
|
if (sd_degenerate(parent))
|
|
|
|
return 1;
|
|
|
|
|
2008-11-24 23:05:04 +07:00
|
|
|
if (!cpumask_equal(sched_domain_span(sd), sched_domain_span(parent)))
|
2005-06-26 04:57:25 +07:00
|
|
|
return 0;
|
|
|
|
|
|
|
|
/* Flags needing groups don't count if only 1 group in parent */
|
|
|
|
if (parent->groups == parent->groups->next) {
|
|
|
|
pflags &= ~(SD_LOAD_BALANCE |
|
|
|
|
SD_BALANCE_NEWIDLE |
|
|
|
|
SD_BALANCE_FORK |
|
2006-10-03 15:14:09 +07:00
|
|
|
SD_BALANCE_EXEC |
|
2014-05-28 00:50:41 +07:00
|
|
|
SD_SHARE_CPUCAPACITY |
|
sched/fair: Fix the sd_parent_degenerate() code
I found that on my WSM box I had a redundant domain:
[ 0.949769] CPU0 attaching sched-domain:
[ 0.953765] domain 0: span 0,12 level SIBLING
[ 0.958335] groups: 0 (cpu_power = 587) 12 (cpu_power = 588)
[ 0.964548] domain 1: span 0-5,12-17 level MC
[ 0.969206] groups: 0,12 (cpu_power = 1175) 1,13 (cpu_power = 1176) 2,14 (cpu_power = 1176) 3,15 (cpu_power = 1176) 4,16 (cpu_power = 1176) 5,17 (cpu_power = 1176)
[ 0.984993] domain 2: span 0-5,12-17 level CPU
[ 0.989822] groups: 0-5,12-17 (cpu_power = 7055)
[ 0.995049] domain 3: span 0-23 level NUMA
[ 0.999620] groups: 0-5,12-17 (cpu_power = 7055) 6-11,18-23 (cpu_power = 7056)
Note how domain 2 has only a single group and spans the same CPUs as
domain 1. We should not keep such domains and do in fact have code to
prune these.
It turns out that the 'new' SD_PREFER_SIBLING flag causes this, it
makes sd_parent_degenerate() fail on the CPU domain. We can easily
fix this by 'ignoring' the SD_PREFER_SIBLING bit and transfering it
to whatever domain ends up covering the span.
With this patch the domains now look like this:
[ 0.950419] CPU0 attaching sched-domain:
[ 0.954454] domain 0: span 0,12 level SIBLING
[ 0.959039] groups: 0 (cpu_power = 587) 12 (cpu_power = 588)
[ 0.965271] domain 1: span 0-5,12-17 level MC
[ 0.969936] groups: 0,12 (cpu_power = 1175) 1,13 (cpu_power = 1176) 2,14 (cpu_power = 1176) 3,15 (cpu_power = 1176) 4,16 (cpu_power = 1176) 5,17 (cpu_power = 1176)
[ 0.985737] domain 2: span 0-23 level NUMA
[ 0.990231] groups: 0-5,12-17 (cpu_power = 7055) 6-11,18-23 (cpu_power = 7056)
Reviewed-by: Paul Turner <pjt@google.com>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/n/tip-ys201g4jwukj0h8xcamakxq1@git.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2013-08-19 21:57:04 +07:00
|
|
|
SD_SHARE_PKG_RESOURCES |
|
2014-04-11 16:44:40 +07:00
|
|
|
SD_PREFER_SIBLING |
|
|
|
|
SD_SHARE_POWERDOMAIN);
|
2008-12-08 09:47:37 +07:00
|
|
|
if (nr_node_ids == 1)
|
|
|
|
pflags &= ~SD_SERIALIZE;
|
2005-06-26 04:57:25 +07:00
|
|
|
}
|
|
|
|
if (~cflags & pflags)
|
|
|
|
return 0;
|
|
|
|
|
|
|
|
return 1;
|
|
|
|
}
|
|
|
|
|
2011-04-07 19:09:50 +07:00
|
|
|
static void free_rootdomain(struct rcu_head *rcu)
|
2008-11-24 23:05:05 +07:00
|
|
|
{
|
2011-04-07 19:09:50 +07:00
|
|
|
struct root_domain *rd = container_of(rcu, struct root_domain, rcu);
|
2009-11-16 16:28:09 +07:00
|
|
|
|
2008-11-24 23:05:13 +07:00
|
|
|
cpupri_cleanup(&rd->cpupri);
|
2013-11-07 20:43:47 +07:00
|
|
|
cpudl_cleanup(&rd->cpudl);
|
sched/deadline: Add SCHED_DEADLINE SMP-related data structures & logic
Introduces data structures relevant for implementing dynamic
migration of -deadline tasks and the logic for checking if
runqueues are overloaded with -deadline tasks and for choosing
where a task should migrate, when it is the case.
Adds also dynamic migrations to SCHED_DEADLINE, so that tasks can
be moved among CPUs when necessary. It is also possible to bind a
task to a (set of) CPU(s), thus restricting its capability of
migrating, or forbidding migrations at all.
The very same approach used in sched_rt is utilised:
- -deadline tasks are kept into CPU-specific runqueues,
- -deadline tasks are migrated among runqueues to achieve the
following:
* on an M-CPU system the M earliest deadline ready tasks
are always running;
* affinity/cpusets settings of all the -deadline tasks is
always respected.
Therefore, this very special form of "load balancing" is done with
an active method, i.e., the scheduler pushes or pulls tasks between
runqueues when they are woken up and/or (de)scheduled.
IOW, every time a preemption occurs, the descheduled task might be sent
to some other CPU (depending on its deadline) to continue executing
(push). On the other hand, every time a CPU becomes idle, it might pull
the second earliest deadline ready task from some other CPU.
To enforce this, a pull operation is always attempted before taking any
scheduling decision (pre_schedule()), as well as a push one after each
scheduling decision (post_schedule()). In addition, when a task arrives
or wakes up, the best CPU where to resume it is selected taking into
account its affinity mask, the system topology, but also its deadline.
E.g., from the scheduling point of view, the best CPU where to wake
up (and also where to push) a task is the one which is running the task
with the latest deadline among the M executing ones.
In order to facilitate these decisions, per-runqueue "caching" of the
deadlines of the currently running and of the first ready task is used.
Queued but not running tasks are also parked in another rb-tree to
speed-up pushes.
Signed-off-by: Juri Lelli <juri.lelli@gmail.com>
Signed-off-by: Dario Faggioli <raistlin@linux.it>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/1383831828-15501-5-git-send-email-juri.lelli@gmail.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2013-11-07 20:43:38 +07:00
|
|
|
free_cpumask_var(rd->dlo_mask);
|
2008-11-24 23:05:05 +07:00
|
|
|
free_cpumask_var(rd->rto_mask);
|
|
|
|
free_cpumask_var(rd->online);
|
|
|
|
free_cpumask_var(rd->span);
|
|
|
|
kfree(rd);
|
|
|
|
}
|
|
|
|
|
2008-01-26 03:08:18 +07:00
|
|
|
static void rq_attach_root(struct rq *rq, struct root_domain *rd)
|
|
|
|
{
|
2009-02-12 17:35:40 +07:00
|
|
|
struct root_domain *old_rd = NULL;
|
2008-01-26 03:08:18 +07:00
|
|
|
unsigned long flags;
|
|
|
|
|
2009-11-17 20:28:38 +07:00
|
|
|
raw_spin_lock_irqsave(&rq->lock, flags);
|
2008-01-26 03:08:18 +07:00
|
|
|
|
|
|
|
if (rq->rd) {
|
2009-02-12 17:35:40 +07:00
|
|
|
old_rd = rq->rd;
|
2008-01-26 03:08:18 +07:00
|
|
|
|
2008-11-24 23:05:05 +07:00
|
|
|
if (cpumask_test_cpu(rq->cpu, old_rd->online))
|
2008-06-05 02:04:05 +07:00
|
|
|
set_rq_offline(rq);
|
2008-01-26 03:08:18 +07:00
|
|
|
|
2008-11-24 23:05:05 +07:00
|
|
|
cpumask_clear_cpu(rq->cpu, old_rd->span);
|
2008-01-26 03:08:26 +07:00
|
|
|
|
2009-02-12 17:35:40 +07:00
|
|
|
/*
|
2013-11-17 10:12:36 +07:00
|
|
|
* If we dont want to free the old_rd yet then
|
2009-02-12 17:35:40 +07:00
|
|
|
* set old_rd to NULL to skip the freeing later
|
|
|
|
* in this function:
|
|
|
|
*/
|
|
|
|
if (!atomic_dec_and_test(&old_rd->refcount))
|
|
|
|
old_rd = NULL;
|
2008-01-26 03:08:18 +07:00
|
|
|
}
|
|
|
|
|
|
|
|
atomic_inc(&rd->refcount);
|
|
|
|
rq->rd = rd;
|
|
|
|
|
2008-11-24 23:05:05 +07:00
|
|
|
cpumask_set_cpu(rq->cpu, rd->span);
|
2009-07-30 21:57:23 +07:00
|
|
|
if (cpumask_test_cpu(rq->cpu, cpu_active_mask))
|
2008-06-05 02:04:05 +07:00
|
|
|
set_rq_online(rq);
|
2008-01-26 03:08:18 +07:00
|
|
|
|
2009-11-17 20:28:38 +07:00
|
|
|
raw_spin_unlock_irqrestore(&rq->lock, flags);
|
2009-02-12 17:35:40 +07:00
|
|
|
|
|
|
|
if (old_rd)
|
2011-04-07 19:09:50 +07:00
|
|
|
call_rcu_sched(&old_rd->rcu, free_rootdomain);
|
2008-01-26 03:08:18 +07:00
|
|
|
}
|
|
|
|
|
2010-07-16 03:18:22 +07:00
|
|
|
static int init_rootdomain(struct root_domain *rd)
|
2008-01-26 03:08:18 +07:00
|
|
|
{
|
|
|
|
memset(rd, 0, sizeof(*rd));
|
|
|
|
|
2015-12-02 18:52:59 +07:00
|
|
|
if (!zalloc_cpumask_var(&rd->span, GFP_KERNEL))
|
2009-01-06 16:39:06 +07:00
|
|
|
goto out;
|
2015-12-02 18:52:59 +07:00
|
|
|
if (!zalloc_cpumask_var(&rd->online, GFP_KERNEL))
|
2008-11-24 23:05:05 +07:00
|
|
|
goto free_span;
|
2015-12-02 18:52:59 +07:00
|
|
|
if (!zalloc_cpumask_var(&rd->dlo_mask, GFP_KERNEL))
|
2008-11-24 23:05:05 +07:00
|
|
|
goto free_online;
|
2015-12-02 18:52:59 +07:00
|
|
|
if (!zalloc_cpumask_var(&rd->rto_mask, GFP_KERNEL))
|
sched/deadline: Add SCHED_DEADLINE SMP-related data structures & logic
Introduces data structures relevant for implementing dynamic
migration of -deadline tasks and the logic for checking if
runqueues are overloaded with -deadline tasks and for choosing
where a task should migrate, when it is the case.
Adds also dynamic migrations to SCHED_DEADLINE, so that tasks can
be moved among CPUs when necessary. It is also possible to bind a
task to a (set of) CPU(s), thus restricting its capability of
migrating, or forbidding migrations at all.
The very same approach used in sched_rt is utilised:
- -deadline tasks are kept into CPU-specific runqueues,
- -deadline tasks are migrated among runqueues to achieve the
following:
* on an M-CPU system the M earliest deadline ready tasks
are always running;
* affinity/cpusets settings of all the -deadline tasks is
always respected.
Therefore, this very special form of "load balancing" is done with
an active method, i.e., the scheduler pushes or pulls tasks between
runqueues when they are woken up and/or (de)scheduled.
IOW, every time a preemption occurs, the descheduled task might be sent
to some other CPU (depending on its deadline) to continue executing
(push). On the other hand, every time a CPU becomes idle, it might pull
the second earliest deadline ready task from some other CPU.
To enforce this, a pull operation is always attempted before taking any
scheduling decision (pre_schedule()), as well as a push one after each
scheduling decision (post_schedule()). In addition, when a task arrives
or wakes up, the best CPU where to resume it is selected taking into
account its affinity mask, the system topology, but also its deadline.
E.g., from the scheduling point of view, the best CPU where to wake
up (and also where to push) a task is the one which is running the task
with the latest deadline among the M executing ones.
In order to facilitate these decisions, per-runqueue "caching" of the
deadlines of the currently running and of the first ready task is used.
Queued but not running tasks are also parked in another rb-tree to
speed-up pushes.
Signed-off-by: Juri Lelli <juri.lelli@gmail.com>
Signed-off-by: Dario Faggioli <raistlin@linux.it>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/1383831828-15501-5-git-send-email-juri.lelli@gmail.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2013-11-07 20:43:38 +07:00
|
|
|
goto free_dlo_mask;
|
2008-05-13 02:21:01 +07:00
|
|
|
|
sched/deadline: Add bandwidth management for SCHED_DEADLINE tasks
In order of deadline scheduling to be effective and useful, it is
important that some method of having the allocation of the available
CPU bandwidth to tasks and task groups under control.
This is usually called "admission control" and if it is not performed
at all, no guarantee can be given on the actual scheduling of the
-deadline tasks.
Since when RT-throttling has been introduced each task group have a
bandwidth associated to itself, calculated as a certain amount of
runtime over a period. Moreover, to make it possible to manipulate
such bandwidth, readable/writable controls have been added to both
procfs (for system wide settings) and cgroupfs (for per-group
settings).
Therefore, the same interface is being used for controlling the
bandwidth distrubution to -deadline tasks and task groups, i.e.,
new controls but with similar names, equivalent meaning and with
the same usage paradigm are added.
However, more discussion is needed in order to figure out how
we want to manage SCHED_DEADLINE bandwidth at the task group level.
Therefore, this patch adds a less sophisticated, but actually
very sensible, mechanism to ensure that a certain utilization
cap is not overcome per each root_domain (the single rq for !SMP
configurations).
Another main difference between deadline bandwidth management and
RT-throttling is that -deadline tasks have bandwidth on their own
(while -rt ones doesn't!), and thus we don't need an higher level
throttling mechanism to enforce the desired bandwidth.
This patch, therefore:
- adds system wide deadline bandwidth management by means of:
* /proc/sys/kernel/sched_dl_runtime_us,
* /proc/sys/kernel/sched_dl_period_us,
that determine (i.e., runtime / period) the total bandwidth
available on each CPU of each root_domain for -deadline tasks;
- couples the RT and deadline bandwidth management, i.e., enforces
that the sum of how much bandwidth is being devoted to -rt
-deadline tasks to stay below 100%.
This means that, for a root_domain comprising M CPUs, -deadline tasks
can be created until the sum of their bandwidths stay below:
M * (sched_dl_runtime_us / sched_dl_period_us)
It is also possible to disable this bandwidth management logic, and
be thus free of oversubscribing the system up to any arbitrary level.
Signed-off-by: Dario Faggioli <raistlin@linux.it>
Signed-off-by: Juri Lelli <juri.lelli@gmail.com>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/1383831828-15501-12-git-send-email-juri.lelli@gmail.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2013-11-07 20:43:45 +07:00
|
|
|
init_dl_bw(&rd->dl_bw);
|
2013-11-07 20:43:47 +07:00
|
|
|
if (cpudl_init(&rd->cpudl) != 0)
|
|
|
|
goto free_dlo_mask;
|
sched/deadline: Add bandwidth management for SCHED_DEADLINE tasks
In order of deadline scheduling to be effective and useful, it is
important that some method of having the allocation of the available
CPU bandwidth to tasks and task groups under control.
This is usually called "admission control" and if it is not performed
at all, no guarantee can be given on the actual scheduling of the
-deadline tasks.
Since when RT-throttling has been introduced each task group have a
bandwidth associated to itself, calculated as a certain amount of
runtime over a period. Moreover, to make it possible to manipulate
such bandwidth, readable/writable controls have been added to both
procfs (for system wide settings) and cgroupfs (for per-group
settings).
Therefore, the same interface is being used for controlling the
bandwidth distrubution to -deadline tasks and task groups, i.e.,
new controls but with similar names, equivalent meaning and with
the same usage paradigm are added.
However, more discussion is needed in order to figure out how
we want to manage SCHED_DEADLINE bandwidth at the task group level.
Therefore, this patch adds a less sophisticated, but actually
very sensible, mechanism to ensure that a certain utilization
cap is not overcome per each root_domain (the single rq for !SMP
configurations).
Another main difference between deadline bandwidth management and
RT-throttling is that -deadline tasks have bandwidth on their own
(while -rt ones doesn't!), and thus we don't need an higher level
throttling mechanism to enforce the desired bandwidth.
This patch, therefore:
- adds system wide deadline bandwidth management by means of:
* /proc/sys/kernel/sched_dl_runtime_us,
* /proc/sys/kernel/sched_dl_period_us,
that determine (i.e., runtime / period) the total bandwidth
available on each CPU of each root_domain for -deadline tasks;
- couples the RT and deadline bandwidth management, i.e., enforces
that the sum of how much bandwidth is being devoted to -rt
-deadline tasks to stay below 100%.
This means that, for a root_domain comprising M CPUs, -deadline tasks
can be created until the sum of their bandwidths stay below:
M * (sched_dl_runtime_us / sched_dl_period_us)
It is also possible to disable this bandwidth management logic, and
be thus free of oversubscribing the system up to any arbitrary level.
Signed-off-by: Dario Faggioli <raistlin@linux.it>
Signed-off-by: Juri Lelli <juri.lelli@gmail.com>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/1383831828-15501-12-git-send-email-juri.lelli@gmail.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2013-11-07 20:43:45 +07:00
|
|
|
|
2010-07-16 03:18:22 +07:00
|
|
|
if (cpupri_init(&rd->cpupri) != 0)
|
2008-11-24 23:05:13 +07:00
|
|
|
goto free_rto_mask;
|
2008-11-24 23:05:05 +07:00
|
|
|
return 0;
|
2008-05-13 02:21:01 +07:00
|
|
|
|
2008-11-24 23:05:13 +07:00
|
|
|
free_rto_mask:
|
|
|
|
free_cpumask_var(rd->rto_mask);
|
sched/deadline: Add SCHED_DEADLINE SMP-related data structures & logic
Introduces data structures relevant for implementing dynamic
migration of -deadline tasks and the logic for checking if
runqueues are overloaded with -deadline tasks and for choosing
where a task should migrate, when it is the case.
Adds also dynamic migrations to SCHED_DEADLINE, so that tasks can
be moved among CPUs when necessary. It is also possible to bind a
task to a (set of) CPU(s), thus restricting its capability of
migrating, or forbidding migrations at all.
The very same approach used in sched_rt is utilised:
- -deadline tasks are kept into CPU-specific runqueues,
- -deadline tasks are migrated among runqueues to achieve the
following:
* on an M-CPU system the M earliest deadline ready tasks
are always running;
* affinity/cpusets settings of all the -deadline tasks is
always respected.
Therefore, this very special form of "load balancing" is done with
an active method, i.e., the scheduler pushes or pulls tasks between
runqueues when they are woken up and/or (de)scheduled.
IOW, every time a preemption occurs, the descheduled task might be sent
to some other CPU (depending on its deadline) to continue executing
(push). On the other hand, every time a CPU becomes idle, it might pull
the second earliest deadline ready task from some other CPU.
To enforce this, a pull operation is always attempted before taking any
scheduling decision (pre_schedule()), as well as a push one after each
scheduling decision (post_schedule()). In addition, when a task arrives
or wakes up, the best CPU where to resume it is selected taking into
account its affinity mask, the system topology, but also its deadline.
E.g., from the scheduling point of view, the best CPU where to wake
up (and also where to push) a task is the one which is running the task
with the latest deadline among the M executing ones.
In order to facilitate these decisions, per-runqueue "caching" of the
deadlines of the currently running and of the first ready task is used.
Queued but not running tasks are also parked in another rb-tree to
speed-up pushes.
Signed-off-by: Juri Lelli <juri.lelli@gmail.com>
Signed-off-by: Dario Faggioli <raistlin@linux.it>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/1383831828-15501-5-git-send-email-juri.lelli@gmail.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2013-11-07 20:43:38 +07:00
|
|
|
free_dlo_mask:
|
|
|
|
free_cpumask_var(rd->dlo_mask);
|
2008-11-24 23:05:05 +07:00
|
|
|
free_online:
|
|
|
|
free_cpumask_var(rd->online);
|
|
|
|
free_span:
|
|
|
|
free_cpumask_var(rd->span);
|
2009-01-06 16:39:06 +07:00
|
|
|
out:
|
2008-11-24 23:05:05 +07:00
|
|
|
return -ENOMEM;
|
2008-01-26 03:08:18 +07:00
|
|
|
}
|
|
|
|
|
2011-10-25 15:00:11 +07:00
|
|
|
/*
|
|
|
|
* By default the system creates a single root-domain with all cpus as
|
|
|
|
* members (mimicking the global state we have today).
|
|
|
|
*/
|
|
|
|
struct root_domain def_root_domain;
|
|
|
|
|
2008-01-26 03:08:18 +07:00
|
|
|
static void init_defrootdomain(void)
|
|
|
|
{
|
2010-07-16 03:18:22 +07:00
|
|
|
init_rootdomain(&def_root_domain);
|
2008-11-24 23:05:05 +07:00
|
|
|
|
2008-01-26 03:08:18 +07:00
|
|
|
atomic_set(&def_root_domain.refcount, 1);
|
|
|
|
}
|
|
|
|
|
2008-01-26 03:08:26 +07:00
|
|
|
static struct root_domain *alloc_rootdomain(void)
|
2008-01-26 03:08:18 +07:00
|
|
|
{
|
|
|
|
struct root_domain *rd;
|
|
|
|
|
|
|
|
rd = kmalloc(sizeof(*rd), GFP_KERNEL);
|
|
|
|
if (!rd)
|
|
|
|
return NULL;
|
|
|
|
|
2010-07-16 03:18:22 +07:00
|
|
|
if (init_rootdomain(rd) != 0) {
|
2008-11-24 23:05:05 +07:00
|
|
|
kfree(rd);
|
|
|
|
return NULL;
|
|
|
|
}
|
2008-01-26 03:08:18 +07:00
|
|
|
|
|
|
|
return rd;
|
|
|
|
}
|
|
|
|
|
2014-05-27 05:19:37 +07:00
|
|
|
static void free_sched_groups(struct sched_group *sg, int free_sgc)
|
2011-07-15 15:35:52 +07:00
|
|
|
{
|
|
|
|
struct sched_group *tmp, *first;
|
|
|
|
|
|
|
|
if (!sg)
|
|
|
|
return;
|
|
|
|
|
|
|
|
first = sg;
|
|
|
|
do {
|
|
|
|
tmp = sg->next;
|
|
|
|
|
2014-05-27 05:19:37 +07:00
|
|
|
if (free_sgc && atomic_dec_and_test(&sg->sgc->ref))
|
|
|
|
kfree(sg->sgc);
|
2011-07-15 15:35:52 +07:00
|
|
|
|
|
|
|
kfree(sg);
|
|
|
|
sg = tmp;
|
|
|
|
} while (sg != first);
|
|
|
|
}
|
|
|
|
|
2011-04-07 19:09:50 +07:00
|
|
|
static void free_sched_domain(struct rcu_head *rcu)
|
|
|
|
{
|
|
|
|
struct sched_domain *sd = container_of(rcu, struct sched_domain, rcu);
|
2011-07-15 15:35:52 +07:00
|
|
|
|
|
|
|
/*
|
|
|
|
* If its an overlapping domain it has private groups, iterate and
|
|
|
|
* nuke them all.
|
|
|
|
*/
|
|
|
|
if (sd->flags & SD_OVERLAP) {
|
|
|
|
free_sched_groups(sd->groups, 1);
|
|
|
|
} else if (atomic_dec_and_test(&sd->groups->ref)) {
|
2014-05-27 05:19:37 +07:00
|
|
|
kfree(sd->groups->sgc);
|
2011-04-07 19:09:50 +07:00
|
|
|
kfree(sd->groups);
|
2011-07-14 18:00:06 +07:00
|
|
|
}
|
2011-04-07 19:09:50 +07:00
|
|
|
kfree(sd);
|
|
|
|
}
|
|
|
|
|
|
|
|
static void destroy_sched_domain(struct sched_domain *sd, int cpu)
|
|
|
|
{
|
|
|
|
call_rcu(&sd->rcu, free_sched_domain);
|
|
|
|
}
|
|
|
|
|
|
|
|
static void destroy_sched_domains(struct sched_domain *sd, int cpu)
|
|
|
|
{
|
|
|
|
for (; sd; sd = sd->parent)
|
|
|
|
destroy_sched_domain(sd, cpu);
|
|
|
|
}
|
|
|
|
|
2011-12-07 21:07:31 +07:00
|
|
|
/*
|
|
|
|
* Keep a special pointer to the highest sched_domain that has
|
|
|
|
* SD_SHARE_PKG_RESOURCE set (Last Level Cache Domain) for this
|
|
|
|
* allows us to avoid some pointer chasing select_idle_sibling().
|
|
|
|
*
|
|
|
|
* Also keep a unique ID per domain (we use the first cpu number in
|
|
|
|
* the cpumask of the domain), this allows us to quickly tell if
|
2012-01-26 18:44:34 +07:00
|
|
|
* two cpus are in the same cache domain, see cpus_share_cache().
|
2011-12-07 21:07:31 +07:00
|
|
|
*/
|
|
|
|
DEFINE_PER_CPU(struct sched_domain *, sd_llc);
|
2013-07-04 11:56:46 +07:00
|
|
|
DEFINE_PER_CPU(int, sd_llc_size);
|
2011-12-07 21:07:31 +07:00
|
|
|
DEFINE_PER_CPU(int, sd_llc_id);
|
2013-10-07 17:29:17 +07:00
|
|
|
DEFINE_PER_CPU(struct sched_domain *, sd_numa);
|
2013-10-30 10:12:52 +07:00
|
|
|
DEFINE_PER_CPU(struct sched_domain *, sd_busy);
|
|
|
|
DEFINE_PER_CPU(struct sched_domain *, sd_asym);
|
2011-12-07 21:07:31 +07:00
|
|
|
|
|
|
|
static void update_top_cache_domain(int cpu)
|
|
|
|
{
|
|
|
|
struct sched_domain *sd;
|
2013-12-17 16:21:25 +07:00
|
|
|
struct sched_domain *busy_sd = NULL;
|
2011-12-07 21:07:31 +07:00
|
|
|
int id = cpu;
|
2013-07-04 11:56:46 +07:00
|
|
|
int size = 1;
|
2011-12-07 21:07:31 +07:00
|
|
|
|
|
|
|
sd = highest_flag_domain(cpu, SD_SHARE_PKG_RESOURCES);
|
2013-07-04 11:56:46 +07:00
|
|
|
if (sd) {
|
2011-12-07 21:07:31 +07:00
|
|
|
id = cpumask_first(sched_domain_span(sd));
|
2013-07-04 11:56:46 +07:00
|
|
|
size = cpumask_weight(sched_domain_span(sd));
|
2013-12-17 16:21:25 +07:00
|
|
|
busy_sd = sd->parent; /* sd_busy */
|
2013-07-04 11:56:46 +07:00
|
|
|
}
|
2013-12-17 16:21:25 +07:00
|
|
|
rcu_assign_pointer(per_cpu(sd_busy, cpu), busy_sd);
|
2011-12-07 21:07:31 +07:00
|
|
|
|
|
|
|
rcu_assign_pointer(per_cpu(sd_llc, cpu), sd);
|
2013-07-04 11:56:46 +07:00
|
|
|
per_cpu(sd_llc_size, cpu) = size;
|
2011-12-07 21:07:31 +07:00
|
|
|
per_cpu(sd_llc_id, cpu) = id;
|
2013-10-07 17:29:17 +07:00
|
|
|
|
|
|
|
sd = lowest_flag_domain(cpu, SD_NUMA);
|
|
|
|
rcu_assign_pointer(per_cpu(sd_numa, cpu), sd);
|
2013-10-30 10:12:52 +07:00
|
|
|
|
|
|
|
sd = highest_flag_domain(cpu, SD_ASYM_PACKING);
|
|
|
|
rcu_assign_pointer(per_cpu(sd_asym, cpu), sd);
|
2011-12-07 21:07:31 +07:00
|
|
|
}
|
|
|
|
|
2005-04-17 05:20:36 +07:00
|
|
|
/*
|
2008-01-26 03:08:19 +07:00
|
|
|
* Attach the domain 'sd' to 'cpu' as its base domain. Callers must
|
2005-04-17 05:20:36 +07:00
|
|
|
* hold the hotplug lock.
|
|
|
|
*/
|
2008-01-26 03:08:19 +07:00
|
|
|
static void
|
|
|
|
cpu_attach_domain(struct sched_domain *sd, struct root_domain *rd, int cpu)
|
2005-04-17 05:20:36 +07:00
|
|
|
{
|
2006-07-03 14:25:42 +07:00
|
|
|
struct rq *rq = cpu_rq(cpu);
|
2005-06-26 04:57:25 +07:00
|
|
|
struct sched_domain *tmp;
|
|
|
|
|
|
|
|
/* Remove the sched domains which do not contribute to scheduling. */
|
2008-11-06 08:45:16 +07:00
|
|
|
for (tmp = sd; tmp; ) {
|
2005-06-26 04:57:25 +07:00
|
|
|
struct sched_domain *parent = tmp->parent;
|
|
|
|
if (!parent)
|
|
|
|
break;
|
2008-11-06 08:45:16 +07:00
|
|
|
|
2006-10-03 15:14:08 +07:00
|
|
|
if (sd_parent_degenerate(tmp, parent)) {
|
2005-06-26 04:57:25 +07:00
|
|
|
tmp->parent = parent->parent;
|
2006-10-03 15:14:08 +07:00
|
|
|
if (parent->parent)
|
|
|
|
parent->parent->child = tmp;
|
sched/fair: Fix the sd_parent_degenerate() code
I found that on my WSM box I had a redundant domain:
[ 0.949769] CPU0 attaching sched-domain:
[ 0.953765] domain 0: span 0,12 level SIBLING
[ 0.958335] groups: 0 (cpu_power = 587) 12 (cpu_power = 588)
[ 0.964548] domain 1: span 0-5,12-17 level MC
[ 0.969206] groups: 0,12 (cpu_power = 1175) 1,13 (cpu_power = 1176) 2,14 (cpu_power = 1176) 3,15 (cpu_power = 1176) 4,16 (cpu_power = 1176) 5,17 (cpu_power = 1176)
[ 0.984993] domain 2: span 0-5,12-17 level CPU
[ 0.989822] groups: 0-5,12-17 (cpu_power = 7055)
[ 0.995049] domain 3: span 0-23 level NUMA
[ 0.999620] groups: 0-5,12-17 (cpu_power = 7055) 6-11,18-23 (cpu_power = 7056)
Note how domain 2 has only a single group and spans the same CPUs as
domain 1. We should not keep such domains and do in fact have code to
prune these.
It turns out that the 'new' SD_PREFER_SIBLING flag causes this, it
makes sd_parent_degenerate() fail on the CPU domain. We can easily
fix this by 'ignoring' the SD_PREFER_SIBLING bit and transfering it
to whatever domain ends up covering the span.
With this patch the domains now look like this:
[ 0.950419] CPU0 attaching sched-domain:
[ 0.954454] domain 0: span 0,12 level SIBLING
[ 0.959039] groups: 0 (cpu_power = 587) 12 (cpu_power = 588)
[ 0.965271] domain 1: span 0-5,12-17 level MC
[ 0.969936] groups: 0,12 (cpu_power = 1175) 1,13 (cpu_power = 1176) 2,14 (cpu_power = 1176) 3,15 (cpu_power = 1176) 4,16 (cpu_power = 1176) 5,17 (cpu_power = 1176)
[ 0.985737] domain 2: span 0-23 level NUMA
[ 0.990231] groups: 0-5,12-17 (cpu_power = 7055) 6-11,18-23 (cpu_power = 7056)
Reviewed-by: Paul Turner <pjt@google.com>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/n/tip-ys201g4jwukj0h8xcamakxq1@git.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2013-08-19 21:57:04 +07:00
|
|
|
/*
|
|
|
|
* Transfer SD_PREFER_SIBLING down in case of a
|
|
|
|
* degenerate parent; the spans match for this
|
|
|
|
* so the property transfers.
|
|
|
|
*/
|
|
|
|
if (parent->flags & SD_PREFER_SIBLING)
|
|
|
|
tmp->flags |= SD_PREFER_SIBLING;
|
2011-04-07 19:09:50 +07:00
|
|
|
destroy_sched_domain(parent, cpu);
|
2008-11-06 08:45:16 +07:00
|
|
|
} else
|
|
|
|
tmp = tmp->parent;
|
2005-06-26 04:57:25 +07:00
|
|
|
}
|
|
|
|
|
2006-10-03 15:14:08 +07:00
|
|
|
if (sd && sd_degenerate(sd)) {
|
2011-04-07 19:09:50 +07:00
|
|
|
tmp = sd;
|
2005-06-26 04:57:25 +07:00
|
|
|
sd = sd->parent;
|
2011-04-07 19:09:50 +07:00
|
|
|
destroy_sched_domain(tmp, cpu);
|
2006-10-03 15:14:08 +07:00
|
|
|
if (sd)
|
|
|
|
sd->child = NULL;
|
|
|
|
}
|
2005-04-17 05:20:36 +07:00
|
|
|
|
2011-04-07 19:09:58 +07:00
|
|
|
sched_domain_debug(sd, cpu);
|
2005-04-17 05:20:36 +07:00
|
|
|
|
2008-01-26 03:08:18 +07:00
|
|
|
rq_attach_root(rq, rd);
|
2011-04-07 19:09:50 +07:00
|
|
|
tmp = rq->sd;
|
2005-06-26 04:57:27 +07:00
|
|
|
rcu_assign_pointer(rq->sd, sd);
|
2011-04-07 19:09:50 +07:00
|
|
|
destroy_sched_domains(tmp, cpu);
|
2011-12-07 21:07:31 +07:00
|
|
|
|
|
|
|
update_top_cache_domain(cpu);
|
2005-04-17 05:20:36 +07:00
|
|
|
}
|
|
|
|
|
|
|
|
/* Setup the mask of cpus configured for isolated domains */
|
|
|
|
static int __init isolated_cpu_setup(char *str)
|
|
|
|
{
|
2016-02-04 21:38:00 +07:00
|
|
|
int ret;
|
|
|
|
|
2009-12-02 10:39:16 +07:00
|
|
|
alloc_bootmem_cpumask_var(&cpu_isolated_map);
|
2016-02-04 21:38:00 +07:00
|
|
|
ret = cpulist_parse(str, cpu_isolated_map);
|
|
|
|
if (ret) {
|
|
|
|
pr_err("sched: Error, all isolcpus= values must be between 0 and %d\n", nr_cpu_ids);
|
|
|
|
return 0;
|
|
|
|
}
|
2005-04-17 05:20:36 +07:00
|
|
|
return 1;
|
|
|
|
}
|
2007-10-15 22:00:13 +07:00
|
|
|
__setup("isolcpus=", isolated_cpu_setup);
|
2005-04-17 05:20:36 +07:00
|
|
|
|
2009-08-18 17:51:52 +07:00
|
|
|
struct s_data {
|
2011-04-07 19:09:48 +07:00
|
|
|
struct sched_domain ** __percpu sd;
|
2009-08-18 17:51:52 +07:00
|
|
|
struct root_domain *rd;
|
|
|
|
};
|
|
|
|
|
2009-08-18 17:53:00 +07:00
|
|
|
enum s_alloc {
|
|
|
|
sa_rootdomain,
|
2011-04-07 19:09:48 +07:00
|
|
|
sa_sd,
|
2011-04-07 19:09:50 +07:00
|
|
|
sa_sd_storage,
|
2009-08-18 17:53:00 +07:00
|
|
|
sa_none,
|
|
|
|
};
|
|
|
|
|
2012-05-31 19:47:33 +07:00
|
|
|
/*
|
|
|
|
* Build an iteration mask that can exclude certain CPUs from the upwards
|
|
|
|
* domain traversal.
|
|
|
|
*
|
|
|
|
* Asymmetric node setups can result in situations where the domain tree is of
|
|
|
|
* unequal depth, make sure to skip domains that already cover the entire
|
|
|
|
* range.
|
|
|
|
*
|
|
|
|
* In that case build_sched_domains() will have terminated the iteration early
|
|
|
|
* and our sibling sd spans will be empty. Domains should always include the
|
|
|
|
* cpu they're built on, so check that.
|
|
|
|
*
|
|
|
|
*/
|
|
|
|
static void build_group_mask(struct sched_domain *sd, struct sched_group *sg)
|
|
|
|
{
|
|
|
|
const struct cpumask *span = sched_domain_span(sd);
|
|
|
|
struct sd_data *sdd = sd->private;
|
|
|
|
struct sched_domain *sibling;
|
|
|
|
int i;
|
|
|
|
|
|
|
|
for_each_cpu(i, span) {
|
|
|
|
sibling = *per_cpu_ptr(sdd->sd, i);
|
|
|
|
if (!cpumask_test_cpu(i, sched_domain_span(sibling)))
|
|
|
|
continue;
|
|
|
|
|
|
|
|
cpumask_set_cpu(i, sched_group_mask(sg));
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Return the canonical balance cpu for this group, this is the first cpu
|
|
|
|
* of this group that's also in the iteration mask.
|
|
|
|
*/
|
|
|
|
int group_balance_cpu(struct sched_group *sg)
|
|
|
|
{
|
|
|
|
return cpumask_first_and(sched_group_cpus(sg), sched_group_mask(sg));
|
|
|
|
}
|
|
|
|
|
2011-07-15 15:35:52 +07:00
|
|
|
static int
|
|
|
|
build_overlap_sched_groups(struct sched_domain *sd, int cpu)
|
|
|
|
{
|
|
|
|
struct sched_group *first = NULL, *last = NULL, *groups = NULL, *sg;
|
|
|
|
const struct cpumask *span = sched_domain_span(sd);
|
|
|
|
struct cpumask *covered = sched_domains_tmpmask;
|
|
|
|
struct sd_data *sdd = sd->private;
|
2014-08-02 08:18:03 +07:00
|
|
|
struct sched_domain *sibling;
|
2011-07-15 15:35:52 +07:00
|
|
|
int i;
|
|
|
|
|
|
|
|
cpumask_clear(covered);
|
|
|
|
|
|
|
|
for_each_cpu(i, span) {
|
|
|
|
struct cpumask *sg_span;
|
|
|
|
|
|
|
|
if (cpumask_test_cpu(i, covered))
|
|
|
|
continue;
|
|
|
|
|
2014-08-02 08:18:03 +07:00
|
|
|
sibling = *per_cpu_ptr(sdd->sd, i);
|
2012-05-31 19:47:33 +07:00
|
|
|
|
|
|
|
/* See the comment near build_group_mask(). */
|
2014-08-02 08:18:03 +07:00
|
|
|
if (!cpumask_test_cpu(i, sched_domain_span(sibling)))
|
2012-05-31 19:47:33 +07:00
|
|
|
continue;
|
|
|
|
|
2011-07-15 15:35:52 +07:00
|
|
|
sg = kzalloc_node(sizeof(struct sched_group) + cpumask_size(),
|
2011-11-19 06:03:29 +07:00
|
|
|
GFP_KERNEL, cpu_to_node(cpu));
|
2011-07-15 15:35:52 +07:00
|
|
|
|
|
|
|
if (!sg)
|
|
|
|
goto fail;
|
|
|
|
|
|
|
|
sg_span = sched_group_cpus(sg);
|
2014-08-02 08:18:03 +07:00
|
|
|
if (sibling->child)
|
|
|
|
cpumask_copy(sg_span, sched_domain_span(sibling->child));
|
|
|
|
else
|
2011-07-15 15:35:52 +07:00
|
|
|
cpumask_set_cpu(i, sg_span);
|
|
|
|
|
|
|
|
cpumask_or(covered, covered, sg_span);
|
|
|
|
|
2014-05-27 05:19:37 +07:00
|
|
|
sg->sgc = *per_cpu_ptr(sdd->sgc, i);
|
|
|
|
if (atomic_inc_return(&sg->sgc->ref) == 1)
|
2012-05-31 19:47:33 +07:00
|
|
|
build_group_mask(sd, sg);
|
|
|
|
|
2012-05-31 17:05:32 +07:00
|
|
|
/*
|
2014-05-27 05:19:37 +07:00
|
|
|
* Initialize sgc->capacity such that even if we mess up the
|
2012-05-31 17:05:32 +07:00
|
|
|
* domains and no possible iteration will get us here, we won't
|
|
|
|
* die on a /0 trap.
|
|
|
|
*/
|
2014-05-27 05:19:39 +07:00
|
|
|
sg->sgc->capacity = SCHED_CAPACITY_SCALE * cpumask_weight(sg_span);
|
2011-07-15 15:35:52 +07:00
|
|
|
|
2012-05-31 19:47:33 +07:00
|
|
|
/*
|
|
|
|
* Make sure the first group of this domain contains the
|
|
|
|
* canonical balance cpu. Otherwise the sched_domain iteration
|
|
|
|
* breaks. See update_sg_lb_stats().
|
|
|
|
*/
|
2012-05-23 23:00:43 +07:00
|
|
|
if ((!groups && cpumask_test_cpu(cpu, sg_span)) ||
|
2012-05-31 19:47:33 +07:00
|
|
|
group_balance_cpu(sg) == cpu)
|
2011-07-15 15:35:52 +07:00
|
|
|
groups = sg;
|
|
|
|
|
|
|
|
if (!first)
|
|
|
|
first = sg;
|
|
|
|
if (last)
|
|
|
|
last->next = sg;
|
|
|
|
last = sg;
|
|
|
|
last->next = first;
|
|
|
|
}
|
|
|
|
sd->groups = groups;
|
|
|
|
|
|
|
|
return 0;
|
|
|
|
|
|
|
|
fail:
|
|
|
|
free_sched_groups(first, 0);
|
|
|
|
|
|
|
|
return -ENOMEM;
|
|
|
|
}
|
|
|
|
|
2011-04-07 19:09:50 +07:00
|
|
|
static int get_group(int cpu, struct sd_data *sdd, struct sched_group **sg)
|
2005-04-17 05:20:36 +07:00
|
|
|
{
|
2011-04-07 19:09:50 +07:00
|
|
|
struct sched_domain *sd = *per_cpu_ptr(sdd->sd, cpu);
|
|
|
|
struct sched_domain *child = sd->child;
|
2005-04-17 05:20:36 +07:00
|
|
|
|
2011-04-07 19:09:50 +07:00
|
|
|
if (child)
|
|
|
|
cpu = cpumask_first(sched_domain_span(child));
|
2006-03-27 16:15:22 +07:00
|
|
|
|
2011-07-14 18:00:06 +07:00
|
|
|
if (sg) {
|
2011-04-07 19:09:50 +07:00
|
|
|
*sg = *per_cpu_ptr(sdd->sg, cpu);
|
2014-05-27 05:19:37 +07:00
|
|
|
(*sg)->sgc = *per_cpu_ptr(sdd->sgc, cpu);
|
|
|
|
atomic_set(&(*sg)->sgc->ref, 1); /* for claim_allocations */
|
2011-07-14 18:00:06 +07:00
|
|
|
}
|
2011-04-07 19:09:50 +07:00
|
|
|
|
|
|
|
return cpu;
|
2006-03-27 16:15:22 +07:00
|
|
|
}
|
|
|
|
|
2010-08-31 15:28:16 +07:00
|
|
|
/*
|
2011-04-07 19:09:50 +07:00
|
|
|
* build_sched_groups will build a circular linked list of the groups
|
|
|
|
* covered by the given span, and will set each group's ->cpumask correctly,
|
2014-05-27 05:19:38 +07:00
|
|
|
* and ->cpu_capacity to 0.
|
2011-07-15 15:35:52 +07:00
|
|
|
*
|
|
|
|
* Assumes the sched_domain tree is fully constructed
|
2010-08-31 15:28:16 +07:00
|
|
|
*/
|
2011-07-15 15:35:52 +07:00
|
|
|
static int
|
|
|
|
build_sched_groups(struct sched_domain *sd, int cpu)
|
2005-04-17 05:20:36 +07:00
|
|
|
{
|
2011-04-07 19:09:50 +07:00
|
|
|
struct sched_group *first = NULL, *last = NULL;
|
|
|
|
struct sd_data *sdd = sd->private;
|
|
|
|
const struct cpumask *span = sched_domain_span(sd);
|
2011-04-07 19:09:57 +07:00
|
|
|
struct cpumask *covered;
|
2011-04-07 19:09:50 +07:00
|
|
|
int i;
|
2005-09-07 05:18:14 +07:00
|
|
|
|
2011-07-15 15:35:52 +07:00
|
|
|
get_group(cpu, sdd, &sd->groups);
|
|
|
|
atomic_inc(&sd->groups->ref);
|
|
|
|
|
2013-06-11 18:02:43 +07:00
|
|
|
if (cpu != cpumask_first(span))
|
2011-07-15 15:35:52 +07:00
|
|
|
return 0;
|
|
|
|
|
2011-04-07 19:09:57 +07:00
|
|
|
lockdep_assert_held(&sched_domains_mutex);
|
|
|
|
covered = sched_domains_tmpmask;
|
|
|
|
|
2011-04-07 19:09:50 +07:00
|
|
|
cpumask_clear(covered);
|
2006-12-10 17:20:07 +07:00
|
|
|
|
2011-04-07 19:09:50 +07:00
|
|
|
for_each_cpu(i, span) {
|
|
|
|
struct sched_group *sg;
|
2013-06-11 18:02:44 +07:00
|
|
|
int group, j;
|
2006-12-10 17:20:07 +07:00
|
|
|
|
2011-04-07 19:09:50 +07:00
|
|
|
if (cpumask_test_cpu(i, covered))
|
|
|
|
continue;
|
2006-12-10 17:20:07 +07:00
|
|
|
|
2013-06-11 18:02:44 +07:00
|
|
|
group = get_group(i, sdd, &sg);
|
2012-05-31 19:47:33 +07:00
|
|
|
cpumask_setall(sched_group_mask(sg));
|
2009-08-18 18:01:11 +07:00
|
|
|
|
2011-04-07 19:09:50 +07:00
|
|
|
for_each_cpu(j, span) {
|
|
|
|
if (get_group(j, sdd, NULL) != group)
|
|
|
|
continue;
|
2009-08-18 18:01:11 +07:00
|
|
|
|
2011-04-07 19:09:50 +07:00
|
|
|
cpumask_set_cpu(j, covered);
|
|
|
|
cpumask_set_cpu(j, sched_group_cpus(sg));
|
|
|
|
}
|
2009-08-18 18:01:11 +07:00
|
|
|
|
2011-04-07 19:09:50 +07:00
|
|
|
if (!first)
|
|
|
|
first = sg;
|
|
|
|
if (last)
|
|
|
|
last->next = sg;
|
|
|
|
last = sg;
|
|
|
|
}
|
|
|
|
last->next = first;
|
2011-07-15 15:35:52 +07:00
|
|
|
|
|
|
|
return 0;
|
2009-08-18 18:01:11 +07:00
|
|
|
}
|
2006-06-27 16:54:38 +07:00
|
|
|
|
2006-10-03 15:14:09 +07:00
|
|
|
/*
|
2014-05-27 05:19:37 +07:00
|
|
|
* Initialize sched groups cpu_capacity.
|
2006-10-03 15:14:09 +07:00
|
|
|
*
|
2014-05-27 05:19:37 +07:00
|
|
|
* cpu_capacity indicates the capacity of sched group, which is used while
|
2006-10-03 15:14:09 +07:00
|
|
|
* distributing the load between different sched groups in a sched domain.
|
2014-05-27 05:19:37 +07:00
|
|
|
* Typically cpu_capacity for all the groups in a sched domain will be same
|
|
|
|
* unless there are asymmetries in the topology. If there are asymmetries,
|
|
|
|
* group having more cpu_capacity will pickup more load compared to the
|
|
|
|
* group having less cpu_capacity.
|
2006-10-03 15:14:09 +07:00
|
|
|
*/
|
2014-05-27 05:19:37 +07:00
|
|
|
static void init_sched_groups_capacity(int cpu, struct sched_domain *sd)
|
2006-10-03 15:14:09 +07:00
|
|
|
{
|
2011-07-15 15:35:52 +07:00
|
|
|
struct sched_group *sg = sd->groups;
|
2006-10-03 15:14:09 +07:00
|
|
|
|
2013-06-11 18:02:45 +07:00
|
|
|
WARN_ON(!sg);
|
2011-07-15 15:35:52 +07:00
|
|
|
|
|
|
|
do {
|
|
|
|
sg->group_weight = cpumask_weight(sched_group_cpus(sg));
|
|
|
|
sg = sg->next;
|
|
|
|
} while (sg != sd->groups);
|
2006-10-03 15:14:09 +07:00
|
|
|
|
2012-05-31 19:47:33 +07:00
|
|
|
if (cpu != group_balance_cpu(sg))
|
2011-07-15 15:35:52 +07:00
|
|
|
return;
|
2010-09-18 05:02:32 +07:00
|
|
|
|
2014-05-27 05:19:37 +07:00
|
|
|
update_group_capacity(sd, cpu);
|
|
|
|
atomic_set(&sg->sgc->nr_busy_cpus, sg->group_weight);
|
2006-10-03 15:14:09 +07:00
|
|
|
}
|
|
|
|
|
2008-04-05 08:11:11 +07:00
|
|
|
/*
|
|
|
|
* Initializers for schedule domains
|
|
|
|
* Non-inlined to reduce accumulated stack pressure in build_sched_domains()
|
|
|
|
*/
|
|
|
|
|
2008-04-15 12:04:23 +07:00
|
|
|
static int default_relax_domain_level = -1;
|
2011-04-07 19:10:04 +07:00
|
|
|
int sched_domain_level_max;
|
2008-04-15 12:04:23 +07:00
|
|
|
|
|
|
|
static int __init setup_relax_domain_level(char *str)
|
|
|
|
{
|
2012-06-06 01:44:36 +07:00
|
|
|
if (kstrtoint(str, 0, &default_relax_domain_level))
|
|
|
|
pr_warn("Unable to set relax_domain_level\n");
|
2008-05-13 09:27:17 +07:00
|
|
|
|
2008-04-15 12:04:23 +07:00
|
|
|
return 1;
|
|
|
|
}
|
|
|
|
__setup("relax_domain_level=", setup_relax_domain_level);
|
|
|
|
|
|
|
|
static void set_domain_attribute(struct sched_domain *sd,
|
|
|
|
struct sched_domain_attr *attr)
|
|
|
|
{
|
|
|
|
int request;
|
|
|
|
|
|
|
|
if (!attr || attr->relax_domain_level < 0) {
|
|
|
|
if (default_relax_domain_level < 0)
|
|
|
|
return;
|
|
|
|
else
|
|
|
|
request = default_relax_domain_level;
|
|
|
|
} else
|
|
|
|
request = attr->relax_domain_level;
|
|
|
|
if (request < sd->level) {
|
|
|
|
/* turn off idle balance on this domain */
|
sched: Merge select_task_rq_fair() and sched_balance_self()
The problem with wake_idle() is that is doesn't respect things like
cpu_power, which means it doesn't deal well with SMT nor the recent
RT interaction.
To cure this, it needs to do what sched_balance_self() does, which
leads to the possibility of merging select_task_rq_fair() and
sched_balance_self().
Modify sched_balance_self() to:
- update_shares() when walking up the domain tree,
(it only called it for the top domain, but it should
have done this anyway), which allows us to remove
this ugly bit from try_to_wake_up().
- do wake_affine() on the smallest domain that contains
both this (the waking) and the prev (the wakee) cpu for
WAKE invocations.
Then use the top-down balance steps it had to replace wake_idle().
This leads to the dissapearance of SD_WAKE_BALANCE and
SD_WAKE_IDLE_FAR, with SD_WAKE_IDLE replaced with SD_BALANCE_WAKE.
SD_WAKE_AFFINE needs SD_BALANCE_WAKE to be effective.
Touch all topology bits to replace the old with new SD flags --
platforms might need re-tuning, enabling SD_BALANCE_WAKE
conditionally on a NUMA distance seems like a good additional
feature, magny-core and small nehalem systems would want this
enabled, systems with slow interconnects would not.
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
LKML-Reference: <new-submission>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-09-10 18:50:02 +07:00
|
|
|
sd->flags &= ~(SD_BALANCE_WAKE|SD_BALANCE_NEWIDLE);
|
2008-04-15 12:04:23 +07:00
|
|
|
} else {
|
|
|
|
/* turn on idle balance on this domain */
|
sched: Merge select_task_rq_fair() and sched_balance_self()
The problem with wake_idle() is that is doesn't respect things like
cpu_power, which means it doesn't deal well with SMT nor the recent
RT interaction.
To cure this, it needs to do what sched_balance_self() does, which
leads to the possibility of merging select_task_rq_fair() and
sched_balance_self().
Modify sched_balance_self() to:
- update_shares() when walking up the domain tree,
(it only called it for the top domain, but it should
have done this anyway), which allows us to remove
this ugly bit from try_to_wake_up().
- do wake_affine() on the smallest domain that contains
both this (the waking) and the prev (the wakee) cpu for
WAKE invocations.
Then use the top-down balance steps it had to replace wake_idle().
This leads to the dissapearance of SD_WAKE_BALANCE and
SD_WAKE_IDLE_FAR, with SD_WAKE_IDLE replaced with SD_BALANCE_WAKE.
SD_WAKE_AFFINE needs SD_BALANCE_WAKE to be effective.
Touch all topology bits to replace the old with new SD flags --
platforms might need re-tuning, enabling SD_BALANCE_WAKE
conditionally on a NUMA distance seems like a good additional
feature, magny-core and small nehalem systems would want this
enabled, systems with slow interconnects would not.
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
LKML-Reference: <new-submission>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-09-10 18:50:02 +07:00
|
|
|
sd->flags |= (SD_BALANCE_WAKE|SD_BALANCE_NEWIDLE);
|
2008-04-15 12:04:23 +07:00
|
|
|
}
|
|
|
|
}
|
|
|
|
|
2011-04-07 19:10:03 +07:00
|
|
|
static void __sdt_free(const struct cpumask *cpu_map);
|
|
|
|
static int __sdt_alloc(const struct cpumask *cpu_map);
|
|
|
|
|
2009-08-18 17:53:00 +07:00
|
|
|
static void __free_domain_allocs(struct s_data *d, enum s_alloc what,
|
|
|
|
const struct cpumask *cpu_map)
|
|
|
|
{
|
|
|
|
switch (what) {
|
|
|
|
case sa_rootdomain:
|
2011-04-07 19:09:51 +07:00
|
|
|
if (!atomic_read(&d->rd->refcount))
|
|
|
|
free_rootdomain(&d->rd->rcu); /* fall through */
|
2011-04-07 19:09:48 +07:00
|
|
|
case sa_sd:
|
|
|
|
free_percpu(d->sd); /* fall through */
|
2011-04-07 19:09:50 +07:00
|
|
|
case sa_sd_storage:
|
2011-04-07 19:10:03 +07:00
|
|
|
__sdt_free(cpu_map); /* fall through */
|
2009-08-18 17:53:00 +07:00
|
|
|
case sa_none:
|
|
|
|
break;
|
|
|
|
}
|
|
|
|
}
|
2008-11-24 23:05:03 +07:00
|
|
|
|
2009-08-18 17:53:00 +07:00
|
|
|
static enum s_alloc __visit_domain_allocation_hell(struct s_data *d,
|
|
|
|
const struct cpumask *cpu_map)
|
|
|
|
{
|
2011-04-07 19:09:50 +07:00
|
|
|
memset(d, 0, sizeof(*d));
|
|
|
|
|
2011-04-07 19:10:03 +07:00
|
|
|
if (__sdt_alloc(cpu_map))
|
|
|
|
return sa_sd_storage;
|
2011-04-07 19:09:50 +07:00
|
|
|
d->sd = alloc_percpu(struct sched_domain *);
|
|
|
|
if (!d->sd)
|
|
|
|
return sa_sd_storage;
|
2009-08-18 17:53:00 +07:00
|
|
|
d->rd = alloc_rootdomain();
|
2011-04-07 19:09:50 +07:00
|
|
|
if (!d->rd)
|
2011-04-07 19:09:48 +07:00
|
|
|
return sa_sd;
|
2009-08-18 17:53:00 +07:00
|
|
|
return sa_rootdomain;
|
|
|
|
}
|
2008-01-26 03:08:18 +07:00
|
|
|
|
2011-04-07 19:09:50 +07:00
|
|
|
/*
|
|
|
|
* NULL the sd_data elements we've used to build the sched_domain and
|
|
|
|
* sched_group structure so that the subsequent __free_domain_allocs()
|
|
|
|
* will not free the data we're using.
|
|
|
|
*/
|
|
|
|
static void claim_allocations(int cpu, struct sched_domain *sd)
|
|
|
|
{
|
|
|
|
struct sd_data *sdd = sd->private;
|
|
|
|
|
|
|
|
WARN_ON_ONCE(*per_cpu_ptr(sdd->sd, cpu) != sd);
|
|
|
|
*per_cpu_ptr(sdd->sd, cpu) = NULL;
|
|
|
|
|
2011-07-15 15:35:52 +07:00
|
|
|
if (atomic_read(&(*per_cpu_ptr(sdd->sg, cpu))->ref))
|
2011-04-07 19:09:50 +07:00
|
|
|
*per_cpu_ptr(sdd->sg, cpu) = NULL;
|
2011-07-15 15:35:52 +07:00
|
|
|
|
2014-05-27 05:19:37 +07:00
|
|
|
if (atomic_read(&(*per_cpu_ptr(sdd->sgc, cpu))->ref))
|
|
|
|
*per_cpu_ptr(sdd->sgc, cpu) = NULL;
|
2011-04-07 19:09:50 +07:00
|
|
|
}
|
|
|
|
|
2012-04-17 20:49:36 +07:00
|
|
|
#ifdef CONFIG_NUMA
|
|
|
|
static int sched_domains_numa_levels;
|
2014-10-17 14:29:50 +07:00
|
|
|
enum numa_topology_type sched_numa_topology_type;
|
2012-04-17 20:49:36 +07:00
|
|
|
static int *sched_domains_numa_distance;
|
2014-10-17 14:29:49 +07:00
|
|
|
int sched_max_numa_distance;
|
2012-04-17 20:49:36 +07:00
|
|
|
static struct cpumask ***sched_domains_numa_masks;
|
|
|
|
static int sched_domains_curr_level;
|
2014-04-11 16:44:37 +07:00
|
|
|
#endif
|
2012-04-17 20:49:36 +07:00
|
|
|
|
2014-04-11 16:44:37 +07:00
|
|
|
/*
|
|
|
|
* SD_flags allowed in topology descriptions.
|
|
|
|
*
|
2014-05-28 00:50:41 +07:00
|
|
|
* SD_SHARE_CPUCAPACITY - describes SMT topologies
|
2014-04-11 16:44:37 +07:00
|
|
|
* SD_SHARE_PKG_RESOURCES - describes shared caches
|
|
|
|
* SD_NUMA - describes NUMA topologies
|
2014-04-11 16:44:40 +07:00
|
|
|
* SD_SHARE_POWERDOMAIN - describes shared power domain
|
2014-04-11 16:44:37 +07:00
|
|
|
*
|
|
|
|
* Odd one out:
|
|
|
|
* SD_ASYM_PACKING - describes SMT quirks
|
|
|
|
*/
|
|
|
|
#define TOPOLOGY_SD_FLAGS \
|
2014-05-28 00:50:41 +07:00
|
|
|
(SD_SHARE_CPUCAPACITY | \
|
2014-04-11 16:44:37 +07:00
|
|
|
SD_SHARE_PKG_RESOURCES | \
|
|
|
|
SD_NUMA | \
|
2014-04-11 16:44:40 +07:00
|
|
|
SD_ASYM_PACKING | \
|
|
|
|
SD_SHARE_POWERDOMAIN)
|
2012-04-17 20:49:36 +07:00
|
|
|
|
|
|
|
static struct sched_domain *
|
2014-04-11 16:44:37 +07:00
|
|
|
sd_init(struct sched_domain_topology_level *tl, int cpu)
|
2012-04-17 20:49:36 +07:00
|
|
|
{
|
|
|
|
struct sched_domain *sd = *per_cpu_ptr(tl->data.sd, cpu);
|
2014-04-11 16:44:37 +07:00
|
|
|
int sd_weight, sd_flags = 0;
|
|
|
|
|
|
|
|
#ifdef CONFIG_NUMA
|
|
|
|
/*
|
|
|
|
* Ugly hack to pass state to sd_numa_mask()...
|
|
|
|
*/
|
|
|
|
sched_domains_curr_level = tl->numa_level;
|
|
|
|
#endif
|
|
|
|
|
|
|
|
sd_weight = cpumask_weight(tl->mask(cpu));
|
|
|
|
|
|
|
|
if (tl->sd_flags)
|
|
|
|
sd_flags = (*tl->sd_flags)();
|
|
|
|
if (WARN_ONCE(sd_flags & ~TOPOLOGY_SD_FLAGS,
|
|
|
|
"wrong sd_flags in topology description\n"))
|
|
|
|
sd_flags &= ~TOPOLOGY_SD_FLAGS;
|
2012-04-17 20:49:36 +07:00
|
|
|
|
|
|
|
*sd = (struct sched_domain){
|
|
|
|
.min_interval = sd_weight,
|
|
|
|
.max_interval = 2*sd_weight,
|
|
|
|
.busy_factor = 32,
|
2012-05-11 05:26:27 +07:00
|
|
|
.imbalance_pct = 125,
|
2014-04-11 16:44:37 +07:00
|
|
|
|
|
|
|
.cache_nice_tries = 0,
|
|
|
|
.busy_idx = 0,
|
|
|
|
.idle_idx = 0,
|
2012-04-17 20:49:36 +07:00
|
|
|
.newidle_idx = 0,
|
|
|
|
.wake_idx = 0,
|
|
|
|
.forkexec_idx = 0,
|
|
|
|
|
|
|
|
.flags = 1*SD_LOAD_BALANCE
|
|
|
|
| 1*SD_BALANCE_NEWIDLE
|
2014-04-11 16:44:37 +07:00
|
|
|
| 1*SD_BALANCE_EXEC
|
|
|
|
| 1*SD_BALANCE_FORK
|
2012-04-17 20:49:36 +07:00
|
|
|
| 0*SD_BALANCE_WAKE
|
2014-04-11 16:44:37 +07:00
|
|
|
| 1*SD_WAKE_AFFINE
|
2014-05-28 00:50:41 +07:00
|
|
|
| 0*SD_SHARE_CPUCAPACITY
|
2012-04-17 20:49:36 +07:00
|
|
|
| 0*SD_SHARE_PKG_RESOURCES
|
2014-04-11 16:44:37 +07:00
|
|
|
| 0*SD_SERIALIZE
|
2012-04-17 20:49:36 +07:00
|
|
|
| 0*SD_PREFER_SIBLING
|
2014-04-11 16:44:37 +07:00
|
|
|
| 0*SD_NUMA
|
|
|
|
| sd_flags
|
2012-04-17 20:49:36 +07:00
|
|
|
,
|
2014-04-11 16:44:37 +07:00
|
|
|
|
2012-04-17 20:49:36 +07:00
|
|
|
.last_balance = jiffies,
|
|
|
|
.balance_interval = sd_weight,
|
2014-04-11 16:44:37 +07:00
|
|
|
.smt_gain = 0,
|
2014-04-24 08:30:34 +07:00
|
|
|
.max_newidle_lb_cost = 0,
|
|
|
|
.next_decay_max_lb_cost = jiffies,
|
2014-04-11 16:44:37 +07:00
|
|
|
#ifdef CONFIG_SCHED_DEBUG
|
|
|
|
.name = tl->name,
|
|
|
|
#endif
|
2012-04-17 20:49:36 +07:00
|
|
|
};
|
|
|
|
|
|
|
|
/*
|
2014-04-11 16:44:37 +07:00
|
|
|
* Convert topological properties into behaviour.
|
2012-04-17 20:49:36 +07:00
|
|
|
*/
|
2014-04-11 16:44:37 +07:00
|
|
|
|
2014-05-28 00:50:41 +07:00
|
|
|
if (sd->flags & SD_SHARE_CPUCAPACITY) {
|
2015-02-27 22:54:13 +07:00
|
|
|
sd->flags |= SD_PREFER_SIBLING;
|
2014-04-11 16:44:37 +07:00
|
|
|
sd->imbalance_pct = 110;
|
|
|
|
sd->smt_gain = 1178; /* ~15% */
|
|
|
|
|
|
|
|
} else if (sd->flags & SD_SHARE_PKG_RESOURCES) {
|
|
|
|
sd->imbalance_pct = 117;
|
|
|
|
sd->cache_nice_tries = 1;
|
|
|
|
sd->busy_idx = 2;
|
|
|
|
|
|
|
|
#ifdef CONFIG_NUMA
|
|
|
|
} else if (sd->flags & SD_NUMA) {
|
|
|
|
sd->cache_nice_tries = 2;
|
|
|
|
sd->busy_idx = 3;
|
|
|
|
sd->idle_idx = 2;
|
|
|
|
|
|
|
|
sd->flags |= SD_SERIALIZE;
|
|
|
|
if (sched_domains_numa_distance[tl->numa_level] > RECLAIM_DISTANCE) {
|
|
|
|
sd->flags &= ~(SD_BALANCE_EXEC |
|
|
|
|
SD_BALANCE_FORK |
|
|
|
|
SD_WAKE_AFFINE);
|
|
|
|
}
|
|
|
|
|
|
|
|
#endif
|
|
|
|
} else {
|
|
|
|
sd->flags |= SD_PREFER_SIBLING;
|
|
|
|
sd->cache_nice_tries = 1;
|
|
|
|
sd->busy_idx = 2;
|
|
|
|
sd->idle_idx = 1;
|
|
|
|
}
|
|
|
|
|
|
|
|
sd->private = &tl->data;
|
2012-04-17 20:49:36 +07:00
|
|
|
|
|
|
|
return sd;
|
|
|
|
}
|
|
|
|
|
2014-04-11 16:44:37 +07:00
|
|
|
/*
|
|
|
|
* Topology list, bottom-up.
|
|
|
|
*/
|
|
|
|
static struct sched_domain_topology_level default_topology[] = {
|
|
|
|
#ifdef CONFIG_SCHED_SMT
|
|
|
|
{ cpu_smt_mask, cpu_smt_flags, SD_INIT_NAME(SMT) },
|
|
|
|
#endif
|
|
|
|
#ifdef CONFIG_SCHED_MC
|
|
|
|
{ cpu_coregroup_mask, cpu_core_flags, SD_INIT_NAME(MC) },
|
|
|
|
#endif
|
|
|
|
{ cpu_cpu_mask, SD_INIT_NAME(DIE) },
|
|
|
|
{ NULL, },
|
|
|
|
};
|
|
|
|
|
2015-09-22 17:48:59 +07:00
|
|
|
static struct sched_domain_topology_level *sched_domain_topology =
|
|
|
|
default_topology;
|
2014-04-11 16:44:37 +07:00
|
|
|
|
|
|
|
#define for_each_sd_topology(tl) \
|
|
|
|
for (tl = sched_domain_topology; tl->mask; tl++)
|
|
|
|
|
|
|
|
void set_sched_topology(struct sched_domain_topology_level *tl)
|
|
|
|
{
|
|
|
|
sched_domain_topology = tl;
|
|
|
|
}
|
|
|
|
|
|
|
|
#ifdef CONFIG_NUMA
|
|
|
|
|
2012-04-17 20:49:36 +07:00
|
|
|
static const struct cpumask *sd_numa_mask(int cpu)
|
|
|
|
{
|
|
|
|
return sched_domains_numa_masks[sched_domains_curr_level][cpu_to_node(cpu)];
|
|
|
|
}
|
|
|
|
|
2012-06-01 02:20:16 +07:00
|
|
|
static void sched_numa_warn(const char *str)
|
|
|
|
{
|
|
|
|
static int done = false;
|
|
|
|
int i,j;
|
|
|
|
|
|
|
|
if (done)
|
|
|
|
return;
|
|
|
|
|
|
|
|
done = true;
|
|
|
|
|
|
|
|
printk(KERN_WARNING "ERROR: %s\n\n", str);
|
|
|
|
|
|
|
|
for (i = 0; i < nr_node_ids; i++) {
|
|
|
|
printk(KERN_WARNING " ");
|
|
|
|
for (j = 0; j < nr_node_ids; j++)
|
|
|
|
printk(KERN_CONT "%02d ", node_distance(i,j));
|
|
|
|
printk(KERN_CONT "\n");
|
|
|
|
}
|
|
|
|
printk(KERN_WARNING "\n");
|
|
|
|
}
|
|
|
|
|
2014-10-17 14:29:49 +07:00
|
|
|
bool find_numa_distance(int distance)
|
2012-06-01 02:20:16 +07:00
|
|
|
{
|
|
|
|
int i;
|
|
|
|
|
|
|
|
if (distance == node_distance(0, 0))
|
|
|
|
return true;
|
|
|
|
|
|
|
|
for (i = 0; i < sched_domains_numa_levels; i++) {
|
|
|
|
if (sched_domains_numa_distance[i] == distance)
|
|
|
|
return true;
|
|
|
|
}
|
|
|
|
|
|
|
|
return false;
|
|
|
|
}
|
|
|
|
|
2014-10-17 14:29:50 +07:00
|
|
|
/*
|
|
|
|
* A system can have three types of NUMA topology:
|
|
|
|
* NUMA_DIRECT: all nodes are directly connected, or not a NUMA system
|
|
|
|
* NUMA_GLUELESS_MESH: some nodes reachable through intermediary nodes
|
|
|
|
* NUMA_BACKPLANE: nodes can reach other nodes through a backplane
|
|
|
|
*
|
|
|
|
* The difference between a glueless mesh topology and a backplane
|
|
|
|
* topology lies in whether communication between not directly
|
|
|
|
* connected nodes goes through intermediary nodes (where programs
|
|
|
|
* could run), or through backplane controllers. This affects
|
|
|
|
* placement of programs.
|
|
|
|
*
|
|
|
|
* The type of topology can be discerned with the following tests:
|
|
|
|
* - If the maximum distance between any nodes is 1 hop, the system
|
|
|
|
* is directly connected.
|
|
|
|
* - If for two nodes A and B, located N > 1 hops away from each other,
|
|
|
|
* there is an intermediary node C, which is < N hops away from both
|
|
|
|
* nodes A and B, the system is a glueless mesh.
|
|
|
|
*/
|
|
|
|
static void init_numa_topology_type(void)
|
|
|
|
{
|
|
|
|
int a, b, c, n;
|
|
|
|
|
|
|
|
n = sched_max_numa_distance;
|
|
|
|
|
2015-08-11 08:20:48 +07:00
|
|
|
if (sched_domains_numa_levels <= 1) {
|
2014-10-17 14:29:50 +07:00
|
|
|
sched_numa_topology_type = NUMA_DIRECT;
|
2015-08-11 08:20:48 +07:00
|
|
|
return;
|
|
|
|
}
|
2014-10-17 14:29:50 +07:00
|
|
|
|
|
|
|
for_each_online_node(a) {
|
|
|
|
for_each_online_node(b) {
|
|
|
|
/* Find two nodes furthest removed from each other. */
|
|
|
|
if (node_distance(a, b) < n)
|
|
|
|
continue;
|
|
|
|
|
|
|
|
/* Is there an intermediary node between a and b? */
|
|
|
|
for_each_online_node(c) {
|
|
|
|
if (node_distance(a, c) < n &&
|
|
|
|
node_distance(b, c) < n) {
|
|
|
|
sched_numa_topology_type =
|
|
|
|
NUMA_GLUELESS_MESH;
|
|
|
|
return;
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
|
|
|
sched_numa_topology_type = NUMA_BACKPLANE;
|
|
|
|
return;
|
|
|
|
}
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
2012-04-17 20:49:36 +07:00
|
|
|
static void sched_init_numa(void)
|
|
|
|
{
|
|
|
|
int next_distance, curr_distance = node_distance(0, 0);
|
|
|
|
struct sched_domain_topology_level *tl;
|
|
|
|
int level = 0;
|
|
|
|
int i, j, k;
|
|
|
|
|
|
|
|
sched_domains_numa_distance = kzalloc(sizeof(int) * nr_node_ids, GFP_KERNEL);
|
|
|
|
if (!sched_domains_numa_distance)
|
|
|
|
return;
|
|
|
|
|
|
|
|
/*
|
|
|
|
* O(nr_nodes^2) deduplicating selection sort -- in order to find the
|
|
|
|
* unique distances in the node_distance() table.
|
|
|
|
*
|
|
|
|
* Assumes node_distance(0,j) includes all distances in
|
|
|
|
* node_distance(i,j) in order to avoid cubic time.
|
|
|
|
*/
|
|
|
|
next_distance = curr_distance;
|
|
|
|
for (i = 0; i < nr_node_ids; i++) {
|
|
|
|
for (j = 0; j < nr_node_ids; j++) {
|
2012-06-01 02:20:16 +07:00
|
|
|
for (k = 0; k < nr_node_ids; k++) {
|
|
|
|
int distance = node_distance(i, k);
|
|
|
|
|
|
|
|
if (distance > curr_distance &&
|
|
|
|
(distance < next_distance ||
|
|
|
|
next_distance == curr_distance))
|
|
|
|
next_distance = distance;
|
|
|
|
|
|
|
|
/*
|
|
|
|
* While not a strong assumption it would be nice to know
|
|
|
|
* about cases where if node A is connected to B, B is not
|
|
|
|
* equally connected to A.
|
|
|
|
*/
|
|
|
|
if (sched_debug() && node_distance(k, i) != distance)
|
|
|
|
sched_numa_warn("Node-distance not symmetric");
|
|
|
|
|
|
|
|
if (sched_debug() && i && !find_numa_distance(distance))
|
|
|
|
sched_numa_warn("Node-0 not representative");
|
|
|
|
}
|
|
|
|
if (next_distance != curr_distance) {
|
|
|
|
sched_domains_numa_distance[level++] = next_distance;
|
|
|
|
sched_domains_numa_levels = level;
|
|
|
|
curr_distance = next_distance;
|
|
|
|
} else break;
|
2012-04-17 20:49:36 +07:00
|
|
|
}
|
2012-06-01 02:20:16 +07:00
|
|
|
|
|
|
|
/*
|
|
|
|
* In case of sched_debug() we verify the above assumption.
|
|
|
|
*/
|
|
|
|
if (!sched_debug())
|
|
|
|
break;
|
2012-04-17 20:49:36 +07:00
|
|
|
}
|
2014-11-07 21:53:40 +07:00
|
|
|
|
|
|
|
if (!level)
|
|
|
|
return;
|
|
|
|
|
2012-04-17 20:49:36 +07:00
|
|
|
/*
|
|
|
|
* 'level' contains the number of unique distances, excluding the
|
|
|
|
* identity distance node_distance(i,i).
|
|
|
|
*
|
2013-04-05 17:56:46 +07:00
|
|
|
* The sched_domains_numa_distance[] array includes the actual distance
|
2012-04-17 20:49:36 +07:00
|
|
|
* numbers.
|
|
|
|
*/
|
|
|
|
|
2012-09-25 20:12:30 +07:00
|
|
|
/*
|
|
|
|
* Here, we should temporarily reset sched_domains_numa_levels to 0.
|
|
|
|
* If it fails to allocate memory for array sched_domains_numa_masks[][],
|
|
|
|
* the array will contain less then 'level' members. This could be
|
|
|
|
* dangerous when we use it to iterate array sched_domains_numa_masks[][]
|
|
|
|
* in other functions.
|
|
|
|
*
|
|
|
|
* We reset it to 'level' at the end of this function.
|
|
|
|
*/
|
|
|
|
sched_domains_numa_levels = 0;
|
|
|
|
|
2012-04-17 20:49:36 +07:00
|
|
|
sched_domains_numa_masks = kzalloc(sizeof(void *) * level, GFP_KERNEL);
|
|
|
|
if (!sched_domains_numa_masks)
|
|
|
|
return;
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Now for each level, construct a mask per node which contains all
|
|
|
|
* cpus of nodes that are that many hops away from us.
|
|
|
|
*/
|
|
|
|
for (i = 0; i < level; i++) {
|
|
|
|
sched_domains_numa_masks[i] =
|
|
|
|
kzalloc(nr_node_ids * sizeof(void *), GFP_KERNEL);
|
|
|
|
if (!sched_domains_numa_masks[i])
|
|
|
|
return;
|
|
|
|
|
|
|
|
for (j = 0; j < nr_node_ids; j++) {
|
2012-05-25 14:26:43 +07:00
|
|
|
struct cpumask *mask = kzalloc(cpumask_size(), GFP_KERNEL);
|
2012-04-17 20:49:36 +07:00
|
|
|
if (!mask)
|
|
|
|
return;
|
|
|
|
|
|
|
|
sched_domains_numa_masks[i][j] = mask;
|
|
|
|
|
2016-01-16 02:01:23 +07:00
|
|
|
for_each_node(k) {
|
2012-05-11 05:56:20 +07:00
|
|
|
if (node_distance(j, k) > sched_domains_numa_distance[i])
|
2012-04-17 20:49:36 +07:00
|
|
|
continue;
|
|
|
|
|
|
|
|
cpumask_or(mask, mask, cpumask_of_node(k));
|
|
|
|
}
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
2014-04-11 16:44:37 +07:00
|
|
|
/* Compute default topology size */
|
|
|
|
for (i = 0; sched_domain_topology[i].mask; i++);
|
|
|
|
|
2014-05-13 16:11:01 +07:00
|
|
|
tl = kzalloc((i + level + 1) *
|
2012-04-17 20:49:36 +07:00
|
|
|
sizeof(struct sched_domain_topology_level), GFP_KERNEL);
|
|
|
|
if (!tl)
|
|
|
|
return;
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Copy the default topology bits..
|
|
|
|
*/
|
2014-04-11 16:44:37 +07:00
|
|
|
for (i = 0; sched_domain_topology[i].mask; i++)
|
|
|
|
tl[i] = sched_domain_topology[i];
|
2012-04-17 20:49:36 +07:00
|
|
|
|
|
|
|
/*
|
|
|
|
* .. and append 'j' levels of NUMA goodness.
|
|
|
|
*/
|
|
|
|
for (j = 0; j < level; i++, j++) {
|
|
|
|
tl[i] = (struct sched_domain_topology_level){
|
|
|
|
.mask = sd_numa_mask,
|
2014-04-11 16:44:37 +07:00
|
|
|
.sd_flags = cpu_numa_flags,
|
2012-04-17 20:49:36 +07:00
|
|
|
.flags = SDTL_OVERLAP,
|
|
|
|
.numa_level = j,
|
2014-04-11 16:44:37 +07:00
|
|
|
SD_INIT_NAME(NUMA)
|
2012-04-17 20:49:36 +07:00
|
|
|
};
|
|
|
|
}
|
|
|
|
|
|
|
|
sched_domain_topology = tl;
|
2012-09-25 20:12:30 +07:00
|
|
|
|
|
|
|
sched_domains_numa_levels = level;
|
2014-10-17 14:29:49 +07:00
|
|
|
sched_max_numa_distance = sched_domains_numa_distance[level - 1];
|
2014-10-17 14:29:50 +07:00
|
|
|
|
|
|
|
init_numa_topology_type();
|
2012-04-17 20:49:36 +07:00
|
|
|
}
|
2012-09-25 20:12:31 +07:00
|
|
|
|
2016-03-10 18:54:11 +07:00
|
|
|
static void sched_domains_numa_masks_set(unsigned int cpu)
|
2012-09-25 20:12:31 +07:00
|
|
|
{
|
|
|
|
int node = cpu_to_node(cpu);
|
2016-03-10 18:54:11 +07:00
|
|
|
int i, j;
|
2012-09-25 20:12:31 +07:00
|
|
|
|
|
|
|
for (i = 0; i < sched_domains_numa_levels; i++) {
|
|
|
|
for (j = 0; j < nr_node_ids; j++) {
|
|
|
|
if (node_distance(j, node) <= sched_domains_numa_distance[i])
|
|
|
|
cpumask_set_cpu(cpu, sched_domains_numa_masks[i][j]);
|
|
|
|
}
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
2016-03-10 18:54:11 +07:00
|
|
|
static void sched_domains_numa_masks_clear(unsigned int cpu)
|
2012-09-25 20:12:31 +07:00
|
|
|
{
|
|
|
|
int i, j;
|
2016-03-10 18:54:11 +07:00
|
|
|
|
2012-09-25 20:12:31 +07:00
|
|
|
for (i = 0; i < sched_domains_numa_levels; i++) {
|
|
|
|
for (j = 0; j < nr_node_ids; j++)
|
|
|
|
cpumask_clear_cpu(cpu, sched_domains_numa_masks[i][j]);
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
2012-04-17 20:49:36 +07:00
|
|
|
#else
|
2016-03-10 18:54:11 +07:00
|
|
|
static inline void sched_init_numa(void) { }
|
|
|
|
static void sched_domains_numa_masks_set(unsigned int cpu) { }
|
|
|
|
static void sched_domains_numa_masks_clear(unsigned int cpu) { }
|
2012-04-17 20:49:36 +07:00
|
|
|
#endif /* CONFIG_NUMA */
|
|
|
|
|
2011-04-07 19:10:03 +07:00
|
|
|
static int __sdt_alloc(const struct cpumask *cpu_map)
|
|
|
|
{
|
|
|
|
struct sched_domain_topology_level *tl;
|
|
|
|
int j;
|
|
|
|
|
2013-06-10 17:57:20 +07:00
|
|
|
for_each_sd_topology(tl) {
|
2011-04-07 19:10:03 +07:00
|
|
|
struct sd_data *sdd = &tl->data;
|
|
|
|
|
|
|
|
sdd->sd = alloc_percpu(struct sched_domain *);
|
|
|
|
if (!sdd->sd)
|
|
|
|
return -ENOMEM;
|
|
|
|
|
|
|
|
sdd->sg = alloc_percpu(struct sched_group *);
|
|
|
|
if (!sdd->sg)
|
|
|
|
return -ENOMEM;
|
|
|
|
|
2014-05-27 05:19:37 +07:00
|
|
|
sdd->sgc = alloc_percpu(struct sched_group_capacity *);
|
|
|
|
if (!sdd->sgc)
|
2011-07-14 18:00:06 +07:00
|
|
|
return -ENOMEM;
|
|
|
|
|
2011-04-07 19:10:03 +07:00
|
|
|
for_each_cpu(j, cpu_map) {
|
|
|
|
struct sched_domain *sd;
|
|
|
|
struct sched_group *sg;
|
2014-05-27 05:19:37 +07:00
|
|
|
struct sched_group_capacity *sgc;
|
2011-04-07 19:10:03 +07:00
|
|
|
|
2015-06-11 19:46:50 +07:00
|
|
|
sd = kzalloc_node(sizeof(struct sched_domain) + cpumask_size(),
|
2011-04-07 19:10:03 +07:00
|
|
|
GFP_KERNEL, cpu_to_node(j));
|
|
|
|
if (!sd)
|
|
|
|
return -ENOMEM;
|
|
|
|
|
|
|
|
*per_cpu_ptr(sdd->sd, j) = sd;
|
|
|
|
|
|
|
|
sg = kzalloc_node(sizeof(struct sched_group) + cpumask_size(),
|
|
|
|
GFP_KERNEL, cpu_to_node(j));
|
|
|
|
if (!sg)
|
|
|
|
return -ENOMEM;
|
|
|
|
|
2012-05-09 17:38:28 +07:00
|
|
|
sg->next = sg;
|
|
|
|
|
2011-04-07 19:10:03 +07:00
|
|
|
*per_cpu_ptr(sdd->sg, j) = sg;
|
2011-07-14 18:00:06 +07:00
|
|
|
|
2014-05-27 05:19:37 +07:00
|
|
|
sgc = kzalloc_node(sizeof(struct sched_group_capacity) + cpumask_size(),
|
2011-07-14 18:00:06 +07:00
|
|
|
GFP_KERNEL, cpu_to_node(j));
|
2014-05-27 05:19:37 +07:00
|
|
|
if (!sgc)
|
2011-07-14 18:00:06 +07:00
|
|
|
return -ENOMEM;
|
|
|
|
|
2014-05-27 05:19:37 +07:00
|
|
|
*per_cpu_ptr(sdd->sgc, j) = sgc;
|
2011-04-07 19:10:03 +07:00
|
|
|
}
|
|
|
|
}
|
|
|
|
|
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
|
|
|
|
static void __sdt_free(const struct cpumask *cpu_map)
|
|
|
|
{
|
|
|
|
struct sched_domain_topology_level *tl;
|
|
|
|
int j;
|
|
|
|
|
2013-06-10 17:57:20 +07:00
|
|
|
for_each_sd_topology(tl) {
|
2011-04-07 19:10:03 +07:00
|
|
|
struct sd_data *sdd = &tl->data;
|
|
|
|
|
|
|
|
for_each_cpu(j, cpu_map) {
|
2012-04-25 18:59:21 +07:00
|
|
|
struct sched_domain *sd;
|
|
|
|
|
|
|
|
if (sdd->sd) {
|
|
|
|
sd = *per_cpu_ptr(sdd->sd, j);
|
|
|
|
if (sd && (sd->flags & SD_OVERLAP))
|
|
|
|
free_sched_groups(sd->groups, 0);
|
|
|
|
kfree(*per_cpu_ptr(sdd->sd, j));
|
|
|
|
}
|
|
|
|
|
|
|
|
if (sdd->sg)
|
|
|
|
kfree(*per_cpu_ptr(sdd->sg, j));
|
2014-05-27 05:19:37 +07:00
|
|
|
if (sdd->sgc)
|
|
|
|
kfree(*per_cpu_ptr(sdd->sgc, j));
|
2011-04-07 19:10:03 +07:00
|
|
|
}
|
|
|
|
free_percpu(sdd->sd);
|
2012-04-25 18:59:21 +07:00
|
|
|
sdd->sd = NULL;
|
2011-04-07 19:10:03 +07:00
|
|
|
free_percpu(sdd->sg);
|
2012-04-25 18:59:21 +07:00
|
|
|
sdd->sg = NULL;
|
2014-05-27 05:19:37 +07:00
|
|
|
free_percpu(sdd->sgc);
|
|
|
|
sdd->sgc = NULL;
|
2011-04-07 19:10:03 +07:00
|
|
|
}
|
|
|
|
}
|
|
|
|
|
2011-04-07 19:10:01 +07:00
|
|
|
struct sched_domain *build_sched_domain(struct sched_domain_topology_level *tl,
|
2013-06-04 17:42:43 +07:00
|
|
|
const struct cpumask *cpu_map, struct sched_domain_attr *attr,
|
|
|
|
struct sched_domain *child, int cpu)
|
2011-04-07 19:10:01 +07:00
|
|
|
{
|
2014-04-11 16:44:37 +07:00
|
|
|
struct sched_domain *sd = sd_init(tl, cpu);
|
2011-04-07 19:10:01 +07:00
|
|
|
if (!sd)
|
2011-04-07 19:10:02 +07:00
|
|
|
return child;
|
2011-04-07 19:10:01 +07:00
|
|
|
|
|
|
|
cpumask_and(sched_domain_span(sd), cpu_map, tl->mask(cpu));
|
2011-04-07 19:10:04 +07:00
|
|
|
if (child) {
|
|
|
|
sd->level = child->level + 1;
|
|
|
|
sched_domain_level_max = max(sched_domain_level_max, sd->level);
|
2011-04-07 19:10:02 +07:00
|
|
|
child->parent = sd;
|
2013-06-10 17:57:19 +07:00
|
|
|
sd->child = child;
|
2014-07-22 16:47:40 +07:00
|
|
|
|
|
|
|
if (!cpumask_subset(sched_domain_span(child),
|
|
|
|
sched_domain_span(sd))) {
|
|
|
|
pr_err("BUG: arch topology borken\n");
|
|
|
|
#ifdef CONFIG_SCHED_DEBUG
|
|
|
|
pr_err(" the %s domain not a subset of the %s domain\n",
|
|
|
|
child->name, sd->name);
|
|
|
|
#endif
|
|
|
|
/* Fixup, ensure @sd has at least @child cpus. */
|
|
|
|
cpumask_or(sched_domain_span(sd),
|
|
|
|
sched_domain_span(sd),
|
|
|
|
sched_domain_span(child));
|
|
|
|
}
|
|
|
|
|
2011-04-07 19:10:04 +07:00
|
|
|
}
|
2012-06-06 01:44:36 +07:00
|
|
|
set_domain_attribute(sd, attr);
|
2011-04-07 19:10:01 +07:00
|
|
|
|
|
|
|
return sd;
|
|
|
|
}
|
|
|
|
|
2009-08-18 17:53:00 +07:00
|
|
|
/*
|
|
|
|
* Build sched domains for a given set of cpus and attach the sched domains
|
|
|
|
* to the individual cpus
|
|
|
|
*/
|
2011-04-07 19:09:50 +07:00
|
|
|
static int build_sched_domains(const struct cpumask *cpu_map,
|
|
|
|
struct sched_domain_attr *attr)
|
2009-08-18 17:53:00 +07:00
|
|
|
{
|
2013-06-10 17:57:18 +07:00
|
|
|
enum s_alloc alloc_state;
|
2011-04-07 19:09:50 +07:00
|
|
|
struct sched_domain *sd;
|
2009-08-18 17:53:00 +07:00
|
|
|
struct s_data d;
|
2011-04-07 19:09:51 +07:00
|
|
|
int i, ret = -ENOMEM;
|
2005-09-07 05:18:14 +07:00
|
|
|
|
2009-08-18 17:53:00 +07:00
|
|
|
alloc_state = __visit_domain_allocation_hell(&d, cpu_map);
|
|
|
|
if (alloc_state != sa_rootdomain)
|
|
|
|
goto error;
|
2005-09-07 05:18:14 +07:00
|
|
|
|
2011-04-07 19:09:50 +07:00
|
|
|
/* Set up domains for cpus specified by the cpu_map. */
|
2008-11-24 23:05:02 +07:00
|
|
|
for_each_cpu(i, cpu_map) {
|
2011-04-07 19:10:00 +07:00
|
|
|
struct sched_domain_topology_level *tl;
|
|
|
|
|
2011-04-07 19:09:54 +07:00
|
|
|
sd = NULL;
|
2013-06-10 17:57:20 +07:00
|
|
|
for_each_sd_topology(tl) {
|
2013-06-04 17:42:43 +07:00
|
|
|
sd = build_sched_domain(tl, cpu_map, attr, sd, i);
|
2013-06-04 17:11:15 +07:00
|
|
|
if (tl == sched_domain_topology)
|
|
|
|
*per_cpu_ptr(d.sd, i) = sd;
|
2011-07-15 15:35:52 +07:00
|
|
|
if (tl->flags & SDTL_OVERLAP || sched_feat(FORCE_SD_OVERLAP))
|
|
|
|
sd->flags |= SD_OVERLAP;
|
2011-07-20 23:42:57 +07:00
|
|
|
if (cpumask_equal(cpu_map, sched_domain_span(sd)))
|
|
|
|
break;
|
2011-07-15 15:35:52 +07:00
|
|
|
}
|
2011-04-07 19:09:50 +07:00
|
|
|
}
|
|
|
|
|
|
|
|
/* Build the groups for the domains */
|
|
|
|
for_each_cpu(i, cpu_map) {
|
|
|
|
for (sd = *per_cpu_ptr(d.sd, i); sd; sd = sd->parent) {
|
|
|
|
sd->span_weight = cpumask_weight(sched_domain_span(sd));
|
2011-07-15 15:35:52 +07:00
|
|
|
if (sd->flags & SD_OVERLAP) {
|
|
|
|
if (build_overlap_sched_groups(sd, i))
|
|
|
|
goto error;
|
|
|
|
} else {
|
|
|
|
if (build_sched_groups(sd, i))
|
|
|
|
goto error;
|
|
|
|
}
|
2011-04-07 19:09:47 +07:00
|
|
|
}
|
2011-04-07 19:09:44 +07:00
|
|
|
}
|
2005-09-07 05:18:14 +07:00
|
|
|
|
2014-05-27 05:19:38 +07:00
|
|
|
/* Calculate CPU capacity for physical packages and nodes */
|
2011-04-07 19:09:49 +07:00
|
|
|
for (i = nr_cpumask_bits-1; i >= 0; i--) {
|
|
|
|
if (!cpumask_test_cpu(i, cpu_map))
|
|
|
|
continue;
|
2005-09-07 05:18:14 +07:00
|
|
|
|
2011-04-07 19:09:50 +07:00
|
|
|
for (sd = *per_cpu_ptr(d.sd, i); sd; sd = sd->parent) {
|
|
|
|
claim_allocations(i, sd);
|
2014-05-27 05:19:37 +07:00
|
|
|
init_sched_groups_capacity(i, sd);
|
2011-04-07 19:09:50 +07:00
|
|
|
}
|
2006-07-30 17:02:59 +07:00
|
|
|
}
|
2005-09-07 05:18:14 +07:00
|
|
|
|
2005-04-17 05:20:36 +07:00
|
|
|
/* Attach the domains */
|
2011-04-07 19:09:50 +07:00
|
|
|
rcu_read_lock();
|
2008-11-24 23:05:02 +07:00
|
|
|
for_each_cpu(i, cpu_map) {
|
2011-04-07 19:09:48 +07:00
|
|
|
sd = *per_cpu_ptr(d.sd, i);
|
2009-08-18 17:51:52 +07:00
|
|
|
cpu_attach_domain(sd, d.rd, i);
|
2005-04-17 05:20:36 +07:00
|
|
|
}
|
2011-04-07 19:09:50 +07:00
|
|
|
rcu_read_unlock();
|
2006-06-27 16:54:38 +07:00
|
|
|
|
2011-04-07 19:09:51 +07:00
|
|
|
ret = 0;
|
2006-06-27 16:54:38 +07:00
|
|
|
error:
|
2009-08-18 17:53:00 +07:00
|
|
|
__free_domain_allocs(&d, alloc_state, cpu_map);
|
2011-04-07 19:09:51 +07:00
|
|
|
return ret;
|
2005-04-17 05:20:36 +07:00
|
|
|
}
|
2007-10-19 13:40:20 +07:00
|
|
|
|
2009-11-03 11:23:40 +07:00
|
|
|
static cpumask_var_t *doms_cur; /* current sched domains */
|
2007-10-19 13:40:20 +07:00
|
|
|
static int ndoms_cur; /* number of sched domains in 'doms_cur' */
|
2008-05-16 22:47:14 +07:00
|
|
|
static struct sched_domain_attr *dattr_cur;
|
|
|
|
/* attribues of custom domains in 'doms_cur' */
|
2007-10-19 13:40:20 +07:00
|
|
|
|
|
|
|
/*
|
|
|
|
* Special case: If a kmalloc of a doms_cur partition (array of
|
2008-11-24 23:05:12 +07:00
|
|
|
* cpumask) fails, then fallback to a single sched domain,
|
|
|
|
* as determined by the single cpumask fallback_doms.
|
2007-10-19 13:40:20 +07:00
|
|
|
*/
|
2008-11-24 23:05:12 +07:00
|
|
|
static cpumask_var_t fallback_doms;
|
2007-10-19 13:40:20 +07:00
|
|
|
|
2008-12-10 00:49:50 +07:00
|
|
|
/*
|
|
|
|
* arch_update_cpu_topology lets virtualized architectures update the
|
|
|
|
* cpu core maps. It is supposed to return 1 if the topology changed
|
|
|
|
* or 0 if it stayed the same.
|
|
|
|
*/
|
2014-04-08 05:39:20 +07:00
|
|
|
int __weak arch_update_cpu_topology(void)
|
2008-03-13 00:31:59 +07:00
|
|
|
{
|
2008-12-10 00:49:50 +07:00
|
|
|
return 0;
|
2008-03-13 00:31:59 +07:00
|
|
|
}
|
|
|
|
|
2009-11-03 11:23:40 +07:00
|
|
|
cpumask_var_t *alloc_sched_domains(unsigned int ndoms)
|
|
|
|
{
|
|
|
|
int i;
|
|
|
|
cpumask_var_t *doms;
|
|
|
|
|
|
|
|
doms = kmalloc(sizeof(*doms) * ndoms, GFP_KERNEL);
|
|
|
|
if (!doms)
|
|
|
|
return NULL;
|
|
|
|
for (i = 0; i < ndoms; i++) {
|
|
|
|
if (!alloc_cpumask_var(&doms[i], GFP_KERNEL)) {
|
|
|
|
free_sched_domains(doms, i);
|
|
|
|
return NULL;
|
|
|
|
}
|
|
|
|
}
|
|
|
|
return doms;
|
|
|
|
}
|
|
|
|
|
|
|
|
void free_sched_domains(cpumask_var_t doms[], unsigned int ndoms)
|
|
|
|
{
|
|
|
|
unsigned int i;
|
|
|
|
for (i = 0; i < ndoms; i++)
|
|
|
|
free_cpumask_var(doms[i]);
|
|
|
|
kfree(doms);
|
|
|
|
}
|
|
|
|
|
2005-06-26 04:57:33 +07:00
|
|
|
/*
|
2007-12-05 21:46:09 +07:00
|
|
|
* Set up scheduler domains and groups. Callers must hold the hotplug lock.
|
2007-10-19 13:40:20 +07:00
|
|
|
* For now this just excludes isolated cpus, but could be used to
|
|
|
|
* exclude other special cases in the future.
|
2005-06-26 04:57:33 +07:00
|
|
|
*/
|
2011-04-07 19:09:42 +07:00
|
|
|
static int init_sched_domains(const struct cpumask *cpu_map)
|
2005-06-26 04:57:33 +07:00
|
|
|
{
|
2007-10-24 23:23:48 +07:00
|
|
|
int err;
|
|
|
|
|
2008-03-13 00:31:59 +07:00
|
|
|
arch_update_cpu_topology();
|
2007-10-19 13:40:20 +07:00
|
|
|
ndoms_cur = 1;
|
2009-11-03 11:23:40 +07:00
|
|
|
doms_cur = alloc_sched_domains(ndoms_cur);
|
2007-10-19 13:40:20 +07:00
|
|
|
if (!doms_cur)
|
2009-11-03 11:23:40 +07:00
|
|
|
doms_cur = &fallback_doms;
|
|
|
|
cpumask_andnot(doms_cur[0], cpu_map, cpu_isolated_map);
|
2011-04-07 19:09:50 +07:00
|
|
|
err = build_sched_domains(doms_cur[0], NULL);
|
2007-10-15 22:00:19 +07:00
|
|
|
register_sched_domain_sysctl();
|
2007-10-24 23:23:48 +07:00
|
|
|
|
|
|
|
return err;
|
2005-06-26 04:57:33 +07:00
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Detach sched domains from a group of cpus specified in cpu_map
|
|
|
|
* These cpus will now be attached to the NULL domain
|
|
|
|
*/
|
2008-11-24 23:05:14 +07:00
|
|
|
static void detach_destroy_domains(const struct cpumask *cpu_map)
|
2005-06-26 04:57:33 +07:00
|
|
|
{
|
|
|
|
int i;
|
|
|
|
|
2011-04-07 19:09:50 +07:00
|
|
|
rcu_read_lock();
|
2008-11-24 23:05:02 +07:00
|
|
|
for_each_cpu(i, cpu_map)
|
2008-01-26 03:08:18 +07:00
|
|
|
cpu_attach_domain(NULL, &def_root_domain, i);
|
2011-04-07 19:09:50 +07:00
|
|
|
rcu_read_unlock();
|
2005-06-26 04:57:33 +07:00
|
|
|
}
|
|
|
|
|
2008-04-15 12:04:23 +07:00
|
|
|
/* handle null as "default" */
|
|
|
|
static int dattrs_equal(struct sched_domain_attr *cur, int idx_cur,
|
|
|
|
struct sched_domain_attr *new, int idx_new)
|
|
|
|
{
|
|
|
|
struct sched_domain_attr tmp;
|
|
|
|
|
|
|
|
/* fast path */
|
|
|
|
if (!new && !cur)
|
|
|
|
return 1;
|
|
|
|
|
|
|
|
tmp = SD_ATTR_INIT;
|
|
|
|
return !memcmp(cur ? (cur + idx_cur) : &tmp,
|
|
|
|
new ? (new + idx_new) : &tmp,
|
|
|
|
sizeof(struct sched_domain_attr));
|
|
|
|
}
|
|
|
|
|
2007-10-19 13:40:20 +07:00
|
|
|
/*
|
|
|
|
* Partition sched domains as specified by the 'ndoms_new'
|
2007-12-05 21:46:09 +07:00
|
|
|
* cpumasks in the array doms_new[] of cpumasks. This compares
|
2007-10-19 13:40:20 +07:00
|
|
|
* doms_new[] to the current sched domain partitioning, doms_cur[].
|
|
|
|
* It destroys each deleted domain and builds each new domain.
|
|
|
|
*
|
2009-11-03 11:23:40 +07:00
|
|
|
* 'doms_new' is an array of cpumask_var_t's of length 'ndoms_new'.
|
2007-12-05 21:46:09 +07:00
|
|
|
* The masks don't intersect (don't overlap.) We should setup one
|
|
|
|
* sched domain for each mask. CPUs not in any of the cpumasks will
|
|
|
|
* not be load balanced. If the same cpumask appears both in the
|
2007-10-19 13:40:20 +07:00
|
|
|
* current 'doms_cur' domains and in the new 'doms_new', we can leave
|
|
|
|
* it as it is.
|
|
|
|
*
|
2009-11-03 11:23:40 +07:00
|
|
|
* The passed in 'doms_new' should be allocated using
|
|
|
|
* alloc_sched_domains. This routine takes ownership of it and will
|
|
|
|
* free_sched_domains it when done with it. If the caller failed the
|
|
|
|
* alloc call, then it can pass in doms_new == NULL && ndoms_new == 1,
|
|
|
|
* and partition_sched_domains() will fallback to the single partition
|
|
|
|
* 'fallback_doms', it also forces the domains to be rebuilt.
|
2007-10-19 13:40:20 +07:00
|
|
|
*
|
2008-11-24 23:05:14 +07:00
|
|
|
* If doms_new == NULL it will be replaced with cpu_online_mask.
|
2008-11-18 13:02:03 +07:00
|
|
|
* ndoms_new == 0 is a special case for destroying existing domains,
|
|
|
|
* and it will not create the default domain.
|
2008-08-30 03:11:41 +07:00
|
|
|
*
|
2007-10-19 13:40:20 +07:00
|
|
|
* Call with hotplug lock held
|
|
|
|
*/
|
2009-11-03 11:23:40 +07:00
|
|
|
void partition_sched_domains(int ndoms_new, cpumask_var_t doms_new[],
|
2008-04-15 12:04:23 +07:00
|
|
|
struct sched_domain_attr *dattr_new)
|
2007-10-19 13:40:20 +07:00
|
|
|
{
|
2008-08-30 03:11:41 +07:00
|
|
|
int i, j, n;
|
2008-12-10 00:49:51 +07:00
|
|
|
int new_topology;
|
2007-10-19 13:40:20 +07:00
|
|
|
|
2008-04-28 16:33:07 +07:00
|
|
|
mutex_lock(&sched_domains_mutex);
|
2008-01-26 03:08:00 +07:00
|
|
|
|
2007-10-24 23:23:48 +07:00
|
|
|
/* always unregister in case we don't destroy any domains */
|
|
|
|
unregister_sched_domain_sysctl();
|
|
|
|
|
2008-12-10 00:49:51 +07:00
|
|
|
/* Let architecture update cpu core mappings. */
|
|
|
|
new_topology = arch_update_cpu_topology();
|
|
|
|
|
2008-08-30 03:11:41 +07:00
|
|
|
n = doms_new ? ndoms_new : 0;
|
2007-10-19 13:40:20 +07:00
|
|
|
|
|
|
|
/* Destroy deleted domains */
|
|
|
|
for (i = 0; i < ndoms_cur; i++) {
|
2008-12-10 00:49:51 +07:00
|
|
|
for (j = 0; j < n && !new_topology; j++) {
|
2009-11-03 11:23:40 +07:00
|
|
|
if (cpumask_equal(doms_cur[i], doms_new[j])
|
2008-04-15 12:04:23 +07:00
|
|
|
&& dattrs_equal(dattr_cur, i, dattr_new, j))
|
2007-10-19 13:40:20 +07:00
|
|
|
goto match1;
|
|
|
|
}
|
|
|
|
/* no match - a current sched domain not in new doms_new[] */
|
2009-11-03 11:23:40 +07:00
|
|
|
detach_destroy_domains(doms_cur[i]);
|
2007-10-19 13:40:20 +07:00
|
|
|
match1:
|
|
|
|
;
|
|
|
|
}
|
|
|
|
|
2013-08-06 19:06:42 +07:00
|
|
|
n = ndoms_cur;
|
2008-07-15 18:43:49 +07:00
|
|
|
if (doms_new == NULL) {
|
2013-08-06 19:06:42 +07:00
|
|
|
n = 0;
|
2009-11-03 11:23:40 +07:00
|
|
|
doms_new = &fallback_doms;
|
2009-11-25 19:31:39 +07:00
|
|
|
cpumask_andnot(doms_new[0], cpu_active_mask, cpu_isolated_map);
|
2008-11-04 15:20:23 +07:00
|
|
|
WARN_ON_ONCE(dattr_new);
|
2008-07-15 18:43:49 +07:00
|
|
|
}
|
|
|
|
|
2007-10-19 13:40:20 +07:00
|
|
|
/* Build new domains */
|
|
|
|
for (i = 0; i < ndoms_new; i++) {
|
2013-08-06 19:06:42 +07:00
|
|
|
for (j = 0; j < n && !new_topology; j++) {
|
2009-11-03 11:23:40 +07:00
|
|
|
if (cpumask_equal(doms_new[i], doms_cur[j])
|
2008-04-15 12:04:23 +07:00
|
|
|
&& dattrs_equal(dattr_new, i, dattr_cur, j))
|
2007-10-19 13:40:20 +07:00
|
|
|
goto match2;
|
|
|
|
}
|
|
|
|
/* no match - add a new doms_new */
|
2011-04-07 19:09:50 +07:00
|
|
|
build_sched_domains(doms_new[i], dattr_new ? dattr_new + i : NULL);
|
2007-10-19 13:40:20 +07:00
|
|
|
match2:
|
|
|
|
;
|
|
|
|
}
|
|
|
|
|
|
|
|
/* Remember the new sched domains */
|
2009-11-03 11:23:40 +07:00
|
|
|
if (doms_cur != &fallback_doms)
|
|
|
|
free_sched_domains(doms_cur, ndoms_cur);
|
2008-04-15 12:04:23 +07:00
|
|
|
kfree(dattr_cur); /* kfree(NULL) is safe */
|
2007-10-19 13:40:20 +07:00
|
|
|
doms_cur = doms_new;
|
2008-04-15 12:04:23 +07:00
|
|
|
dattr_cur = dattr_new;
|
2007-10-19 13:40:20 +07:00
|
|
|
ndoms_cur = ndoms_new;
|
2007-10-24 23:23:48 +07:00
|
|
|
|
|
|
|
register_sched_domain_sysctl();
|
2008-01-26 03:08:00 +07:00
|
|
|
|
2008-04-28 16:33:07 +07:00
|
|
|
mutex_unlock(&sched_domains_mutex);
|
2007-10-19 13:40:20 +07:00
|
|
|
}
|
|
|
|
|
CPU hotplug, cpusets, suspend: Don't modify cpusets during suspend/resume
In the event of CPU hotplug, the kernel modifies the cpusets' cpus_allowed
masks as and when necessary to ensure that the tasks belonging to the cpusets
have some place (online CPUs) to run on. And regular CPU hotplug is
destructive in the sense that the kernel doesn't remember the original cpuset
configurations set by the user, across hotplug operations.
However, suspend/resume (which uses CPU hotplug) is a special case in which
the kernel has the responsibility to restore the system (during resume), to
exactly the same state it was in before suspend.
In order to achieve that, do the following:
1. Don't modify cpusets during suspend/resume. At all.
In particular, don't move the tasks from one cpuset to another, and
don't modify any cpuset's cpus_allowed mask. So, simply ignore cpusets
during the CPU hotplug operations that are carried out in the
suspend/resume path.
2. However, cpusets and sched domains are related. We just want to avoid
altering cpusets alone. So, to keep the sched domains updated, build
a single sched domain (containing all active cpus) during each of the
CPU hotplug operations carried out in s/r path, effectively ignoring
the cpusets' cpus_allowed masks.
(Since userspace is frozen while doing all this, it will go unnoticed.)
3. During the last CPU online operation during resume, build the sched
domains by looking up the (unaltered) cpusets' cpus_allowed masks.
That will bring back the system to the same original state as it was in
before suspend.
Ultimately, this will not only solve the cpuset problem related to suspend
resume (ie., restores the cpusets to exactly what it was before suspend, by
not touching it at all) but also speeds up suspend/resume because we avoid
running cpuset update code for every CPU being offlined/onlined.
Signed-off-by: Srivatsa S. Bhat <srivatsa.bhat@linux.vnet.ibm.com>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/20120524141611.3692.20155.stgit@srivatsabhat.in.ibm.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2012-05-24 21:16:26 +07:00
|
|
|
static int num_cpus_frozen; /* used to mark begin/end of suspend/resume */
|
|
|
|
|
2005-04-17 05:20:36 +07:00
|
|
|
/*
|
2010-06-09 02:40:36 +07:00
|
|
|
* Update cpusets according to cpu_active mask. If cpusets are
|
|
|
|
* disabled, cpuset_update_active_cpus() becomes a simple wrapper
|
|
|
|
* around partition_sched_domains().
|
CPU hotplug, cpusets, suspend: Don't modify cpusets during suspend/resume
In the event of CPU hotplug, the kernel modifies the cpusets' cpus_allowed
masks as and when necessary to ensure that the tasks belonging to the cpusets
have some place (online CPUs) to run on. And regular CPU hotplug is
destructive in the sense that the kernel doesn't remember the original cpuset
configurations set by the user, across hotplug operations.
However, suspend/resume (which uses CPU hotplug) is a special case in which
the kernel has the responsibility to restore the system (during resume), to
exactly the same state it was in before suspend.
In order to achieve that, do the following:
1. Don't modify cpusets during suspend/resume. At all.
In particular, don't move the tasks from one cpuset to another, and
don't modify any cpuset's cpus_allowed mask. So, simply ignore cpusets
during the CPU hotplug operations that are carried out in the
suspend/resume path.
2. However, cpusets and sched domains are related. We just want to avoid
altering cpusets alone. So, to keep the sched domains updated, build
a single sched domain (containing all active cpus) during each of the
CPU hotplug operations carried out in s/r path, effectively ignoring
the cpusets' cpus_allowed masks.
(Since userspace is frozen while doing all this, it will go unnoticed.)
3. During the last CPU online operation during resume, build the sched
domains by looking up the (unaltered) cpusets' cpus_allowed masks.
That will bring back the system to the same original state as it was in
before suspend.
Ultimately, this will not only solve the cpuset problem related to suspend
resume (ie., restores the cpusets to exactly what it was before suspend, by
not touching it at all) but also speeds up suspend/resume because we avoid
running cpuset update code for every CPU being offlined/onlined.
Signed-off-by: Srivatsa S. Bhat <srivatsa.bhat@linux.vnet.ibm.com>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/20120524141611.3692.20155.stgit@srivatsabhat.in.ibm.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2012-05-24 21:16:26 +07:00
|
|
|
*
|
|
|
|
* If we come here as part of a suspend/resume, don't touch cpusets because we
|
|
|
|
* want to restore it back to its original state upon resume anyway.
|
2005-04-17 05:20:36 +07:00
|
|
|
*/
|
2016-03-10 18:54:13 +07:00
|
|
|
static void cpuset_cpu_active(void)
|
2008-07-15 18:43:49 +07:00
|
|
|
{
|
2016-03-10 18:54:13 +07:00
|
|
|
if (cpuhp_tasks_frozen) {
|
CPU hotplug, cpusets, suspend: Don't modify cpusets during suspend/resume
In the event of CPU hotplug, the kernel modifies the cpusets' cpus_allowed
masks as and when necessary to ensure that the tasks belonging to the cpusets
have some place (online CPUs) to run on. And regular CPU hotplug is
destructive in the sense that the kernel doesn't remember the original cpuset
configurations set by the user, across hotplug operations.
However, suspend/resume (which uses CPU hotplug) is a special case in which
the kernel has the responsibility to restore the system (during resume), to
exactly the same state it was in before suspend.
In order to achieve that, do the following:
1. Don't modify cpusets during suspend/resume. At all.
In particular, don't move the tasks from one cpuset to another, and
don't modify any cpuset's cpus_allowed mask. So, simply ignore cpusets
during the CPU hotplug operations that are carried out in the
suspend/resume path.
2. However, cpusets and sched domains are related. We just want to avoid
altering cpusets alone. So, to keep the sched domains updated, build
a single sched domain (containing all active cpus) during each of the
CPU hotplug operations carried out in s/r path, effectively ignoring
the cpusets' cpus_allowed masks.
(Since userspace is frozen while doing all this, it will go unnoticed.)
3. During the last CPU online operation during resume, build the sched
domains by looking up the (unaltered) cpusets' cpus_allowed masks.
That will bring back the system to the same original state as it was in
before suspend.
Ultimately, this will not only solve the cpuset problem related to suspend
resume (ie., restores the cpusets to exactly what it was before suspend, by
not touching it at all) but also speeds up suspend/resume because we avoid
running cpuset update code for every CPU being offlined/onlined.
Signed-off-by: Srivatsa S. Bhat <srivatsa.bhat@linux.vnet.ibm.com>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/20120524141611.3692.20155.stgit@srivatsabhat.in.ibm.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2012-05-24 21:16:26 +07:00
|
|
|
/*
|
|
|
|
* num_cpus_frozen tracks how many CPUs are involved in suspend
|
|
|
|
* resume sequence. As long as this is not the last online
|
|
|
|
* operation in the resume sequence, just build a single sched
|
|
|
|
* domain, ignoring cpusets.
|
|
|
|
*/
|
|
|
|
num_cpus_frozen--;
|
|
|
|
if (likely(num_cpus_frozen)) {
|
|
|
|
partition_sched_domains(1, NULL, NULL);
|
2016-03-10 18:54:11 +07:00
|
|
|
return;
|
CPU hotplug, cpusets, suspend: Don't modify cpusets during suspend/resume
In the event of CPU hotplug, the kernel modifies the cpusets' cpus_allowed
masks as and when necessary to ensure that the tasks belonging to the cpusets
have some place (online CPUs) to run on. And regular CPU hotplug is
destructive in the sense that the kernel doesn't remember the original cpuset
configurations set by the user, across hotplug operations.
However, suspend/resume (which uses CPU hotplug) is a special case in which
the kernel has the responsibility to restore the system (during resume), to
exactly the same state it was in before suspend.
In order to achieve that, do the following:
1. Don't modify cpusets during suspend/resume. At all.
In particular, don't move the tasks from one cpuset to another, and
don't modify any cpuset's cpus_allowed mask. So, simply ignore cpusets
during the CPU hotplug operations that are carried out in the
suspend/resume path.
2. However, cpusets and sched domains are related. We just want to avoid
altering cpusets alone. So, to keep the sched domains updated, build
a single sched domain (containing all active cpus) during each of the
CPU hotplug operations carried out in s/r path, effectively ignoring
the cpusets' cpus_allowed masks.
(Since userspace is frozen while doing all this, it will go unnoticed.)
3. During the last CPU online operation during resume, build the sched
domains by looking up the (unaltered) cpusets' cpus_allowed masks.
That will bring back the system to the same original state as it was in
before suspend.
Ultimately, this will not only solve the cpuset problem related to suspend
resume (ie., restores the cpusets to exactly what it was before suspend, by
not touching it at all) but also speeds up suspend/resume because we avoid
running cpuset update code for every CPU being offlined/onlined.
Signed-off-by: Srivatsa S. Bhat <srivatsa.bhat@linux.vnet.ibm.com>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/20120524141611.3692.20155.stgit@srivatsabhat.in.ibm.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2012-05-24 21:16:26 +07:00
|
|
|
}
|
|
|
|
/*
|
|
|
|
* This is the last CPU online operation. So fall through and
|
|
|
|
* restore the original sched domains by considering the
|
|
|
|
* cpuset configurations.
|
|
|
|
*/
|
2010-06-09 02:40:36 +07:00
|
|
|
}
|
2016-03-10 18:54:11 +07:00
|
|
|
cpuset_update_active_cpus(true);
|
2010-06-09 02:40:36 +07:00
|
|
|
}
|
2008-07-15 18:43:49 +07:00
|
|
|
|
2016-03-10 18:54:13 +07:00
|
|
|
static int cpuset_cpu_inactive(unsigned int cpu)
|
2010-06-09 02:40:36 +07:00
|
|
|
{
|
2015-03-31 15:53:37 +07:00
|
|
|
unsigned long flags;
|
|
|
|
struct dl_bw *dl_b;
|
2015-05-04 17:09:36 +07:00
|
|
|
bool overflow;
|
|
|
|
int cpus;
|
2015-03-31 15:53:37 +07:00
|
|
|
|
2016-03-10 18:54:13 +07:00
|
|
|
if (!cpuhp_tasks_frozen) {
|
2015-05-04 17:09:36 +07:00
|
|
|
rcu_read_lock_sched();
|
|
|
|
dl_b = dl_bw_of(cpu);
|
2015-03-31 15:53:37 +07:00
|
|
|
|
2015-05-04 17:09:36 +07:00
|
|
|
raw_spin_lock_irqsave(&dl_b->lock, flags);
|
|
|
|
cpus = dl_bw_cpus(cpu);
|
|
|
|
overflow = __dl_overflow(dl_b, cpus, 0, 0);
|
|
|
|
raw_spin_unlock_irqrestore(&dl_b->lock, flags);
|
2015-03-31 15:53:37 +07:00
|
|
|
|
2015-05-04 17:09:36 +07:00
|
|
|
rcu_read_unlock_sched();
|
2015-03-31 15:53:37 +07:00
|
|
|
|
2015-05-04 17:09:36 +07:00
|
|
|
if (overflow)
|
2016-03-10 18:54:11 +07:00
|
|
|
return -EBUSY;
|
2012-05-24 21:16:55 +07:00
|
|
|
cpuset_update_active_cpus(false);
|
2016-03-10 18:54:11 +07:00
|
|
|
} else {
|
CPU hotplug, cpusets, suspend: Don't modify cpusets during suspend/resume
In the event of CPU hotplug, the kernel modifies the cpusets' cpus_allowed
masks as and when necessary to ensure that the tasks belonging to the cpusets
have some place (online CPUs) to run on. And regular CPU hotplug is
destructive in the sense that the kernel doesn't remember the original cpuset
configurations set by the user, across hotplug operations.
However, suspend/resume (which uses CPU hotplug) is a special case in which
the kernel has the responsibility to restore the system (during resume), to
exactly the same state it was in before suspend.
In order to achieve that, do the following:
1. Don't modify cpusets during suspend/resume. At all.
In particular, don't move the tasks from one cpuset to another, and
don't modify any cpuset's cpus_allowed mask. So, simply ignore cpusets
during the CPU hotplug operations that are carried out in the
suspend/resume path.
2. However, cpusets and sched domains are related. We just want to avoid
altering cpusets alone. So, to keep the sched domains updated, build
a single sched domain (containing all active cpus) during each of the
CPU hotplug operations carried out in s/r path, effectively ignoring
the cpusets' cpus_allowed masks.
(Since userspace is frozen while doing all this, it will go unnoticed.)
3. During the last CPU online operation during resume, build the sched
domains by looking up the (unaltered) cpusets' cpus_allowed masks.
That will bring back the system to the same original state as it was in
before suspend.
Ultimately, this will not only solve the cpuset problem related to suspend
resume (ie., restores the cpusets to exactly what it was before suspend, by
not touching it at all) but also speeds up suspend/resume because we avoid
running cpuset update code for every CPU being offlined/onlined.
Signed-off-by: Srivatsa S. Bhat <srivatsa.bhat@linux.vnet.ibm.com>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/20120524141611.3692.20155.stgit@srivatsabhat.in.ibm.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2012-05-24 21:16:26 +07:00
|
|
|
num_cpus_frozen++;
|
|
|
|
partition_sched_domains(1, NULL, NULL);
|
2008-07-15 18:43:49 +07:00
|
|
|
}
|
2016-03-10 18:54:11 +07:00
|
|
|
return 0;
|
2008-07-15 18:43:49 +07:00
|
|
|
}
|
|
|
|
|
2016-03-10 18:54:13 +07:00
|
|
|
int sched_cpu_activate(unsigned int cpu)
|
2016-03-10 18:54:11 +07:00
|
|
|
{
|
2016-03-10 18:54:17 +07:00
|
|
|
struct rq *rq = cpu_rq(cpu);
|
|
|
|
unsigned long flags;
|
|
|
|
|
2016-03-10 18:54:13 +07:00
|
|
|
set_cpu_active(cpu, true);
|
2016-03-10 18:54:11 +07:00
|
|
|
|
2016-03-10 18:54:13 +07:00
|
|
|
if (sched_smp_initialized) {
|
2016-03-10 18:54:11 +07:00
|
|
|
sched_domains_numa_masks_set(cpu);
|
2016-03-10 18:54:13 +07:00
|
|
|
cpuset_cpu_active();
|
2008-07-15 18:43:49 +07:00
|
|
|
}
|
2016-03-10 18:54:17 +07:00
|
|
|
|
|
|
|
/*
|
|
|
|
* Put the rq online, if not already. This happens:
|
|
|
|
*
|
|
|
|
* 1) In the early boot process, because we build the real domains
|
|
|
|
* after all cpus have been brought up.
|
|
|
|
*
|
|
|
|
* 2) At runtime, if cpuset_cpu_active() fails to rebuild the
|
|
|
|
* domains.
|
|
|
|
*/
|
|
|
|
raw_spin_lock_irqsave(&rq->lock, flags);
|
|
|
|
if (rq->rd) {
|
|
|
|
BUG_ON(!cpumask_test_cpu(cpu, rq->rd->span));
|
|
|
|
set_rq_online(rq);
|
|
|
|
}
|
|
|
|
raw_spin_unlock_irqrestore(&rq->lock, flags);
|
|
|
|
|
|
|
|
update_max_interval();
|
|
|
|
|
2016-03-10 18:54:13 +07:00
|
|
|
return 0;
|
2016-03-10 18:54:11 +07:00
|
|
|
}
|
|
|
|
|
2016-03-10 18:54:13 +07:00
|
|
|
int sched_cpu_deactivate(unsigned int cpu)
|
2016-03-10 18:54:11 +07:00
|
|
|
{
|
|
|
|
int ret;
|
|
|
|
|
2016-03-10 18:54:13 +07:00
|
|
|
set_cpu_active(cpu, false);
|
2016-03-10 18:54:14 +07:00
|
|
|
/*
|
|
|
|
* We've cleared cpu_active_mask, wait for all preempt-disabled and RCU
|
|
|
|
* users of this state to go away such that all new such users will
|
|
|
|
* observe it.
|
|
|
|
*
|
|
|
|
* For CONFIG_PREEMPT we have preemptible RCU and its sync_rcu() might
|
|
|
|
* not imply sync_sched(), so wait for both.
|
|
|
|
*
|
|
|
|
* Do sync before park smpboot threads to take care the rcu boost case.
|
|
|
|
*/
|
|
|
|
if (IS_ENABLED(CONFIG_PREEMPT))
|
|
|
|
synchronize_rcu_mult(call_rcu, call_rcu_sched);
|
|
|
|
else
|
|
|
|
synchronize_rcu();
|
2016-03-10 18:54:13 +07:00
|
|
|
|
|
|
|
if (!sched_smp_initialized)
|
|
|
|
return 0;
|
|
|
|
|
|
|
|
ret = cpuset_cpu_inactive(cpu);
|
|
|
|
if (ret) {
|
|
|
|
set_cpu_active(cpu, true);
|
|
|
|
return ret;
|
2016-03-10 18:54:11 +07:00
|
|
|
}
|
2016-03-10 18:54:13 +07:00
|
|
|
sched_domains_numa_masks_clear(cpu);
|
|
|
|
return 0;
|
2016-03-10 18:54:11 +07:00
|
|
|
}
|
|
|
|
|
2016-03-10 18:54:15 +07:00
|
|
|
static void sched_rq_cpu_starting(unsigned int cpu)
|
|
|
|
{
|
|
|
|
struct rq *rq = cpu_rq(cpu);
|
|
|
|
|
|
|
|
rq->calc_load_update = calc_load_update;
|
|
|
|
update_max_interval();
|
|
|
|
}
|
|
|
|
|
2016-03-10 18:54:11 +07:00
|
|
|
int sched_cpu_starting(unsigned int cpu)
|
|
|
|
{
|
|
|
|
set_cpu_rq_start_time(cpu);
|
2016-03-10 18:54:15 +07:00
|
|
|
sched_rq_cpu_starting(cpu);
|
2016-03-10 18:54:11 +07:00
|
|
|
return 0;
|
2008-07-15 18:43:49 +07:00
|
|
|
}
|
|
|
|
|
2016-03-10 18:54:18 +07:00
|
|
|
#ifdef CONFIG_HOTPLUG_CPU
|
|
|
|
int sched_cpu_dying(unsigned int cpu)
|
|
|
|
{
|
|
|
|
struct rq *rq = cpu_rq(cpu);
|
|
|
|
unsigned long flags;
|
|
|
|
|
|
|
|
/* Handle pending wakeups and then migrate everything off */
|
|
|
|
sched_ttwu_pending();
|
|
|
|
raw_spin_lock_irqsave(&rq->lock, flags);
|
|
|
|
if (rq->rd) {
|
|
|
|
BUG_ON(!cpumask_test_cpu(cpu, rq->rd->span));
|
|
|
|
set_rq_offline(rq);
|
|
|
|
}
|
|
|
|
migrate_tasks(rq);
|
|
|
|
BUG_ON(rq->nr_running != 1);
|
|
|
|
raw_spin_unlock_irqrestore(&rq->lock, flags);
|
|
|
|
calc_load_migrate(rq);
|
|
|
|
update_max_interval();
|
2016-03-10 18:54:20 +07:00
|
|
|
nohz_balance_exit_idle(cpu);
|
2016-03-10 18:54:21 +07:00
|
|
|
hrtick_clear(rq);
|
2016-03-10 18:54:18 +07:00
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
#endif
|
|
|
|
|
2005-04-17 05:20:36 +07:00
|
|
|
void __init sched_init_smp(void)
|
|
|
|
{
|
2008-11-24 23:05:12 +07:00
|
|
|
cpumask_var_t non_isolated_cpus;
|
|
|
|
|
|
|
|
alloc_cpumask_var(&non_isolated_cpus, GFP_KERNEL);
|
2009-09-14 19:20:16 +07:00
|
|
|
alloc_cpumask_var(&fallback_doms, GFP_KERNEL);
|
2006-10-03 15:14:04 +07:00
|
|
|
|
2012-04-17 20:49:36 +07:00
|
|
|
sched_init_numa();
|
|
|
|
|
2013-10-11 19:38:20 +07:00
|
|
|
/*
|
|
|
|
* There's no userspace yet to cause hotplug operations; hence all the
|
|
|
|
* cpu masks are stable and all blatant races in the below code cannot
|
|
|
|
* happen.
|
|
|
|
*/
|
2008-04-28 16:33:07 +07:00
|
|
|
mutex_lock(&sched_domains_mutex);
|
2011-04-07 19:09:42 +07:00
|
|
|
init_sched_domains(cpu_active_mask);
|
2008-11-24 23:05:12 +07:00
|
|
|
cpumask_andnot(non_isolated_cpus, cpu_possible_mask, cpu_isolated_map);
|
|
|
|
if (cpumask_empty(non_isolated_cpus))
|
|
|
|
cpumask_set_cpu(smp_processor_id(), non_isolated_cpus);
|
2008-04-28 16:33:07 +07:00
|
|
|
mutex_unlock(&sched_domains_mutex);
|
2008-07-15 18:43:49 +07:00
|
|
|
|
2006-10-03 15:14:04 +07:00
|
|
|
/* Move init over to a non-isolated CPU */
|
2008-11-24 23:05:12 +07:00
|
|
|
if (set_cpus_allowed_ptr(current, non_isolated_cpus) < 0)
|
2006-10-03 15:14:04 +07:00
|
|
|
BUG();
|
2007-11-10 04:39:38 +07:00
|
|
|
sched_init_granularity();
|
2008-11-24 23:05:12 +07:00
|
|
|
free_cpumask_var(non_isolated_cpus);
|
2008-11-24 23:05:12 +07:00
|
|
|
|
2008-11-24 23:05:13 +07:00
|
|
|
init_sched_rt_class();
|
sched/deadline: Add SCHED_DEADLINE SMP-related data structures & logic
Introduces data structures relevant for implementing dynamic
migration of -deadline tasks and the logic for checking if
runqueues are overloaded with -deadline tasks and for choosing
where a task should migrate, when it is the case.
Adds also dynamic migrations to SCHED_DEADLINE, so that tasks can
be moved among CPUs when necessary. It is also possible to bind a
task to a (set of) CPU(s), thus restricting its capability of
migrating, or forbidding migrations at all.
The very same approach used in sched_rt is utilised:
- -deadline tasks are kept into CPU-specific runqueues,
- -deadline tasks are migrated among runqueues to achieve the
following:
* on an M-CPU system the M earliest deadline ready tasks
are always running;
* affinity/cpusets settings of all the -deadline tasks is
always respected.
Therefore, this very special form of "load balancing" is done with
an active method, i.e., the scheduler pushes or pulls tasks between
runqueues when they are woken up and/or (de)scheduled.
IOW, every time a preemption occurs, the descheduled task might be sent
to some other CPU (depending on its deadline) to continue executing
(push). On the other hand, every time a CPU becomes idle, it might pull
the second earliest deadline ready task from some other CPU.
To enforce this, a pull operation is always attempted before taking any
scheduling decision (pre_schedule()), as well as a push one after each
scheduling decision (post_schedule()). In addition, when a task arrives
or wakes up, the best CPU where to resume it is selected taking into
account its affinity mask, the system topology, but also its deadline.
E.g., from the scheduling point of view, the best CPU where to wake
up (and also where to push) a task is the one which is running the task
with the latest deadline among the M executing ones.
In order to facilitate these decisions, per-runqueue "caching" of the
deadlines of the currently running and of the first ready task is used.
Queued but not running tasks are also parked in another rb-tree to
speed-up pushes.
Signed-off-by: Juri Lelli <juri.lelli@gmail.com>
Signed-off-by: Dario Faggioli <raistlin@linux.it>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/1383831828-15501-5-git-send-email-juri.lelli@gmail.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2013-11-07 20:43:38 +07:00
|
|
|
init_sched_dl_class();
|
2016-03-10 18:54:10 +07:00
|
|
|
sched_smp_initialized = true;
|
2005-04-17 05:20:36 +07:00
|
|
|
}
|
2016-03-10 18:54:10 +07:00
|
|
|
|
|
|
|
static int __init migration_init(void)
|
|
|
|
{
|
2016-03-10 18:54:15 +07:00
|
|
|
sched_rq_cpu_starting(smp_processor_id());
|
2016-03-10 18:54:10 +07:00
|
|
|
return 0;
|
2005-04-17 05:20:36 +07:00
|
|
|
}
|
2016-03-10 18:54:10 +07:00
|
|
|
early_initcall(migration_init);
|
|
|
|
|
2005-04-17 05:20:36 +07:00
|
|
|
#else
|
|
|
|
void __init sched_init_smp(void)
|
|
|
|
{
|
2007-11-10 04:39:38 +07:00
|
|
|
sched_init_granularity();
|
2005-04-17 05:20:36 +07:00
|
|
|
}
|
|
|
|
#endif /* CONFIG_SMP */
|
|
|
|
|
|
|
|
int in_sched_functions(unsigned long addr)
|
|
|
|
{
|
|
|
|
return in_lock_functions(addr) ||
|
|
|
|
(addr >= (unsigned long)__sched_text_start
|
|
|
|
&& addr < (unsigned long)__sched_text_end);
|
|
|
|
}
|
|
|
|
|
2011-10-25 15:00:11 +07:00
|
|
|
#ifdef CONFIG_CGROUP_SCHED
|
2013-03-05 15:07:52 +07:00
|
|
|
/*
|
|
|
|
* Default task group.
|
|
|
|
* Every task in system belongs to this group at bootup.
|
|
|
|
*/
|
2011-10-25 15:00:11 +07:00
|
|
|
struct task_group root_task_group;
|
2012-08-07 10:00:13 +07:00
|
|
|
LIST_HEAD(task_groups);
|
2015-12-03 01:41:49 +07:00
|
|
|
|
|
|
|
/* Cacheline aligned slab cache for task_group */
|
|
|
|
static struct kmem_cache *task_group_cache __read_mostly;
|
2008-02-13 21:45:40 +07:00
|
|
|
#endif
|
2008-01-26 03:08:30 +07:00
|
|
|
|
2013-04-23 15:27:41 +07:00
|
|
|
DECLARE_PER_CPU(cpumask_var_t, load_balance_mask);
|
2008-01-26 03:08:30 +07:00
|
|
|
|
2005-04-17 05:20:36 +07:00
|
|
|
void __init sched_init(void)
|
|
|
|
{
|
2007-07-09 23:51:59 +07:00
|
|
|
int i, j;
|
2008-04-05 08:11:04 +07:00
|
|
|
unsigned long alloc_size = 0, ptr;
|
|
|
|
|
|
|
|
#ifdef CONFIG_FAIR_GROUP_SCHED
|
|
|
|
alloc_size += 2 * nr_cpu_ids * sizeof(void **);
|
|
|
|
#endif
|
|
|
|
#ifdef CONFIG_RT_GROUP_SCHED
|
|
|
|
alloc_size += 2 * nr_cpu_ids * sizeof(void **);
|
|
|
|
#endif
|
|
|
|
if (alloc_size) {
|
2009-06-11 03:42:36 +07:00
|
|
|
ptr = (unsigned long)kzalloc(alloc_size, GFP_NOWAIT);
|
2008-04-05 08:11:04 +07:00
|
|
|
|
|
|
|
#ifdef CONFIG_FAIR_GROUP_SCHED
|
2011-01-07 14:17:36 +07:00
|
|
|
root_task_group.se = (struct sched_entity **)ptr;
|
2008-04-05 08:11:04 +07:00
|
|
|
ptr += nr_cpu_ids * sizeof(void **);
|
|
|
|
|
2011-01-07 14:17:36 +07:00
|
|
|
root_task_group.cfs_rq = (struct cfs_rq **)ptr;
|
2008-04-05 08:11:04 +07:00
|
|
|
ptr += nr_cpu_ids * sizeof(void **);
|
2008-04-20 00:45:00 +07:00
|
|
|
|
2008-05-30 19:23:45 +07:00
|
|
|
#endif /* CONFIG_FAIR_GROUP_SCHED */
|
2008-04-05 08:11:04 +07:00
|
|
|
#ifdef CONFIG_RT_GROUP_SCHED
|
2011-01-07 14:17:36 +07:00
|
|
|
root_task_group.rt_se = (struct sched_rt_entity **)ptr;
|
2008-04-05 08:11:04 +07:00
|
|
|
ptr += nr_cpu_ids * sizeof(void **);
|
|
|
|
|
2011-01-07 14:17:36 +07:00
|
|
|
root_task_group.rt_rq = (struct rt_rq **)ptr;
|
2008-04-20 00:45:00 +07:00
|
|
|
ptr += nr_cpu_ids * sizeof(void **);
|
|
|
|
|
2008-05-30 19:23:45 +07:00
|
|
|
#endif /* CONFIG_RT_GROUP_SCHED */
|
2014-12-19 01:44:30 +07:00
|
|
|
}
|
2009-03-19 11:52:20 +07:00
|
|
|
#ifdef CONFIG_CPUMASK_OFFSTACK
|
2014-12-19 01:44:30 +07:00
|
|
|
for_each_possible_cpu(i) {
|
|
|
|
per_cpu(load_balance_mask, i) = (cpumask_var_t)kzalloc_node(
|
|
|
|
cpumask_size(), GFP_KERNEL, cpu_to_node(i));
|
2008-04-05 08:11:04 +07:00
|
|
|
}
|
2014-12-19 01:44:30 +07:00
|
|
|
#endif /* CONFIG_CPUMASK_OFFSTACK */
|
2007-07-09 23:51:59 +07:00
|
|
|
|
sched/deadline: Add bandwidth management for SCHED_DEADLINE tasks
In order of deadline scheduling to be effective and useful, it is
important that some method of having the allocation of the available
CPU bandwidth to tasks and task groups under control.
This is usually called "admission control" and if it is not performed
at all, no guarantee can be given on the actual scheduling of the
-deadline tasks.
Since when RT-throttling has been introduced each task group have a
bandwidth associated to itself, calculated as a certain amount of
runtime over a period. Moreover, to make it possible to manipulate
such bandwidth, readable/writable controls have been added to both
procfs (for system wide settings) and cgroupfs (for per-group
settings).
Therefore, the same interface is being used for controlling the
bandwidth distrubution to -deadline tasks and task groups, i.e.,
new controls but with similar names, equivalent meaning and with
the same usage paradigm are added.
However, more discussion is needed in order to figure out how
we want to manage SCHED_DEADLINE bandwidth at the task group level.
Therefore, this patch adds a less sophisticated, but actually
very sensible, mechanism to ensure that a certain utilization
cap is not overcome per each root_domain (the single rq for !SMP
configurations).
Another main difference between deadline bandwidth management and
RT-throttling is that -deadline tasks have bandwidth on their own
(while -rt ones doesn't!), and thus we don't need an higher level
throttling mechanism to enforce the desired bandwidth.
This patch, therefore:
- adds system wide deadline bandwidth management by means of:
* /proc/sys/kernel/sched_dl_runtime_us,
* /proc/sys/kernel/sched_dl_period_us,
that determine (i.e., runtime / period) the total bandwidth
available on each CPU of each root_domain for -deadline tasks;
- couples the RT and deadline bandwidth management, i.e., enforces
that the sum of how much bandwidth is being devoted to -rt
-deadline tasks to stay below 100%.
This means that, for a root_domain comprising M CPUs, -deadline tasks
can be created until the sum of their bandwidths stay below:
M * (sched_dl_runtime_us / sched_dl_period_us)
It is also possible to disable this bandwidth management logic, and
be thus free of oversubscribing the system up to any arbitrary level.
Signed-off-by: Dario Faggioli <raistlin@linux.it>
Signed-off-by: Juri Lelli <juri.lelli@gmail.com>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/1383831828-15501-12-git-send-email-juri.lelli@gmail.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2013-11-07 20:43:45 +07:00
|
|
|
init_rt_bandwidth(&def_rt_bandwidth,
|
|
|
|
global_rt_period(), global_rt_runtime());
|
|
|
|
init_dl_bandwidth(&def_dl_bandwidth,
|
2013-12-17 18:44:49 +07:00
|
|
|
global_rt_period(), global_rt_runtime());
|
sched/deadline: Add bandwidth management for SCHED_DEADLINE tasks
In order of deadline scheduling to be effective and useful, it is
important that some method of having the allocation of the available
CPU bandwidth to tasks and task groups under control.
This is usually called "admission control" and if it is not performed
at all, no guarantee can be given on the actual scheduling of the
-deadline tasks.
Since when RT-throttling has been introduced each task group have a
bandwidth associated to itself, calculated as a certain amount of
runtime over a period. Moreover, to make it possible to manipulate
such bandwidth, readable/writable controls have been added to both
procfs (for system wide settings) and cgroupfs (for per-group
settings).
Therefore, the same interface is being used for controlling the
bandwidth distrubution to -deadline tasks and task groups, i.e.,
new controls but with similar names, equivalent meaning and with
the same usage paradigm are added.
However, more discussion is needed in order to figure out how
we want to manage SCHED_DEADLINE bandwidth at the task group level.
Therefore, this patch adds a less sophisticated, but actually
very sensible, mechanism to ensure that a certain utilization
cap is not overcome per each root_domain (the single rq for !SMP
configurations).
Another main difference between deadline bandwidth management and
RT-throttling is that -deadline tasks have bandwidth on their own
(while -rt ones doesn't!), and thus we don't need an higher level
throttling mechanism to enforce the desired bandwidth.
This patch, therefore:
- adds system wide deadline bandwidth management by means of:
* /proc/sys/kernel/sched_dl_runtime_us,
* /proc/sys/kernel/sched_dl_period_us,
that determine (i.e., runtime / period) the total bandwidth
available on each CPU of each root_domain for -deadline tasks;
- couples the RT and deadline bandwidth management, i.e., enforces
that the sum of how much bandwidth is being devoted to -rt
-deadline tasks to stay below 100%.
This means that, for a root_domain comprising M CPUs, -deadline tasks
can be created until the sum of their bandwidths stay below:
M * (sched_dl_runtime_us / sched_dl_period_us)
It is also possible to disable this bandwidth management logic, and
be thus free of oversubscribing the system up to any arbitrary level.
Signed-off-by: Dario Faggioli <raistlin@linux.it>
Signed-off-by: Juri Lelli <juri.lelli@gmail.com>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/1383831828-15501-12-git-send-email-juri.lelli@gmail.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2013-11-07 20:43:45 +07:00
|
|
|
|
2008-01-26 03:08:18 +07:00
|
|
|
#ifdef CONFIG_SMP
|
|
|
|
init_defrootdomain();
|
|
|
|
#endif
|
|
|
|
|
2008-04-20 00:44:57 +07:00
|
|
|
#ifdef CONFIG_RT_GROUP_SCHED
|
2011-01-07 14:17:36 +07:00
|
|
|
init_rt_bandwidth(&root_task_group.rt_bandwidth,
|
2008-04-20 00:44:57 +07:00
|
|
|
global_rt_period(), global_rt_runtime());
|
2008-05-30 19:23:45 +07:00
|
|
|
#endif /* CONFIG_RT_GROUP_SCHED */
|
2008-04-20 00:44:57 +07:00
|
|
|
|
2010-01-20 19:26:18 +07:00
|
|
|
#ifdef CONFIG_CGROUP_SCHED
|
2015-12-03 01:41:49 +07:00
|
|
|
task_group_cache = KMEM_CACHE(task_group, 0);
|
|
|
|
|
2011-01-07 14:17:36 +07:00
|
|
|
list_add(&root_task_group.list, &task_groups);
|
|
|
|
INIT_LIST_HEAD(&root_task_group.children);
|
2011-11-02 04:19:07 +07:00
|
|
|
INIT_LIST_HEAD(&root_task_group.siblings);
|
sched: Add 'autogroup' scheduling feature: automated per session task groups
A recurring complaint from CFS users is that parallel kbuild has
a negative impact on desktop interactivity. This patch
implements an idea from Linus, to automatically create task
groups. Currently, only per session autogroups are implemented,
but the patch leaves the way open for enhancement.
Implementation: each task's signal struct contains an inherited
pointer to a refcounted autogroup struct containing a task group
pointer, the default for all tasks pointing to the
init_task_group. When a task calls setsid(), a new task group
is created, the process is moved into the new task group, and a
reference to the preveious task group is dropped. Child
processes inherit this task group thereafter, and increase it's
refcount. When the last thread of a process exits, the
process's reference is dropped, such that when the last process
referencing an autogroup exits, the autogroup is destroyed.
At runqueue selection time, IFF a task has no cgroup assignment,
its current autogroup is used.
Autogroup bandwidth is controllable via setting it's nice level
through the proc filesystem:
cat /proc/<pid>/autogroup
Displays the task's group and the group's nice level.
echo <nice level> > /proc/<pid>/autogroup
Sets the task group's shares to the weight of nice <level> task.
Setting nice level is rate limited for !admin users due to the
abuse risk of task group locking.
The feature is enabled from boot by default if
CONFIG_SCHED_AUTOGROUP=y is selected, but can be disabled via
the boot option noautogroup, and can also be turned on/off on
the fly via:
echo [01] > /proc/sys/kernel/sched_autogroup_enabled
... which will automatically move tasks to/from the root task group.
Signed-off-by: Mike Galbraith <efault@gmx.de>
Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Markus Trippelsdorf <markus@trippelsdorf.de>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Paul Turner <pjt@google.com>
Cc: Oleg Nesterov <oleg@redhat.com>
[ Removed the task_group_path() debug code, and fixed !EVENTFD build failure. ]
Signed-off-by: Ingo Molnar <mingo@elte.hu>
LKML-Reference: <1290281700.28711.9.camel@maggy.simson.net>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2010-11-30 20:18:03 +07:00
|
|
|
autogroup_init(&init_task);
|
2010-01-20 19:26:18 +07:00
|
|
|
#endif /* CONFIG_CGROUP_SCHED */
|
2008-01-26 03:08:30 +07:00
|
|
|
|
2006-03-28 16:56:37 +07:00
|
|
|
for_each_possible_cpu(i) {
|
2006-07-03 14:25:42 +07:00
|
|
|
struct rq *rq;
|
2005-04-17 05:20:36 +07:00
|
|
|
|
|
|
|
rq = cpu_rq(i);
|
2009-11-17 20:28:38 +07:00
|
|
|
raw_spin_lock_init(&rq->lock);
|
2005-06-26 04:57:13 +07:00
|
|
|
rq->nr_running = 0;
|
2009-04-11 15:43:41 +07:00
|
|
|
rq->calc_load_active = 0;
|
|
|
|
rq->calc_load_update = jiffies + LOAD_FREQ;
|
2011-07-14 23:32:43 +07:00
|
|
|
init_cfs_rq(&rq->cfs);
|
2015-03-03 18:50:27 +07:00
|
|
|
init_rt_rq(&rq->rt);
|
|
|
|
init_dl_rq(&rq->dl);
|
2007-07-09 23:51:59 +07:00
|
|
|
#ifdef CONFIG_FAIR_GROUP_SCHED
|
2011-10-25 15:00:11 +07:00
|
|
|
root_task_group.shares = ROOT_TASK_GROUP_LOAD;
|
2008-01-26 03:08:30 +07:00
|
|
|
INIT_LIST_HEAD(&rq->leaf_cfs_rq_list);
|
2008-04-20 00:44:59 +07:00
|
|
|
/*
|
2011-01-07 14:17:36 +07:00
|
|
|
* How much cpu bandwidth does root_task_group get?
|
2008-04-20 00:44:59 +07:00
|
|
|
*
|
|
|
|
* In case of task-groups formed thr' the cgroup filesystem, it
|
|
|
|
* gets 100% of the cpu resources in the system. This overall
|
|
|
|
* system cpu resource is divided among the tasks of
|
2011-01-07 14:17:36 +07:00
|
|
|
* root_task_group and its child task-groups in a fair manner,
|
2008-04-20 00:44:59 +07:00
|
|
|
* based on each entity's (task or task-group's) weight
|
|
|
|
* (se->load.weight).
|
|
|
|
*
|
2011-01-07 14:17:36 +07:00
|
|
|
* In other words, if root_task_group has 10 tasks of weight
|
2008-04-20 00:44:59 +07:00
|
|
|
* 1024) and two child groups A0 and A1 (of weight 1024 each),
|
|
|
|
* then A0's share of the cpu resource is:
|
|
|
|
*
|
2009-05-05 00:13:30 +07:00
|
|
|
* A0's bandwidth = 1024 / (10*1024 + 1024 + 1024) = 8.33%
|
2008-04-20 00:44:59 +07:00
|
|
|
*
|
2011-01-07 14:17:36 +07:00
|
|
|
* We achieve this by letting root_task_group's tasks sit
|
|
|
|
* directly in rq->cfs (i.e root_task_group->se[] = NULL).
|
2008-04-20 00:44:59 +07:00
|
|
|
*/
|
2011-07-21 23:43:28 +07:00
|
|
|
init_cfs_bandwidth(&root_task_group.cfs_bandwidth);
|
2011-01-07 14:17:36 +07:00
|
|
|
init_tg_cfs_entry(&root_task_group, &rq->cfs, NULL, i, NULL);
|
2008-04-20 00:44:59 +07:00
|
|
|
#endif /* CONFIG_FAIR_GROUP_SCHED */
|
|
|
|
|
|
|
|
rq->rt.rt_runtime = def_rt_bandwidth.rt_runtime;
|
2008-02-13 21:45:40 +07:00
|
|
|
#ifdef CONFIG_RT_GROUP_SCHED
|
2011-01-07 14:17:36 +07:00
|
|
|
init_tg_rt_entry(&root_task_group, &rq->rt, NULL, i, NULL);
|
2007-07-09 23:51:59 +07:00
|
|
|
#endif
|
2005-04-17 05:20:36 +07:00
|
|
|
|
2007-07-09 23:51:59 +07:00
|
|
|
for (j = 0; j < CPU_LOAD_IDX_MAX; j++)
|
|
|
|
rq->cpu_load[j] = 0;
|
2010-05-18 08:14:43 +07:00
|
|
|
|
2005-04-17 05:20:36 +07:00
|
|
|
#ifdef CONFIG_SMP
|
2005-06-26 04:57:24 +07:00
|
|
|
rq->sd = NULL;
|
2008-01-26 03:08:18 +07:00
|
|
|
rq->rd = NULL;
|
2015-02-27 22:54:09 +07:00
|
|
|
rq->cpu_capacity = rq->cpu_capacity_orig = SCHED_CAPACITY_SCALE;
|
2015-06-11 19:46:37 +07:00
|
|
|
rq->balance_callback = NULL;
|
2005-04-17 05:20:36 +07:00
|
|
|
rq->active_balance = 0;
|
2007-07-09 23:51:59 +07:00
|
|
|
rq->next_balance = jiffies;
|
2005-04-17 05:20:36 +07:00
|
|
|
rq->push_cpu = 0;
|
2006-09-26 13:30:51 +07:00
|
|
|
rq->cpu = i;
|
2008-06-05 02:04:05 +07:00
|
|
|
rq->online = 0;
|
2009-11-10 09:50:02 +07:00
|
|
|
rq->idle_stamp = 0;
|
|
|
|
rq->avg_idle = 2*sysctl_sched_migration_cost;
|
2013-09-14 01:26:52 +07:00
|
|
|
rq->max_idle_balance_cost = sysctl_sched_migration_cost;
|
2012-02-21 03:49:09 +07:00
|
|
|
|
|
|
|
INIT_LIST_HEAD(&rq->cfs_tasks);
|
|
|
|
|
2008-01-26 03:08:26 +07:00
|
|
|
rq_attach_root(rq, &def_root_domain);
|
2011-08-11 04:21:01 +07:00
|
|
|
#ifdef CONFIG_NO_HZ_COMMON
|
2016-04-19 22:36:51 +07:00
|
|
|
rq->last_load_update_tick = jiffies;
|
2011-12-02 08:07:32 +07:00
|
|
|
rq->nohz_flags = 0;
|
2010-05-22 07:09:41 +07:00
|
|
|
#endif
|
2013-05-03 08:39:05 +07:00
|
|
|
#ifdef CONFIG_NO_HZ_FULL
|
|
|
|
rq->last_sched_tick = 0;
|
|
|
|
#endif
|
2016-04-19 22:36:51 +07:00
|
|
|
#endif /* CONFIG_SMP */
|
2008-01-26 03:08:29 +07:00
|
|
|
init_rq_hrtick(rq);
|
2005-04-17 05:20:36 +07:00
|
|
|
atomic_set(&rq->nr_iowait, 0);
|
|
|
|
}
|
|
|
|
|
[PATCH] sched: implement smpnice
Problem:
The introduction of separate run queues per CPU has brought with it "nice"
enforcement problems that are best described by a simple example.
For the sake of argument suppose that on a single CPU machine with a
nice==19 hard spinner and a nice==0 hard spinner running that the nice==0
task gets 95% of the CPU and the nice==19 task gets 5% of the CPU. Now
suppose that there is a system with 2 CPUs and 2 nice==19 hard spinners and
2 nice==0 hard spinners running. The user of this system would be entitled
to expect that the nice==0 tasks each get 95% of a CPU and the nice==19
tasks only get 5% each. However, whether this expectation is met is pretty
much down to luck as there are four equally likely distributions of the
tasks to the CPUs that the load balancing code will consider to be balanced
with loads of 2.0 for each CPU. Two of these distributions involve one
nice==0 and one nice==19 task per CPU and in these circumstances the users
expectations will be met. The other two distributions both involve both
nice==0 tasks being on one CPU and both nice==19 being on the other CPU and
each task will get 50% of a CPU and the user's expectations will not be
met.
Solution:
The solution to this problem that is implemented in the attached patch is
to use weighted loads when determining if the system is balanced and, when
an imbalance is detected, to move an amount of weighted load between run
queues (as opposed to a number of tasks) to restore the balance. Once
again, the easiest way to explain why both of these measures are necessary
is to use a simple example. Suppose that (in a slight variation of the
above example) that we have a two CPU system with 4 nice==0 and 4 nice=19
hard spinning tasks running and that the 4 nice==0 tasks are on one CPU and
the 4 nice==19 tasks are on the other CPU. The weighted loads for the two
CPUs would be 4.0 and 0.2 respectively and the load balancing code would
move 2 tasks resulting in one CPU with a load of 2.0 and the other with
load of 2.2. If this was considered to be a big enough imbalance to
justify moving a task and that task was moved using the current
move_tasks() then it would move the highest priority task that it found and
this would result in one CPU with a load of 3.0 and the other with a load
of 1.2 which would result in the movement of a task in the opposite
direction and so on -- infinite loop. If, on the other hand, an amount of
load to be moved is calculated from the imbalance (in this case 0.1) and
move_tasks() skips tasks until it find ones whose contributions to the
weighted load are less than this amount it would move two of the nice==19
tasks resulting in a system with 2 nice==0 and 2 nice=19 on each CPU with
loads of 2.1 for each CPU.
One of the advantages of this mechanism is that on a system where all tasks
have nice==0 the load balancing calculations would be mathematically
identical to the current load balancing code.
Notes:
struct task_struct:
has a new field load_weight which (in a trade off of space for speed)
stores the contribution that this task makes to a CPU's weighted load when
it is runnable.
struct runqueue:
has a new field raw_weighted_load which is the sum of the load_weight
values for the currently runnable tasks on this run queue. This field
always needs to be updated when nr_running is updated so two new inline
functions inc_nr_running() and dec_nr_running() have been created to make
sure that this happens. This also offers a convenient way to optimize away
this part of the smpnice mechanism when CONFIG_SMP is not defined.
int try_to_wake_up():
in this function the value SCHED_LOAD_BALANCE is used to represent the load
contribution of a single task in various calculations in the code that
decides which CPU to put the waking task on. While this would be a valid
on a system where the nice values for the runnable tasks were distributed
evenly around zero it will lead to anomalous load balancing if the
distribution is skewed in either direction. To overcome this problem
SCHED_LOAD_SCALE has been replaced by the load_weight for the relevant task
or by the average load_weight per task for the queue in question (as
appropriate).
int move_tasks():
The modifications to this function were complicated by the fact that
active_load_balance() uses it to move exactly one task without checking
whether an imbalance actually exists. This precluded the simple
overloading of max_nr_move with max_load_move and necessitated the addition
of the latter as an extra argument to the function. The internal
implementation is then modified to move up to max_nr_move tasks and
max_load_move of weighted load. This slightly complicates the code where
move_tasks() is called and if ever active_load_balance() is changed to not
use move_tasks() the implementation of move_tasks() should be simplified
accordingly.
struct sched_group *find_busiest_group():
Similar to try_to_wake_up(), there are places in this function where
SCHED_LOAD_SCALE is used to represent the load contribution of a single
task and the same issues are created. A similar solution is adopted except
that it is now the average per task contribution to a group's load (as
opposed to a run queue) that is required. As this value is not directly
available from the group it is calculated on the fly as the queues in the
groups are visited when determining the busiest group.
A key change to this function is that it is no longer to scale down
*imbalance on exit as move_tasks() uses the load in its scaled form.
void set_user_nice():
has been modified to update the task's load_weight field when it's nice
value and also to ensure that its run queue's raw_weighted_load field is
updated if it was runnable.
From: "Siddha, Suresh B" <suresh.b.siddha@intel.com>
With smpnice, sched groups with highest priority tasks can mask the imbalance
between the other sched groups with in the same domain. This patch fixes some
of the listed down scenarios by not considering the sched groups which are
lightly loaded.
a) on a simple 4-way MP system, if we have one high priority and 4 normal
priority tasks, with smpnice we would like to see the high priority task
scheduled on one cpu, two other cpus getting one normal task each and the
fourth cpu getting the remaining two normal tasks. but with current
smpnice extra normal priority task keeps jumping from one cpu to another
cpu having the normal priority task. This is because of the
busiest_has_loaded_cpus, nr_loaded_cpus logic.. We are not including the
cpu with high priority task in max_load calculations but including that in
total and avg_load calcuations.. leading to max_load < avg_load and load
balance between cpus running normal priority tasks(2 Vs 1) will always show
imbalanace as one normal priority and the extra normal priority task will
keep moving from one cpu to another cpu having normal priority task..
b) 4-way system with HT (8 logical processors). Package-P0 T0 has a
highest priority task, T1 is idle. Package-P1 Both T0 and T1 have 1 normal
priority task each.. P2 and P3 are idle. With this patch, one of the
normal priority tasks on P1 will be moved to P2 or P3..
c) With the current weighted smp nice calculations, it doesn't always make
sense to look at the highest weighted runqueue in the busy group..
Consider a load balance scenario on a DP with HT system, with Package-0
containing one high priority and one low priority, Package-1 containing one
low priority(with other thread being idle).. Package-1 thinks that it need
to take the low priority thread from Package-0. And find_busiest_queue()
returns the cpu thread with highest priority task.. And ultimately(with
help of active load balance) we move high priority task to Package-1. And
same continues with Package-0 now, moving high priority task from package-1
to package-0.. Even without the presence of active load balance, load
balance will fail to balance the above scenario.. Fix find_busiest_queue
to use "imbalance" when it is lightly loaded.
[kernel@kolivas.org: sched: store weighted load on up]
[kernel@kolivas.org: sched: add discrete weighted cpu load function]
[suresh.b.siddha@intel.com: sched: remove dead code]
Signed-off-by: Peter Williams <pwil3058@bigpond.com.au>
Cc: "Siddha, Suresh B" <suresh.b.siddha@intel.com>
Cc: "Chen, Kenneth W" <kenneth.w.chen@intel.com>
Acked-by: Ingo Molnar <mingo@elte.hu>
Cc: Nick Piggin <nickpiggin@yahoo.com.au>
Signed-off-by: Con Kolivas <kernel@kolivas.org>
Cc: John Hawkes <hawkes@sgi.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-27 16:54:34 +07:00
|
|
|
set_load_weight(&init_task);
|
2006-07-30 17:03:52 +07:00
|
|
|
|
2007-07-26 18:40:43 +07:00
|
|
|
#ifdef CONFIG_PREEMPT_NOTIFIERS
|
|
|
|
INIT_HLIST_HEAD(&init_task.preempt_notifiers);
|
|
|
|
#endif
|
|
|
|
|
2005-04-17 05:20:36 +07:00
|
|
|
/*
|
|
|
|
* The boot idle thread does lazy MMU switching as well:
|
|
|
|
*/
|
|
|
|
atomic_inc(&init_mm.mm_count);
|
|
|
|
enter_lazy_tlb(&init_mm, current);
|
|
|
|
|
2014-12-29 13:41:43 +07:00
|
|
|
/*
|
|
|
|
* During early bootup we pretend to be a normal task:
|
|
|
|
*/
|
|
|
|
current->sched_class = &fair_sched_class;
|
|
|
|
|
2005-04-17 05:20:36 +07:00
|
|
|
/*
|
|
|
|
* Make us the idle thread. Technically, schedule() should not be
|
|
|
|
* called from this thread, however somewhere below it might be,
|
|
|
|
* but because we are the idle thread, we just pick up running again
|
|
|
|
* when this runqueue becomes "idle".
|
|
|
|
*/
|
|
|
|
init_idle(current, smp_processor_id());
|
2009-04-11 15:43:41 +07:00
|
|
|
|
|
|
|
calc_load_update = jiffies + LOAD_FREQ;
|
|
|
|
|
2008-11-25 06:27:51 +07:00
|
|
|
#ifdef CONFIG_SMP
|
2011-04-07 19:09:58 +07:00
|
|
|
zalloc_cpumask_var(&sched_domains_tmpmask, GFP_NOWAIT);
|
2009-12-02 10:39:16 +07:00
|
|
|
/* May be allocated at isolcpus cmdline parse time */
|
|
|
|
if (cpu_isolated_map == NULL)
|
|
|
|
zalloc_cpumask_var(&cpu_isolated_map, GFP_NOWAIT);
|
2012-04-20 20:05:45 +07:00
|
|
|
idle_thread_set_boot_cpu();
|
2016-03-10 18:54:09 +07:00
|
|
|
set_cpu_rq_start_time(smp_processor_id());
|
2011-10-25 15:00:11 +07:00
|
|
|
#endif
|
|
|
|
init_sched_fair_class();
|
2008-11-24 23:05:04 +07:00
|
|
|
|
2016-06-08 02:43:16 +07:00
|
|
|
init_schedstats();
|
|
|
|
|
2008-02-13 20:02:36 +07:00
|
|
|
scheduler_running = 1;
|
2005-04-17 05:20:36 +07:00
|
|
|
}
|
|
|
|
|
2011-06-09 00:31:56 +07:00
|
|
|
#ifdef CONFIG_DEBUG_ATOMIC_SLEEP
|
2009-07-16 20:44:29 +07:00
|
|
|
static inline int preempt_count_equals(int preempt_offset)
|
|
|
|
{
|
2015-09-28 23:11:45 +07:00
|
|
|
int nested = preempt_count() + rcu_preempt_depth();
|
2009-07-16 20:44:29 +07:00
|
|
|
|
2011-01-26 04:52:22 +07:00
|
|
|
return (nested == preempt_offset);
|
2009-07-16 20:44:29 +07:00
|
|
|
}
|
|
|
|
|
2009-12-23 17:08:18 +07:00
|
|
|
void __might_sleep(const char *file, int line, int preempt_offset)
|
2005-04-17 05:20:36 +07:00
|
|
|
{
|
2014-09-24 15:18:55 +07:00
|
|
|
/*
|
|
|
|
* Blocking primitives will set (and therefore destroy) current->state,
|
|
|
|
* since we will exit with TASK_RUNNING make sure we enter with it,
|
|
|
|
* otherwise we will destroy state.
|
|
|
|
*/
|
sched: don't cause task state changes in nested sleep debugging
Commit 8eb23b9f35aa ("sched: Debug nested sleeps") added code to report
on nested sleep conditions, which we generally want to avoid because the
inner sleeping operation can re-set the thread state to TASK_RUNNING,
but that will then cause the outer sleep loop not actually sleep when it
calls schedule.
However, that's actually valid traditional behavior, with the inner
sleep being some fairly rare case (like taking a sleeping lock that
normally doesn't actually need to sleep).
And the debug code would actually change the state of the task to
TASK_RUNNING internally, which makes that kind of traditional and
working code not work at all, because now the nested sleep doesn't just
sometimes cause the outer one to not block, but will cause it to happen
every time.
In particular, it will cause the cardbus kernel daemon (pccardd) to
basically busy-loop doing scheduling, converting a laptop into a heater,
as reported by Bruno Prémont. But there may be other legacy uses of
that nested sleep model in other drivers that are also likely to never
get converted to the new model.
This fixes both cases:
- don't set TASK_RUNNING when the nested condition happens (note: even
if WARN_ONCE() only _warns_ once, the return value isn't whether the
warning happened, but whether the condition for the warning was true.
So despite the warning only happening once, the "if (WARN_ON(..))"
would trigger for every nested sleep.
- in the cases where we knowingly disable the warning by using
"sched_annotate_sleep()", don't change the task state (that is used
for all core scheduling decisions), instead use '->task_state_change'
that is used for the debugging decision itself.
(Credit for the second part of the fix goes to Oleg Nesterov: "Can't we
avoid this subtle change in behaviour DEBUG_ATOMIC_SLEEP adds?" with the
suggested change to use 'task_state_change' as part of the test)
Reported-and-bisected-by: Bruno Prémont <bonbons@linux-vserver.org>
Tested-by: Rafael J Wysocki <rjw@rjwysocki.net>
Acked-by: Oleg Nesterov <oleg@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>,
Cc: Ilya Dryomov <ilya.dryomov@inktank.com>,
Cc: Mike Galbraith <umgwanakikbuti@gmail.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Peter Hurley <peter@hurleysoftware.com>,
Cc: Davidlohr Bueso <dave@stgolabs.net>,
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-02-02 03:23:32 +07:00
|
|
|
WARN_ONCE(current->state != TASK_RUNNING && current->task_state_change,
|
2014-09-24 15:18:55 +07:00
|
|
|
"do not call blocking ops when !TASK_RUNNING; "
|
|
|
|
"state=%lx set at [<%p>] %pS\n",
|
|
|
|
current->state,
|
|
|
|
(void *)current->task_state_change,
|
sched: don't cause task state changes in nested sleep debugging
Commit 8eb23b9f35aa ("sched: Debug nested sleeps") added code to report
on nested sleep conditions, which we generally want to avoid because the
inner sleeping operation can re-set the thread state to TASK_RUNNING,
but that will then cause the outer sleep loop not actually sleep when it
calls schedule.
However, that's actually valid traditional behavior, with the inner
sleep being some fairly rare case (like taking a sleeping lock that
normally doesn't actually need to sleep).
And the debug code would actually change the state of the task to
TASK_RUNNING internally, which makes that kind of traditional and
working code not work at all, because now the nested sleep doesn't just
sometimes cause the outer one to not block, but will cause it to happen
every time.
In particular, it will cause the cardbus kernel daemon (pccardd) to
basically busy-loop doing scheduling, converting a laptop into a heater,
as reported by Bruno Prémont. But there may be other legacy uses of
that nested sleep model in other drivers that are also likely to never
get converted to the new model.
This fixes both cases:
- don't set TASK_RUNNING when the nested condition happens (note: even
if WARN_ONCE() only _warns_ once, the return value isn't whether the
warning happened, but whether the condition for the warning was true.
So despite the warning only happening once, the "if (WARN_ON(..))"
would trigger for every nested sleep.
- in the cases where we knowingly disable the warning by using
"sched_annotate_sleep()", don't change the task state (that is used
for all core scheduling decisions), instead use '->task_state_change'
that is used for the debugging decision itself.
(Credit for the second part of the fix goes to Oleg Nesterov: "Can't we
avoid this subtle change in behaviour DEBUG_ATOMIC_SLEEP adds?" with the
suggested change to use 'task_state_change' as part of the test)
Reported-and-bisected-by: Bruno Prémont <bonbons@linux-vserver.org>
Tested-by: Rafael J Wysocki <rjw@rjwysocki.net>
Acked-by: Oleg Nesterov <oleg@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>,
Cc: Ilya Dryomov <ilya.dryomov@inktank.com>,
Cc: Mike Galbraith <umgwanakikbuti@gmail.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Peter Hurley <peter@hurleysoftware.com>,
Cc: Davidlohr Bueso <dave@stgolabs.net>,
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-02-02 03:23:32 +07:00
|
|
|
(void *)current->task_state_change);
|
2014-09-24 15:18:55 +07:00
|
|
|
|
2014-09-24 15:18:56 +07:00
|
|
|
___might_sleep(file, line, preempt_offset);
|
|
|
|
}
|
|
|
|
EXPORT_SYMBOL(__might_sleep);
|
|
|
|
|
|
|
|
void ___might_sleep(const char *file, int line, int preempt_offset)
|
2005-04-17 05:20:36 +07:00
|
|
|
{
|
|
|
|
static unsigned long prev_jiffy; /* ratelimiting */
|
|
|
|
|
2011-05-24 22:31:09 +07:00
|
|
|
rcu_sleep_check(); /* WARN_ON_ONCE() by default, no rate limit reqd. */
|
2014-02-08 02:58:38 +07:00
|
|
|
if ((preempt_count_equals(preempt_offset) && !irqs_disabled() &&
|
|
|
|
!is_idle_task(current)) ||
|
2009-07-16 20:44:29 +07:00
|
|
|
system_state != SYSTEM_RUNNING || oops_in_progress)
|
2008-08-28 16:34:43 +07:00
|
|
|
return;
|
|
|
|
if (time_before(jiffies, prev_jiffy + HZ) && prev_jiffy)
|
|
|
|
return;
|
|
|
|
prev_jiffy = jiffies;
|
|
|
|
|
2009-12-20 20:23:57 +07:00
|
|
|
printk(KERN_ERR
|
|
|
|
"BUG: sleeping function called from invalid context at %s:%d\n",
|
|
|
|
file, line);
|
|
|
|
printk(KERN_ERR
|
|
|
|
"in_atomic(): %d, irqs_disabled(): %d, pid: %d, name: %s\n",
|
|
|
|
in_atomic(), irqs_disabled(),
|
|
|
|
current->pid, current->comm);
|
2008-08-28 16:34:43 +07:00
|
|
|
|
2014-12-17 05:25:28 +07:00
|
|
|
if (task_stack_end_corrupted(current))
|
|
|
|
printk(KERN_EMERG "Thread overran stack, or stack corrupted\n");
|
|
|
|
|
2008-08-28 16:34:43 +07:00
|
|
|
debug_show_held_locks(current);
|
|
|
|
if (irqs_disabled())
|
|
|
|
print_irqtrace_events(current);
|
2014-02-08 02:58:39 +07:00
|
|
|
#ifdef CONFIG_DEBUG_PREEMPT
|
|
|
|
if (!preempt_count_equals(preempt_offset)) {
|
|
|
|
pr_err("Preemption disabled at:");
|
|
|
|
print_ip_sym(current->preempt_disable_ip);
|
|
|
|
pr_cont("\n");
|
|
|
|
}
|
|
|
|
#endif
|
2008-08-28 16:34:43 +07:00
|
|
|
dump_stack();
|
2005-04-17 05:20:36 +07:00
|
|
|
}
|
2014-09-24 15:18:56 +07:00
|
|
|
EXPORT_SYMBOL(___might_sleep);
|
2005-04-17 05:20:36 +07:00
|
|
|
#endif
|
|
|
|
|
|
|
|
#ifdef CONFIG_MAGIC_SYSRQ
|
2015-06-11 19:46:38 +07:00
|
|
|
void normalize_rt_tasks(void)
|
2007-10-15 22:00:15 +07:00
|
|
|
{
|
2015-06-11 19:46:38 +07:00
|
|
|
struct task_struct *g, *p;
|
sched: Add new scheduler syscalls to support an extended scheduling parameters ABI
Add the syscalls needed for supporting scheduling algorithms
with extended scheduling parameters (e.g., SCHED_DEADLINE).
In general, it makes possible to specify a periodic/sporadic task,
that executes for a given amount of runtime at each instance, and is
scheduled according to the urgency of their own timing constraints,
i.e.:
- a (maximum/typical) instance execution time,
- a minimum interval between consecutive instances,
- a time constraint by which each instance must be completed.
Thus, both the data structure that holds the scheduling parameters of
the tasks and the system calls dealing with it must be extended.
Unfortunately, modifying the existing struct sched_param would break
the ABI and result in potentially serious compatibility issues with
legacy binaries.
For these reasons, this patch:
- defines the new struct sched_attr, containing all the fields
that are necessary for specifying a task in the computational
model described above;
- defines and implements the new scheduling related syscalls that
manipulate it, i.e., sched_setattr() and sched_getattr().
Syscalls are introduced for x86 (32 and 64 bits) and ARM only, as a
proof of concept and for developing and testing purposes. Making them
available on other architectures is straightforward.
Since no "user" for these new parameters is introduced in this patch,
the implementation of the new system calls is just identical to their
already existing counterpart. Future patches that implement scheduling
policies able to exploit the new data structure must also take care of
modifying the sched_*attr() calls accordingly with their own purposes.
Signed-off-by: Dario Faggioli <raistlin@linux.it>
[ Rewrote to use sched_attr. ]
Signed-off-by: Juri Lelli <juri.lelli@gmail.com>
[ Removed sched_setscheduler2() for now. ]
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/1383831828-15501-3-git-send-email-juri.lelli@gmail.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2013-11-07 20:43:36 +07:00
|
|
|
struct sched_attr attr = {
|
|
|
|
.sched_policy = SCHED_NORMAL,
|
|
|
|
};
|
2005-04-17 05:20:36 +07:00
|
|
|
|
2014-09-22 02:33:38 +07:00
|
|
|
read_lock(&tasklist_lock);
|
2014-08-14 02:19:53 +07:00
|
|
|
for_each_process_thread(g, p) {
|
2007-10-15 22:00:18 +07:00
|
|
|
/*
|
|
|
|
* Only normalize user tasks:
|
|
|
|
*/
|
2014-09-22 02:33:38 +07:00
|
|
|
if (p->flags & PF_KTHREAD)
|
2007-10-15 22:00:18 +07:00
|
|
|
continue;
|
|
|
|
|
2007-08-02 22:41:40 +07:00
|
|
|
p->se.exec_start = 0;
|
|
|
|
#ifdef CONFIG_SCHEDSTATS
|
2010-03-11 09:37:45 +07:00
|
|
|
p->se.statistics.wait_start = 0;
|
|
|
|
p->se.statistics.sleep_start = 0;
|
|
|
|
p->se.statistics.block_start = 0;
|
2007-08-02 22:41:40 +07:00
|
|
|
#endif
|
2007-07-09 23:51:59 +07:00
|
|
|
|
sched/deadline: Add SCHED_DEADLINE structures & implementation
Introduces the data structures, constants and symbols needed for
SCHED_DEADLINE implementation.
Core data structure of SCHED_DEADLINE are defined, along with their
initializers. Hooks for checking if a task belong to the new policy
are also added where they are needed.
Adds a scheduling class, in sched/dl.c and a new policy called
SCHED_DEADLINE. It is an implementation of the Earliest Deadline
First (EDF) scheduling algorithm, augmented with a mechanism (called
Constant Bandwidth Server, CBS) that makes it possible to isolate
the behaviour of tasks between each other.
The typical -deadline task will be made up of a computation phase
(instance) which is activated on a periodic or sporadic fashion. The
expected (maximum) duration of such computation is called the task's
runtime; the time interval by which each instance need to be completed
is called the task's relative deadline. The task's absolute deadline
is dynamically calculated as the time instant a task (better, an
instance) activates plus the relative deadline.
The EDF algorithms selects the task with the smallest absolute
deadline as the one to be executed first, while the CBS ensures each
task to run for at most its runtime every (relative) deadline
length time interval, avoiding any interference between different
tasks (bandwidth isolation).
Thanks to this feature, also tasks that do not strictly comply with
the computational model sketched above can effectively use the new
policy.
To summarize, this patch:
- introduces the data structures, constants and symbols needed;
- implements the core logic of the scheduling algorithm in the new
scheduling class file;
- provides all the glue code between the new scheduling class and
the core scheduler and refines the interactions between sched/dl
and the other existing scheduling classes.
Signed-off-by: Dario Faggioli <raistlin@linux.it>
Signed-off-by: Michael Trimarchi <michael@amarulasolutions.com>
Signed-off-by: Fabio Checconi <fchecconi@gmail.com>
Signed-off-by: Juri Lelli <juri.lelli@gmail.com>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/1383831828-15501-4-git-send-email-juri.lelli@gmail.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2013-11-28 17:14:43 +07:00
|
|
|
if (!dl_task(p) && !rt_task(p)) {
|
2007-07-09 23:51:59 +07:00
|
|
|
/*
|
|
|
|
* Renice negative nice level userspace
|
|
|
|
* tasks back to 0:
|
|
|
|
*/
|
2014-09-22 02:33:38 +07:00
|
|
|
if (task_nice(p) < 0)
|
2007-07-09 23:51:59 +07:00
|
|
|
set_user_nice(p, 0);
|
2005-04-17 05:20:36 +07:00
|
|
|
continue;
|
2007-07-09 23:51:59 +07:00
|
|
|
}
|
2005-04-17 05:20:36 +07:00
|
|
|
|
2015-06-11 19:46:38 +07:00
|
|
|
__sched_setscheduler(p, &attr, false, false);
|
2014-08-14 02:19:53 +07:00
|
|
|
}
|
2014-09-22 02:33:38 +07:00
|
|
|
read_unlock(&tasklist_lock);
|
2005-04-17 05:20:36 +07:00
|
|
|
}
|
|
|
|
|
|
|
|
#endif /* CONFIG_MAGIC_SYSRQ */
|
2005-09-12 21:59:21 +07:00
|
|
|
|
2010-05-21 09:04:21 +07:00
|
|
|
#if defined(CONFIG_IA64) || defined(CONFIG_KGDB_KDB)
|
2005-09-12 21:59:21 +07:00
|
|
|
/*
|
2010-05-21 09:04:21 +07:00
|
|
|
* These functions are only useful for the IA64 MCA handling, or kdb.
|
2005-09-12 21:59:21 +07:00
|
|
|
*
|
|
|
|
* They can only be called when the whole system has been
|
|
|
|
* stopped - every CPU needs to be quiescent, and no scheduling
|
|
|
|
* activity can take place. Using them for anything else would
|
|
|
|
* be a serious bug, and as a result, they aren't even visible
|
|
|
|
* under any other configuration.
|
|
|
|
*/
|
|
|
|
|
|
|
|
/**
|
|
|
|
* curr_task - return the current task for a given cpu.
|
|
|
|
* @cpu: the processor in question.
|
|
|
|
*
|
|
|
|
* ONLY VALID WHEN THE WHOLE SYSTEM IS STOPPED!
|
2013-07-13 01:45:47 +07:00
|
|
|
*
|
|
|
|
* Return: The current task for @cpu.
|
2005-09-12 21:59:21 +07:00
|
|
|
*/
|
2006-07-03 14:25:41 +07:00
|
|
|
struct task_struct *curr_task(int cpu)
|
2005-09-12 21:59:21 +07:00
|
|
|
{
|
|
|
|
return cpu_curr(cpu);
|
|
|
|
}
|
|
|
|
|
2010-05-21 09:04:21 +07:00
|
|
|
#endif /* defined(CONFIG_IA64) || defined(CONFIG_KGDB_KDB) */
|
|
|
|
|
|
|
|
#ifdef CONFIG_IA64
|
2005-09-12 21:59:21 +07:00
|
|
|
/**
|
|
|
|
* set_curr_task - set the current task for a given cpu.
|
|
|
|
* @cpu: the processor in question.
|
|
|
|
* @p: the task pointer to set.
|
|
|
|
*
|
|
|
|
* Description: This function must only be used when non-maskable interrupts
|
2007-12-05 21:46:09 +07:00
|
|
|
* are serviced on a separate stack. It allows the architecture to switch the
|
|
|
|
* notion of the current task on a cpu in a non-blocking manner. This function
|
2005-09-12 21:59:21 +07:00
|
|
|
* must be called with all CPU's synchronized, and interrupts disabled, the
|
|
|
|
* and caller must save the original value of the current task (see
|
|
|
|
* curr_task() above) and restore that value before reenabling interrupts and
|
|
|
|
* re-starting the system.
|
|
|
|
*
|
|
|
|
* ONLY VALID WHEN THE WHOLE SYSTEM IS STOPPED!
|
|
|
|
*/
|
2006-07-03 14:25:41 +07:00
|
|
|
void set_curr_task(int cpu, struct task_struct *p)
|
2005-09-12 21:59:21 +07:00
|
|
|
{
|
|
|
|
cpu_curr(cpu) = p;
|
|
|
|
}
|
|
|
|
|
|
|
|
#endif
|
2007-10-15 22:00:07 +07:00
|
|
|
|
2010-01-20 19:26:18 +07:00
|
|
|
#ifdef CONFIG_CGROUP_SCHED
|
2011-10-25 15:00:11 +07:00
|
|
|
/* task_group_lock serializes the addition/removal of task groups */
|
|
|
|
static DEFINE_SPINLOCK(task_group_lock);
|
|
|
|
|
2016-03-16 22:22:45 +07:00
|
|
|
static void sched_free_group(struct task_group *tg)
|
2008-02-13 21:45:40 +07:00
|
|
|
{
|
|
|
|
free_fair_sched_group(tg);
|
|
|
|
free_rt_sched_group(tg);
|
2011-01-05 17:11:25 +07:00
|
|
|
autogroup_free(tg);
|
2015-12-03 01:41:49 +07:00
|
|
|
kmem_cache_free(task_group_cache, tg);
|
2008-02-13 21:45:40 +07:00
|
|
|
}
|
|
|
|
|
|
|
|
/* allocate runqueue etc for a new task group */
|
2008-04-20 00:44:59 +07:00
|
|
|
struct task_group *sched_create_group(struct task_group *parent)
|
2008-02-13 21:45:40 +07:00
|
|
|
{
|
|
|
|
struct task_group *tg;
|
|
|
|
|
2015-12-03 01:41:49 +07:00
|
|
|
tg = kmem_cache_alloc(task_group_cache, GFP_KERNEL | __GFP_ZERO);
|
2008-02-13 21:45:40 +07:00
|
|
|
if (!tg)
|
|
|
|
return ERR_PTR(-ENOMEM);
|
|
|
|
|
2008-04-20 00:44:59 +07:00
|
|
|
if (!alloc_fair_sched_group(tg, parent))
|
2008-02-13 21:45:40 +07:00
|
|
|
goto err;
|
|
|
|
|
2008-04-20 00:44:59 +07:00
|
|
|
if (!alloc_rt_sched_group(tg, parent))
|
2008-02-13 21:45:40 +07:00
|
|
|
goto err;
|
|
|
|
|
sched: split out css_online/css_offline from tg creation/destruction
This is a preparaton for later patches.
- What do we gain from cpu_cgroup_css_online():
After ss->css_alloc() and before ss->css_online(), there's a small
window that tg->css.cgroup is NULL. With this change, tg won't be seen
before ss->css_online(), where it's added to the global list, so we're
guaranteed we'll never see NULL tg->css.cgroup.
- What do we gain from cpu_cgroup_css_offline():
tg is freed via RCU, so is cgroup. Without this change, This is how
synchronization works:
cgroup_rmdir()
no ss->css_offline()
diput()
syncornize_rcu()
ss->css_free() <-- unregister tg, and free it via call_rcu()
kfree_rcu(cgroup) <-- wait possible refs to cgroup, and free cgroup
We can't just kfree(cgroup), because tg might access tg->css.cgroup.
With this change:
cgroup_rmdir()
ss->css_offline() <-- unregister tg
diput()
synchronize_rcu() <-- wait possible refs to tg and cgroup
ss->css_free() <-- free tg
kfree_rcu(cgroup) <-- free cgroup
As you see, kfree_rcu() is redundant now.
Signed-off-by: Li Zefan <lizefan@huawei.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Ingo Molnar <mingo@kernel.org>
2013-01-24 13:30:48 +07:00
|
|
|
return tg;
|
|
|
|
|
|
|
|
err:
|
2016-03-16 22:22:45 +07:00
|
|
|
sched_free_group(tg);
|
sched: split out css_online/css_offline from tg creation/destruction
This is a preparaton for later patches.
- What do we gain from cpu_cgroup_css_online():
After ss->css_alloc() and before ss->css_online(), there's a small
window that tg->css.cgroup is NULL. With this change, tg won't be seen
before ss->css_online(), where it's added to the global list, so we're
guaranteed we'll never see NULL tg->css.cgroup.
- What do we gain from cpu_cgroup_css_offline():
tg is freed via RCU, so is cgroup. Without this change, This is how
synchronization works:
cgroup_rmdir()
no ss->css_offline()
diput()
syncornize_rcu()
ss->css_free() <-- unregister tg, and free it via call_rcu()
kfree_rcu(cgroup) <-- wait possible refs to cgroup, and free cgroup
We can't just kfree(cgroup), because tg might access tg->css.cgroup.
With this change:
cgroup_rmdir()
ss->css_offline() <-- unregister tg
diput()
synchronize_rcu() <-- wait possible refs to tg and cgroup
ss->css_free() <-- free tg
kfree_rcu(cgroup) <-- free cgroup
As you see, kfree_rcu() is redundant now.
Signed-off-by: Li Zefan <lizefan@huawei.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Ingo Molnar <mingo@kernel.org>
2013-01-24 13:30:48 +07:00
|
|
|
return ERR_PTR(-ENOMEM);
|
|
|
|
}
|
|
|
|
|
|
|
|
void sched_online_group(struct task_group *tg, struct task_group *parent)
|
|
|
|
{
|
|
|
|
unsigned long flags;
|
|
|
|
|
2008-02-13 21:45:39 +07:00
|
|
|
spin_lock_irqsave(&task_group_lock, flags);
|
2008-01-26 03:08:30 +07:00
|
|
|
list_add_rcu(&tg->list, &task_groups);
|
2008-04-20 00:45:00 +07:00
|
|
|
|
|
|
|
WARN_ON(!parent); /* root should already exist */
|
|
|
|
|
|
|
|
tg->parent = parent;
|
|
|
|
INIT_LIST_HEAD(&tg->children);
|
2030-08-14 14:56:40 +07:00
|
|
|
list_add_rcu(&tg->siblings, &parent->children);
|
2008-02-13 21:45:39 +07:00
|
|
|
spin_unlock_irqrestore(&task_group_lock, flags);
|
2016-06-22 19:58:02 +07:00
|
|
|
|
|
|
|
online_fair_sched_group(tg);
|
2007-10-15 22:00:07 +07:00
|
|
|
}
|
|
|
|
|
2007-10-15 22:00:09 +07:00
|
|
|
/* rcu callback to free various structures associated with a task group */
|
2016-03-16 22:22:45 +07:00
|
|
|
static void sched_free_group_rcu(struct rcu_head *rhp)
|
2007-10-15 22:00:07 +07:00
|
|
|
{
|
|
|
|
/* now it should be safe to free those cfs_rqs */
|
2016-03-16 22:22:45 +07:00
|
|
|
sched_free_group(container_of(rhp, struct task_group, rcu));
|
2007-10-15 22:00:07 +07:00
|
|
|
}
|
|
|
|
|
2007-10-15 22:00:14 +07:00
|
|
|
void sched_destroy_group(struct task_group *tg)
|
sched: split out css_online/css_offline from tg creation/destruction
This is a preparaton for later patches.
- What do we gain from cpu_cgroup_css_online():
After ss->css_alloc() and before ss->css_online(), there's a small
window that tg->css.cgroup is NULL. With this change, tg won't be seen
before ss->css_online(), where it's added to the global list, so we're
guaranteed we'll never see NULL tg->css.cgroup.
- What do we gain from cpu_cgroup_css_offline():
tg is freed via RCU, so is cgroup. Without this change, This is how
synchronization works:
cgroup_rmdir()
no ss->css_offline()
diput()
syncornize_rcu()
ss->css_free() <-- unregister tg, and free it via call_rcu()
kfree_rcu(cgroup) <-- wait possible refs to cgroup, and free cgroup
We can't just kfree(cgroup), because tg might access tg->css.cgroup.
With this change:
cgroup_rmdir()
ss->css_offline() <-- unregister tg
diput()
synchronize_rcu() <-- wait possible refs to tg and cgroup
ss->css_free() <-- free tg
kfree_rcu(cgroup) <-- free cgroup
As you see, kfree_rcu() is redundant now.
Signed-off-by: Li Zefan <lizefan@huawei.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Ingo Molnar <mingo@kernel.org>
2013-01-24 13:30:48 +07:00
|
|
|
{
|
|
|
|
/* wait for possible concurrent references to cfs_rqs complete */
|
2016-03-16 22:22:45 +07:00
|
|
|
call_rcu(&tg->rcu, sched_free_group_rcu);
|
sched: split out css_online/css_offline from tg creation/destruction
This is a preparaton for later patches.
- What do we gain from cpu_cgroup_css_online():
After ss->css_alloc() and before ss->css_online(), there's a small
window that tg->css.cgroup is NULL. With this change, tg won't be seen
before ss->css_online(), where it's added to the global list, so we're
guaranteed we'll never see NULL tg->css.cgroup.
- What do we gain from cpu_cgroup_css_offline():
tg is freed via RCU, so is cgroup. Without this change, This is how
synchronization works:
cgroup_rmdir()
no ss->css_offline()
diput()
syncornize_rcu()
ss->css_free() <-- unregister tg, and free it via call_rcu()
kfree_rcu(cgroup) <-- wait possible refs to cgroup, and free cgroup
We can't just kfree(cgroup), because tg might access tg->css.cgroup.
With this change:
cgroup_rmdir()
ss->css_offline() <-- unregister tg
diput()
synchronize_rcu() <-- wait possible refs to tg and cgroup
ss->css_free() <-- free tg
kfree_rcu(cgroup) <-- free cgroup
As you see, kfree_rcu() is redundant now.
Signed-off-by: Li Zefan <lizefan@huawei.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Ingo Molnar <mingo@kernel.org>
2013-01-24 13:30:48 +07:00
|
|
|
}
|
|
|
|
|
|
|
|
void sched_offline_group(struct task_group *tg)
|
2007-10-15 22:00:07 +07:00
|
|
|
{
|
2008-02-13 21:45:39 +07:00
|
|
|
unsigned long flags;
|
2007-10-15 22:00:07 +07:00
|
|
|
|
2010-11-16 06:47:01 +07:00
|
|
|
/* end participation in shares distribution */
|
2016-01-22 04:24:16 +07:00
|
|
|
unregister_fair_sched_group(tg);
|
2010-11-16 06:47:01 +07:00
|
|
|
|
|
|
|
spin_lock_irqsave(&task_group_lock, flags);
|
2008-01-26 03:08:30 +07:00
|
|
|
list_del_rcu(&tg->list);
|
2008-04-20 00:45:00 +07:00
|
|
|
list_del_rcu(&tg->siblings);
|
2008-02-13 21:45:39 +07:00
|
|
|
spin_unlock_irqrestore(&task_group_lock, flags);
|
2007-10-15 22:00:07 +07:00
|
|
|
}
|
|
|
|
|
2016-06-17 18:38:55 +07:00
|
|
|
static void sched_change_group(struct task_struct *tsk, int type)
|
2007-10-15 22:00:07 +07:00
|
|
|
{
|
2012-06-22 18:36:05 +07:00
|
|
|
struct task_group *tg;
|
2007-10-15 22:00:07 +07:00
|
|
|
|
2014-10-28 12:24:34 +07:00
|
|
|
/*
|
|
|
|
* All callers are synchronized by task_rq_lock(); we do not use RCU
|
|
|
|
* which is pointless here. Thus, we pass "true" to task_css_check()
|
|
|
|
* to prevent lockdep warnings.
|
|
|
|
*/
|
|
|
|
tg = container_of(task_css_check(tsk, cpu_cgrp_id, true),
|
2012-06-22 18:36:05 +07:00
|
|
|
struct task_group, css);
|
|
|
|
tg = autogroup_task_group(tsk, tg);
|
|
|
|
tsk->sched_task_group = tg;
|
|
|
|
|
2008-03-01 03:21:01 +07:00
|
|
|
#ifdef CONFIG_FAIR_GROUP_SCHED
|
2016-06-17 18:38:55 +07:00
|
|
|
if (tsk->sched_class->task_change_group)
|
|
|
|
tsk->sched_class->task_change_group(tsk, type);
|
2010-10-15 20:24:15 +07:00
|
|
|
else
|
2008-03-01 03:21:01 +07:00
|
|
|
#endif
|
2010-10-15 20:24:15 +07:00
|
|
|
set_task_rq(tsk, task_cpu(tsk));
|
2016-06-17 18:38:55 +07:00
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Change task's runqueue when it moves between groups.
|
|
|
|
*
|
|
|
|
* The caller of this function should have put the task in its new group by
|
|
|
|
* now. This function just updates tsk->se.cfs_rq and tsk->se.parent to reflect
|
|
|
|
* its new group.
|
|
|
|
*/
|
|
|
|
void sched_move_task(struct task_struct *tsk)
|
|
|
|
{
|
|
|
|
int queued, running;
|
|
|
|
struct rq_flags rf;
|
|
|
|
struct rq *rq;
|
|
|
|
|
|
|
|
rq = task_rq_lock(tsk, &rf);
|
|
|
|
|
|
|
|
running = task_current(rq, tsk);
|
|
|
|
queued = task_on_rq_queued(tsk);
|
|
|
|
|
|
|
|
if (queued)
|
|
|
|
dequeue_task(rq, tsk, DEQUEUE_SAVE | DEQUEUE_MOVE);
|
|
|
|
if (unlikely(running))
|
|
|
|
put_prev_task(rq, tsk);
|
|
|
|
|
|
|
|
sched_change_group(tsk, TASK_MOVE_GROUP);
|
2008-03-01 03:21:01 +07:00
|
|
|
|
2008-03-11 01:01:20 +07:00
|
|
|
if (unlikely(running))
|
|
|
|
tsk->sched_class->set_curr_task(rq);
|
2014-08-20 16:47:32 +07:00
|
|
|
if (queued)
|
2016-01-18 21:27:07 +07:00
|
|
|
enqueue_task(rq, tsk, ENQUEUE_RESTORE | ENQUEUE_MOVE);
|
2007-10-15 22:00:07 +07:00
|
|
|
|
2015-08-01 02:28:18 +07:00
|
|
|
task_rq_unlock(rq, tsk, &rf);
|
2007-10-15 22:00:07 +07:00
|
|
|
}
|
2010-01-20 19:26:18 +07:00
|
|
|
#endif /* CONFIG_CGROUP_SCHED */
|
2007-10-15 22:00:07 +07:00
|
|
|
|
2011-07-21 23:43:29 +07:00
|
|
|
#ifdef CONFIG_RT_GROUP_SCHED
|
|
|
|
/*
|
|
|
|
* Ensure that the real time constraints are schedulable.
|
|
|
|
*/
|
|
|
|
static DEFINE_MUTEX(rt_constraints_mutex);
|
2008-02-13 21:45:39 +07:00
|
|
|
|
2008-08-19 17:33:06 +07:00
|
|
|
/* Must be called with tasklist_lock held */
|
|
|
|
static inline int tg_has_rt_tasks(struct task_group *tg)
|
2008-04-20 00:45:00 +07:00
|
|
|
{
|
2008-08-19 17:33:06 +07:00
|
|
|
struct task_struct *g, *p;
|
2008-04-20 00:45:00 +07:00
|
|
|
|
2015-02-09 17:53:18 +07:00
|
|
|
/*
|
|
|
|
* Autogroups do not have RT tasks; see autogroup_create().
|
|
|
|
*/
|
|
|
|
if (task_group_is_autogroup(tg))
|
|
|
|
return 0;
|
|
|
|
|
2014-08-14 02:19:53 +07:00
|
|
|
for_each_process_thread(g, p) {
|
2014-09-22 02:33:36 +07:00
|
|
|
if (rt_task(p) && task_group(p) == tg)
|
2008-08-19 17:33:06 +07:00
|
|
|
return 1;
|
2014-08-14 02:19:53 +07:00
|
|
|
}
|
2008-04-20 00:45:00 +07:00
|
|
|
|
2008-08-19 17:33:06 +07:00
|
|
|
return 0;
|
|
|
|
}
|
2008-04-20 00:45:00 +07:00
|
|
|
|
2008-08-19 17:33:06 +07:00
|
|
|
struct rt_schedulable_data {
|
|
|
|
struct task_group *tg;
|
|
|
|
u64 rt_period;
|
|
|
|
u64 rt_runtime;
|
|
|
|
};
|
2008-04-20 00:45:00 +07:00
|
|
|
|
2011-07-21 23:43:29 +07:00
|
|
|
static int tg_rt_schedulable(struct task_group *tg, void *data)
|
2008-08-19 17:33:06 +07:00
|
|
|
{
|
|
|
|
struct rt_schedulable_data *d = data;
|
|
|
|
struct task_group *child;
|
|
|
|
unsigned long total, sum = 0;
|
|
|
|
u64 period, runtime;
|
2008-04-20 00:45:00 +07:00
|
|
|
|
2008-08-19 17:33:06 +07:00
|
|
|
period = ktime_to_ns(tg->rt_bandwidth.rt_period);
|
|
|
|
runtime = tg->rt_bandwidth.rt_runtime;
|
2008-04-20 00:45:00 +07:00
|
|
|
|
2008-08-19 17:33:06 +07:00
|
|
|
if (tg == d->tg) {
|
|
|
|
period = d->rt_period;
|
|
|
|
runtime = d->rt_runtime;
|
2008-04-20 00:45:00 +07:00
|
|
|
}
|
|
|
|
|
2008-09-23 20:33:44 +07:00
|
|
|
/*
|
|
|
|
* Cannot have more runtime than the period.
|
|
|
|
*/
|
|
|
|
if (runtime > period && runtime != RUNTIME_INF)
|
|
|
|
return -EINVAL;
|
2008-01-26 03:08:30 +07:00
|
|
|
|
2008-09-23 20:33:44 +07:00
|
|
|
/*
|
|
|
|
* Ensure we don't starve existing RT tasks.
|
|
|
|
*/
|
2008-08-19 17:33:06 +07:00
|
|
|
if (rt_bandwidth_enabled() && !runtime && tg_has_rt_tasks(tg))
|
|
|
|
return -EBUSY;
|
2008-01-26 03:08:30 +07:00
|
|
|
|
2008-08-19 17:33:06 +07:00
|
|
|
total = to_ratio(period, runtime);
|
2008-01-26 03:08:30 +07:00
|
|
|
|
2008-09-23 20:33:44 +07:00
|
|
|
/*
|
|
|
|
* Nobody can have more than the global setting allows.
|
|
|
|
*/
|
|
|
|
if (total > to_ratio(global_rt_period(), global_rt_runtime()))
|
|
|
|
return -EINVAL;
|
2008-01-26 03:08:30 +07:00
|
|
|
|
2008-09-23 20:33:44 +07:00
|
|
|
/*
|
|
|
|
* The sum of our children's runtime should not exceed our own.
|
|
|
|
*/
|
2008-08-19 17:33:06 +07:00
|
|
|
list_for_each_entry_rcu(child, &tg->children, siblings) {
|
|
|
|
period = ktime_to_ns(child->rt_bandwidth.rt_period);
|
|
|
|
runtime = child->rt_bandwidth.rt_runtime;
|
2008-01-26 03:08:30 +07:00
|
|
|
|
2008-08-19 17:33:06 +07:00
|
|
|
if (child == d->tg) {
|
|
|
|
period = d->rt_period;
|
|
|
|
runtime = d->rt_runtime;
|
|
|
|
}
|
2008-01-26 03:08:30 +07:00
|
|
|
|
2008-08-19 17:33:06 +07:00
|
|
|
sum += to_ratio(period, runtime);
|
2008-02-13 21:45:39 +07:00
|
|
|
}
|
2008-01-26 03:08:30 +07:00
|
|
|
|
2008-08-19 17:33:06 +07:00
|
|
|
if (sum > total)
|
|
|
|
return -EINVAL;
|
|
|
|
|
|
|
|
return 0;
|
2008-01-26 03:08:30 +07:00
|
|
|
}
|
|
|
|
|
2008-08-19 17:33:06 +07:00
|
|
|
static int __rt_schedulable(struct task_group *tg, u64 period, u64 runtime)
|
2008-02-28 16:51:56 +07:00
|
|
|
{
|
2011-07-21 23:43:35 +07:00
|
|
|
int ret;
|
|
|
|
|
2008-08-19 17:33:06 +07:00
|
|
|
struct rt_schedulable_data data = {
|
|
|
|
.tg = tg,
|
|
|
|
.rt_period = period,
|
|
|
|
.rt_runtime = runtime,
|
|
|
|
};
|
|
|
|
|
2011-07-21 23:43:35 +07:00
|
|
|
rcu_read_lock();
|
|
|
|
ret = walk_tg_tree(tg_rt_schedulable, tg_nop, &data);
|
|
|
|
rcu_read_unlock();
|
|
|
|
|
|
|
|
return ret;
|
2008-02-28 16:51:56 +07:00
|
|
|
}
|
|
|
|
|
2011-07-21 23:43:28 +07:00
|
|
|
static int tg_set_rt_bandwidth(struct task_group *tg,
|
2008-04-20 00:44:57 +07:00
|
|
|
u64 rt_period, u64 rt_runtime)
|
2008-01-26 03:08:30 +07:00
|
|
|
{
|
2008-04-20 00:44:58 +07:00
|
|
|
int i, err = 0;
|
2008-02-13 21:45:39 +07:00
|
|
|
|
2015-02-09 18:23:20 +07:00
|
|
|
/*
|
|
|
|
* Disallowing the root group RT runtime is BAD, it would disallow the
|
|
|
|
* kernel creating (and or operating) RT threads.
|
|
|
|
*/
|
|
|
|
if (tg == &root_task_group && rt_runtime == 0)
|
|
|
|
return -EINVAL;
|
|
|
|
|
|
|
|
/* No period doesn't make any sense. */
|
|
|
|
if (rt_period == 0)
|
|
|
|
return -EINVAL;
|
|
|
|
|
2008-02-13 21:45:39 +07:00
|
|
|
mutex_lock(&rt_constraints_mutex);
|
2008-02-28 16:51:56 +07:00
|
|
|
read_lock(&tasklist_lock);
|
2008-08-19 17:33:06 +07:00
|
|
|
err = __rt_schedulable(tg, rt_period, rt_runtime);
|
|
|
|
if (err)
|
2008-02-13 21:45:39 +07:00
|
|
|
goto unlock;
|
2008-04-20 00:44:58 +07:00
|
|
|
|
2009-11-17 21:32:06 +07:00
|
|
|
raw_spin_lock_irq(&tg->rt_bandwidth.rt_runtime_lock);
|
2008-04-20 00:44:57 +07:00
|
|
|
tg->rt_bandwidth.rt_period = ns_to_ktime(rt_period);
|
|
|
|
tg->rt_bandwidth.rt_runtime = rt_runtime;
|
2008-04-20 00:44:58 +07:00
|
|
|
|
|
|
|
for_each_possible_cpu(i) {
|
|
|
|
struct rt_rq *rt_rq = tg->rt_rq[i];
|
|
|
|
|
2009-11-17 21:32:06 +07:00
|
|
|
raw_spin_lock(&rt_rq->rt_runtime_lock);
|
2008-04-20 00:44:58 +07:00
|
|
|
rt_rq->rt_runtime = rt_runtime;
|
2009-11-17 21:32:06 +07:00
|
|
|
raw_spin_unlock(&rt_rq->rt_runtime_lock);
|
2008-04-20 00:44:58 +07:00
|
|
|
}
|
2009-11-17 21:32:06 +07:00
|
|
|
raw_spin_unlock_irq(&tg->rt_bandwidth.rt_runtime_lock);
|
2010-10-18 02:46:10 +07:00
|
|
|
unlock:
|
2008-02-28 16:51:56 +07:00
|
|
|
read_unlock(&tasklist_lock);
|
2008-02-13 21:45:39 +07:00
|
|
|
mutex_unlock(&rt_constraints_mutex);
|
|
|
|
|
|
|
|
return err;
|
2008-01-26 03:08:30 +07:00
|
|
|
}
|
|
|
|
|
2013-03-05 15:07:33 +07:00
|
|
|
static int sched_group_set_rt_runtime(struct task_group *tg, long rt_runtime_us)
|
2008-04-20 00:44:57 +07:00
|
|
|
{
|
|
|
|
u64 rt_runtime, rt_period;
|
|
|
|
|
|
|
|
rt_period = ktime_to_ns(tg->rt_bandwidth.rt_period);
|
|
|
|
rt_runtime = (u64)rt_runtime_us * NSEC_PER_USEC;
|
|
|
|
if (rt_runtime_us < 0)
|
|
|
|
rt_runtime = RUNTIME_INF;
|
|
|
|
|
2011-07-21 23:43:28 +07:00
|
|
|
return tg_set_rt_bandwidth(tg, rt_period, rt_runtime);
|
2008-04-20 00:44:57 +07:00
|
|
|
}
|
|
|
|
|
2013-03-05 15:07:33 +07:00
|
|
|
static long sched_group_rt_runtime(struct task_group *tg)
|
2008-02-13 21:45:39 +07:00
|
|
|
{
|
|
|
|
u64 rt_runtime_us;
|
|
|
|
|
2008-04-20 00:44:57 +07:00
|
|
|
if (tg->rt_bandwidth.rt_runtime == RUNTIME_INF)
|
2008-02-13 21:45:39 +07:00
|
|
|
return -1;
|
|
|
|
|
2008-04-20 00:44:57 +07:00
|
|
|
rt_runtime_us = tg->rt_bandwidth.rt_runtime;
|
2008-02-13 21:45:39 +07:00
|
|
|
do_div(rt_runtime_us, NSEC_PER_USEC);
|
|
|
|
return rt_runtime_us;
|
|
|
|
}
|
2008-04-20 00:44:57 +07:00
|
|
|
|
2015-05-03 15:51:56 +07:00
|
|
|
static int sched_group_set_rt_period(struct task_group *tg, u64 rt_period_us)
|
2008-04-20 00:44:57 +07:00
|
|
|
{
|
|
|
|
u64 rt_runtime, rt_period;
|
|
|
|
|
2015-05-03 15:51:56 +07:00
|
|
|
rt_period = rt_period_us * NSEC_PER_USEC;
|
2008-04-20 00:44:57 +07:00
|
|
|
rt_runtime = tg->rt_bandwidth.rt_runtime;
|
|
|
|
|
2011-07-21 23:43:28 +07:00
|
|
|
return tg_set_rt_bandwidth(tg, rt_period, rt_runtime);
|
2008-04-20 00:44:57 +07:00
|
|
|
}
|
|
|
|
|
2013-03-05 15:07:33 +07:00
|
|
|
static long sched_group_rt_period(struct task_group *tg)
|
2008-04-20 00:44:57 +07:00
|
|
|
{
|
|
|
|
u64 rt_period_us;
|
|
|
|
|
|
|
|
rt_period_us = ktime_to_ns(tg->rt_bandwidth.rt_period);
|
|
|
|
do_div(rt_period_us, NSEC_PER_USEC);
|
|
|
|
return rt_period_us;
|
|
|
|
}
|
sched/deadline: Add bandwidth management for SCHED_DEADLINE tasks
In order of deadline scheduling to be effective and useful, it is
important that some method of having the allocation of the available
CPU bandwidth to tasks and task groups under control.
This is usually called "admission control" and if it is not performed
at all, no guarantee can be given on the actual scheduling of the
-deadline tasks.
Since when RT-throttling has been introduced each task group have a
bandwidth associated to itself, calculated as a certain amount of
runtime over a period. Moreover, to make it possible to manipulate
such bandwidth, readable/writable controls have been added to both
procfs (for system wide settings) and cgroupfs (for per-group
settings).
Therefore, the same interface is being used for controlling the
bandwidth distrubution to -deadline tasks and task groups, i.e.,
new controls but with similar names, equivalent meaning and with
the same usage paradigm are added.
However, more discussion is needed in order to figure out how
we want to manage SCHED_DEADLINE bandwidth at the task group level.
Therefore, this patch adds a less sophisticated, but actually
very sensible, mechanism to ensure that a certain utilization
cap is not overcome per each root_domain (the single rq for !SMP
configurations).
Another main difference between deadline bandwidth management and
RT-throttling is that -deadline tasks have bandwidth on their own
(while -rt ones doesn't!), and thus we don't need an higher level
throttling mechanism to enforce the desired bandwidth.
This patch, therefore:
- adds system wide deadline bandwidth management by means of:
* /proc/sys/kernel/sched_dl_runtime_us,
* /proc/sys/kernel/sched_dl_period_us,
that determine (i.e., runtime / period) the total bandwidth
available on each CPU of each root_domain for -deadline tasks;
- couples the RT and deadline bandwidth management, i.e., enforces
that the sum of how much bandwidth is being devoted to -rt
-deadline tasks to stay below 100%.
This means that, for a root_domain comprising M CPUs, -deadline tasks
can be created until the sum of their bandwidths stay below:
M * (sched_dl_runtime_us / sched_dl_period_us)
It is also possible to disable this bandwidth management logic, and
be thus free of oversubscribing the system up to any arbitrary level.
Signed-off-by: Dario Faggioli <raistlin@linux.it>
Signed-off-by: Juri Lelli <juri.lelli@gmail.com>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/1383831828-15501-12-git-send-email-juri.lelli@gmail.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2013-11-07 20:43:45 +07:00
|
|
|
#endif /* CONFIG_RT_GROUP_SCHED */
|
2008-04-20 00:44:57 +07:00
|
|
|
|
sched/deadline: Add bandwidth management for SCHED_DEADLINE tasks
In order of deadline scheduling to be effective and useful, it is
important that some method of having the allocation of the available
CPU bandwidth to tasks and task groups under control.
This is usually called "admission control" and if it is not performed
at all, no guarantee can be given on the actual scheduling of the
-deadline tasks.
Since when RT-throttling has been introduced each task group have a
bandwidth associated to itself, calculated as a certain amount of
runtime over a period. Moreover, to make it possible to manipulate
such bandwidth, readable/writable controls have been added to both
procfs (for system wide settings) and cgroupfs (for per-group
settings).
Therefore, the same interface is being used for controlling the
bandwidth distrubution to -deadline tasks and task groups, i.e.,
new controls but with similar names, equivalent meaning and with
the same usage paradigm are added.
However, more discussion is needed in order to figure out how
we want to manage SCHED_DEADLINE bandwidth at the task group level.
Therefore, this patch adds a less sophisticated, but actually
very sensible, mechanism to ensure that a certain utilization
cap is not overcome per each root_domain (the single rq for !SMP
configurations).
Another main difference between deadline bandwidth management and
RT-throttling is that -deadline tasks have bandwidth on their own
(while -rt ones doesn't!), and thus we don't need an higher level
throttling mechanism to enforce the desired bandwidth.
This patch, therefore:
- adds system wide deadline bandwidth management by means of:
* /proc/sys/kernel/sched_dl_runtime_us,
* /proc/sys/kernel/sched_dl_period_us,
that determine (i.e., runtime / period) the total bandwidth
available on each CPU of each root_domain for -deadline tasks;
- couples the RT and deadline bandwidth management, i.e., enforces
that the sum of how much bandwidth is being devoted to -rt
-deadline tasks to stay below 100%.
This means that, for a root_domain comprising M CPUs, -deadline tasks
can be created until the sum of their bandwidths stay below:
M * (sched_dl_runtime_us / sched_dl_period_us)
It is also possible to disable this bandwidth management logic, and
be thus free of oversubscribing the system up to any arbitrary level.
Signed-off-by: Dario Faggioli <raistlin@linux.it>
Signed-off-by: Juri Lelli <juri.lelli@gmail.com>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/1383831828-15501-12-git-send-email-juri.lelli@gmail.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2013-11-07 20:43:45 +07:00
|
|
|
#ifdef CONFIG_RT_GROUP_SCHED
|
2008-04-20 00:44:57 +07:00
|
|
|
static int sched_rt_global_constraints(void)
|
|
|
|
{
|
|
|
|
int ret = 0;
|
|
|
|
|
|
|
|
mutex_lock(&rt_constraints_mutex);
|
2008-08-19 17:33:06 +07:00
|
|
|
read_lock(&tasklist_lock);
|
2008-09-23 20:33:44 +07:00
|
|
|
ret = __rt_schedulable(NULL, 0, 0);
|
2008-08-19 17:33:06 +07:00
|
|
|
read_unlock(&tasklist_lock);
|
2008-04-20 00:44:57 +07:00
|
|
|
mutex_unlock(&rt_constraints_mutex);
|
|
|
|
|
|
|
|
return ret;
|
|
|
|
}
|
2009-02-27 16:43:54 +07:00
|
|
|
|
2013-03-05 15:07:33 +07:00
|
|
|
static int sched_rt_can_attach(struct task_group *tg, struct task_struct *tsk)
|
2009-02-27 16:43:54 +07:00
|
|
|
{
|
|
|
|
/* Don't accept realtime tasks when there is no way for them to run */
|
|
|
|
if (rt_task(tsk) && tg->rt_bandwidth.rt_runtime == 0)
|
|
|
|
return 0;
|
|
|
|
|
|
|
|
return 1;
|
|
|
|
}
|
|
|
|
|
2008-05-30 19:23:45 +07:00
|
|
|
#else /* !CONFIG_RT_GROUP_SCHED */
|
2008-04-20 00:44:57 +07:00
|
|
|
static int sched_rt_global_constraints(void)
|
|
|
|
{
|
2008-04-20 00:44:58 +07:00
|
|
|
unsigned long flags;
|
2016-05-05 16:51:19 +07:00
|
|
|
int i;
|
2008-09-11 07:00:19 +07:00
|
|
|
|
2009-11-17 21:32:06 +07:00
|
|
|
raw_spin_lock_irqsave(&def_rt_bandwidth.rt_runtime_lock, flags);
|
2008-04-20 00:44:58 +07:00
|
|
|
for_each_possible_cpu(i) {
|
|
|
|
struct rt_rq *rt_rq = &cpu_rq(i)->rt;
|
|
|
|
|
2009-11-17 21:32:06 +07:00
|
|
|
raw_spin_lock(&rt_rq->rt_runtime_lock);
|
2008-04-20 00:44:58 +07:00
|
|
|
rt_rq->rt_runtime = global_rt_runtime();
|
2009-11-17 21:32:06 +07:00
|
|
|
raw_spin_unlock(&rt_rq->rt_runtime_lock);
|
2008-04-20 00:44:58 +07:00
|
|
|
}
|
2009-11-17 21:32:06 +07:00
|
|
|
raw_spin_unlock_irqrestore(&def_rt_bandwidth.rt_runtime_lock, flags);
|
2008-04-20 00:44:58 +07:00
|
|
|
|
2016-05-05 16:51:19 +07:00
|
|
|
return 0;
|
2008-04-20 00:44:57 +07:00
|
|
|
}
|
2008-05-30 19:23:45 +07:00
|
|
|
#endif /* CONFIG_RT_GROUP_SCHED */
|
2008-04-20 00:44:57 +07:00
|
|
|
|
2015-03-17 18:15:31 +07:00
|
|
|
static int sched_dl_global_validate(void)
|
sched/deadline: Add bandwidth management for SCHED_DEADLINE tasks
In order of deadline scheduling to be effective and useful, it is
important that some method of having the allocation of the available
CPU bandwidth to tasks and task groups under control.
This is usually called "admission control" and if it is not performed
at all, no guarantee can be given on the actual scheduling of the
-deadline tasks.
Since when RT-throttling has been introduced each task group have a
bandwidth associated to itself, calculated as a certain amount of
runtime over a period. Moreover, to make it possible to manipulate
such bandwidth, readable/writable controls have been added to both
procfs (for system wide settings) and cgroupfs (for per-group
settings).
Therefore, the same interface is being used for controlling the
bandwidth distrubution to -deadline tasks and task groups, i.e.,
new controls but with similar names, equivalent meaning and with
the same usage paradigm are added.
However, more discussion is needed in order to figure out how
we want to manage SCHED_DEADLINE bandwidth at the task group level.
Therefore, this patch adds a less sophisticated, but actually
very sensible, mechanism to ensure that a certain utilization
cap is not overcome per each root_domain (the single rq for !SMP
configurations).
Another main difference between deadline bandwidth management and
RT-throttling is that -deadline tasks have bandwidth on their own
(while -rt ones doesn't!), and thus we don't need an higher level
throttling mechanism to enforce the desired bandwidth.
This patch, therefore:
- adds system wide deadline bandwidth management by means of:
* /proc/sys/kernel/sched_dl_runtime_us,
* /proc/sys/kernel/sched_dl_period_us,
that determine (i.e., runtime / period) the total bandwidth
available on each CPU of each root_domain for -deadline tasks;
- couples the RT and deadline bandwidth management, i.e., enforces
that the sum of how much bandwidth is being devoted to -rt
-deadline tasks to stay below 100%.
This means that, for a root_domain comprising M CPUs, -deadline tasks
can be created until the sum of their bandwidths stay below:
M * (sched_dl_runtime_us / sched_dl_period_us)
It is also possible to disable this bandwidth management logic, and
be thus free of oversubscribing the system up to any arbitrary level.
Signed-off-by: Dario Faggioli <raistlin@linux.it>
Signed-off-by: Juri Lelli <juri.lelli@gmail.com>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/1383831828-15501-12-git-send-email-juri.lelli@gmail.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2013-11-07 20:43:45 +07:00
|
|
|
{
|
2013-12-17 18:44:49 +07:00
|
|
|
u64 runtime = global_rt_runtime();
|
|
|
|
u64 period = global_rt_period();
|
sched/deadline: Add bandwidth management for SCHED_DEADLINE tasks
In order of deadline scheduling to be effective and useful, it is
important that some method of having the allocation of the available
CPU bandwidth to tasks and task groups under control.
This is usually called "admission control" and if it is not performed
at all, no guarantee can be given on the actual scheduling of the
-deadline tasks.
Since when RT-throttling has been introduced each task group have a
bandwidth associated to itself, calculated as a certain amount of
runtime over a period. Moreover, to make it possible to manipulate
such bandwidth, readable/writable controls have been added to both
procfs (for system wide settings) and cgroupfs (for per-group
settings).
Therefore, the same interface is being used for controlling the
bandwidth distrubution to -deadline tasks and task groups, i.e.,
new controls but with similar names, equivalent meaning and with
the same usage paradigm are added.
However, more discussion is needed in order to figure out how
we want to manage SCHED_DEADLINE bandwidth at the task group level.
Therefore, this patch adds a less sophisticated, but actually
very sensible, mechanism to ensure that a certain utilization
cap is not overcome per each root_domain (the single rq for !SMP
configurations).
Another main difference between deadline bandwidth management and
RT-throttling is that -deadline tasks have bandwidth on their own
(while -rt ones doesn't!), and thus we don't need an higher level
throttling mechanism to enforce the desired bandwidth.
This patch, therefore:
- adds system wide deadline bandwidth management by means of:
* /proc/sys/kernel/sched_dl_runtime_us,
* /proc/sys/kernel/sched_dl_period_us,
that determine (i.e., runtime / period) the total bandwidth
available on each CPU of each root_domain for -deadline tasks;
- couples the RT and deadline bandwidth management, i.e., enforces
that the sum of how much bandwidth is being devoted to -rt
-deadline tasks to stay below 100%.
This means that, for a root_domain comprising M CPUs, -deadline tasks
can be created until the sum of their bandwidths stay below:
M * (sched_dl_runtime_us / sched_dl_period_us)
It is also possible to disable this bandwidth management logic, and
be thus free of oversubscribing the system up to any arbitrary level.
Signed-off-by: Dario Faggioli <raistlin@linux.it>
Signed-off-by: Juri Lelli <juri.lelli@gmail.com>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/1383831828-15501-12-git-send-email-juri.lelli@gmail.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2013-11-07 20:43:45 +07:00
|
|
|
u64 new_bw = to_ratio(period, runtime);
|
2014-09-30 15:23:37 +07:00
|
|
|
struct dl_bw *dl_b;
|
2013-12-17 18:44:49 +07:00
|
|
|
int cpu, ret = 0;
|
2014-02-11 15:24:27 +07:00
|
|
|
unsigned long flags;
|
sched/deadline: Add bandwidth management for SCHED_DEADLINE tasks
In order of deadline scheduling to be effective and useful, it is
important that some method of having the allocation of the available
CPU bandwidth to tasks and task groups under control.
This is usually called "admission control" and if it is not performed
at all, no guarantee can be given on the actual scheduling of the
-deadline tasks.
Since when RT-throttling has been introduced each task group have a
bandwidth associated to itself, calculated as a certain amount of
runtime over a period. Moreover, to make it possible to manipulate
such bandwidth, readable/writable controls have been added to both
procfs (for system wide settings) and cgroupfs (for per-group
settings).
Therefore, the same interface is being used for controlling the
bandwidth distrubution to -deadline tasks and task groups, i.e.,
new controls but with similar names, equivalent meaning and with
the same usage paradigm are added.
However, more discussion is needed in order to figure out how
we want to manage SCHED_DEADLINE bandwidth at the task group level.
Therefore, this patch adds a less sophisticated, but actually
very sensible, mechanism to ensure that a certain utilization
cap is not overcome per each root_domain (the single rq for !SMP
configurations).
Another main difference between deadline bandwidth management and
RT-throttling is that -deadline tasks have bandwidth on their own
(while -rt ones doesn't!), and thus we don't need an higher level
throttling mechanism to enforce the desired bandwidth.
This patch, therefore:
- adds system wide deadline bandwidth management by means of:
* /proc/sys/kernel/sched_dl_runtime_us,
* /proc/sys/kernel/sched_dl_period_us,
that determine (i.e., runtime / period) the total bandwidth
available on each CPU of each root_domain for -deadline tasks;
- couples the RT and deadline bandwidth management, i.e., enforces
that the sum of how much bandwidth is being devoted to -rt
-deadline tasks to stay below 100%.
This means that, for a root_domain comprising M CPUs, -deadline tasks
can be created until the sum of their bandwidths stay below:
M * (sched_dl_runtime_us / sched_dl_period_us)
It is also possible to disable this bandwidth management logic, and
be thus free of oversubscribing the system up to any arbitrary level.
Signed-off-by: Dario Faggioli <raistlin@linux.it>
Signed-off-by: Juri Lelli <juri.lelli@gmail.com>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/1383831828-15501-12-git-send-email-juri.lelli@gmail.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2013-11-07 20:43:45 +07:00
|
|
|
|
|
|
|
/*
|
|
|
|
* Here we want to check the bandwidth not being set to some
|
|
|
|
* value smaller than the currently allocated bandwidth in
|
|
|
|
* any of the root_domains.
|
|
|
|
*
|
|
|
|
* FIXME: Cycling on all the CPUs is overdoing, but simpler than
|
|
|
|
* cycling on root_domains... Discussion on different/better
|
|
|
|
* solutions is welcome!
|
|
|
|
*/
|
2013-12-17 18:44:49 +07:00
|
|
|
for_each_possible_cpu(cpu) {
|
2014-09-30 15:23:37 +07:00
|
|
|
rcu_read_lock_sched();
|
|
|
|
dl_b = dl_bw_of(cpu);
|
sched/deadline: Add bandwidth management for SCHED_DEADLINE tasks
In order of deadline scheduling to be effective and useful, it is
important that some method of having the allocation of the available
CPU bandwidth to tasks and task groups under control.
This is usually called "admission control" and if it is not performed
at all, no guarantee can be given on the actual scheduling of the
-deadline tasks.
Since when RT-throttling has been introduced each task group have a
bandwidth associated to itself, calculated as a certain amount of
runtime over a period. Moreover, to make it possible to manipulate
such bandwidth, readable/writable controls have been added to both
procfs (for system wide settings) and cgroupfs (for per-group
settings).
Therefore, the same interface is being used for controlling the
bandwidth distrubution to -deadline tasks and task groups, i.e.,
new controls but with similar names, equivalent meaning and with
the same usage paradigm are added.
However, more discussion is needed in order to figure out how
we want to manage SCHED_DEADLINE bandwidth at the task group level.
Therefore, this patch adds a less sophisticated, but actually
very sensible, mechanism to ensure that a certain utilization
cap is not overcome per each root_domain (the single rq for !SMP
configurations).
Another main difference between deadline bandwidth management and
RT-throttling is that -deadline tasks have bandwidth on their own
(while -rt ones doesn't!), and thus we don't need an higher level
throttling mechanism to enforce the desired bandwidth.
This patch, therefore:
- adds system wide deadline bandwidth management by means of:
* /proc/sys/kernel/sched_dl_runtime_us,
* /proc/sys/kernel/sched_dl_period_us,
that determine (i.e., runtime / period) the total bandwidth
available on each CPU of each root_domain for -deadline tasks;
- couples the RT and deadline bandwidth management, i.e., enforces
that the sum of how much bandwidth is being devoted to -rt
-deadline tasks to stay below 100%.
This means that, for a root_domain comprising M CPUs, -deadline tasks
can be created until the sum of their bandwidths stay below:
M * (sched_dl_runtime_us / sched_dl_period_us)
It is also possible to disable this bandwidth management logic, and
be thus free of oversubscribing the system up to any arbitrary level.
Signed-off-by: Dario Faggioli <raistlin@linux.it>
Signed-off-by: Juri Lelli <juri.lelli@gmail.com>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/1383831828-15501-12-git-send-email-juri.lelli@gmail.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2013-11-07 20:43:45 +07:00
|
|
|
|
2014-02-11 15:24:27 +07:00
|
|
|
raw_spin_lock_irqsave(&dl_b->lock, flags);
|
2013-12-17 18:44:49 +07:00
|
|
|
if (new_bw < dl_b->total_bw)
|
|
|
|
ret = -EBUSY;
|
2014-02-11 15:24:27 +07:00
|
|
|
raw_spin_unlock_irqrestore(&dl_b->lock, flags);
|
2013-12-17 18:44:49 +07:00
|
|
|
|
2014-09-30 15:23:37 +07:00
|
|
|
rcu_read_unlock_sched();
|
|
|
|
|
2013-12-17 18:44:49 +07:00
|
|
|
if (ret)
|
|
|
|
break;
|
sched/deadline: Add bandwidth management for SCHED_DEADLINE tasks
In order of deadline scheduling to be effective and useful, it is
important that some method of having the allocation of the available
CPU bandwidth to tasks and task groups under control.
This is usually called "admission control" and if it is not performed
at all, no guarantee can be given on the actual scheduling of the
-deadline tasks.
Since when RT-throttling has been introduced each task group have a
bandwidth associated to itself, calculated as a certain amount of
runtime over a period. Moreover, to make it possible to manipulate
such bandwidth, readable/writable controls have been added to both
procfs (for system wide settings) and cgroupfs (for per-group
settings).
Therefore, the same interface is being used for controlling the
bandwidth distrubution to -deadline tasks and task groups, i.e.,
new controls but with similar names, equivalent meaning and with
the same usage paradigm are added.
However, more discussion is needed in order to figure out how
we want to manage SCHED_DEADLINE bandwidth at the task group level.
Therefore, this patch adds a less sophisticated, but actually
very sensible, mechanism to ensure that a certain utilization
cap is not overcome per each root_domain (the single rq for !SMP
configurations).
Another main difference between deadline bandwidth management and
RT-throttling is that -deadline tasks have bandwidth on their own
(while -rt ones doesn't!), and thus we don't need an higher level
throttling mechanism to enforce the desired bandwidth.
This patch, therefore:
- adds system wide deadline bandwidth management by means of:
* /proc/sys/kernel/sched_dl_runtime_us,
* /proc/sys/kernel/sched_dl_period_us,
that determine (i.e., runtime / period) the total bandwidth
available on each CPU of each root_domain for -deadline tasks;
- couples the RT and deadline bandwidth management, i.e., enforces
that the sum of how much bandwidth is being devoted to -rt
-deadline tasks to stay below 100%.
This means that, for a root_domain comprising M CPUs, -deadline tasks
can be created until the sum of their bandwidths stay below:
M * (sched_dl_runtime_us / sched_dl_period_us)
It is also possible to disable this bandwidth management logic, and
be thus free of oversubscribing the system up to any arbitrary level.
Signed-off-by: Dario Faggioli <raistlin@linux.it>
Signed-off-by: Juri Lelli <juri.lelli@gmail.com>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/1383831828-15501-12-git-send-email-juri.lelli@gmail.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2013-11-07 20:43:45 +07:00
|
|
|
}
|
|
|
|
|
2013-12-17 18:44:49 +07:00
|
|
|
return ret;
|
sched/deadline: Add bandwidth management for SCHED_DEADLINE tasks
In order of deadline scheduling to be effective and useful, it is
important that some method of having the allocation of the available
CPU bandwidth to tasks and task groups under control.
This is usually called "admission control" and if it is not performed
at all, no guarantee can be given on the actual scheduling of the
-deadline tasks.
Since when RT-throttling has been introduced each task group have a
bandwidth associated to itself, calculated as a certain amount of
runtime over a period. Moreover, to make it possible to manipulate
such bandwidth, readable/writable controls have been added to both
procfs (for system wide settings) and cgroupfs (for per-group
settings).
Therefore, the same interface is being used for controlling the
bandwidth distrubution to -deadline tasks and task groups, i.e.,
new controls but with similar names, equivalent meaning and with
the same usage paradigm are added.
However, more discussion is needed in order to figure out how
we want to manage SCHED_DEADLINE bandwidth at the task group level.
Therefore, this patch adds a less sophisticated, but actually
very sensible, mechanism to ensure that a certain utilization
cap is not overcome per each root_domain (the single rq for !SMP
configurations).
Another main difference between deadline bandwidth management and
RT-throttling is that -deadline tasks have bandwidth on their own
(while -rt ones doesn't!), and thus we don't need an higher level
throttling mechanism to enforce the desired bandwidth.
This patch, therefore:
- adds system wide deadline bandwidth management by means of:
* /proc/sys/kernel/sched_dl_runtime_us,
* /proc/sys/kernel/sched_dl_period_us,
that determine (i.e., runtime / period) the total bandwidth
available on each CPU of each root_domain for -deadline tasks;
- couples the RT and deadline bandwidth management, i.e., enforces
that the sum of how much bandwidth is being devoted to -rt
-deadline tasks to stay below 100%.
This means that, for a root_domain comprising M CPUs, -deadline tasks
can be created until the sum of their bandwidths stay below:
M * (sched_dl_runtime_us / sched_dl_period_us)
It is also possible to disable this bandwidth management logic, and
be thus free of oversubscribing the system up to any arbitrary level.
Signed-off-by: Dario Faggioli <raistlin@linux.it>
Signed-off-by: Juri Lelli <juri.lelli@gmail.com>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/1383831828-15501-12-git-send-email-juri.lelli@gmail.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2013-11-07 20:43:45 +07:00
|
|
|
}
|
|
|
|
|
2013-12-17 18:44:49 +07:00
|
|
|
static void sched_dl_do_global(void)
|
2013-02-07 22:47:04 +07:00
|
|
|
{
|
2013-12-17 18:44:49 +07:00
|
|
|
u64 new_bw = -1;
|
2014-09-30 15:23:37 +07:00
|
|
|
struct dl_bw *dl_b;
|
2013-12-17 18:44:49 +07:00
|
|
|
int cpu;
|
2014-02-11 15:24:27 +07:00
|
|
|
unsigned long flags;
|
2013-02-07 22:47:04 +07:00
|
|
|
|
2013-12-17 18:44:49 +07:00
|
|
|
def_dl_bandwidth.dl_period = global_rt_period();
|
|
|
|
def_dl_bandwidth.dl_runtime = global_rt_runtime();
|
|
|
|
|
|
|
|
if (global_rt_runtime() != RUNTIME_INF)
|
|
|
|
new_bw = to_ratio(global_rt_period(), global_rt_runtime());
|
|
|
|
|
|
|
|
/*
|
|
|
|
* FIXME: As above...
|
|
|
|
*/
|
|
|
|
for_each_possible_cpu(cpu) {
|
2014-09-30 15:23:37 +07:00
|
|
|
rcu_read_lock_sched();
|
|
|
|
dl_b = dl_bw_of(cpu);
|
2013-12-17 18:44:49 +07:00
|
|
|
|
2014-02-11 15:24:27 +07:00
|
|
|
raw_spin_lock_irqsave(&dl_b->lock, flags);
|
2013-12-17 18:44:49 +07:00
|
|
|
dl_b->bw = new_bw;
|
2014-02-11 15:24:27 +07:00
|
|
|
raw_spin_unlock_irqrestore(&dl_b->lock, flags);
|
2014-09-30 15:23:37 +07:00
|
|
|
|
|
|
|
rcu_read_unlock_sched();
|
2013-02-07 22:47:04 +07:00
|
|
|
}
|
2013-12-17 18:44:49 +07:00
|
|
|
}
|
|
|
|
|
|
|
|
static int sched_rt_global_validate(void)
|
|
|
|
{
|
|
|
|
if (sysctl_sched_rt_period <= 0)
|
|
|
|
return -EINVAL;
|
|
|
|
|
2014-02-11 15:24:26 +07:00
|
|
|
if ((sysctl_sched_rt_runtime != RUNTIME_INF) &&
|
|
|
|
(sysctl_sched_rt_runtime > sysctl_sched_rt_period))
|
2013-12-17 18:44:49 +07:00
|
|
|
return -EINVAL;
|
|
|
|
|
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
|
|
|
|
static void sched_rt_do_global(void)
|
|
|
|
{
|
|
|
|
def_rt_bandwidth.rt_runtime = global_rt_runtime();
|
|
|
|
def_rt_bandwidth.rt_period = ns_to_ktime(global_rt_period());
|
2013-02-07 22:47:04 +07:00
|
|
|
}
|
|
|
|
|
2008-04-20 00:44:57 +07:00
|
|
|
int sched_rt_handler(struct ctl_table *table, int write,
|
2009-09-24 05:57:19 +07:00
|
|
|
void __user *buffer, size_t *lenp,
|
2008-04-20 00:44:57 +07:00
|
|
|
loff_t *ppos)
|
|
|
|
{
|
|
|
|
int old_period, old_runtime;
|
|
|
|
static DEFINE_MUTEX(mutex);
|
2013-12-17 18:44:49 +07:00
|
|
|
int ret;
|
2008-04-20 00:44:57 +07:00
|
|
|
|
|
|
|
mutex_lock(&mutex);
|
|
|
|
old_period = sysctl_sched_rt_period;
|
|
|
|
old_runtime = sysctl_sched_rt_runtime;
|
|
|
|
|
2009-09-24 05:57:19 +07:00
|
|
|
ret = proc_dointvec(table, write, buffer, lenp, ppos);
|
2008-04-20 00:44:57 +07:00
|
|
|
|
|
|
|
if (!ret && write) {
|
2013-12-17 18:44:49 +07:00
|
|
|
ret = sched_rt_global_validate();
|
|
|
|
if (ret)
|
|
|
|
goto undo;
|
|
|
|
|
2015-03-17 18:15:31 +07:00
|
|
|
ret = sched_dl_global_validate();
|
2013-12-17 18:44:49 +07:00
|
|
|
if (ret)
|
|
|
|
goto undo;
|
|
|
|
|
2015-03-17 18:15:31 +07:00
|
|
|
ret = sched_rt_global_constraints();
|
2013-12-17 18:44:49 +07:00
|
|
|
if (ret)
|
|
|
|
goto undo;
|
|
|
|
|
|
|
|
sched_rt_do_global();
|
|
|
|
sched_dl_do_global();
|
|
|
|
}
|
|
|
|
if (0) {
|
|
|
|
undo:
|
|
|
|
sysctl_sched_rt_period = old_period;
|
|
|
|
sysctl_sched_rt_runtime = old_runtime;
|
2008-04-20 00:44:57 +07:00
|
|
|
}
|
|
|
|
mutex_unlock(&mutex);
|
|
|
|
|
|
|
|
return ret;
|
|
|
|
}
|
2007-10-19 13:41:03 +07:00
|
|
|
|
2013-12-17 18:44:49 +07:00
|
|
|
int sched_rr_handler(struct ctl_table *table, int write,
|
sched/deadline: Add bandwidth management for SCHED_DEADLINE tasks
In order of deadline scheduling to be effective and useful, it is
important that some method of having the allocation of the available
CPU bandwidth to tasks and task groups under control.
This is usually called "admission control" and if it is not performed
at all, no guarantee can be given on the actual scheduling of the
-deadline tasks.
Since when RT-throttling has been introduced each task group have a
bandwidth associated to itself, calculated as a certain amount of
runtime over a period. Moreover, to make it possible to manipulate
such bandwidth, readable/writable controls have been added to both
procfs (for system wide settings) and cgroupfs (for per-group
settings).
Therefore, the same interface is being used for controlling the
bandwidth distrubution to -deadline tasks and task groups, i.e.,
new controls but with similar names, equivalent meaning and with
the same usage paradigm are added.
However, more discussion is needed in order to figure out how
we want to manage SCHED_DEADLINE bandwidth at the task group level.
Therefore, this patch adds a less sophisticated, but actually
very sensible, mechanism to ensure that a certain utilization
cap is not overcome per each root_domain (the single rq for !SMP
configurations).
Another main difference between deadline bandwidth management and
RT-throttling is that -deadline tasks have bandwidth on their own
(while -rt ones doesn't!), and thus we don't need an higher level
throttling mechanism to enforce the desired bandwidth.
This patch, therefore:
- adds system wide deadline bandwidth management by means of:
* /proc/sys/kernel/sched_dl_runtime_us,
* /proc/sys/kernel/sched_dl_period_us,
that determine (i.e., runtime / period) the total bandwidth
available on each CPU of each root_domain for -deadline tasks;
- couples the RT and deadline bandwidth management, i.e., enforces
that the sum of how much bandwidth is being devoted to -rt
-deadline tasks to stay below 100%.
This means that, for a root_domain comprising M CPUs, -deadline tasks
can be created until the sum of their bandwidths stay below:
M * (sched_dl_runtime_us / sched_dl_period_us)
It is also possible to disable this bandwidth management logic, and
be thus free of oversubscribing the system up to any arbitrary level.
Signed-off-by: Dario Faggioli <raistlin@linux.it>
Signed-off-by: Juri Lelli <juri.lelli@gmail.com>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/1383831828-15501-12-git-send-email-juri.lelli@gmail.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2013-11-07 20:43:45 +07:00
|
|
|
void __user *buffer, size_t *lenp,
|
|
|
|
loff_t *ppos)
|
|
|
|
{
|
|
|
|
int ret;
|
|
|
|
static DEFINE_MUTEX(mutex);
|
|
|
|
|
|
|
|
mutex_lock(&mutex);
|
|
|
|
ret = proc_dointvec(table, write, buffer, lenp, ppos);
|
2013-12-17 18:44:49 +07:00
|
|
|
/* make sure that internally we keep jiffies */
|
|
|
|
/* also, writing zero resets timeslice to default */
|
sched/deadline: Add bandwidth management for SCHED_DEADLINE tasks
In order of deadline scheduling to be effective and useful, it is
important that some method of having the allocation of the available
CPU bandwidth to tasks and task groups under control.
This is usually called "admission control" and if it is not performed
at all, no guarantee can be given on the actual scheduling of the
-deadline tasks.
Since when RT-throttling has been introduced each task group have a
bandwidth associated to itself, calculated as a certain amount of
runtime over a period. Moreover, to make it possible to manipulate
such bandwidth, readable/writable controls have been added to both
procfs (for system wide settings) and cgroupfs (for per-group
settings).
Therefore, the same interface is being used for controlling the
bandwidth distrubution to -deadline tasks and task groups, i.e.,
new controls but with similar names, equivalent meaning and with
the same usage paradigm are added.
However, more discussion is needed in order to figure out how
we want to manage SCHED_DEADLINE bandwidth at the task group level.
Therefore, this patch adds a less sophisticated, but actually
very sensible, mechanism to ensure that a certain utilization
cap is not overcome per each root_domain (the single rq for !SMP
configurations).
Another main difference between deadline bandwidth management and
RT-throttling is that -deadline tasks have bandwidth on their own
(while -rt ones doesn't!), and thus we don't need an higher level
throttling mechanism to enforce the desired bandwidth.
This patch, therefore:
- adds system wide deadline bandwidth management by means of:
* /proc/sys/kernel/sched_dl_runtime_us,
* /proc/sys/kernel/sched_dl_period_us,
that determine (i.e., runtime / period) the total bandwidth
available on each CPU of each root_domain for -deadline tasks;
- couples the RT and deadline bandwidth management, i.e., enforces
that the sum of how much bandwidth is being devoted to -rt
-deadline tasks to stay below 100%.
This means that, for a root_domain comprising M CPUs, -deadline tasks
can be created until the sum of their bandwidths stay below:
M * (sched_dl_runtime_us / sched_dl_period_us)
It is also possible to disable this bandwidth management logic, and
be thus free of oversubscribing the system up to any arbitrary level.
Signed-off-by: Dario Faggioli <raistlin@linux.it>
Signed-off-by: Juri Lelli <juri.lelli@gmail.com>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/1383831828-15501-12-git-send-email-juri.lelli@gmail.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2013-11-07 20:43:45 +07:00
|
|
|
if (!ret && write) {
|
2013-12-17 18:44:49 +07:00
|
|
|
sched_rr_timeslice = sched_rr_timeslice <= 0 ?
|
|
|
|
RR_TIMESLICE : msecs_to_jiffies(sched_rr_timeslice);
|
sched/deadline: Add bandwidth management for SCHED_DEADLINE tasks
In order of deadline scheduling to be effective and useful, it is
important that some method of having the allocation of the available
CPU bandwidth to tasks and task groups under control.
This is usually called "admission control" and if it is not performed
at all, no guarantee can be given on the actual scheduling of the
-deadline tasks.
Since when RT-throttling has been introduced each task group have a
bandwidth associated to itself, calculated as a certain amount of
runtime over a period. Moreover, to make it possible to manipulate
such bandwidth, readable/writable controls have been added to both
procfs (for system wide settings) and cgroupfs (for per-group
settings).
Therefore, the same interface is being used for controlling the
bandwidth distrubution to -deadline tasks and task groups, i.e.,
new controls but with similar names, equivalent meaning and with
the same usage paradigm are added.
However, more discussion is needed in order to figure out how
we want to manage SCHED_DEADLINE bandwidth at the task group level.
Therefore, this patch adds a less sophisticated, but actually
very sensible, mechanism to ensure that a certain utilization
cap is not overcome per each root_domain (the single rq for !SMP
configurations).
Another main difference between deadline bandwidth management and
RT-throttling is that -deadline tasks have bandwidth on their own
(while -rt ones doesn't!), and thus we don't need an higher level
throttling mechanism to enforce the desired bandwidth.
This patch, therefore:
- adds system wide deadline bandwidth management by means of:
* /proc/sys/kernel/sched_dl_runtime_us,
* /proc/sys/kernel/sched_dl_period_us,
that determine (i.e., runtime / period) the total bandwidth
available on each CPU of each root_domain for -deadline tasks;
- couples the RT and deadline bandwidth management, i.e., enforces
that the sum of how much bandwidth is being devoted to -rt
-deadline tasks to stay below 100%.
This means that, for a root_domain comprising M CPUs, -deadline tasks
can be created until the sum of their bandwidths stay below:
M * (sched_dl_runtime_us / sched_dl_period_us)
It is also possible to disable this bandwidth management logic, and
be thus free of oversubscribing the system up to any arbitrary level.
Signed-off-by: Dario Faggioli <raistlin@linux.it>
Signed-off-by: Juri Lelli <juri.lelli@gmail.com>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/1383831828-15501-12-git-send-email-juri.lelli@gmail.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2013-11-07 20:43:45 +07:00
|
|
|
}
|
|
|
|
mutex_unlock(&mutex);
|
|
|
|
return ret;
|
|
|
|
}
|
|
|
|
|
2008-02-13 21:45:40 +07:00
|
|
|
#ifdef CONFIG_CGROUP_SCHED
|
2007-10-19 13:41:03 +07:00
|
|
|
|
2013-08-09 07:11:23 +07:00
|
|
|
static inline struct task_group *css_tg(struct cgroup_subsys_state *css)
|
2007-10-19 13:41:03 +07:00
|
|
|
{
|
2013-08-09 07:11:23 +07:00
|
|
|
return css ? container_of(css, struct task_group, css) : NULL;
|
2007-10-19 13:41:03 +07:00
|
|
|
}
|
|
|
|
|
2013-08-09 07:11:23 +07:00
|
|
|
static struct cgroup_subsys_state *
|
|
|
|
cpu_cgroup_css_alloc(struct cgroup_subsys_state *parent_css)
|
2007-10-19 13:41:03 +07:00
|
|
|
{
|
2013-08-09 07:11:23 +07:00
|
|
|
struct task_group *parent = css_tg(parent_css);
|
|
|
|
struct task_group *tg;
|
2007-10-19 13:41:03 +07:00
|
|
|
|
2013-08-09 07:11:23 +07:00
|
|
|
if (!parent) {
|
2007-10-19 13:41:03 +07:00
|
|
|
/* This is early initialization for the top cgroup */
|
2011-01-07 14:17:36 +07:00
|
|
|
return &root_task_group.css;
|
2007-10-19 13:41:03 +07:00
|
|
|
}
|
|
|
|
|
2008-04-20 00:44:59 +07:00
|
|
|
tg = sched_create_group(parent);
|
2007-10-19 13:41:03 +07:00
|
|
|
if (IS_ERR(tg))
|
|
|
|
return ERR_PTR(-ENOMEM);
|
|
|
|
|
2016-03-16 22:22:45 +07:00
|
|
|
sched_online_group(tg, parent);
|
|
|
|
|
2007-10-19 13:41:03 +07:00
|
|
|
return &tg->css;
|
|
|
|
}
|
|
|
|
|
2016-03-16 22:22:45 +07:00
|
|
|
static void cpu_cgroup_css_released(struct cgroup_subsys_state *css)
|
sched: split out css_online/css_offline from tg creation/destruction
This is a preparaton for later patches.
- What do we gain from cpu_cgroup_css_online():
After ss->css_alloc() and before ss->css_online(), there's a small
window that tg->css.cgroup is NULL. With this change, tg won't be seen
before ss->css_online(), where it's added to the global list, so we're
guaranteed we'll never see NULL tg->css.cgroup.
- What do we gain from cpu_cgroup_css_offline():
tg is freed via RCU, so is cgroup. Without this change, This is how
synchronization works:
cgroup_rmdir()
no ss->css_offline()
diput()
syncornize_rcu()
ss->css_free() <-- unregister tg, and free it via call_rcu()
kfree_rcu(cgroup) <-- wait possible refs to cgroup, and free cgroup
We can't just kfree(cgroup), because tg might access tg->css.cgroup.
With this change:
cgroup_rmdir()
ss->css_offline() <-- unregister tg
diput()
synchronize_rcu() <-- wait possible refs to tg and cgroup
ss->css_free() <-- free tg
kfree_rcu(cgroup) <-- free cgroup
As you see, kfree_rcu() is redundant now.
Signed-off-by: Li Zefan <lizefan@huawei.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Ingo Molnar <mingo@kernel.org>
2013-01-24 13:30:48 +07:00
|
|
|
{
|
2013-08-09 07:11:23 +07:00
|
|
|
struct task_group *tg = css_tg(css);
|
sched: split out css_online/css_offline from tg creation/destruction
This is a preparaton for later patches.
- What do we gain from cpu_cgroup_css_online():
After ss->css_alloc() and before ss->css_online(), there's a small
window that tg->css.cgroup is NULL. With this change, tg won't be seen
before ss->css_online(), where it's added to the global list, so we're
guaranteed we'll never see NULL tg->css.cgroup.
- What do we gain from cpu_cgroup_css_offline():
tg is freed via RCU, so is cgroup. Without this change, This is how
synchronization works:
cgroup_rmdir()
no ss->css_offline()
diput()
syncornize_rcu()
ss->css_free() <-- unregister tg, and free it via call_rcu()
kfree_rcu(cgroup) <-- wait possible refs to cgroup, and free cgroup
We can't just kfree(cgroup), because tg might access tg->css.cgroup.
With this change:
cgroup_rmdir()
ss->css_offline() <-- unregister tg
diput()
synchronize_rcu() <-- wait possible refs to tg and cgroup
ss->css_free() <-- free tg
kfree_rcu(cgroup) <-- free cgroup
As you see, kfree_rcu() is redundant now.
Signed-off-by: Li Zefan <lizefan@huawei.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Ingo Molnar <mingo@kernel.org>
2013-01-24 13:30:48 +07:00
|
|
|
|
2016-03-16 22:22:45 +07:00
|
|
|
sched_offline_group(tg);
|
sched: split out css_online/css_offline from tg creation/destruction
This is a preparaton for later patches.
- What do we gain from cpu_cgroup_css_online():
After ss->css_alloc() and before ss->css_online(), there's a small
window that tg->css.cgroup is NULL. With this change, tg won't be seen
before ss->css_online(), where it's added to the global list, so we're
guaranteed we'll never see NULL tg->css.cgroup.
- What do we gain from cpu_cgroup_css_offline():
tg is freed via RCU, so is cgroup. Without this change, This is how
synchronization works:
cgroup_rmdir()
no ss->css_offline()
diput()
syncornize_rcu()
ss->css_free() <-- unregister tg, and free it via call_rcu()
kfree_rcu(cgroup) <-- wait possible refs to cgroup, and free cgroup
We can't just kfree(cgroup), because tg might access tg->css.cgroup.
With this change:
cgroup_rmdir()
ss->css_offline() <-- unregister tg
diput()
synchronize_rcu() <-- wait possible refs to tg and cgroup
ss->css_free() <-- free tg
kfree_rcu(cgroup) <-- free cgroup
As you see, kfree_rcu() is redundant now.
Signed-off-by: Li Zefan <lizefan@huawei.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Ingo Molnar <mingo@kernel.org>
2013-01-24 13:30:48 +07:00
|
|
|
}
|
|
|
|
|
2013-08-09 07:11:23 +07:00
|
|
|
static void cpu_cgroup_css_free(struct cgroup_subsys_state *css)
|
2007-10-19 13:41:03 +07:00
|
|
|
{
|
2013-08-09 07:11:23 +07:00
|
|
|
struct task_group *tg = css_tg(css);
|
2007-10-19 13:41:03 +07:00
|
|
|
|
2016-03-16 22:22:45 +07:00
|
|
|
/*
|
|
|
|
* Relies on the RCU grace period between css_released() and this.
|
|
|
|
*/
|
|
|
|
sched_free_group(tg);
|
sched: split out css_online/css_offline from tg creation/destruction
This is a preparaton for later patches.
- What do we gain from cpu_cgroup_css_online():
After ss->css_alloc() and before ss->css_online(), there's a small
window that tg->css.cgroup is NULL. With this change, tg won't be seen
before ss->css_online(), where it's added to the global list, so we're
guaranteed we'll never see NULL tg->css.cgroup.
- What do we gain from cpu_cgroup_css_offline():
tg is freed via RCU, so is cgroup. Without this change, This is how
synchronization works:
cgroup_rmdir()
no ss->css_offline()
diput()
syncornize_rcu()
ss->css_free() <-- unregister tg, and free it via call_rcu()
kfree_rcu(cgroup) <-- wait possible refs to cgroup, and free cgroup
We can't just kfree(cgroup), because tg might access tg->css.cgroup.
With this change:
cgroup_rmdir()
ss->css_offline() <-- unregister tg
diput()
synchronize_rcu() <-- wait possible refs to tg and cgroup
ss->css_free() <-- free tg
kfree_rcu(cgroup) <-- free cgroup
As you see, kfree_rcu() is redundant now.
Signed-off-by: Li Zefan <lizefan@huawei.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Ingo Molnar <mingo@kernel.org>
2013-01-24 13:30:48 +07:00
|
|
|
}
|
|
|
|
|
2016-06-17 18:38:55 +07:00
|
|
|
/*
|
|
|
|
* This is called before wake_up_new_task(), therefore we really only
|
|
|
|
* have to set its group bits, all the other stuff does not apply.
|
|
|
|
*/
|
2015-12-03 22:24:08 +07:00
|
|
|
static void cpu_cgroup_fork(struct task_struct *task)
|
2014-10-27 17:18:25 +07:00
|
|
|
{
|
2016-06-17 18:38:55 +07:00
|
|
|
struct rq_flags rf;
|
|
|
|
struct rq *rq;
|
|
|
|
|
|
|
|
rq = task_rq_lock(task, &rf);
|
|
|
|
|
|
|
|
sched_change_group(task, TASK_SET_GROUP);
|
|
|
|
|
|
|
|
task_rq_unlock(rq, task, &rf);
|
2014-10-27 17:18:25 +07:00
|
|
|
}
|
|
|
|
|
cgroup: fix handling of multi-destination migration from subtree_control enabling
Consider the following v2 hierarchy.
P0 (+memory) --- P1 (-memory) --- A
\- B
P0 has memory enabled in its subtree_control while P1 doesn't. If
both A and B contain processes, they would belong to the memory css of
P1. Now if memory is enabled on P1's subtree_control, memory csses
should be created on both A and B and A's processes should be moved to
the former and B's processes the latter. IOW, enabling controllers
can cause atomic migrations into different csses.
The core cgroup migration logic has been updated accordingly but the
controller migration methods haven't and still assume that all tasks
migrate to a single target css; furthermore, the methods were fed the
css in which subtree_control was updated which is the parent of the
target csses. pids controller depends on the migration methods to
move charges and this made the controller attribute charges to the
wrong csses often triggering the following warning by driving a
counter negative.
WARNING: CPU: 1 PID: 1 at kernel/cgroup_pids.c:97 pids_cancel.constprop.6+0x31/0x40()
Modules linked in:
CPU: 1 PID: 1 Comm: systemd Not tainted 4.4.0-rc1+ #29
...
ffffffff81f65382 ffff88007c043b90 ffffffff81551ffc 0000000000000000
ffff88007c043bc8 ffffffff810de202 ffff88007a752000 ffff88007a29ab00
ffff88007c043c80 ffff88007a1d8400 0000000000000001 ffff88007c043bd8
Call Trace:
[<ffffffff81551ffc>] dump_stack+0x4e/0x82
[<ffffffff810de202>] warn_slowpath_common+0x82/0xc0
[<ffffffff810de2fa>] warn_slowpath_null+0x1a/0x20
[<ffffffff8118e031>] pids_cancel.constprop.6+0x31/0x40
[<ffffffff8118e0fd>] pids_can_attach+0x6d/0xf0
[<ffffffff81188a4c>] cgroup_taskset_migrate+0x6c/0x330
[<ffffffff81188e05>] cgroup_migrate+0xf5/0x190
[<ffffffff81189016>] cgroup_attach_task+0x176/0x200
[<ffffffff8118949d>] __cgroup_procs_write+0x2ad/0x460
[<ffffffff81189684>] cgroup_procs_write+0x14/0x20
[<ffffffff811854e5>] cgroup_file_write+0x35/0x1c0
[<ffffffff812e26f1>] kernfs_fop_write+0x141/0x190
[<ffffffff81265f88>] __vfs_write+0x28/0xe0
[<ffffffff812666fc>] vfs_write+0xac/0x1a0
[<ffffffff81267019>] SyS_write+0x49/0xb0
[<ffffffff81bcef32>] entry_SYSCALL_64_fastpath+0x12/0x76
This patch fixes the bug by removing @css parameter from the three
migration methods, ->can_attach, ->cancel_attach() and ->attach() and
updating cgroup_taskset iteration helpers also return the destination
css in addition to the task being migrated. All controllers are
updated accordingly.
* Controllers which don't care whether there are one or multiple
target csses can be converted trivially. cpu, io, freezer, perf,
netclassid and netprio fall in this category.
* cpuset's current implementation assumes that there's single source
and destination and thus doesn't support v2 hierarchy already. The
only change made by this patchset is how that single destination css
is obtained.
* memory migration path already doesn't do anything on v2. How the
single destination css is obtained is updated and the prep stage of
mem_cgroup_can_attach() is reordered to accomodate the change.
* pids is the only controller which was affected by this bug. It now
correctly handles multi-destination migrations and no longer causes
counter underflow from incorrect accounting.
Signed-off-by: Tejun Heo <tj@kernel.org>
Reported-and-tested-by: Daniel Wagner <daniel.wagner@bmw-carit.de>
Cc: Aleksa Sarai <cyphar@cyphar.com>
2015-12-03 22:18:21 +07:00
|
|
|
static int cpu_cgroup_can_attach(struct cgroup_taskset *tset)
|
2007-10-19 13:41:03 +07:00
|
|
|
{
|
2011-12-13 09:12:21 +07:00
|
|
|
struct task_struct *task;
|
cgroup: fix handling of multi-destination migration from subtree_control enabling
Consider the following v2 hierarchy.
P0 (+memory) --- P1 (-memory) --- A
\- B
P0 has memory enabled in its subtree_control while P1 doesn't. If
both A and B contain processes, they would belong to the memory css of
P1. Now if memory is enabled on P1's subtree_control, memory csses
should be created on both A and B and A's processes should be moved to
the former and B's processes the latter. IOW, enabling controllers
can cause atomic migrations into different csses.
The core cgroup migration logic has been updated accordingly but the
controller migration methods haven't and still assume that all tasks
migrate to a single target css; furthermore, the methods were fed the
css in which subtree_control was updated which is the parent of the
target csses. pids controller depends on the migration methods to
move charges and this made the controller attribute charges to the
wrong csses often triggering the following warning by driving a
counter negative.
WARNING: CPU: 1 PID: 1 at kernel/cgroup_pids.c:97 pids_cancel.constprop.6+0x31/0x40()
Modules linked in:
CPU: 1 PID: 1 Comm: systemd Not tainted 4.4.0-rc1+ #29
...
ffffffff81f65382 ffff88007c043b90 ffffffff81551ffc 0000000000000000
ffff88007c043bc8 ffffffff810de202 ffff88007a752000 ffff88007a29ab00
ffff88007c043c80 ffff88007a1d8400 0000000000000001 ffff88007c043bd8
Call Trace:
[<ffffffff81551ffc>] dump_stack+0x4e/0x82
[<ffffffff810de202>] warn_slowpath_common+0x82/0xc0
[<ffffffff810de2fa>] warn_slowpath_null+0x1a/0x20
[<ffffffff8118e031>] pids_cancel.constprop.6+0x31/0x40
[<ffffffff8118e0fd>] pids_can_attach+0x6d/0xf0
[<ffffffff81188a4c>] cgroup_taskset_migrate+0x6c/0x330
[<ffffffff81188e05>] cgroup_migrate+0xf5/0x190
[<ffffffff81189016>] cgroup_attach_task+0x176/0x200
[<ffffffff8118949d>] __cgroup_procs_write+0x2ad/0x460
[<ffffffff81189684>] cgroup_procs_write+0x14/0x20
[<ffffffff811854e5>] cgroup_file_write+0x35/0x1c0
[<ffffffff812e26f1>] kernfs_fop_write+0x141/0x190
[<ffffffff81265f88>] __vfs_write+0x28/0xe0
[<ffffffff812666fc>] vfs_write+0xac/0x1a0
[<ffffffff81267019>] SyS_write+0x49/0xb0
[<ffffffff81bcef32>] entry_SYSCALL_64_fastpath+0x12/0x76
This patch fixes the bug by removing @css parameter from the three
migration methods, ->can_attach, ->cancel_attach() and ->attach() and
updating cgroup_taskset iteration helpers also return the destination
css in addition to the task being migrated. All controllers are
updated accordingly.
* Controllers which don't care whether there are one or multiple
target csses can be converted trivially. cpu, io, freezer, perf,
netclassid and netprio fall in this category.
* cpuset's current implementation assumes that there's single source
and destination and thus doesn't support v2 hierarchy already. The
only change made by this patchset is how that single destination css
is obtained.
* memory migration path already doesn't do anything on v2. How the
single destination css is obtained is updated and the prep stage of
mem_cgroup_can_attach() is reordered to accomodate the change.
* pids is the only controller which was affected by this bug. It now
correctly handles multi-destination migrations and no longer causes
counter underflow from incorrect accounting.
Signed-off-by: Tejun Heo <tj@kernel.org>
Reported-and-tested-by: Daniel Wagner <daniel.wagner@bmw-carit.de>
Cc: Aleksa Sarai <cyphar@cyphar.com>
2015-12-03 22:18:21 +07:00
|
|
|
struct cgroup_subsys_state *css;
|
2016-06-16 18:29:28 +07:00
|
|
|
int ret = 0;
|
2011-12-13 09:12:21 +07:00
|
|
|
|
cgroup: fix handling of multi-destination migration from subtree_control enabling
Consider the following v2 hierarchy.
P0 (+memory) --- P1 (-memory) --- A
\- B
P0 has memory enabled in its subtree_control while P1 doesn't. If
both A and B contain processes, they would belong to the memory css of
P1. Now if memory is enabled on P1's subtree_control, memory csses
should be created on both A and B and A's processes should be moved to
the former and B's processes the latter. IOW, enabling controllers
can cause atomic migrations into different csses.
The core cgroup migration logic has been updated accordingly but the
controller migration methods haven't and still assume that all tasks
migrate to a single target css; furthermore, the methods were fed the
css in which subtree_control was updated which is the parent of the
target csses. pids controller depends on the migration methods to
move charges and this made the controller attribute charges to the
wrong csses often triggering the following warning by driving a
counter negative.
WARNING: CPU: 1 PID: 1 at kernel/cgroup_pids.c:97 pids_cancel.constprop.6+0x31/0x40()
Modules linked in:
CPU: 1 PID: 1 Comm: systemd Not tainted 4.4.0-rc1+ #29
...
ffffffff81f65382 ffff88007c043b90 ffffffff81551ffc 0000000000000000
ffff88007c043bc8 ffffffff810de202 ffff88007a752000 ffff88007a29ab00
ffff88007c043c80 ffff88007a1d8400 0000000000000001 ffff88007c043bd8
Call Trace:
[<ffffffff81551ffc>] dump_stack+0x4e/0x82
[<ffffffff810de202>] warn_slowpath_common+0x82/0xc0
[<ffffffff810de2fa>] warn_slowpath_null+0x1a/0x20
[<ffffffff8118e031>] pids_cancel.constprop.6+0x31/0x40
[<ffffffff8118e0fd>] pids_can_attach+0x6d/0xf0
[<ffffffff81188a4c>] cgroup_taskset_migrate+0x6c/0x330
[<ffffffff81188e05>] cgroup_migrate+0xf5/0x190
[<ffffffff81189016>] cgroup_attach_task+0x176/0x200
[<ffffffff8118949d>] __cgroup_procs_write+0x2ad/0x460
[<ffffffff81189684>] cgroup_procs_write+0x14/0x20
[<ffffffff811854e5>] cgroup_file_write+0x35/0x1c0
[<ffffffff812e26f1>] kernfs_fop_write+0x141/0x190
[<ffffffff81265f88>] __vfs_write+0x28/0xe0
[<ffffffff812666fc>] vfs_write+0xac/0x1a0
[<ffffffff81267019>] SyS_write+0x49/0xb0
[<ffffffff81bcef32>] entry_SYSCALL_64_fastpath+0x12/0x76
This patch fixes the bug by removing @css parameter from the three
migration methods, ->can_attach, ->cancel_attach() and ->attach() and
updating cgroup_taskset iteration helpers also return the destination
css in addition to the task being migrated. All controllers are
updated accordingly.
* Controllers which don't care whether there are one or multiple
target csses can be converted trivially. cpu, io, freezer, perf,
netclassid and netprio fall in this category.
* cpuset's current implementation assumes that there's single source
and destination and thus doesn't support v2 hierarchy already. The
only change made by this patchset is how that single destination css
is obtained.
* memory migration path already doesn't do anything on v2. How the
single destination css is obtained is updated and the prep stage of
mem_cgroup_can_attach() is reordered to accomodate the change.
* pids is the only controller which was affected by this bug. It now
correctly handles multi-destination migrations and no longer causes
counter underflow from incorrect accounting.
Signed-off-by: Tejun Heo <tj@kernel.org>
Reported-and-tested-by: Daniel Wagner <daniel.wagner@bmw-carit.de>
Cc: Aleksa Sarai <cyphar@cyphar.com>
2015-12-03 22:18:21 +07:00
|
|
|
cgroup_taskset_for_each(task, css, tset) {
|
2008-02-13 21:45:40 +07:00
|
|
|
#ifdef CONFIG_RT_GROUP_SCHED
|
2013-08-09 07:11:23 +07:00
|
|
|
if (!sched_rt_can_attach(css_tg(css), task))
|
2011-12-13 09:12:21 +07:00
|
|
|
return -EINVAL;
|
2008-02-13 21:45:40 +07:00
|
|
|
#else
|
2011-12-13 09:12:21 +07:00
|
|
|
/* We don't support RT-tasks being in separate groups */
|
|
|
|
if (task->sched_class != &fair_sched_class)
|
|
|
|
return -EINVAL;
|
2008-02-13 21:45:40 +07:00
|
|
|
#endif
|
2016-06-16 18:29:28 +07:00
|
|
|
/*
|
|
|
|
* Serialize against wake_up_new_task() such that if its
|
|
|
|
* running, we're sure to observe its full state.
|
|
|
|
*/
|
|
|
|
raw_spin_lock_irq(&task->pi_lock);
|
|
|
|
/*
|
|
|
|
* Avoid calling sched_move_task() before wake_up_new_task()
|
|
|
|
* has happened. This would lead to problems with PELT, due to
|
|
|
|
* move wanting to detach+attach while we're not attached yet.
|
|
|
|
*/
|
|
|
|
if (task->state == TASK_NEW)
|
|
|
|
ret = -EINVAL;
|
|
|
|
raw_spin_unlock_irq(&task->pi_lock);
|
|
|
|
|
|
|
|
if (ret)
|
|
|
|
break;
|
2011-12-13 09:12:21 +07:00
|
|
|
}
|
2016-06-16 18:29:28 +07:00
|
|
|
return ret;
|
2009-09-24 05:56:31 +07:00
|
|
|
}
|
2007-10-19 13:41:03 +07:00
|
|
|
|
cgroup: fix handling of multi-destination migration from subtree_control enabling
Consider the following v2 hierarchy.
P0 (+memory) --- P1 (-memory) --- A
\- B
P0 has memory enabled in its subtree_control while P1 doesn't. If
both A and B contain processes, they would belong to the memory css of
P1. Now if memory is enabled on P1's subtree_control, memory csses
should be created on both A and B and A's processes should be moved to
the former and B's processes the latter. IOW, enabling controllers
can cause atomic migrations into different csses.
The core cgroup migration logic has been updated accordingly but the
controller migration methods haven't and still assume that all tasks
migrate to a single target css; furthermore, the methods were fed the
css in which subtree_control was updated which is the parent of the
target csses. pids controller depends on the migration methods to
move charges and this made the controller attribute charges to the
wrong csses often triggering the following warning by driving a
counter negative.
WARNING: CPU: 1 PID: 1 at kernel/cgroup_pids.c:97 pids_cancel.constprop.6+0x31/0x40()
Modules linked in:
CPU: 1 PID: 1 Comm: systemd Not tainted 4.4.0-rc1+ #29
...
ffffffff81f65382 ffff88007c043b90 ffffffff81551ffc 0000000000000000
ffff88007c043bc8 ffffffff810de202 ffff88007a752000 ffff88007a29ab00
ffff88007c043c80 ffff88007a1d8400 0000000000000001 ffff88007c043bd8
Call Trace:
[<ffffffff81551ffc>] dump_stack+0x4e/0x82
[<ffffffff810de202>] warn_slowpath_common+0x82/0xc0
[<ffffffff810de2fa>] warn_slowpath_null+0x1a/0x20
[<ffffffff8118e031>] pids_cancel.constprop.6+0x31/0x40
[<ffffffff8118e0fd>] pids_can_attach+0x6d/0xf0
[<ffffffff81188a4c>] cgroup_taskset_migrate+0x6c/0x330
[<ffffffff81188e05>] cgroup_migrate+0xf5/0x190
[<ffffffff81189016>] cgroup_attach_task+0x176/0x200
[<ffffffff8118949d>] __cgroup_procs_write+0x2ad/0x460
[<ffffffff81189684>] cgroup_procs_write+0x14/0x20
[<ffffffff811854e5>] cgroup_file_write+0x35/0x1c0
[<ffffffff812e26f1>] kernfs_fop_write+0x141/0x190
[<ffffffff81265f88>] __vfs_write+0x28/0xe0
[<ffffffff812666fc>] vfs_write+0xac/0x1a0
[<ffffffff81267019>] SyS_write+0x49/0xb0
[<ffffffff81bcef32>] entry_SYSCALL_64_fastpath+0x12/0x76
This patch fixes the bug by removing @css parameter from the three
migration methods, ->can_attach, ->cancel_attach() and ->attach() and
updating cgroup_taskset iteration helpers also return the destination
css in addition to the task being migrated. All controllers are
updated accordingly.
* Controllers which don't care whether there are one or multiple
target csses can be converted trivially. cpu, io, freezer, perf,
netclassid and netprio fall in this category.
* cpuset's current implementation assumes that there's single source
and destination and thus doesn't support v2 hierarchy already. The
only change made by this patchset is how that single destination css
is obtained.
* memory migration path already doesn't do anything on v2. How the
single destination css is obtained is updated and the prep stage of
mem_cgroup_can_attach() is reordered to accomodate the change.
* pids is the only controller which was affected by this bug. It now
correctly handles multi-destination migrations and no longer causes
counter underflow from incorrect accounting.
Signed-off-by: Tejun Heo <tj@kernel.org>
Reported-and-tested-by: Daniel Wagner <daniel.wagner@bmw-carit.de>
Cc: Aleksa Sarai <cyphar@cyphar.com>
2015-12-03 22:18:21 +07:00
|
|
|
static void cpu_cgroup_attach(struct cgroup_taskset *tset)
|
2007-10-19 13:41:03 +07:00
|
|
|
{
|
2011-12-13 09:12:21 +07:00
|
|
|
struct task_struct *task;
|
cgroup: fix handling of multi-destination migration from subtree_control enabling
Consider the following v2 hierarchy.
P0 (+memory) --- P1 (-memory) --- A
\- B
P0 has memory enabled in its subtree_control while P1 doesn't. If
both A and B contain processes, they would belong to the memory css of
P1. Now if memory is enabled on P1's subtree_control, memory csses
should be created on both A and B and A's processes should be moved to
the former and B's processes the latter. IOW, enabling controllers
can cause atomic migrations into different csses.
The core cgroup migration logic has been updated accordingly but the
controller migration methods haven't and still assume that all tasks
migrate to a single target css; furthermore, the methods were fed the
css in which subtree_control was updated which is the parent of the
target csses. pids controller depends on the migration methods to
move charges and this made the controller attribute charges to the
wrong csses often triggering the following warning by driving a
counter negative.
WARNING: CPU: 1 PID: 1 at kernel/cgroup_pids.c:97 pids_cancel.constprop.6+0x31/0x40()
Modules linked in:
CPU: 1 PID: 1 Comm: systemd Not tainted 4.4.0-rc1+ #29
...
ffffffff81f65382 ffff88007c043b90 ffffffff81551ffc 0000000000000000
ffff88007c043bc8 ffffffff810de202 ffff88007a752000 ffff88007a29ab00
ffff88007c043c80 ffff88007a1d8400 0000000000000001 ffff88007c043bd8
Call Trace:
[<ffffffff81551ffc>] dump_stack+0x4e/0x82
[<ffffffff810de202>] warn_slowpath_common+0x82/0xc0
[<ffffffff810de2fa>] warn_slowpath_null+0x1a/0x20
[<ffffffff8118e031>] pids_cancel.constprop.6+0x31/0x40
[<ffffffff8118e0fd>] pids_can_attach+0x6d/0xf0
[<ffffffff81188a4c>] cgroup_taskset_migrate+0x6c/0x330
[<ffffffff81188e05>] cgroup_migrate+0xf5/0x190
[<ffffffff81189016>] cgroup_attach_task+0x176/0x200
[<ffffffff8118949d>] __cgroup_procs_write+0x2ad/0x460
[<ffffffff81189684>] cgroup_procs_write+0x14/0x20
[<ffffffff811854e5>] cgroup_file_write+0x35/0x1c0
[<ffffffff812e26f1>] kernfs_fop_write+0x141/0x190
[<ffffffff81265f88>] __vfs_write+0x28/0xe0
[<ffffffff812666fc>] vfs_write+0xac/0x1a0
[<ffffffff81267019>] SyS_write+0x49/0xb0
[<ffffffff81bcef32>] entry_SYSCALL_64_fastpath+0x12/0x76
This patch fixes the bug by removing @css parameter from the three
migration methods, ->can_attach, ->cancel_attach() and ->attach() and
updating cgroup_taskset iteration helpers also return the destination
css in addition to the task being migrated. All controllers are
updated accordingly.
* Controllers which don't care whether there are one or multiple
target csses can be converted trivially. cpu, io, freezer, perf,
netclassid and netprio fall in this category.
* cpuset's current implementation assumes that there's single source
and destination and thus doesn't support v2 hierarchy already. The
only change made by this patchset is how that single destination css
is obtained.
* memory migration path already doesn't do anything on v2. How the
single destination css is obtained is updated and the prep stage of
mem_cgroup_can_attach() is reordered to accomodate the change.
* pids is the only controller which was affected by this bug. It now
correctly handles multi-destination migrations and no longer causes
counter underflow from incorrect accounting.
Signed-off-by: Tejun Heo <tj@kernel.org>
Reported-and-tested-by: Daniel Wagner <daniel.wagner@bmw-carit.de>
Cc: Aleksa Sarai <cyphar@cyphar.com>
2015-12-03 22:18:21 +07:00
|
|
|
struct cgroup_subsys_state *css;
|
2011-12-13 09:12:21 +07:00
|
|
|
|
cgroup: fix handling of multi-destination migration from subtree_control enabling
Consider the following v2 hierarchy.
P0 (+memory) --- P1 (-memory) --- A
\- B
P0 has memory enabled in its subtree_control while P1 doesn't. If
both A and B contain processes, they would belong to the memory css of
P1. Now if memory is enabled on P1's subtree_control, memory csses
should be created on both A and B and A's processes should be moved to
the former and B's processes the latter. IOW, enabling controllers
can cause atomic migrations into different csses.
The core cgroup migration logic has been updated accordingly but the
controller migration methods haven't and still assume that all tasks
migrate to a single target css; furthermore, the methods were fed the
css in which subtree_control was updated which is the parent of the
target csses. pids controller depends on the migration methods to
move charges and this made the controller attribute charges to the
wrong csses often triggering the following warning by driving a
counter negative.
WARNING: CPU: 1 PID: 1 at kernel/cgroup_pids.c:97 pids_cancel.constprop.6+0x31/0x40()
Modules linked in:
CPU: 1 PID: 1 Comm: systemd Not tainted 4.4.0-rc1+ #29
...
ffffffff81f65382 ffff88007c043b90 ffffffff81551ffc 0000000000000000
ffff88007c043bc8 ffffffff810de202 ffff88007a752000 ffff88007a29ab00
ffff88007c043c80 ffff88007a1d8400 0000000000000001 ffff88007c043bd8
Call Trace:
[<ffffffff81551ffc>] dump_stack+0x4e/0x82
[<ffffffff810de202>] warn_slowpath_common+0x82/0xc0
[<ffffffff810de2fa>] warn_slowpath_null+0x1a/0x20
[<ffffffff8118e031>] pids_cancel.constprop.6+0x31/0x40
[<ffffffff8118e0fd>] pids_can_attach+0x6d/0xf0
[<ffffffff81188a4c>] cgroup_taskset_migrate+0x6c/0x330
[<ffffffff81188e05>] cgroup_migrate+0xf5/0x190
[<ffffffff81189016>] cgroup_attach_task+0x176/0x200
[<ffffffff8118949d>] __cgroup_procs_write+0x2ad/0x460
[<ffffffff81189684>] cgroup_procs_write+0x14/0x20
[<ffffffff811854e5>] cgroup_file_write+0x35/0x1c0
[<ffffffff812e26f1>] kernfs_fop_write+0x141/0x190
[<ffffffff81265f88>] __vfs_write+0x28/0xe0
[<ffffffff812666fc>] vfs_write+0xac/0x1a0
[<ffffffff81267019>] SyS_write+0x49/0xb0
[<ffffffff81bcef32>] entry_SYSCALL_64_fastpath+0x12/0x76
This patch fixes the bug by removing @css parameter from the three
migration methods, ->can_attach, ->cancel_attach() and ->attach() and
updating cgroup_taskset iteration helpers also return the destination
css in addition to the task being migrated. All controllers are
updated accordingly.
* Controllers which don't care whether there are one or multiple
target csses can be converted trivially. cpu, io, freezer, perf,
netclassid and netprio fall in this category.
* cpuset's current implementation assumes that there's single source
and destination and thus doesn't support v2 hierarchy already. The
only change made by this patchset is how that single destination css
is obtained.
* memory migration path already doesn't do anything on v2. How the
single destination css is obtained is updated and the prep stage of
mem_cgroup_can_attach() is reordered to accomodate the change.
* pids is the only controller which was affected by this bug. It now
correctly handles multi-destination migrations and no longer causes
counter underflow from incorrect accounting.
Signed-off-by: Tejun Heo <tj@kernel.org>
Reported-and-tested-by: Daniel Wagner <daniel.wagner@bmw-carit.de>
Cc: Aleksa Sarai <cyphar@cyphar.com>
2015-12-03 22:18:21 +07:00
|
|
|
cgroup_taskset_for_each(task, css, tset)
|
2011-12-13 09:12:21 +07:00
|
|
|
sched_move_task(task);
|
2007-10-19 13:41:03 +07:00
|
|
|
}
|
|
|
|
|
2008-02-13 21:45:40 +07:00
|
|
|
#ifdef CONFIG_FAIR_GROUP_SCHED
|
2013-08-09 07:11:24 +07:00
|
|
|
static int cpu_shares_write_u64(struct cgroup_subsys_state *css,
|
|
|
|
struct cftype *cftype, u64 shareval)
|
2007-10-19 13:41:03 +07:00
|
|
|
{
|
2013-08-09 07:11:24 +07:00
|
|
|
return sched_group_set_shares(css_tg(css), scale_load(shareval));
|
2007-10-19 13:41:03 +07:00
|
|
|
}
|
|
|
|
|
2013-08-09 07:11:24 +07:00
|
|
|
static u64 cpu_shares_read_u64(struct cgroup_subsys_state *css,
|
|
|
|
struct cftype *cft)
|
2007-10-19 13:41:03 +07:00
|
|
|
{
|
2013-08-09 07:11:24 +07:00
|
|
|
struct task_group *tg = css_tg(css);
|
2007-10-19 13:41:03 +07:00
|
|
|
|
sched: Increase SCHED_LOAD_SCALE resolution
Introduce SCHED_LOAD_RESOLUTION, which scales is added to
SCHED_LOAD_SHIFT and increases the resolution of
SCHED_LOAD_SCALE. This patch sets the value of
SCHED_LOAD_RESOLUTION to 10, scaling up the weights for all
sched entities by a factor of 1024. With this extra resolution,
we can handle deeper cgroup hiearchies and the scheduler can do
better shares distribution and load load balancing on larger
systems (especially for low weight task groups).
This does not change the existing user interface, the scaled
weights are only used internally. We do not modify
prio_to_weight values or inverses, but use the original weights
when calculating the inverse which is used to scale execution
time delta in calc_delta_mine(). This ensures we do not lose
accuracy when accounting time to the sched entities. Thanks to
Nikunj Dadhania for fixing an bug in c_d_m() that broken fairness.
Below is some analysis of the performance costs/improvements of
this patch.
1. Micro-arch performance costs:
Experiment was to run Ingo's pipe_test_100k 200 times with the
task pinned to one cpu. I measured instruction, cycles and
stalled-cycles for the runs. See:
http://thread.gmane.org/gmane.linux.kernel/1129232/focus=1129389
for more info.
-tip (baseline):
Performance counter stats for '/root/load-scale/pipe-test-100k' (200 runs):
964,991,769 instructions # 0.82 insns per cycle
# 0.33 stalled cycles per insn
# ( +- 0.05% )
1,171,186,635 cycles # 0.000 GHz ( +- 0.08% )
306,373,664 stalled-cycles-backend # 26.16% backend cycles idle ( +- 0.28% )
314,933,621 stalled-cycles-frontend # 26.89% frontend cycles idle ( +- 0.34% )
1.122405684 seconds time elapsed ( +- 0.05% )
-tip+patches:
Performance counter stats for './load-scale/pipe-test-100k' (200 runs):
963,624,821 instructions # 0.82 insns per cycle
# 0.33 stalled cycles per insn
# ( +- 0.04% )
1,175,215,649 cycles # 0.000 GHz ( +- 0.08% )
315,321,126 stalled-cycles-backend # 26.83% backend cycles idle ( +- 0.28% )
316,835,873 stalled-cycles-frontend # 26.96% frontend cycles idle ( +- 0.29% )
1.122238659 seconds time elapsed ( +- 0.06% )
With this patch, instructions decrease by ~0.10% and cycles
increase by 0.27%. This doesn't look statistically significant.
The number of stalled cycles in the backend increased from
26.16% to 26.83%. This can be attributed to the shifts we do in
c_d_m() and other places. The fraction of stalled cycles in the
frontend remains about the same, at 26.96% compared to 26.89% in -tip.
2. Balancing low-weight task groups
Test setup: run 50 tasks with random sleep/busy times (biased
around 100ms) in a low weight container (with cpu.shares = 2).
Measure %idle as reported by mpstat over a 10s window.
-tip (baseline):
06:47:48 PM CPU %usr %nice %sys %iowait %irq %soft %steal %guest %idle intr/s
06:47:49 PM all 94.32 0.00 0.06 0.00 0.00 0.00 0.00 0.00 5.62 15888.00
06:47:50 PM all 94.57 0.00 0.62 0.00 0.00 0.00 0.00 0.00 4.81 16180.00
06:47:51 PM all 94.69 0.00 0.06 0.00 0.00 0.00 0.00 0.00 5.25 15966.00
06:47:52 PM all 95.81 0.00 0.00 0.00 0.00 0.00 0.00 0.00 4.19 16053.00
06:47:53 PM all 94.88 0.06 0.00 0.00 0.00 0.00 0.00 0.00 5.06 15984.00
06:47:54 PM all 93.31 0.00 0.00 0.00 0.00 0.00 0.00 0.00 6.69 15806.00
06:47:55 PM all 94.19 0.00 0.06 0.00 0.00 0.00 0.00 0.00 5.75 15896.00
06:47:56 PM all 92.87 0.00 0.00 0.00 0.00 0.00 0.00 0.00 7.13 15716.00
06:47:57 PM all 94.88 0.00 0.00 0.00 0.00 0.00 0.00 0.00 5.12 15982.00
06:47:58 PM all 95.44 0.00 0.00 0.00 0.00 0.00 0.00 0.00 4.56 16075.00
Average: all 94.49 0.01 0.08 0.00 0.00 0.00 0.00 0.00 5.42 15954.60
-tip+patches:
06:47:03 PM CPU %usr %nice %sys %iowait %irq %soft %steal %guest %idle intr/s
06:47:04 PM all 100.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 16630.00
06:47:05 PM all 99.69 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.31 16580.20
06:47:06 PM all 99.69 0.00 0.06 0.00 0.00 0.00 0.00 0.00 0.25 16596.00
06:47:07 PM all 99.20 0.00 0.74 0.00 0.00 0.06 0.00 0.00 0.00 17838.61
06:47:08 PM all 100.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 16540.00
06:47:09 PM all 100.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 16575.00
06:47:10 PM all 100.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 16614.00
06:47:11 PM all 99.94 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.06 16588.00
06:47:12 PM all 99.94 0.00 0.06 0.00 0.00 0.00 0.00 0.00 0.00 16593.00
06:47:13 PM all 99.94 0.00 0.06 0.00 0.00 0.00 0.00 0.00 0.00 16551.00
Average: all 99.84 0.00 0.09 0.00 0.00 0.01 0.00 0.00 0.06 16711.58
We see an improvement in idle% on the system (drops from 5.42% on -tip to 0.06%
with the patches).
We see an improvement in idle% on the system (drops from 5.42%
on -tip to 0.06% with the patches).
Signed-off-by: Nikhil Rao <ncrao@google.com>
Acked-by: Peter Zijlstra <peterz@infradead.org>
Cc: Nikunj A. Dadhania <nikunj@linux.vnet.ibm.com>
Cc: Srivatsa Vaddagiri <vatsa@linux.vnet.ibm.com>
Cc: Stephan Barwolf <stephan.baerwolf@tu-ilmenau.de>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Link: http://lkml.kernel.org/r/1305754668-18792-1-git-send-email-ncrao@google.com
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2011-05-19 04:37:48 +07:00
|
|
|
return (u64) scale_load_down(tg->shares);
|
2007-10-19 13:41:03 +07:00
|
|
|
}
|
2011-07-21 23:43:28 +07:00
|
|
|
|
|
|
|
#ifdef CONFIG_CFS_BANDWIDTH
|
2011-07-21 23:43:29 +07:00
|
|
|
static DEFINE_MUTEX(cfs_constraints_mutex);
|
|
|
|
|
2011-07-21 23:43:28 +07:00
|
|
|
const u64 max_cfs_quota_period = 1 * NSEC_PER_SEC; /* 1s */
|
|
|
|
const u64 min_cfs_quota_period = 1 * NSEC_PER_MSEC; /* 1ms */
|
|
|
|
|
2011-07-21 23:43:29 +07:00
|
|
|
static int __cfs_schedulable(struct task_group *tg, u64 period, u64 runtime);
|
|
|
|
|
2011-07-21 23:43:28 +07:00
|
|
|
static int tg_set_cfs_bandwidth(struct task_group *tg, u64 period, u64 quota)
|
|
|
|
{
|
2011-11-08 11:26:33 +07:00
|
|
|
int i, ret = 0, runtime_enabled, runtime_was_enabled;
|
2011-10-25 15:00:11 +07:00
|
|
|
struct cfs_bandwidth *cfs_b = &tg->cfs_bandwidth;
|
2011-07-21 23:43:28 +07:00
|
|
|
|
|
|
|
if (tg == &root_task_group)
|
|
|
|
return -EINVAL;
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Ensure we have at some amount of bandwidth every period. This is
|
|
|
|
* to prevent reaching a state of large arrears when throttled via
|
|
|
|
* entity_tick() resulting in prolonged exit starvation.
|
|
|
|
*/
|
|
|
|
if (quota < min_cfs_quota_period || period < min_cfs_quota_period)
|
|
|
|
return -EINVAL;
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Likewise, bound things on the otherside by preventing insane quota
|
|
|
|
* periods. This also allows us to normalize in computing quota
|
|
|
|
* feasibility.
|
|
|
|
*/
|
|
|
|
if (period > max_cfs_quota_period)
|
|
|
|
return -EINVAL;
|
|
|
|
|
2014-06-25 15:19:42 +07:00
|
|
|
/*
|
|
|
|
* Prevent race between setting of cfs_rq->runtime_enabled and
|
|
|
|
* unthrottle_offline_cfs_rqs().
|
|
|
|
*/
|
|
|
|
get_online_cpus();
|
2011-07-21 23:43:29 +07:00
|
|
|
mutex_lock(&cfs_constraints_mutex);
|
|
|
|
ret = __cfs_schedulable(tg, period, quota);
|
|
|
|
if (ret)
|
|
|
|
goto out_unlock;
|
|
|
|
|
2011-07-21 23:43:31 +07:00
|
|
|
runtime_enabled = quota != RUNTIME_INF;
|
2011-11-08 11:26:33 +07:00
|
|
|
runtime_was_enabled = cfs_b->quota != RUNTIME_INF;
|
2013-10-17 01:16:12 +07:00
|
|
|
/*
|
|
|
|
* If we need to toggle cfs_bandwidth_used, off->on must occur
|
|
|
|
* before making related changes, and on->off must occur afterwards
|
|
|
|
*/
|
|
|
|
if (runtime_enabled && !runtime_was_enabled)
|
|
|
|
cfs_bandwidth_usage_inc();
|
2011-07-21 23:43:28 +07:00
|
|
|
raw_spin_lock_irq(&cfs_b->lock);
|
|
|
|
cfs_b->period = ns_to_ktime(period);
|
|
|
|
cfs_b->quota = quota;
|
2011-07-21 23:43:31 +07:00
|
|
|
|
2011-07-21 23:43:32 +07:00
|
|
|
__refill_cfs_bandwidth_runtime(cfs_b);
|
2011-07-21 23:43:31 +07:00
|
|
|
/* restart the period timer (if active) to handle new period expiry */
|
sched: Cleanup bandwidth timers
Roman reported a 3 cpu lockup scenario involving __start_cfs_bandwidth().
The more I look at that code the more I'm convinced its crack, that
entire __start_cfs_bandwidth() thing is brain melting, we don't need to
cancel a timer before starting it, *hrtimer_start*() will happily remove
the timer for you if its still enqueued.
Removing that, removes a big part of the problem, no more ugly cancel
loop to get stuck in.
So now, if I understand things right, the entire reason you have this
cfs_b->lock guarded ->timer_active nonsense is to make sure we don't
accidentally lose the timer.
It appears to me that it should be possible to guarantee that same by
unconditionally (re)starting the timer when !queued. Because regardless
what hrtimer::function will return, if we beat it to (re)enqueue the
timer, it doesn't matter.
Now, because hrtimers don't come with any serialization guarantees we
must ensure both handler and (re)start loop serialize their access to
the hrtimer to avoid both trying to forward the timer at the same
time.
Update the rt bandwidth timer to match.
This effectively reverts: 09dc4ab03936 ("sched/fair: Fix
tg_set_cfs_bandwidth() deadlock on rq->lock").
Reported-by: Roman Gushchin <klamm@yandex-team.ru>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Ben Segall <bsegall@google.com>
Cc: Paul Turner <pjt@google.com>
Link: http://lkml.kernel.org/r/20150415095011.804589208@infradead.org
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2015-04-15 16:41:57 +07:00
|
|
|
if (runtime_enabled)
|
|
|
|
start_cfs_bandwidth(cfs_b);
|
2011-07-21 23:43:28 +07:00
|
|
|
raw_spin_unlock_irq(&cfs_b->lock);
|
|
|
|
|
2014-06-25 15:19:42 +07:00
|
|
|
for_each_online_cpu(i) {
|
2011-07-21 23:43:28 +07:00
|
|
|
struct cfs_rq *cfs_rq = tg->cfs_rq[i];
|
2011-10-25 15:00:11 +07:00
|
|
|
struct rq *rq = cfs_rq->rq;
|
2011-07-21 23:43:28 +07:00
|
|
|
|
|
|
|
raw_spin_lock_irq(&rq->lock);
|
2011-07-21 23:43:31 +07:00
|
|
|
cfs_rq->runtime_enabled = runtime_enabled;
|
2011-07-21 23:43:28 +07:00
|
|
|
cfs_rq->runtime_remaining = 0;
|
2011-07-21 23:43:34 +07:00
|
|
|
|
2011-10-25 15:00:11 +07:00
|
|
|
if (cfs_rq->throttled)
|
2011-07-21 23:43:34 +07:00
|
|
|
unthrottle_cfs_rq(cfs_rq);
|
2011-07-21 23:43:28 +07:00
|
|
|
raw_spin_unlock_irq(&rq->lock);
|
|
|
|
}
|
2013-10-17 01:16:12 +07:00
|
|
|
if (runtime_was_enabled && !runtime_enabled)
|
|
|
|
cfs_bandwidth_usage_dec();
|
2011-07-21 23:43:29 +07:00
|
|
|
out_unlock:
|
|
|
|
mutex_unlock(&cfs_constraints_mutex);
|
2014-06-25 15:19:42 +07:00
|
|
|
put_online_cpus();
|
2011-07-21 23:43:28 +07:00
|
|
|
|
2011-07-21 23:43:29 +07:00
|
|
|
return ret;
|
2011-07-21 23:43:28 +07:00
|
|
|
}
|
|
|
|
|
|
|
|
int tg_set_cfs_quota(struct task_group *tg, long cfs_quota_us)
|
|
|
|
{
|
|
|
|
u64 quota, period;
|
|
|
|
|
2011-10-25 15:00:11 +07:00
|
|
|
period = ktime_to_ns(tg->cfs_bandwidth.period);
|
2011-07-21 23:43:28 +07:00
|
|
|
if (cfs_quota_us < 0)
|
|
|
|
quota = RUNTIME_INF;
|
|
|
|
else
|
|
|
|
quota = (u64)cfs_quota_us * NSEC_PER_USEC;
|
|
|
|
|
|
|
|
return tg_set_cfs_bandwidth(tg, period, quota);
|
|
|
|
}
|
|
|
|
|
|
|
|
long tg_get_cfs_quota(struct task_group *tg)
|
|
|
|
{
|
|
|
|
u64 quota_us;
|
|
|
|
|
2011-10-25 15:00:11 +07:00
|
|
|
if (tg->cfs_bandwidth.quota == RUNTIME_INF)
|
2011-07-21 23:43:28 +07:00
|
|
|
return -1;
|
|
|
|
|
2011-10-25 15:00:11 +07:00
|
|
|
quota_us = tg->cfs_bandwidth.quota;
|
2011-07-21 23:43:28 +07:00
|
|
|
do_div(quota_us, NSEC_PER_USEC);
|
|
|
|
|
|
|
|
return quota_us;
|
|
|
|
}
|
|
|
|
|
|
|
|
int tg_set_cfs_period(struct task_group *tg, long cfs_period_us)
|
|
|
|
{
|
|
|
|
u64 quota, period;
|
|
|
|
|
|
|
|
period = (u64)cfs_period_us * NSEC_PER_USEC;
|
2011-10-25 15:00:11 +07:00
|
|
|
quota = tg->cfs_bandwidth.quota;
|
2011-07-21 23:43:28 +07:00
|
|
|
|
|
|
|
return tg_set_cfs_bandwidth(tg, period, quota);
|
|
|
|
}
|
|
|
|
|
|
|
|
long tg_get_cfs_period(struct task_group *tg)
|
|
|
|
{
|
|
|
|
u64 cfs_period_us;
|
|
|
|
|
2011-10-25 15:00:11 +07:00
|
|
|
cfs_period_us = ktime_to_ns(tg->cfs_bandwidth.period);
|
2011-07-21 23:43:28 +07:00
|
|
|
do_div(cfs_period_us, NSEC_PER_USEC);
|
|
|
|
|
|
|
|
return cfs_period_us;
|
|
|
|
}
|
|
|
|
|
2013-08-09 07:11:24 +07:00
|
|
|
static s64 cpu_cfs_quota_read_s64(struct cgroup_subsys_state *css,
|
|
|
|
struct cftype *cft)
|
2011-07-21 23:43:28 +07:00
|
|
|
{
|
2013-08-09 07:11:24 +07:00
|
|
|
return tg_get_cfs_quota(css_tg(css));
|
2011-07-21 23:43:28 +07:00
|
|
|
}
|
|
|
|
|
2013-08-09 07:11:24 +07:00
|
|
|
static int cpu_cfs_quota_write_s64(struct cgroup_subsys_state *css,
|
|
|
|
struct cftype *cftype, s64 cfs_quota_us)
|
2011-07-21 23:43:28 +07:00
|
|
|
{
|
2013-08-09 07:11:24 +07:00
|
|
|
return tg_set_cfs_quota(css_tg(css), cfs_quota_us);
|
2011-07-21 23:43:28 +07:00
|
|
|
}
|
|
|
|
|
2013-08-09 07:11:24 +07:00
|
|
|
static u64 cpu_cfs_period_read_u64(struct cgroup_subsys_state *css,
|
|
|
|
struct cftype *cft)
|
2011-07-21 23:43:28 +07:00
|
|
|
{
|
2013-08-09 07:11:24 +07:00
|
|
|
return tg_get_cfs_period(css_tg(css));
|
2011-07-21 23:43:28 +07:00
|
|
|
}
|
|
|
|
|
2013-08-09 07:11:24 +07:00
|
|
|
static int cpu_cfs_period_write_u64(struct cgroup_subsys_state *css,
|
|
|
|
struct cftype *cftype, u64 cfs_period_us)
|
2011-07-21 23:43:28 +07:00
|
|
|
{
|
2013-08-09 07:11:24 +07:00
|
|
|
return tg_set_cfs_period(css_tg(css), cfs_period_us);
|
2011-07-21 23:43:28 +07:00
|
|
|
}
|
|
|
|
|
2011-07-21 23:43:29 +07:00
|
|
|
struct cfs_schedulable_data {
|
|
|
|
struct task_group *tg;
|
|
|
|
u64 period, quota;
|
|
|
|
};
|
|
|
|
|
|
|
|
/*
|
|
|
|
* normalize group quota/period to be quota/max_period
|
|
|
|
* note: units are usecs
|
|
|
|
*/
|
|
|
|
static u64 normalize_cfs_quota(struct task_group *tg,
|
|
|
|
struct cfs_schedulable_data *d)
|
|
|
|
{
|
|
|
|
u64 quota, period;
|
|
|
|
|
|
|
|
if (tg == d->tg) {
|
|
|
|
period = d->period;
|
|
|
|
quota = d->quota;
|
|
|
|
} else {
|
|
|
|
period = tg_get_cfs_period(tg);
|
|
|
|
quota = tg_get_cfs_quota(tg);
|
|
|
|
}
|
|
|
|
|
|
|
|
/* note: these should typically be equivalent */
|
|
|
|
if (quota == RUNTIME_INF || quota == -1)
|
|
|
|
return RUNTIME_INF;
|
|
|
|
|
|
|
|
return to_ratio(period, quota);
|
|
|
|
}
|
|
|
|
|
|
|
|
static int tg_cfs_schedulable_down(struct task_group *tg, void *data)
|
|
|
|
{
|
|
|
|
struct cfs_schedulable_data *d = data;
|
2011-10-25 15:00:11 +07:00
|
|
|
struct cfs_bandwidth *cfs_b = &tg->cfs_bandwidth;
|
2011-07-21 23:43:29 +07:00
|
|
|
s64 quota = 0, parent_quota = -1;
|
|
|
|
|
|
|
|
if (!tg->parent) {
|
|
|
|
quota = RUNTIME_INF;
|
|
|
|
} else {
|
2011-10-25 15:00:11 +07:00
|
|
|
struct cfs_bandwidth *parent_b = &tg->parent->cfs_bandwidth;
|
2011-07-21 23:43:29 +07:00
|
|
|
|
|
|
|
quota = normalize_cfs_quota(tg, d);
|
2014-09-21 08:24:36 +07:00
|
|
|
parent_quota = parent_b->hierarchical_quota;
|
2011-07-21 23:43:29 +07:00
|
|
|
|
|
|
|
/*
|
|
|
|
* ensure max(child_quota) <= parent_quota, inherit when no
|
|
|
|
* limit is set
|
|
|
|
*/
|
|
|
|
if (quota == RUNTIME_INF)
|
|
|
|
quota = parent_quota;
|
|
|
|
else if (parent_quota != RUNTIME_INF && quota > parent_quota)
|
|
|
|
return -EINVAL;
|
|
|
|
}
|
2014-09-21 08:24:36 +07:00
|
|
|
cfs_b->hierarchical_quota = quota;
|
2011-07-21 23:43:29 +07:00
|
|
|
|
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
|
|
|
|
static int __cfs_schedulable(struct task_group *tg, u64 period, u64 quota)
|
|
|
|
{
|
2011-07-21 23:43:35 +07:00
|
|
|
int ret;
|
2011-07-21 23:43:29 +07:00
|
|
|
struct cfs_schedulable_data data = {
|
|
|
|
.tg = tg,
|
|
|
|
.period = period,
|
|
|
|
.quota = quota,
|
|
|
|
};
|
|
|
|
|
|
|
|
if (quota != RUNTIME_INF) {
|
|
|
|
do_div(data.period, NSEC_PER_USEC);
|
|
|
|
do_div(data.quota, NSEC_PER_USEC);
|
|
|
|
}
|
|
|
|
|
2011-07-21 23:43:35 +07:00
|
|
|
rcu_read_lock();
|
|
|
|
ret = walk_tg_tree(tg_cfs_schedulable_down, tg_nop, &data);
|
|
|
|
rcu_read_unlock();
|
|
|
|
|
|
|
|
return ret;
|
2011-07-21 23:43:29 +07:00
|
|
|
}
|
2011-07-21 23:43:40 +07:00
|
|
|
|
2013-12-06 00:28:04 +07:00
|
|
|
static int cpu_stats_show(struct seq_file *sf, void *v)
|
2011-07-21 23:43:40 +07:00
|
|
|
{
|
2013-12-06 00:28:04 +07:00
|
|
|
struct task_group *tg = css_tg(seq_css(sf));
|
2011-10-25 15:00:11 +07:00
|
|
|
struct cfs_bandwidth *cfs_b = &tg->cfs_bandwidth;
|
2011-07-21 23:43:40 +07:00
|
|
|
|
2013-12-06 00:28:01 +07:00
|
|
|
seq_printf(sf, "nr_periods %d\n", cfs_b->nr_periods);
|
|
|
|
seq_printf(sf, "nr_throttled %d\n", cfs_b->nr_throttled);
|
|
|
|
seq_printf(sf, "throttled_time %llu\n", cfs_b->throttled_time);
|
2011-07-21 23:43:40 +07:00
|
|
|
|
|
|
|
return 0;
|
|
|
|
}
|
2011-07-21 23:43:28 +07:00
|
|
|
#endif /* CONFIG_CFS_BANDWIDTH */
|
2008-05-30 19:23:45 +07:00
|
|
|
#endif /* CONFIG_FAIR_GROUP_SCHED */
|
2007-10-19 13:41:03 +07:00
|
|
|
|
2008-02-13 21:45:40 +07:00
|
|
|
#ifdef CONFIG_RT_GROUP_SCHED
|
2013-08-09 07:11:24 +07:00
|
|
|
static int cpu_rt_runtime_write(struct cgroup_subsys_state *css,
|
|
|
|
struct cftype *cft, s64 val)
|
2008-01-26 03:08:30 +07:00
|
|
|
{
|
2013-08-09 07:11:24 +07:00
|
|
|
return sched_group_set_rt_runtime(css_tg(css), val);
|
2008-01-26 03:08:30 +07:00
|
|
|
}
|
|
|
|
|
2013-08-09 07:11:24 +07:00
|
|
|
static s64 cpu_rt_runtime_read(struct cgroup_subsys_state *css,
|
|
|
|
struct cftype *cft)
|
2008-01-26 03:08:30 +07:00
|
|
|
{
|
2013-08-09 07:11:24 +07:00
|
|
|
return sched_group_rt_runtime(css_tg(css));
|
2008-01-26 03:08:30 +07:00
|
|
|
}
|
2008-04-20 00:44:57 +07:00
|
|
|
|
2013-08-09 07:11:24 +07:00
|
|
|
static int cpu_rt_period_write_uint(struct cgroup_subsys_state *css,
|
|
|
|
struct cftype *cftype, u64 rt_period_us)
|
2008-04-20 00:44:57 +07:00
|
|
|
{
|
2013-08-09 07:11:24 +07:00
|
|
|
return sched_group_set_rt_period(css_tg(css), rt_period_us);
|
2008-04-20 00:44:57 +07:00
|
|
|
}
|
|
|
|
|
2013-08-09 07:11:24 +07:00
|
|
|
static u64 cpu_rt_period_read_uint(struct cgroup_subsys_state *css,
|
|
|
|
struct cftype *cft)
|
2008-04-20 00:44:57 +07:00
|
|
|
{
|
2013-08-09 07:11:24 +07:00
|
|
|
return sched_group_rt_period(css_tg(css));
|
2008-04-20 00:44:57 +07:00
|
|
|
}
|
2008-05-30 19:23:45 +07:00
|
|
|
#endif /* CONFIG_RT_GROUP_SCHED */
|
2008-01-26 03:08:30 +07:00
|
|
|
|
2007-10-30 03:18:11 +07:00
|
|
|
static struct cftype cpu_files[] = {
|
2008-02-13 21:45:40 +07:00
|
|
|
#ifdef CONFIG_FAIR_GROUP_SCHED
|
2007-10-30 03:18:11 +07:00
|
|
|
{
|
|
|
|
.name = "shares",
|
2008-04-29 14:59:56 +07:00
|
|
|
.read_u64 = cpu_shares_read_u64,
|
|
|
|
.write_u64 = cpu_shares_write_u64,
|
2007-10-30 03:18:11 +07:00
|
|
|
},
|
2008-02-13 21:45:40 +07:00
|
|
|
#endif
|
2011-07-21 23:43:28 +07:00
|
|
|
#ifdef CONFIG_CFS_BANDWIDTH
|
|
|
|
{
|
|
|
|
.name = "cfs_quota_us",
|
|
|
|
.read_s64 = cpu_cfs_quota_read_s64,
|
|
|
|
.write_s64 = cpu_cfs_quota_write_s64,
|
|
|
|
},
|
|
|
|
{
|
|
|
|
.name = "cfs_period_us",
|
|
|
|
.read_u64 = cpu_cfs_period_read_u64,
|
|
|
|
.write_u64 = cpu_cfs_period_write_u64,
|
|
|
|
},
|
2011-07-21 23:43:40 +07:00
|
|
|
{
|
|
|
|
.name = "stat",
|
2013-12-06 00:28:04 +07:00
|
|
|
.seq_show = cpu_stats_show,
|
2011-07-21 23:43:40 +07:00
|
|
|
},
|
2011-07-21 23:43:28 +07:00
|
|
|
#endif
|
2008-02-13 21:45:40 +07:00
|
|
|
#ifdef CONFIG_RT_GROUP_SCHED
|
2008-01-26 03:08:30 +07:00
|
|
|
{
|
2008-02-13 21:45:39 +07:00
|
|
|
.name = "rt_runtime_us",
|
2008-04-29 15:00:06 +07:00
|
|
|
.read_s64 = cpu_rt_runtime_read,
|
|
|
|
.write_s64 = cpu_rt_runtime_write,
|
2008-01-26 03:08:30 +07:00
|
|
|
},
|
2008-04-20 00:44:57 +07:00
|
|
|
{
|
|
|
|
.name = "rt_period_us",
|
2008-04-29 14:59:56 +07:00
|
|
|
.read_u64 = cpu_rt_period_read_uint,
|
|
|
|
.write_u64 = cpu_rt_period_write_uint,
|
2008-04-20 00:44:57 +07:00
|
|
|
},
|
2008-02-13 21:45:40 +07:00
|
|
|
#endif
|
2012-04-02 02:09:55 +07:00
|
|
|
{ } /* terminate */
|
2007-10-19 13:41:03 +07:00
|
|
|
};
|
|
|
|
|
2014-02-08 22:36:58 +07:00
|
|
|
struct cgroup_subsys cpu_cgrp_subsys = {
|
2012-11-19 23:13:38 +07:00
|
|
|
.css_alloc = cpu_cgroup_css_alloc,
|
2016-03-16 22:22:45 +07:00
|
|
|
.css_released = cpu_cgroup_css_released,
|
2012-11-19 23:13:38 +07:00
|
|
|
.css_free = cpu_cgroup_css_free,
|
2014-10-27 17:18:25 +07:00
|
|
|
.fork = cpu_cgroup_fork,
|
2011-12-13 09:12:21 +07:00
|
|
|
.can_attach = cpu_cgroup_can_attach,
|
|
|
|
.attach = cpu_cgroup_attach,
|
2014-07-15 22:05:09 +07:00
|
|
|
.legacy_cftypes = cpu_files,
|
2016-02-23 22:00:50 +07:00
|
|
|
.early_init = true,
|
2007-10-19 13:41:03 +07:00
|
|
|
};
|
|
|
|
|
2008-02-13 21:45:40 +07:00
|
|
|
#endif /* CONFIG_CGROUP_SCHED */
|
2007-12-03 02:04:49 +07:00
|
|
|
|
2012-09-20 06:58:38 +07:00
|
|
|
void dump_cpu_task(int cpu)
|
|
|
|
{
|
|
|
|
pr_info("Task dump for CPU %d:\n", cpu);
|
|
|
|
sched_show_task(cpu_curr(cpu));
|
|
|
|
}
|
2015-11-30 11:59:43 +07:00
|
|
|
|
|
|
|
/*
|
|
|
|
* Nice levels are multiplicative, with a gentle 10% change for every
|
|
|
|
* nice level changed. I.e. when a CPU-bound task goes from nice 0 to
|
|
|
|
* nice 1, it will get ~10% less CPU time than another CPU-bound task
|
|
|
|
* that remained on nice 0.
|
|
|
|
*
|
|
|
|
* The "10% effect" is relative and cumulative: from _any_ nice level,
|
|
|
|
* if you go up 1 level, it's -10% CPU usage, if you go down 1 level
|
|
|
|
* it's +10% CPU usage. (to achieve that we use a multiplier of 1.25.
|
|
|
|
* If a task goes up by ~10% and another task goes down by ~10% then
|
|
|
|
* the relative distance between them is ~25%.)
|
|
|
|
*/
|
|
|
|
const int sched_prio_to_weight[40] = {
|
|
|
|
/* -20 */ 88761, 71755, 56483, 46273, 36291,
|
|
|
|
/* -15 */ 29154, 23254, 18705, 14949, 11916,
|
|
|
|
/* -10 */ 9548, 7620, 6100, 4904, 3906,
|
|
|
|
/* -5 */ 3121, 2501, 1991, 1586, 1277,
|
|
|
|
/* 0 */ 1024, 820, 655, 526, 423,
|
|
|
|
/* 5 */ 335, 272, 215, 172, 137,
|
|
|
|
/* 10 */ 110, 87, 70, 56, 45,
|
|
|
|
/* 15 */ 36, 29, 23, 18, 15,
|
|
|
|
};
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Inverse (2^32/x) values of the sched_prio_to_weight[] array, precalculated.
|
|
|
|
*
|
|
|
|
* In cases where the weight does not change often, we can use the
|
|
|
|
* precalculated inverse to speed up arithmetics by turning divisions
|
|
|
|
* into multiplications:
|
|
|
|
*/
|
|
|
|
const u32 sched_prio_to_wmult[40] = {
|
|
|
|
/* -20 */ 48388, 59856, 76040, 92818, 118348,
|
|
|
|
/* -15 */ 147320, 184698, 229616, 287308, 360437,
|
|
|
|
/* -10 */ 449829, 563644, 704093, 875809, 1099582,
|
|
|
|
/* -5 */ 1376151, 1717300, 2157191, 2708050, 3363326,
|
|
|
|
/* 0 */ 4194304, 5237765, 6557202, 8165337, 10153587,
|
|
|
|
/* 5 */ 12820798, 15790321, 19976592, 24970740, 31350126,
|
|
|
|
/* 10 */ 39045157, 49367440, 61356676, 76695844, 95443717,
|
|
|
|
/* 15 */ 119304647, 148102320, 186737708, 238609294, 286331153,
|
|
|
|
};
|