linux_dsm_epyc7002/kernel/sched
Daniel Bristot de Oliveira feff2e65ef sched/deadline: Unthrottle PI boosted threads while enqueuing
stress-ng has a test (stress-ng --cyclic) that creates a set of threads
under SCHED_DEADLINE with the following parameters:

    dl_runtime   =  10000 (10 us)
    dl_deadline  = 100000 (100 us)
    dl_period    = 100000 (100 us)

These parameters are very aggressive. When using a system without HRTICK
set, these threads can easily execute longer than the dl_runtime because
the throttling happens with 1/HZ resolution.

During the main part of the test, the system works just fine because
the workload does not try to run over the 10 us. The problem happens at
the end of the test, on the exit() path. During exit(), the threads need
to do some cleanups that require real-time mutex locks, mainly those
related to memory management, resulting in this scenario:

Note: locks are rt_mutexes...
 ------------------------------------------------------------------------
    TASK A:		TASK B:				TASK C:
    activation
							activation
			activation

    lock(a): OK!	lock(b): OK!
    			<overrun runtime>
    			lock(a)
    			-> block (task A owns it)
			  -> self notice/set throttled
 +--<			  -> arm replenished timer
 |    			switch-out
 |    							lock(b)
 |    							-> <C prio > B prio>
 |    							-> boost TASK B
 |  unlock(a)						switch-out
 |  -> handle lock a to B
 |    -> wakeup(B)
 |      -> B is throttled:
 |        -> do not enqueue
 |     switch-out
 |
 |
 +---------------------> replenishment timer
			-> TASK B is boosted:
			  -> do not enqueue
 ------------------------------------------------------------------------

BOOM: TASK B is runnable but !enqueued, holding TASK C: the system
crashes with hung task C.

This problem is avoided by removing the throttle state from the boosted
thread while boosting it (by TASK A in the example above), allowing it to
be queued and run boosted.

The next replenishment will take care of the runtime overrun, pushing
the deadline further away. See the "while (dl_se->runtime <= 0)" on
replenish_dl_entity() for more information.

Reported-by: Mark Simmons <msimmons@redhat.com>
Signed-off-by: Daniel Bristot de Oliveira <bristot@redhat.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Juri Lelli <juri.lelli@redhat.com>
Tested-by: Mark Simmons <msimmons@redhat.com>
Link: https://lkml.kernel.org/r/5076e003450835ec74e6fa5917d02c4fa41687e6.1600170294.git.bristot@redhat.com
2020-10-03 16:30:53 +02:00
..
autogroup.c sched/autogroup: Make autogroup_path() always available 2019-06-24 19:23:40 +02:00
autogroup.h sched/headers: Simplify and clean up header usage in the scheduler 2018-03-04 12:39:29 +01:00
clock.c sched/clock: Use static_branch_likely() with sched_clock_running 2019-11-29 08:10:54 +01:00
completion.c completion: Use lockdep_assert_RT_in_threaded_ctx() in complete_all() 2020-03-23 18:40:25 +01:00
core.c sched/debug: Add new tracepoint to track cpu_capacity 2020-10-03 16:30:52 +02:00
cpuacct.c sched/cpuacct: Fix charge cpuacct.usage_sys 2020-05-19 20:34:14 +02:00
cpudeadline.c sched/deadline: Implement fallback mechanism for !fit case 2020-06-15 14:10:05 +02:00
cpudeadline.h sched/headers: Simplify and clean up header usage in the scheduler 2018-03-04 12:39:29 +01:00
cpufreq_schedutil.c Power management updates for 5.9-rc1 2020-08-03 20:28:08 -07:00
cpufreq.c cpufreq: Avoid leaving stale IRQ work items during CPU offline 2019-12-12 17:59:43 +01:00
cpupri.c sched/rt: cpupri_find: Trigger a full search as fallback 2020-03-20 13:06:20 +01:00
cpupri.h sched/rt: Optimize cpupri_find() on non-heterogenous systems 2020-03-06 12:57:27 +01:00
cputime.c sched/cputime: Improve cputime_adjust() 2020-06-15 14:10:00 +02:00
deadline.c sched/deadline: Unthrottle PI boosted threads while enqueuing 2020-10-03 16:30:53 +02:00
debug.c sched/topology: Move sd_flag_debug out of #ifdef CONFIG_SYSCTL 2020-09-09 10:09:03 +02:00
fair.c sched/debug: Add new tracepoint to track cpu_capacity 2020-10-03 16:30:52 +02:00
features.h sched/rt: Disable RT_RUNTIME_SHARE by default 2020-09-25 14:23:24 +02:00
idle.c Merge branch 'sched/urgent' 2020-07-08 11:38:59 +02:00
isolation.c isolcpus: Affine unbound kernel threads to housekeeping cpus 2020-06-15 14:10:03 +02:00
loadavg.c sched: nohz: stop passing around unused "ticks" parameter. 2020-07-22 10:22:04 +02:00
Makefile kcsan: Improve various small stylistic details 2019-11-20 10:47:23 +01:00
membarrier.c rseq/membarrier: Add MEMBARRIER_CMD_PRIVATE_EXPEDITED_RSEQ 2020-09-25 14:23:27 +02:00
pelt.c sched: Add a tracepoint to track rq->nr_running 2020-07-08 11:39:02 +02:00
pelt.h sched/pelt: Cleanup PELT divider 2020-06-15 14:10:06 +02:00
psi.c sched,psi: Convert to sched_set_fifo_low() 2020-06-15 14:10:25 +02:00
rt.c sched: Remove struct sched_class::next field 2020-06-25 13:45:44 +02:00
sched-pelt.h sched/fair: Fix "runnable_avg_yN_inv" not used warnings 2019-06-17 12:15:58 +02:00
sched.h sched: Fix use of count for nr_running tracepoint 2020-08-06 09:36:59 +02:00
smp.h sched/headers: Split out open-coded prototypes into kernel/sched/smp.h 2020-05-28 11:03:20 +02:00
stats.c proc: introduce proc_create_seq{,_data} 2018-05-16 07:23:35 +02:00
stats.h psi: Move PF_MEMSTALL out of task->flags 2020-03-20 13:06:19 +01:00
stop_task.c sched: Remove struct sched_class::next field 2020-06-25 13:45:44 +02:00
swait.c sched/swait: Prepare usage in completions 2020-03-21 16:00:23 +01:00
topology.c sched/fair: Reduce busy load balance interval 2020-09-25 14:23:26 +02:00
wait_bit.c sched/wait: fix ___wait_var_event(exclusive) 2019-12-17 13:32:50 +01:00
wait.c list: add "list_del_init_careful()" to go with "list_empty_careful()" 2020-08-02 20:39:44 -07:00