Commit Graph

2120 Commits

Author SHA1 Message Date
Wanpeng Li
2a42eb9594 sched/cputime: Accumulate vtime on top of nsec clocksource
Currently the cputime source used by vtime is jiffies. When we cross
a context boundary and jiffies have changed since the last snapshot, the
pending cputime is accounted to the switching out context.

This system works ok if the ticks are not aligned across CPUs. If they
instead are aligned (ie: all fire at the same time) and the CPUs run in
userspace, the jiffies change is only observed on tick exit and therefore
the user cputime is accounted as system cputime. This is because the
CPU that maintains timekeeping fires its tick at the same time as the
others. It updates jiffies in the middle of the tick and the other CPUs
see that update on IRQ exit:

    CPU 0 (timekeeper)                  CPU 1
    -------------------              -------------
                      jiffies = N
    ...                              run in userspace for a jiffy
    tick entry                       tick entry (sees jiffies = N)
    set jiffies = N + 1
    tick exit                        tick exit (sees jiffies = N + 1)
                                                account 1 jiffy as stime

Fix this with using a nanosec clock source instead of jiffies. The
cputime is then accumulated and flushed everytime the pending delta
reaches a jiffy in order to mitigate the accounting overhead.

[ fweisbec: changelog, rebase on struct vtime, field renames, add delta
  on cputime readers, keep idle vtime as-is (low overhead accounting),
  harmonize clock sources. ]

Suggested-by: Thomas Gleixner <tglx@linutronix.de>
Reported-by: Luiz Capitulino <lcapitulino@redhat.com>
Tested-by: Luiz Capitulino <lcapitulino@redhat.com>
Signed-off-by: Wanpeng Li <wanpeng.li@hotmail.com>
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Rik van Riel <riel@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Wanpeng Li <kernellwp@gmail.com>
Link: http://lkml.kernel.org/r/1498756511-11714-6-git-send-email-fweisbec@gmail.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-07-05 09:54:15 +02:00
Frederic Weisbecker
bac5b6b6b1 sched/cputime: Move the vtime task fields to their own struct
We are about to add vtime accumulation fields to the task struct. Let's
avoid more bloatification and gather vtime information to their own
struct.

Tested-by: Luiz Capitulino <lcapitulino@redhat.com>
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Rik van Riel <riel@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Wanpeng Li <kernellwp@gmail.com>
Link: http://lkml.kernel.org/r/1498756511-11714-5-git-send-email-fweisbec@gmail.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-07-05 09:54:15 +02:00
Frederic Weisbecker
60a9ce57e7 sched/cputime: Rename vtime fields
The current "snapshot" based naming on vtime fields suggests we record
some past event but that's a low level picture of their actual purpose
which comes out blurry. The real point of these fields is to run a basic
state machine that tracks down cputime entry while switching between
contexts.

So lets reflect that with more meaningful names.

Tested-by: Luiz Capitulino <lcapitulino@redhat.com>
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Rik van Riel <riel@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Wanpeng Li <kernellwp@gmail.com>
Link: http://lkml.kernel.org/r/1498756511-11714-4-git-send-email-fweisbec@gmail.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-07-05 09:54:14 +02:00
Frederic Weisbecker
9fa57cf5a5 sched/cputime: Always set tsk->vtime_snap_whence after accounting vtime
Even though it doesn't have functional consequences, setting
the task's new context state after we actually accounted the pending
vtime from the old context state makes more sense from a review
perspective.

vtime_user_exit() is the only function that doesn't follow that rule
and that can bug the reviewer for a little while until he realizes there
is no reason for this special case.

Tested-by: Luiz Capitulino <lcapitulino@redhat.com>
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Rik van Riel <riel@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Wanpeng Li <kernellwp@gmail.com>
Link: http://lkml.kernel.org/r/1498756511-11714-3-git-send-email-fweisbec@gmail.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-07-05 09:54:14 +02:00
Frederic Weisbecker
1c3eda01a7 vtime, sched/cputime: Remove vtime_account_user()
It's an unnecessary function between vtime_user_exit() and
account_user_time().

Tested-by: Luiz Capitulino <lcapitulino@redhat.com>
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Rik van Riel <riel@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Wanpeng Li <kernellwp@gmail.com>
Link: http://lkml.kernel.org/r/1498756511-11714-2-git-send-email-fweisbec@gmail.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-07-05 09:54:14 +02:00
Ingo Molnar
3b9c08ae3d Revert "sched/cputime: Refactor the cputime_adjust() code"
This reverts commit 72298e5c92.

As Peter explains:

> Argh, no... That code was perfectly fine. The new code otoh is
> convoluted.
>
> The old code had the following form:
>
>         if (exception1)
>           deal with exception1
>
>         if (execption2)
>           deal with exception2
>
>         do normal stuff
>
> Which is as simple and straight forward as it gets.
>
> The new code otoh reads like:
>
>         if (!exception1) {
>                 if (exception2)
>                   deal with exception 2
>                 else
>                   do normal stuff
>         }

So restore the old form.

Also fix the comment describing the logic, as it was confusing.

Requested-by: Peter Zijlstra <peterz@infradead.org>
Cc: Gustavo A. R. Silva <garsilva@embeddedor.com>
Cc: Frans Klaver <fransklaver@gmail.com>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Rik van Riel <riel@redhat.com>
Cc: Stanislaw Gruszka <sgruszka@redhat.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Wanpeng Li <wanpeng.li@hotmail.com>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-07-04 11:58:05 +02:00
Linus Torvalds
9bd42183b9 Merge branch 'sched-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull scheduler updates from Ingo Molnar:
 "The main changes in this cycle were:

   - Add the SYSTEM_SCHEDULING bootup state to move various scheduler
     debug checks earlier into the bootup. This turns silent and
     sporadically deadly bugs into nice, deterministic splats. Fix some
     of the splats that triggered. (Thomas Gleixner)

   - A round of restructuring and refactoring of the load-balancing and
     topology code (Peter Zijlstra)

   - Another round of consolidating ~20 of incremental scheduler code
     history: this time in terms of wait-queue nomenclature. (I didn't
     get much feedback on these renaming patches, and we can still
     easily change any names I might have misplaced, so if anyone hates
     a new name, please holler and I'll fix it.) (Ingo Molnar)

   - sched/numa improvements, fixes and updates (Rik van Riel)

   - Another round of x86/tsc scheduler clock code improvements, in hope
     of making it more robust (Peter Zijlstra)

   - Improve NOHZ behavior (Frederic Weisbecker)

   - Deadline scheduler improvements and fixes (Luca Abeni, Daniel
     Bristot de Oliveira)

   - Simplify and optimize the topology setup code (Lauro Ramos
     Venancio)

   - Debloat and decouple scheduler code some more (Nicolas Pitre)

   - Simplify code by making better use of llist primitives (Byungchul
     Park)

   - ... plus other fixes and improvements"

* 'sched-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (103 commits)
  sched/cputime: Refactor the cputime_adjust() code
  sched/debug: Expose the number of RT/DL tasks that can migrate
  sched/numa: Hide numa_wake_affine() from UP build
  sched/fair: Remove effective_load()
  sched/numa: Implement NUMA node level wake_affine()
  sched/fair: Simplify wake_affine() for the single socket case
  sched/numa: Override part of migrate_degrades_locality() when idle balancing
  sched/rt: Move RT related code from sched/core.c to sched/rt.c
  sched/deadline: Move DL related code from sched/core.c to sched/deadline.c
  sched/cpuset: Only offer CONFIG_CPUSETS if SMP is enabled
  sched/fair: Spare idle load balancing on nohz_full CPUs
  nohz: Move idle balancer registration to the idle path
  sched/loadavg: Generalize "_idle" naming to "_nohz"
  sched/core: Drop the unused try_get_task_struct() helper function
  sched/fair: WARN() and refuse to set buddy when !se->on_rq
  sched/debug: Fix SCHED_WARN_ON() to return a value on !CONFIG_SCHED_DEBUG as well
  sched/wait: Disambiguate wq_entry->task_list and wq_head->task_list naming
  sched/wait: Move bit_wait_table[] and related functionality from sched/core.c to sched/wait_bit.c
  sched/wait: Split out the wait_bit*() APIs from <linux/wait.h> into <linux/wait_bit.h>
  sched/wait: Re-adjust macro line continuation backslashes in <linux/wait.h>
  ...
2017-07-03 13:08:04 -07:00
Linus Torvalds
330e9e4625 Merge branch 'core-rcu-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull RCU updates from Ingo Molnar:
 "The sole purpose of these changes is to shrink and simplify the RCU
  code base, which has suffered from creeping bloat over the past couple
  of years. The end result is a net removal of ~2700 lines of code:

     79 files changed, 1496 insertions(+), 4211 deletions(-)

  Plus there's a marked reduction in the Kconfig space complexity as
  well, here's the number of matches on 'grep RCU' in the .config:

                               before       after

     x86-defconfig                 17          15
     x86-allmodconfig              33          20"

* 'core-rcu-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (86 commits)
  rcu: Remove RCU CPU stall warnings from Tiny RCU
  rcu: Remove event tracing from Tiny RCU
  rcu: Move RCU debug Kconfig options to kernel/rcu
  rcu: Move RCU non-debug Kconfig options to kernel/rcu
  rcu: Eliminate NOCBs CPU-state Kconfig options
  rcu: Remove debugfs tracing
  srcu: Remove Classic SRCU
  srcu: Fix rcutorture-statistics typo
  rcu: Remove SPARSE_RCU_POINTER Kconfig option
  rcu: Remove the now-obsolete PROVE_RCU_REPEATEDLY Kconfig option
  rcu: Remove typecheck() from RCU locking wrapper functions
  rcu: Remove #ifdef moving rcu_end_inkernel_boot from rcupdate.h
  rcu: Remove nohz_full full-system-idle state machine
  rcu: Remove the RCU_KTHREAD_PRIO Kconfig option
  rcu: Remove *_SLOW_* Kconfig options
  srcu: Use rnp->lock wrappers to replace explicit memory barriers
  rcu: Move rnp->lock wrappers for SRCU use
  rcu: Convert rnp->lock wrappers to macros for SRCU use
  rcu: Refactor #includes from include/linux/rcupdate.h
  bcm47xx: Fix build regression
  ...
2017-07-03 11:34:53 -07:00
Gustavo A. R. Silva
72298e5c92 sched/cputime: Refactor the cputime_adjust() code
Address a Coverity false positive, which is caused by overly
convoluted code:

Value assigned to variable 'utime' at line 619:utime = rtime;
is overwritten at line 642:utime = rtime - stime; before it
can be used. This makes such variable assignment useless.

Remove this variable assignment and refactor the code related.

Addresses-Coverity-ID: 1371643
Signed-off-by: Gustavo A. R. Silva <garsilva@embeddedor.com>
Cc: Frans Klaver <fransklaver@gmail.com>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rik van Riel <riel@redhat.com>
Cc: Stanislaw Gruszka <sgruszka@redhat.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Wanpeng Li <wanpeng.li@hotmail.com>
Link: http://lkml.kernel.org/r/20170629184128.GA5271@embeddedgus
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-06-30 09:37:59 +02:00
Daniel Bristot de Oliveira
48365b3884 sched/debug: Expose the number of RT/DL tasks that can migrate
Add the value of the rt_rq.rt_nr_migratory and dl_rq.dl_nr_migratory
to the sched_debug output, for instance:

 rt_rq[0]:
   .rt_nr_running                 : 2
   .rt_nr_migratory               : 1     <--- Like this
   .rt_throttled                  : 0
   .rt_time                       : 828.645877
   .rt_runtime                    : 1000.000000

This is useful to debug problems related to the RT/DL schedulers.

This also fixes the format of some variables, that were unsigned, rather
than signed.

Signed-off-by: Daniel Bristot de Oliveira <bristot@redhat.com>
Cc: Clark Williams <williams@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Luis Claudio R. Goncalves <lgoncalv@redhat.com>
Cc: Luiz Capitulino <lcapitulino@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-rt-users <linux-rt-users@vger.kernel.org>
Link: http://lkml.kernel.org/r/7896f71cada54ee7dd8507bb666063a2e051c3d4.1498482127.git.bristot@redhat.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-06-30 09:32:07 +02:00
Thomas Gleixner
ff801b716e sched/numa: Hide numa_wake_affine() from UP build
Stephen reported the following build warning in UP:

kernel/sched/fair.c:2657:9: warning: 'struct sched_domain' declared inside
parameter list
         ^
/home/sfr/next/next/kernel/sched/fair.c:2657:9: warning: its scope is only this
definition or declaration, which is probably not what you want

Hide the numa_wake_affine() inline stub on UP builds to get rid of it.

Fixes: 3fed382b46 ("sched/numa: Implement NUMA node level wake_affine()")
Reported-by: Stephen Rothwell <sfr@canb.auug.org.au>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Rik van Riel <riel@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
2017-06-29 08:25:52 +02:00
Rik van Riel
815abf5af4 sched/fair: Remove effective_load()
The effective_load() function was only used by the NUMA balancing
code, and not by the regular load balancing code. Now that the
NUMA balancing code no longer uses it either, get rid of it.

Signed-off-by: Rik van Riel <riel@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: jhladky@redhat.com
Cc: linux-kernel@vger.kernel.org
Link: http://lkml.kernel.org/r/20170623165530.22514-5-riel@redhat.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-06-24 08:57:53 +02:00
Rik van Riel
3fed382b46 sched/numa: Implement NUMA node level wake_affine()
Since select_idle_sibling() can place a task anywhere on a socket,
comparing loads between individual CPU cores makes no real sense
for deciding whether to do an affine wakeup across sockets, either.

Instead, compare the load between the sockets in a similar way the
load balancer and the numa balancing code do.

Signed-off-by: Rik van Riel <riel@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: jhladky@redhat.com
Cc: linux-kernel@vger.kernel.org
Link: http://lkml.kernel.org/r/20170623165530.22514-4-riel@redhat.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-06-24 08:57:52 +02:00
Rik van Riel
7d894e6e34 sched/fair: Simplify wake_affine() for the single socket case
Then 'this_cpu' and 'prev_cpu' are in the same socket, select_idle_sibling()
will do its thing regardless of the return value of wake_affine().

Just return true and don't look at all the other things.

Signed-off-by: Rik van Riel <riel@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: jhladky@redhat.com
Cc: linux-kernel@vger.kernel.org
Link: http://lkml.kernel.org/r/20170623165530.22514-3-riel@redhat.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-06-24 08:57:52 +02:00
Rik van Riel
739294fb03 sched/numa: Override part of migrate_degrades_locality() when idle balancing
Several tests in the NAS benchmark seem to run a lot slower with
NUMA balancing enabled, than with NUMA balancing disabled. The
slower run time corresponds with increased idle time.

Overriding the final test of migrate_degrades_locality (but still
doing the other NUMA tests first) seems to improve performance
of those benchmarks.

Reported-by: Jirka Hladky <jhladky@redhat.com>
Signed-off-by: Rik van Riel <riel@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Link: http://lkml.kernel.org/r/20170623165530.22514-2-riel@redhat.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-06-24 08:57:46 +02:00
Nicolas Pitre
8887cd9903 sched/rt: Move RT related code from sched/core.c to sched/rt.c
This helps making sched/core.c smaller and hopefully easier to understand and maintain.

Signed-off-by: Nicolas Pitre <nico@linaro.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/20170621182203.30626-3-nicolas.pitre@linaro.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-06-23 10:46:45 +02:00
Nicolas Pitre
06a76fe08d sched/deadline: Move DL related code from sched/core.c to sched/deadline.c
This helps making sched/core.c smaller and hopefully easier to understand and maintain.

Signed-off-by: Nicolas Pitre <nico@linaro.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/20170621182203.30626-2-nicolas.pitre@linaro.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-06-23 10:46:45 +02:00
Nicolas Pitre
e1d4eeec5a sched/cpuset: Only offer CONFIG_CPUSETS if SMP is enabled
Make CONFIG_CPUSETS=y depend on SMP as this feature makes no sense
on UP. This allows for configuring out cpuset_cpumask_can_shrink()
and task_can_attach() entirely, which shrinks the kernel a bit.

Signed-off-by: Nicolas Pitre <nico@linaro.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/20170614171926.8345-2-nicolas.pitre@linaro.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-06-23 10:46:44 +02:00
Frederic Weisbecker
387bc8b553 sched/fair: Spare idle load balancing on nohz_full CPUs
Although idle load balancing obviously only concerns idle CPUs, it can
be a disturbance on a busy nohz_full CPU. Indeed a CPU can only get rid
of an idle load balancing duty once a tick fires while it runs a task
and this can take a while on a nohz_full CPU.

We could fix that and escape the idle load balancing duty from the very
idle exit path but that would bring unecessary overhead. Lets just not
bother and leave that job to housekeeping CPUs (those outside nohz_full
range). The nohz_full CPUs simply don't want any disturbance.

Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rik van Riel <riel@redhat.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/1497838322-10913-4-git-send-email-fweisbec@gmail.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-06-22 11:30:02 +02:00
Frederic Weisbecker
3c85d6db5e sched/loadavg: Generalize "_idle" naming to "_nohz"
The loadavg naming code still assumes that nohz == idle whereas its code
is actually handling well both nohz idle and nohz full.

So lets fix the naming according to what the code actually does, to
unconfuse the reader.

Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Acked-by: Rik van Riel <riel@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/1497838322-10913-2-git-send-email-fweisbec@gmail.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-06-22 11:30:01 +02:00
Ingo Molnar
902b319413 Merge branch 'WIP.sched/core' into sched/core
Conflicts:
	kernel/sched/Makefile

Pick up the waitqueue related renames - it didn't get much feedback,
so it appears to be uncontroversial. Famous last words? ;-)

Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-06-20 12:28:21 +02:00
Daniel Axtens
c5ae366e12 sched/fair: WARN() and refuse to set buddy when !se->on_rq
If we set a next or last buddy for a se that is not on_rq, we will
end up taking a NULL pointer dereference in wakeup_preempt_entity
via pick_next_task_fair.

Detect when we would be about to do that, throw a warning and
then refuse to actually set it.

This has been suggested at least twice:

  https://marc.info/?l=linux-kernel&m=146651668921468&w=2
  https://lkml.org/lkml/2016/6/16/663

I recently had to debug a problem with these (we hadn't backported
Konstantin's patches in this area) and this would have saved a lot
of time/pain.

Just do it.

Signed-off-by: Daniel Axtens <dja@axtens.net>
Cc: Ben Segall <bsegall@google.com>
Cc: Konstantin Khlebnikov <khlebnikov@yandex-team.ru>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/20170510201139.16236-1-dja@axtens.net
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-06-20 12:26:52 +02:00
Ingo Molnar
6d3aed3d8a sched/debug: Fix SCHED_WARN_ON() to return a value on !CONFIG_SCHED_DEBUG as well
This definition of SCHED_WARN_ON():

 #define SCHED_WARN_ON(x)        ((void)(x))

is not fully compatible with the 'real' WARN_ON_ONCE() primitive, as it
has no return value, so it cannot be used in conditionals.

Fix it.

Cc: Daniel Axtens <dja@axtens.net>
Cc: Konstantin Khlebnikov <khlebnikov@yandex-team.ru>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-06-20 12:26:52 +02:00
Ingo Molnar
2055da9738 sched/wait: Disambiguate wq_entry->task_list and wq_head->task_list naming
So I've noticed a number of instances where it was not obvious from the
code whether ->task_list was for a wait-queue head or a wait-queue entry.

Furthermore, there's a number of wait-queue users where the lists are
not for 'tasks' but other entities (poll tables, etc.), in which case
the 'task_list' name is actively confusing.

To clear this all up, name the wait-queue head and entry list structure
fields unambiguously:

	struct wait_queue_head::task_list	=> ::head
	struct wait_queue_entry::task_list	=> ::entry

For example, this code:

	rqw->wait.task_list.next != &wait->task_list

... is was pretty unclear (to me) what it's doing, while now it's written this way:

	rqw->wait.head.next != &wait->entry

... which makes it pretty clear that we are iterating a list until we see the head.

Other examples are:

	list_for_each_entry_safe(pos, next, &x->task_list, task_list) {
	list_for_each_entry(wq, &fence->wait.task_list, task_list) {

... where it's unclear (to me) what we are iterating, and during review it's
hard to tell whether it's trying to walk a wait-queue entry (which would be
a bug), while now it's written as:

	list_for_each_entry_safe(pos, next, &x->head, entry) {
	list_for_each_entry(wq, &fence->wait.head, entry) {

Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-06-20 12:19:14 +02:00
Ingo Molnar
5822a454d6 sched/wait: Move bit_wait_table[] and related functionality from sched/core.c to sched/wait_bit.c
The key hashed waitqueue data structures and their initialization
was done in the main scheduler file for no good reason, move them
to sched/wait_bit.c instead.

Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-06-20 12:19:14 +02:00
Ingo Molnar
5dd43ce2f6 sched/wait: Split out the wait_bit*() APIs from <linux/wait.h> into <linux/wait_bit.h>
The wait_bit*() types and APIs are mixed into wait.h, but they
are a pretty orthogonal extension of wait-queues.

Furthermore, only about 50 kernel files use these APIs, while
over 1000 use the regular wait-queue functionality.

So clean up the main wait.h by moving the wait-bit functionality
out of it, into a separate .h and .c file:

  include/linux/wait_bit.h  for types and APIs
  kernel/sched/wait_bit.c   for the implementation

Update all header dependencies.

This reduces the size of wait.h rather significantly, by about 30%.

Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-06-20 12:19:09 +02:00
Ingo Molnar
76c85ddc46 sched/wait: Standardize wait_bit_queue naming
So wait-bit-queue head variables are often named:

	struct wait_bit_queue *q

... which is a bit ambiguous and super confusing, because
they clearly suggest wait-queue head semantics and behavior
(they rhyme with the old wait_queue_t *q naming), while they
are extended wait-queue _entries_, not heads!

They are misnomers in two ways:

 - the 'wait_bit_queue' leaves open the question of whether
   it's an entry or a head

 - the 'q' parameter and local variable naming falsely implies
   that it's a 'queue' - while it's an entry.

This resulted in sometimes confusing cases such as:

	finish_wait(wq, &q->wait);

where the 'q' is not a wait-queue head, but a wait-bit-queue entry.

So improve this all by standardizing wait-bit-queue nomenclature
similar to wait-queue head naming:

	struct wait_bit_queue   => struct wait_bit_queue_entry
	q			=> wbq_entry

Which makes it all a much clearer:

	struct wait_bit_queue_entry *wbq_entry

... and turns the former confusing piece of code into:

	finish_wait(wq_head, &wbq_entry->wq_entry;

which IMHO makes it apparently clear what we are doing,
without having to analyze the context of the code: we are
adding a wait-queue entry to a regular wait-queue head,
which entry is embedded in a wait-bit-queue entry.

I'm not a big fan of acronyms, but repeating wait_bit_queue_entry
in field and local variable names is too long, so Hopefully it's
clear enough that 'wq_' prefixes stand for wait-queues, while
'wbq_' prefixes stand for wait-bit-queues.

Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-06-20 12:18:29 +02:00
Ingo Molnar
2141713616 sched/wait: Standardize 'struct wait_bit_queue' wait-queue entry field name
Rename 'struct wait_bit_queue::wait' to ::wq_entry, to more clearly
name it as a wait-queue entry.

Propagate it to a couple of usage sites where the wait-bit-queue internals
are exposed.

Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-06-20 12:18:28 +02:00
Ingo Molnar
9d9d676f59 sched/wait: Standardize internal naming of wait-queue heads
The wait-queue head parameters and variables are named in a
couple of ways, we have the following variants currently:

	wait_queue_head_t *q
	wait_queue_head_t *wq
	wait_queue_head_t *head

In particular the 'wq' naming is ambiguous in the sense whether it's
a wait-queue head or entry name - as entries were often named 'wait'.

( Not to mention the confusion of any readers coming over from
  workqueue-land. )

Standardize all this around a single, unambiguous parameter and
variable name:

	struct wait_queue_head *wq_head

which is easy to grep for and also rhymes nicely with the wait-queue
entry naming:

	struct wait_queue_entry *wq_entry

Also rename:

	struct __wait_queue_head => struct wait_queue_head

... and use this struct type to migrate from typedefs usage to 'struct'
usage, which is more in line with existing kernel practices.

Don't touch any external users and preserve the main wait_queue_head_t
typedef.

Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-06-20 12:18:28 +02:00
Ingo Molnar
50816c4899 sched/wait: Standardize internal naming of wait-queue entries
So the various wait-queue entry variables in include/linux/wait.h
and kernel/sched/wait.c are named in a colorfully inconsistent
way:

	wait_queue_entry_t *wait
	wait_queue_entry_t *__wait	(even in plain C code!)
	wait_queue_entry_t *q		(!)
	wait_queue_entry_t *new		(making anyone who knows C++ cringe)
	wait_queue_entry_t *old

I think part of the reason for the inconsistency is the constant
apparent confusion about what a wait queue 'head' versus 'entry' is.

( Some of the documentation talks about a 'wait descriptor', which is
  the wait-queue entry itself - further adding to the confusion. )

The most common name is 'wait', but that in itself is somewhat
ambiguous as well, as it does not really make it clear whether
it's a wait-queue entry or head.

To improve all this name the wait-queue entry structure parameters
and variables consistently and push through this naming into all
the wait.h and wait.c code:

	struct wait_queue_entry *wq_entry

The 'wq_' prefix makes it easy to grep for, and we also use the
opportunity to move away from the typedef to a plain 'struct' naming:
in the kernel we typically reserve typedefs for cases where a
C structure is really small and somewhat opaque - such as pte_t.

wait-queue entries are neither small nor opaque, so use the more
standard 'struct xxx_entry' list management code nomenclature instead.

( We don't touch external users, and we preserve the typedef as well
  for actual wait-queue users, to reduce unnecessary churn. )

Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-06-20 12:18:27 +02:00
Ingo Molnar
ac6424b981 sched/wait: Rename wait_queue_t => wait_queue_entry_t
Rename:

	wait_queue_t		=>	wait_queue_entry_t

'wait_queue_t' was always a slight misnomer: its name implies that it's a "queue",
but in reality it's a queue *entry*. The 'real' queue is the wait queue head,
which had to carry the name.

Start sorting this out by renaming it to 'wait_queue_entry_t'.

This also allows the real structure name 'struct __wait_queue' to
lose its double underscore and become 'struct wait_queue_entry',
which is the more canonical nomenclature for such data types.

Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-06-20 12:18:27 +02:00
Linus Torvalds
0be5255c88 Merge branch 'sched-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull scheduler fixes from Thomas Gleixner:
 "Two small fixes for the schedulre core:

   - Use the proper switch_mm() variant in idle_task_exit() because that
     code is not called with interrupts disabled.

   - Fix a confusing typo in a printk"

* 'sched-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  sched/core: Idle_task_exit() shouldn't use switch_mm_irqs_off()
  sched/fair: Fix typo in printk message
2017-06-18 18:45:17 +09:00
Rafael J. Wysocki
f63e4f7d41 Merge branches 'pm-cpufreq', 'pm-cpuidle' and 'pm-devfreq'
* pm-cpufreq:
  cpufreq: conservative: Allow down_threshold to take values from 1 to 10
  Revert "cpufreq: schedutil: Reduce frequencies slower"

* pm-cpuidle:
  cpuidle: dt: Add missing 'of_node_put()'

* pm-devfreq:
  PM / devfreq: exynos-ppmu: Staticize event list
  PM / devfreq: exynos-ppmu: Handle return value of clk_prepare_enable
  PM / devfreq: exynos-nocp: Handle return value of clk_prepare_enable
2017-06-15 01:51:33 +02:00
Rafael J. Wysocki
ff0a6d6f93 Revert "cpufreq: schedutil: Reduce frequencies slower"
Revert commit 39b64aa1c0 (cpufreq: schedutil: Reduce frequencies
slower) that introduced unintentional changes in behavior leading
to adverse effects on some systems.

Reported-by: Viresh Kumar <viresh.kumar@linaro.org>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2017-06-12 14:16:16 +02:00
Andy Lutomirski
252d2a4117 sched/core: Idle_task_exit() shouldn't use switch_mm_irqs_off()
idle_task_exit() can be called with IRQs on x86 on and therefore
should use switch_mm(), not switch_mm_irqs_off().

This doesn't seem to cause any problems right now, but it will
confuse my upcoming TLB flush changes.  Nonetheless, I think it
should be backported because it's trivial.  There won't be any
meaningful performance impact because idle_task_exit() is only
used when offlining a CPU.

Signed-off-by: Andy Lutomirski <luto@kernel.org>
Cc: Borislav Petkov <bp@suse.de>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: stable@vger.kernel.org
Fixes: f98db6013c ("sched/core: Add switch_mm_irqs_off() and use it in the scheduler")
Link: http://lkml.kernel.org/r/ca3d1a9fa93a0b49f5a8ff729eda3640fb6abdf9.1497034141.git.luto@kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-06-11 10:58:17 +02:00
Marcin Nowakowski
f67abed585 sched/fair: Fix typo in printk message
'schedstats' kernel parameter should be set to enable/disable, so
correct the printk hint saying that it should be set to 'enable'
rather than 'enabled' to enable scheduler tracepoints.

Signed-off-by: Marcin Nowakowski <marcin.nowakowski@imgtec.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/1496995229-31245-1-git-send-email-marcin.nowakowski@imgtec.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-06-11 10:00:33 +02:00
Paul E. McKenney
d7d34d5e46 sched: Rely on synchronize_rcu_mult() de-duplication
The synchronize_rcu_mult() function now detects duplicate requests
for the same grace-period flavor and waits only once for each flavor.
This commit therefore removes the ugly #ifdef from sched_cpu_deactivate()
because synchronize_rcu_mult(call_rcu, call_rcu_sched) now does what
the #ifdef used to be needed for.

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
2017-06-08 08:25:39 -07:00
Aubrey Li
ebfa4c02fa sched/idle: Add deferrable vmstat_updater back
Deferrable vmstat_updater was missing in commit:

  c1de45ca83 ("sched/idle: Add support for tasks that inject idle")

Add it back.

Signed-off-by: Aubrey Li <aubrey.li@linux.intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Aubrey Li <aubrey.li@intel.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/1496803742-38274-1-git-send-email-aubrey.li@intel.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-06-08 10:32:09 +02:00
Nicolas Pitre
f5832c1998 sched/core: Omit building stop_sched_class when !SMP
The stop class is invoked through stop_machine only.
This is dead code on UP builds.

Signed-off-by: Nicolas Pitre <nico@linaro.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/20170529210302.26868-3-nicolas.pitre@linaro.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-06-08 10:32:04 +02:00
Daniel Bristot de Oliveira
3effcb4247 sched/deadline: Use the revised wakeup rule for suspending constrained dl tasks
We have been facing some problems with self-suspending constrained
deadline tasks. The main reason is that the original CBS was not
designed for such sort of tasks.

One problem reported by Xunlei Pang takes place when a task
suspends, and then is awakened before the deadline, but so close
to the deadline that its remaining runtime can cause the task
to have an absolute density higher than allowed. In such situation,
the original CBS assumes that the task is facing an early activation,
and so it replenishes the task and set another deadline, one deadline
in the future. This rule works fine for implicit deadline tasks.
Moreover, it allows the system to adapt the period of a task in which
the external event source suffered from a clock drift.

However, this opens the window for bandwidth leakage for constrained
deadline tasks. For instance, a task with the following parameters:

  runtime   = 5 ms
  deadline  = 7 ms
  [density] = 5 / 7 = 0.71
  period    = 1000 ms

If the task runs for 1 ms, and then suspends for another 1ms,
it will be awakened with the following parameters:

  remaining runtime = 4
  laxity = 5

presenting a absolute density of 4 / 5 = 0.80.

In this case, the original CBS would assume the task had an early
wakeup. Then, CBS will reset the runtime, and the absolute deadline will
be postponed by one relative deadline, allowing the task to run.

The problem is that, if the task runs this pattern forever, it will keep
receiving bandwidth, being able to run 1ms every 2ms. Following this
behavior, the task would be able to run 500 ms in 1 sec. Thus running
more than the 5 ms / 1 sec the admission control allowed it to run.

Trying to address the self-suspending case, Luca Abeni, Giuseppe
Lipari, and Juri Lelli [1] revisited the CBS in order to deal with
self-suspending tasks. In the new approach, rather than
replenishing/postponing the absolute deadline, the revised wakeup rule
adjusts the remaining runtime, reducing it to fit into the allowed
density.

A revised version of the idea is:

At a given time t, the maximum absolute density of a task cannot be
higher than its relative density, that is:

  runtime / (deadline - t) <= dl_runtime / dl_deadline

Knowing the laxity of a task (deadline - t), it is possible to move
it to the other side of the equality, thus enabling to define max
remaining runtime a task can use within the absolute deadline, without
over-running the allowed density:

  runtime = (dl_runtime / dl_deadline) * (deadline - t)

For instance, in our previous example, the task could still run:

  runtime = ( 5 / 7 ) * 5
  runtime = 3.57 ms

Without causing damage for other deadline tasks. It is note worthy
that the laxity cannot be negative because that would cause a negative
runtime. Thus, this patch depends on the patch:

  df8eac8caf ("sched/deadline: Throttle a constrained deadline task activated after the deadline")

Which throttles a constrained deadline task activated after the
deadline.

Finally, it is also possible to use the revised wakeup rule for
all other tasks, but that would require some more discussions
about pros and cons.

Reported-by: Xunlei Pang <xpang@redhat.com>
Signed-off-by: Daniel Bristot de Oliveira <bristot@redhat.com>
[peterz: replaced dl_is_constrained with dl_is_implicit]
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Juri Lelli <juri.lelli@arm.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Luca Abeni <luca.abeni@santannapisa.it>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Romulo Silva de Oliveira <romulo.deoliveira@ufsc.br>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Tommaso Cucinotta <tommaso.cucinotta@sssup.it>
Link: http://lkml.kernel.org/r/5c800ab3a74a168a84ee5f3f84d12a02e11383be.1495803804.git.bristot@redhat.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-06-08 10:32:03 +02:00
Xunlei Pang
ae83b56a56 sched/deadline: Zero out positive runtime after throttling constrained tasks
When a contrained task is throttled by dl_check_constrained_dl(),
it may carry the remaining positive runtime, as a result when
dl_task_timer() fires and calls replenish_dl_entity(), it will
not be replenished correctly due to the positive dl_se->runtime.

This patch assigns its runtime to 0 if positive after throttling.

Signed-off-by: Xunlei Pang <xlpang@redhat.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Daniel Bristot de Oliveira <bristot@redhat.com>
Cc: Juri Lelli <juri.lelli@arm.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Luca Abeni <luca.abeni@santannapisa.it>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Fixes: df8eac8caf ("sched/deadline: Throttle a constrained deadline task activated after the deadline)
Link: http://lkml.kernel.org/r/1494421417-27550-1-git-send-email-xlpang@redhat.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-06-08 10:31:58 +02:00
Luca Abeni
daec579836 sched/deadline: Reclaim bandwidth not used by dl tasks
This commit introduces a per-runqueue "extra utilization" that can be
reclaimed by deadline tasks. In this way, the maximum fraction of CPU
time that can reclaimed by deadline tasks is fixed (and configurable)
and does not depend on the total deadline utilization.
The GRUB accounting rule is modified to add this "extra utilization"
to the inactive utilization of the runqueue, and to avoid reclaiming
more than a maximum fraction of the CPU time.

Tested-by: Daniel Bristot de Oliveira <bristot@redhat.com>
Signed-off-by: Luca Abeni <luca.abeni@santannapisa.it>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Claudio Scordino <claudio@evidence.eu.com>
Cc: Joel Fernandes <joelaf@google.com>
Cc: Juri Lelli <juri.lelli@arm.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mathieu Poirier <mathieu.poirier@linaro.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Tommaso Cucinotta <tommaso.cucinotta@sssup.it>
Link: http://lkml.kernel.org/r/1495138417-6203-10-git-send-email-luca.abeni@santannapisa.it
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-06-08 10:31:55 +02:00
Luca Abeni
9f0d1a5077 sched/deadline: Base GRUB reclaiming on the inactive utilization
Instead of decreasing the runtime as "dq = -Uact dt" (eventually
divided by the maximum utilization available for deadline tasks),
decrease it as "dq = -max{u, (1 - Uinact)} dt", where u is the task
utilization and Uinact is the "inactive utilization".
In this way, the maximum fraction of CPU time that can be reclaimed
is given by the total utilization of deadline tasks.
This approach solves a fairness issue with "traditional" global GRUB
reclaiming: using the traditional GRUB algorithm, if tasks are
allocated to the various cores in a non-uniform way, the
reclaiming mechanism allows some tasks to reclaim more time than
others. This issue is visible starting 11 time-consuming tasks with
runtime 10ms and period 30ms (total utilization 3.666) on a 4-cores
system: some tasks will receive much more than the reserved runtime
(thanks to the reclaiming mechanism), while other tasks will receive
less than the reserved runtime.

Tested-by: Daniel Bristot de Oliveira <bristot@redhat.com>
Signed-off-by: Luca Abeni <luca.abeni@santannapisa.it>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Claudio Scordino <claudio@evidence.eu.com>
Cc: Joel Fernandes <joelaf@google.com>
Cc: Juri Lelli <juri.lelli@arm.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mathieu Poirier <mathieu.poirier@linaro.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Tommaso Cucinotta <tommaso.cucinotta@sssup.it>
Link: http://lkml.kernel.org/r/1495138417-6203-9-git-send-email-luca.abeni@santannapisa.it
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-06-08 10:31:54 +02:00
Luca Abeni
8fd27231c3 sched/deadline: Track the "total rq utilization" too
The total rq utilization is defined as the sum of the utilisations of
tasks that are "assigned" to a runqueue, independently from their state
(TASK_RUNNING or blocked)

Tested-by: Daniel Bristot de Oliveira <bristot@redhat.com>
Signed-off-by: Luca Abeni <luca.abeni@santannapisa.it>
Signed-off-by: Claudio Scordino <claudio@evidence.eu.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Joel Fernandes <joelaf@google.com>
Cc: Juri Lelli <juri.lelli@arm.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mathieu Poirier <mathieu.poirier@linaro.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Tommaso Cucinotta <tommaso.cucinotta@sssup.it>
Link: http://lkml.kernel.org/r/1495138417-6203-8-git-send-email-luca.abeni@santannapisa.it
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-06-08 10:31:53 +02:00
Luca Abeni
2d4283e9d5 sched/deadline: Make GRUB a task's flag
This patch introduces the SCHED_FLAG_RECLAIM flag to specify
that a DL task is allowed to reclaim unused CPU time (using
the GRUB algorithm).

Tested-by: Daniel Bristot de Oliveira <bristot@redhat.com>
Signed-off-by: Luca Abeni <luca.abeni@santannapisa.it>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Claudio Scordino <claudio@evidence.eu.com>
Cc: Joel Fernandes <joelaf@google.com>
Cc: Juri Lelli <juri.lelli@arm.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mathieu Poirier <mathieu.poirier@linaro.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Tommaso Cucinotta <tommaso.cucinotta@sssup.it>
Link: http://lkml.kernel.org/r/1495138417-6203-7-git-send-email-luca.abeni@santannapisa.it
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-06-08 10:31:52 +02:00
Luca Abeni
4da3abcefe sched/deadline: Do not reclaim the whole CPU bandwidth
Original GRUB tends to reclaim 100% of the CPU time... And this
allows a CPU hog to starve non-deadline tasks.
To address this issue, allow the scheduler to reclaim only a
specified fraction of CPU time, stored in the new "bw_ratio"
field of the dl runqueue structure.

Tested-by: Daniel Bristot de Oliveira <bristot@redhat.com>
Signed-off-by: Luca Abeni <luca.abeni@santannapisa.it>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Claudio Scordino <claudio@evidence.eu.com>
Cc: Joel Fernandes <joelaf@google.com>
Cc: Juri Lelli <juri.lelli@arm.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mathieu Poirier <mathieu.poirier@linaro.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Tommaso Cucinotta <tommaso.cucinotta@sssup.it>
Link: http://lkml.kernel.org/r/1495138417-6203-6-git-send-email-luca.abeni@santannapisa.it
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-06-08 10:31:51 +02:00
Luca Abeni
c52f14d384 sched/deadline: Implement GRUB accounting
According to the GRUB (Greedy Reclaimation of Unused Bandwidth)
reclaiming algorithm, the runtime is not decreased as "dq = -dt",
but as "dq = -Uact dt" (where Uact is the per-runqueue active
utilization).
Hence, this commit modifies the runtime accounting rule in
update_curr_dl() to implement the GRUB rule.

Tested-by: Daniel Bristot de Oliveira <bristot@redhat.com>
Signed-off-by: Luca Abeni <luca.abeni@santannapisa.it>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Claudio Scordino <claudio@evidence.eu.com>
Cc: Joel Fernandes <joelaf@google.com>
Cc: Juri Lelli <juri.lelli@arm.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mathieu Poirier <mathieu.poirier@linaro.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Tommaso Cucinotta <tommaso.cucinotta@sssup.it>
Link: http://lkml.kernel.org/r/1495138417-6203-5-git-send-email-luca.abeni@santannapisa.it
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-06-08 10:31:51 +02:00
Luca Abeni
387e31300b sched/deadline: Fix the update of the total -deadline utilization
Now that the inactive timer can be armed to fire at the 0-lag time,
it is possible to use inactive_task_timer() to update the total
-deadline utilization (dl_b->total_bw) at the correct time, fixing
dl_overflow() and __setparam_dl().

Tested-by: Daniel Bristot de Oliveira <bristot@redhat.com>
Signed-off-by: Luca Abeni <luca.abeni@santannapisa.it>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Claudio Scordino <claudio@evidence.eu.com>
Cc: Joel Fernandes <joelaf@google.com>
Cc: Juri Lelli <juri.lelli@arm.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mathieu Poirier <mathieu.poirier@linaro.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Tommaso Cucinotta <tommaso.cucinotta@sssup.it>
Link: http://lkml.kernel.org/r/1495138417-6203-4-git-send-email-luca.abeni@santannapisa.it
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-06-08 10:31:50 +02:00
Luca Abeni
209a0cbda7 sched/deadline: Improve the tracking of active utilization
This patch implements a more theoretically sound algorithm for
tracking active utilization: instead of decreasing it when a
task blocks, use a timer (the "inactive timer", named after the
"Inactive" task state of the GRUB algorithm) to decrease the
active utilization at the so called "0-lag time".

Tested-by: Claudio Scordino <claudio@evidence.eu.com>
Tested-by: Daniel Bristot de Oliveira <bristot@redhat.com>
Signed-off-by: Luca Abeni <luca.abeni@santannapisa.it>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Joel Fernandes <joelaf@google.com>
Cc: Juri Lelli <juri.lelli@arm.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mathieu Poirier <mathieu.poirier@linaro.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Tommaso Cucinotta <tommaso.cucinotta@sssup.it>
Link: http://lkml.kernel.org/r/1495138417-6203-3-git-send-email-luca.abeni@santannapisa.it
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-06-08 10:31:49 +02:00
Luca Abeni
e36d8677bf sched/deadline: Track the active utilization
Active utilization is defined as the total utilization of active
(TASK_RUNNING) tasks queued on a runqueue. Hence, it is increased
when a task wakes up and is decreased when a task blocks.

When a task is migrated from CPUi to CPUj, immediately subtract the
task's utilization from CPUi and add it to CPUj. This mechanism is
implemented by modifying the pull and push functions.
Note: this is not fully correct from the theoretical point of view
(the utilization should be removed from CPUi only at the 0 lag
time), a more theoretically sound solution is presented in the
next patches.

Tested-by: Daniel Bristot de Oliveira <bristot@redhat.com>
Signed-off-by: Luca Abeni <luca.abeni@unitn.it>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Juri Lelli <juri.lelli@arm.com>
Cc: Claudio Scordino <claudio@evidence.eu.com>
Cc: Joel Fernandes <joelaf@google.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mathieu Poirier <mathieu.poirier@linaro.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Tommaso Cucinotta <tommaso.cucinotta@sssup.it>
Link: http://lkml.kernel.org/r/1495138417-6203-2-git-send-email-luca.abeni@santannapisa.it
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-06-08 10:27:56 +02:00
Peter Zijlstra
1ad3aaf3fc sched/core: Implement new approach to scale select_idle_cpu()
Hackbench recently suffered a bunch of pain, first by commit:

  4c77b18cf8 ("sched/fair: Make select_idle_cpu() more aggressive")

and then by commit:

  c743f0a5c5 ("sched/fair, cpumask: Export for_each_cpu_wrap()")

which fixed a bug in the initial for_each_cpu_wrap() implementation
that made select_idle_cpu() even more expensive. The bug was that it
would skip over CPUs when bits were consequtive in the bitmask.

This however gave me an idea to fix select_idle_cpu(); where the old
scheme was a cliff-edge throttle on idle scanning, this introduces a
more gradual approach. Instead of stopping to scan entirely, we limit
how many CPUs we scan.

Initial benchmarks show that it mostly recovers hackbench while not
hurting anything else, except Mason's schbench, but not as bad as the
old thing.

It also appears to recover the tbench high-end, which also suffered like
hackbench.

Tested-by: Matt Fleming <matt@codeblueprint.co.uk>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Chris Mason <clm@fb.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: hpa@zytor.com
Cc: kitsunyan <kitsunyan@inbox.ru>
Cc: linux-kernel@vger.kernel.org
Cc: lvenanci@redhat.com
Cc: riel@redhat.com
Cc: xiaolong.ye@intel.com
Link: http://lkml.kernel.org/r/20170517105350.hk5m4h4jb6dfr65a@hirez.programming.kicks-ass.net
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-06-08 10:25:17 +02:00
Peter Zijlstra
45aea32167 sched/clock: Fix early boot preempt assumption in __set_sched_clock_stable()
The more strict early boot preemption warnings found that
__set_sched_clock_stable() was incorrectly assuming we'd still be
running on a single CPU:

  BUG: using smp_processor_id() in preemptible [00000000] code: swapper/0/1
  caller is debug_smp_processor_id+0x1c/0x1e
  CPU: 0 PID: 1 Comm: swapper/0 Not tainted 4.12.0-rc2-00108-g1c3c5ea #1
  Call Trace:
   dump_stack+0x110/0x192
   check_preemption_disabled+0x10c/0x128
   ? set_debug_rodata+0x25/0x25
   debug_smp_processor_id+0x1c/0x1e
   sched_clock_init_late+0x27/0x87
  [...]

Fix it by disabling IRQs.

Reported-by: kernel test robot <xiaolong.ye@intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: lkp@01.org
Cc: tipbuild@zytor.com
Link: http://lkml.kernel.org/r/20170524065202.v25vyu7pvba5mhpd@hirez.programming.kicks-ass.net
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-05-24 09:10:00 +02:00
Thomas Gleixner
1c3c5eab17 sched/core: Enable might_sleep() and smp_processor_id() checks early
might_sleep() and smp_processor_id() checks are enabled after the boot
process is done. That hides bugs in the SMP bringup and driver
initialization code.

Enable it right when the scheduler starts working, i.e. when init task and
kthreadd have been created and right before the idle task enables
preemption.

Tested-by: Mark Rutland <mark.rutland@arm.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Mark Rutland <mark.rutland@arm.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Link: http://lkml.kernel.org/r/20170516184736.272225698@linutronix.de
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-05-23 10:01:38 +02:00
Vlastimil Babka
8655d54977 sched/numa: Use down_read_trylock() for the mmap_sem
A customer has reported a soft-lockup when running an intensive
memory stress test, where the trace on multiple CPU's looks like this:

 RIP: 0010:[<ffffffff810c53fe>]
  [<ffffffff810c53fe>] native_queued_spin_lock_slowpath+0x10e/0x190
...
 Call Trace:
  [<ffffffff81182d07>] queued_spin_lock_slowpath+0x7/0xa
  [<ffffffff811bc331>] change_protection_range+0x3b1/0x930
  [<ffffffff811d4be8>] change_prot_numa+0x18/0x30
  [<ffffffff810adefe>] task_numa_work+0x1fe/0x310
  [<ffffffff81098322>] task_work_run+0x72/0x90

Further investigation showed that the lock contention here is pmd_lock().

The task_numa_work() function makes sure that only one thread is let to perform
the work in a single scan period (via cmpxchg), but if there's a thread with
mmap_sem locked for writing for several periods, multiple threads in
task_numa_work() can build up a convoy waiting for mmap_sem for read and then
all get unblocked at once.

This patch changes the down_read() to the trylock version, which prevents the
build up. For a workload experiencing mmap_sem contention, it's probably better
to postpone the NUMA balancing work anyway. This seems to have fixed the soft
lockups involving pmd_lock(), which is in line with the convoy theory.

Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Rik van Riel <riel@redhat.com>
Acked-by: Mel Gorman <mgorman@techsingularity.net>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/20170515131316.21909-1-vbabka@suse.cz
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-05-23 10:01:34 +02:00
Dave Kleikamp
c249f255aa sched/rt: Minimize rq->lock contention in do_sched_rt_period_timer()
With CONFIG_RT_GROUP_SCHED=y, do_sched_rt_period_timer() sequentially
takes each CPU's rq->lock. On a large, busy system, the cumulative time it
takes to acquire each lock can be excessive, even triggering a watchdog
timeout.

If rt_rq->rt_time and rt_rq->rt_nr_running are both zero, this function does
nothing while holding the lock, so don't bother taking it at all.

Signed-off-by: Dave Kleikamp <dave.kleikamp@oracle.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/a767637b-df85-912f-ba69-c90ee00a3fb6@oracle.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-05-23 10:01:34 +02:00
Steven Rostedt (VMware)
896bbb2522 sched/core: Allow __sched_setscheduler() in interrupts when PI is not used
When priority inheritance was added back in 2.6.18 to sched_setscheduler(), it
added a path to taking an rt-mutex wait_lock, which is not IRQ safe. As PI
is not a common occurrence, lockdep will likely never trigger if
sched_setscheduler was called from interrupt context. A BUG_ON() was added
to trigger if __sched_setscheduler() was ever called from interrupt context
because there was a possibility to take the wait_lock.

Today the wait_lock is irq safe, but the path to taking it in
sched_setscheduler() is the same as the path to taking it from normal
context. The wait_lock is taken with raw_spin_lock_irq() and released with
raw_spin_unlock_irq() which will indiscriminately enable interrupts,
which would be bad in interrupt context.

The problem is that normalize_rt_tasks, which is called by triggering the
sysrq nice-all-RT-tasks was changed to call __sched_setscheduler(), and this
is done from interrupt context!

Now __sched_setscheduler() takes a "pi" parameter that is used to know if
the priority inheritance should be called or not. As the BUG_ON() only cares
about calling the PI code, it should only bug if called from interrupt
context with the "pi" parameter set to true.

Reported-by: Laurent Dufour <ldufour@linux.vnet.ibm.com>
Tested-by: Laurent Dufour <ldufour@linux.vnet.ibm.com>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Andrew Morton <akpm@osdl.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Fixes: dbc7f069b9 ("sched: Use replace normalize_task() with __sched_setscheduler()")
Link: http://lkml.kernel.org/r/20170308124654.10e598f2@gandalf.local.home
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-05-23 10:01:34 +02:00
Byungchul Park
a776b968e5 sched/deadline: Remove unnecessary condition in push_dl_task()
pick_next_pushable_dl_task(rq) has BUG_ON(rq->cpu != task_cpu(task))
when it returns a task other than NULL, which means that task_cpu(task)
must be rq->cpu. So if task == next_task, then task_cpu(next_task) must
be rq->cpu as well. Remove the redundant condition and make the code simpler.

This way one unnecessary branch and two LOAD operations can be avoided.

Signed-off-by: Byungchul Park <byungchul.park@lge.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Reviewed-by: Juri Lelli <juri.lelli@arm.com>
Reviewed-by: Daniel Bristot de Oliveira <bristot@redhat.com>
Cc: <kernel-team@lge.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/1494551159-22367-1-git-send-email-byungchul.park@lge.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-05-23 10:01:33 +02:00
Byungchul Park
de16b91eff sched/rt: Remove unnecessary condition in push_rt_task()
pick_next_pushable_task(rq) has BUG_ON(rq_cpu != task_cpu(task)) when
it returns a task other than NULL, which means that task_cpu(task) must
be rq->cpu. So if task == next_task, then task_cpu(next_task) must be
rq->cpu as well. Remove the redundant condition and make the code simpler.

This way one unnecessary branch and two LOAD operations can be avoided.

Signed-off-by: Byungchul Park <byungchul.park@lge.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Reviewed-by: Juri Lelli <juri.lelli@arm.com>
Reviewed-by: Daniel Bristot de Oliveira <bristot@redhat.com>
Cc: <kernel-team@lge.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/1494551143-22219-1-git-send-email-byungchul.park@lge.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-05-23 10:01:33 +02:00
Byungchul Park
73215849df sched/core: Use the new llist_for_each_entry_safe() primitive
Now that we've added llist_for_each_entry_safe(), use it to simplify
an open coded version of it in sched_ttwu_pending().

Signed-off-by: Byungchul Park <byungchul.park@lge.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: <kernel-team@lge.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/1494549584-11730-1-git-send-email-byungchul.park@lge.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-05-23 10:01:33 +02:00
Ingo Molnar
386b554888 Merge branch 'linus' into sched/core, to pick up fixes
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-05-23 09:50:35 +02:00
Rafael J. Wysocki
079c1812a2 Merge branches 'intel_pstate', 'pm-cpufreq' and 'pm-cpufreq-sched'
* intel_pstate:
  cpufreq: intel_pstate: Document the current behavior and user interface

* pm-cpufreq:
  cpufreq: dbx500: add a Kconfig symbol

* pm-cpufreq-sched:
  cpufreq: schedutil: use now as reference when aggregating shared policy requests
2017-05-22 20:28:22 +02:00
Tejun Heo
a9e7f6544b sched/fair: Fix O(nr_cgroups) in load balance path
Currently, rq->leaf_cfs_rq_list is a traversal ordered list of all
live cfs_rqs which have ever been active on the CPU; unfortunately,
this makes update_blocked_averages() O(# total cgroups) which isn't
scalable at all.

This shows up as a small CPU consumption and scheduling latency
increase in the load balancing path in systems with CPU controller
enabled across most cgroups.  In an edge case where temporary cgroups
were leaking, this caused the kernel to consume good several tens of
percents of CPU cycles running update_blocked_averages(), each run
taking multiple millisecs.

This patch fixes the issue by taking empty and fully decayed cfs_rqs
off the rq->leaf_cfs_rq_list.

Signed-off-by: Tejun Heo <tj@kernel.org>
[ Added cfs_rq_is_decayed() ]
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Vincent Guittot <vincent.guittot@linaro.org>
Cc: Chris Mason <clm@fb.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Paul Turner <pjt@google.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/20170426004350.GB3222@wtj.duckdns.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-05-15 12:07:44 +02:00
Peter Zijlstra
502ce005ab sched/fair: Use task_groups instead of leaf_cfs_rq_list to walk all cfs_rqs
In order to allow leaf_cfs_rq_list to remove entries switch the
bandwidth hotplug code over to the task_groups list.

Suggested-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Chris Mason <clm@fb.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Paul Turner <pjt@google.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/20170504133122.a6qjlj3hlblbjxux@hirez.programming.kicks-ass.net
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-05-15 10:15:35 +02:00
Peter Zijlstra
ae4df9d6c9 sched/topology: Rename sched_group_cpus()
There's a discrepancy in naming between the sched_domain and
sched_group cpumask accessor. Since we're doing changes, fix it.

  $ git grep sched_group_cpus | wc -l
  28
  $ git grep sched_domain_span | wc -l
  38

Suggests changing sched_group_cpus() into sched_group_span():

  for i  in `git grep -l sched_group_cpus`
  do
    sed -ie 's/sched_group_cpus/sched_group_span/g' $i
  done

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-05-15 10:15:34 +02:00
Peter Zijlstra
e5c14b1fb8 sched/topology: Rename sched_group_mask()
Since sched_group_mask() is now an independent cpumask (it no longer
masks sched_group_cpus()), rename the thing.

Suggested-by: Lauro Ramos Venancio <lvenanci@redhat.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-05-15 10:15:33 +02:00
Peter Zijlstra
af218122b1 sched/topology: Simplify sched_group_mask() usage
While writing the comments, it occurred to me that:

  sg_cpus & sg_mask == sg_mask

at least conceptually; the !overlap case sets the all 1s mask. If we
correct that we can simplify things and directly use sg_mask.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-05-15 10:15:33 +02:00
Peter Zijlstra
0c0e776a9b sched/topology: Rewrite get_group()
We want to attain:

  sg_cpus() & sg_mask() == sg_mask()

for this to be so we must initialize sg_mask() to sg_cpus() for the
!overlap case (its currently cpumask_setall()).

Since the code makes my head hurt bad, rewrite it into a simpler form,
inspired by the now fixed overlap code.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-05-15 10:15:32 +02:00
Peter Zijlstra
35a566e6e8 sched/topology: Add a few comments
Try and describe what this code is about..

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-05-15 10:15:31 +02:00
Peter Zijlstra
1676330ecf sched/topology: Fix overlapping sched_group_capacity
When building the overlapping groups we need to attach a consistent
sched_group_capacity structure. That is, all 'identical' sched_group's
should have the _same_ sched_group_capacity.

This can (once again) be demonstrated with a topology like:

  node   0   1   2   3
    0:  10  20  30  20
    1:  20  10  20  30
    2:  30  20  10  20
    3:  20  30  20  10

But we need at least 2 CPUs per node for this to show up, after all,
if there is only one CPU per node, our CPU @i is per definition a
unique CPU that reaches this domain (aka balance-cpu).

Given the above NUMA topo and 2 CPUs per node:

  [] CPU0 attaching sched-domain(s):
  []  domain-0: span=0,4 level=DIE
  []   groups: 0:{ span=0 }, 4:{ span=4 }
  []   domain-1: span=0-1,3-5,7 level=NUMA
  []    groups: 0:{ span=0,4 mask=0,4 cap=2048 }, 1:{ span=1,5 mask=1,5 cap=2048 }, 3:{ span=3,7 mask=3,7 cap=2048 }
  []    domain-2: span=0-7 level=NUMA
  []     groups: 0:{ span=0-1,3-5,7 mask=0,4 cap=6144 }, 2:{ span=1-3,5-7 mask=2,6 cap=6144 }
  [] CPU1 attaching sched-domain(s):
  []  domain-0: span=1,5 level=DIE
  []   groups: 1:{ span=1 }, 5:{ span=5 }
  []   domain-1: span=0-2,4-6 level=NUMA
  []    groups: 1:{ span=1,5 mask=1,5 cap=2048 }, 2:{ span=2,6 mask=2,6 cap=2048 }, 4:{ span=0,4 mask=0,4 cap=2048 }
  []    domain-2: span=0-7 level=NUMA
  []     groups: 1:{ span=0-2,4-6 mask=1,5 cap=6144 }, 3:{ span=0,2-4,6-7 mask=3,7 cap=6144 }

Observe how CPU0-domain1-group0 and CPU1-domain1-group4 are the
'same' but have a different id (0 vs 4).

To fix this, use the group balance CPU to select the SGC. This means
we have to compute the full mask for each CPU and require a second
temporary mask to store the group mask in (it otherwise lives in the
SGC).

The fixed topology looks like:

  [] CPU0 attaching sched-domain(s):
  []  domain-0: span=0,4 level=DIE
  []   groups: 0:{ span=0 }, 4:{ span=4 }
  []   domain-1: span=0-1,3-5,7 level=NUMA
  []    groups: 0:{ span=0,4 mask=0,4 cap=2048 }, 1:{ span=1,5 mask=1,5 cap=2048 }, 3:{ span=3,7 mask=3,7 cap=2048 }
  []    domain-2: span=0-7 level=NUMA
  []     groups: 0:{ span=0-1,3-5,7 mask=0,4 cap=6144 }, 2:{ span=1-3,5-7 mask=2,6 cap=6144 }
  [] CPU1 attaching sched-domain(s):
  []  domain-0: span=1,5 level=DIE
  []   groups: 1:{ span=1 }, 5:{ span=5 }
  []   domain-1: span=0-2,4-6 level=NUMA
  []    groups: 1:{ span=1,5 mask=1,5 cap=2048 }, 2:{ span=2,6 mask=2,6 cap=2048 }, 0:{ span=0,4 mask=0,4 cap=2048 }
  []    domain-2: span=0-7 level=NUMA
  []     groups: 1:{ span=0-2,4-6 mask=1,5 cap=6144 }, 3:{ span=0,2-4,6-7 mask=3,7 cap=6144 }

Debugged-by: Lauro Ramos Venancio <lvenanci@redhat.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Fixes: e3589f6c81 ("sched: Allow for overlapping sched_domain spans")
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-05-15 10:15:30 +02:00
Peter Zijlstra
005f874dd2 sched/topology: Add sched_group_capacity debugging
Add sgc::id to easier spot domain construction issues.

Take the opportunity to slightly rework the group printing, because
adding more "(id: %d)" strings makes the entire thing very hard to
read. Also the individual groups are very hard to separate, so add
explicit visual grouping, which allows replacing all the "(%s: %d)"
format things with shorter "%s=%d" variants.

Then fix up some inconsistencies in surrounding prints for domains.

The end result looks like:

  [] CPU0 attaching sched-domain(s):
  []  domain-0: span=0,4 level=DIE
  []   groups: 0:{ span=0 }, 4:{ span=4 }
  []   domain-1: span=0-1,3-5,7 level=NUMA
  []    groups: 0:{ span=0,4 mask=0,4 cap=2048 }, 1:{ span=1,5 mask=1,5 cap=2048 }, 3:{ span=3,7 mask=3,7 cap=2048 }
  []    domain-2: span=0-7 level=NUMA
  []     groups: 0:{ span=0-1,3-5,7 mask=0,4 cap=6144 }, 2:{ span=1-3,5-7 mask=2,6 cap=6144 }

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-05-15 10:15:30 +02:00
Peter Zijlstra
8d5dc5126b sched/topology: Small cleanup
Move the allocation of topology specific cpumasks into the topology
code.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-05-15 10:15:29 +02:00
Peter Zijlstra
73bb059f9b sched/topology: Fix overlapping sched_group_mask
The point of sched_group_mask is to select those CPUs from
sched_group_cpus that can actually arrive at this balance domain.

The current code gets it wrong, as can be readily demonstrated with a
topology like:

  node   0   1   2   3
    0:  10  20  30  20
    1:  20  10  20  30
    2:  30  20  10  20
    3:  20  30  20  10

Where (for example) domain 1 on CPU1 ends up with a mask that includes
CPU0:

  [] CPU1 attaching sched-domain:
  []  domain 0: span 0-2 level NUMA
  []   groups: 1 (mask: 1), 2, 0
  []   domain 1: span 0-3 level NUMA
  []    groups: 0-2 (mask: 0-2) (cpu_capacity: 3072), 0,2-3 (cpu_capacity: 3072)

This causes sched_balance_cpu() to compute the wrong CPU and
consequently should_we_balance() will terminate early resulting in
missed load-balance opportunities.

The fixed topology looks like:

  [] CPU1 attaching sched-domain:
  []  domain 0: span 0-2 level NUMA
  []   groups: 1 (mask: 1), 2, 0
  []   domain 1: span 0-3 level NUMA
  []    groups: 0-2 (mask: 1) (cpu_capacity: 3072), 0,2-3 (cpu_capacity: 3072)

(note: this relies on OVERLAP domains to always have children, this is
 true because the regular topology domains are still here -- this is
 before degenerate trimming)

Debugged-by: Lauro Ramos Venancio <lvenanci@redhat.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Cc: stable@vger.kernel.org
Fixes: e3589f6c81 ("sched: Allow for overlapping sched_domain spans")
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-05-15 10:15:28 +02:00
Peter Zijlstra
af85596c74 sched/topology: Remove FORCE_SD_OVERLAP
Its an obsolete debug mechanism and future code wants to rely on
properties this undermines.

Namely, it would be good to assume that SD_OVERLAP domains have
children, but if we build the entire hierarchy with SD_OVERLAP this is
obviously false.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-05-15 10:15:28 +02:00
Lauro Ramos Venancio
c20e1ea4b6 sched/topology: Move comment about asymmetric node setups
Signed-off-by: Lauro Ramos Venancio <lvenanci@redhat.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: lwang@redhat.com
Cc: riel@redhat.com
Link: http://lkml.kernel.org/r/1492717903-5195-4-git-send-email-lvenanci@redhat.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-05-15 10:15:27 +02:00
Lauro Ramos Venancio
f32d782e31 sched/topology: Optimize build_group_mask()
The group mask is always used in intersection with the group CPUs. So,
when building the group mask, we don't have to care about CPUs that are
not part of the group.

Signed-off-by: Lauro Ramos Venancio <lvenanci@redhat.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: lwang@redhat.com
Cc: riel@redhat.com
Link: http://lkml.kernel.org/r/1492717903-5195-2-git-send-email-lvenanci@redhat.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-05-15 10:15:26 +02:00
Peter Zijlstra
a420b06303 sched/topology: Verify the first group matches the child domain
We want sched_groups to be sibling child domains (or individual CPUs
when there are no child domains). Furthermore, since the first group
of a domain should include the CPU of that domain, the first group of
each domain should match the child domain.

Verify this is indeed so.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-05-15 10:15:26 +02:00
Peter Zijlstra
b0151c2554 sched/debug: Print the scheduler topology group mask
In order to determine the balance_cpu (for should_we_balance()) we need
the sched_group_mask() for overlapping domains.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-05-15 10:15:25 +02:00
Peter Zijlstra
91eaed0d61 sched/topology: Simplify build_overlap_sched_groups()
Now that the first group will always be the previous domain of this
@cpu this can be simplified.

In fact, writing the code now removed should've been a big clue I was
doing it wrong :/

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-05-15 10:15:24 +02:00
Peter Zijlstra
0372dd2736 sched/topology: Fix building of overlapping sched-groups
When building the overlapping groups, we very obviously should start
with the previous domain of _this_ @cpu, not CPU-0.

This can be readily demonstrated with a topology like:

  node   0   1   2   3
    0:  10  20  30  20
    1:  20  10  20  30
    2:  30  20  10  20
    3:  20  30  20  10

Where (for example) CPU1 ends up generating the following nonsensical groups:

  [] CPU1 attaching sched-domain:
  []  domain 0: span 0-2 level NUMA
  []   groups: 1 2 0
  []   domain 1: span 0-3 level NUMA
  []    groups: 1-3 (cpu_capacity = 3072) 0-1,3 (cpu_capacity = 3072)

Where the fact that domain 1 doesn't include a group with span 0-2 is
the obvious fail.

With patch this looks like:

  [] CPU1 attaching sched-domain:
  []  domain 0: span 0-2 level NUMA
  []   groups: 1 0 2
  []   domain 1: span 0-3 level NUMA
  []    groups: 0-2 (cpu_capacity = 3072) 0,2-3 (cpu_capacity = 3072)

Debugged-by: Lauro Ramos Venancio <lvenanci@redhat.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Cc: stable@vger.kernel.org
Fixes: e3589f6c81 ("sched: Allow for overlapping sched_domain spans")
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-05-15 10:15:23 +02:00
Peter Zijlstra
c743f0a5c5 sched/fair, cpumask: Export for_each_cpu_wrap()
More users for for_each_cpu_wrap() have appeared. Promote the construct
to generic cpumask interface.

The implementation is slightly modified to reduce arguments.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Lauro Ramos Venancio <lvenanci@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rik van Riel <riel@redhat.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: lwang@redhat.com
Link: http://lkml.kernel.org/r/20170414122005.o35me2h5nowqkxbv@hirez.programming.kicks-ass.net
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-05-15 10:15:23 +02:00
Lauro Ramos Venancio
8c0334697d sched/topology: Refactor function build_overlap_sched_groups()
Create functions build_group_from_child_sched_domain() and
init_overlap_sched_group(). No functional change.

Signed-off-by: Lauro Ramos Venancio <lvenanci@redhat.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Rik van Riel <riel@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/1492091769-19879-2-git-send-email-lvenanci@redhat.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-05-15 10:15:22 +02:00
Peter Zijlstra
7708d5f04d sched/clock: Print a warning recommending 'tsc=unstable'
With our switch to stable delayed until late_initcall(), the most
likely cause of hitting mark_tsc_unstable() is the watchdog. The
watchdog typically only triggers when creative BIOS'es fiddle with the
TSC to hide SMI latency.

Since the watchdog can only detect TSC fiddling after the fact all TSC
clocks (including userspace GTOD) can already have reported funny
values.

The only way to fully avoid this, is manually marking the TSC unstable
at boot. Suggest people do this on their broken systems.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-05-15 10:15:21 +02:00
Peter Zijlstra
2e44b7ddf8 sched/clock: Use late_initcall() instead of sched_init_smp()
Core2 marks its TSC unstable in ACPI Processor Idle, which is probed
after sched_init_smp(). Luckily it appears both acpi_processor and
intel_idle (which has a similar check) are mandatory built-in.

This means we can delay switching to stable until after these drivers
have ran (if they were modules, this would be impossible).

Delay the stable switch to late_initcall() to allow these drivers to
mark TSC unstable and avoid difficult stable->unstable transitions.

Reported-by: Lofstedt, Marta <marta.lofstedt@intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rafael J . Wysocki <rafael.j.wysocki@intel.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-05-15 10:15:21 +02:00
Peter Zijlstra
f9fccdb9ef cpuidle: Fix idle time tracking
Ville reported that on his Core2, which has TSC stop in idle, we would
always report very short idle durations. He tracked this down to
commit:

  e93e59ce5b ("cpuidle: Replace ktime_get() with local_clock()")

which replaces ktime_get() with local_clock().

Add a sched_clock_idle_wakeup_event() call, which will re-sync the
clock with ktime_get_ns() when TSC is unstable and no-op otherwise.

Reported-by: Ville Syrjälä <ville.syrjala@linux.intel.com>
Tested-by: Ville Syrjälä <ville.syrjala@linux.intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Daniel Lezcano <daniel.lezcano@linaro.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rafael J . Wysocki <rafael.j.wysocki@intel.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Fixes: e93e59ce5b ("cpuidle: Replace ktime_get() with local_clock()")
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-05-15 10:15:20 +02:00
Peter Zijlstra
3067a33d5f sched/clock: Remove watchdog touching
Commit:

  2bacec8c31 ("sched: touch softlockup watchdog after idling")

introduced the touch_softlockup_watchdog_sched() call without
justification and I feel sched_clock management is not the right
place, it should only be concerned with producing semi coherent time.

If this causes watchdog thingies, we can find a better place.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-05-15 10:15:19 +02:00
Peter Zijlstra
ac1e843f09 sched/clock: Remove unused argument to sched_clock_idle_wakeup_event()
The argument to sched_clock_idle_wakeup_event() has not been used in a
long time. Remove it.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-05-15 10:15:18 +02:00
Peter Zijlstra
b421b22b00 x86/tsc, sched/clock, clocksource: Use clocksource watchdog to provide stable sync points
Currently we keep sched_clock_tick() active for stable TSC in order to
keep the per-CPU state semi up-to-date. The (obvious) problem is that
by the time we detect TSC is borked, our per-CPU state is also borked.

So hook into the clocksource watchdog and call a method after we've
found it to still be stable.

There's the obvious race where the TSC goes wonky between finding it
stable and us running the callback, but closing that is too much work
and not really worth it, since we're already detecting TSC wobbles
after the fact, so we cannot, per definition, fully avoid funny clock
values.

And since the watchdog runs less often than the tick, this is also an
optimization.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-05-15 10:15:18 +02:00
Peter Zijlstra
cf15ca8ded sched/clock: Initialize all per-CPU state before switching (back) to unstable
In preparation for not keeping the sched_clock_tick() active for
stable TSC, we need to explicitly initialize all per-CPU state
before switching back to unstable.

Note: this patch looses the __gtod_offset calculation; it will be
restored in the next one.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-05-15 10:15:17 +02:00
Vincent Guittot
625ed2bf04 sched/cfs: Make util/load_avg more stable
In the current implementation of load/util_avg, we assume that the
ongoing time segment has fully elapsed, and util/load_sum is divided
by LOAD_AVG_MAX, even if part of the time segment still remains to
run. As a consequence, this remaining part is considered as idle time
and generates unexpected variations of util_avg of a busy CPU in the
range [1002..1024[ whereas util_avg should stay at 1023.

In order to keep the metric stable, we should not consider the ongoing
time segment when computing load/util_avg but only the segments that
have already fully elapsed. But to not consider the current time
segment adds unwanted latency in the load/util_avg responsivness
especially when the time is scaled instead of the contribution.

Instead of waiting for the current time segment to have fully elapsed
before accounting it in load/util_avg, we can already account the
elapsed part but change the range used to compute load/util_avg
accordingly.

At the very beginning of a new time segment, the past segments have
been decayed and the max value is LOAD_AVG_MAX*y. At the very end of
the current time segment, the max value becomes:

  LOAD_AVG_MAX*y + 1024(us)  (== LOAD_AVG_MAX)

In fact, the max value is:

  LOAD_AVG_MAX*y + sa->period_contrib

at any time in the time segment.

Taking advantage of the fact that:

  LOAD_AVG_MAX*y == LOAD_AVG_MAX-1024

the range becomes [0..LOAD_AVG_MAX-1024+sa->period_contrib].

As the elapsed part is already accounted in load/util_sum, we update
the max value according to the current position in the time segment
instead of removing its contribution.

Suggested-by: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Morten.Rasmussen@arm.com
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: bsegall@google.com
Cc: dietmar.eggemann@arm.com
Cc: pjt@google.com
Cc: yuyang.du@intel.com
Link: http://lkml.kernel.org/r/1493188076-2767-1-git-send-email-vincent.guittot@linaro.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-05-15 10:15:13 +02:00
Steven Rostedt (VMware)
8663effb24 sched/core: Call __schedule() from do_idle() without enabling preemption
I finally got around to creating trampolines for dynamically allocated
ftrace_ops with using synchronize_rcu_tasks(). For users of the ftrace
function hook callbacks, like perf, that allocate the ftrace_ops
descriptor via kmalloc() and friends, ftrace was not able to optimize
the functions being traced to use a trampoline because they would also
need to be allocated dynamically. The problem is that they cannot be
freed when CONFIG_PREEMPT is set, as there's no way to tell if a task
was preempted on the trampoline. That was before Paul McKenney
implemented synchronize_rcu_tasks() that would make sure all tasks
(except idle) have scheduled out or have entered user space.

While testing this, I triggered this bug:

 BUG: unable to handle kernel paging request at ffffffffa0230077
 ...
 RIP: 0010:0xffffffffa0230077
 ...
 Call Trace:
  schedule+0x5/0xe0
  schedule_preempt_disabled+0x18/0x30
  do_idle+0x172/0x220

What happened was that the idle task was preempted on the trampoline.
As synchronize_rcu_tasks() ignores the idle thread, there's nothing
that lets ftrace know that the idle task was preempted on a trampoline.

The idle task shouldn't need to ever enable preemption. The idle task
is simply a loop that calls schedule or places the cpu into idle mode.
In fact, having preemption enabled is inefficient, because it can
happen when idle is just about to call schedule anyway, which would
cause schedule to be called twice. Once for when the interrupt came in
and was returning back to normal context, and then again in the normal
path that the idle loop is running in, which would be pointless, as it
had already scheduled.

The only reason schedule_preempt_disable() enables preemption is to be
able to call sched_submit_work(), which requires preemption enabled. As
this is a nop when the task is in the RUNNING state, and idle is always
in the running state, there's no reason that idle needs to enable
preemption. But that means it cannot use schedule_preempt_disable() as
other callers of that function require calling sched_submit_work().

Adding a new function local to kernel/sched/ that allows idle to call
the scheduler without enabling preemption, fixes the
synchronize_rcu_tasks() issue, as well as removes the pointless spurious
schedule calls caused by interrupts happening in the brief window where
preemption is enabled just before it calls schedule.

Reviewed: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/20170414084809.3dacde2a@gandalf.local.home
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-05-15 10:09:12 +02:00
Linus Torvalds
de4d195308 Merge branch 'core-rcu-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull RCU updates from Ingo Molnar:
 "The main changes are:

   - Debloat RCU headers

   - Parallelize SRCU callback handling (plus overlapping patches)

   - Improve the performance of Tree SRCU on a CPU-hotplug stress test

   - Documentation updates

   - Miscellaneous fixes"

* 'core-rcu-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (74 commits)
  rcu: Open-code the rcu_cblist_n_lazy_cbs() function
  rcu: Open-code the rcu_cblist_n_cbs() function
  rcu: Open-code the rcu_cblist_empty() function
  rcu: Separately compile large rcu_segcblist functions
  srcu: Debloat the <linux/rcu_segcblist.h> header
  srcu: Adjust default auto-expediting holdoff
  srcu: Specify auto-expedite holdoff time
  srcu: Expedite first synchronize_srcu() when idle
  srcu: Expedited grace periods with reduced memory contention
  srcu: Make rcutorture writer stalls print SRCU GP state
  srcu: Exact tracking of srcu_data structures containing callbacks
  srcu: Make SRCU be built by default
  srcu: Fix Kconfig botch when SRCU not selected
  rcu: Make non-preemptive schedule be Tasks RCU quiescent state
  srcu: Expedite srcu_schedule_cbs_snp() callback invocation
  srcu: Parallelize callback handling
  kvm: Move srcu_struct fields to end of struct kvm
  rcu: Fix typo in PER_RCU_NODE_PERIOD header comment
  rcu: Use true/false in assignment to bool
  rcu: Use bool value directly
  ...
2017-05-10 10:30:46 -07:00
Juri Lelli
d86ab9cff8 cpufreq: schedutil: use now as reference when aggregating shared policy requests
Currently, sugov_next_freq_shared() uses last_freq_update_time as a
reference to decide when to start considering CPU contributions as
stale.

However, since last_freq_update_time is set by the last CPU that issued
a frequency transition, this might cause problems in certain cases. In
practice, the detection of stale utilization values fails whenever the
CPU with such values was the last to update the policy. For example (and
please note again that the SCHED_CPUFREQ_RT flag is not the problem
here, but only the detection of after how much time that flag has to be
considered stale), suppose a policy with 2 CPUs:

               CPU0                |               CPU1
                                   |
                                   |     RT task scheduled
                                   |     SCHED_CPUFREQ_RT is set
                                   |     CPU1->last_update = now
                                   |     freq transition to max
                                   |     last_freq_update_time = now
                                   |

                        more than TICK_NSEC nsecs

                                   |
     a small CFS wakes up          |
     CPU0->last_update = now1      |
     delta_ns(CPU0) < TICK_NSEC*   |
     CPU0's util is considered     |
     delta_ns(CPU1) =              |
      last_freq_update_time -      |
      CPU1->last_update = 0        |
      < TICK_NSEC                  |
     CPU1 is still considered      |
     CPU1->SCHED_CPUFREQ_RT is set |
     we stay at max (until CPU1    |
     exits from idle)              |

* delta_ns is actually negative as now1 > last_freq_update_time

While last_freq_update_time is a sensible reference for rate limiting,
it doesn't seem to be useful for working around stale CPU states.

Fix the problem by always considering now (time) as the reference for
deciding when CPUs have stale contributions.

Signed-off-by: Juri Lelli <juri.lelli@arm.com>
Acked-by: Vincent Guittot <vincent.guittot@linaro.org>
Acked-by: Viresh Kumar <viresh.kumar@linaro.org>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2017-05-05 23:34:32 +02:00
Linus Torvalds
89c9fea3c8 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/trivial
Pull trivial tree updates from Jiri Kosina.

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/trivial:
  tty: fix comment for __tty_alloc_driver()
  init/main: properly align the multi-line comment
  init/main: Fix double "the" in comment
  Fix dead URLs to ftp.kernel.org
  drivers: Clean up duplicated email address
  treewide: Fix typo in xml/driver-api/basics.xml
  tools/testing/selftests/powerpc: remove redundant CFLAGS in Makefile: "-Wall -O2 -Wall" -> "-O2 -Wall"
  selftests/timers: Spelling s/privledges/privileges/
  HID: picoLCD: Spelling s/REPORT_WRTIE_MEMORY/REPORT_WRITE_MEMORY/
  net: phy: dp83848: Fix Typo
  UBI: Fix typos
  Documentation: ftrace.txt: Correct nice value of 120 priority
  net: fec: Fix typo in error msg and comment
  treewide: Fix typos in printk
2017-05-02 19:09:35 -07:00
Linus Torvalds
76f1948a79 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/livepatching
Pull livepatch updates from Jiri Kosina:

 - a per-task consistency model is being added for architectures that
   support reliable stack dumping (extending this, currently rather
   trivial set, is currently in the works).

   This extends the nature of the types of patches that can be applied
   by live patching infrastructure. The code stems from the design
   proposal made [1] back in November 2014. It's a hybrid of SUSE's
   kGraft and RH's kpatch, combining advantages of both: it uses
   kGraft's per-task consistency and syscall barrier switching combined
   with kpatch's stack trace switching. There are also a number of
   fallback options which make it quite flexible.

   Most of the heavy lifting done by Josh Poimboeuf with help from
   Miroslav Benes and Petr Mladek

   [1] https://lkml.kernel.org/r/20141107140458.GA21774@suse.cz

 - module load time patch optimization from Zhou Chengming

 - a few assorted small fixes

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/livepatching:
  livepatch: add missing printk newlines
  livepatch: Cancel transition a safe way for immediate patches
  livepatch: Reduce the time of finding module symbols
  livepatch: make klp_mutex proper part of API
  livepatch: allow removal of a disabled patch
  livepatch: add /proc/<pid>/patch_state
  livepatch: change to a per-task consistency model
  livepatch: store function sizes
  livepatch: use kstrtobool() in enabled_store()
  livepatch: move patching functions into patch.c
  livepatch: remove unnecessary object loaded check
  livepatch: separate enabled and patched states
  livepatch/s390: add TIF_PATCH_PENDING thread flag
  livepatch/s390: reorganize TIF thread flag bits
  livepatch/powerpc: add TIF_PATCH_PENDING thread flag
  livepatch/x86: add TIF_PATCH_PENDING thread flag
  livepatch: create temporary klp_update_patch_state() stub
  x86/entry: define _TIF_ALLWORK_MASK flags explicitly
  stacktrace/x86: add function for detecting reliable stack traces
2017-05-02 18:24:16 -07:00
Linus Torvalds
207fb8c304 Merge branch 'locking-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull locking updates from Ingo Molnar:
 "The main changes in this cycle were:

   - a big round of FUTEX_UNLOCK_PI improvements, fixes, cleanups and
     general restructuring

   - lockdep updates such as new checks for lock_downgrade()

   - introduce the new atomic_try_cmpxchg() locking API and use it to
     optimize refcount code generation

   - ... plus misc fixes, updates and cleanups"

* 'locking-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (38 commits)
  MAINTAINERS: Add FUTEX SUBSYSTEM
  futex: Clarify mark_wake_futex memory barrier usage
  futex: Fix small (and harmless looking) inconsistencies
  futex: Avoid freeing an active timer
  rtmutex: Plug preempt count leak in rt_mutex_futex_unlock()
  rtmutex: Fix more prio comparisons
  rtmutex: Fix PI chain order integrity
  sched,tracing: Update trace_sched_pi_setprio()
  sched/rtmutex: Refactor rt_mutex_setprio()
  rtmutex: Clean up
  sched/deadline/rtmutex: Dont miss the dl_runtime/dl_period update
  sched/rtmutex/deadline: Fix a PI crash for deadline tasks
  rtmutex: Deboost before waking up the top waiter
  locking/ww-mutex: Limit stress test to 2 seconds
  locking/atomic: Fix atomic_try_cmpxchg() semantics
  lockdep: Fix per-cpu static objects
  futex: Drop hb->lock before enqueueing on the rtmutex
  futex: Futex_unlock_pi() determinism
  futex: Rework futex_lock_pi() to use rt_mutex_*_proxy_lock()
  futex,rt_mutex: Restructure rt_mutex_finish_proxy_lock()
  ...
2017-05-01 19:36:00 -07:00
Linus Torvalds
3527d3e951 Merge branch 'sched-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull scheduler updates from Ingo Molnar:
 "The main changes in this cycle were:

   - another round of rq-clock handling debugging, robustization and
     fixes

   - PELT accounting improvements

   - CPU hotplug related ->cpus_allowed affinity handling fixes all
     around the tree

   - ... plus misc fixes, cleanups and updates"

* 'sched-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (35 commits)
  sched/x86: Update reschedule warning text
  crypto: N2 - Replace racy task affinity logic
  cpufreq/sparc-us2e: Replace racy task affinity logic
  cpufreq/sparc-us3: Replace racy task affinity logic
  cpufreq/sh: Replace racy task affinity logic
  cpufreq/ia64: Replace racy task affinity logic
  ACPI/processor: Replace racy task affinity logic
  ACPI/processor: Fix error handling in __acpi_processor_start()
  sparc/sysfs: Replace racy task affinity logic
  powerpc/smp: Replace open coded task affinity logic
  ia64/sn/hwperf: Replace racy task affinity logic
  ia64/salinfo: Replace racy task affinity logic
  workqueue: Provide work_on_cpu_safe()
  ia64/topology: Remove cpus_allowed manipulation
  sched/fair: Move the PELT constants into a generated header
  sched/fair: Increase PELT accuracy for small tasks
  sched/fair: Fix comments
  sched/Documentation: Add 'sched-pelt' tool
  sched/fair: Fix corner case in __accumulate_sum()
  sched/core: Remove 'task' parameter and rename tsk_restore_flags() to current_restore_flags()
  ...
2017-05-01 19:12:53 -07:00
Linus Torvalds
0e285e9088 Power management updates for v4.12-rc1
- Rework the intel_pstate driver's sysfs interface to make it
    more straightforward and more intuitive (Rafael Wysocki).
 
  - Make intel_pstate support all processors which advertise HWP
    (hardware-managed P-states) to the kernel in all operation modes
    and make it use the load-based P-state selection algorithm on a
    wider range of systems in the active mode (Rafael Wysocki).
 
  - Add cpufreq driver for Tegra186 (Mikko Perttunen).
 
  - Add support for Gemini Lake SoCs to intel_pstate (David Box).
 
  - Add support for MT8176 and MT817x to the Mediatek cpufreq driver
    and clean up that driver a bit (Daniel Kurtz).
 
  - Clean up intel_pstate and optimize it slightly (Rafael Wysocki).
 
  - Update the schedutil cpufreq governor, mostly to fix a couple of
    issues with it related to specific workloads, and rework its sysfs
    tunable and initialization a bit (Rafael Wysocki, Viresh Kumar).
 
  - Fix minor issues in the imx6q, dbx500 and qoriq cpufreq drivers
    (Christophe Jaillet, Irina Tirdea, Leonard Crestez, Viresh Kumar,
    YuanTian Tang).
 
  - Add file patterns for cpufreq DT bindings to MAINTAINERS (Geert
    Uytterhoeven).
 
  - Add support for "always on" power domains to the genpd (generic
    power domains) framework and clean up that code somewhat (Ulf
    Hansson, Lina Iyer, Viresh Kumar).
 
  - Fix minor issues in the powernv cpuidle driver and clean it up
    (Anton Blanchard, Gautham Shenoy).
 
  - Move the AnalyzeSuspend utility under tools/power/pm-graph/ and
    add an analogous boot-profiling utility called AnalyzeBoot to it
    (Todd Brandt).
 
  - Add rk3328 support to the rockchip-io AVS (Adaptive Voltage
    Scaling) driver (David Wu).
 
  - Fix minor issues in the cpuidle core, the intel_pstate_tracer
    utility, the devfreq framework and the PM core documentation
    (Chanwoo Choi, Doug Smythies, Johan Hovold, Marcin Nowakowski).
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v2
 
 iQIcBAABCAAGBQJZB5gGAAoJEILEb/54YlRx+78QAJRu+xAA9KtW2+loNyV8iBOB
 EFmQLrvz9jCDyYWsHE5huA1k6EVu5QE74HBfgDn4od9s1VqU1zWdEjqKYiaMwlCt
 EHxYCZ4YKeF31O3P3CtearBz9IXrckRx/XZ3F1jRsGGWooWv7o3U6PPN9iREmCzi
 9dB2j0UD4lCwrnpsDMrJ0GqLu4agn9pcIKDtu4VfszVwYtza0vOQvvlgg1fQS1jX
 BnNfaxN0lmpSlxDjtWfM//hfLzEWK8NlHiKWJFPnWFxJIAX1QL2QKnznF/Tqi5N5
 el9tQXCTRujlD7BLyl6FdsaowbiUUEcMqeh2k01Vz20WSmNDHIpfrV9MtzJ7biUK
 /DopyShUrpgJclKNF7BARJAJc19+PMkv3HMnOnsnhOsBNBpJjsL6FZPA95MMjI0o
 xmRQkixl31NWIMk60EIC6PaMLsxjhAWjiYi5D93bzkhnJTnQswvNb4ROPG+X2FCU
 6YgBogsYKkqp93TTJ49OFXIvu3o7NwOyEQQW8mnNY8ffaFdWuGzOX4HkOoKHCMTD
 rT0qT/2q+7LPK87YwTPIVtpGVltnCr/SVI/FtAGXPysyghu2Z+4GGP4eh2+pSAUj
 7Dqxdw3q7zs8ou6LOThTyNJrR+N/w1JPloprBleJR3TNcJjOy/SBjfp7yL0uopIK
 j5exr76ImVq7zzjyrYoa
 =6M3e
 -----END PGP SIGNATURE-----

Merge tag 'pm-4.12-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm

Pull power management updates from Rafael Wysocki:
 "This time the majority of changes go to the cpufreq subsystem (and to
  the intel_pstate driver in particular) and there are some updates in
  the generic power domains framework, cpuidle, tools and a couple of
  other places.

  One thing worth mentioning is that the intel_pstate's sysfs interface
  has been reworked to be more consistent with the general expectations
  of the cpufreq core and less confusing, hopefully for the better.
  Also, we have a new cpufreq driver for Tegra186 and new hardware
  support in intel_pstata and the Mediatek cpufreq driver.

  Apart from that, the AnalyzeSuspend utility for system suspend
  profiling gets a companion called AnalyzeBoot for the analogous
  profiling of system boot and they both go into one place under
  tools/power/pm-graph/.

  The rest is mostly fixes, cleanups and code reorganization.

  Specifics:

   - Rework the intel_pstate driver's sysfs interface to make it more
     straightforward and more intuitive (Rafael Wysocki).

   - Make intel_pstate support all processors which advertise HWP
     (hardware-managed P-states) to the kernel in all operation modes
     and make it use the load-based P-state selection algorithm on a
     wider range of systems in the active mode (Rafael Wysocki).

   - Add cpufreq driver for Tegra186 (Mikko Perttunen).

   - Add support for Gemini Lake SoCs to intel_pstate (David Box).

   - Add support for MT8176 and MT817x to the Mediatek cpufreq driver
     and clean up that driver a bit (Daniel Kurtz).

   - Clean up intel_pstate and optimize it slightly (Rafael Wysocki).

   - Update the schedutil cpufreq governor, mostly to fix a couple of
     issues with it related to specific workloads, and rework its sysfs
     tunable and initialization a bit (Rafael Wysocki, Viresh Kumar).

   - Fix minor issues in the imx6q, dbx500 and qoriq cpufreq drivers
     (Christophe Jaillet, Irina Tirdea, Leonard Crestez, Viresh Kumar,
     YuanTian Tang).

   - Add file patterns for cpufreq DT bindings to MAINTAINERS (Geert
     Uytterhoeven).

   - Add support for "always on" power domains to the genpd (generic
     power domains) framework and clean up that code somewhat (Ulf
     Hansson, Lina Iyer, Viresh Kumar).

   - Fix minor issues in the powernv cpuidle driver and clean it up
     (Anton Blanchard, Gautham Shenoy).

   - Move the AnalyzeSuspend utility under tools/power/pm-graph/ and add
     an analogous boot-profiling utility called AnalyzeBoot to it (Todd
     Brandt).

   - Add rk3328 support to the rockchip-io AVS (Adaptive Voltage
     Scaling) driver (David Wu).

   - Fix minor issues in the cpuidle core, the intel_pstate_tracer
     utility, the devfreq framework and the PM core documentation
     (Chanwoo Choi, Doug Smythies, Johan Hovold, Marcin Nowakowski)"

* tag 'pm-4.12-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm: (56 commits)
  PM / runtime: Document autosuspend-helper side effects
  PM / runtime: Fix autosuspend documentation
  tools: power: pm-graph: Package makefile and man pages
  tools: power: pm-graph: AnalyzeBoot v2.0
  tools: power: pm-graph: AnalyzeSuspend v4.6
  cpufreq: Add Tegra186 cpufreq driver
  cpufreq: imx6q: Fix error handling code
  cpufreq: imx6q: Set max suspend_freq to avoid changes during suspend
  cpufreq: imx6q: Fix handling EPROBE_DEFER from regulator
  cpuidle: powernv: Avoid a branch in the core snooze_loop() loop
  cpuidle: powernv: Don't continually set thread priority in snooze_loop()
  cpuidle: powernv: Don't bounce between low and very low thread priority
  cpuidle: cpuidle-cps: remove unused variable
  tools/power/x86/intel_pstate_tracer: Adjust directory ownership
  cpufreq: schedutil: Use policy-dependent transition delays
  cpufreq: schedutil: Reduce frequencies slower
  PM / devfreq: Move struct devfreq_governor to devfreq directory
  PM / Domains: Ignore domain-idle-states that are not compatible
  cpufreq: intel_pstate: Add support for Gemini Lake
  powernv-cpuidle: Validate DT property array size
  ...
2017-05-01 14:09:46 -07:00
Linus Torvalds
9410091dd5 Merge branch 'for-4.12' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup
Pull cgroup updates from Tejun Heo:
 "Nothing major. Two notable fixes are Li's second stab at fixing the
  long-standing race condition in the mount path and suppression of
  spurious warning from cgroup_get(). All other changes are trivial"

* 'for-4.12' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup:
  cgroup: mark cgroup_get() with __maybe_unused
  cgroup: avoid attaching a cgroup root to two different superblocks, take 2
  cgroup: fix spurious warnings on cgroup_is_dead() from cgroup_sk_alloc()
  cgroup: move cgroup_subsys_state parent field for cache locality
  cpuset: Remove cpuset_update_active_cpus()'s parameter.
  cgroup: switch to BUG_ON()
  cgroup: drop duplicate header nsproxy.h
  kernel: convert css_set.refcount from atomic_t to refcount_t
  kernel: convert cgroup_namespace.count from atomic_t to refcount_t
2017-05-01 13:52:24 -07:00
Rafael J. Wysocki
2addac72af Merge schedutil governor updates for v4.12. 2017-04-28 23:13:33 +02:00
Frederic Weisbecker
25e2d8c1b9 sched/cputime: Fix ksoftirqd cputime accounting regression
irq_time_read() returns the irqtime minus the ksoftirqd time. This
is necessary because irq_time_read() is used to substract the IRQ time
from the sum_exec_runtime of a task. If we were to include the softirq
time of ksoftirqd, this task would substract its own CPU time everytime
it updates ksoftirqd->sum_exec_runtime which would therefore never
progress.

But this behaviour got broken by:

  a499a5a14d ("sched/cputime: Increment kcpustat directly on irqtime account")

... which now includes ksoftirqd softirq time in the time returned by
irq_time_read().

This has resulted in wrong ksoftirqd cputime reported to userspace
through /proc/stat and thus "top" not showing ksoftirqd when it should
after intense networking load.

ksoftirqd->stime happens to be correct but it gets scaled down by
sum_exec_runtime through task_cputime_adjusted().

To fix this, just account the strict IRQ time in a separate counter and
use it to report the IRQ time.

Reported-and-tested-by: Jesper Dangaard Brouer <brouer@redhat.com>
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Reviewed-by: Rik van Riel <riel@redhat.com>
Acked-by: Jesper Dangaard Brouer <brouer@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Stanislaw Gruszka <sgruszka@redhat.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Wanpeng Li <wanpeng.li@hotmail.com>
Link: http://lkml.kernel.org/r/1493129448-5356-1-git-send-email-fweisbec@gmail.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-04-27 09:08:26 +02:00
Ingo Molnar
58d30c36d4 Merge branch 'for-mingo' of git://git.kernel.org/pub/scm/linux/kernel/git/paulmck/linux-rcu into core/rcu
Pull RCU updates from Paul E. McKenney:

 - Documentation updates.

 - Miscellaneous fixes.

 - Parallelize SRCU callback handling (plus overlapping patches).

Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-04-23 11:12:44 +02:00
Paul E. McKenney
bcbfdd01dc rcu: Make non-preemptive schedule be Tasks RCU quiescent state
Currently, a call to schedule() acts as a Tasks RCU quiescent state
only if a context switch actually takes place.  However, just the
call to schedule() guarantees that the calling task has moved off of
whatever tracing trampoline that it might have been one previously.
This commit therefore plumbs schedule()'s "preempt" parameter into
rcu_note_context_switch(), which then records the Tasks RCU quiescent
state, but only if this call to schedule() was -not- due to a preemption.

To avoid adding overhead to the common-case context-switch path,
this commit hides the rcu_note_context_switch() check under an existing
non-common-case check.

Suggested-by: Steven Rostedt <rostedt@goodmis.org>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
2017-04-21 05:59:27 -07:00
Rafael J. Wysocki
1b72e7fd30 cpufreq: schedutil: Use policy-dependent transition delays
Make the schedutil governor take the initial (default) value of the
rate_limit_us sysfs attribute from the (new) transition_delay_us
policy parameter (to be set by the scaling driver).

That will allow scaling drivers to make schedutil use smaller default
values of rate_limit_us and reduce the default average time interval
between consecutive frequency changes.

Make intel_pstate set transition_delay_us to 500.

Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Acked-by: Viresh Kumar <viresh.kumar@linaro.org>
2017-04-17 18:37:27 +02:00
Ingo Molnar
0ba78a95a6 Merge branch 'linus' into locking/core, to pick up fixes
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-04-14 10:29:40 +02:00
Peter Zijlstra
283e2ed399 sched/fair: Move the PELT constants into a generated header
Now that we have a tool to generate the PELT constants in C form,
use its output as a separate header.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-04-14 10:26:37 +02:00
Peter Zijlstra
bb0bd044e6 sched/fair: Increase PELT accuracy for small tasks
We truncate (and loose) the lower 10 bits of runtime in
___update_load_avg(), this means there's a consistent bias to
under-account tasks. This is esp. significant for small tasks.

Cure this by only forwarding last_update_time to the point we've
actually accounted for, leaving the remainder for the next time.

Reported-by: Morten Rasmussen <morten.rasmussen@arm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Morten Rasmussen <morten.rasmussen@arm.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-04-14 10:26:36 +02:00
Peter Zijlstra
3841cdc310 sched/fair: Fix comments
Historically our periods (or p) argument in PELT denoted the number of
full periods (what is now d2). However recent patches have changed
this to the total decay (previously p+1), leading to a confusing
discrepancy between comments and code.

Try and clarify things by making periods (in code) and p (in comments)
be the same thing (again).

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-04-14 10:26:36 +02:00
Peter Zijlstra
05296e7535 sched/fair: Fix corner case in __accumulate_sum()
Paul noticed that in the (periods >= LOAD_AVG_MAX_N) case in
__accumulate_sum(), the returned contribution value (LOAD_AVG_MAX) is
incorrect.

This is because at this point, the decay_load() on the old state --
the first step in accumulate_sum() -- will not have resulted in 0, and
will therefore result in a sum larger than the maximum value of our
series. Obviously broken.

Note that:

	decay_load(LOAD_AVG_MAX, LOAD_AVG_MAX_N) =

                1   (345 / 32)
	47742 * - ^            = ~27
                2

Not to mention that any further contribution from the d3 segment (our
new period) would also push it over the maximum.

Solve this by noting that we can write our c2 term:

		    p
	c2 = 1024 \Sum y^n
		   n=1

In terms of our maximum value:

		    inf		      inf	  p
	max = 1024 \Sum y^n = 1024 ( \Sum y^n + \Sum y^n + y^0 )
		    n=0		      n=p+1	 n=1

Further note that:

           inf              inf            inf
        ( \Sum y^n ) y^p = \Sum y^(n+p) = \Sum y^n
           n=0              n=0            n=p

Combined that gives us:

		    p
	c2 = 1024 \Sum y^n
		   n=1

		     inf        inf
	   = 1024 ( \Sum y^n - \Sum y^n - y^0 )
		     n=0        n=p+1

	   = max - (max y^(p+1)) - 1024

Further simplify things by dealing with p=0 early on.

Reported-by: Paul Turner <pjt@google.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Yuyang Du <yuyang.du@intel.com>
Cc: linux-kernel@vger.kernel.org
Fixes: a481db34b9 ("sched/fair: Optimize ___update_sched_avg()")
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-04-14 10:26:34 +02:00
Rafael J. Wysocki
39b64aa1c0 cpufreq: schedutil: Reduce frequencies slower
The schedutil governor reduces frequencies too fast in some
situations which cases undesirable performance drops to
appear.

To address that issue, make schedutil reduce the frequency slower by
setting it to the average of the value chosen during the previous
iteration of governor computations and the new one coming from its
frequency selection formula.

Link: https://bugzilla.kernel.org/show_bug.cgi?id=194963
Reported-by: John <john.ettedgui@gmail.com>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Acked-by: Viresh Kumar <viresh.kumar@linaro.org>
2017-04-13 03:46:40 +02:00
Ingo Molnar
36a4dfc378 Linux 4.11-rc6
-----BEGIN PGP SIGNATURE-----
 
 iQEcBAABAgAGBQJY6mY1AAoJEHm+PkMAQRiGB14IAImsH28JPjxJVDasMIRPBxVc
 euPPlZgoBieu7sNt+kEsEqdkXuu0MLk6gln0IGxWLeoB2S+u3Tz5LMa2YArVqV9Z
 tWzOnI9auE73P2Pz/tUMOdyMs5tO0PolQxX3uljbULBozOHjHRh13fsXchX2yQvl
 mFeFCDqpPV0KhWRH/ciA8uIHdvYPhMpkKgRtmR8jXL0yzqLp6+2J+Bs8nHG4NNng
 HMVxZPC8jOE/TgWq6k/GmXgxh3H/AideFdHFbLKYnIFJW41ZGOI8a262zq3NmjPd
 lywpVU7O7RMhSITY5PnuR3LpNV8ftw1hz2y6t35unyFK1P02adOSj5GJ3hGdhaQ=
 =Xz5O
 -----END PGP SIGNATURE-----

Merge tag 'v4.11-rc6' into sched/core, to pick up fixes

Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-04-11 09:05:36 +02:00
Rakib Mullick
30e03acda5 cpuset: Remove cpuset_update_active_cpus()'s parameter.
In cpuset_update_active_cpus(), cpu_online isn't used anymore. Remove
it.

Signed-off-by: Rakib Mullick<rakib.mullick@gmail.com>
Acked-by: Zefan Li <lizefan@huawei.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2017-04-11 08:57:54 +09:00
Peter Zijlstra
b91473ff6e sched,tracing: Update trace_sched_pi_setprio()
Pass the PI donor task, instead of a numerical priority.

Numerical priorities are not sufficient to describe state ever since
SCHED_DEADLINE.

Annotate all sched tracepoints that are currently broken; fixing them
will bork userspace. *hate*.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Steven Rostedt <rostedt@goodmis.org>
Cc: juri.lelli@arm.com
Cc: bigeasy@linutronix.de
Cc: xlpang@redhat.com
Cc: mathieu.desnoyers@efficios.com
Cc: jdesfossez@efficios.com
Cc: bristot@redhat.com
Link: http://lkml.kernel.org/r/20170323150216.353599881@infradead.org
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2017-04-04 11:44:06 +02:00
Peter Zijlstra
acd58620e4 sched/rtmutex: Refactor rt_mutex_setprio()
With the introduction of SCHED_DEADLINE the whole notion that priority
is a single number is gone, therefore the @prio argument to
rt_mutex_setprio() doesn't make sense anymore.

So rework the code to pass a pi_task instead.

Note this also fixes a problem with pi_top_task caching; previously we
would not set the pointer (call rt_mutex_update_top_task) if the
priority didn't change, this could lead to a stale pointer.

As for the XXX, I think its fine to use pi_task->prio, because if it
differs from waiter->prio, a PI chain update is immenent.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: juri.lelli@arm.com
Cc: bigeasy@linutronix.de
Cc: xlpang@redhat.com
Cc: rostedt@goodmis.org
Cc: mathieu.desnoyers@efficios.com
Cc: jdesfossez@efficios.com
Cc: bristot@redhat.com
Link: http://lkml.kernel.org/r/20170323150216.303827095@infradead.org
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2017-04-04 11:44:06 +02:00
Xunlei Pang
e96a7705e7 sched/rtmutex/deadline: Fix a PI crash for deadline tasks
A crash happened while I was playing with deadline PI rtmutex.

    BUG: unable to handle kernel NULL pointer dereference at 0000000000000018
    IP: [<ffffffff810eeb8f>] rt_mutex_get_top_task+0x1f/0x30
    PGD 232a75067 PUD 230947067 PMD 0
    Oops: 0000 [#1] SMP
    CPU: 1 PID: 10994 Comm: a.out Not tainted

    Call Trace:
    [<ffffffff810b658c>] enqueue_task+0x2c/0x80
    [<ffffffff810ba763>] activate_task+0x23/0x30
    [<ffffffff810d0ab5>] pull_dl_task+0x1d5/0x260
    [<ffffffff810d0be6>] pre_schedule_dl+0x16/0x20
    [<ffffffff8164e783>] __schedule+0xd3/0x900
    [<ffffffff8164efd9>] schedule+0x29/0x70
    [<ffffffff8165035b>] __rt_mutex_slowlock+0x4b/0xc0
    [<ffffffff81650501>] rt_mutex_slowlock+0xd1/0x190
    [<ffffffff810eeb33>] rt_mutex_timed_lock+0x53/0x60
    [<ffffffff810ecbfc>] futex_lock_pi.isra.18+0x28c/0x390
    [<ffffffff810ed8b0>] do_futex+0x190/0x5b0
    [<ffffffff810edd50>] SyS_futex+0x80/0x180

This is because rt_mutex_enqueue_pi() and rt_mutex_dequeue_pi()
are only protected by pi_lock when operating pi waiters, while
rt_mutex_get_top_task(), will access them with rq lock held but
not holding pi_lock.

In order to tackle it, we introduce new "pi_top_task" pointer
cached in task_struct, and add new rt_mutex_update_top_task()
to update its value, it can be called by rt_mutex_setprio()
which held both owner's pi_lock and rq lock. Thus "pi_top_task"
can be safely accessed by enqueue_task_dl() under rq lock.

Originally-From: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Xunlei Pang <xlpang@redhat.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Steven Rostedt <rostedt@goodmis.org>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Cc: juri.lelli@arm.com
Cc: bigeasy@linutronix.de
Cc: mathieu.desnoyers@efficios.com
Cc: jdesfossez@efficios.com
Cc: bristot@redhat.com
Link: http://lkml.kernel.org/r/20170323150216.157682758@infradead.org
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2017-04-04 11:44:05 +02:00
Linus Torvalds
128c434a70 Merge branch 'sched-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull scheduler fixes from Thomas Gleixner:
 "This update provides:

   - make the scheduler clock switch to unstable mode smooth so the
     timestamps stay at microseconds granularity instead of switching to
     tick granularity.

   - unbreak perf test tsc by taking the new offset into account which
     was added in order to proveide better sched clock continuity

   - switching sched clock to unstable mode runs all clock related
     computations which affect the sched clock output itself from a work
     queue. In case of preemption sched clock uses half updated data and
     provides wrong timestamps. Keep the math in the protected context
     and delegate only the static key switch to workqueue context.

   - remove a duplicate header include"

* 'sched-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  sched/headers: Remove duplicate #include <linux/sched/debug.h> line
  sched/clock: Fix broken stable to unstable transfer
  sched/clock, x86/perf: Fix "perf test tsc"
  sched/clock: Fix clear_sched_clock_stable() preempt wobbly
2017-04-02 09:25:10 -07:00
Yuyang Du
a481db34b9 sched/fair: Optimize ___update_sched_avg()
The main PELT function ___update_load_avg(), which implements the
accumulation and progression of the geometric average series, is
implemented along the following lines for the scenario where the time
delta spans all 3 possible sections (see figure below):

  1. add the remainder of the last incomplete period
  2. decay old sum
  3. accumulate new sum in full periods since last_update_time
  4. accumulate the current incomplete period
  5. update averages

Or:

            d1          d2           d3
            ^           ^            ^
            |           |            |
          |<->|<----------------->|<--->|
  ... |---x---|------| ... |------|-----x (now)

  load_sum' = (load_sum + weight * scale * d1) * y^(p+1) +	(1,2)

                                        p
	      weight * scale * 1024 * \Sum y^n +		(3)
                                       n=1

	      weight * scale * d3 * y^0				(4)

  load_avg' = load_sum' / LOAD_AVG_MAX				(5)

Where:

 d1 - is the delta part completing the remainder of the last
      incomplete period,
 d2 - is the delta part spannind complete periods, and
 d3 - is the delta part starting the current incomplete period.

We can simplify the code in two steps; the first step is to separate
the first term into new and old parts like:

  (load_sum + weight * scale * d1) * y^(p+1) = load_sum * y^(p+1) +
					       weight * scale * d1 * y^(p+1)

Once we've done that, its easy to see that all new terms carry the
common factors:

  weight * scale

If we factor those out, we arrive at the form:

  load_sum' = load_sum * y^(p+1) +

	      weight * scale * (d1 * y^(p+1) +

					 p
			        1024 * \Sum y^n +
					n=1

				d3 * y^0)

Which results in a simpler, smaller and faster implementation.

Signed-off-by: Yuyang Du <yuyang.du@intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: bsegall@google.com
Cc: dietmar.eggemann@arm.com
Cc: matt@codeblueprint.co.uk
Cc: morten.rasmussen@arm.com
Cc: pjt@google.com
Cc: umgwanakikbuti@gmail.com
Cc: vincent.guittot@linaro.org
Link: http://lkml.kernel.org/r/1486935863-25251-3-git-send-email-yuyang.du@intel.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-03-30 09:43:41 +02:00
Peter Zijlstra
0ccb977f4c sched/fair: Explicitly generate __update_load_avg() instances
The __update_load_avg() function is an __always_inline because its
used with constant propagation to generate different variants of the
code without having to duplicate it (which would be prone to bugs).

Explicitly instantiate the 3 variants.

Note that most of this is called from rather hot paths, so reducing
branches is good.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-03-30 09:43:40 +02:00
Pavel Tatashin
7b09cc5a9d sched/clock: Fix broken stable to unstable transfer
When it is determined that the clock is actually unstable, and
we switch from stable to unstable, the __clear_sched_clock_stable()
function is eventually called.

In this function we set gtod_offset so the following holds true:

  sched_clock() + raw_offset == ktime_get_ns() + gtod_offset

But instead of getting the latest timestamps, we use the last values
from scd, so instead of sched_clock() we use scd->tick_raw, and
instead of ktime_get_ns() we use scd->tick_gtod.

However, later, when we use gtod_offset sched_clock_local() we do not
add it to scd->tick_gtod to calculate the correct clock value when we
determine the boundaries for min/max clocks.

This can result in tick granularity sched_clock() values, so fix it.

Signed-off-by: Pavel Tatashin <pasha.tatashin@oracle.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: hpa@zytor.com
Fixes: 5680d8094f ("sched/clock: Provide better clock continuity")
Link: http://lkml.kernel.org/r/1490214265-899964-2-git-send-email-pasha.tatashin@oracle.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-03-27 10:23:48 +02:00
Srikar Dronamraju
05b40e0577 sched/fair: Prefer sibiling only if local group is under-utilized
If the child domain prefers tasks to go siblings, the local group could
end up pulling tasks to itself even if the local group is almost equally
loaded as the source group.

Lets assume a 4 core,smt==2 machine running 5 thread ebizzy workload.
Everytime, local group has capacity and source group has atleast 2 threads,
local group tries to pull the task. This causes the threads to constantly
move between different cores. This is even more profound if the cores have
more threads, like in Power 8, smt 8 mode.

Fix this by only allowing local group to pull a task, if the source group
has more number of tasks than the local group.

Here are the relevant perf stat numbers of a 22 core,smt 8 Power 8 machine.

Without patch:
 Performance counter stats for 'ebizzy -t 22 -S 100' (5 runs):

             1,440      context-switches          #    0.001 K/sec                    ( +-  1.26% )
               366      cpu-migrations            #    0.000 K/sec                    ( +-  5.58% )
             3,933      page-faults               #    0.002 K/sec                    ( +- 11.08% )

 Performance counter stats for 'ebizzy -t 48 -S 100' (5 runs):

             6,287      context-switches          #    0.001 K/sec                    ( +-  3.65% )
             3,776      cpu-migrations            #    0.001 K/sec                    ( +-  4.84% )
             5,702      page-faults               #    0.001 K/sec                    ( +-  9.36% )

 Performance counter stats for 'ebizzy -t 96 -S 100' (5 runs):

             8,776      context-switches          #    0.001 K/sec                    ( +-  0.73% )
             2,790      cpu-migrations            #    0.000 K/sec                    ( +-  0.98% )
            10,540      page-faults               #    0.001 K/sec                    ( +-  3.12% )

With patch:

 Performance counter stats for 'ebizzy -t 22 -S 100' (5 runs):

             1,133      context-switches          #    0.001 K/sec                    ( +-  4.72% )
               123      cpu-migrations            #    0.000 K/sec                    ( +-  3.42% )
             3,858      page-faults               #    0.002 K/sec                    ( +-  8.52% )

 Performance counter stats for 'ebizzy -t 48 -S 100' (5 runs):

             2,169      context-switches          #    0.000 K/sec                    ( +-  6.19% )
               189      cpu-migrations            #    0.000 K/sec                    ( +- 12.75% )
             5,917      page-faults               #    0.001 K/sec                    ( +-  8.09% )

 Performance counter stats for 'ebizzy -t 96 -S 100' (5 runs):

             5,333      context-switches          #    0.001 K/sec                    ( +-  5.91% )
               506      cpu-migrations            #    0.000 K/sec                    ( +-  3.35% )
            10,792      page-faults               #    0.001 K/sec                    ( +-  7.75% )

Which show that in these workloads CPU migrations get reduced significantly.

Signed-off-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vincent Guittot <vincent.guittot@linaro.org>
Link: http://lkml.kernel.org/r/1490205470-10249-1-git-send-email-srikar@linux.vnet.ibm.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-03-27 10:22:26 +02:00
Rafael J. Wysocki
70e493f3bb Merge back schedutil governor updates for 4.12. 2017-03-25 02:35:48 +01:00
Masanari Iida
0ba42a599f treewide: Fix typo in xml/driver-api/basics.xml
This patch fix spelling typos found in
Documentation/output/xml/driver-api/basics.xml.
It is because the xml file was generated from comments in source,
so I had to fix the comments.

Signed-off-by: Masanari Iida <standby24x7@gmail.com>
Signed-off-by: Jiri Kosina <jkosina@suse.cz>
2017-03-24 15:47:57 +01:00
Rafael J. Wysocki
38d4ea229d cpufreq: schedutil: Trace frequency only if it has changed
sugov_update_commit() calls trace_cpu_frequency() to record the
current CPU frequency if it has not changed in the fast switch case
to prevent utilities from getting confused (they may report that the
CPU is idle if the frequency has not been recorded for too long, for
example).

However, that may cause the tracepoint to be triggered quite often
for no real reason (if the frequency doesn't change, we will not
modify the last update time stamp and governor computations may
run again shortly when that happens), so don't do that (arguably, it
is done to work around a utilities bug anyway).

That allows code duplication in sugov_update_commit() to be reduced
somewhat too.

Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Acked-by: Viresh Kumar <viresh.kumar@linaro.org>
2017-03-24 02:57:22 +01:00
Vincent Guittot
bc4278987e sched/fair: Fix FTQ noise bench regression
A regression of the FTQ noise has been reported by Ying Huang,
on the following hardware:

  8 threads Intel(R) Core(TM)i7-4770 CPU @ 3.40GHz with 8G memory

... which was caused by this commit:

  commit 4e5160766f ("sched/fair: Propagate asynchrous detach")

The only part of the patch that can increase the noise is the update
of blocked load of group entity in update_blocked_averages().

We can optimize this call and skip the update of group entity if its load
and utilization are already null and there is no pending propagation of load
in the task group.

This optimization partly restores the noise score. A more agressive
optimization has been tried but has shown worse score.

Reported-by: ying.huang@linux.intel.com
Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: dietmar.eggemann@arm.com
Cc: ying.huang@intel.com
Fixes: 4e5160766f ("sched/fair: Propagate asynchrous detach")
Link: http://lkml.kernel.org/r/1489758442-2877-1-git-send-email-vincent.guittot@linaro.org
[ Fixed typos, improved layout. ]
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-03-23 07:44:51 +01:00
Wanpeng Li
d7921a5dda sched/core: Fix rq lock pinning warning after call balance callbacks
This can be reproduced by running rt-migrate-test:

 WARNING: CPU: 2 PID: 2195 at kernel/locking/lockdep.c:3670 lock_unpin_lock()
 unpinning an unpinned lock
 ...
 Call Trace:
  dump_stack()
  __warn()
  warn_slowpath_fmt()
  lock_unpin_lock()
  __balance_callback()
  __schedule()
  schedule()
  futex_wait_queue_me()
  futex_wait()
  do_futex()
  SyS_futex()
  do_syscall_64()
  entry_SYSCALL64_slow_path()

Revert the rq_lock_irqsave() usage here, the whole point of the
balance_callback() was to allow dropping rq->lock.

Reported-by: Fengguang Wu <fengguang.wu@intel.com>
Signed-off-by: Wanpeng Li <wanpeng.li@hotmail.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Fixes: 8a8c69c327 ("sched/core: Add rq->lock wrappers")
Link: http://lkml.kernel.org/r/1489718719-3951-1-git-send-email-wanpeng.li@hotmail.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-03-23 07:44:51 +01:00
Peter Zijlstra
698eff6355 sched/clock, x86/perf: Fix "perf test tsc"
People reported that commit:

  5680d8094f ("sched/clock: Provide better clock continuity")

broke "perf test tsc".

That commit added another offset to the reported clock value; so
take that into account when computing the provided offset values.

Reported-by: Adrian Hunter <adrian.hunter@intel.com>
Reported-by: Arnaldo Carvalho de Melo <acme@kernel.org>
Tested-by: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Fixes: 5680d8094f ("sched/clock: Provide better clock continuity")
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-03-23 07:31:49 +01:00
Peter Zijlstra
71fdb70eb4 sched/clock: Fix clear_sched_clock_stable() preempt wobbly
Paul reported a problems with clear_sched_clock_stable(). Since we run
all of __clear_sched_clock_stable() from workqueue context, there's a
preempt problem.

Solve it by only running the static_key_disable() from workqueue.

Reported-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: fweisbec@gmail.com
Link: http://lkml.kernel.org/r/20170313124621.GA3328@twins.programming.kicks-ass.net
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-03-23 07:31:48 +01:00
Rafael J. Wysocki
b7eaf1aab9 cpufreq: schedutil: Avoid reducing frequency of busy CPUs prematurely
The way the schedutil governor uses the PELT metric causes it to
underestimate the CPU utilization in some cases.

That can be easily demonstrated by running kernel compilation on
a Sandy Bridge Intel processor, running turbostat in parallel with
it and looking at the values written to the MSR_IA32_PERF_CTL
register.  Namely, the expected result would be that when all CPUs
were 100% busy, all of them would be requested to run in the maximum
P-state, but observation shows that this clearly isn't the case.
The CPUs run in the maximum P-state for a while and then are
requested to run slower and go back to the maximum P-state after
a while again.  That causes the actual frequency of the processor to
visibly oscillate below the sustainable maximum in a jittery fashion
which clearly is not desirable.

That has been attributed to CPU utilization metric updates on task
migration that cause the total utilization value for the CPU to be
reduced by the utilization of the migrated task.  If that happens,
the schedutil governor may see a CPU utilization reduction and will
attempt to reduce the CPU frequency accordingly right away.  That
may be premature, though, for example if the system is generally
busy and there are other runnable tasks waiting to be run on that
CPU already.

This is unlikely to be an issue on systems where cpufreq policies are
shared between multiple CPUs, because in those cases the policy
utilization is computed as the maximum of the CPU utilization values
over the whole policy and if that turns out to be low, reducing the
frequency for the policy most likely is a good idea anyway.  On
systems with one CPU per policy, however, it may affect performance
adversely and even lead to increased energy consumption in some cases.

On those systems it may be addressed by taking another utilization
metric into consideration, like whether or not the CPU whose
frequency is about to be reduced has been idle recently, because if
that's not the case, the CPU is likely to be busy in the near future
and its frequency should not be reduced.

To that end, use the counter of idle calls in the timekeeping code.
Namely, make the schedutil governor look at that counter for the
current CPU every time before its frequency is about to be reduced.
If the counter has not changed since the previous iteration of the
governor computations for that CPU, the CPU has been busy for all
that time and its frequency should not be decreased, so if the new
frequency would be lower than the one set previously, the governor
will skip the frequency update.

Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Viresh Kumar <viresh.kumar@linaro.org>
Reviewed-by: Joel Fernandes <joelaf@google.com>
2017-03-23 02:12:14 +01:00
Rafael J. Wysocki
4296f23ed4 cpufreq: schedutil: Fix per-CPU structure initialization in sugov_start()
sugov_start() only initializes struct sugov_cpu per-CPU structures
for shared policies, but it should do that for single-CPU policies too.

That in particular makes the IO-wait boost mechanism work in the
cases when cpufreq policies correspond to individual CPUs.

Fixes: 21ca6d2c52 (cpufreq: schedutil: Add iowait boosting)
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Acked-by: Viresh Kumar <viresh.kumar@linaro.org>
Cc: 4.9+ <stable@vger.kernel.org> # 4.9+
2017-03-21 01:03:08 +01:00
Peter Zijlstra
15ff991e80 sched/core: Avoid double update_rq_clock() in move_queued_task()
Address this case:

  WARNING: CPU: 0 PID: 2070 at ../kernel/sched/core.c:109 update_rq_clock+0x74/0x80
  rq->clock_update_flags & RQCF_UPDATED

  Call Trace:
  update_rq_clock()
  move_queued_task()
  __set_cpus_allowed_ptr()
  ...

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-03-16 09:46:26 +01:00
Peter Zijlstra
5704ac0ae7 sched/core: Fix double update_rq_clock) calls in attach_task()/detach_task()
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-03-16 09:46:25 +01:00
Peter Zijlstra
7a57f32a4d sched/core: Avoid obvious double update_rq_clock()
Add DEQUEUE_NOCLOCK to all places where we just did an
update_rq_clock() already.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-03-16 09:46:25 +01:00
Peter Zijlstra
bce4dc80c6 sched/core: Simplify update_rq_clock() in __schedule()
Instead of relying on deactivate_task() to call update_rq_clock() and
handling the case where it didn't happen (task_on_rq_queued),
unconditionally do update_rq_clock() and skip any further updates.

This also avoids a double update on deactivate_task() + ttwu_local().

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-03-16 09:46:24 +01:00
Peter Zijlstra
77558e4d01 sched/core: Make sched_ttwu_pending() atomic in time
Since all tasks on the wake_list are woken under a single rq->lock
avoid calling update_rq_clock() for each task.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-03-16 09:46:24 +01:00
Peter Zijlstra
7134b3e941 sched/core: Add ENQUEUE_NOCLOCK to ENQUEUE_RESTORE
In all cases, ENQUEUE_RESTORE should also have ENQUEUE_NOCLOCK because
DEQUEUE_SAVE will have done an update_rq_clock().

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-03-16 09:46:23 +01:00
Peter Zijlstra
0a67d1ee30 sched/core: Add {EN,DE}QUEUE_NOCLOCK flags
Currently {en,de}queue_task() do an unconditional update_rq_clock().
However since we want to avoid duplicate updates, so that each
rq->lock section appears atomic in time, we need to be able to skip
these clock updates.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-03-16 09:46:23 +01:00
Peter Zijlstra
8a8c69c327 sched/core: Add rq->lock wrappers
The missing update_rq_clock() check can work with partial rq->lock
wrappery, since a missing wrapper can cause the warning to not be
emitted when it should have, but cannot cause the warning to trigger
when it should not have.

The duplicate update_rq_clock() check however can cause false warnings
to trigger. Therefore add more comprehensive rq->lock wrappery.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-03-16 09:46:22 +01:00
Peter Zijlstra
26ae58d23b sched/core: Add WARNING for multiple update_rq_clock() calls
Now that we have no missing calls, add a warning to find multiple
calls.

By having only a single update_rq_clock() call per rq-lock section,
the section appears 'atomic' wrt time.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-03-16 09:46:21 +01:00
Steven Rostedt (VMware)
3e777f9909 sched/rt: Add comments describing the RT IPI pull method
While looking into optimizations for the RT scheduler IPI logic, I realized
that the comments are lacking to describe it efficiently. It deserves a
lengthy description describing its design.

Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Clark Williams <williams@redhat.com>
Cc: Daniel Bristot de Oliveira <bristot@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/20170228155030.30c69068@gandalf.local.home
[ Small typographical edits. ]
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-03-16 09:41:35 +01:00
Steven Rostedt (VMware)
2317d5f1c3 sched/deadline: Use deadline instead of period when calculating overflow
I was testing Daniel's changes with his test case, and tweaked it a
little. Instead of having the runtime equal to the deadline, I
increased the deadline ten fold.

Daniel's test case had:

	attr.sched_runtime  = 2 * 1000 * 1000;		/* 2 ms */
	attr.sched_deadline = 2 * 1000 * 1000;		/* 2 ms */
	attr.sched_period   = 2 * 1000 * 1000 * 1000;	/* 2 s */

To make it more interesting, I changed it to:

	attr.sched_runtime  =  2 * 1000 * 1000;		/* 2 ms */
	attr.sched_deadline = 20 * 1000 * 1000;		/* 20 ms */
	attr.sched_period   =  2 * 1000 * 1000 * 1000;	/* 2 s */

The results were rather surprising. The behavior that Daniel's patch
was fixing came back. The task started using much more than .1% of the
CPU. More like 20%.

Looking into this I found that it was due to the dl_entity_overflow()
constantly returning true. That's because it uses the relative period
against relative runtime vs the absolute deadline against absolute
runtime.

  runtime / (deadline - t) > dl_runtime / dl_period

There's even a comment mentioning this, and saying that when relative
deadline equals relative period, that the equation is the same as using
deadline instead of period. That comment is backwards! What we really
want is:

  runtime / (deadline - t) > dl_runtime / dl_deadline

We care about if the runtime can make its deadline, not its period. And
then we can say "when the deadline equals the period, the equation is
the same as using dl_period instead of dl_deadline".

After correcting this, now when the task gets enqueued, it can throttle
correctly, and Daniel's fix to the throttling of sleeping deadline
tasks works even when the runtime and deadline are not the same.

Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Daniel Bristot de Oliveira <bristot@redhat.com>
Cc: Juri Lelli <juri.lelli@arm.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Luca Abeni <luca.abeni@santannapisa.it>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Romulo Silva de Oliveira <romulo.deoliveira@ufsc.br>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Tommaso Cucinotta <tommaso.cucinotta@sssup.it>
Link: http://lkml.kernel.org/r/02135a27f1ae3fe5fd032568a5a2f370e190e8d7.1488392936.git.bristot@redhat.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-03-16 09:37:38 +01:00
Daniel Bristot de Oliveira
df8eac8caf sched/deadline: Throttle a constrained deadline task activated after the deadline
During the activation, CBS checks if it can reuse the current task's
runtime and period. If the deadline of the task is in the past, CBS
cannot use the runtime, and so it replenishes the task. This rule
works fine for implicit deadline tasks (deadline == period), and the
CBS was designed for implicit deadline tasks. However, a task with
constrained deadline (deadine < period) might be awakened after the
deadline, but before the next period. In this case, replenishing the
task would allow it to run for runtime / deadline. As in this case
deadline < period, CBS enables a task to run for more than the
runtime / period. In a very loaded system, this can cause a domino
effect, making other tasks miss their deadlines.

To avoid this problem, in the activation of a constrained deadline
task after the deadline but before the next period, throttle the
task and set the replenishing timer to the begin of the next period,
unless it is boosted.

Reproducer:

 --------------- %< ---------------
  int main (int argc, char **argv)
  {
	int ret;
	int flags = 0;
	unsigned long l = 0;
	struct timespec ts;
	struct sched_attr attr;

	memset(&attr, 0, sizeof(attr));
	attr.size = sizeof(attr);

	attr.sched_policy   = SCHED_DEADLINE;
	attr.sched_runtime  = 2 * 1000 * 1000;		/* 2 ms */
	attr.sched_deadline = 2 * 1000 * 1000;		/* 2 ms */
	attr.sched_period   = 2 * 1000 * 1000 * 1000;	/* 2 s */

	ts.tv_sec = 0;
	ts.tv_nsec = 2000 * 1000;			/* 2 ms */

	ret = sched_setattr(0, &attr, flags);

	if (ret < 0) {
		perror("sched_setattr");
		exit(-1);
	}

	for(;;) {
		/* XXX: you may need to adjust the loop */
		for (l = 0; l < 150000; l++);
		/*
		 * The ideia is to go to sleep right before the deadline
		 * and then wake up before the next period to receive
		 * a new replenishment.
		 */
		nanosleep(&ts, NULL);
	}

	exit(0);
  }
  --------------- >% ---------------

On my box, this reproducer uses almost 50% of the CPU time, which is
obviously wrong for a task with 2/2000 reservation.

Signed-off-by: Daniel Bristot de Oliveira <bristot@redhat.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Juri Lelli <juri.lelli@arm.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Luca Abeni <luca.abeni@santannapisa.it>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Romulo Silva de Oliveira <romulo.deoliveira@ufsc.br>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Tommaso Cucinotta <tommaso.cucinotta@sssup.it>
Link: http://lkml.kernel.org/r/edf58354e01db46bf42df8d2dd32418833f68c89.1488392936.git.bristot@redhat.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-03-16 09:37:38 +01:00
Daniel Bristot de Oliveira
5ac69d3778 sched/deadline: Make sure the replenishment timer fires in the next period
Currently, the replenishment timer is set to fire at the deadline
of a task. Although that works for implicit deadline tasks because the
deadline is equals to the begin of the next period, that is not correct
for constrained deadline tasks (deadline < period).

For instance:

f.c:
 --------------- %< ---------------
int main (void)
{
	for(;;);
}
 --------------- >% ---------------

  # gcc -o f f.c

  # trace-cmd record -e sched:sched_switch                              \
				   -e syscalls:sys_exit_sched_setattr   \
   chrt -d --sched-runtime  490000000					\
           --sched-deadline 500000000					\
	   --sched-period  1000000000 0 ./f

  # trace-cmd report | grep "{pid of ./f}"

After setting parameters, the task is replenished and continue running
until being throttled:

         f-11295 [003] 13322.113776: sys_exit_sched_setattr: 0x0

The task is throttled after running 492318 ms, as expected:

         f-11295 [003] 13322.606094: sched_switch:   f:11295 [-1] R ==> watchdog/3:32 [0]

But then, the task is replenished 500719 ms after the first
replenishment:

    <idle>-0     [003] 13322.614495: sched_switch:   swapper/3:0 [120] R ==> f:11295 [-1]

Running for 490277 ms:

         f-11295 [003] 13323.104772: sched_switch:   f:11295 [-1] R ==>  swapper/3:0 [120]

Hence, in the first period, the task runs 2 * runtime, and that is a bug.

During the first replenishment, the next deadline is set one period away.
So the runtime / period starts to be respected. However, as the second
replenishment took place in the wrong instant, the next replenishment
will also be held in a wrong instant of time. Rather than occurring in
the nth period away from the first activation, it is taking place
in the (nth period - relative deadline).

Signed-off-by: Daniel Bristot de Oliveira <bristot@redhat.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Luca Abeni <luca.abeni@santannapisa.it>
Reviewed-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Reviewed-by: Juri Lelli <juri.lelli@arm.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Romulo Silva de Oliveira <romulo.deoliveira@ufsc.br>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Tommaso Cucinotta <tommaso.cucinotta@sssup.it>
Link: http://lkml.kernel.org/r/ac50d89887c25285b47465638354b63362f8adff.1488392936.git.bristot@redhat.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-03-16 09:37:37 +01:00
Matt Fleming
caeb588297 sched/loadavg: Use {READ,WRITE}_ONCE() for sample window
'calc_load_update' is accessed without any kind of locking and there's
a clear assumption in the code that only a single value is read or
written.

Make this explicit by using READ_ONCE() and WRITE_ONCE(), and avoid
unintentionally seeing multiple values, or having the load/stores
split.

Technically the loads in calc_global_*() don't require this since
those are the only functions that update 'calc_load_update', but I've
added the READ_ONCE() for consistency.

Suggested-by: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Matt Fleming <matt@codeblueprint.co.uk>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Mike Galbraith <umgwanakikbuti@gmail.com>
Cc: Morten Rasmussen <morten.rasmussen@arm.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vincent Guittot <vincent.guittot@linaro.org>
Link: http://lkml.kernel.org/r/20170217120731.11868-3-matt@codeblueprint.co.uk
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-03-16 09:21:01 +01:00
Matt Fleming
6e5f32f7a4 sched/loadavg: Avoid loadavg spikes caused by delayed NO_HZ accounting
If we crossed a sample window while in NO_HZ we will add LOAD_FREQ to
the pending sample window time on exit, setting the next update not
one window into the future, but two.

This situation on exiting NO_HZ is described by:

  this_rq->calc_load_update < jiffies < calc_load_update

In this scenario, what we should be doing is:

  this_rq->calc_load_update = calc_load_update		     [ next window ]

But what we actually do is:

  this_rq->calc_load_update = calc_load_update + LOAD_FREQ   [ next+1 window ]

This has the effect of delaying load average updates for potentially
up to ~9seconds.

This can result in huge spikes in the load average values due to
per-cpu uninterruptible task counts being out of sync when accumulated
across all CPUs.

It's safe to update the per-cpu active count if we wake between sample
windows because any load that we left in 'calc_load_idle' will have
been zero'd when the idle load was folded in calc_global_load().

This issue is easy to reproduce before,

  commit 9d89c257df ("sched/fair: Rewrite runnable load and utilization average tracking")

just by forking short-lived process pipelines built from ps(1) and
grep(1) in a loop. I'm unable to reproduce the spikes after that
commit, but the bug still seems to be present from code review.

Signed-off-by: Matt Fleming <matt@codeblueprint.co.uk>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Mike Galbraith <umgwanakikbuti@gmail.com>
Cc: Morten Rasmussen <morten.rasmussen@arm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vincent Guittot <vincent.guittot@linaro.org>
Fixes: commit 5167e8d ("sched/nohz: Rewrite and fix load-avg computation -- again")
Link: http://lkml.kernel.org/r/20170217120731.11868-2-matt@codeblueprint.co.uk
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-03-16 09:21:00 +01:00
Wanpeng Li
dcc3b5ffe1 sched/deadline: Add missing update_rq_clock() in dl_task_timer()
The following warning can be triggered by hot-unplugging the CPU
on which an active SCHED_DEADLINE task is running on:

 ------------[ cut here ]------------
 WARNING: CPU: 7 PID: 0 at kernel/sched/sched.h:833 replenish_dl_entity+0x71e/0xc40
 rq->clock_update_flags < RQCF_ACT_SKIP
 CPU: 7 PID: 0 Comm: swapper/7 Tainted: G    B           4.11.0-rc1+ #24
 Hardware name: LENOVO ThinkCentre M8500t-N000/SHARKBAY, BIOS FBKTC1AUS 02/16/2016
 Call Trace:
  <IRQ>
  dump_stack+0x85/0xc4
  __warn+0x172/0x1b0
  warn_slowpath_fmt+0xb4/0xf0
  ? __warn+0x1b0/0x1b0
  ? debug_check_no_locks_freed+0x2c0/0x2c0
  ? cpudl_set+0x3d/0x2b0
  replenish_dl_entity+0x71e/0xc40
  enqueue_task_dl+0x2ea/0x12e0
  ? dl_task_timer+0x777/0x990
  ? __hrtimer_run_queues+0x270/0xa50
  dl_task_timer+0x316/0x990
  ? enqueue_task_dl+0x12e0/0x12e0
  ? enqueue_task_dl+0x12e0/0x12e0
  __hrtimer_run_queues+0x270/0xa50
  ? hrtimer_cancel+0x20/0x20
  ? hrtimer_interrupt+0x119/0x600
  hrtimer_interrupt+0x19c/0x600
  ? trace_hardirqs_off+0xd/0x10
  local_apic_timer_interrupt+0x74/0xe0
  smp_apic_timer_interrupt+0x76/0xa0
  apic_timer_interrupt+0x93/0xa0

The DL task will be migrated to a suitable later deadline rq once the DL
timer fires and currnet rq is offline. The rq clock of the new rq should
be updated. This patch fixes it by updating the rq clock after holding
the new rq's rq lock.

Signed-off-by: Wanpeng Li <wanpeng.li@hotmail.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Matt Fleming <matt@codeblueprint.co.uk>
Cc: Juri Lelli <juri.lelli@arm.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/1488865888-15894-1-git-send-email-wanpeng.li@hotmail.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-03-16 09:20:59 +01:00
Viresh Kumar
cba1dfb57b cpufreq: schedutil: Refactor sugov_next_freq_shared()
The loop in sugov_next_freq_shared() contains an if block to skip the
loop for the current CPU. This turns out to be an unnecessary
conditional in the scheduler's hot-path for every CPU in the policy.

It would be better to drop the conditional and make the loop treat all
the CPUs in the same way. That would eliminate the need of calling
sugov_iowait_boost() at the top of the routine.

To keep the code optimized to return early if the current CPU has RT/DL
flags set, move the flags check to sugov_update_shared() instead in
order to avoid the function call entirely.

Signed-off-by: Viresh Kumar <viresh.kumar@linaro.org>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2017-03-12 23:11:33 +01:00
Viresh Kumar
994a8f2514 cpufreq: schedutil: Redefine the rate_limit_us tunable
The rate_limit_us tunable is intended to reduce the possible overhead
from running the schedutil governor.  However, that overhead can be
divided into two separate parts: the governor computations and the
invocation of the scaling driver to set the CPU frequency.  The latter
is where the real overhead comes from.  The former is much less
expensive in terms of execution time and running it every time the
governor callback is invoked by the scheduler, after rate_limit_us
interval has passed since the last frequency update, would not be a
problem.

For this reason, redefine the rate_limit_us tunable so that it means the
minimum time that has to pass between two consecutive invocations of the
scaling driver by the schedutil governor (to set the CPU frequency).

Signed-off-by: Viresh Kumar <viresh.kumar@linaro.org>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2017-03-12 23:11:33 +01:00
Linus Torvalds
c1aa905a30 Power management updates for v4.11-rc2
- Three fixes for intel_pstate problems related to the passive
    mode (in which it acts as a regular cpufreq scaling driver), two
    for the handling of global P-state limits and one for the handling
    of the cpu_frequency tracepoint in that mode (Rafael Wysocki).
 
  - Three fixes for the handling of P-state limits in intel_pstate in
    the active mode (Rafael Wysocki).
 
  - Introduction of a new cpufreq.off=1 kernel command line argument
    that will disable cpufreq entirely if passed to the kernel and
    is simply hooked up to the existing code used by Xen (Len Brown).
 
  - Fix for the schedutil cpufreq governor to prevent it from using
    stale raw frequency values in configurations with mutiple CPUs
    sharing one policy object and a cleanup for it reducing its
    overhead slightly (Viresh Kumar).
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v2
 
 iQIcBAABCAAGBQJYwc9bAAoJEILEb/54YlRxU24P/2RPFpRYNtGNfbnvmyNjcA5H
 MLQ/i0HEqhBVsfsj2Kmy6sRkY/X/kqnOiD2PzGcaOMTtMx2FcLNHzNqH0P0I7A4W
 9wzUcq0SC5FzTJcLryN4gtJa3tfBDHjr6peAcZyji5t3DbGa+mygVwH8IGyr4feo
 u7PE/6eXLfkRIySwG/kCI/YVE8tWuhjDHbjhR1pjgyrMjhDbPP480Mgsi4eDRRTO
 BwGVyQJXoMo9e7/vZ8Y8Fkt7PMxYeyeE1yAGHkJzjkFfdrkZrn3IUPfgVgxy8IPV
 N3w1BB+H84duEswPZH2patdIKQxXG3r0eaGHTXZIwy+5sHyc3mUMJ1FeHR67gasd
 Z4p4hylTP+dGZGDo4G7PbVX4V6zEP+LSxgOQYjbo4n6k3nJJOrIF4wt+l5+tQz5m
 Y+V6XVHgZm1LyFgjj06jMeVSmk6SS7X0qBHJ4WcfGzV/J+vStXmIvAmljs+cd/u2
 R9z1W/rxelZFKT+4lRCuztpBYfJvlE5nXyM1XwC6Rz6QjFax0pJAHUPFznWyq24C
 GlMAvQWPapDEoOvDrS/QKeqX1ROOYf2FwHs+uvPPCvOxMjL9NCQsQ34tqCM3MiL/
 ebk3uZ0xC1e0LaOB87xr7tfsag12MyZzSCwX0D8nBLBLvntpa/xQwE5+4vu9ZguD
 xlarnIsEh6bTwTVH6DJP
 =Gwp7
 -----END PGP SIGNATURE-----

Merge tag 'pm-4.11-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm

Pull power management fixes from Rafael Wysocki:
 "These fix several issues in the intel_pstate driver and one issue in
  the schedutil cpufreq governor, clean up that governor a bit and hook
  up existing code for disabling cpufreq to a new kernel command line
  option.

  Specifics:

   - Three fixes for intel_pstate problems related to the passive mode
     (in which it acts as a regular cpufreq scaling driver), two for the
     handling of global P-state limits and one for the handling of the
     cpu_frequency tracepoint in that mode (Rafael Wysocki).

   - Three fixes for the handling of P-state limits in intel_pstate in
     the active mode (Rafael Wysocki).

   - Introduction of a new cpufreq.off=1 kernel command line argument
     that will disable cpufreq entirely if passed to the kernel and is
     simply hooked up to the existing code used by Xen (Len Brown).

   - Fix for the schedutil cpufreq governor to prevent it from using
     stale raw frequency values in configurations with mutiple CPUs
     sharing one policy object and a cleanup for it reducing its
     overhead slightly (Viresh Kumar)"

* tag 'pm-4.11-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm:
  cpufreq: intel_pstate: Do not reinit performance limits in ->setpolicy
  cpufreq: intel_pstate: Fix intel_pstate_verify_policy()
  cpufreq: intel_pstate: Fix global settings in active mode
  cpufreq: Add the "cpufreq.off=1" cmdline option
  cpufreq: schedutil: Pass sg_policy to get_next_freq()
  cpufreq: schedutil: move cached_raw_freq to struct sugov_policy
  cpufreq: intel_pstate: Avoid triggering cpu_frequency tracepoint unnecessarily
  cpufreq: intel_pstate: Fix intel_cpufreq_verify_policy()
  cpufreq: intel_pstate: Do not use performance_limits in passive mode
2017-03-09 16:30:37 -08:00
Linus Torvalds
bd0f9b356d sched/headers: fix up header file dependency on <linux/sched/signal.h>
The scheduler header file split and cleanups ended up exposing a few
nasty header file dependencies, and in particular it showed how we in
<linux/wait.h> ended up depending on "signal_pending()", which now comes
from <linux/sched/signal.h>.

That's a very subtle and annoying dependency, which already caused a
semantic merge conflict (see commit e58bc92783 "Pull overlayfs updates
from Miklos Szeredi", which added that fixup in the merge commit).

It turns out that we can avoid this dependency _and_ improve code
generation by moving the guts of the fairly nasty helper #define
__wait_event_interruptible_locked() to out-of-line code.  The code that
includes the signal_pending() check is all in the slow-path where we
actually go to sleep waiting for the event anyway, so using a helper
function is the right thing to do.

Using a helper function is also what we already did for the non-locked
versions, see the "__wait_event*()" macros and the "prepare_to_wait*()"
set of helper functions.

We might want to try to unify all these macro games, we have a _lot_ of
subtly different wait-event loops.  But this is the minimal patch to fix
the annoying header dependency.

Acked-by: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-03-08 10:36:03 -08:00
Josh Poimboeuf
d83a7cb375 livepatch: change to a per-task consistency model
Change livepatch to use a basic per-task consistency model.  This is the
foundation which will eventually enable us to patch those ~10% of
security patches which change function or data semantics.  This is the
biggest remaining piece needed to make livepatch more generally useful.

This code stems from the design proposal made by Vojtech [1] in November
2014.  It's a hybrid of kGraft and kpatch: it uses kGraft's per-task
consistency and syscall barrier switching combined with kpatch's stack
trace switching.  There are also a number of fallback options which make
it quite flexible.

Patches are applied on a per-task basis, when the task is deemed safe to
switch over.  When a patch is enabled, livepatch enters into a
transition state where tasks are converging to the patched state.
Usually this transition state can complete in a few seconds.  The same
sequence occurs when a patch is disabled, except the tasks converge from
the patched state to the unpatched state.

An interrupt handler inherits the patched state of the task it
interrupts.  The same is true for forked tasks: the child inherits the
patched state of the parent.

Livepatch uses several complementary approaches to determine when it's
safe to patch tasks:

1. The first and most effective approach is stack checking of sleeping
   tasks.  If no affected functions are on the stack of a given task,
   the task is patched.  In most cases this will patch most or all of
   the tasks on the first try.  Otherwise it'll keep trying
   periodically.  This option is only available if the architecture has
   reliable stacks (HAVE_RELIABLE_STACKTRACE).

2. The second approach, if needed, is kernel exit switching.  A
   task is switched when it returns to user space from a system call, a
   user space IRQ, or a signal.  It's useful in the following cases:

   a) Patching I/O-bound user tasks which are sleeping on an affected
      function.  In this case you have to send SIGSTOP and SIGCONT to
      force it to exit the kernel and be patched.
   b) Patching CPU-bound user tasks.  If the task is highly CPU-bound
      then it will get patched the next time it gets interrupted by an
      IRQ.
   c) In the future it could be useful for applying patches for
      architectures which don't yet have HAVE_RELIABLE_STACKTRACE.  In
      this case you would have to signal most of the tasks on the
      system.  However this isn't supported yet because there's
      currently no way to patch kthreads without
      HAVE_RELIABLE_STACKTRACE.

3. For idle "swapper" tasks, since they don't ever exit the kernel, they
   instead have a klp_update_patch_state() call in the idle loop which
   allows them to be patched before the CPU enters the idle state.

   (Note there's not yet such an approach for kthreads.)

All the above approaches may be skipped by setting the 'immediate' flag
in the 'klp_patch' struct, which will disable per-task consistency and
patch all tasks immediately.  This can be useful if the patch doesn't
change any function or data semantics.  Note that, even with this flag
set, it's possible that some tasks may still be running with an old
version of the function, until that function returns.

There's also an 'immediate' flag in the 'klp_func' struct which allows
you to specify that certain functions in the patch can be applied
without per-task consistency.  This might be useful if you want to patch
a common function like schedule(), and the function change doesn't need
consistency but the rest of the patch does.

For architectures which don't have HAVE_RELIABLE_STACKTRACE, the user
must set patch->immediate which causes all tasks to be patched
immediately.  This option should be used with care, only when the patch
doesn't change any function or data semantics.

In the future, architectures which don't have HAVE_RELIABLE_STACKTRACE
may be allowed to use per-task consistency if we can come up with
another way to patch kthreads.

The /sys/kernel/livepatch/<patch>/transition file shows whether a patch
is in transition.  Only a single patch (the topmost patch on the stack)
can be in transition at a given time.  A patch can remain in transition
indefinitely, if any of the tasks are stuck in the initial patch state.

A transition can be reversed and effectively canceled by writing the
opposite value to the /sys/kernel/livepatch/<patch>/enabled file while
the transition is in progress.  Then all the tasks will attempt to
converge back to the original patch state.

[1] https://lkml.kernel.org/r/20141107140458.GA21774@suse.cz

Signed-off-by: Josh Poimboeuf <jpoimboe@redhat.com>
Acked-by: Miroslav Benes <mbenes@suse.cz>
Acked-by: Ingo Molnar <mingo@kernel.org>        # for the scheduler changes
Signed-off-by: Jiri Kosina <jkosina@suse.cz>
2017-03-08 09:36:21 +01:00
Linus Torvalds
609b07b72d Merge branch 'sched-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull scheduler fixes from Ingo Molnar:
 "A fix for KVM's scheduler clock which (erroneously) was always marked
  unstable, a fix for RT/DL load balancing, plus latency fixes"

* 'sched-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  sched/clock, x86/tsc: Rework the x86 'unstable' sched_clock() interface
  sched/core: Fix pick_next_task() for RT,DL
  sched/fair: Make select_idle_cpu() more aggressive
2017-03-07 14:42:34 -08:00
Viresh Kumar
655cb1ebff cpufreq: schedutil: Pass sg_policy to get_next_freq()
get_next_freq() uses sg_cpu only to get sg_policy, which the callers of
get_next_freq() already have. Pass sg_policy instead of sg_cpu to
get_next_freq(), to make it more efficient.

Signed-off-by: Viresh Kumar <viresh.kumar@linaro.org>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2017-03-05 23:58:48 +01:00
Viresh Kumar
6c4f0fa643 cpufreq: schedutil: move cached_raw_freq to struct sugov_policy
cached_raw_freq applies to the entire cpufreq policy and not individual
CPUs. Apart from wasting per-cpu memory, it is actually wrong to keep it
in struct sugov_cpu as we may end up comparing next_freq with a stale
cached_raw_freq of a random CPU.

Move cached_raw_freq to struct sugov_policy.

Fixes: 5cbea46984 (cpufreq: schedutil: map raw required frequency to driver frequency)
Signed-off-by: Viresh Kumar <viresh.kumar@linaro.org>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2017-03-05 23:58:48 +01:00
Ingo Molnar
1050b27c52 sched/headers: Move cputime functionality from <linux/sched.h> and <linux/cputime.h> into <linux/sched/cputime.h>
Move cputime related functionality out of <linux/sched.h>, as most code
that includes <linux/sched.h> does not use that functionality.

Move data types that are not included in task_struct directly to
the signal definitions, into <linux/sched/signal.h>.

Also merge the (small) existing <linux/cputime.h> header into <linux/sched/cputime.h>.

Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-03-03 01:45:22 +01:00
Peter Zijlstra
0ba87bb27d sched/core: Fix pick_next_task() for RT,DL
Pavan noticed that the following commit:

  49ee576809 ("sched/core: Optimize pick_next_task() for idle_sched_class")

... broke RT,DL balancing by robbing them of the opportinty to do new-'idle'
balancing when their last runnable task (on that runqueue) goes away.

Reported-by: Pavan Kondeti <pkondeti@codeaurora.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Fixes: 49ee576809 ("sched/core: Optimize pick_next_task() for idle_sched_class")
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-03-02 08:50:17 +01:00
Peter Zijlstra
4c77b18cf8 sched/fair: Make select_idle_cpu() more aggressive
Kitsunyan reported desktop latency issues on his Celeron 887 because
of commit:

  1b568f0aab ("sched/core: Optimize SCHED_SMT")

... even though his CPU doesn't do SMT.

The effect of running the SMT code on a !SMT part is basically a more
aggressive select_idle_cpu(). Removing the avg condition fixed things
for him.

I also know FB likes this test gone, even though other workloads like
having it.

For now, take it out by default, until we get a better idea.

Reported-by: kitsunyan <kitsunyan@inbox.ru>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Chris Mason <clm@fb.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Mike Galbraith <umgwanakikbuti@gmail.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-03-02 08:50:17 +01:00
Ingo Molnar
1777e46355 sched/headers: Prepare to move _init() prototypes from <linux/sched.h> to <linux/sched/init.h>
But first introduce a trivial header and update usage sites.

Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-03-02 08:42:40 +01:00
Ingo Molnar
32ef5517c2 sched/headers: Prepare to move cputime functionality from <linux/sched.h> into <linux/sched/cputime.h>
Introduce a trivial, mostly empty <linux/sched/cputime.h> header
to prepare for the moving of cputime functionality out of sched.h.

Update all code that relies on these facilities.

Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-03-02 08:42:39 +01:00
Ingo Molnar
f719ff9bce sched/headers: Prepare to move the task_lock()/unlock() APIs to <linux/sched/task.h>
But first update the code that uses these facilities with the
new header.

Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-03-02 08:42:38 +01:00
Ingo Molnar
589ee62844 sched/headers: Prepare to remove the <linux/mm_types.h> dependency from <linux/sched.h>
Update code that relied on sched.h including various MM types for them.

This will allow us to remove the <linux/mm_types.h> include from <linux/sched.h>.

Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-03-02 08:42:37 +01:00
Ingo Molnar
dfc3401a33 sched/headers: Prepare to move the 'root_task_group' declaration to <linux/sched/autogroup.h>
Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-03-02 08:42:36 +01:00
Ingo Molnar
68db0cf106 sched/headers: Prepare for new header dependencies before moving code to <linux/sched/task_stack.h>
We are going to split <linux/sched/task_stack.h> out of <linux/sched.h>, which
will have to be picked up from other headers and a couple of .c files.

Create a trivial placeholder <linux/sched/task_stack.h> file that just
maps to <linux/sched.h> to make this patch obviously correct and
bisectable.

Include the new header in the files that are going to need it.

Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-03-02 08:42:36 +01:00
Ingo Molnar
299300258d sched/headers: Prepare for new header dependencies before moving code to <linux/sched/task.h>
We are going to split <linux/sched/task.h> out of <linux/sched.h>, which
will have to be picked up from other headers and a couple of .c files.

Create a trivial placeholder <linux/sched/task.h> file that just
maps to <linux/sched.h> to make this patch obviously correct and
bisectable.

Include the new header in the files that are going to need it.

Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-03-02 08:42:35 +01:00
Ingo Molnar
ef8bd77f33 sched/headers: Prepare for new header dependencies before moving code to <linux/sched/hotplug.h>
We are going to split <linux/sched/hotplug.h> out of <linux/sched.h>, which
will have to be picked up from other headers and a couple of .c files.

Create a trivial placeholder <linux/sched/hotplug.h> file that just
maps to <linux/sched.h> to make this patch obviously correct and
bisectable.

Include the new header in the files that are going to need it.

Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-03-02 08:42:35 +01:00
Ingo Molnar
b17b01533b sched/headers: Prepare for new header dependencies before moving code to <linux/sched/debug.h>
We are going to split <linux/sched/debug.h> out of <linux/sched.h>, which
will have to be picked up from other headers and a couple of .c files.

Create a trivial placeholder <linux/sched/debug.h> file that just
maps to <linux/sched.h> to make this patch obviously correct and
bisectable.

Include the new header in the files that are going to need it.

Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-03-02 08:42:34 +01:00
Ingo Molnar
370c91355c sched/headers: Prepare for new header dependencies before moving code to <linux/sched/nohz.h>
We are going to split <linux/sched/nohz.h> out of <linux/sched.h>, which
will have to be picked up from other headers and a couple of .c files.

Create a trivial placeholder <linux/sched/nohz.h> file that just
maps to <linux/sched.h> to make this patch obviously correct and
bisectable.

Include the new header in the files that are going to need it.

Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-03-02 08:42:34 +01:00
Ingo Molnar
03441a3482 sched/headers: Prepare for new header dependencies before moving code to <linux/sched/stat.h>
We are going to split <linux/sched/stat.h> out of <linux/sched.h>, which
will have to be picked up from other headers and a couple of .c files.

Create a trivial placeholder <linux/sched/stat.h> file that just
maps to <linux/sched.h> to make this patch obviously correct and
bisectable.

Include the new header in the files that are going to need it.

Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-03-02 08:42:34 +01:00
Ingo Molnar
174cd4b1e5 sched/headers: Prepare to move signal wakeup & sigpending methods from <linux/sched.h> into <linux/sched/signal.h>
Fix up affected files that include this signal functionality via sched.h.

Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-03-02 08:42:32 +01:00
Ingo Molnar
7fce777cd4 sched/headers: Prepare header dependency changes, move the <asm/paravirt.h> include to kernel/sched/sched.h
Recent header reorganizations unearthed this hidden dependency:

  kernel/sched/core.c:199:25: error: 'paravirt_steal_rq_enabled' undeclared (first use in this function)
  kernel/sched/core.c:200:11: error: implicit declaration of function 'paravirt_steal_clock' [-Werror=implicit-function-declaration]

So move the asm/paravirt.h include from kernel/sched/cpuclock.c to kernel/sched/sched.h.

( NOTE: We do this change before doing the changes that introduce the build failure,
        so the series remains fully bisectable. )

Reported-by: kbuild test robot <fengguang.wu@intel.com>
Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-03-02 08:42:31 +01:00
Ingo Molnar
6a3827d750 sched/headers: Prepare for new header dependencies before moving code to <linux/sched/numa_balancing.h>
We are going to split <linux/sched/numa_balancing.h> out of <linux/sched.h>, which
will have to be picked up from other headers and a couple of .c files.

Create a trivial placeholder <linux/sched/numa_balancing.h> file that just
maps to <linux/sched.h> to make this patch obviously correct and
bisectable.

Include the new header in the files that are going to need it.

Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-03-02 08:42:30 +01:00
Ingo Molnar
55687da166 sched/headers: Prepare for new header dependencies before moving code to <linux/sched/cpufreq.h>
We are going to split <linux/sched/cpufreq.h> out of <linux/sched.h>, which
will have to be picked up from other headers and a couple of .c files.

Create a trivial placeholder <linux/sched/cpufreq.h> file that just
maps to <linux/sched.h> to make this patch obviously correct and
bisectable.

Include the new header in the files that are going to need it.

Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-03-02 08:42:30 +01:00
Ingo Molnar
38b8d208a4 sched/headers: Prepare for new header dependencies before moving code to <linux/sched/nmi.h>
We are going to move softlockup APIs out of <linux/sched.h>, which
will have to be picked up from other headers and a couple of .c files.

<linux/nmi.h> already includes <linux/sched.h>.

Include the <linux/nmi.h> header in the files that are going to need it.

Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-03-02 08:42:30 +01:00
Ingo Molnar
3f07c01441 sched/headers: Prepare for new header dependencies before moving code to <linux/sched/signal.h>
We are going to split <linux/sched/signal.h> out of <linux/sched.h>, which
will have to be picked up from other headers and a couple of .c files.

Create a trivial placeholder <linux/sched/signal.h> file that just
maps to <linux/sched.h> to make this patch obviously correct and
bisectable.

Include the new header in the files that are going to need it.

Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-03-02 08:42:29 +01:00
Ingo Molnar
6e84f31522 sched/headers: Prepare for new header dependencies before moving code to <linux/sched/mm.h>
We are going to split <linux/sched/mm.h> out of <linux/sched.h>, which
will have to be picked up from other headers and a couple of .c files.

Create a trivial placeholder <linux/sched/mm.h> file that just
maps to <linux/sched.h> to make this patch obviously correct and
bisectable.

The APIs that are going to be moved first are:

   mm_alloc()
   __mmdrop()
   mmdrop()
   mmdrop_async_fn()
   mmdrop_async()
   mmget_not_zero()
   mmput()
   mmput_async()
   get_task_mm()
   mm_access()
   mm_release()

Include the new header in the files that are going to need it.

Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-03-02 08:42:28 +01:00
Ingo Molnar
4eb5aaa3af sched/headers: Prepare for new header dependencies before moving code to <linux/sched/autogroup.h>
We are going to split <linux/sched/autogroup.h> out of <linux/sched.h>, which
will have to be picked up from other headers and a couple of .c files.

Create a trivial placeholder <linux/sched/autogroup.h> file that just
maps to <linux/sched.h> to make this patch obviously correct and
bisectable.

Include the new header in the files that are going to need it.

Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-03-02 08:42:28 +01:00
Ingo Molnar
4f17722c72 sched/headers: Prepare for new header dependencies before moving code to <linux/sched/loadavg.h>
We are going to split <linux/sched/loadavg.h> out of <linux/sched.h>, which
will have to be picked up from a couple of .c files.

Create a trivial placeholder <linux/sched/topology.h> file that just
maps to <linux/sched.h> to make this patch obviously correct and
bisectable.

Include the new header in the files that are going to need it.

Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-03-02 08:42:27 +01:00
Ingo Molnar
ae7e81c077 sched/headers: Prepare for new header dependencies before moving code to <uapi/linux/sched/types.h>
We are going to move scheduler ABI details to <uapi/linux/sched/types.h>,
which will be used from a number of .c files.

Create empty placeholder header that maps to <linux/types.h>.

Include the new header in the files that are going to need it.

Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-03-02 08:42:27 +01:00
Ingo Molnar
e601757102 sched/headers: Prepare for new header dependencies before moving code to <linux/sched/clock.h>
We are going to split <linux/sched/clock.h> out of <linux/sched.h>, which
will have to be picked up from other headers and .c files.

Create a trivial placeholder <linux/sched/clock.h> file that just
maps to <linux/sched.h> to make this patch obviously correct and
bisectable.

Include the new header in the files that are going to need it.

Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-03-02 08:42:27 +01:00
Ingo Molnar
84f001e157 sched/headers: Prepare for new header dependencies before moving code to <linux/sched/wake_q.h>
We are going to split <linux/sched/wake_q.h> out of <linux/sched.h>, which
will have to be picked up from other headers and a couple of .c files.

Create a trivial placeholder <linux/sched/wake_q.h> file that just
maps to <linux/sched.h> to make this patch obviously correct and
bisectable.

Include the new header in the files that are going to need it.

Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-03-02 08:42:26 +01:00
Ingo Molnar
4c822698cb sched/headers: Prepare for new header dependencies before moving code to <linux/sched/idle.h>
We are going to split  <linux/sched/idle.h> out of <linux/sched.h>, which
will have to be picked up from other headers and a couple of .c files.

Create a trivial placeholder <linux/sched/idle.h> file that just
maps to <linux/sched.h> to make this patch obviously correct and
bisectable.

Include the new header in the files that are going to need it.

Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-03-02 08:42:26 +01:00
Ingo Molnar
105ab3d8ce sched/headers: Prepare for new header dependencies before moving code to <linux/sched/topology.h>
We are going to split <linux/sched/topology.h> out of <linux/sched.h>, which
will have to be picked up from other headers and a couple of .c files.

Create a trivial placeholder <linux/sched/topology.h> file that just
maps to <linux/sched.h> to make this patch obviously correct and
bisectable.

Include the new header in the files that are going to need it.

Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-03-02 08:42:26 +01:00
Ingo Molnar
f9411ebe3d rcu: Separate the RCU synchronization types and APIs into <linux/rcupdate_wait.h>
So rcupdate.h is a pretty complex header, in particular it includes
<linux/completion.h> which includes <linux/wait.h> - creating a
dependency that includes <linux/wait.h> in <linux/sched.h>,
which prevents the isolation of <linux/sched.h> from the derived
<linux/wait.h> header.

Solve part of the problem by decoupling rcupdate.h from completions:
this can be done by separating out the rcu_synchronize types and APIs,
and updating their usage sites.

Since this is a mostly RCU-internal types this will not just simplify
<linux/sched.h>'s dependencies, but will make all the hundreds of
.c files that include rcupdate.h but not completions or wait.h build
faster.

( For rcutiny this means that two dependent APIs have to be uninlined,
  but that shouldn't be much of a problem as they are rare variants. )

Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-03-02 08:42:24 +01:00
Ingo Molnar
4b53a3412d sched/core: Remove the tsk_nr_cpus_allowed() wrapper
tsk_nr_cpus_allowed() too is a pretty pointless wrapper that
is not used consistently and which makes the code both harder
to read and longer as well.

So remove it - this also shrinks <linux/sched.h> a bit.

Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-03-02 08:42:24 +01:00
Ingo Molnar
0c98d344fe sched/core: Remove the tsk_cpus_allowed() wrapper
So the original intention of tsk_cpus_allowed() was to 'future-proof'
the field - but it's pretty ineffectual at that, because half of
the code uses ->cpus_allowed directly ...

Also, the wrapper makes the code longer than the original expression!

So just get rid of it. This also shrinks <linux/sched.h> a bit.

Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-03-02 08:42:24 +01:00
Ingo Molnar
59ddbcb2f4 sched/core: Move the get_preempt_disable_ip() inline to sched/core.c
It's defined in <linux/sched.h>, but nothing outside the scheduler
uses it - so move it to the sched/core.c usage site.

Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-03-02 08:42:23 +01:00
Ingo Molnar
c930b2c0de sched/core: Convert ___assert_task_state() link time assert to BUILD_BUG_ON()
The length of TASK_STATE_TO_CHAR_STR was still checked using the old
link-time manual error method - convert it to BUILD_BUG_ON(). This
has a couple of advantages:

 - it's more obvious what's going on

 - it reduces the size and complexity of <linux/sched.h>

 - BUILD_BUG_ON() will fail during compilation, with a clearer
   error message than the link time assert.

Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-03-02 08:42:23 +01:00
Linus Torvalds
65314ed08e Merge branch 'sched-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull scheduler fixes from Ingo Molnar:
 "Two rq-clock warnings related fixes, plus a cgroups related crash fix"

* 'sched-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  sched/cgroup: Move sched_online_group() back into css_online() to fix crash
  sched/fair: Update rq clock before changing a task's CPU affinity
  sched/core: Fix update_rq_clock() splat on hotplug (and suspend/resume)
2017-02-28 11:44:01 -08:00
Vegard Nossum
f1f1007644 mm: add new mmgrab() helper
Apart from adding the helper function itself, the rest of the kernel is
converted mechanically using:

  git grep -l 'atomic_inc.*mm_count' | xargs sed -i 's/atomic_inc(&\(.*\)->mm_count);/mmgrab\(\1\);/'
  git grep -l 'atomic_inc.*mm_count' | xargs sed -i 's/atomic_inc(&\(.*\)\.mm_count);/mmgrab\(\&\1\);/'

This is needed for a later patch that hooks into the helper, but might
be a worthwhile cleanup on its own.

(Michal Hocko provided most of the kerneldoc comment.)

Link: http://lkml.kernel.org/r/20161218123229.22952-1-vegard.nossum@oracle.com
Signed-off-by: Vegard Nossum <vegard.nossum@oracle.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: David Rientjes <rientjes@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-02-27 18:43:48 -08:00
Konstantin Khlebnikov
96b777452d sched/cgroup: Move sched_online_group() back into css_online() to fix crash
Commit:

  2f5177f0fd ("sched/cgroup: Fix/cleanup cgroup teardown/init")

.. moved sched_online_group() from css_online() to css_alloc().
It exposes half-baked task group into global lists before initializing
generic cgroup stuff.

LTP testcase (third in cgroup_regression_test) written for testing
similar race in kernels 2.6.26-2.6.28 easily triggers this oops:

  BUG: unable to handle kernel NULL pointer dereference at 0000000000000008
  IP: kernfs_path_from_node_locked+0x260/0x320
  CPU: 1 PID: 30346 Comm: cat Not tainted 4.10.0-rc5-test #4
  Call Trace:
  ? kernfs_path_from_node+0x4f/0x60
  kernfs_path_from_node+0x3e/0x60
  print_rt_rq+0x44/0x2b0
  print_rt_stats+0x7a/0xd0
  print_cpu+0x2fc/0xe80
  ? __might_sleep+0x4a/0x80
  sched_debug_show+0x17/0x30
  seq_read+0xf2/0x3b0
  proc_reg_read+0x42/0x70
  __vfs_read+0x28/0x130
  ? security_file_permission+0x9b/0xc0
  ? rw_verify_area+0x4e/0xb0
  vfs_read+0xa5/0x170
  SyS_read+0x46/0xa0
  entry_SYSCALL_64_fastpath+0x1e/0xad

Here the task group is already linked into the global RCU-protected 'task_groups'
list, but the css->cgroup pointer is still NULL.

This patch reverts this chunk and moves online back to css_online().

Signed-off-by: Konstantin Khlebnikov <khlebnikov@yandex-team.ru>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Tejun Heo <tj@kernel.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Fixes: 2f5177f0fd ("sched/cgroup: Fix/cleanup cgroup teardown/init")
Link: http://lkml.kernel.org/r/148655324740.424917.5302984537258726349.stgit@buzz
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-02-24 09:25:28 +01:00
Wanpeng Li
a499c3ead8 sched/fair: Update rq clock before changing a task's CPU affinity
This is triggered during boot when CONFIG_SCHED_DEBUG is enabled:

 ------------[ cut here ]------------
 WARNING: CPU: 6 PID: 81 at kernel/sched/sched.h:812 set_next_entity+0x11d/0x380
 rq->clock_update_flags < RQCF_ACT_SKIP
 CPU: 6 PID: 81 Comm: torture_shuffle Not tainted 4.10.0+ #1
 Hardware name: LENOVO ThinkCentre M8500t-N000/SHARKBAY, BIOS FBKTC1AUS 02/16/2016
 Call Trace:
  dump_stack+0x85/0xc2
  __warn+0xcb/0xf0
  warn_slowpath_fmt+0x5f/0x80
  set_next_entity+0x11d/0x380
  set_curr_task_fair+0x2b/0x60
  do_set_cpus_allowed+0x139/0x180
  __set_cpus_allowed_ptr+0x113/0x260
  set_cpus_allowed_ptr+0x10/0x20
  torture_shuffle+0xfd/0x180
  kthread+0x10f/0x150
  ? torture_shutdown_init+0x60/0x60
  ? kthread_create_on_node+0x60/0x60
  ret_from_fork+0x31/0x40
 ---[ end trace dd94d92344cea9c6 ]---

The task is running && !queued, so there is no rq clock update before calling
set_curr_task().

This patch fixes it by updating rq clock after holding rq->lock/pi_lock
just as what other dequeue + put_prev + enqueue + set_curr story does.

Signed-off-by: Wanpeng Li <wanpeng.li@hotmail.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Matt Fleming <matt@codeblueprint.co.uk>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/1487749975-5994-1-git-send-email-wanpeng.li@hotmail.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-02-24 08:58:33 +01:00
Peter Zijlstra
8cb68b343a sched/core: Fix update_rq_clock() splat on hotplug (and suspend/resume)
The hotplug code still triggers the warning about using a stale
rq->clock value.

Fix things up to actually run update_rq_clock() in a place where we
record the 'UPDATED' flag, and then modify the annotation to retain
this flag over the rq->lock fiddling that happens as a result of
actually migrating all the tasks elsewhere.

Reported-by: Linus Torvalds <torvalds@linux-foundation.org>
Tested-by: Mike Galbraith <efault@gmx.de>
Tested-by: Sachin Sant <sachinp@linux.vnet.ibm.com>
Tested-by: Borislav Petkov <bp@suse.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Matt Fleming <matt@codeblueprint.co.uk>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Ross Zwisler <zwisler@gmail.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Fixes: 4d25b35ea3 ("sched/fair: Restore previous rq_flags when migrating tasks in hotplug")
Link: http://lkml.kernel.org/r/20170202155506.GX6515@twins.programming.kicks-ass.net
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-02-24 08:58:33 +01:00
Mark Brown
fd4a61e08a sched/core: Fix build paravirt build on arm and arm64
Commit 004172bdad ("sched/core: Remove unnecessary #include headers")
removed the inclusion of asm/paravirt.h which is used to get
declarations of paravirt_steal_rq_enabled and paravirt_steal_clock.

It is implicitly included on x86 but not on arm and arm64 breaking the
build if paravirtualization is used.  Since things from that header are
used directly fix the build by putting the direct inclusion back.

Signed-off-by: Mark Brown <broonie@kernel.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-02-21 10:54:02 -08:00
Linus Torvalds
828cad8ea0 Merge branch 'sched-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull scheduler updates from Ingo Molnar:
 "The main changes in this (fairly busy) cycle were:

   - There was a class of scheduler bugs related to forgetting to update
     the rq-clock timestamp which can cause weird and hard to debug
     problems, so there's a new debug facility for this: which uncovered
     a whole lot of bugs which convinced us that we want to keep the
     debug facility.

     (Peter Zijlstra, Matt Fleming)

   - Various cputime related updates: eliminate cputime and use u64
     nanoseconds directly, simplify and improve the arch interfaces,
     implement delayed accounting more widely, etc. - (Frederic
     Weisbecker)

   - Move code around for better structure plus cleanups (Ingo Molnar)

   - Move IO schedule accounting deeper into the scheduler plus related
     changes to improve the situation (Tejun Heo)

   - ... plus a round of sched/rt and sched/deadline fixes, plus other
     fixes, updats and cleanups"

* 'sched-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (85 commits)
  sched/core: Remove unlikely() annotation from sched_move_task()
  sched/autogroup: Rename auto_group.[ch] to autogroup.[ch]
  sched/topology: Split out scheduler topology code from core.c into topology.c
  sched/core: Remove unnecessary #include headers
  sched/rq_clock: Consolidate the ordering of the rq_clock methods
  delayacct: Include <uapi/linux/taskstats.h>
  sched/core: Clean up comments
  sched/rt: Show the 'sched_rr_timeslice' SCHED_RR timeslice tuning knob in milliseconds
  sched/clock: Add dummy clear_sched_clock_stable() stub function
  sched/cputime: Remove generic asm headers
  sched/cputime: Remove unused nsec_to_cputime()
  s390, sched/cputime: Remove unused cputime definitions
  powerpc, sched/cputime: Remove unused cputime definitions
  s390, sched/cputime: Make arch_cpu_idle_time() to return nsecs
  ia64, sched/cputime: Remove unused cputime definitions
  ia64: Convert vtime to use nsec units directly
  ia64, sched/cputime: Move the nsecs based cputime headers to the last arch using it
  sched/cputime: Remove jiffies based cputime
  sched/cputime, vtime: Return nsecs instead of cputime_t to account
  sched/cputime: Complete nsec conversion of tick based accounting
  ...
2017-02-20 12:52:55 -08:00
Steven Rostedt (VMware)
bb3bac2ca9 sched/core: Remove unlikely() annotation from sched_move_task()
The check for 'running' in sched_move_task() has an unlikely() around it. That
is, it is unlikely that the task being moved is running. That use to be
true. But with a couple of recent updates, it is now likely that the task
will be running.

The first change came from ea86cb4b76 ("sched/cgroup: Fix
cpu_cgroup_fork() handling") that moved around the use case of
sched_move_task() in do_fork() where the call is now done after the task is
woken (hence it is running).

The second change came from 8e5bfa8c1f ("sched/autogroup: Do not use
autogroup->tg in zombie threads") where sched_move_task() is called by the
exit path, by the task that is exiting. Hence it too is running.

Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vincent Guittot <vincent.guittot@linaro.org>
Link: http://lkml.kernel.org/r/20170206110426.27ca6426@gandalf.local.home
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-02-10 09:05:42 +01:00
Ingo Molnar
1051408f7e sched/autogroup: Rename auto_group.[ch] to autogroup.[ch]
The names are all 'autogroup', not 'auto_group' - so rename
the kernel/sched/auto_group.[ch] to match the existing
nomenclature.

Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-02-08 09:01:11 +01:00
Ingo Molnar
f2cb13609d sched/topology: Split out scheduler topology code from core.c into topology.c
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-02-07 10:58:12 +01:00
Ingo Molnar
004172bdad sched/core: Remove unnecessary #include headers
Over the years sched/core.c accumulated over 50 #include lines,
40 of which are superfluous. (!)

Removing them decreases the preprocessed .c file (.i) size noticeably:

            triton:~/tip> wc -l kernel/sched/core.i

  Before:   76387 kernel/sched/core.i
  After:    75896 kernel/sched/core.i

Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-02-07 10:58:07 +01:00
Ingo Molnar
535b9552bb sched/rq_clock: Consolidate the ordering of the rq_clock methods
update_rq_clock_task() and update_rq_clock() we unnecessarily
spread across core.c, requiring an extra prototype line.

Move them next to each other and in the proper order.

Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-02-07 10:58:01 +01:00
Ingo Molnar
d1ccc66df8 sched/core: Clean up comments
Refresh the comments in the core scheduler code:

 - Capitalize sentences consistently

 - Capitalize 'CPU' consistently

 - ... and other small details.

Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-02-07 10:57:41 +01:00
Shile Zhang
975e155ed8 sched/rt: Show the 'sched_rr_timeslice' SCHED_RR timeslice tuning knob in milliseconds
We added the 'sched_rr_timeslice_ms' SCHED_RR tuning knob in this commit:

  ce0dbbbb30 ("sched/rt: Add a tuning knob to allow changing SCHED_RR timeslice")

... which name suggests to users that it's in milliseconds, while in reality
it's being set in milliseconds but the result is shown in jiffies.

This is obviously confusing when HZ is not 1000, it makes it appear like the
value set failed, such as HZ=100:

  root# echo 100 > /proc/sys/kernel/sched_rr_timeslice_ms
  root# cat /proc/sys/kernel/sched_rr_timeslice_ms
  10

Fix this to be milliseconds all around.

Signed-off-by: Shile Zhang <shile.zhang@nokia.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/1485612049-20923-1-git-send-email-shile.zhang@nokia.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-02-01 11:01:30 +01:00
Frederic Weisbecker
bfce1d6006 sched/cputime, vtime: Return nsecs instead of cputime_t to account
Turn the full dynticks cputime clock source to return nsec while keeping
its very internals jiffies based for performance reasons.

Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Fenghua Yu <fenghua.yu@intel.com>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rik van Riel <riel@redhat.com>
Cc: Stanislaw Gruszka <sgruszka@redhat.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Tony Luck <tony.luck@intel.com>
Cc: Wanpeng Li <wanpeng.li@hotmail.com>
Link: http://lkml.kernel.org/r/1485832191-26889-27-git-send-email-fweisbec@gmail.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-02-01 09:13:59 +01:00
Frederic Weisbecker
2b1f967d80 sched/cputime: Complete nsec conversion of tick based accounting
This is the final step toward tick based cputime conversion. Now that
the whole cputime accounting engine accounts in nsecs, we can convert the
very source of the cputime to account in nsecs.

Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Fenghua Yu <fenghua.yu@intel.com>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rik van Riel <riel@redhat.com>
Cc: Stanislaw Gruszka <sgruszka@redhat.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Tony Luck <tony.luck@intel.com>
Cc: Wanpeng Li <wanpeng.li@hotmail.com>
Link: http://lkml.kernel.org/r/1485832191-26889-26-git-send-email-fweisbec@gmail.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-02-01 09:13:59 +01:00
Frederic Weisbecker
fb8b049c98 sched/cputime: Push time to account_system_time() in nsecs
This is one more step toward converting cputime accounting to pure nsecs.

Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Fenghua Yu <fenghua.yu@intel.com>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rik van Riel <riel@redhat.com>
Cc: Stanislaw Gruszka <sgruszka@redhat.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Tony Luck <tony.luck@intel.com>
Cc: Wanpeng Li <wanpeng.li@hotmail.com>
Link: http://lkml.kernel.org/r/1485832191-26889-25-git-send-email-fweisbec@gmail.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-02-01 09:13:58 +01:00
Frederic Weisbecker
18b43a9bd7 sched/cputime: Push time to account_idle_time() in nsecs
This is one more step toward converting cputime accounting to pure nsecs.

Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Fenghua Yu <fenghua.yu@intel.com>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rik van Riel <riel@redhat.com>
Cc: Stanislaw Gruszka <sgruszka@redhat.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Tony Luck <tony.luck@intel.com>
Cc: Wanpeng Li <wanpeng.li@hotmail.com>
Link: http://lkml.kernel.org/r/1485832191-26889-24-git-send-email-fweisbec@gmail.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-02-01 09:13:57 +01:00
Frederic Weisbecker
be9095ed4f sched/cputime: Push time to account_steal_time() in nsecs
This is one more step toward converting cputime accounting to pure nsecs.

Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Fenghua Yu <fenghua.yu@intel.com>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rik van Riel <riel@redhat.com>
Cc: Stanislaw Gruszka <sgruszka@redhat.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Tony Luck <tony.luck@intel.com>
Cc: Wanpeng Li <wanpeng.li@hotmail.com>
Link: http://lkml.kernel.org/r/1485832191-26889-23-git-send-email-fweisbec@gmail.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-02-01 09:13:57 +01:00
Frederic Weisbecker
23244a5c80 sched/cputime: Push time to account_user_time() in nsecs
This is one more step toward converting cputime accounting to pure nsecs.

Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Fenghua Yu <fenghua.yu@intel.com>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rik van Riel <riel@redhat.com>
Cc: Stanislaw Gruszka <sgruszka@redhat.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Tony Luck <tony.luck@intel.com>
Cc: Wanpeng Li <wanpeng.li@hotmail.com>
Link: http://lkml.kernel.org/r/1485832191-26889-22-git-send-email-fweisbec@gmail.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-02-01 09:13:56 +01:00
Frederic Weisbecker
ebd7e7fc4b timers/posix-timers: Convert internals to use nsecs
Use the new nsec based cputime accessors as part of the whole cputime
conversion from cputime_t to nsecs.

Also convert posix-cpu-timers to use nsec based internal counters to
simplify it.

Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Fenghua Yu <fenghua.yu@intel.com>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rik van Riel <riel@redhat.com>
Cc: Stanislaw Gruszka <sgruszka@redhat.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Tony Luck <tony.luck@intel.com>
Cc: Wanpeng Li <wanpeng.li@hotmail.com>
Link: http://lkml.kernel.org/r/1485832191-26889-19-git-send-email-fweisbec@gmail.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-02-01 09:13:54 +01:00
Frederic Weisbecker
a499a5a14d sched/cputime: Increment kcpustat directly on irqtime account
The irqtime is accounted is nsecs and stored in
cpu_irq_time.hardirq_time and cpu_irq_time.softirq_time. Once the
accumulated amount reaches a new jiffy, this one gets accounted to the
kcpustat.

This was necessary when kcpustat was stored in cputime_t, which could at
worst have jiffies granularity. But now kcpustat is stored in nsecs
so this whole discretization game with temporary irqtime storage has
become unnecessary.

We can now directly account the irqtime to the kcpustat.

Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Fenghua Yu <fenghua.yu@intel.com>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rik van Riel <riel@redhat.com>
Cc: Stanislaw Gruszka <sgruszka@redhat.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Tony Luck <tony.luck@intel.com>
Cc: Wanpeng Li <wanpeng.li@hotmail.com>
Link: http://lkml.kernel.org/r/1485832191-26889-17-git-send-email-fweisbec@gmail.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-02-01 09:13:53 +01:00
Frederic Weisbecker
5613fda9a5 sched/cputime: Convert task/group cputime to nsecs
Now that most cputime readers use the transition API which return the
task cputime in old style cputime_t, we can safely store the cputime in
nsecs. This will eventually make cputime statistics less opaque and more
granular. Back and forth convertions between cputime_t and nsecs in order
to deal with cputime_t random granularity won't be needed anymore.

Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Fenghua Yu <fenghua.yu@intel.com>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rik van Riel <riel@redhat.com>
Cc: Stanislaw Gruszka <sgruszka@redhat.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Tony Luck <tony.luck@intel.com>
Cc: Wanpeng Li <wanpeng.li@hotmail.com>
Link: http://lkml.kernel.org/r/1485832191-26889-8-git-send-email-fweisbec@gmail.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-02-01 09:13:49 +01:00
Frederic Weisbecker
16a6d9be90 sched/cputime: Convert guest time accounting to nsecs (u64)
cputime_t is being obsolete and replaced by nsecs units in order to make
internal timestamps less opaque and more granular.

Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Tony Luck <tony.luck@intel.com>
Cc: Fenghua Yu <fenghua.yu@intel.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rik van Riel <riel@redhat.com>
Cc: Stanislaw Gruszka <sgruszka@redhat.com>
Cc: Wanpeng Li <wanpeng.li@hotmail.com>
Link: http://lkml.kernel.org/r/1485832191-26889-6-git-send-email-fweisbec@gmail.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-02-01 09:13:48 +01:00
Frederic Weisbecker
7fb1327ee9 sched/cputime: Convert kcpustat to nsecs
Kernel CPU stats are stored in cputime_t which is an architecture
defined type, and hence a bit opaque and requiring accessors and mutators
for any operation.

Converting them to nsecs simplifies the code and is one step toward
the removal of cputime_t in the core code.

Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Tony Luck <tony.luck@intel.com>
Cc: Fenghua Yu <fenghua.yu@intel.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rik van Riel <riel@redhat.com>
Cc: Stanislaw Gruszka <sgruszka@redhat.com>
Cc: Wanpeng Li <wanpeng.li@hotmail.com>
Link: http://lkml.kernel.org/r/1485832191-26889-4-git-send-email-fweisbec@gmail.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-02-01 09:13:47 +01:00
Sebastian Andrzej Siewior
619bd4a718 sched/rt: Add a missing rescheduling point
Since the change in commit:

  fd7a4bed18 ("sched, rt: Convert switched_{from, to}_rt() / prio_changed_rt() to balance callbacks")

... we don't reschedule a task under certain circumstances:

Lets say task-A, SCHED_OTHER, is running on CPU0 (and it may run only on
CPU0) and holds a PI lock. This task is removed from the CPU because it
used up its time slice and another SCHED_OTHER task is running. Task-B on
CPU1 runs at RT priority and asks for the lock owned by task-A. This
results in a priority boost for task-A. Task-B goes to sleep until the
lock has been made available. Task-A is already runnable (but not active),
so it receives no wake up.

The reality now is that task-A gets on the CPU once the scheduler decides
to remove the current task despite the fact that a high priority task is
enqueued and waiting. This may take a long time.

The desired behaviour is that CPU0 immediately reschedules after the
priority boost which made task-A the task with the lowest priority.

Suggested-by: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Thomas Gleixner <tglx@linutronix.de>
Fixes: fd7a4bed18 ("sched, rt: Convert switched_{from, to}_rt() prio_changed_rt() to balance callbacks")
Link: http://lkml.kernel.org/r/20170124144006.29821-1-bigeasy@linutronix.de
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-01-30 11:46:37 +01:00
Mathieu Poirier
4b12db9391 sched/core: Fix &rd->cpudl memory leak
While in the process of initialising a root domain, if function
cpupri_init() fails the memory allocated in cpudl_init() is not
reclaimed.

Adding a new goto target to cleanup the previous initialistion of
the root_domain's dl_bw structure reclaims said memory.

Signed-off-by: Mathieu Poirier <mathieu.poirier@linaro.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/1485292295-21298-2-git-send-email-mathieu.poirier@linaro.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-01-30 11:46:37 +01:00
Mathieu Poirier
92c99ac829 sched/core: Fix &rd->rto_mask memory leak
If function cpudl_init() fails the memory allocated for &rd->rto_mask
needs to be freed, something this patch is addressing.

Signed-off-by: Mathieu Poirier <mathieu.poirier@linaro.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/1485292295-21298-1-git-send-email-mathieu.poirier@linaro.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-01-30 11:46:36 +01:00
Matt Fleming
4d25b35ea3 sched/fair: Restore previous rq_flags when migrating tasks in hotplug
__migrate_task() can return with a different runqueue locked than the
one we passed as an argument. So that we can repin the lock in
migrate_tasks() (and keep the update_rq_clock() bit) we need to
restore the old rq_flags before repinning.

Note that it wouldn't be correct to change move_queued_task() to repin
because of the change of runqueue and the fact that having an
up-to-date clock on the initial rq doesn't mean the new rq has one
too.

Signed-off-by: Matt Fleming <matt@codeblueprint.co.uk>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-01-30 11:46:35 +01:00
Peter Zijlstra
1b1d62254d sched/core: Add missing update_rq_clock() call in sched_move_task()
Bug was noticed via this warning:

  WARNING: CPU: 6 PID: 1 at kernel/sched/sched.h:804 detach_task_cfs_rq+0x8e8/0xb80
  rq->clock_update_flags < RQCF_ACT_SKIP
  Modules linked in:
  CPU: 6 PID: 1 Comm: systemd Not tainted 4.10.0-rc5-00140-g0874170baf55-dirty #1
  Hardware name: Supermicro SYS-4048B-TRFT/X10QBi, BIOS 1.0 04/11/2014
  Call Trace:
   dump_stack+0x4d/0x65
   __warn+0xcb/0xf0
   warn_slowpath_fmt+0x5f/0x80
   detach_task_cfs_rq+0x8e8/0xb80
   ? allocate_cgrp_cset_links+0x59/0x80
   task_change_group_fair+0x27/0x150
   sched_change_group+0x48/0xf0
   sched_move_task+0x53/0x150
   cpu_cgroup_attach+0x36/0x70
   cgroup_taskset_migrate+0x175/0x300
   cgroup_migrate+0xab/0xd0
   cgroup_attach_task+0xf0/0x190
   __cgroup_procs_write+0x1ed/0x2f0
   cgroup_procs_write+0x14/0x20
   cgroup_file_write+0x3f/0x100
   kernfs_fop_write+0x104/0x180
   __vfs_write+0x37/0x140
   vfs_write+0xb8/0x1b0
   SyS_write+0x55/0xc0
   do_syscall_64+0x61/0x170
   entry_SYSCALL64_slow_path+0x25/0x25

Reported-by: Ingo Molnar <mingo@kernel.org>
Reported-by: Borislav Petkov <bp@alien8.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-01-30 11:46:34 +01:00
Peter Zijlstra
49ee576809 sched/core: Optimize pick_next_task() for idle_sched_class
Steve noticed that when we switch from IDLE to SCHED_OTHER we fail to
take the shortcut, even though all runnable tasks are of the fair
class, because prev->sched_class != &fair_sched_class.

Since I reworked the put_prev_task() stuff, we don't really care about
prev->class here, so removing that condition will allow this case.

This increases the likely case from 78% to 98% correct for Steve's
workload.

Reported-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Tested-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/20170119174408.GN6485@twins.programming.kicks-ass.net
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-01-30 11:46:34 +01:00
Nicolas Pitre
b18b6a9cef timers: Omit POSIX timer stuff from task_struct when disabled
When CONFIG_POSIX_TIMERS is disabled, it is preferable to remove related
structures from struct task_struct and struct signal_struct as they
won't contain anything useful and shouldn't be relied upon by mistake.
Code still referencing those structures is also disabled here.

Signed-off-by: Nicolas Pitre <nico@linaro.org>
Signed-off-by: John Stultz <john.stultz@linaro.org>
2017-01-27 13:05:26 -08:00
Peter Zijlstra
acb04058de sched/clock: Fix hotplug crash
Mike reported that he could trigger the WARN_ON_ONCE() in
set_sched_clock_stable() using hotplug.

This exposed a fundamental problem with the interface, we should never
mark the TSC stable if we ever find it to be unstable. Therefore
set_sched_clock_stable() is a broken interface.

The reason it existed is that not having it is a pain, it means all
relevant architecture code needs to call clear_sched_clock_stable()
where appropriate.

Of the three architectures that select HAVE_UNSTABLE_SCHED_CLOCK ia64
and parisc are trivial in that they never called
set_sched_clock_stable(), so add an unconditional call to
clear_sched_clock_stable() to them.

For x86 the story is a lot more involved, and what this patch tries to
do is ensure we preserve the status quo. So even is Cyrix or Transmeta
have usable TSC they never called set_sched_clock_stable() so they now
get an explicit mark unstable.

Reported-by: Mike Galbraith <efault@gmx.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Fixes: 9881b024b7 ("sched/clock: Delay switching sched_clock to stable")
Link: http://lkml.kernel.org/r/20170119133633.GB6536@twins.programming.kicks-ass.net
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-01-20 02:38:46 +01:00
Tejun Heo
10ab56434f sched/core: Separate out io_schedule_prepare() and io_schedule_finish()
Now that IO schedule accounting is done inside __schedule(),
io_schedule() can be split into three steps - prep, schedule, and
finish - where the schedule part doesn't need any special annotation.
This allows marking a sleep as iowait by simply wrapping an existing
blocking function with io_schedule_prepare() and io_schedule_finish().

Because task_struct->in_iowait is single bit, the caller of
io_schedule_prepare() needs to record and the pass its state to
io_schedule_finish() to be safe regarding nesting.  While this isn't
the prettiest, these functions are mostly gonna be used by core
functions and we don't want to use more space for ->in_iowait.

While at it, as it's simple to do now, reimplement io_schedule()
without unnecessarily going through io_schedule_timeout().

Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: adilger.kernel@dilger.ca
Cc: jack@suse.com
Cc: kernel-team@fb.com
Cc: mingbo@fb.com
Cc: tytso@mit.edu
Link: http://lkml.kernel.org/r/1477673892-28940-3-git-send-email-tj@kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-01-14 11:30:04 +01:00
Tejun Heo
e33a9bba85 sched/core: move IO scheduling accounting from io_schedule_timeout() into scheduler
For an interface to support blocking for IOs, it must call
io_schedule() instead of schedule().  This makes it tedious to add IO
blocking to existing interfaces as the switching between schedule()
and io_schedule() is often buried deep.

As we already have a way to mark the task as IO scheduling, this can
be made easier by separating out io_schedule() into multiple steps so
that IO schedule preparation can be performed before invoking a
blocking interface and the actual accounting happens inside the
scheduler.

io_schedule_timeout() does the following three things prior to calling
schedule_timeout().

 1. Mark the task as scheduling for IO.
 2. Flush out plugged IOs.
 3. Account the IO scheduling.

done close to the actual scheduling.  This patch moves #3 into the
scheduler so that later patches can separate out preparation and
finish steps from io_schedule().

Patch-originally-by: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: adilger.kernel@dilger.ca
Cc: akpm@linux-foundation.org
Cc: axboe@kernel.dk
Cc: jack@suse.com
Cc: kernel-team@fb.com
Cc: mingbo@fb.com
Cc: tytso@mit.edu
Link: http://lkml.kernel.org/r/20161207204841.GA22296@htj.duckdns.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-01-14 11:30:03 +01:00
Dietmar Eggemann
b8fd842369 sched/fair: Explain why MIN_SHARES isn't scaled in calc_cfs_shares()
Signed-off-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Paul Turner <pjt@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Samuel Thibault <samuel.thibault@ens-lyon.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/e9a4d858-bcf3-36b9-e3a9-449953e34569@arm.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-01-14 11:30:02 +01:00
Vincent Guittot
89ee048f3c sched/core: Fix group_entity's share update
The update of the share of a cfs_rq is done when its load_avg is updated
but before the group_entity's load_avg has been updated for the past time
slot. This generates wrong load_avg accounting which can be significant
when small tasks are involved in the scheduling.

Let take the example of a task a that is dequeued of its task group A:
   root
  (cfs_rq)
    \
    (se)
     A
    (cfs_rq)
      \
      (se)
       a

Task "a" was the only task in task group A which becomes idle when a is
dequeued.

We have the sequence:

- dequeue_entity a->se
    - update_load_avg(a->se)
    - dequeue_entity_load_avg(A->cfs_rq, a->se)
    - update_cfs_shares(A->cfs_rq)
	A->cfs_rq->load.weight == 0
        A->se->load.weight is updated with the new share (0 in this case)
- dequeue_entity A->se
    - update_load_avg(A->se) but its weight is now null so the last time
      slot (up to a tick) will be accounted with a weight of 0 instead of
      its real weight during the time slot. The last time slot will be
      accounted as an idle one whereas it was a running one.

If the running time of task a is short enough that no tick happens when it
runs, all running time of group entity A->se will be accounted as idle
time.

Instead, we should update the share of a cfs_rq (in fact the weight of its
group entity) only after having updated the load_avg of the group_entity.

update_cfs_shares() now takes the sched_entity as a parameter instead of the
cfs_rq, and the weight of the group_entity is updated only once its load_avg
has been synced with current time.

Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: pjt@google.com
Link: http://lkml.kernel.org/r/1482335426-7664-1-git-send-email-vincent.guittot@linaro.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-01-14 11:30:02 +01:00
Peter Zijlstra
da9647e076 sched/completions: Fix complete_all() semantics
Documentation/scheduler/completion.txt says this about complete_all():

  "calls complete_all() to signal all current and future waiters."

Which doesn't strictly match the current semantics. Currently
complete_all() is equivalent to UINT_MAX/2 complete() invocations,
which is distinctly less than 'all current and future waiters'
(enumerable vs innumerable), although it has worked in practice.

However, Dmitry had a weird case where it might matter, so change
completions to use saturation semantics for complete()/complete_all().
Once done hits UINT_MAX (and complete_all() sets it there) it will
never again be decremented.

Requested-by: Dmitry Torokhov <dmitry.torokhov@gmail.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: der.herr@hofr.at
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-01-14 11:30:01 +01:00
Tommaso Cucinotta
59f8c29892 sched/deadline: Show leftover runtime and abs deadline in /proc/*/sched
This patch allows for reading the current (leftover) runtime and
absolute deadline of a SCHED_DEADLINE task through /proc/*/sched
(entries dl.runtime and dl.deadline), while debugging/testing.

Signed-off-by: Tommaso Cucinotta <tommaso.cucinotta@sssup.it>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Juri Lelli <juri.lelli@arm.com>
Reviewed-by: Luca Abeni <luca.abeni@unitn.it>
Acked-by: Daniel Bistrot de Oliveira <danielbristot@gmail.com>
Cc: Juri Lelli <juri.lelli@gmail.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/1477473437-10346-2-git-send-email-tommaso.cucinotta@sssup.it
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-01-14 11:30:00 +01:00
Peter Zijlstra
5680d8094f sched/clock: Provide better clock continuity
When switching between the unstable and stable variants it is
currently possible that clock discontinuities occur.

And while these will mostly be 'small', attempt to do better.

As observed on my IVB-EP, the sched_clock() is ~1.5s ahead of the
ktime_get_ns() based timeline at the point of switchover
(sched_clock_init_late()) after SMP bringup.

Equally, when the TSC is later found to be unstable -- typically
because SMM tries to hide its SMI latencies by mucking with the TSC --
we want to avoid large jumps.

Since the clocksource watchdog reports the issue after the fact we
cannot exactly fix up time, but since SMI latencies are typically
small (~10ns range), the discontinuity is mainly due to drift between
sched_clock() and ktime_get_ns() (which on my desktop is ~79s over
24days).

I dislike this patch because it adds overhead to the good case in
favour of dealing with badness. But given the widespread failure of
TSC stability this is worth it.

Note that in case the TSC makes drastic jumps after SMP bringup we're
still hosed. There's just not much we can do in that case without
stupid overhead.

If we were to somehow expose tsc_clocksource_reliable (which is hard
because this code is also used on ia64 and parisc) we could avoid some
of the newly introduced overhead.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-01-14 11:30:00 +01:00
Peter Zijlstra
9881b024b7 sched/clock: Delay switching sched_clock to stable
Currently we switch to the stable sched_clock if we guess the TSC is
usable, and then switch back to the unstable path if it turns out TSC
isn't stable during SMP bringup after all.

Delay switching to the stable path until after SMP bringup is
complete. This way we'll avoid switching during the time we detect the
worst of the TSC offences.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-01-14 11:29:59 +01:00
Peter Zijlstra
555570d744 sched/clock: Update static_key usage
sched_clock was still using the deprecated static_key interface.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-01-14 11:29:57 +01:00
Matt Fleming
cb42c9a3eb sched/core: Add debugging code to catch missing update_rq_clock() calls
There's no diagnostic checks for figuring out when we've accidentally
missed update_rq_clock() calls. Let's add some by piggybacking on the
rq_*pin_lock() wrappers.

The idea behind the diagnostic checks is that upon pining rq lock the
rq clock should be updated, via update_rq_clock(), before anybody
reads the clock with rq_clock() or rq_clock_task().

The exception to this rule is when updates have explicitly been
disabled with the rq_clock_skip_update() optimisation.

There are some functions that only unpin the rq lock in order to grab
some other lock and avoid deadlock. In that case we don't need to
update the clock again and the previous diagnostic state can be
carried over in rq_repin_lock() by saving the state in the rq_flags
context.

Since this patch adds a new clock update flag and some already exist
in rq::clock_skip_update, that field has now been renamed. An attempt
has been made to keep the flag manipulation code small and fast since
it's used in the heart of the __schedule() fast path.

For the !CONFIG_SCHED_DEBUG case the only object code change (other
than addresses) is the following change to reset RQCF_ACT_SKIP inside
of __schedule(),

  -       c7 83 38 09 00 00 00    movl   $0x0,0x938(%rbx)
  -       00 00 00
  +       83 a3 38 09 00 00 fc    andl   $0xfffffffc,0x938(%rbx)

Suggested-by: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Matt Fleming <matt@codeblueprint.co.uk>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Byungchul Park <byungchul.park@lge.com>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Jan Kara <jack@suse.cz>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Luca Abeni <luca.abeni@unitn.it>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Mike Galbraith <umgwanakikbuti@gmail.com>
Cc: Petr Mladek <pmladek@suse.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Sergey Senozhatsky <sergey.senozhatsky.work@gmail.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Wanpeng Li <wanpeng.li@hotmail.com>
Cc: Yuyang Du <yuyang.du@intel.com>
Link: http://lkml.kernel.org/r/20160921133813.31976-8-matt@codeblueprint.co.uk
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-01-14 11:29:35 +01:00
Peter Zijlstra
2fb8d36787 sched/core: Add missing update_rq_clock() call in set_user_nice()
Address this rq-clock update bug:

  WARNING: CPU: 30 PID: 195 at ../kernel/sched/sched.h:797 set_next_entity()
  rq->clock_update_flags < RQCF_ACT_SKIP

  Call Trace:
    dump_stack()
    __warn()
    warn_slowpath_fmt()
    set_next_entity()
    ? _raw_spin_lock()
    set_curr_task_fair()
    set_user_nice.part.85()
    set_user_nice()
    create_worker()
    worker_thread()
    kthread()
    ret_from_fork()

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-01-14 11:29:34 +01:00
Peter Zijlstra
3bed5e2166 sched/core: Add missing update_rq_clock() call for task_hot()
Add the update_rq_clock() call at the top of the callstack instead of
at the bottom where we find it missing, this to aid later effort to
minimize the number of update_rq_lock() calls.

  WARNING: CPU: 30 PID: 194 at ../kernel/sched/sched.h:797 assert_clock_updated()
  rq->clock_update_flags < RQCF_ACT_SKIP

  Call Trace:
    dump_stack()
    __warn()
    warn_slowpath_fmt()
    assert_clock_updated.isra.63.part.64()
    can_migrate_task()
    load_balance()
    pick_next_task_fair()
    __schedule()
    schedule()
    worker_thread()
    kthread()

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-01-14 11:29:34 +01:00
Peter Zijlstra
80f5c1b84b sched/core: Add missing update_rq_clock() in detach_task_cfs_rq()
Instead of adding the update_rq_clock() all the way at the bottom of
the callstack, add one at the top, this to aid later effort to
minimize update_rq_lock() calls.

  WARNING: CPU: 0 PID: 1 at ../kernel/sched/sched.h:797 detach_task_cfs_rq()
  rq->clock_update_flags < RQCF_ACT_SKIP

  Call Trace:
    dump_stack()
    __warn()
    warn_slowpath_fmt()
    detach_task_cfs_rq()
    switched_from_fair()
    __sched_setscheduler()
    _sched_setscheduler()
    sched_set_stop_task()
    cpu_stop_create()
    __smpboot_create_thread.part.2()
    smpboot_register_percpu_thread_cpumask()
    cpu_stop_init()
    do_one_initcall()
    ? print_cpu_info()
    kernel_init_freeable()
    ? rest_init()
    kernel_init()
    ret_from_fork()

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-01-14 11:29:33 +01:00
Peter Zijlstra
4126bad671 sched/core: Add missing update_rq_clock() in post_init_entity_util_avg()
Address this rq-clock update bug:

  WARNING: CPU: 0 PID: 0 at ../kernel/sched/sched.h:797 post_init_entity_util_avg()
  rq->clock_update_flags < RQCF_ACT_SKIP

  Call Trace:
    __warn()
    post_init_entity_util_avg()
    wake_up_new_task()
    _do_fork()
    kernel_thread()
    rest_init()
    start_kernel()

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-01-14 11:29:32 +01:00
Matt Fleming
46f69fa337 sched/fair: Push rq lock pin/unpin into idle_balance()
Future patches will emit warnings if rq_clock() is called before
update_rq_clock() inside a rq_pin_lock()/rq_unpin_lock() pair.

Since there is only one caller of idle_balance() we can push the
unpin/repin there.

Signed-off-by: Matt Fleming <matt@codeblueprint.co.uk>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Byungchul Park <byungchul.park@lge.com>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Jan Kara <jack@suse.cz>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Luca Abeni <luca.abeni@unitn.it>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Mike Galbraith <umgwanakikbuti@gmail.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Petr Mladek <pmladek@suse.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Sergey Senozhatsky <sergey.senozhatsky.work@gmail.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Wanpeng Li <wanpeng.li@hotmail.com>
Cc: Yuyang Du <yuyang.du@intel.com>
Link: http://lkml.kernel.org/r/20160921133813.31976-7-matt@codeblueprint.co.uk
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-01-14 11:29:32 +01:00
Matt Fleming
92509b732b sched/core: Reset RQCF_ACT_SKIP before unpinning rq->lock
rq_clock() is called from sched_info_{depart,arrive}() after resetting
RQCF_ACT_SKIP but prior to a call to update_rq_clock().

In preparation for pending patches that check whether the rq clock has
been updated inside of a pin context before rq_clock() is called, move
the reset of rq->clock_skip_update immediately before unpinning the rq
lock.

This will avoid the new warnings which check if update_rq_clock() is
being actively skipped.

Signed-off-by: Matt Fleming <matt@codeblueprint.co.uk>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Byungchul Park <byungchul.park@lge.com>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Jan Kara <jack@suse.cz>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Luca Abeni <luca.abeni@unitn.it>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Mike Galbraith <umgwanakikbuti@gmail.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Petr Mladek <pmladek@suse.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Sergey Senozhatsky <sergey.senozhatsky.work@gmail.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Wanpeng Li <wanpeng.li@hotmail.com>
Cc: Yuyang Du <yuyang.du@intel.com>
Link: http://lkml.kernel.org/r/20160921133813.31976-6-matt@codeblueprint.co.uk
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-01-14 11:29:31 +01:00
Matt Fleming
d8ac897137 sched/core: Add wrappers for lockdep_(un)pin_lock()
In preparation for adding diagnostic checks to catch missing calls to
update_rq_clock(), provide wrappers for (re)pinning and unpinning
rq->lock.

Because the pending diagnostic checks allow state to be maintained in
rq_flags across pin contexts, swap the 'struct pin_cookie' arguments
for 'struct rq_flags *'.

Signed-off-by: Matt Fleming <matt@codeblueprint.co.uk>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Byungchul Park <byungchul.park@lge.com>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Jan Kara <jack@suse.cz>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Luca Abeni <luca.abeni@unitn.it>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Mike Galbraith <umgwanakikbuti@gmail.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Petr Mladek <pmladek@suse.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Sergey Senozhatsky <sergey.senozhatsky.work@gmail.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Wanpeng Li <wanpeng.li@hotmail.com>
Cc: Yuyang Du <yuyang.du@intel.com>
Link: http://lkml.kernel.org/r/20160921133813.31976-5-matt@codeblueprint.co.uk
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-01-14 11:29:30 +01:00
Frederic Weisbecker
c8d7dabf8f sched/cputime: Rename vtime_account_user() to vtime_flush()
CONFIG_VIRT_CPU_ACCOUNTING_NATIVE=y used to accumulate user time and
account it on ticks and context switches only through the
vtime_account_user() function.

Now this model has been generalized on the 3 archs for all kind of
cputime (system, irq, ...) and all the cputime flushing happens under
vtime_account_user().

So let's rename this function to better reflect its new role.

Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Acked-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Christian Borntraeger <borntraeger@de.ibm.com>
Cc: Fenghua Yu <fenghua.yu@intel.com>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rik van Riel <riel@redhat.com>
Cc: Stanislaw Gruszka <sgruszka@redhat.com>
Cc: Tony Luck <tony.luck@intel.com>
Cc: Wanpeng Li <wanpeng.li@hotmail.com>
Link: http://lkml.kernel.org/r/1483636310-6557-11-git-send-email-fweisbec@gmail.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-01-14 09:54:13 +01:00
Frederic Weisbecker
1213699ab4 sched/cputime: Export account_guest_time()
In order to prepare for CONFIG_VIRT_CPU_ACCOUNTING_NATIVE=y to delay
cputime accounting to the tick, let's allow archs to account cputime
directly to gtime.

Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Acked-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Christian Borntraeger <borntraeger@de.ibm.com>
Cc: Fenghua Yu <fenghua.yu@intel.com>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rik van Riel <riel@redhat.com>
Cc: Stanislaw Gruszka <sgruszka@redhat.com>
Cc: Tony Luck <tony.luck@intel.com>
Cc: Wanpeng Li <wanpeng.li@hotmail.com>
Link: http://lkml.kernel.org/r/1483636310-6557-5-git-send-email-fweisbec@gmail.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-01-14 09:54:11 +01:00
Frederic Weisbecker
c31cc6a518 sched/cputime: Allow accounting system time using cpustat index
In order to prepare for CONFIG_VIRT_CPU_ACCOUNTING_NATIVE=y to delay
cputime accounting to the tick, let's provide APIs to account system
time to precise contexts: hardirq, softirq, pure system, ...

Inspired-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Acked-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Christian Borntraeger <borntraeger@de.ibm.com>
Cc: Fenghua Yu <fenghua.yu@intel.com>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rik van Riel <riel@redhat.com>
Cc: Stanislaw Gruszka <sgruszka@redhat.com>
Cc: Tony Luck <tony.luck@intel.com>
Cc: Wanpeng Li <wanpeng.li@hotmail.com>
Link: http://lkml.kernel.org/r/1483636310-6557-4-git-send-email-fweisbec@gmail.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-01-14 09:54:11 +01:00
Thomas Gleixner
8b0e195314 ktime: Cleanup ktime_set() usage
ktime_set(S,N) was required for the timespec storage type and is still
useful for situations where a Seconds and Nanoseconds part of a time value
needs to be converted. For anything where the Seconds argument is 0, this
is pointless and can be replaced with a simple assignment.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Peter Zijlstra <peterz@infradead.org>
2016-12-25 17:21:22 +01:00
Linus Torvalds
7b9dc3f75f Power management material for v4.10-rc1
- New cpufreq driver for Broadcom STB SoCs and a Device Tree binding
    for it (Markus Mayer).
 
  - Support for ARM Integrator/AP and Integrator/CP in the generic
    DT cpufreq driver and elimination of the old Integrator cpufreq
    driver (Linus Walleij).
 
  - Support for the zx296718, r8a7743 and r8a7745, Socionext UniPhier,
    and PXA SoCs in the the generic DT cpufreq driver (Baoyou Xie,
    Geert Uytterhoeven, Masahiro Yamada, Robert Jarzmik).
 
  - cpufreq core fix to eliminate races that may lead to using
    inactive policy objects and related cleanups (Rafael Wysocki).
 
  - cpufreq schedutil governor update to make it use SCHED_FIFO
    kernel threads (instead of regular workqueues) for doing delayed
    work (to reduce the response latency in some cases) and related
    cleanups (Viresh Kumar).
 
  - New cpufreq sysfs attribute for resetting statistics (Markus
    Mayer).
 
  - cpufreq governors fixes and cleanups (Chen Yu, Stratos Karafotis,
    Viresh Kumar).
 
  - Support for using generic cpufreq governors in the intel_pstate
    driver (Rafael Wysocki).
 
  - Support for per-logical-CPU P-state limits and the EPP/EPB
    (Energy Performance Preference/Energy Performance Bias) knobs
    in the intel_pstate driver (Srinivas Pandruvada).
 
  - New CPU ID for Knights Mill in intel_pstate (Piotr Luc).
 
  - intel_pstate driver modification to use the P-state selection
    algorithm based on CPU load on platforms with the system profile
    in the ACPI tables set to "mobile" (Srinivas Pandruvada).
 
  - intel_pstate driver cleanups (Arnd Bergmann, Rafael Wysocki,
    Srinivas Pandruvada).
 
  - cpufreq powernv driver updates including fast switching support
    (for the schedutil governor), fixes and cleanus (Akshay Adiga,
    Andrew Donnellan, Denis Kirjanov).
 
  - acpi-cpufreq driver rework to switch it over to the new CPU
    offline/online state machine (Sebastian Andrzej Siewior).
 
  - Assorted cleanups in cpufreq drivers (Wei Yongjun, Prashanth
    Prakash).
 
  - Idle injection rework (to make it use the regular idle path
    instead of a home-grown custom one) and related powerclamp
    thermal driver updates (Peter Zijlstra, Jacob Pan, Petr Mladek,
    Sebastian Andrzej Siewior).
 
  - New CPU IDs for Atom Z34xx and Knights Mill in intel_idle (Andy
    Shevchenko, Piotr Luc).
 
  - intel_idle driver cleanups and switch over to using the new CPU
    offline/online state machine (Anna-Maria Gleixner, Sebastian
    Andrzej Siewior).
 
  - cpuidle DT driver update to support suspend-to-idle properly
    (Sudeep Holla).
 
  - cpuidle core cleanups and misc updates (Daniel Lezcano, Pan Bian,
    Rafael Wysocki).
 
  - Preliminary support for power domains including CPUs in the
    generic power domains (genpd) framework and related DT bindings
    (Lina Iyer).
 
  - Assorted fixes and cleanups in the generic power domains (genpd)
    framework (Colin Ian King, Dan Carpenter, Geert Uytterhoeven).
 
  - Preliminary support for devices with multiple voltage regulators
    and related fixes and cleanups in the Operating Performance Points
    (OPP) library (Viresh Kumar, Masahiro Yamada, Stephen Boyd).
 
  - System sleep state selection interface rework to make it easier
    to support suspend-to-idle as the default system suspend method
    (Rafael Wysocki).
 
  - PM core fixes and cleanups, mostly related to the interactions
    between the system suspend and runtime PM frameworks (Ulf Hansson,
    Sahitya Tummala, Tony Lindgren).
 
  - Latency tolerance PM QoS framework imorovements (Andrew
    Lutomirski).
 
  - New Knights Mill CPU ID for the Intel RAPL power capping driver
    (Piotr Luc).
 
  - Intel RAPL power capping driver fixes, cleanups and switch over
    to using the new CPU offline/online state machine (Jacob Pan,
    Thomas Gleixner, Sebastian Andrzej Siewior).
 
  - Fixes and cleanups in the exynos-ppmu, exynos-nocp, rk3399_dmc,
    rockchip-dfi devfreq drivers and the devfreq core (Axel Lin,
    Chanwoo Choi, Javier Martinez Canillas, MyungJoo Ham, Viresh
    Kumar).
 
  - Fix for false-positive KASAN warnings during resume from ACPI S3
    (suspend-to-RAM) on x86 (Josh Poimboeuf).
 
  - Memory map verification during resume from hibernation on x86 to
    ensure a consistent address space layout (Chen Yu).
 
  - Wakeup sources debugging enhancement (Xing Wei).
 
  - rockchip-io AVS driver cleanup (Shawn Lin).
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v2
 
 iQIcBAABCAAGBQJYTx4+AAoJEILEb/54YlRx9f8P/2SlNHUENW5qh6FtCw00oC2u
 UqJerQJ2L38UgbgxbE/0VYblma9rFABDWC1eO2xN2XdcdW5UPBKPVvNcOgNe1Clh
 gjy3RxZXVpmjfzt2kGfsTLEuGnHqwvx51hTUkeA2LwvkOal45xb8ZESmy8opCtiv
 iG4LwmPHoxdX5Za5nA9ItFKzxyO1EoyNSnBYAVwALDHxmNOfxEcRevfurASt/0M9
 brCCZJA0/sZxeL0lBdy8fNQPIBTUfCoTJG/MtmzGrObJ9wMFvEDfXrVEyZiWs/zA
 AAZ4kQL77enrIKgrLN8e0G6LzTLHoVcvn38Xjf24dKUqhd7ACBhYcnW+jK3+7EAd
 gjZ8efObQsiuyK/EDLUNw35tt96CHOqfrQCj2tIwRVvk9EekLqAGXdIndTCr2kYW
 RpefmP5kMljnm/nQFOVLwMEUQMuVkvUE7EgxADy7DoDmepBFC4ICRDWPye70R2kC
 0O1Tn2PAQq4Fd1tyI9TYYz0YQQkRoaRb5rfYUSzbRbeCdsphUopp4Vhsiyn6IcnF
 XnLbg6pRAat82MoS9n4pfO/VCo8vkErKA8tut9G7TDakkrJoEE7l31PdKW0hP3f6
 sBo6xXy6WTeivU/o/i8TbM6K4mA37pBaj78ooIkWLgg5fzRaS2+0xSPVy2H9x1m5
 LymHcobCK9rSZ1l208Fe
 =vhxI
 -----END PGP SIGNATURE-----

Merge tag 'pm-4.10-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm

Pull power management updates from Rafael Wysocki:
 "Again, cpufreq gets more changes than the other parts this time (one
  new driver, one old driver less, a bunch of enhancements of the
  existing code, new CPU IDs, fixes, cleanups)

  There also are some changes in cpuidle (idle injection rework, a
  couple of new CPU IDs, online/offline rework in intel_idle, fixes and
  cleanups), in the generic power domains framework (mostly related to
  supporting power domains containing CPUs), and in the Operating
  Performance Points (OPP) library (mostly related to supporting devices
  with multiple voltage regulators)

  In addition to that, the system sleep state selection interface is
  modified to make it easier for distributions with unchanged user space
  to support suspend-to-idle as the default system suspend method, some
  issues are fixed in the PM core, the latency tolerance PM QoS
  framework is improved a bit, the Intel RAPL power capping driver is
  cleaned up and there are some fixes and cleanups in the devfreq
  subsystem

  Specifics:

   - New cpufreq driver for Broadcom STB SoCs and a Device Tree binding
     for it (Markus Mayer)

   - Support for ARM Integrator/AP and Integrator/CP in the generic DT
     cpufreq driver and elimination of the old Integrator cpufreq driver
     (Linus Walleij)

   - Support for the zx296718, r8a7743 and r8a7745, Socionext UniPhier,
     and PXA SoCs in the the generic DT cpufreq driver (Baoyou Xie,
     Geert Uytterhoeven, Masahiro Yamada, Robert Jarzmik)

   - cpufreq core fix to eliminate races that may lead to using inactive
     policy objects and related cleanups (Rafael Wysocki)

   - cpufreq schedutil governor update to make it use SCHED_FIFO kernel
     threads (instead of regular workqueues) for doing delayed work (to
     reduce the response latency in some cases) and related cleanups
     (Viresh Kumar)

   - New cpufreq sysfs attribute for resetting statistics (Markus Mayer)

   - cpufreq governors fixes and cleanups (Chen Yu, Stratos Karafotis,
     Viresh Kumar)

   - Support for using generic cpufreq governors in the intel_pstate
     driver (Rafael Wysocki)

   - Support for per-logical-CPU P-state limits and the EPP/EPB (Energy
     Performance Preference/Energy Performance Bias) knobs in the
     intel_pstate driver (Srinivas Pandruvada)

   - New CPU ID for Knights Mill in intel_pstate (Piotr Luc)

   - intel_pstate driver modification to use the P-state selection
     algorithm based on CPU load on platforms with the system profile in
     the ACPI tables set to "mobile" (Srinivas Pandruvada)

   - intel_pstate driver cleanups (Arnd Bergmann, Rafael Wysocki,
     Srinivas Pandruvada)

   - cpufreq powernv driver updates including fast switching support
     (for the schedutil governor), fixes and cleanus (Akshay Adiga,
     Andrew Donnellan, Denis Kirjanov)

   - acpi-cpufreq driver rework to switch it over to the new CPU
     offline/online state machine (Sebastian Andrzej Siewior)

   - Assorted cleanups in cpufreq drivers (Wei Yongjun, Prashanth
     Prakash)

   - Idle injection rework (to make it use the regular idle path instead
     of a home-grown custom one) and related powerclamp thermal driver
     updates (Peter Zijlstra, Jacob Pan, Petr Mladek, Sebastian Andrzej
     Siewior)

   - New CPU IDs for Atom Z34xx and Knights Mill in intel_idle (Andy
     Shevchenko, Piotr Luc)

   - intel_idle driver cleanups and switch over to using the new CPU
     offline/online state machine (Anna-Maria Gleixner, Sebastian
     Andrzej Siewior)

   - cpuidle DT driver update to support suspend-to-idle properly
     (Sudeep Holla)

   - cpuidle core cleanups and misc updates (Daniel Lezcano, Pan Bian,
     Rafael Wysocki)

   - Preliminary support for power domains including CPUs in the generic
     power domains (genpd) framework and related DT bindings (Lina Iyer)

   - Assorted fixes and cleanups in the generic power domains (genpd)
     framework (Colin Ian King, Dan Carpenter, Geert Uytterhoeven)

   - Preliminary support for devices with multiple voltage regulators
     and related fixes and cleanups in the Operating Performance Points
     (OPP) library (Viresh Kumar, Masahiro Yamada, Stephen Boyd)

   - System sleep state selection interface rework to make it easier to
     support suspend-to-idle as the default system suspend method
     (Rafael Wysocki)

   - PM core fixes and cleanups, mostly related to the interactions
     between the system suspend and runtime PM frameworks (Ulf Hansson,
     Sahitya Tummala, Tony Lindgren)

   - Latency tolerance PM QoS framework imorovements (Andrew Lutomirski)

   - New Knights Mill CPU ID for the Intel RAPL power capping driver
     (Piotr Luc)

   - Intel RAPL power capping driver fixes, cleanups and switch over to
     using the new CPU offline/online state machine (Jacob Pan, Thomas
     Gleixner, Sebastian Andrzej Siewior)

   - Fixes and cleanups in the exynos-ppmu, exynos-nocp, rk3399_dmc,
     rockchip-dfi devfreq drivers and the devfreq core (Axel Lin,
     Chanwoo Choi, Javier Martinez Canillas, MyungJoo Ham, Viresh Kumar)

   - Fix for false-positive KASAN warnings during resume from ACPI S3
     (suspend-to-RAM) on x86 (Josh Poimboeuf)

   - Memory map verification during resume from hibernation on x86 to
     ensure a consistent address space layout (Chen Yu)

   - Wakeup sources debugging enhancement (Xing Wei)

   - rockchip-io AVS driver cleanup (Shawn Lin)"

* tag 'pm-4.10-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm: (127 commits)
  devfreq: rk3399_dmc: Don't use OPP structures outside of RCU locks
  devfreq: rk3399_dmc: Remove dangling rcu_read_unlock()
  devfreq: exynos: Don't use OPP structures outside of RCU locks
  Documentation: intel_pstate: Document HWP energy/performance hints
  cpufreq: intel_pstate: Support for energy performance hints with HWP
  cpufreq: intel_pstate: Add locking around HWP requests
  PM / sleep: Print active wakeup sources when blocking on wakeup_count reads
  PM / core: Fix bug in the error handling of async suspend
  PM / wakeirq: Fix dedicated wakeirq for drivers not using autosuspend
  PM / Domains: Fix compatible for domain idle state
  PM / OPP: Don't WARN on multiple calls to dev_pm_opp_set_regulators()
  PM / OPP: Allow platform specific custom set_opp() callbacks
  PM / OPP: Separate out _generic_set_opp()
  PM / OPP: Add infrastructure to manage multiple regulators
  PM / OPP: Pass struct dev_pm_opp_supply to _set_opp_voltage()
  PM / OPP: Manage supply's voltage/current in a separate structure
  PM / OPP: Don't use OPP structure outside of rcu protected section
  PM / OPP: Reword binding supporting multiple regulators per device
  PM / OPP: Fix incorrect cpu-supply property in binding
  cpuidle: Add a kerneldoc comment to cpuidle_use_deepest_state()
  ..
2016-12-13 10:41:53 -08:00
Linus Torvalds
92c020d08d Merge branch 'sched-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull scheduler updates from Ingo Molnar:
 "The main scheduler changes in this cycle were:

   - support Intel Turbo Boost Max Technology 3.0 (TBM3) by introducig a
     notion of 'better cores', which the scheduler will prefer to
     schedule single threaded workloads on. (Tim Chen, Srinivas
     Pandruvada)

   - enhance the handling of asymmetric capacity CPUs further (Morten
     Rasmussen)

   - improve/fix load handling when moving tasks between task groups
     (Vincent Guittot)

   - simplify and clean up the cputime code (Stanislaw Gruszka)

   - improve mass fork()ed task spread a.k.a. hackbench speedup (Vincent
     Guittot)

   - make struct kthread kmalloc()ed and related fixes (Oleg Nesterov)

   - add uaccess atomicity debugging (when using access_ok() in the
     wrong context), under CONFIG_DEBUG_ATOMIC_SLEEP=y (Peter Zijlstra)

   - implement various fixes, cleanups and other enhancements (Daniel
     Bristot de Oliveira, Martin Schwidefsky, Rafael J. Wysocki)"

* 'sched-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (41 commits)
  sched/core: Use load_avg for selecting idlest group
  sched/core: Fix find_idlest_group() for fork
  kthread: Don't abuse kthread_create_on_cpu() in __kthread_create_worker()
  kthread: Don't use to_live_kthread() in kthread_[un]park()
  kthread: Don't use to_live_kthread() in kthread_stop()
  Revert "kthread: Pin the stack via try_get_task_stack()/put_task_stack() in to_live_kthread() function"
  kthread: Make struct kthread kmalloc'ed
  x86/uaccess, sched/preempt: Verify access_ok() context
  sched/x86: Make CONFIG_SCHED_MC_PRIO=y easier to enable
  sched/x86: Change CONFIG_SCHED_ITMT to CONFIG_SCHED_MC_PRIO
  x86/sched: Use #include <linux/mutex.h> instead of #include <asm/mutex.h>
  cpufreq/intel_pstate: Use CPPC to get max performance
  acpi/bus: Set _OSC for diverse core support
  acpi/bus: Enable HWP CPPC objects
  x86/sched: Add SD_ASYM_PACKING flags to x86 ITMT CPU
  x86/sysctl: Add sysctl for ITMT scheduling feature
  x86: Enable Intel Turbo Boost Max Technology 3.0
  x86/topology: Define x86's arch_update_cpu_topology
  sched: Extend scheduler's asym packing
  sched/fair: Clean up the tunable parameter definitions
  ...
2016-12-12 12:15:10 -08:00
Rafael J. Wysocki
b19ad3b9f1 Merge branch 'pm-cpuidle'
* pm-cpuidle:
  cpuidle: Add a kerneldoc comment to cpuidle_use_deepest_state()
  cpuidle: fix improper return value on error
  intel_idle: Convert to hotplug state machine
  intel_idle: Remove superfluous SMP fuction call
  MAINTAINERS: Add Jacob Pan as a new intel_idle maintainer
  MAINTAINERS: Add bug tracking system location entries for cpuidle
  x86/intel_idle: Add Knights Mill CPUID
  x86/intel_idle: Add CPU model 0x4a (Atom Z34xx series)
  thermal/intel_powerclamp: stop sched tick in forced idle
  thermal/intel_powerclamp: Convert to CPU hotplug state
  thermal/intel_powerclamp: Convert the kthread to kthread worker API
  thermal/intel_powerclamp: Remove duplicated code that starts the kthread
  sched/idle: Add support for tasks that inject idle
  cpuidle: Allow enforcing deepest idle state selection
  cpuidle/powernv: staticise powernv_idle_driver
  cpuidle: dt: assign ->enter_freeze to same as ->enter callback function
  cpuidle: governors: Remove remaining old module code
2016-12-12 20:46:15 +01:00
Rafael J. Wysocki
fecc8c0ebd Merge branch 'pm-cpufreq'
* pm-cpufreq: (51 commits)
  Documentation: intel_pstate: Document HWP energy/performance hints
  cpufreq: intel_pstate: Support for energy performance hints with HWP
  cpufreq: intel_pstate: Add locking around HWP requests
  cpufreq: ondemand: Set MIN_FREQUENCY_UP_THRESHOLD to 1
  cpufreq: intel_pstate: Add Knights Mill CPUID
  MAINTAINERS: Add bug tracking system location entry for cpufreq
  cpufreq: dt: Add support for zx296718
  cpufreq: acpi-cpufreq: drop rdmsr_on_cpus() usage
  cpufreq: acpi-cpufreq: Convert to hotplug state machine
  cpufreq: intel_pstate: fix intel_pstate_exit_perf_limits() prototype
  cpufreq: intel_pstate: Set EPP/EPB to 0 in performance mode
  cpufreq: schedutil: Rectify comment in sugov_irq_work() function
  cpufreq: intel_pstate: increase precision of performance limits
  cpufreq: intel_pstate: round up min_perf limits
  cpufreq: Make cpufreq_update_policy() void
  ACPI / processor: Make acpi_processor_ppc_has_changed() void
  cpufreq: Avoid using inactive policies
  cpufreq: intel_pstate: Generic governors support
  cpufreq: intel_pstate: Request P-states control from SMM if needed
  cpufreq: dt: Add support for r8a7743 and r8a7745
  ...
2016-12-12 20:45:01 +01:00
Vincent Guittot
6b94780e45 sched/core: Use load_avg for selecting idlest group
find_idlest_group() only compares the runnable_load_avg when looking
for the least loaded group. But on fork intensive use case like
hackbench where tasks blocked quickly after the fork, this can lead to
selecting the same CPU instead of other CPUs, which have similar
runnable load but a lower load_avg.

When the runnable_load_avg of 2 CPUs are close, we now take into
account the amount of blocked load as a 2nd selection factor. There is
now 3 zones for the runnable_load of the rq:

 - [0 .. (runnable_load - imbalance)]:
	Select the new rq which has significantly less runnable_load

 - [(runnable_load - imbalance) .. (runnable_load + imbalance)]:
	The runnable loads are close so we use load_avg to chose
	between the 2 rq

 - [(runnable_load + imbalance) .. ULONG_MAX]:
	Keep the current rq which has significantly less runnable_load

The scale factor that is currently used for comparing runnable_load,
doesn't work well with small value. As an example, the use of a
scaling factor fails as soon as this_runnable_load == 0 because we
always select local rq even if min_runnable_load is only 1, which
doesn't really make sense because they are just the same. So instead
of scaling factor, we use an absolute margin for runnable_load to
detect CPUs with similar runnable_load and we keep using scaling
factor for blocked load.

For use case like hackbench, this enable the scheduler to select
different CPUs during the fork sequence and to spread tasks across the
system.

Tests have been done on a Hikey board (ARM based octo cores) for
several kernel. The result below gives min, max, avg and stdev values
of 18 runs with each configuration.

The patches depend on the "no missing update_rq_clock()" work.

hackbench -P -g 1

         ea86cb4b76  7dc603c902  v4.8        v4.8+patches
  min    0.049         0.050         0.051       0,048
  avg    0.057         0.057(0%)     0.057(0%)   0,055(+5%)
  max    0.066         0.068         0.070       0,063
  stdev  +/-9%         +/-9%         +/-8%       +/-9%

More performance numbers here:

  https://lkml.kernel.org/r/20161203214707.GI20785@codeblueprint.co.uk

Tested-by: Matt Fleming <matt@codeblueprint.co.uk>
Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Matt Fleming <matt@codeblueprint.co.uk>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Morten.Rasmussen@arm.com
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: dietmar.eggemann@arm.com
Cc: kernellwp@gmail.com
Cc: umgwanakikbuti@gmail.com
Cc: yuyang.du@intel.comc
Link: http://lkml.kernel.org/r/1481216215-24651-3-git-send-email-vincent.guittot@linaro.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-12-11 13:10:57 +01:00
Vincent Guittot
f519a3f1c6 sched/core: Fix find_idlest_group() for fork
During fork, the utilization of a task is init once the rq has been
selected because the current utilization level of the rq is used to
set the utilization of the fork task. As the task's utilization is
still 0 at this step of the fork sequence, it doesn't make sense to
look for some spare capacity that can fit the task's utilization.
Furthermore, I can see perf regressions for the test:

   hackbench -P -g 1

because the least loaded policy is always bypassed and tasks are not
spread during fork.

With this patch and the fix below, we are back to same performances as
for v4.8. The fix below is only a temporary one used for the test
until a smarter solution is found because we can't simply remove the
test which is useful for others benchmarks

| @@ -5708,13 +5708,6 @@ static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd, int t
|
|	avg_cost = this_sd->avg_scan_cost;
|
| -	/*
| -	 * Due to large variance we need a large fuzz factor; hackbench in
| -	 * particularly is sensitive here.
| -	 */
| -	if ((avg_idle / 512) < avg_cost)
| -		return -1;
| -
|	time = local_clock();
|
|	for_each_cpu_wrap(cpu, sched_domain_span(sd), target, wrap) {

Tested-by: Matt Fleming <matt@codeblueprint.co.uk>
Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Matt Fleming <matt@codeblueprint.co.uk>
Acked-by: Morten Rasmussen <morten.rasmussen@arm.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: dietmar.eggemann@arm.com
Cc: kernellwp@gmail.com
Cc: umgwanakikbuti@gmail.com
Cc: yuyang.du@intel.comc
Link: http://lkml.kernel.org/r/1481216215-24651-2-git-send-email-vincent.guittot@linaro.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-12-11 13:10:56 +01:00
Ingo Molnar
6643aab30f Merge branch 'linus' into sched/core, to pick up fixes
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-12-11 13:10:40 +01:00
Ingo Molnar
6f38751510 Merge branch 'linus' into locking/core, to pick up fixes
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-12-11 13:07:13 +01:00
Ingo Molnar
1b95b1a06c Merge branch 'locking/urgent' into locking/core, to pick up dependent fixes
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-12-02 11:13:44 +01:00
Rafael J. Wysocki
4e28ec3d5f Merge back earlier cpuidle material for v4.10. 2016-12-01 14:39:51 +01:00
Peter Zijlstra
c1de45ca83 sched/idle: Add support for tasks that inject idle
Idle injection drivers such as Intel powerclamp and ACPI PAD drivers use
realtime tasks to take control of CPU then inject idle. There are two
issues with this approach:

 1. Low efficiency: injected idle task is treated as busy so sched ticks
    do not stop during injected idle period, the result of these
    unwanted wakeups can be ~20% loss in power savings.

 2. Idle accounting: injected idle time is presented to user as busy.

This patch addresses the issues by introducing a new PF_IDLE flag which
allows any given task to be treated as idle task while the flag is set.
Therefore, idle injection tasks can run through the normal flow of NOHZ
idle enter/exit to get the correct accounting as well as tick stop when
possible.

The implication is that idle task is then no longer limited to PID == 0.

Acked-by: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Jacob Pan <jacob.jun.pan@linux.intel.com>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2016-11-29 14:02:21 +01:00