Commit Graph

23358 Commits

Author SHA1 Message Date
Linus Torvalds
4b978934a4 Merge branch 'core-rcu-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull RCU updates from Ingo Molnar:
 "The main changes in this cycle were:

   - Expedited grace-period changes, most notably avoiding having user
     threads drive expedited grace periods, using a workqueue instead.

   - Miscellaneous fixes, including a performance fix for lists that was
     sent with the lists modifications.

   - CPU hotplug updates, most notably providing exact CPU-online
     tracking for RCU. This will in turn allow removal of the checks
     supporting RCU's prior heuristic that was based on the assumption
     that CPUs would take no longer than one jiffy to come online.

   - Torture-test updates.

   - Documentation updates"

* 'core-rcu-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (22 commits)
  list: Expand list_first_entry_or_null()
  torture: TOROUT_STRING(): Insert a space between flag and message
  rcuperf: Consistently insert space between flag and message
  rcutorture: Print out barrier error as document says
  torture: Add task state to writer-task stall printk()s
  torture: Convert torture_shutdown() to hrtimer
  rcutorture: Convert to hotplug state machine
  cpu/hotplug: Get rid of CPU_STARTING reference
  rcu: Provide exact CPU-online tracking for RCU
  rcu: Avoid redundant quiescent-state chasing
  rcu: Don't use modular infrastructure in non-modular code
  sched: Make wake_up_nohz_cpu() handle CPUs going offline
  rcu: Use rcu_gp_kthread_wake() to wake up grace period kthreads
  rcu: Use RCU's online-CPU state for expedited IPI retry
  rcu: Exclude RCU-offline CPUs from expedited grace periods
  rcu: Make expedited RCU CPU stall warnings respond to controls
  rcu: Stop disabling expedited RCU CPU stall warnings
  rcu: Drive expedited grace periods from workqueue
  rcu: Consolidate expedited grace period machinery
  documentation: Record reason for rcu_head two-byte alignment
  ...
2016-10-03 10:29:53 -07:00
Linus Torvalds
72ec94560d Power management material for v4.9-rc1
- Add a mechanism for passing hints from the scheduler to cpufreq governors
    via their utilization update callbacks and use it to introduce "IOwait
    boosting" into the schedutil governor and intel_pstate that will make them
    boost performance if the enqueued task was previously waiting on I/O
    (Rafael Wysocki).
 
  - Fix a schedutil governor problem that causes it to overestimate utilization
    if SMT is in use (Steve Muckle).
 
  - Update defconfigs trying to use the schedutil governor as a module which is
    not possible any more (Javier Martinez Canillas).
 
  - Update the intel_pstate's pstate_sample tracepoint to take "IOwait boosting"
    into account (Srinivas Pandruvada).
 
  - Fix a problem in the cpufreq core causing it to mishandle the initialization
    of CPUs registered after the cpufreq driver (Viresh Kumar, Rafael Wysocki).
 
  - Make the cpufreq-dt driver support per-policy governor tunables, clean it
    up and update its Kconfig description (Viresh Kumar).
 
  - Add support for more ARM platforms to the cpufreq-dt driver (Chanwoo Choi,
    Dave Gerlach, Geert Uytterhoeven).
 
  - Make the cpufreq CPPC driver report frequencies in KHz to avoid user space
    compatiblility issues (Al Stone, Hoan Tran).
 
  - Clean up a few cpufreq drivers (st, kirkwood, SCPI) a bit (Colin Ian King,
    Markus Elfring).
 
  - Constify some local structures in the intel_pstate driver (Julia Lawall).
 
  - Add a Documentation/cpu-freq/ entry to MAINTAINERS (Jean Delvare).
 
  - Add support for PM domain removal to the generic power domains (genpd)
    framework, add new DT helper functions to it and make it always enable
    debugfs support if available (Jon Hunter, Tomeu Vizoso).
 
  - Clean up the generic power domains (genpd) framework and make it avoid
    measuring power-on and power-off latencies during system-wide PM transitions
    (Ulf Hansson).
 
  - Add support for the RockChip DFI controller and the rk3399 DMC to the
    devfreq framework (Lin Huang, Axel Lin, Arnd Bergmann).
 
  - Add COMPILE_TEST to the devfreq framework (Krzysztof Kozlowski, Stephen
    Rothwell).
 
  - Fix a minor issue in the exynos-ppmu devfreq driver and fix up devfreq
    Kconfig indentation style (Wei Yongjun, Jisheng Zhang).
 
  - Fix the system suspend interface to make suspend-to-idle work if platform
    suspend operations have not been registered (Sudeep Holla).
 
  - Make it possible to use hibernation with PAGE_POISONING_ZERO enabled
    (Anisse Astier).
 
  - Increas the default timeout of the system suspend/resume watchdog and make it
    depend on EXPERT (Chen Yu).
 
  - Make the operating performance points (OPP) framework avoid using OPPs that
    aren't supported by the platform and fix a build warning in it (Dave Gerlach,
    Arnd Bergmann).
 
  - Fix the ARM cpuidle driver's return value (Christophe Jaillet).
 
  - Make the SmartReflex AVS (Adaptive Voltage Scaling) driver use more common
    logging style (Joe Perches).
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v2.0.22 (GNU/Linux)
 
 iQIcBAABCAAGBQJX8Y32AAoJEILEb/54YlRx8e0P/27zu8Lb6Aks1S2Zx9GEW0qr
 DvrO4kklCHqi3DgHlyFOYetf9cxMrUluojVJofnoSDvgAayWyg7VAd4gtOrMGCXG
 pJVJM73itcOUK+DsAVvoWJY3hk15nX77n2aiXPN2GqaMqennlQusdfzTmjCasqpm
 M84j+JwFYlJcfyMCcF5kGWqS7QBjzxhA0CjytUX1i3pL3NqRALZUEpaHwBD1W+4r
 tcF/jYTy3RsghCbuC6HoPxEF9NMOFGxeAXogmu6NvGu8gy0GqtywRSRrs5wA1a0z
 ZDAJ8krrFbzuFPMdjNIE8wtTeziofS5i9piQx3JlIMH3HpNGN86BRXVfzuHzJj11
 6ZMUI/FJy+fYukIXOEeVLtsLHUnMcMux8Jq1UF6N0InahaR9nbsjmGOmXh72+Scx
 7VJ+29l0oVwX6wkw/DjPP3rb1Swd1i3yY0/3uRoJ174mYTjhRGbrbDkIjPiDeuM5
 2Cx7QunscOjFmaNtPyr8niQ+7YhMEpn8VIbGNaX5ABz0fGftfi8nDHqliSNa391Z
 nK6YoKD0O6R0JHE6GavvJTcuMS9qE+HHHOwymWKxEdE9KYk0JUqen3gj1sSTaAZT
 BIPBsn6XlorqNy3dnqtWTHV7Nf0al9huolWvrL90s6g4Bh2BzTzDVydSgNWTMDUi
 G64nP0q1sJTqdoe30uvk
 =NYkv
 -----END PGP SIGNATURE-----

Merge tag 'pm-4.9-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm

Pull power management updates from Rafael Wysocki:
 "Traditionally, cpufreq is the area with the greatest number of
  changes, but there are fewer of them than last time. There also is
  some activity in the generic power domains and the devfreq frameworks,
  a couple of system suspend and hibernation fixes and some assorted
  changes in other places.

  One new feature is the cpufreq change to allow the scheduler to pass
  hints to the governors' utilization update callbacks and some code
  rework based on that. Another one is the support for domain removal in
  the generic power domains framework. Also it is now possible to use
  hibernation with PAGE_POISONING_ZERO enabled and devfreq supports the
  RockChip DFI controller and the rk3399 DMC.

  The rest of the changes is mostly fixes and cleanups in a number of
  places.

  Specifics:

   - Add a mechanism for passing hints from the scheduler to cpufreq
     governors via their utilization update callbacks and use it to
     introduce "IOwait boosting" into the schedutil governor and
     intel_pstate that will make them boost performance if the enqueued
     task was previously waiting on I/O (Rafael Wysocki).

   - Fix a schedutil governor problem that causes it to overestimate
     utilization if SMT is in use (Steve Muckle).

   - Update defconfigs trying to use the schedutil governor as a module
     which is not possible any more (Javier Martinez Canillas).

   - Update the intel_pstate's pstate_sample tracepoint to take "IOwait
     boosting" into account (Srinivas Pandruvada).

   - Fix a problem in the cpufreq core causing it to mishandle the
     initialization of CPUs registered after the cpufreq driver (Viresh
     Kumar, Rafael Wysocki).

   - Make the cpufreq-dt driver support per-policy governor tunables,
     clean it up and update its Kconfig description (Viresh Kumar).

   - Add support for more ARM platforms to the cpufreq-dt driver
     (Chanwoo Choi, Dave Gerlach, Geert Uytterhoeven).

   - Make the cpufreq CPPC driver report frequencies in KHz to avoid
     user space compatiblility issues (Al Stone, Hoan Tran).

   - Clean up a few cpufreq drivers (st, kirkwood, SCPI) a bit (Colin
     Ian King, Markus Elfring).

   - Constify some local structures in the intel_pstate driver (Julia
     Lawall).

   - Add a Documentation/cpu-freq/ entry to MAINTAINERS (Jean Delvare).

   - Add support for PM domain removal to the generic power domains
     (genpd) framework, add new DT helper functions to it and make it
     always enable debugfs support if available (Jon Hunter, Tomeu
     Vizoso).

   - Clean up the generic power domains (genpd) framework and make it
     avoid measuring power-on and power-off latencies during system-wide
     PM transitions (Ulf Hansson).

   - Add support for the RockChip DFI controller and the rk3399 DMC to
     the devfreq framework (Lin Huang, Axel Lin, Arnd Bergmann).

   - Add COMPILE_TEST to the devfreq framework (Krzysztof Kozlowski,
     Stephen Rothwell).

   - Fix a minor issue in the exynos-ppmu devfreq driver and fix up
     devfreq Kconfig indentation style (Wei Yongjun, Jisheng Zhang).

   - Fix the system suspend interface to make suspend-to-idle work if
     platform suspend operations have not been registered (Sudeep
     Holla).

   - Make it possible to use hibernation with PAGE_POISONING_ZERO
     enabled (Anisse Astier).

   - Increas the default timeout of the system suspend/resume watchdog
     and make it depend on EXPERT (Chen Yu).

   - Make the operating performance points (OPP) framework avoid using
     OPPs that aren't supported by the platform and fix a build warning
     in it (Dave Gerlach, Arnd Bergmann).

   - Fix the ARM cpuidle driver's return value (Christophe Jaillet).

   - Make the SmartReflex AVS (Adaptive Voltage Scaling) driver use more
     common logging style (Joe Perches)"

* tag 'pm-4.9-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm: (58 commits)
  PM / OPP: Don't support OPP if it provides supported-hw but platform does not
  cpufreq: st: add missing \n to end of dev_err message
  cpufreq: kirkwood: add missing \n to end of dev_err messages
  PM / Domains: Rename pm_genpd_sync_poweron|poweroff()
  PM / Domains: Don't measure latency of ->power_on|off() during system PM
  PM / Domains: Remove redundant system PM callbacks
  PM / Domains: Simplify detaching a device from its genpd
  PM / devfreq: rk3399_dmc: Remove explictly regulator_put call in .remove
  PM / devfreq: rockchip: add PM_DEVFREQ_EVENT dependency
  PM / OPP: avoid maybe-uninitialized warning
  PM / Domains: Allow holes in genpd_data.domains array
  cpufreq: CPPC: Avoid overflow when calculating desired_perf
  cpufreq: ti: Use generic platdev driver
  cpufreq: intel_pstate: Add io_boost trace
  partial revert of "PM / devfreq: Add COMPILE_TEST for build coverage"
  cpufreq: intel_pstate: Use IOWAIT flag in Atom algorithm
  cpufreq: schedutil: Add iowait boosting
  cpufreq / sched: SCHED_CPUFREQ_IOWAIT flag to indicate iowait condition
  PM / Domains: Add support for removing nested PM domains by provider
  PM / Domains: Add support for removing PM domains
  ...
2016-10-03 09:33:40 -07:00
Linus Torvalds
7af8a0f808 arm64 updates for 4.9:
- Support for execute-only page permissions
 - Support for hibernate and DEBUG_PAGEALLOC
 - Support for heterogeneous systems with mismatches cache line sizes
 - Errata workarounds (A53 843419 update and QorIQ A-008585 timer bug)
 - arm64 PMU perf updates, including cpumasks for heterogeneous systems
 - Set UTS_MACHINE for building rpm packages
 - Yet another head.S tidy-up
 - Some cleanups and refactoring, particularly in the NUMA code
 - Lots of random, non-critical fixes across the board
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1
 
 iQEcBAABCgAGBQJX7k31AAoJELescNyEwWM0XX0H/iOaWCfKlWOhvBsStGUCsLrK
 XryTzQT2KjdnLKf3jwP+1ateCuBR5ROurYxoDCX5/7mD63c5KiI338Vbv61a1lE1
 AAwjt1stmQVUg/j+kqnuQwB/0DYg+2C8se3D3q5Iyn7zc19cDZJEGcBHNrvLMufc
 XgHrgHgl/rzBDDlHJXleknDFge/MfhU5/Q1vJMRRb4JYrpAtmIokzCO75CYMRcCT
 ND2QbmppKtsyuFPGUTVbAFzJlP6dGKb3eruYta7/ct5d0pJQxav3u98D2yWGfjdM
 YaYq1EmX5Pol7rWumqLtk0+mA9yCFcKLLc+PrJu20Vx0UkvOq8G8Xt70sHNvZU8=
 =gdPM
 -----END PGP SIGNATURE-----

Merge tag 'arm64-upstream' of git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux

Pull arm64 updates from Will Deacon:
 "It's a bit all over the place this time with no "killer feature" to
  speak of.  Support for mismatched cache line sizes should help people
  seeing whacky JIT failures on some SoCs, and the big.LITTLE perf
  updates have been a long time coming, but a lot of the changes here
  are cleanups.

  We stray outside arch/arm64 in a few areas: the arch/arm/ arch_timer
  workaround is acked by Russell, the DT/OF bits are acked by Rob, the
  arch_timer clocksource changes acked by Marc, CPU hotplug by tglx and
  jump_label by Peter (all CC'd).

  Summary:

   - Support for execute-only page permissions
   - Support for hibernate and DEBUG_PAGEALLOC
   - Support for heterogeneous systems with mismatches cache line sizes
   - Errata workarounds (A53 843419 update and QorIQ A-008585 timer bug)
   - arm64 PMU perf updates, including cpumasks for heterogeneous systems
   - Set UTS_MACHINE for building rpm packages
   - Yet another head.S tidy-up
   - Some cleanups and refactoring, particularly in the NUMA code
   - Lots of random, non-critical fixes across the board"

* tag 'arm64-upstream' of git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux: (100 commits)
  arm64: tlbflush.h: add __tlbi() macro
  arm64: Kconfig: remove SMP dependence for NUMA
  arm64: Kconfig: select OF/ACPI_NUMA under NUMA config
  arm64: fix dump_backtrace/unwind_frame with NULL tsk
  arm/arm64: arch_timer: Use archdata to indicate vdso suitability
  arm64: arch_timer: Work around QorIQ Erratum A-008585
  arm64: arch_timer: Add device tree binding for A-008585 erratum
  arm64: Correctly bounds check virt_addr_valid
  arm64: migrate exception table users off module.h and onto extable.h
  arm64: pmu: Hoist pmu platform device name
  arm64: pmu: Probe default hw/cache counters
  arm64: pmu: add fallback probe table
  MAINTAINERS: Update ARM PMU PROFILING AND DEBUGGING entry
  arm64: Improve kprobes test for atomic sequence
  arm64/kvm: use alternative auto-nop
  arm64: use alternative auto-nop
  arm64: alternative: add auto-nop infrastructure
  arm64: lse: convert lse alternatives NOP padding to use __nops
  arm64: barriers: introduce nops and __nops macros for NOP sequences
  arm64: sysreg: replace open-coded mrs_s/msr_s with {read,write}_sysreg_s
  ...
2016-10-03 08:58:35 -07:00
David S. Miller
b50afd203a Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
Three sets of overlapping changes.  Nothing serious.

Signed-off-by: David S. Miller <davem@davemloft.net>
2016-10-02 22:20:41 -04:00
Rafael J. Wysocki
993eb0aeae Merge branches 'pm-devfreq' and 'pm-sleep'
* pm-devfreq:
  PM / devfreq: rk3399_dmc: Remove explictly regulator_put call in .remove
  PM / devfreq: rockchip: add PM_DEVFREQ_EVENT dependency
  partial revert of "PM / devfreq: Add COMPILE_TEST for build coverage"
  PM / devfreq: rockchip: add devfreq driver for rk3399 dmc
  Documentation: bindings: add dt documentation for rk3399 dmc
  PM / devfreq: event: support rockchip dfi controller
  Documentation: bindings: add dt documentation for dfi controller
  PM / devfreq: event: remove duplicate devfreq_event_get_drvdata()
  PM / devfreq: fix Kconfig indent style
  PM / devfreq: Add COMPILE_TEST for build coverage
  PM / devfreq: exynos-ppmu: remove unneeded of_node_put()

* pm-sleep:
  PM / Hibernate: allow hibernation with PAGE_POISONING_ZERO
  PM / sleep: enable suspend-to-idle even without registered suspend_ops
  PM / sleep: Increase default DPM watchdog timeout to 120
2016-10-02 01:43:45 +02:00
Rafael J. Wysocki
7005f6dc69 Merge branch 'pm-cpufreq'
* pm-cpufreq: (24 commits)
  cpufreq: st: add missing \n to end of dev_err message
  cpufreq: kirkwood: add missing \n to end of dev_err messages
  cpufreq: CPPC: Avoid overflow when calculating desired_perf
  cpufreq: ti: Use generic platdev driver
  cpufreq: intel_pstate: Add io_boost trace
  cpufreq: intel_pstate: Use IOWAIT flag in Atom algorithm
  cpufreq: schedutil: Add iowait boosting
  cpufreq / sched: SCHED_CPUFREQ_IOWAIT flag to indicate iowait condition
  cpufreq: CPPC: Force reporting values in KHz to fix user space interface
  cpufreq: create link to policy only for registered CPUs
  intel_pstate: constify local structures
  cpufreq: dt: Support governor tunables per policy
  cpufreq: dt: Update kconfig description
  cpufreq: dt: Remove unused code
  MAINTAINERS: Add Documentation/cpu-freq/
  cpufreq: dt: Add support for r8a7792
  cpufreq / sched: ignore SMT when determining max cpu capacity
  cpufreq: Drop unnecessary check from cpufreq_policy_alloc()
  ARM: multi_v7_defconfig: Don't attempt to enable schedutil governor as module
  ARM: exynos_defconfig: Don't attempt to enable schedutil governor as module
  ...
2016-10-02 01:42:45 +02:00
Rafael J. Wysocki
b6e2511782 Merge branch 'pm-cpufreq-sched' into pm-cpufreq 2016-10-02 01:42:33 +02:00
Eric W. Biederman
d29216842a mnt: Add a per mount namespace limit on the number of mounts
CAI Qian <caiqian@redhat.com> pointed out that the semantics
of shared subtrees make it possible to create an exponentially
increasing number of mounts in a mount namespace.

    mkdir /tmp/1 /tmp/2
    mount --make-rshared /
    for i in $(seq 1 20) ; do mount --bind /tmp/1 /tmp/2 ; done

Will create create 2^20 or 1048576 mounts, which is a practical problem
as some people have managed to hit this by accident.

As such CVE-2016-6213 was assigned.

Ian Kent <raven@themaw.net> described the situation for autofs users
as follows:

> The number of mounts for direct mount maps is usually not very large because of
> the way they are implemented, large direct mount maps can have performance
> problems. There can be anywhere from a few (likely case a few hundred) to less
> than 10000, plus mounts that have been triggered and not yet expired.
>
> Indirect mounts have one autofs mount at the root plus the number of mounts that
> have been triggered and not yet expired.
>
> The number of autofs indirect map entries can range from a few to the common
> case of several thousand and in rare cases up to between 30000 and 50000. I've
> not heard of people with maps larger than 50000 entries.
>
> The larger the number of map entries the greater the possibility for a large
> number of active mounts so it's not hard to expect cases of a 1000 or somewhat
> more active mounts.

So I am setting the default number of mounts allowed per mount
namespace at 100,000.  This is more than enough for any use case I
know of, but small enough to quickly stop an exponential increase
in mounts.  Which should be perfect to catch misconfigurations and
malfunctioning programs.

For anyone who needs a higher limit this can be changed by writing
to the new /proc/sys/fs/mount-max sysctl.

Tested-by: CAI Qian <caiqian@redhat.com>
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
2016-09-30 12:46:48 -05:00
Thomas Gleixner
d7e25c66c9 Merge branch 'x86/urgent' into x86/asm
Get the cr4 fixes so we can apply the final cleanup
2016-09-30 12:38:28 +02:00
Frederic Weisbecker
447976ef4f sched/irqtime: Consolidate irqtime flushing code
The code performing irqtime nsecs stats flushing to kcpustat is roughly
the same for hardirq and softirq. So lets consolidate that common code.

Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Reviewed-by: Rik van Riel <riel@redhat.com>
Cc: Eric Dumazet <eric.dumazet@gmail.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Wanpeng Li <wanpeng.li@hotmail.com>
Link: http://lkml.kernel.org/r/1474849761-12678-6-git-send-email-fweisbec@gmail.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-09-30 11:46:41 +02:00
Frederic Weisbecker
19d23dbfeb sched/irqtime: Consolidate accounting synchronization with u64_stats API
The irqtime accounting currently implement its own ad hoc implementation
of u64_stats API. Lets rather consolidate it with the appropriate
library.

Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Reviewed-by: Rik van Riel <riel@redhat.com>
Cc: Eric Dumazet <eric.dumazet@gmail.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Wanpeng Li <wanpeng.li@hotmail.com>
Link: http://lkml.kernel.org/r/1474849761-12678-5-git-send-email-fweisbec@gmail.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-09-30 11:46:40 +02:00
Frederic Weisbecker
2810f611f9 sched/irqtime: Remove needless IRQs disablement on kcpustat update
The callers of the functions performing irqtime kcpustat updates have
IRQS disabled, no need to disable them again.

Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Reviewed-by: Rik van Riel <riel@redhat.com>
Cc: Eric Dumazet <eric.dumazet@gmail.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Wanpeng Li <wanpeng.li@hotmail.com>
Link: http://lkml.kernel.org/r/1474849761-12678-3-git-send-email-fweisbec@gmail.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-09-30 11:46:39 +02:00
Frederic Weisbecker
f9094a6575 sched/irqtime: No need for preempt-safe accessors
We can safely use the preempt-unsafe accessors for irqtime when we
flush its counters to kcpustat as IRQs are disabled at this time.

Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Reviewed-by: Rik van Riel <riel@redhat.com>
Cc: Eric Dumazet <eric.dumazet@gmail.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Wanpeng Li <wanpeng.li@hotmail.com>
Link: http://lkml.kernel.org/r/1474849761-12678-2-git-send-email-fweisbec@gmail.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-09-30 11:46:38 +02:00
Peter Zijlstra
b60205c7c5 sched/fair: Fix min_vruntime tracking
While going through enqueue/dequeue to review the movement of
set_curr_task() I noticed that the (2nd) update_min_vruntime() call in
dequeue_entity() is suspect.

It turns out, its actually wrong because it will consider
cfs_rq->curr, which could be the entry we just normalized. This mixes
different vruntime forms and leads to fail.

The purpose of the second update_min_vruntime() is to move
min_vruntime forward if the entity we just removed is the one that was
holding it back; _except_ for the DEQUEUE_SAVE case, because then we
know its a temporary removal and it will come back.

However, since we do put_prev_task() _after_ dequeue(), cfs_rq->curr
will still be set (and per the above, can be tranformed into a
different unit), so update_min_vruntime() should also consider
curr->on_rq. This also fixes another corner case where the enqueue
(which also does update_curr()->update_min_vruntime()) happens on the
rq->lock break in schedule(), between dequeue and put_prev_task.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Fixes: 1e87623178 ("sched: Fix ->min_vruntime calculation in dequeue_entity()")
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-09-30 11:03:29 +02:00
Peter Zijlstra
9148a3a10e sched/debug: Add SCHED_WARN_ON()
Provide SCHED_WARN_ON as wrapper for WARN_ON_ONCE() to avoid
CONFIG_SCHED_DEBUG wrappery.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-09-30 11:03:29 +02:00
Peter Zijlstra
49bd21efe7 sched/core: Fix set_user_nice()
Almost all scheduler functions update state with the following
pattern:

	if (queued)
		dequeue_task(rq, p, DEQUEUE_SAVE);
	if (running)
		put_prev_task(rq, p);

	/* update state */

	if (queued)
		enqueue_task(rq, p, ENQUEUE_RESTORE);
	if (running)
		set_curr_task(rq, p);

set_user_nice() however misses the running part, cure this.

This was found by asserting we never enqueue 'current'.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-09-30 11:03:28 +02:00
Peter Zijlstra
b2bf6c314e sched/fair: Introduce set_curr_task() helper
Now that the ia64 only set_curr_task() symbol is gone, provide a
helper just like put_prev_task().

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-09-30 11:03:28 +02:00
Peter Zijlstra
a458ae2ea6 sched/core, ia64: Rename set_curr_task()
Rename the ia64 only set_curr_task() function to free up the name.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Tony Luck <tony.luck@intel.com>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-09-30 11:03:27 +02:00
Vincent Guittot
a399d23307 sched/core: Fix incorrect utilization accounting when switching to fair class
When a task switches to fair scheduling class, the period between now
and the last update of its utilization is accounted as running time
whatever happened during this period. This incorrect accounting applies
to the task and also to the task group branch.

When changing the property of a running task like its list of allowed
CPUs or its scheduling class, we follow the sequence:

 - dequeue task
 - put task
 - change the property
 - set task as current task
 - enqueue task

The end of the sequence doesn't follow the normal sequence (as per
__schedule()) which is:

 - enqueue a task
 - then set the task as current task.

This incorrectordering is the root cause of incorrect utilization accounting.
Update the sequence to follow the right one:

 - dequeue task
 - put task
 - change the property
 - enqueue task
 - set task as current task

Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Morten.Rasmussen@arm.com
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: bsegall@google.com
Cc: dietmar.eggemann@arm.com
Cc: linaro-kernel@lists.linaro.org
Cc: pjt@google.com
Cc: yuyang.du@intel.com
Link: http://lkml.kernel.org/r/1473666472-13749-8-git-send-email-vincent.guittot@linaro.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-09-30 11:03:27 +02:00
Peter Zijlstra
1b568f0aab sched/core: Optimize SCHED_SMT
Avoid pointless SCHED_SMT code when running on !SMT hardware.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-09-30 11:03:26 +02:00
Peter Zijlstra
10e2f1acd0 sched/core: Rewrite and improve select_idle_siblings()
select_idle_siblings() is a known pain point for a number of
workloads; it either does too much or not enough and sometimes just
does plain wrong.

This rewrite attempts to address a number of issues (but sadly not
all).

The current code does an unconditional sched_domain iteration; with
the intent of finding an idle core (on SMT hardware). The problems
which this patch tries to address are:

 - its pointless to look for idle cores if the machine is real busy;
   at which point you're just wasting cycles.

 - it's behaviour is inconsistent between SMT and !SMT hardware in
   that !SMT hardware ends up doing a scan for any idle CPU in the LLC
   domain, while SMT hardware does a scan for idle cores and if that
   fails, falls back to a scan for idle threads on the 'target' core.

The new code replaces the sched_domain scan with 3 explicit scans:

 1) search for an idle core in the LLC
 2) search for an idle CPU in the LLC
 3) search for an idle thread in the 'target' core

where 1 and 3 are conditional on SMT support and 1 and 2 have runtime
heuristics to skip the step.

Step 1) is conditional on sd_llc_shared->has_idle_cores; when a cpu
goes idle and sd_llc_shared->has_idle_cores is false, we scan all SMT
siblings of the CPU going idle. Similarly, we clear
sd_llc_shared->has_idle_cores when we fail to find an idle core.

Step 2) tracks the average cost of the scan and compares this to the
average idle time guestimate for the CPU doing the wakeup. There is a
significant fudge factor involved to deal with the variability of the
averages. Esp. hackbench was sensitive to this.

Step 3) is unconditional; we assume (also per step 1) that scanning
all SMT siblings in a core is 'cheap'.

With this; SMT systems gain step 2, which cures a few benchmarks --
notably one from Facebook.

One 'feature' of the sched_domain iteration, which we preserve in the
new code, is that it would start scanning from the 'target' CPU,
instead of scanning the cpumask in cpu id order. This avoids multiple
CPUs in the LLC scanning for idle to gang up and find the same CPU
quite as much. The down side is that tasks can end up hopping across
the LLC for no apparent reason.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-09-30 11:03:09 +02:00
Ingo Molnar
0b429e18c2 Merge branch 'linus' into locking/core, to pick up fixes
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-09-30 10:54:46 +02:00
Peter Zijlstra
0e369d7575 sched/core: Replace sd_busy/nr_busy_cpus with sched_domain_shared
Move the nr_busy_cpus thing from its hacky sd->parent->groups->sgc
location into the much more natural sched_domain_shared location.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-09-30 10:54:07 +02:00
Peter Zijlstra
24fc7edb92 sched/core: Introduce 'struct sched_domain_shared'
Since struct sched_domain is strictly per cpu; introduce a structure
that is shared between all 'identical' sched_domains.

Limit to SD_SHARE_PKG_RESOURCES domains for now, as we'll only use it
for shared cache state; if another use comes up later we can easily
relax this.

While the sched_group's are normally shared between CPUs, these are
not natural to use when we need some shared state on a domain level --
since that would require the domain to have a parent, which is not a
given.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-09-30 10:54:06 +02:00
Peter Zijlstra
16f3ef4680 sched/core: Restructure destroy_sched_domain()
There is no point in doing a call_rcu() for each domain, only do a
callback for the root sched domain and clean up the entire set in one
go.

Also make the entire call chain be called destroy_sched_domain*() to
remove confusion with the free_sched_domains() call, which does an
entirely different thing.

Both cpu_attach_domain() callers of destroy_sched_domain() can live
without the call_rcu() because at those points the sched_domain hasn't
been published yet.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-09-30 10:54:06 +02:00
Peter Zijlstra
f39180efe5 sched/core: Remove unused @cpu argument from destroy_sched_domain*()
Small cleanup; nothing uses the @cpu argument so make it go away.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-09-30 10:54:05 +02:00
Oleg Nesterov
0176beaffb sched/wait: Introduce init_wait_entry()
The partial initialization of wait_queue_t in prepare_to_wait_event() looks
ugly. This was done to shrink .text, but we can simply add the new helper
which does the full initialization and shrink the compiled code a bit more.

And. This way prepare_to_wait_event() can have more users. In particular we
are ready to remove the signal_pending_state() checks from wait_bit_action_f
helpers and change __wait_on_bit_lock() to use prepare_to_wait_event().

Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Al Viro <viro@ZenIV.linux.org.uk>
Cc: Bart Van Assche <bvanassche@acm.org>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Neil Brown <neilb@suse.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/20160906140055.GA6167@redhat.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-09-30 10:54:03 +02:00
Oleg Nesterov
eaf9ef5224 sched/wait: Avoid abort_exclusive_wait() in __wait_on_bit_lock()
__wait_on_bit_lock() doesn't need abort_exclusive_wait() too. Right
now it can't use prepare_to_wait_event() (see the next change), but
it can do the additional finish_wait() if action() fails.

abort_exclusive_wait() no longer has callers, remove it.

Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Al Viro <viro@ZenIV.linux.org.uk>
Cc: Bart Van Assche <bvanassche@acm.org>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Neil Brown <neilb@suse.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/20160906140053.GA6164@redhat.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-09-30 10:54:03 +02:00
Oleg Nesterov
b1ea06a90f sched/wait: Avoid abort_exclusive_wait() in ___wait_event()
___wait_event() doesn't really need abort_exclusive_wait(), we can simply
change prepare_to_wait_event() to remove the waiter from q->task_list if
it was interrupted.

This simplifies the code/logic, and this way prepare_to_wait_event() can
have more users, see the next change.

Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Al Viro <viro@ZenIV.linux.org.uk>
Cc: Bart Van Assche <bvanassche@acm.org>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Neil Brown <neilb@suse.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/20160908164815.GA18801@redhat.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
--
 include/linux/wait.h |    7 +------
 kernel/sched/wait.c  |   35 +++++++++++++++++++++++++----------
 2 files changed, 26 insertions(+), 16 deletions(-)
2016-09-30 10:53:44 +02:00
Oleg Nesterov
38a3e1fc1d sched/wait: Fix abort_exclusive_wait(), it should pass TASK_NORMAL to wake_up()
Otherwise this logic only works if mode is "compatible" with another
exclusive waiter.

If some wq has both TASK_INTERRUPTIBLE and TASK_UNINTERRUPTIBLE waiters,
abort_exclusive_wait() won't wait an uninterruptible waiter.

The main user is __wait_on_bit_lock() and currently it is fine but only
because TASK_KILLABLE includes TASK_UNINTERRUPTIBLE and we do not have
lock_page_interruptible() yet.

Just use TASK_NORMAL and remove the "mode" arg from abort_exclusive_wait().
Yes, this means that (say) wake_up_interruptible() can wake up the non-
interruptible waiter(s), but I think this is fine. And in fact I think
that abort_exclusive_wait() must die, see the next change.

Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Al Viro <viro@ZenIV.linux.org.uk>
Cc: Bart Van Assche <bvanassche@acm.org>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Neil Brown <neilb@suse.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/20160906140047.GA6157@redhat.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-09-30 10:53:19 +02:00
Dietmar Eggemann
ab522e33f9 sched/fair: Fix fixed point arithmetic width for shares and effective load
Since commit:

  2159197d66 ("sched/core: Enable increased load resolution on 64-bit kernels")

we now have two different fixed point units for load:

- 'shares' in calc_cfs_shares() has 20 bit fixed point unit on 64-bit
  kernels. Therefore use scale_load() on MIN_SHARES.

- 'wl' in effective_load() has 10 bit fixed point unit. Therefore use
  scale_load_down() on tg->shares which has 20 bit fixed point unit on
  64-bit kernels.

Signed-off-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/1471874441-24701-1-git-send-email-dietmar.eggemann@arm.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-09-30 10:53:19 +02:00
Tim Chen
8f37961cf2 sched/core, x86/topology: Fix NUMA in package topology bug
Current code can call set_cpu_sibling_map() and invoke sched_set_topology()
more than once (e.g. on CPU hot plug).  When this happens after
sched_init_smp() has been called, we lose the NUMA topology extension to
sched_domain_topology in sched_init_numa().  This results in incorrect
topology when the sched domain is rebuilt.

This patch fixes the bug and issues warning if we call sched_set_topology()
after sched_init_smp().

Signed-off-by: Tim Chen <tim.c.chen@linux.intel.com>
Signed-off-by: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: bp@suse.de
Cc: jolsa@redhat.com
Cc: rjw@rjwysocki.net
Link: http://lkml.kernel.org/r/1474485552-141429-2-git-send-email-srinivas.pandruvada@linux.intel.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-09-30 10:53:18 +02:00
Ingo Molnar
536e0e81e0 Merge branch 'linus' into sched/core, to pick up fixes
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-09-30 10:44:27 +02:00
Eric Dumazet
4cd13c21b2 softirq: Let ksoftirqd do its job
A while back, Paolo and Hannes sent an RFC patch adding threaded-able
napi poll loop support : (https://patchwork.ozlabs.org/patch/620657/)

The problem seems to be that softirqs are very aggressive and are often
handled by the current process, even if we are under stress and that
ksoftirqd was scheduled, so that innocent threads would have more chance
to make progress.

This patch makes sure that if ksoftirq is running, we let it
perform the softirq work.

Jonathan Corbet summarized the issue in https://lwn.net/Articles/687617/

Tested:

 - NIC receiving traffic handled by CPU 0
 - UDP receiver running on CPU 0, using a single UDP socket.
 - Incoming flood of UDP packets targeting the UDP socket.

Before the patch, the UDP receiver could almost never get CPU cycles and
could only receive ~2,000 packets per second.

After the patch, CPU cycles are split 50/50 between user application and
ksoftirqd/0, and we can effectively read ~900,000 packets per second,
a huge improvement in DOS situation. (Note that more packets are now
dropped by the NIC itself, since the BH handlers get less CPU cycles to
drain RX ring buffer)

Since the load runs in well identified threads context, an admin can
more easily tune process scheduling parameters if needed.

Reported-by: Paolo Abeni <pabeni@redhat.com>
Reported-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: David Miller <davem@davemloft.net>
Cc: Hannes Frederic Sowa <hannes@redhat.com>
Cc: Jesper Dangaard Brouer <jbrouer@redhat.com>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rik van Riel <riel@redhat.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/1472665349.14381.356.camel@edumazet-glaptop3.roam.corp.google.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-09-30 10:43:36 +02:00
Colin Ian King
d282b9c0ac tracing/syscalls: fix multiline in error message text
pr_info message spans two lines and the literal string is missing
a white space between words. Add the white space.

Signed-off-by: Colin Ian King <colin.king@canonical.com>
Acked-by: Ingo Molnar <mingo@kernel.org>
Acked-by: Steven Rostedt <rostedt@goodmis.org>
Signed-off-by: Jiri Kosina <jkosina@suse.cz>
2016-09-29 10:25:23 +02:00
Josef Bacik
484611357c bpf: allow access into map value arrays
Suppose you have a map array value that is something like this

struct foo {
	unsigned iter;
	int array[SOME_CONSTANT];
};

You can easily insert this into an array, but you cannot modify the contents of
foo->array[] after the fact.  This is because we have no way to verify we won't
go off the end of the array at verification time.  This patch provides a start
for this work.  We accomplish this by keeping track of a minimum and maximum
value a register could be while we're checking the code.  Then at the time we
try to do an access into a MAP_VALUE we verify that the maximum offset into that
region is a valid access into that memory region.  So in practice, code such as
this

unsigned index = 0;

if (foo->iter >= SOME_CONSTANT)
	foo->iter = index;
else
	index = foo->iter++;
foo->array[index] = bar;

would be allowed, as we can verify that index will always be between 0 and
SOME_CONSTANT-1.  If you wish to use signed values you'll have to have an extra
check to make sure the index isn't less than 0, or do something like index %=
SOME_CONSTANT.

Signed-off-by: Josef Bacik <jbacik@fb.com>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-09-29 01:35:35 -04:00
Shaohua Li
b761fe226b bpf: clean up put_cpu_var usage
put_cpu_var takes the percpu data, not the data returned from
get_cpu_var.

This doesn't change the behavior.

Cc: Tejun Heo <tj@kernel.org>
Cc: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Shaohua Li <shli@fb.com>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Tejun Heo <tj@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-09-27 22:09:17 -04:00
Arnd Bergmann
9dcfcda576 compat: remove compat_printk()
After 7e8e385aaf ("x86/compat: Remove sys32_vm86_warning"), this
function has become unused, so we can remove it as well.

Link: http://lkml.kernel.org/r/20160617142903.3070388-1-arnd@arndb.de
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: "Theodore Ts'o" <tytso@mit.edu>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2016-09-27 21:20:53 -04:00
Deepa Dinamani
078cd8279e fs: Replace CURRENT_TIME with current_time() for inode timestamps
CURRENT_TIME macro is not appropriate for filesystems as it
doesn't use the right granularity for filesystem timestamps.
Use current_time() instead.

CURRENT_TIME is also not y2038 safe.

This is also in preparation for the patch that transitions
vfs timestamps to use 64 bit time and hence make them
y2038 safe. As part of the effort current_time() will be
extended to do range checks. Hence, it is necessary for all
file system timestamps to use current_time(). Also,
current_time() will be transitioned along with vfs to be
y2038 safe.

Note that whenever a single call to current_time() is used
to change timestamps in different inodes, it is because they
share the same time granularity.

Signed-off-by: Deepa Dinamani <deepa.kernel@gmail.com>
Reviewed-by: Arnd Bergmann <arnd@arndb.de>
Acked-by: Felipe Balbi <balbi@kernel.org>
Acked-by: Steven Whitehouse <swhiteho@redhat.com>
Acked-by: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>
Acked-by: David Sterba <dsterba@suse.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2016-09-27 21:06:21 -04:00
Linus Torvalds
8ab293e3a1 Merge branch 'for-4.8-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup
Pull cgroup fixes from Tejun Heo:
 "Three late fixes for cgroup: Two cpuset ones, one trivial and the
  other pretty obscure, and a cgroup core fix for a bug which impacts
  cgroup v2 namespace users"

* 'for-4.8-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup:
  cgroup: fix invalid controller enable rejections with cgroup namespace
  cpuset: fix non static symbol warning
  cpuset: handle race between CPU hotplug and cpuset_hotplug_work
2016-09-27 16:43:11 -07:00
Alexey Dobriyan
9b80a184ea fs/file: more unsigned file descriptors
Propagate unsignedness for grand total of 149 bytes:

	$ ./scripts/bloat-o-meter ../vmlinux-000 ../obj/vmlinux
	add/remove: 0/0 grow/shrink: 0/10 up/down: 0/-149 (-149)
	function                                     old     new   delta
	set_close_on_exec                             99      98      -1
	put_files_struct                             201     200      -1
	get_close_on_exec                             59      58      -1
	do_prlimit                                   498     497      -1
	do_execveat_common.isra                     1662    1661      -1
	__close_fd                                   178     173      -5
	do_dup2                                      219     204     -15
	seq_show                                     685     660     -25
	__alloc_fd                                   384     357     -27
	dup_fd                                       718     646     -72

It mostly comes from converting "unsigned int" to "long" for bit operations.

Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2016-09-27 18:47:38 -04:00
Miklos Szeredi
2773bf00ae fs: rename "rename2" i_op to "rename"
Generated patch:

sed -i "s/\.rename2\t/\.rename\t\t/" `git grep -wl rename2`
sed -i "s/\brename2\b/rename/g" `git grep -wl rename2`

Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2016-09-27 11:03:58 +02:00
Miklos Szeredi
e0e0be8a83 libfs: support RENAME_NOREPLACE in simple_rename()
This is trivial to do:

 - add flags argument to simple_rename()
 - check if flags doesn't have any other than RENAME_NOREPLACE
 - assign simple_rename() to .rename2 instead of .rename

Filesystems converted:

hugetlbfs, ramfs, bpf.

Debugfs uses simple_rename() to implement debugfs_rename(), which is for
debugfs instances to rename files internally, not for userspace filesystem
access.  For this case pass zero flags to simple_rename().

Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
Acked-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Alexei Starovoitov <ast@kernel.org>
2016-09-27 11:03:57 +02:00
Mickaël Salaün
1955351da4 bpf: Set register type according to is_valid_access()
This prevent future potential pointer leaks when an unprivileged eBPF
program will read a pointer value from its context. Even if
is_valid_access() returns a pointer type, the eBPF verifier replace it
with UNKNOWN_VALUE. The register value that contains a kernel address is
then allowed to leak. Moreover, this fix allows unprivileged eBPF
programs to use functions with (legitimate) pointer arguments.

Not an issue currently since reg_type is only set for PTR_TO_PACKET or
PTR_TO_PACKET_END in XDP and TC programs that can only be loaded as
privileged. For now, the only unprivileged eBPF program allowed is for
socket filtering and all the types from its context are UNKNOWN_VALUE.
However, this fix is important for future unprivileged eBPF programs
which could use pointers in their context.

Signed-off-by: Mickaël Salaün <mic@digikod.net>
Cc: Alexei Starovoitov <ast@kernel.org>
Cc: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-09-27 03:51:34 -04:00
Linus Torvalds
4c04b4b534 Al Viro has been looking at the tracefs code, and has pointed out
some issues. This contains one fix by me and one by Al. I'm sure that
 he'll come up with more but for now I tested these patches and they
 don't appear to have any negative impact on tracing.
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1
 
 iQEcBAABAgAGBQJX6FvrAAoJEKKk/i67LK/8EuIH/Arf6vJidYsmbe57WQp8PU3I
 bldem6ePj6zgZ2ZqPlSGCs1J2DcK4Bh3lPVxdx7rRKVWSd/Zoj+i83hvObusR8M7
 Qs1G92bJTvvVO3aPfiN0GvKGdKfGn45L+j0BcBauiTRKqnj3PkhOhIP2/ks0ewSk
 qeq7R3xxo/FDs26AHS69Hm0PIIw7btyhXNX4GB3Il7IIA5/nUknw3C+bjVj86tYX
 R4iElcHEhplgoSjKuLgNIRZGUnEFtsm/fnohYXpHacLTUKNXnTDY230x/OKc1yyB
 1vOfHS/y5s3XSJ1lcgSjYeNc51lK8NiDASaptZSUnOookKSAooUTFELNzpbc0sg=
 =+Fr3
 -----END PGP SIGNATURE-----

Merge tag 'trace-v4.8-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace

Pull tracefs fixes from Steven Rostedt:
 "Al Viro has been looking at the tracefs code, and has pointed out some
  issues.  This contains one fix by me and one by Al.  I'm sure that
  he'll come up with more but for now I tested these patches and they
  don't appear to have any negative impact on tracing"

* tag 'trace-v4.8-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace:
  fix memory leaks in tracing_buffers_splice_read()
  tracing: Move mutex to protect against resetting of seq data
2016-09-25 18:40:13 -07:00
Wei Yongjun
b8129a1f6a genirq: Make function __irq_do_set_handler() static
Fixes the following sparse warning:

kernel/irq/chip.c:786:1: warning:
 symbol '__irq_do_set_handler' was not declared. Should it be static?

Signed-off-by: Wei Yongjun <weiyongjun1@huawei.com>
Link: http://lkml.kernel.org/r/1474817799-18676-1-git-send-email-weiyj.lk@gmail.com
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2016-09-25 16:46:52 -04:00
Al Viro
1ae2293dd6 fix memory leaks in tracing_buffers_splice_read()
Cc: stable@vger.kernel.org
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2016-09-25 13:30:13 -04:00
Steven Rostedt (Red Hat)
1245800c0f tracing: Move mutex to protect against resetting of seq data
The iter->seq can be reset outside the protection of the mutex. So can
reading of user data. Move the mutex up to the beginning of the function.

Fixes: d7350c3f45 ("tracing/core: make the read callbacks reentrants")
Cc: stable@vger.kernel.org # 2.6.30+
Reported-by: Al Viro <viro@ZenIV.linux.org.uk>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2016-09-25 10:27:08 -04:00
Linus Torvalds
9c0e28a7be Merge branch 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull perf fixes from Thomas Gleixner:
 "Three fixlets for perf:

   - add a missing NULL pointer check in the intel BTS driver

   - make BTS an exclusive PMU because BTS can only handle one event at
     a time

   - ensure that exclusive events are limited to one PMU so that several
     exclusive events can be scheduled on different PMU instances"

* 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  perf/core: Limit matching exclusive events to one PMU
  perf/x86/intel/bts: Make it an exclusive PMU
  perf/x86/intel/bts: Make sure debug store is valid
2016-09-24 12:44:28 -07:00
Linus Torvalds
4b8b0ff60f Merge branch 'irq-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull irq fixes from Thomas Gleixner:
 "Three fixes for irq core and irq chip drivers:

   - Do not set the irq type if type is NONE.  Fixes a boot regression
     on various SoCs

   - Use the proper cpu for setting up the GIC target list.  Discovered
     by the cpumask debugging code.

   - A rather large fix for the MIPS-GIC so per cpu local interrupts
     work again.  This was discovered late because the code falls back
     to slower timers which use normal device interrupts"

* 'irq-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  irqchip/mips-gic: Fix local interrupts
  irqchip/gicv3: Silence noisy DEBUG_PER_CPU_MAPS warning
  genirq: Skip chained interrupt trigger setup if type is IRQ_TYPE_NONE
2016-09-24 12:30:12 -07:00
Tejun Heo
9157056da8 cgroup: fix invalid controller enable rejections with cgroup namespace
On the v2 hierarchy, "cgroup.subtree_control" rejects controller
enables if the cgroup has processes in it.  The enforcement of this
logic assumes that the cgroup wouldn't have any css_sets associated
with it if there are no tasks in the cgroup, which is no longer true
since a79a908fd2 ("cgroup: introduce cgroup namespaces").

When a cgroup namespace is created, it pins the css_set of the
creating task to use it as the root css_set of the namespace.  This
extra reference stays as long as the namespace is around and makes
"cgroup.subtree_control" think that the namespace root cgroup is not
empty even when it is and thus reject controller enables.

Fix it by making cgroup_subtree_control() walk and test emptiness of
each css_set instead of testing whether the list_head is empty.

While at it, update the comment of cgroup_task_count() to indicate
that the returned value may be higher than the number of tasks, which
has always been true due to temporary references and doesn't break
anything.

Signed-off-by: Tejun Heo <tj@kernel.org>
Reported-by: Evgeny Vereshchagin <evvers@ya.ru>
Cc: Serge E. Hallyn <serge.hallyn@ubuntu.com>
Cc: Aditya Kali <adityakali@google.com>
Cc: Eric W. Biederman <ebiederm@xmission.com>
Cc: stable@vger.kernel.org # v4.6+
Fixes: a79a908fd2 ("cgroup: introduce cgroup namespaces")
Link: https://github.com/systemd/systemd/pull/3589#issuecomment-249089541
2016-09-23 16:55:49 -04:00
Masami Hiramatsu
a0d0c6216a tracing: Call traceoff trigger after event is recorded
Call traceoff trigger after the event is recorded.
Since current traceoff trigger is called before recording
the event, we can not know what event stopped tracing.

Typical usecase of traceoff/traceon trigger is tracing
function calls and trace events between a pair of events.
For example, trace function calls between syscall entry/exit.
In that case, it is useful if we can see the return code
of the target syscall.

Link: http://lkml.kernel.org/r/147335074530.12462.4526186083406015005.stgit@devbox

Signed-off-by: Masami Hiramatsu <mhiramat@kernel.org>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2016-09-23 09:47:59 -04:00
David S. Miller
d6989d4bbe Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net 2016-09-23 06:46:57 -04:00
Ingo Molnar
739f1bcd04 Merge branch 'perf/urgent' into perf/core, to pick up fixes
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-09-23 07:20:33 +02:00
Eric W. Biederman
7872559664 Merge branch 'nsfs-ioctls' into HEAD
From: Andrey Vagin <avagin@openvz.org>

Each namespace has an owning user namespace and now there is not way
to discover these relationships.

Pid and user namepaces are hierarchical. There is no way to discover
parent-child relationships too.

Why we may want to know relationships between namespaces?

One use would be visualization, in order to understand the running
system.  Another would be to answer the question: what capability does
process X have to perform operations on a resource governed by namespace
Y?

One more use-case (which usually called abnormal) is checkpoint/restart.
In CRIU we are going to dump and restore nested namespaces.

There [1] was a discussion about which interface to choose to determing
relationships between namespaces.

Eric suggested to add two ioctl-s [2]:
> Grumble, Grumble.  I think this may actually a case for creating ioctls
> for these two cases.  Now that random nsfs file descriptors are bind
> mountable the original reason for using proc files is not as pressing.
>
> One ioctl for the user namespace that owns a file descriptor.
> One ioctl for the parent namespace of a namespace file descriptor.

Here is an implementaions of these ioctl-s.

$ man man7/namespaces.7
...
Since  Linux  4.X,  the  following  ioctl(2)  calls are supported for
namespace file descriptors.  The correct syntax is:

      fd = ioctl(ns_fd, ioctl_type);

where ioctl_type is one of the following:

NS_GET_USERNS
      Returns a file descriptor that refers to an owning user names‐
      pace.

NS_GET_PARENT
      Returns  a  file descriptor that refers to a parent namespace.
      This ioctl(2) can be used for pid  and  user  namespaces.  For
      user namespaces, NS_GET_PARENT and NS_GET_USERNS have the same
      meaning.

In addition to generic ioctl(2) errors, the following  specific  ones
can occur:

EINVAL NS_GET_PARENT was called for a nonhierarchical namespace.

EPERM  The  requested  namespace  is outside of the current namespace
      scope.

[1] https://lkml.org/lkml/2016/7/6/158
[2] https://lkml.org/lkml/2016/7/9/101

Changes for v2:
* don't return ENOENT for init_user_ns and init_pid_ns. There is nothing
  outside of the init namespace, so we can return EPERM in this case too.
  > The fewer special cases the easier the code is to get
  > correct, and the easier it is to read. // Eric

Changes for v3:
* rename ns->get_owner() to ns->owner(). get_* usually means that it
  grabs a reference.

Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: James Bottomley <James.Bottomley@HansenPartnership.com>
Cc: "Michael Kerrisk (man-pages)" <mtk.manpages@gmail.com>
Cc: "W. Trevor King" <wking@tremily.us>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Serge Hallyn <serge.hallyn@canonical.com>
2016-09-22 20:00:36 -05:00
Andrey Vagin
a7306ed8d9 nsfs: add ioctl to get a parent namespace
Pid and user namepaces are hierarchical. There is no way to discover
parent-child relationships.

In a future we will use this interface to dump and restore nested
namespaces.

Acked-by: Serge Hallyn <serge@hallyn.com>
Signed-off-by: Andrei Vagin <avagin@openvz.org>
Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
2016-09-22 19:59:41 -05:00
Andrey Vagin
bcac25a58b kernel: add a helper to get an owning user namespace for a namespace
Return -EPERM if an owning user namespace is outside of a process
current user namespace.

v2: In a first version ns_get_owner returned ENOENT for init_user_ns.
    This special cases was removed from this version. There is nothing
    outside of init_user_ns, so we can return EPERM.
v3: rename ns->get_owner() to ns->owner(). get_* usually means that it
grabs a reference.

Acked-by: Serge Hallyn <serge@hallyn.com>
Signed-off-by: Andrei Vagin <avagin@openvz.org>
Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
2016-09-22 19:59:39 -05:00
Rob Herring
ce836c2974 kvmconfig: add virtio-gpu to config fragment
virtio-gpu is used for VMs, so add it to the kvm config.

Signed-off-by: Rob Herring <robh@kernel.org>
Cc: Christoffer Dall <christoffer.dall@linaro.org>
Cc: Marc Zyngier <marc.zyngier@arm.com>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: "Radim Krčmář" <rkrcmar@redhat.com>
Cc: kvmarm@lists.cs.columbia.edu
Cc: kvm@vger.kernel.org
[expanded "frag" to "fragment" in summary]
Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
2016-09-23 01:08:13 +02:00
Rob Herring
bd6c92221d config: move x86 kvm_guest.config to a common location
kvm_guest.config is useful for KVM guests on other arches, and nothing
in it appears to be x86 specific, so just move the whole file. Kbuild
will find it in either location.

Signed-off-by: Rob Herring <robh@kernel.org>
Cc: Christoffer Dall <christoffer.dall@linaro.org>
Cc: Marc Zyngier <marc.zyngier@arm.com>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: "Radim Krčmář" <rkrcmar@redhat.com>
Cc: kvmarm@lists.cs.columbia.edu
Cc: kvm@vger.kernel.org
Acked-by: Christoffer Dall <christoffer.dall@linaro.org>
Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
2016-09-23 01:08:12 +02:00
Eric W. Biederman
df75e7748b userns: When the per user per user namespace limit is reached return ENOSPC
The current error codes returned when a the per user per user
namespace limit are hit (EINVAL, EUSERS, and ENFILE) are wrong.  I
asked for advice on linux-api and it we made clear that those were
the wrong error code, but a correct effor code was not suggested.

The best general error code I have found for hitting a resource limit
is ENOSPC.  It is not perfect but as it is unambiguous it will serve
until someone comes up with a better error code.

Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
2016-09-22 13:25:56 -05:00
Peter Zijlstra
d32cdbfb0b locking/lglock: Remove lglock implementation
It is now unused, remove it before someone else thinks its a good idea
to use this.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-09-22 15:25:56 +02:00
Oleg Nesterov
e625397041 stop_machine: Remove stop_cpus_lock and lg_double_lock/unlock()
stop_two_cpus() and stop_cpus() use stop_cpus_lock to avoid the deadlock,
we need to ensure that the stopper functions can't be queued "backwards"
from one another. This doesn't look nice; if we use lglock then we do not
really need stopper->lock, cpu_stop_queue_work() could use lg_local_lock()
under local_irq_save().

OTOH it would be even better to avoid lglock in stop_machine.c and remove
lg_double_lock(). This patch adds "bool stop_cpus_in_progress" set/cleared
by queue_stop_cpus_work(), and changes cpu_stop_queue_two_works() to busy
wait until it is cleared.

queue_stop_cpus_work() sets stop_cpus_in_progress = T lockless, but after
it queues a work on CPU1 it must be visible to stop_two_cpus(CPU1, CPU2)
which checks it under the same lock. And since stop_two_cpus() holds the
2nd lock too, queue_stop_cpus_work() can not clear stop_cpus_in_progress
if it is also going to queue a work on CPU2, it needs to take that 2nd
lock to do this.

Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rik van Riel <riel@redhat.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/20151121181148.GA433@redhat.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-09-22 15:25:55 +02:00
Pan Xinhui
b193049375 locking/pv-qspinlock: Use cmpxchg_release() in __pv_queued_spin_unlock()
cmpxchg_release() is more lighweight than cmpxchg() on some archs(e.g.
PPC), moreover, in __pv_queued_spin_unlock() we only needs a RELEASE in
the fast path(pairing with *_try_lock() or *_lock()). And the slow path
has smp_store_release too. So it's safe to use cmpxchg_release here.

Suggested-by:  Boqun Feng <boqun.feng@gmail.com>
Signed-off-by: Pan Xinhui <xinhui.pan@linux.vnet.ibm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: benh@kernel.crashing.org
Cc: linuxppc-dev@lists.ozlabs.org
Cc: mpe@ellerman.id.au
Cc: paulmck@linux.vnet.ibm.com
Cc: paulus@samba.org
Cc: virtualization@lists.linux-foundation.org
Cc: waiman.long@hpe.com
Link: http://lkml.kernel.org/r/1474277037-15200-2-git-send-email-xinhui.pan@linux.vnet.ibm.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-09-22 15:25:51 +02:00
Ingo Molnar
7cf0f1426a Merge branch 'locking/urgent' into locking/core, to pick up fixes
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-09-22 15:21:48 +02:00
Peter Zijlstra
a18a579e5f sched/debug: Hide printk() by default
Dietmar accidentally added an unconditional sched domain printk. Hide
it behind the normal sched_debug flag.

Reported-by: Christian Borntraeger <borntraeger@de.ibm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Christian Borntraeger <borntraeger@de.ibm.com>
Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Fixes: cd92bfd3b8 ("sched/core: Store maximum per-CPU capacity in root domain")
[ Fixed !SCHED_DEBUG build failure. ]
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-09-22 15:20:25 +02:00
Srivatsa Vaddagiri
8bf46a39be sched/fair: Fix SCHED_HRTICK bug leading to late preemption of tasks
SCHED_HRTICK feature is useful to preempt SCHED_FAIR tasks on-the-dot
(just when they would have exceeded their ideal_runtime).

It makes use of a per-CPU hrtimer resource and hence arming that
hrtimer should be based on total SCHED_FAIR tasks a CPU has across its
various cfs_rqs, rather than being based on number of tasks in a
particular cfs_rq (as implemented currently).

As a result, with current code, its possible for a running task (which
is the sole task in its cfs_rq) to be preempted much after its
ideal_runtime has elapsed, resulting in increased latency for tasks in
other cfs_rq on same CPU.

Fix this by arming sched hrtimer based on total number of SCHED_FAIR
tasks a CPU has across its various cfs_rqs.

Signed-off-by: Srivatsa Vaddagiri <vatsa@codeaurora.org>
Signed-off-by: Joonwoo Park <joonwoop@codeaurora.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/1474075731-11550-1-git-send-email-joonwoop@codeaurora.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-09-22 15:20:18 +02:00
Alexander Shishkin
3bf6215a1b perf/core: Limit matching exclusive events to one PMU
An "exclusive" PMU is the one that can only have one event scheduled in
at any given time. There may be more than one of such PMUs in a system,
though, like Intel PT and BTS. It should be allowed to have one event
for either of those inside the same context (there may be other constraints
that may prevent this, but those would be hardware-specific). However,
the exclusivity code is written so that only one event from any of the
"exclusive" PMUs is allowed in a context.

Fix this by making the exclusive event filter explicitly match two events'
PMUs.

Signed-off-by: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Arnaldo Carvalho de Melo <acme@infradead.org>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Jiri Olsa <jolsa@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Stephane Eranian <eranian@google.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vince Weaver <vincent.weaver@maine.edu>
Cc: vince@deater.net
Link: http://lkml.kernel.org/r/20160920154811.3255-3-alexander.shishkin@linux.intel.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-09-22 14:56:09 +02:00
Peter Zijlstra
35a773a079 sched/core: Avoid _cond_resched() for PREEMPT=y
On fully preemptible kernels _cond_resched() is pointless, so avoid
emitting any code for it.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mikulas Patocka <mpatocka@redhat.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-09-22 14:53:46 +02:00
Peter Zijlstra
9af6528ee9 sched/core: Optimize __schedule()
Oleg noted that by making do_exit() use __schedule() for the TASK_DEAD
context switch, we can avoid the TASK_DEAD special case currently in
__schedule() because that avoids the extra preempt_disable() from
schedule().

In order to facilitate this, create a do_task_dead() helper which we
place in the scheduler code, such that it can access __schedule().

Also add some __noreturn annotations to the functions, there's no
coming back from do_exit().

Suggested-by: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Cheng Chao <cs.os.kernel@gmail.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: akpm@linux-foundation.org
Cc: chris@chris-wilson.co.uk
Cc: tj@kernel.org
Link: http://lkml.kernel.org/r/20160913163729.GB5012@twins.programming.kicks-ass.net
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-09-22 14:53:45 +02:00
Cheng Chao
bf89a30472 stop_machine: Avoid a sleep and wakeup in stop_one_cpu()
In case @cpu == smp_proccessor_id(), we can avoid a sleep+wakeup
cycle by doing a preemption.

Callers such as sched_exec() can benefit from this change.

Signed-off-by: Cheng Chao <cs.os.kernel@gmail.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: akpm@linux-foundation.org
Cc: chris@chris-wilson.co.uk
Cc: tj@kernel.org
Link: http://lkml.kernel.org/r/1473818510-6779-1-git-send-email-cs.os.kernel@gmail.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-09-22 14:53:45 +02:00
Cheng Chao
0b8473570c sched/core: Remove unnecessary initialization in sched_init()
init_idle() is called immediately after:

  current->sched_class = &fair_sched_class;

init_idle() sets:

  current->sched_class = &idle_sched_class;

First assignment is superfluous.

Signed-off-by: Cheng Chao <cs.os.kernel@gmail.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/1473819536-7398-1-git-send-email-cs.os.kernel@gmail.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-09-22 14:53:44 +02:00
Ingo Molnar
50797851b4 Merge branch 'linus' into sched/core, to pick up fixes
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-09-22 14:49:40 +02:00
Peter Zijlstra
8db549491c smp: Allocate smp_call_on_cpu() workqueue on stack too
The SMP IPI struct descriptor is allocated on the stack except for the
workqueue and lockdep complains:

  INFO: trying to register non-static key.
  the code is fine but needs lockdep annotation.
  turning off the locking correctness validator.
  CPU: 0 PID: 110 Comm: kworker/0:1 Not tainted 4.8.0-rc5+ #14
  Hardware name: Dell Inc. Precision T3600/0PTTT9, BIOS A13 05/11/2014
  Workqueue: events smp_call_on_cpu_callback
  ...
  Call Trace:
    dump_stack
    register_lock_class
    ? __lock_acquire
    __lock_acquire
    ? __lock_acquire
    lock_acquire
    ? process_one_work
    process_one_work
    ? process_one_work
    worker_thread
    ? process_one_work
    ? process_one_work
    kthread
    ? kthread_create_on_node
    ret_from_fork

So allocate it on the stack too.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
[ Test and write commit message. ]
Signed-off-by: Borislav Petkov <bp@suse.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/20160911084323.jhtnpb4b37t5tlno@pd.tnic
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-09-22 14:49:10 +02:00
Con Kolivas
4fa5cd5245 sched/core: Do not use smp_processor_id() with preempt enabled in smpboot_thread_fn()
We should not be using smp_processor_id() with preempt enabled.

Bug identified and fix provided by Alfred Chen.

Reported-by: Alfred Chen <cchalpha@gmail.com>
Signed-off-by: Con Kolivas <kernel@kolivas.org>
Cc: Alfred Chen <cchalpha@gmail.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/2042051.3vvUWIM0vs@hex
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-09-22 12:28:00 +02:00
Jakub Kicinski
6b17387307 bpf: recognize 64bit immediate loads as consts
When running as parser interpret BPF_LD | BPF_IMM | BPF_DW
instructions as loading CONST_IMM with the value stored
in imm.  The verifier will continue not recognizing those
due to concerns about search space/program complexity
increase.

Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-09-21 19:50:02 -04:00
Jakub Kicinski
13a27dfc66 bpf: enable non-core use of the verfier
Advanced JIT compilers and translators may want to use
eBPF verifier as a base for parsers or to perform custom
checks and validations.

Add ability for external users to invoke the verifier
and provide callbacks to be invoked for every intruction
checked.  For now only add most basic callback for
per-instruction pre-interpretation checks is added.  More
advanced users may also like to have per-instruction post
callback and state comparison callback.

Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-09-21 19:50:02 -04:00
Jakub Kicinski
58e2af8b3a bpf: expose internal verfier structures
Move verifier's internal structures to a header file and
prefix their names with bpf_ to avoid potential namespace
conflicts.  Those structures will soon be used by external
analyzers.

Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-09-21 19:50:02 -04:00
Jakub Kicinski
3df126f35f bpf: don't (ab)use instructions to store state
Storing state in reserved fields of instructions makes
it impossible to run verifier on programs already
marked as read-only. Allocate and use an array of
per-instruction state instead.

While touching the error path rename and move existing
jump target.

Suggested-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-09-21 19:50:02 -04:00
Daniel Borkmann
36bbef52c7 bpf: direct packet write and access for helpers for clsact progs
This work implements direct packet access for helpers and direct packet
write in a similar fashion as already available for XDP types via commits
4acf6c0b84 ("bpf: enable direct packet data write for xdp progs") and
6841de8b0d ("bpf: allow helpers access the packet directly"), and as a
complementary feature to the already available direct packet read for tc
(cls/act) programs.

For enabling this, we need to introduce two helpers, bpf_skb_pull_data()
and bpf_csum_update(). The first is generally needed for both, read and
write, because they would otherwise only be limited to the current linear
skb head. Usually, when the data_end test fails, programs just bail out,
or, in the direct read case, use bpf_skb_load_bytes() as an alternative
to overcome this limitation. If such data sits in non-linear parts, we
can just pull them in once with the new helper, retest and eventually
access them.

At the same time, this also makes sure the skb is uncloned, which is, of
course, a necessary condition for direct write. As this needs to be an
invariant for the write part only, the verifier detects writes and adds
a prologue that is calling bpf_skb_pull_data() to effectively unclone the
skb from the very beginning in case it is indeed cloned. The heuristic
makes use of a similar trick that was done in 233577a220 ("net: filter:
constify detection of pkt_type_offset"). This comes at zero cost for other
programs that do not use the direct write feature. Should a program use
this feature only sparsely and has read access for the most parts with,
for example, drop return codes, then such write action can be delegated
to a tail called program for mitigating this cost of potential uncloning
to a late point in time where it would have been paid similarly with the
bpf_skb_store_bytes() as well. Advantage of direct write is that the
writes are inlined whereas the helper cannot make any length assumptions
and thus needs to generate a call to memcpy() also for small sizes, as well
as cost of helper call itself with sanity checks are avoided. Plus, when
direct read is already used, we don't need to cache or perform rechecks
on the data boundaries (due to verifier invalidating previous checks for
helpers that change skb->data), so more complex programs using rewrites
can benefit from switching to direct read plus write.

For direct packet access to helpers, we save the otherwise needed copy into
a temp struct sitting on stack memory when use-case allows. Both facilities
are enabled via may_access_direct_pkt_data() in verifier. For now, we limit
this to map helpers and csum_diff, and can successively enable other helpers
where we find it makes sense. Helpers that definitely cannot be allowed for
this are those part of bpf_helper_changes_skb_data() since they can change
underlying data, and those that write into memory as this could happen for
packet typed args when still cloned. bpf_csum_update() helper accommodates
for the fact that we need to fixup checksum_complete when using direct write
instead of bpf_skb_store_bytes(), meaning the programs can use available
helpers like bpf_csum_diff(), and implement csum_add(), csum_sub(),
csum_block_add(), csum_block_sub() equivalents in eBPF together with the
new helper. A usage example will be provided for iproute2's examples/bpf/
directory.

Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-09-20 23:32:11 -04:00
Daniel Borkmann
b399cf64e3 bpf, verifier: enforce larger zero range for pkt on overloading stack buffs
Current contract for the following two helper argument types is:

  * ARG_CONST_STACK_SIZE: passed argument pair must be (ptr, >0).
  * ARG_CONST_STACK_SIZE_OR_ZERO: passed argument pair can be either
    (NULL, 0) or (ptr, >0).

With 6841de8b0d ("bpf: allow helpers access the packet directly"), we can
pass also raw packet data to helpers, so depending on the argument type
being PTR_TO_PACKET, we now either assert memory via check_packet_access()
or check_stack_boundary(). As a result, the tests in check_packet_access()
currently allow more than intended with regards to reg->imm.

Back in 969bf05eb3 ("bpf: direct packet access"), check_packet_access()
was fine to ignore size argument since in check_mem_access() size was
bpf_size_to_bytes() derived and prior to the call to check_packet_access()
guaranteed to be larger than zero.

However, for the above two argument types, it currently means, we can have
a <= 0 size and thus breaking current guarantees for helpers. Enforce a
check for size <= 0 and bail out if so.

check_stack_boundary() doesn't have such an issue since it already tests
for access_size <= 0 and bails out, resp. access_size == 0 in case of NULL
pointer passed when allowed.

Fixes: 6841de8b0d ("bpf: allow helpers access the packet directly")
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-09-20 23:32:11 -04:00
Thomas Gleixner
464b5847e6 Merge branch 'irq/urgent' into irq/core
Merge urgent fixes so pending patches for 4.9 can be applied.
2016-09-20 23:20:32 +02:00
Ingo Molnar
b2c16e1efd Merge branch 'linus' into x86/asm, to pick up fixes
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-09-20 08:29:21 +02:00
Johannes Weiner
d979a39d72 cgroup: duplicate cgroup reference when cloning sockets
When a socket is cloned, the associated sock_cgroup_data is duplicated
but not its reference on the cgroup.  As a result, the cgroup reference
count will underflow when both sockets are destroyed later on.

Fixes: bd1060a1d6 ("sock, cgroup: add sock->sk_cgroup")
Link: http://lkml.kernel.org/r/20160914194846.11153-2-hannes@cmpxchg.org
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Tejun Heo <tj@kernel.org>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Vladimir Davydov <vdavydov@virtuozzo.com>
Cc: <stable@vger.kernel.org>	[4.5+]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-09-19 15:36:17 -07:00
Sebastian Andrzej Siewior
30e92153b4 padata: Convert to hotplug state machine
Install the callbacks via the state machine. CPU-hotplug multinstance support
is used with the nocalls() version. Maybe parts of padata_alloc() could be
moved into the online callback so that we could invoke ->startup callback for
instance and drop get_online_cpus().

Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Cc: Steffen Klassert <steffen.klassert@secunet.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: linux-crypto@vger.kernel.org
Cc: rt@linutronix.de
Link: http://lkml.kernel.org/r/20160906170457.32393-14-bigeasy@linutronix.de
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2016-09-19 21:44:30 +02:00
Marc Zyngier
1984e07591 genirq: Skip chained interrupt trigger setup if type is IRQ_TYPE_NONE
There is no point in trying to configure the trigger of a chained
interrupt if no trigger information has been configured. At best
this is ignored, and at the worse this confuses the underlying
irqchip (which is likely not to handle such a thing), and
unnecessarily alarms the user.

Only apply the configuration if type is not IRQ_TYPE_NONE.

Fixes: 1e12c4a939 ("genirq: Correctly configure the trigger on chained interrupts")
Reported-and-tested-by: Geert Uytterhoeven <geert@linux-m68k.org>
Signed-off-by: Marc Zyngier <marc.zyngier@arm.com>
Link: https://lkml.kernel.org/r/CAMuHMdVW1eTn20=EtYcJ8hkVwohaSuH_yQXrY2MGBEvZ8fpFOg@mail.gmail.com
Link: http://lkml.kernel.org/r/1474274967-15984-1-git-send-email-marc.zyngier@arm.com
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2016-09-19 11:31:36 +02:00
Wei Yongjun
8a15b81741 cpuset: fix non static symbol warning
Fixes the following sparse warning:

kernel/cpuset.c:2088:6: warning:
 symbol 'cpuset_fork' was not declared. Should it be static?

Signed-off-by: Wei Yongjun <weiyongjun1@huawei.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2016-09-16 11:31:17 -04:00
Andy Lutomirski
ac496bf48d fork: Optimize task creation by caching two thread stacks per CPU if CONFIG_VMAP_STACK=y
vmalloc() is a bit slow, and pounding vmalloc()/vfree() will eventually
force a global TLB flush.

To reduce pressure on them, if CONFIG_VMAP_STACK=y, cache two thread
stacks per CPU.  This will let us quickly allocate a hopefully
cache-hot, TLB-hot stack under heavy forking workloads (shell script style).

On my silly pthread_create() benchmark, it saves about 2 µs per
pthread_create()+join() with CONFIG_VMAP_STACK=y.

Signed-off-by: Andy Lutomirski <luto@kernel.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Jann Horn <jann@thejh.net>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/94811d8e3994b2e962f88866290017d498eb069c.1474003868.git.luto@kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-09-16 09:18:54 +02:00
Andy Lutomirski
68f24b08ee sched/core: Free the stack early if CONFIG_THREAD_INFO_IN_TASK
We currently keep every task's stack around until the task_struct
itself is freed.  This means that we keep the stack allocation alive
for longer than necessary and that, under load, we free stacks in
big batches whenever RCU drops the last task reference.  Neither of
these is good for reuse of cache-hot memory, and freeing in batches
prevents us from usefully caching small numbers of vmalloced stacks.

On architectures that have thread_info on the stack, we can't easily
change this, but on architectures that set THREAD_INFO_IN_TASK, we
can free it as soon as the task is dead.

Signed-off-by: Andy Lutomirski <luto@kernel.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Jann Horn <jann@thejh.net>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/08ca06cde00ebed0046c5d26cbbf3fbb7ef5b812.1474003868.git.luto@kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-09-16 09:18:54 +02:00
Oleg Nesterov
23196f2e5f kthread: Pin the stack via try_get_task_stack()/put_task_stack() in to_live_kthread() function
get_task_struct(tsk) no longer pins tsk->stack so all users of
to_live_kthread() should do try_get_task_stack/put_task_stack to protect
"struct kthread" which lives on kthread's stack.

TODO: Kill to_live_kthread(), perhaps we can even kill "struct kthread" too,
and rework kthread_stop(), it can use task_work_add() to sync with the exiting
kernel thread.

Message-Id: <20160629180357.GA7178@redhat.com>
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Andy Lutomirski <luto@kernel.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Jann Horn <jann@thejh.net>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/cb9b16bbc19d4aea4507ab0552e4644c1211d130.1474003868.git.luto@kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-09-16 09:18:53 +02:00
Ingo Molnar
2d8fbcd13e Merge branch 'for-mingo' of git://git.kernel.org/pub/scm/linux/kernel/git/paulmck/linux-rcu into core/rcu
Pull RCU changes from Paul E. McKenney:

 - Expedited grace-period changes, most notably avoiding having
   user threads drive expedited grace periods, using a workqueue
   instead.

 - Miscellaneous fixes, including a performance fix for lists
   that was sent with the lists modifications (second URL below).

 - CPU hotplug updates, most notably providing exact CPU-online
   tracking for RCU.  This will in turn allow removal of the
   checks supporting RCU's prior heuristic that was based on the
   assumption that CPUs would take no longer than one jiffy to
   come online.

 - Torture-test updates.

 - Documentation updates.

Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-09-16 09:08:43 +02:00
Thomas Gleixner
0a30d69195 Merge branch 'irq/for-block' into irq/core
Add the new irq spreading infrastructure.
2016-09-15 20:54:40 +02:00
Andy Lutomirski
c65eacbe29 sched/core: Allow putting thread_info into task_struct
If an arch opts in by setting CONFIG_THREAD_INFO_IN_TASK_STRUCT,
then thread_info is defined as a single 'u32 flags' and is the first
entry of task_struct.  thread_info::task is removed (it serves no
purpose if thread_info is embedded in task_struct), and
thread_info::cpu gets its own slot in task_struct.

This is heavily based on a patch written by Linus.

Originally-from: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Andy Lutomirski <luto@kernel.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Jann Horn <jann@thejh.net>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/a0898196f0476195ca02713691a5037a14f2aac5.1473801993.git.luto@kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-09-15 08:25:13 +02:00
Ingo Molnar
d4b80afbba Merge branch 'linus' into x86/asm, to pick up recent fixes
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-09-15 08:24:53 +02:00
Thomas Gleixner
44082fd670 genirq/affinity: Remove old irq spread infrastructure
No more users.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Christoph Hellwig <hch@lst.de>
Cc: axboe@fb.com
Cc: keith.busch@intel.com
Cc: agordeev@redhat.com
Cc: linux-block@vger.kernel.org
Link: http://lkml.kernel.org/r/1473862739-15032-5-git-send-email-hch@lst.de
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2016-09-14 22:11:09 +02:00
Thomas Gleixner
e75eafb9b0 genirq/msi: Switch to new irq spreading infrastructure
Switch MSI over to the new spreading code. If a pci device contains a valid
pointer to a cpumask, then this mask is used for spreading otherwise the
online cpu mask is used. This allows a driver to restrict the spread to a
subset of CPUs, e.g. cpus on a particular node.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Christoph Hellwig <hch@lst.de>
Cc: axboe@fb.com
Cc: keith.busch@intel.com
Cc: agordeev@redhat.com
Cc: linux-block@vger.kernel.org
Link: http://lkml.kernel.org/r/1473862739-15032-4-git-send-email-hch@lst.de
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2016-09-14 22:11:09 +02:00
Thomas Gleixner
34c3d9819f genirq/affinity: Provide smarter irq spreading infrastructure
The current irq spreading infrastructure is just looking at a cpumask and
tries to spread the interrupts over the mask. Thats suboptimal as it does
not take numa nodes into account.

Change the logic so the interrupts are spread across numa nodes and inside
the nodes. If there are more cpus than vectors per node, then we set the
affinity to several cpus. If HT siblings are available we take that into
account and try to set all siblings to a single vector.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Christoph Hellwig <hch@lst.de>
Cc: axboe@fb.com
Cc: keith.busch@intel.com
Cc: agordeev@redhat.com
Cc: linux-block@vger.kernel.org
Link: http://lkml.kernel.org/r/1473862739-15032-3-git-send-email-hch@lst.de
2016-09-14 22:11:08 +02:00
Thomas Gleixner
28f4b04143 genirq/msi: Add cpumask allocation to alloc_msi_entry
For irq spreading want to store affinity masks in the msi_entry. Add the
infrastructure for it.

We allocate an array of cpumasks with an array size of the number of used
vectors in the entry, so we can hand in the information per linux interrupt
later.

As we hand in the number of used vectors, we assign them right
away. Convert all the call sites.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: axboe@fb.com
Cc: keith.busch@intel.com
Cc: agordeev@redhat.com
Cc: linux-block@vger.kernel.org
Cc: Christoph Hellwig <hch@lst.de>
Link: http://lkml.kernel.org/r/1473862739-15032-2-git-send-email-hch@lst.de
2016-09-14 22:11:08 +02:00
Paul E. McKenney
d74b62bc32 Merge branches 'doc.2016.08.22c', 'exp.2016.08.22c', 'fixes.2016.09.14a', 'hotplug.2016.08.22c' and 'torture.2016.08.22c' into HEAD
doc.2016.08.22c: Documentation updates
exp.2016.08.22c: Expedited grace-period updates
fixes.2016.09.14a: Miscellaneous fixes
hotplug.2016.08.22c: CPU-hotplug changes
torture.2016.08.22c: Torture-test changes
2016-09-14 12:58:49 -07:00
Dmitry Safonov
6846351052 x86/signal: Add SA_{X32,IA32}_ABI sa_flags
Introduce new flags that defines which ABI to use on creating sigframe.
Those flags kernel will set according to sigaction syscall ABI,
which set handler for the signal being delivered.

So that will drop the dependency on TIF_IA32/TIF_X32 flags on signal deliver.
Those flags will be used only under CONFIG_COMPAT.

Similar way ARM uses sa_flags to differ in which mode deliver signal
for 26-bit applications (look at SA_THIRYTWO).

Signed-off-by: Dmitry Safonov <dsafonov@virtuozzo.com>
Reviewed-by: Andy Lutomirski <luto@kernel.org>
Cc: 0x7f454c46@gmail.com
Cc: oleg@redhat.com
Cc: linux-mm@kvack.org
Cc: gorcunov@openvz.org
Cc: xemul@virtuozzo.com
Link: http://lkml.kernel.org/r/20160905133308.28234-7-dsafonov@virtuozzo.com
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2016-09-14 21:28:11 +02:00
Thomas Gleixner
16217dc79d First drop of irqchip updates for 4.9
- ACPI IORT core code
 - IORT support for the GICv3 ITS
 - A few of GIC cleanups
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1
 
 iQIcBAABAgAGBQJX2VxlAAoJECPQ0LrRPXpD7tMP/i696FakX2xYknlVNaBy5vww
 eH1TLVpItkcfUCHbOyNzJcLcVVRnC4JuaiCphtgDQ0Zm4HJsl8LRFoKF9iBF/Wmt
 CvL5YLfkl6ziMZyPa23a9ApH5gKTY2QzirDAl+noF+N6tSsmz5JCXArW0YFawZJs
 o1tmh/XX0xb9cqB5f/jISxTNF7rPw/Dc1sDY3/p7DUch2TDjuTLOQljnJ2EFb8Mh
 QltBs9EbklYCKaSBVIHXmhAOBCaW8Nwm2BNscgNEQAH1EtjBjGK/aqmFqGzxBWms
 wXr8GHSNhnVsgpOzalG6yJzhtWcj4KNf1utZaNc0L8dT1bVc0yJEEUkxOEbi4pIO
 sst+BWo2FAe/cDyWWxwiSBLaO7M5SCTvsBN25AYMfZw9AU6pw6SMCdbm6BCWSPmh
 YUW7khrXObtNTl+0lroi/mlmIE3sq+UVpQWDzfwsvfWb1kgqIif2Fd6TqU+Jodl5
 pK/UzMHP3YQnAZjcn6Yz4sEWkAGgK8RrIPje1Th0mxi20pyx8VbAucwfSCUi5hza
 9mbaaidnLRvDVX38TNs/LzOejwfW3JKDJUPy9jLg3l07Fgron+jfLtCZOEm9XfJH
 nj/MN3g+yil9OeosJ+6NfzZKonJrfccpkVKkZrNT+ReeM5YFQylgX0OHRQT9eUDo
 z3P5/VhsoTcyqynxMcRp
 =Xhg0
 -----END PGP SIGNATURE-----

Merge tag 'irqchip-4.9-1' of git://git.kernel.org/pub/scm/linux/kernel/git/maz/arm-platforms into irq/core

Merge the first drop of irqchip updates for 4.9 from Marc Zyngier:

- ACPI IORT core code
- IORT support for the GICv3 ITS
- A few of GIC cleanups
2016-09-14 20:53:26 +02:00