Commit Graph

1362 Commits

Author SHA1 Message Date
Linus Torvalds
22a093b2fb Merge branch 'sched-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull scheduler fixes from Ingo Molnar:
 "Debug info and other statistics fixes and related enhancements"

* 'sched-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  sched/numa: Fix numa balancing stats in /proc/pid/sched
  sched/numa: Show numa_group ID in /proc/sched_debug task listings
  sched/debug: Move print_cfs_rq() declaration to kernel/sched/sched.h
  sched/stat: Expose /proc/pid/schedstat if CONFIG_SCHED_INFO=y
  sched/stat: Simplify the sched_info accounting dependency
2015-07-04 08:56:53 -07:00
Srikar Dronamraju
6b55c9654f sched/debug: Move print_cfs_rq() declaration to kernel/sched/sched.h
Currently print_cfs_rq() is declared in include/linux/sched.h.
However it's not used outside kernel/sched. Hence move the
declaration to kernel/sched/sched.h

Also some functions are only available for CONFIG_SCHED_DEBUG=y.
Hence move the declarations to within the #ifdef.

Signed-off-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Acked-by: Rik van Riel <riel@redhat.com>
Cc: Iulia Manda <iulia.manda21@gmail.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/1435252903-1081-2-git-send-email-srikar@linux.vnet.ibm.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-07-04 10:04:31 +02:00
Naveen N. Rao
f6db834799 sched/stat: Simplify the sched_info accounting dependency
Both CONFIG_SCHEDSTATS=y and CONFIG_TASK_DELAY_ACCT=y track task
sched_info, which results in ugly #if clauses.

Simplify the code by introducing a synthethic CONFIG_SCHED_INFO
switch, selected by both.

Signed-off-by: Naveen N. Rao <naveen.n.rao@linux.vnet.ibm.com>
Cc: Balbir Singh <bsingharora@gmail.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: a.p.zijlstra@chello.nl
Cc: ricklind@us.ibm.com
Link: http://lkml.kernel.org/r/8d19eef800811a94b0f91bcbeb27430a884d7433.1435255405.git.naveen.n.rao@linux.vnet.ibm.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-07-04 10:04:30 +02:00
Linus Torvalds
e22619a29f Merge branch 'next' of git://git.kernel.org/pub/scm/linux/kernel/git/jmorris/linux-security
Pull security subsystem updates from James Morris:
 "The main change in this kernel is Casey's generalized LSM stacking
  work, which removes the hard-coding of Capabilities and Yama stacking,
  allowing multiple arbitrary "small" LSMs to be stacked with a default
  monolithic module (e.g.  SELinux, Smack, AppArmor).

  See
        https://lwn.net/Articles/636056/

  This will allow smaller, simpler LSMs to be incorporated into the
  mainline kernel and arbitrarily stacked by users.  Also, this is a
  useful cleanup of the LSM code in its own right"

* 'next' of git://git.kernel.org/pub/scm/linux/kernel/git/jmorris/linux-security: (38 commits)
  tpm, tpm_crb: fix le64_to_cpu conversions in crb_acpi_add()
  vTPM: set virtual device before passing to ibmvtpm_reset_crq
  tpm_ibmvtpm: remove unneccessary message level.
  ima: update builtin policies
  ima: extend "mask" policy matching support
  ima: add support for new "euid" policy condition
  ima: fix ima_show_template_data_ascii()
  Smack: freeing an error pointer in smk_write_revoke_subj()
  selinux: fix setting of security labels on NFS
  selinux: Remove unused permission definitions
  selinux: enable genfscon labeling for sysfs and pstore files
  selinux: enable per-file labeling for debugfs files.
  selinux: update netlink socket classes
  signals: don't abuse __flush_signals() in selinux_bprm_committed_creds()
  selinux: Print 'sclass' as string when unrecognized netlink message occurs
  Smack: allow multiple labels in onlycap
  Smack: fix seq operations in smackfs
  ima: pass iint to ima_add_violation()
  ima: wrap event related data to the new ima_event_data structure
  integrity: add validity checks for 'path' parameter
  ...
2015-06-27 13:26:03 -07:00
Linus Torvalds
bbe179f88d Merge branch 'for-4.2' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup
Pull cgroup updates from Tejun Heo:

 - threadgroup_lock got reorganized so that its users can pick the
   actual locking mechanism to use.  Its only user - cgroups - is
   updated to use a percpu_rwsem instead of per-process rwsem.

   This makes things a bit lighter on hot paths and allows cgroups to
   perform and fail multi-task (a process) migrations atomically.
   Multi-task migrations are used in several places including the
   unified hierarchy.

 - Delegation rule and documentation added to unified hierarchy.  This
   will likely be the last interface update from the cgroup core side
   for unified hierarchy before lifting the devel mask.

 - Some groundwork for the pids controller which is scheduled to be
   merged in the coming devel cycle.

* 'for-4.2' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup:
  cgroup: add delegation section to unified hierarchy documentation
  cgroup: require write perm on common ancestor when moving processes on the default hierarchy
  cgroup: separate out cgroup_procs_write_permission() from __cgroup_procs_write()
  kernfs: make kernfs_get_inode() public
  MAINTAINERS: add a cgroup core co-maintainer
  cgroup: fix uninitialised iterator in for_each_subsys_which
  cgroup: replace explicit ss_mask checking with for_each_subsys_which
  cgroup: use bitmask to filter for_each_subsys
  cgroup: add seq_file forward declaration for struct cftype
  cgroup: simplify threadgroup locking
  sched, cgroup: replace signal_struct->group_rwsem with a global percpu_rwsem
  sched, cgroup: reorganize threadgroup locking
  cgroup: switch to unsigned long for bitmasks
  cgroup: reorganize include/linux/cgroup.h
  cgroup: separate out include/linux/cgroup-defs.h
  cgroup: fix some comment typos
2015-06-26 19:50:04 -07:00
Josh Triplett
3033f14ab7 clone: support passing tls argument via C rather than pt_regs magic
clone has some of the quirkiest syscall handling in the kernel, with a
pile of special cases, historical curiosities, and architecture-specific
calling conventions.  In particular, clone with CLONE_SETTLS accepts a
parameter "tls" that the C entry point completely ignores and some
assembly entry points overwrite; instead, the low-level arch-specific
code pulls the tls parameter out of the arch-specific register captured
as part of pt_regs on entry to the kernel.  That's a massive hack, and
it makes the arch-specific code only work when called via the specific
existing syscall entry points; because of this hack, any new clone-like
system call would have to accept an identical tls argument in exactly
the same arch-specific position, rather than providing a unified system
call entry point across architectures.

The first patch allows architectures to handle the tls argument via
normal C parameter passing, if they opt in by selecting
HAVE_COPY_THREAD_TLS.  The second patch makes 32-bit and 64-bit x86 opt
into this.

These two patches came out of the clone4 series, which isn't ready for
this merge window, but these first two cleanup patches were entirely
uncontroversial and have acks.  I'd like to go ahead and submit these
two so that other architectures can begin building on top of this and
opting into HAVE_COPY_THREAD_TLS.  However, I'm also happy to wait and
send these through the next merge window (along with v3 of clone4) if
anyone would prefer that.

This patch (of 2):

clone with CLONE_SETTLS accepts an argument to set the thread-local
storage area for the new thread.  sys_clone declares an int argument
tls_val in the appropriate point in the argument list (based on the
various CLONE_BACKWARDS variants), but doesn't actually use or pass along
that argument.  Instead, sys_clone calls do_fork, which calls
copy_process, which calls the arch-specific copy_thread, and copy_thread
pulls the corresponding syscall argument out of the pt_regs captured at
kernel entry (knowing what argument of clone that architecture passes tls
in).

Apart from being awful and inscrutable, that also only works because only
one code path into copy_thread can pass the CLONE_SETTLS flag, and that
code path comes from sys_clone with its architecture-specific
argument-passing order.  This prevents introducing a new version of the
clone system call without propagating the same architecture-specific
position of the tls argument.

However, there's no reason to pull the argument out of pt_regs when
sys_clone could just pass it down via C function call arguments.

Introduce a new CONFIG_HAVE_COPY_THREAD_TLS for architectures to opt into,
and a new copy_thread_tls that accepts the tls parameter as an additional
unsigned long (syscall-argument-sized) argument.  Change sys_clone's tls
argument to an unsigned long (which does not change the ABI), and pass
that down to copy_thread_tls.

Architectures that don't opt into copy_thread_tls will continue to ignore
the C argument to sys_clone in favor of the pt_regs captured at kernel
entry, and thus will be unable to introduce new versions of the clone
syscall.

Patch co-authored by Josh Triplett and Thiago Macieira.

Signed-off-by: Josh Triplett <josh@joshtriplett.org>
Acked-by: Andy Lutomirski <luto@kernel.org>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Thiago Macieira <thiago.macieira@intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-06-25 17:00:38 -07:00
Linus Torvalds
3a95398f54 Merge branch 'timers-nohz-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull NOHZ updates from Thomas Gleixner:
 "A few updates to the nohz infrastructure:

   - recursion protection for context tracking

   - make the TIF_NOHZ inheritance smarter

   - isolate cpus which belong to the NOHZ full set"

* 'timers-nohz-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  nohz: Set isolcpus when nohz_full is set
  nohz: Add tick_nohz_full_add_cpus_to() API
  context_tracking: Inherit TIF_NOHZ through forks instead of context switches
  context_tracking: Protect against recursion
2015-06-22 19:20:04 -07:00
Linus Torvalds
43224b96af Merge branch 'timers-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull timer updates from Thomas Gleixner:
 "A rather largish update for everything time and timer related:

   - Cache footprint optimizations for both hrtimers and timer wheel

   - Lower the NOHZ impact on systems which have NOHZ or timer migration
     disabled at runtime.

   - Optimize run time overhead of hrtimer interrupt by making the clock
     offset updates smarter

   - hrtimer cleanups and removal of restrictions to tackle some
     problems in sched/perf

   - Some more leap second tweaks

   - Another round of changes addressing the 2038 problem

   - First step to change the internals of clock event devices by
     introducing the necessary infrastructure

   - Allow constant folding for usecs/msecs_to_jiffies()

   - The usual pile of clockevent/clocksource driver updates

  The hrtimer changes contain updates to sched, perf and x86 as they
  depend on them plus changes all over the tree to cleanup API changes
  and redundant code, which got copied all over the place.  The y2038
  changes touch s390 to remove the last non 2038 safe code related to
  boot/persistant clock"

* 'timers-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (114 commits)
  clocksource: Increase dependencies of timer-stm32 to limit build wreckage
  timer: Minimize nohz off overhead
  timer: Reduce timer migration overhead if disabled
  timer: Stats: Simplify the flags handling
  timer: Replace timer base by a cpu index
  timer: Use hlist for the timer wheel hash buckets
  timer: Remove FIFO "guarantee"
  timers: Sanitize catchup_timer_jiffies() usage
  hrtimer: Allow hrtimer::function() to free the timer
  seqcount: Introduce raw_write_seqcount_barrier()
  seqcount: Rename write_seqcount_barrier()
  hrtimer: Fix hrtimer_is_queued() hole
  hrtimer: Remove HRTIMER_STATE_MIGRATE
  selftest: Timers: Avoid signal deadlock in leap-a-day
  timekeeping: Copy the shadow-timekeeper over the real timekeeper last
  clockevents: Check state instead of mode in suspend/resume path
  selftests: timers: Add leap-second timer edge testing to leap-a-day.c
  ntp: Do leapsecond adjustment in adjtimex read path
  time: Prevent early expiry of hrtimers[CLOCK_REALTIME] at the leap second edge
  ntp: Introduce and use SECS_PER_DAY macro instead of 86400
  ...
2015-06-22 18:57:44 -07:00
Linus Torvalds
23b7776290 Merge branch 'sched-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull scheduler updates from Ingo Molnar:
 "The main changes are:

   - lockless wakeup support for futexes and IPC message queues
     (Davidlohr Bueso, Peter Zijlstra)

   - Replace spinlocks with atomics in thread_group_cputimer(), to
     improve scalability (Jason Low)

   - NUMA balancing improvements (Rik van Riel)

   - SCHED_DEADLINE improvements (Wanpeng Li)

   - clean up and reorganize preemption helpers (Frederic Weisbecker)

   - decouple page fault disabling machinery from the preemption
     counter, to improve debuggability and robustness (David
     Hildenbrand)

   - SCHED_DEADLINE documentation updates (Luca Abeni)

   - topology CPU masks cleanups (Bartosz Golaszewski)

   - /proc/sched_debug improvements (Srikar Dronamraju)"

* 'sched-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (79 commits)
  sched/deadline: Remove needless parameter in dl_runtime_exceeded()
  sched: Remove superfluous resetting of the p->dl_throttled flag
  sched/deadline: Drop duplicate init_sched_dl_class() declaration
  sched/deadline: Reduce rq lock contention by eliminating locking of non-feasible target
  sched/deadline: Make init_sched_dl_class() __init
  sched/deadline: Optimize pull_dl_task()
  sched/preempt: Add static_key() to preempt_notifiers
  sched/preempt: Fix preempt notifiers documentation about hlist_del() within unsafe iteration
  sched/stop_machine: Fix deadlock between multiple stop_two_cpus()
  sched/debug: Add sum_sleep_runtime to /proc/<pid>/sched
  sched/debug: Replace vruntime with wait_sum in /proc/sched_debug
  sched/debug: Properly format runnable tasks in /proc/sched_debug
  sched/numa: Only consider less busy nodes as numa balancing destinations
  Revert 095bebf61a ("sched/numa: Do not move past the balance point if unbalanced")
  sched/fair: Prevent throttling in early pick_next_task_fair()
  preempt: Reorganize the notrace definitions a bit
  preempt: Use preempt_schedule_context() as the official tracing preemption point
  sched: Make preempt_schedule_context() function-tracing safe
  x86: Remove cpu_sibling_mask() and cpu_core_mask()
  x86: Replace cpu_**_mask() with topology_**_cpumask()
  ...
2015-06-22 15:52:04 -07:00
Linus Torvalds
c58267e9fa Merge branch 'perf-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull perf updates from Ingo Molnar:
 "Kernel side changes mostly consist of work on x86 PMU drivers:

   - x86 Intel PT (hardware CPU tracer) improvements (Alexander
     Shishkin)

   - x86 Intel CQM (cache quality monitoring) improvements (Thomas
     Gleixner)

   - x86 Intel PEBSv3 support (Peter Zijlstra)

   - x86 Intel PEBS interrupt batching support for lower overhead
     sampling (Zheng Yan, Kan Liang)

   - x86 PMU scheduler fixes and improvements (Peter Zijlstra)

  There's too many tooling improvements to list them all - here are a
  few select highlights:

  'perf bench':

      - Introduce new 'perf bench futex' benchmark: 'wake-parallel', to
        measure parallel waker threads generating contention for kernel
        locks (hb->lock). (Davidlohr Bueso)

  'perf top', 'perf report':

      - Allow disabling/enabling events dynamicaly in 'perf top':
        a 'perf top' session can instantly become a 'perf report'
        one, i.e. going from dynamic analysis to a static one,
        returning to a dynamic one is possible, to toogle the
        modes, just press 'f' to 'freeze/unfreeze' the sampling. (Arnaldo Carvalho de Melo)

      - Make Ctrl-C stop processing on TUI, allowing interrupting the load of big
        perf.data files (Namhyung Kim)

  'perf probe': (Masami Hiramatsu)

      - Support glob wildcards for function name
      - Support $params special probe argument: Collect all function arguments
      - Make --line checks validate C-style function name.
      - Add --no-inlines option to avoid searching inline functions
      - Greatly speed up 'perf probe --list' by caching debuginfo.
      - Improve --filter support for 'perf probe', allowing using its arguments
        on other commands, as --add, --del, etc.

  'perf sched':

      - Add option in 'perf sched' to merge like comms to lat output (Josef Bacik)

  Plus tons of infrastructure work - in particular preparation for
  upcoming threaded perf report support, but also lots of other work -
  and fixes and other improvements.  See (much) more details in the
  shortlog and in the git log"

* 'perf-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (305 commits)
  perf tools: Configurable per thread proc map processing time out
  perf tools: Add time out to force stop proc map processing
  perf report: Fix sort__sym_cmp to also compare end of symbol
  perf hists browser: React to unassigned hotkey pressing
  perf top: Tell the user how to unfreeze events after pressing 'f'
  perf hists browser: Honour the help line provided by builtin-{top,report}.c
  perf hists browser: Do not exit when 'f' is pressed in 'report' mode
  perf top: Replace CTRL+z with 'f' as hotkey for enable/disable events
  perf annotate: Rename source_line_percent to source_line_samples
  perf annotate: Display total number of samples with --show-total-period
  perf tools: Ensure thread-stack is flushed
  perf top: Allow disabling/enabling events dynamicly
  perf evlist: Add toggle_enable() method
  perf trace: Fix race condition at the end of started workloads
  perf probe: Speed up perf probe --list by caching debuginfo
  perf probe: Show usage even if the last event is skipped
  perf tools: Move libtraceevent dynamic list to separated LDFLAGS variable
  perf tools: Fix a problem when opening old perf.data with different byte order
  perf tools: Ignore .config-detected in .gitignore
  perf probe: Fix to return error if no probe is added
  ...
2015-06-22 15:19:21 -07:00
Linus Torvalds
1bf7067c6e Merge branch 'locking-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull locking updates from Ingo Molnar:
 "The main changes are:

   - 'qspinlock' support, enabled on x86: queued spinlocks - these are
     now the spinlock variant used by x86 as they outperform ticket
     spinlocks in every category.  (Waiman Long)

   - 'pvqspinlock' support on x86: paravirtualized variant of queued
     spinlocks.  (Waiman Long, Peter Zijlstra)

   - 'qrwlock' support, enabled on x86: queued rwlocks.  Similar to
     queued spinlocks, they are now the variant used by x86:

       CONFIG_ARCH_USE_QUEUED_SPINLOCKS=y
       CONFIG_QUEUED_SPINLOCKS=y
       CONFIG_ARCH_USE_QUEUED_RWLOCKS=y
       CONFIG_QUEUED_RWLOCKS=y

   - various lockdep fixlets

   - various locking primitives cleanups, further WRITE_ONCE()
     propagation"

* 'locking-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (24 commits)
  locking/lockdep: Remove hard coded array size dependency
  locking/qrwlock: Don't contend with readers when setting _QW_WAITING
  lockdep: Do not break user-visible string
  locking/arch: Rename set_mb() to smp_store_mb()
  locking/arch: Add WRITE_ONCE() to set_mb()
  rtmutex: Warn if trylock is called from hard/softirq context
  arch: Remove __ARCH_HAVE_CMPXCHG
  locking/rtmutex: Drop usage of __HAVE_ARCH_CMPXCHG
  locking/qrwlock: Rename QUEUE_RWLOCK to QUEUED_RWLOCKS
  locking/pvqspinlock: Rename QUEUED_SPINLOCK to QUEUED_SPINLOCKS
  locking/pvqspinlock: Replace xchg() by the more descriptive set_mb()
  locking/pvqspinlock, x86: Enable PV qspinlock for Xen
  locking/pvqspinlock, x86: Enable PV qspinlock for KVM
  locking/pvqspinlock, x86: Implement the paravirt qspinlock call patching
  locking/pvqspinlock: Implement simple paravirt support for the qspinlock
  locking/qspinlock: Revert to test-and-set on hypervisors
  locking/qspinlock: Use a simple write to grab the lock
  locking/qspinlock: Optimize for smaller NR_CPUS
  locking/qspinlock: Extract out code snippets for the next patch
  locking/qspinlock: Add pending bit
  ...
2015-06-22 14:54:22 -07:00
Thomas Gleixner
bc7a34b8b9 timer: Reduce timer migration overhead if disabled
Eric reported that the timer_migration sysctl is not really nice
performance wise as it needs to check at every timer insertion whether
the feature is enabled or not. Further the check does not live in the
timer code, so we have an extra function call which checks an extra
cache line to figure out that it is disabled.

We can do better and store that information in the per cpu (hr)timer
bases. I pondered to use a static key, but that's a nightmare to
update from the nohz code and the timer base cache line is hot anyway
when we select a timer base.

The old logic enabled the timer migration unconditionally if
CONFIG_NO_HZ was set even if nohz was disabled on the kernel command
line.

With this modification, we start off with migration disabled. The user
visible sysctl is still set to enabled. If the kernel switches to NOHZ
migration is enabled, if the user did not disable it via the sysctl
prior to the switch. If nohz=off is on the kernel command line,
migration stays disabled no matter what.

Before:
  47.76%  hog       [.] main
  14.84%  [kernel]  [k] _raw_spin_lock_irqsave
   9.55%  [kernel]  [k] _raw_spin_unlock_irqrestore
   6.71%  [kernel]  [k] mod_timer
   6.24%  [kernel]  [k] lock_timer_base.isra.38
   3.76%  [kernel]  [k] detach_if_pending
   3.71%  [kernel]  [k] del_timer
   2.50%  [kernel]  [k] internal_add_timer
   1.51%  [kernel]  [k] get_nohz_timer_target
   1.28%  [kernel]  [k] __internal_add_timer
   0.78%  [kernel]  [k] timerfn
   0.48%  [kernel]  [k] wake_up_nohz_cpu

After:
  48.10%  hog       [.] main
  15.25%  [kernel]  [k] _raw_spin_lock_irqsave
   9.76%  [kernel]  [k] _raw_spin_unlock_irqrestore
   6.50%  [kernel]  [k] mod_timer
   6.44%  [kernel]  [k] lock_timer_base.isra.38
   3.87%  [kernel]  [k] detach_if_pending
   3.80%  [kernel]  [k] del_timer
   2.67%  [kernel]  [k] internal_add_timer
   1.33%  [kernel]  [k] __internal_add_timer
   0.73%  [kernel]  [k] timerfn
   0.54%  [kernel]  [k] wake_up_nohz_cpu


Reported-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Paul McKenney <paulmck@linux.vnet.ibm.com>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Viresh Kumar <viresh.kumar@linaro.org>
Cc: John Stultz <john.stultz@linaro.org>
Cc: Joonwoo Park <joonwoop@codeaurora.org>
Cc: Wenbo Wang <wenbo.wang@memblaze.com>
Link: http://lkml.kernel.org/r/20150526224512.127050787@linutronix.de
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2015-06-19 15:18:28 +02:00
Oleg Nesterov
9e7c8f8c62 signals: don't abuse __flush_signals() in selinux_bprm_committed_creds()
selinux_bprm_committed_creds()->__flush_signals() is not right, we
shouldn't clear TIF_SIGPENDING unconditionally. There can be other
reasons for signal_pending(): freezing(), JOBCTL_PENDING_MASK, and
potentially more.

Also change this code to check fatal_signal_pending() rather than
SIGNAL_GROUP_EXIT, it looks a bit better.

Now we can kill __flush_signals() before it finds another buggy user.

Note: this code looks racy, we can flush a signal which was sent after
the task SID has been updated.

Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Paul Moore <pmoore@redhat.com>
2015-06-04 16:22:16 -04:00
Tejun Heo
d59cfc09c3 sched, cgroup: replace signal_struct->group_rwsem with a global percpu_rwsem
The cgroup side of threadgroup locking uses signal_struct->group_rwsem
to synchronize against threadgroup changes.  This per-process rwsem
adds small overhead to thread creation, exit and exec paths, forces
cgroup code paths to do lock-verify-unlock-retry dance in a couple
places and makes it impossible to atomically perform operations across
multiple processes.

This patch replaces signal_struct->group_rwsem with a global
percpu_rwsem cgroup_threadgroup_rwsem which is cheaper on the reader
side and contained in cgroups proper.  This patch converts one-to-one.

This does make writer side heavier and lower the granularity; however,
cgroup process migration is a fairly cold path, we do want to optimize
thread operations over it and cgroup migration operations don't take
enough time for the lower granularity to matter.

Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
2015-05-26 20:35:00 -04:00
Tejun Heo
7d7efec368 sched, cgroup: reorganize threadgroup locking
threadgroup_change_begin/end() are used to mark the beginning and end
of threadgroup modifying operations to allow code paths which require
a threadgroup to stay stable across blocking operations to synchronize
against those sections using threadgroup_lock/unlock().

It's currently implemented as a general mechanism in sched.h using
per-signal_struct rwsem; however, this never grew non-cgroup use cases
and becomes noop if !CONFIG_CGROUPS.  It turns out that cgroups is
gonna be better served with a different sycnrhonization scheme and is
a bit silly to keep cgroups specific details as a general mechanism.

What's general here is identifying the places where threadgroups are
modified.  This patch restructures threadgroup locking so that
threadgroup_change_begin/end() become a place where subsystems which
need to sycnhronize against threadgroup changes can hook into.

cgroup_threadgroup_change_begin/end() which operate on the
per-signal_struct rwsem are created and threadgroup_lock/unlock() are
moved to cgroup.c and made static.

This is pure reorganization which doesn't cause any functional
changes.

Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
2015-05-26 20:35:00 -04:00
Peter Zijlstra
80ed87c8a9 sched/wait: Introduce TASK_NOLOAD and TASK_IDLE
Currently people use TASK_INTERRUPTIBLE to idle kthreads and wait for
'work' because TASK_UNINTERRUPTIBLE contributes to the loadavg. Having
all idle kthreads contribute to the loadavg is somewhat silly.

Now mostly this works OK, because kthreads have all their signals
masked. However there's a few sites where this is causing problems and
TASK_UNINTERRUPTIBLE should be used, except for that loadavg issue.

This patch adds TASK_NOLOAD which, when combined with
TASK_UNINTERRUPTIBLE avoids the loadavg accounting.

As most of imagined usage sites are loops where a thread wants to
idle, waiting for work, a helper TASK_IDLE is introduced.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Julian Anastasov <ja@ssi.bg>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: NeilBrown <neilb@suse.de>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-05-19 08:39:18 +02:00
David Hildenbrand
8bcbde5480 sched/preempt, mm/fault: Count pagefault_disable() levels in pagefault_disabled
Until now, pagefault_disable()/pagefault_enabled() used the preempt
count to track whether in an environment with pagefaults disabled (can
be queried via in_atomic()).

This patch introduces a separate counter in task_struct to count the
level of pagefault_disable() calls. We'll keep manipulating the preempt
count to retain compatibility to existing pagefault handlers.

It is now possible to verify whether in a pagefault_disable() envionment
by calling pagefault_disabled(). In contrast to in_atomic() it will not
be influenced by preempt_enable()/preempt_disable().

This patch is based on a patch from Ingo Molnar.

Reviewed-and-tested-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: David Hildenbrand <dahi@linux.vnet.ibm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: David.Laight@ACULAB.COM
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: airlied@linux.ie
Cc: akpm@linux-foundation.org
Cc: benh@kernel.crashing.org
Cc: bigeasy@linutronix.de
Cc: borntraeger@de.ibm.com
Cc: daniel.vetter@intel.com
Cc: heiko.carstens@de.ibm.com
Cc: herbert@gondor.apana.org.au
Cc: hocko@suse.cz
Cc: hughd@google.com
Cc: mst@redhat.com
Cc: paulus@samba.org
Cc: ralf@linux-mips.org
Cc: schwidefsky@de.ibm.com
Cc: yang.shi@windriver.com
Link: http://lkml.kernel.org/r/1431359540-32227-2-git-send-email-dahi@linux.vnet.ibm.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-05-19 08:39:13 +02:00
Frederic Weisbecker
92cf211874 sched/preempt: Merge preempt_mask.h into preempt.h
preempt_mask.h defines all the preempt_count semantics and related
symbols: preempt, softirq, hardirq, nmi, preempt active, need resched,
etc...

preempt.h defines the accessors and mutators of preempt_count.

But there is a messy dependency game around those two header files:

	* preempt_mask.h includes preempt.h in order to access preempt_count()

	* preempt_mask.h defines all preempt_count semantic and symbols
	  except PREEMPT_NEED_RESCHED that is needed by asm/preempt.h
	  Thus we need to define it from preempt.h, right before including
	  asm/preempt.h, instead of defining it to preempt_mask.h with the
	  other preempt_count symbols. Therefore the preempt_count semantics
	  happen to be spread out.

	* We plan to introduce preempt_active_[enter,exit]() to consolidate
	  preempt_schedule*() code. But we'll need to access both preempt_count
	  mutators (preempt_count_add()) and preempt_count symbols
	  (PREEMPT_ACTIVE, PREEMPT_OFFSET). The usual place to define preempt
	  operations is in preempt.h but then we'll need symbols in
	  preempt_mask.h which already includes preempt.h. So we end up with
	  a ressource circle dependency.

Lets merge preempt_mask.h into preempt.h to solve these dependency issues.
This way we gather semantic symbols and operation definition of
preempt_count in a single file.

This is a dumb copy-paste merge. Further merge re-arrangments are
performed in a subsequent patch to ease review.

Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/1431441711-29753-2-git-send-email-fweisbec@gmail.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-05-19 08:39:11 +02:00
Peter Zijlstra
b92b8b35a2 locking/arch: Rename set_mb() to smp_store_mb()
Since set_mb() is really about an smp_mb() -- not a IO/DMA barrier
like mb() rename it to match the recent smp_load_acquire() and
smp_store_release().

Suggested-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-05-19 08:32:00 +02:00
Al Viro
89076bc319 get rid of assorted nameidata-related debris
pointless forward declarations, stale comments

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2015-05-15 01:10:37 -04:00
Nikolay Borisov
8c8a457a60 sched: Remove redundant #ifdef
Two adjacent members in task_struct were guarded
by the same #define, so we can merge the two blocks.

Signed-off-by: Nikolay Borisov <n.borisov@siteground.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/1431603061-29408-1-git-send-email-kernel@kyup.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-05-14 20:04:43 +02:00
NeilBrown
756daf263e VFS: replace {, total_}link_count in task_struct with pointer to nameidata
task_struct currently contains two ad-hoc members for use by the VFS:
link_count and total_link_count.  These are only interesting to fs/namei.c,
so exposing them explicitly is poor layering.  Incidentally, link_count
isn't used anymore, so it can just die.

This patches replaces those with a single pointer to 'struct nameidata'.
This structure represents the current filename lookup of which
there can only be one per process, and is a natural place to
store total_link_count.

This will allow the current "nameidata" argument to all
follow_link operations to be removed as current->nameidata
can be used instead in the _very_ few instances that care about
it at all.

As there are occasional circumstances where pathname lookup can
recurse, such as through kern_path_locked, we always save and old
current->nameidata (if there is one) when setting a new value, and
make sure any active link_counts are preserved.

follow_mount and follow_automount now get a 'struct nameidata *'
rather than 'int flags' so that they can directly access
total_link_count, rather than going through 'current'.

Suggested-by: Al Viro <viro@ZenIV.linux.org.uk>
Signed-off-by: NeilBrown <neilb@suse.de>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2015-05-10 22:20:14 -04:00
Jason Low
920ce39f6c sched, timer: Fix documentation for 'struct thread_group_cputimer'
Fix the docbook build bug reported by Fengguang Wu.

Reported-by: Fengguang Wu <fengguang.wu@intel.com>
Signed-off-by: Jason Low <jason.low2@hp.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Waiman.Long@hp.com
Cc: aswin@hp.com
Cc: bp@alien8.de
Cc: dave@stgolabs.net
Cc: fweisbec@gmail.com
Cc: mgorman@suse.de
Cc: oleg@redhat.com
Cc: paulmck@linux.vnet.ibm.com
Cc: preeti@linux.vnet.ibm.com
Cc: riel@redhat.com
Cc: rostedt@goodmis.org
Cc: scott.norton@hp.com
Cc: torvalds@linux-foundation.org
Cc: umgwanakikbuti@gmail.com
Link: http://lkml.kernel.org/r/1431120710.5136.12.camel@j-VirtualBox
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-05-10 12:45:27 +02:00
Peter Zijlstra
ff303e66c2 perf: Fix software migrate events
Stephane asked about PERF_COUNT_SW_CPU_MIGRATIONS and I realized it
was borken:

 > The problem is that the task isn't actually scheduled while its being
 > migrated (obviously), and if its not scheduled, the counters aren't
 > scheduled either, so there's no observing of the fact.
 >
 > A further problem with migrations is that many migrations happen from
 > softirq context, which is nested inside the 'random' task context of
 > whoemever happens to run at that time, similarly for the wakeup
 > migrations triggered from (soft)irq context. All those end up being
 > accounted in the task that's currently running, eg. your 'ls'.

The below cures this by marking a task as migrated and accounting it
on the subsequent sched_in().

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Stephane Eranian <eranian@google.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-05-08 12:25:38 +02:00
Peter Zijlstra
7675104990 sched: Implement lockless wake-queues
This is useful for locking primitives that can effect multiple
wakeups per operation and want to avoid lock internal lock contention
by delaying the wakeups until we've released the lock internal locks.

Alternatively it can be used to avoid issuing multiple wakeups, and
thus save a few cycles, in packet processing. Queue all target tasks
and wakeup once you've processed all packets. That way you avoid
waking the target task multiple times if there were multiple packets
for the same task.

Properties of a wake_q are:
- Lockless, as queue head must reside on the stack.
- Being a queue, maintains wakeup order passed by the callers. This can
  be important for otherwise, in scenarios where highly contended locks
  could affect any reliance on lock fairness.
- A queued task cannot be added again until it is woken up.

This patch adds the needed infrastructure into the scheduler code
and uses the new wake_list to delay the futex wakeups until
after we've released the hash bucket locks.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
[tweaks, adjustments, comments, etc.]
Signed-off-by: Davidlohr Bueso <dbueso@suse.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Chris Mason <clm@fb.com>
Cc: Davidlohr Bueso <dave@stgolabs.net>
Cc: George Spelvin <linux@horizon.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Manfred Spraul <manfred@colorfullife.com>
Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Cc: Steven Rostedt <rostedt@goodmis.org>
Link: http://lkml.kernel.org/r/1430494072-30283-2-git-send-email-dave@stgolabs.net
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-05-08 12:20:45 +02:00
Jason Low
7110744516 sched, timer: Use the atomic task_cputime in thread_group_cputimer
Recent optimizations were made to thread_group_cputimer to improve its
scalability by keeping track of cputime stats without a lock. However,
the values were open coded to the structure, causing them to be at
a different abstraction level from the regular task_cputime structure.
Furthermore, any subsequent similar optimizations would not be able to
share the new code, since they are specific to thread_group_cputimer.

This patch adds the new task_cputime_atomic data structure (introduced in
the previous patch in the series) to thread_group_cputimer for keeping
track of the cputime atomically, which also helps generalize the code.

Suggested-by: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Jason Low <jason.low2@hp.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Rik van Riel <riel@redhat.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Aswin Chandramouleeswaran <aswin@hp.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Davidlohr Bueso <dave@stgolabs.net>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Mike Galbraith <umgwanakikbuti@gmail.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Preeti U Murthy <preeti@linux.vnet.ibm.com>
Cc: Scott J Norton <scott.norton@hp.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Waiman Long <Waiman.Long@hp.com>
Link: http://lkml.kernel.org/r/1430251224-5764-6-git-send-email-jason.low2@hp.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-05-08 12:17:46 +02:00
Jason Low
971e8a9854 sched, timer: Provide an atomic 'struct task_cputime' data structure
This patch adds an atomic variant of the 'struct task_cputime' data structure,
which can be used to store and update task_cputime statistics without
needing to do locking.

Suggested-by: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Jason Low <jason.low2@hp.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Rik van Riel <riel@redhat.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Aswin Chandramouleeswaran <aswin@hp.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Davidlohr Bueso <dave@stgolabs.net>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Mike Galbraith <umgwanakikbuti@gmail.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Preeti U Murthy <preeti@linux.vnet.ibm.com>
Cc: Scott J Norton <scott.norton@hp.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Waiman Long <Waiman.Long@hp.com>
Link: http://lkml.kernel.org/r/1430251224-5764-5-git-send-email-jason.low2@hp.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-05-08 12:17:45 +02:00
Jason Low
1018016c70 sched, timer: Replace spinlocks with atomics in thread_group_cputimer(), to improve scalability
While running a database workload, we found a scalability issue with itimers.

Much of the problem was caused by the thread_group_cputimer spinlock.
Each time we account for group system/user time, we need to obtain a
thread_group_cputimer's spinlock to update the timers. On larger systems
(such as a 16 socket machine), this caused more than 30% of total time
spent trying to obtain this kernel lock to update these group timer stats.

This patch converts the timers to 64-bit atomic variables and use
atomic add to update them without a lock. With this patch, the percent
of total time spent updating thread group cputimer timers was reduced
from 30% down to less than 1%.

Note: On 32-bit systems using the generic 64-bit atomics, this causes
sample_group_cputimer() to take locks 3 times instead of just 1 time.
However, we tested this patch on a 32-bit system ARM system using the
generic atomics and did not find the overhead to be much of an issue.
An explanation for why this isn't an issue is that 32-bit systems usually
have small numbers of CPUs, and cacheline contention from extra spinlocks
called periodically is not really apparent on smaller systems.

Signed-off-by: Jason Low <jason.low2@hp.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Rik van Riel <riel@redhat.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Aswin Chandramouleeswaran <aswin@hp.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Davidlohr Bueso <dave@stgolabs.net>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Mike Galbraith <umgwanakikbuti@gmail.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Preeti U Murthy <preeti@linux.vnet.ibm.com>
Cc: Scott J Norton <scott.norton@hp.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Waiman Long <Waiman.Long@hp.com>
Link: http://lkml.kernel.org/r/1430251224-5764-4-git-send-email-jason.low2@hp.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-05-08 12:15:31 +02:00
Jason Low
316c1608d1 sched, timer: Convert usages of ACCESS_ONCE() in the scheduler to READ_ONCE()/WRITE_ONCE()
ACCESS_ONCE doesn't work reliably on non-scalar types. This patch removes
the rest of the existing usages of ACCESS_ONCE() in the scheduler, and use
the new READ_ONCE() and WRITE_ONCE() APIs as appropriate.

Signed-off-by: Jason Low <jason.low2@hp.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Rik van Riel <riel@redhat.com>
Acked-by: Waiman Long <Waiman.Long@hp.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Aswin Chandramouleeswaran <aswin@hp.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Davidlohr Bueso <dave@stgolabs.net>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Mike Galbraith <umgwanakikbuti@gmail.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Preeti U Murthy <preeti@linux.vnet.ibm.com>
Cc: Scott J Norton <scott.norton@hp.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Link: http://lkml.kernel.org/r/1430251224-5764-2-git-send-email-jason.low2@hp.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-05-08 12:11:32 +02:00
Palmer Dabbelt
e7cc417311 signals, ptrace, sched: Fix a misaligned load inside ptrace_attach()
The misaligned load exception arises when running ptrace_attach() on
the RISC-V (which hasn't been upstreamed yet).  The problem is that
wait_on_bit() takes a void* but then proceeds to call test_bit(),
which takes a long*.  This allows an int-aligned pointer to be passed
to test_bit(), which promptly fails.  This will manifest on any other
asm-generic port where unaligned loads trap, where sizeof(long) >
sizeof(int), and where task_struct.jobctl ends up not being
long-aligned.

This patch changes task_struct.jobctl to be a long, which ensures it
has the correct alignment.

Signed-off-by: Palmer Dabbelt <palmer@dabbelt.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Chris Metcalf <cmetcalf@ezchip.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: bobby.prani@gmail.com
Cc: oleg@redhat.com
Cc: paulmck@linux.vnet.ibm.com
Cc: richard@nod.at
Cc: vdavydov@parallels.com
Link: http://lkml.kernel.org/r/1430453997-32459-2-git-send-email-palmer@dabbelt.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-05-08 12:06:57 +02:00
Palmer Dabbelt
b76808e680 signals, sched: Change all uses of JOBCTL_* from 'int' to 'long'
c56fb6564dcd ("Fix a misaligned load inside ptrace_attach()") makes
jobctl an "unsigned long".  It makes sense to have the masks applied
to it match that type.  This is currently just a cosmetic change, but
it will prevent the mask from being unexpectedly truncated if we ever
end up with masks with more bits.

One instance of "signr" is an int, but I left this alone because the
mask ensures that it will never overflow.

Signed-off-by: Palmer Dabbelt <palmer@dabbelt.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Chris Metcalf <cmetcalf@ezchip.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: bobby.prani@gmail.com
Cc: oleg@redhat.com
Cc: paulmck@linux.vnet.ibm.com
Cc: richard@nod.at
Cc: vdavydov@parallels.com
Link: http://lkml.kernel.org/r/1430453997-32459-4-git-send-email-palmer@dabbelt.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-05-08 12:04:36 +02:00
Peter Zijlstra
3289bdb429 sched: Move the loadavg code to a more obvious location
I could not find the loadavg code.. turns out it was hidden in a file
called proc.c. It further got mingled up with the cruft per rq load
indexes (which we really want to get rid of).

Move the per rq load indexes into the fair.c load-balance code (that's
the only thing that uses them) and rename proc.c to loadavg.c so we
can find it again.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Paul Gortmaker <paul.gortmaker@windriver.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
[ Did minor cleanups to the code. ]
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-05-08 12:04:12 +02:00
Frederic Weisbecker
fafe870f31 context_tracking: Inherit TIF_NOHZ through forks instead of context switches
TIF_NOHZ is used by context_tracking to force syscall slow-path
on every task in order to track userspace roundtrips. As such,
it must be set on all running tasks.

It's currently explicitly inherited through context switches.
There is no need to do it in this fast-path though. The flag
could simply be set once for all on all tasks, whether they are
running or not.

Lets do this by setting the flag for the init task on early boot,
and let it propagate through fork inheritance.

While at it, mark context_tracking_cpu_set() as init code, we
only need it at early boot time.

Suggested-by: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Reviewed-by: Rik van Riel <riel@redhat.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Chris Metcalf <cmetcalf@ezchip.com>
Cc: Dave Jones <davej@redhat.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Mike Galbraith <umgwanakikbuti@gmail.com>
Cc: Paul E . McKenney <paulmck@linux.vnet.ibm.com>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rafael J . Wysocki <rafael.j.wysocki@intel.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/1430928266-24888-3-git-send-email-fweisbec@gmail.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-05-07 12:02:51 +02:00
Paolo Bonzini
73459e2a1a x86: pvclock: Really remove the sched notifier for cross-cpu migrations
This reverts commits 0a4e6be9ca
and 80f7fdb1c7.

The task migration notifier was originally introduced in order to support
the pvclock vsyscall with non-synchronized TSC, but KVM only supports it
with synchronized TSC.  Hence, on KVM the race condition is only needed
due to a bad implementation on the host side, and even then it's so rare
that it's mostly theoretical.

As far as KVM is concerned it's possible to fix the host, avoiding the
additional complexity in the vDSO and the (re)introduction of the task
migration notifier.

Xen, on the other hand, hasn't yet implemented vsyscall support at
all, so we do not care about its plans for non-synchronized TSC.

Reported-by: Peter Zijlstra <peterz@infradead.org>
Suggested-by: Marcelo Tosatti <mtosatti@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2015-04-27 15:49:30 +02:00
Linus Torvalds
fa2e5c073a Merge branch 'exec_domain_rip_v2' of git://git.kernel.org/pub/scm/linux/kernel/git/rw/misc
Pull exec domain removal from Richard Weinberger:
 "This series removes execution domain support from Linux.

  The idea behind exec domains was to support different ABIs.  The
  feature was never complete nor stable.  Let's rip it out and make the
  kernel signal handling code less complicated"

* 'exec_domain_rip_v2' of git://git.kernel.org/pub/scm/linux/kernel/git/rw/misc: (27 commits)
  arm64: Removed unused variable
  sparc: Fix execution domain removal
  Remove rest of exec domains.
  arch: Remove exec_domain from remaining archs
  arc: Remove signal translation and exec_domain
  xtensa: Remove signal translation and exec_domain
  xtensa: Autogenerate offsets in struct thread_info
  x86: Remove signal translation and exec_domain
  unicore32: Remove signal translation and exec_domain
  um: Remove signal translation and exec_domain
  tile: Remove signal translation and exec_domain
  sparc: Remove signal translation and exec_domain
  sh: Remove signal translation and exec_domain
  s390: Remove signal translation and exec_domain
  mn10300: Remove signal translation and exec_domain
  microblaze: Remove signal translation and exec_domain
  m68k: Remove signal translation and exec_domain
  m32r: Remove signal translation and exec_domain
  m32r: Autogenerate offsets in struct thread_info
  frv: Remove signal translation and exec_domain
  ...
2015-04-15 13:53:55 -07:00
Linus Torvalds
4fd48b45ff Merge branch 'for-4.1' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup
Pull cgroup updates from Tejun Heo:
 "Nothing too interesting.  Rik made cpuset cooperate better with
  isolcpus and there are several other cleanup patches"

* 'for-4.1' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup:
  cpuset, isolcpus: document relationship between cpusets & isolcpus
  cpusets, isolcpus: exclude isolcpus from load balancing in cpusets
  sched, isolcpu: make cpu_isolated_map visible outside scheduler
  cpuset: initialize cpuset a bit early
  cgroup: Use kvfree in pidlist_free()
  cgroup: call cgroup_subsys->bind on cgroup subsys initialization
2015-04-13 16:47:11 -07:00
Linus Torvalds
49d2953c72 Merge branch 'sched-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull scheduler changes from Ingo Molnar:
 "Major changes:

   - Reworked CPU capacity code, for better SMP load balancing on
     systems with assymetric CPUs. (Vincent Guittot, Morten Rasmussen)

   - Reworked RT task SMP balancing to be push based instead of pull
     based, to reduce latencies on large CPU count systems. (Steven
     Rostedt)

   - SCHED_DEADLINE support updates and fixes. (Juri Lelli)

   - SCHED_DEADLINE task migration support during CPU hotplug. (Wanpeng Li)

   - x86 mwait-idle optimizations and fixes. (Mike Galbraith, Len Brown)

   - sched/numa improvements. (Rik van Riel)

   - various cleanups"

* 'sched-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (28 commits)
  sched/core: Drop debugging leftover trace_printk call
  sched/deadline: Support DL task migration during CPU hotplug
  sched/core: Check for available DL bandwidth in cpuset_cpu_inactive()
  sched/deadline: Always enqueue on previous rq when dl_task_timer() fires
  sched/core: Remove unused argument from init_[rt|dl]_rq()
  sched/deadline: Fix rt runtime corruption when dl fails its global constraints
  sched/deadline: Avoid a superfluous check
  sched: Improve load balancing in the presence of idle CPUs
  sched: Optimize freq invariant accounting
  sched: Move CFS tasks to CPUs with higher capacity
  sched: Add SD_PREFER_SIBLING for SMT level
  sched: Remove unused struct sched_group_capacity::capacity_orig
  sched: Replace capacity_factor by usage
  sched: Calculate CPU's usage statistic and put it into struct sg_lb_stats::group_usage
  sched: Add struct rq::cpu_capacity_orig
  sched: Make scale_rt invariant with frequency
  sched: Make sched entity usage tracking scale-invariant
  sched: Remove frequency scaling from cpu_capacity
  sched: Track group sched_entity usage contributions
  sched: Add sched_avg::utilization_avg_contrib
  ...
2015-04-13 10:47:34 -07:00
Linus Torvalds
9003601310 The most interesting bit here is irqfd/ioeventfd support for ARM and ARM64.
ARM/ARM64: fixes for live migration, irqfd and ioeventfd support (enabling
 vhost, too), page aging
 
 s390: interrupt handling rework, allowing to inject all local interrupts
 via new ioctl and to get/set the full local irq state for migration
 and introspection.  New ioctls to access memory by virtual address,
 and to get/set the guest storage keys.  SIMD support.
 
 MIPS: FPU and MIPS SIMD Architecture (MSA) support.  Includes some patches
 from Ralf Baechle's MIPS tree.
 
 x86: bugfixes (notably for pvclock, the others are small) and cleanups.
 Another small latency improvement for the TSC deadline timer.
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v2.0.22 (GNU/Linux)
 
 iQEcBAABAgAGBQJVJ9vmAAoJEL/70l94x66DoMEH/R3rh8IMf4jTiWRkcqohOMPX
 k1+NaSY/lCKayaSgggJ2hcQenMbQoXEOdslvaA/H0oC+VfJGK+lmU6E63eMyyhjQ
 Y+Px6L85NENIzDzaVu/TIWWuhil5PvIRr3VO8cvntExRoCjuekTUmNdOgCvN2ObW
 wswN2qRdPIeEj2kkulbnye+9IV4G0Ne9bvsmUdOdfSSdi6ZcV43JcvrpOZT++mKj
 RrKB+3gTMZYGJXMMLBwMkdl8mK1ozriD+q0mbomT04LUyGlPwYLl4pVRDBqyksD7
 KsSSybaK2E4i5R80WEljgDMkNqrCgNfg6VZe4n9Y+CfAAOToNnkMJaFEi+yuqbs=
 =yu2b
 -----END PGP SIGNATURE-----

Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm

Pull KVM updates from Paolo Bonzini:
 "First batch of KVM changes for 4.1

  The most interesting bit here is irqfd/ioeventfd support for ARM and
  ARM64.

  Summary:

  ARM/ARM64:
     fixes for live migration, irqfd and ioeventfd support (enabling
     vhost, too), page aging

  s390:
     interrupt handling rework, allowing to inject all local interrupts
     via new ioctl and to get/set the full local irq state for migration
     and introspection.  New ioctls to access memory by virtual address,
     and to get/set the guest storage keys.  SIMD support.

  MIPS:
     FPU and MIPS SIMD Architecture (MSA) support.  Includes some
     patches from Ralf Baechle's MIPS tree.

  x86:
     bugfixes (notably for pvclock, the others are small) and cleanups.
     Another small latency improvement for the TSC deadline timer"

* tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (146 commits)
  KVM: use slowpath for cross page cached accesses
  kvm: mmu: lazy collapse small sptes into large sptes
  KVM: x86: Clear CR2 on VCPU reset
  KVM: x86: DR0-DR3 are not clear on reset
  KVM: x86: BSP in MSR_IA32_APICBASE is writable
  KVM: x86: simplify kvm_apic_map
  KVM: x86: avoid logical_map when it is invalid
  KVM: x86: fix mixed APIC mode broadcast
  KVM: x86: use MDA for interrupt matching
  kvm/ppc/mpic: drop unused IRQ_testbit
  KVM: nVMX: remove unnecessary double caching of MAXPHYADDR
  KVM: nVMX: checks for address bits beyond MAXPHYADDR on VM-entry
  KVM: x86: cache maxphyaddr CPUID leaf in struct kvm_vcpu
  KVM: vmx: pass error code with internal error #2
  x86: vdso: fix pvclock races with task migration
  KVM: remove kvm_read_hva and kvm_read_hva_atomic
  KVM: x86: optimize delivery of TSC deadline timer interrupt
  KVM: x86: extract blocking logic from __vcpu_run
  kvm: x86: fix x86 eflags fixed bit
  KVM: s390: migrate vcpu interrupt state
  ...
2015-04-13 09:47:01 -07:00
Richard Weinberger
9058f3b326 Remove rest of exec domains.
It is gone from all archs, now we can remove
the final bits.

Signed-off-by: Richard Weinberger <richard@nod.at>
2015-04-12 21:03:31 +02:00
Vincent Guittot
36ee28e45d sched: Add sched_avg::utilization_avg_contrib
Add new statistics which reflect the average time a task is running on the CPU
and the sum of these running time of the tasks on a runqueue. The latter is
named utilization_load_avg.

This patch is based on the usage metric that was proposed in the 1st
versions of the per-entity load tracking patchset by Paul Turner
<pjt@google.com> but that has be removed afterwards. This version differs from
the original one in the sense that it's not linked to task_group.

The rq's utilization_load_avg will be used to check if a rq is overloaded or
not instead of trying to compute how many tasks a group of CPUs can handle.

Rename runnable_avg_period into avg_period as it is now used with both
runnable_avg_sum and running_avg_sum.

Add some descriptions of the variables to explain their differences.

Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Morten Rasmussen <morten.rasmussen@arm.com>
Cc: Paul Turner <pjt@google.com>
Cc: Ben Segall <bsegall@google.com>
Cc: Ben Segall <bsegall@google.com>
Cc: Morten.Rasmussen@arm.com
Cc: Paul Turner <pjt@google.com>
Cc: dietmar.eggemann@arm.com
Cc: efault@gmx.de
Cc: kamalesh@linux.vnet.ibm.com
Cc: linaro-kernel@lists.linaro.org
Cc: nicolas.pitre@linaro.org
Cc: preeti@linux.vnet.ibm.com
Cc: riel@redhat.com
Link: http://lkml.kernel.org/r/1425052454-25797-2-git-send-email-vincent.guittot@linaro.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-03-27 09:35:57 +01:00
Mel Gorman
074c238177 mm: numa: slow PTE scan rate if migration failures occur
Dave Chinner reported the following on https://lkml.org/lkml/2015/3/1/226

  Across the board the 4.0-rc1 numbers are much slower, and the degradation
  is far worse when using the large memory footprint configs. Perf points
  straight at the cause - this is from 4.0-rc1 on the "-o bhash=101073" config:

   -   56.07%    56.07%  [kernel]            [k] default_send_IPI_mask_sequence_phys
      - default_send_IPI_mask_sequence_phys
         - 99.99% physflat_send_IPI_mask
            - 99.37% native_send_call_func_ipi
                 smp_call_function_many
               - native_flush_tlb_others
                  - 99.85% flush_tlb_page
                       ptep_clear_flush
                       try_to_unmap_one
                       rmap_walk
                       try_to_unmap
                       migrate_pages
                       migrate_misplaced_page
                     - handle_mm_fault
                        - 99.73% __do_page_fault
                             trace_do_page_fault
                             do_async_page_fault
                           + async_page_fault
              0.63% native_send_call_func_single_ipi
                 generic_exec_single
                 smp_call_function_single

This is showing excessive migration activity even though excessive
migrations are meant to get throttled.  Normally, the scan rate is tuned
on a per-task basis depending on the locality of faults.  However, if
migrations fail for any reason then the PTE scanner may scan faster if
the faults continue to be remote.  This means there is higher system CPU
overhead and fault trapping at exactly the time we know that migrations
cannot happen.  This patch tracks when migration failures occur and
slows the PTE scanner.

Signed-off-by: Mel Gorman <mgorman@suse.de>
Reported-by: Dave Chinner <david@fromorbit.com>
Tested-by: Dave Chinner <david@fromorbit.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Aneesh Kumar <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-03-25 16:20:31 -07:00
Marcelo Tosatti
0a4e6be9ca x86: kvm: Revert "remove sched notifier for cross-cpu migrations"
The following point:

    2. per-CPU pvclock time info is updated if the
       underlying CPU changes.

Is not true anymore since "KVM: x86: update pvclock area conditionally,
on cpu migration".

Add task migration notification back.

Problem noticed by Andy Lutomirski.

Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
CC: stable@kernel.org # 3.11+
2015-03-23 20:22:48 -03:00
Rik van Riel
3fa0818b3c sched, isolcpu: make cpu_isolated_map visible outside scheduler
Needed by the next patch. Also makes cpu_isolated_map present
when compiled without SMP and/or with CONFIG_NR_CPUS=1, like
the other cpu masks.

At some point we may want to clean things up so cpumasks do
not exist in UP kernels. Maybe something for the CONFIG_TINY
crowd.

Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Clark Williams <williams@redhat.com>
Cc: Li Zefan <lizefan@huawei.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Luiz Capitulino <lcapitulino@redhat.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Mike Galbraith <umgwanakikbuti@gmail.com>
Cc: cgroups@vger.kernel.org
Signed-off-by: Rik van Riel <riel@redhat.com>
Acked-by: Zefan Li <lizefan@huawei.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2015-03-19 14:28:19 -04:00
Linus Torvalds
e2defd0271 Merge branch 'sched-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull scheduler fixes from Ingo Molnar:
 "Thiscontains misc fixes: preempt_schedule_common() and io_schedule()
  recursion fixes, sched/dl fixes, a completion_done() revert, two
  sched/rt fixes and a comment update patch"

* 'sched-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  sched/rt: Avoid obvious configuration fail
  sched/autogroup: Fix failure to set cpu.rt_runtime_us
  sched/dl: Do update_rq_clock() in yield_task_dl()
  sched: Prevent recursion in io_schedule()
  sched/completion: Serialize completion_done() with complete()
  sched: Fix preempt_schedule_common() triggering tracing recursion
  sched/dl: Prevent enqueue of a sleeping task in dl_task_timer()
  sched: Make dl_task_time() use task_rq_lock()
  sched: Clarify ordering between task_rq_lock() and move_queued_task()
2015-02-21 10:40:02 -08:00
NeilBrown
9cff8adeaa sched: Prevent recursion in io_schedule()
io_schedule() calls blk_flush_plug() which, depending on the
contents of current->plug, can initiate arbitrary blk-io requests.

Note that this contrasts with blk_schedule_flush_plug() which requires
all non-trivial work to be handed off to a separate thread.

This makes it possible for io_schedule() to recurse, and initiating
block requests could possibly call mempool_alloc() which, in times of
memory pressure, uses io_schedule().

Apart from any stack usage issues, io_schedule() will not behave
correctly when called recursively as delayacct_blkio_start() does
not allow for repeated calls.

So:
 - use ->in_iowait to detect recursion.  Set it earlier, and restore
   it to the old value.
 - move the call to "raw_rq" after the call to blk_flush_plug().
   As this is some sort of per-cpu thing, we want some chance that
   we are on the right CPU
 - When io_schedule() is called recurively, use blk_schedule_flush_plug()
   which cannot further recurse.
 - as this makes io_schedule() a lot more complex and as io_schedule()
   must match io_schedule_timeout(), but all the changes in io_schedule_timeout()
   and make io_schedule a simple wrapper for that.

Signed-off-by: NeilBrown <neilb@suse.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
[ Moved the now rudimentary io_schedule() into sched.h. ]
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Tony Battersby <tonyb@cybernetics.com>
Link: http://lkml.kernel.org/r/20150213162600.059fffb2@notabene.brown
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-02-18 14:27:44 +01:00
Andrey Ryabinin
0b24becc81 kasan: add kernel address sanitizer infrastructure
Kernel Address sanitizer (KASan) is a dynamic memory error detector.  It
provides fast and comprehensive solution for finding use-after-free and
out-of-bounds bugs.

KASAN uses compile-time instrumentation for checking every memory access,
therefore GCC > v4.9.2 required.  v4.9.2 almost works, but has issues with
putting symbol aliases into the wrong section, which breaks kasan
instrumentation of globals.

This patch only adds infrastructure for kernel address sanitizer.  It's
not available for use yet.  The idea and some code was borrowed from [1].

Basic idea:

The main idea of KASAN is to use shadow memory to record whether each byte
of memory is safe to access or not, and use compiler's instrumentation to
check the shadow memory on each memory access.

Address sanitizer uses 1/8 of the memory addressable in kernel for shadow
memory and uses direct mapping with a scale and offset to translate a
memory address to its corresponding shadow address.

Here is function to translate address to corresponding shadow address:

     unsigned long kasan_mem_to_shadow(unsigned long addr)
     {
                return (addr >> KASAN_SHADOW_SCALE_SHIFT) + KASAN_SHADOW_OFFSET;
     }

where KASAN_SHADOW_SCALE_SHIFT = 3.

So for every 8 bytes there is one corresponding byte of shadow memory.
The following encoding used for each shadow byte: 0 means that all 8 bytes
of the corresponding memory region are valid for access; k (1 <= k <= 7)
means that the first k bytes are valid for access, and other (8 - k) bytes
are not; Any negative value indicates that the entire 8-bytes are
inaccessible.  Different negative values used to distinguish between
different kinds of inaccessible memory (redzones, freed memory) (see
mm/kasan/kasan.h).

To be able to detect accesses to bad memory we need a special compiler.
Such compiler inserts a specific function calls (__asan_load*(addr),
__asan_store*(addr)) before each memory access of size 1, 2, 4, 8 or 16.

These functions check whether memory region is valid to access or not by
checking corresponding shadow memory.  If access is not valid an error
printed.

Historical background of the address sanitizer from Dmitry Vyukov:

	"We've developed the set of tools, AddressSanitizer (Asan),
	ThreadSanitizer and MemorySanitizer, for user space. We actively use
	them for testing inside of Google (continuous testing, fuzzing,
	running prod services). To date the tools have found more than 10'000
	scary bugs in Chromium, Google internal codebase and various
	open-source projects (Firefox, OpenSSL, gcc, clang, ffmpeg, MySQL and
	lots of others): [2] [3] [4].
	The tools are part of both gcc and clang compilers.

	We have not yet done massive testing under the Kernel AddressSanitizer
	(it's kind of chicken and egg problem, you need it to be upstream to
	start applying it extensively). To date it has found about 50 bugs.
	Bugs that we've found in upstream kernel are listed in [5].
	We've also found ~20 bugs in out internal version of the kernel. Also
	people from Samsung and Oracle have found some.

	[...]

	As others noted, the main feature of AddressSanitizer is its
	performance due to inline compiler instrumentation and simple linear
	shadow memory. User-space Asan has ~2x slowdown on computational
	programs and ~2x memory consumption increase. Taking into account that
	kernel usually consumes only small fraction of CPU and memory when
	running real user-space programs, I would expect that kernel Asan will
	have ~10-30% slowdown and similar memory consumption increase (when we
	finish all tuning).

	I agree that Asan can well replace kmemcheck. We have plans to start
	working on Kernel MemorySanitizer that finds uses of unitialized
	memory. Asan+Msan will provide feature-parity with kmemcheck. As
	others noted, Asan will unlikely replace debug slab and pagealloc that
	can be enabled at runtime. Asan uses compiler instrumentation, so even
	if it is disabled, it still incurs visible overheads.

	Asan technology is easily portable to other architectures. Compiler
	instrumentation is fully portable. Runtime has some arch-dependent
	parts like shadow mapping and atomic operation interception. They are
	relatively easy to port."

Comparison with other debugging features:
========================================

KMEMCHECK:

  - KASan can do almost everything that kmemcheck can.  KASan uses
    compile-time instrumentation, which makes it significantly faster than
    kmemcheck.  The only advantage of kmemcheck over KASan is detection of
    uninitialized memory reads.

    Some brief performance testing showed that kasan could be
    x500-x600 times faster than kmemcheck:

$ netperf -l 30
		MIGRATED TCP STREAM TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to localhost (127.0.0.1) port 0 AF_INET
		Recv   Send    Send
		Socket Socket  Message  Elapsed
		Size   Size    Size     Time     Throughput
		bytes  bytes   bytes    secs.    10^6bits/sec

no debug:	87380  16384  16384    30.00    41624.72

kasan inline:	87380  16384  16384    30.00    12870.54

kasan outline:	87380  16384  16384    30.00    10586.39

kmemcheck: 	87380  16384  16384    30.03      20.23

  - Also kmemcheck couldn't work on several CPUs.  It always sets
    number of CPUs to 1.  KASan doesn't have such limitation.

DEBUG_PAGEALLOC:
	- KASan is slower than DEBUG_PAGEALLOC, but KASan works on sub-page
	  granularity level, so it able to find more bugs.

SLUB_DEBUG (poisoning, redzones):
	- SLUB_DEBUG has lower overhead than KASan.

	- SLUB_DEBUG in most cases are not able to detect bad reads,
	  KASan able to detect both reads and writes.

	- In some cases (e.g. redzone overwritten) SLUB_DEBUG detect
	  bugs only on allocation/freeing of object. KASan catch
	  bugs right before it will happen, so we always know exact
	  place of first bad read/write.

[1] https://code.google.com/p/address-sanitizer/wiki/AddressSanitizerForKernel
[2] https://code.google.com/p/address-sanitizer/wiki/FoundBugs
[3] https://code.google.com/p/thread-sanitizer/wiki/FoundBugs
[4] https://code.google.com/p/memory-sanitizer/wiki/FoundBugs
[5] https://code.google.com/p/address-sanitizer/wiki/AddressSanitizerForKernel#Trophies

Based on work by Andrey Konovalov.

Signed-off-by: Andrey Ryabinin <a.ryabinin@samsung.com>
Acked-by: Michal Marek <mmarek@suse.cz>
Signed-off-by: Andrey Konovalov <adech.fo@gmail.com>
Cc: Dmitry Vyukov <dvyukov@google.com>
Cc: Konstantin Serebryany <kcc@google.com>
Cc: Dmitry Chernenkov <dmitryc@google.com>
Cc: Yuri Gribov <tetra2005@gmail.com>
Cc: Konstantin Khlebnikov <koct9i@gmail.com>
Cc: Sasha Levin <sasha.levin@oracle.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Andi Kleen <andi@firstfloor.org>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: David Rientjes <rientjes@google.com>
Cc: Stephen Rothwell <sfr@canb.auug.org.au>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-02-13 21:21:40 -08:00
Cyril Bur
545a2bf742 kernel/sched/clock.c: add another clock for use with the soft lockup watchdog
When the hypervisor pauses a virtualised kernel the kernel will observe a
jump in timebase, this can cause spurious messages from the softlockup
detector.

Whilst these messages are harmless, they are accompanied with a stack
trace which causes undue concern and more problematically the stack trace
in the guest has nothing to do with the observed problem and can only be
misleading.

Futhermore, on POWER8 this is completely avoidable with the introduction
of the Virtual Time Base (VTB) register.

This patch (of 2):

This permits the use of arch specific clocks for which virtualised kernels
can use their notion of 'running' time, not the elpased wall time which
will include host execution time.

Signed-off-by: Cyril Bur <cyrilbur@gmail.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Andrew Jones <drjones@redhat.com>
Acked-by: Don Zickus <dzickus@redhat.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Ulrich Obergfell <uobergfe@redhat.com>
Cc: chai wen <chaiw.fnst@cn.fujitsu.com>
Cc: Fabian Frederick <fabf@skynet.be>
Cc: Aaron Tomlin <atomlin@redhat.com>
Cc: Ben Zhang <benzh@chromium.org>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: John Stultz <john.stultz@linaro.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-02-12 18:54:13 -08:00
Andy Lutomirski
f56141e3e2 all arches, signal: move restart_block to struct task_struct
If an attacker can cause a controlled kernel stack overflow, overwriting
the restart block is a very juicy exploit target.  This is because the
restart_block is held in the same memory allocation as the kernel stack.

Moving the restart block to struct task_struct prevents this exploit by
making the restart_block harder to locate.

Note that there are other fields in thread_info that are also easy
targets, at least on some architectures.

It's also a decent simplification, since the restart code is more or less
identical on all architectures.

[james.hogan@imgtec.com: metag: align thread_info::supervisor_stack]
Signed-off-by: Andy Lutomirski <luto@amacapital.net>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Kees Cook <keescook@chromium.org>
Cc: David Miller <davem@davemloft.net>
Acked-by: Richard Weinberger <richard@nod.at>
Cc: Richard Henderson <rth@twiddle.net>
Cc: Ivan Kokshaysky <ink@jurassic.park.msu.ru>
Cc: Matt Turner <mattst88@gmail.com>
Cc: Vineet Gupta <vgupta@synopsys.com>
Cc: Russell King <rmk@arm.linux.org.uk>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Will Deacon <will.deacon@arm.com>
Cc: Haavard Skinnemoen <hskinnemoen@gmail.com>
Cc: Hans-Christian Egtvedt <egtvedt@samfundet.no>
Cc: Steven Miao <realmz6@gmail.com>
Cc: Mark Salter <msalter@redhat.com>
Cc: Aurelien Jacquiot <a-jacquiot@ti.com>
Cc: Mikael Starvik <starvik@axis.com>
Cc: Jesper Nilsson <jesper.nilsson@axis.com>
Cc: David Howells <dhowells@redhat.com>
Cc: Richard Kuo <rkuo@codeaurora.org>
Cc: "Luck, Tony" <tony.luck@intel.com>
Cc: Geert Uytterhoeven <geert@linux-m68k.org>
Cc: Michal Simek <monstr@monstr.eu>
Cc: Ralf Baechle <ralf@linux-mips.org>
Cc: Jonas Bonn <jonas@southpole.se>
Cc: "James E.J. Bottomley" <jejb@parisc-linux.org>
Cc: Helge Deller <deller@gmx.de>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Paul Mackerras <paulus@samba.org>
Acked-by: Michael Ellerman <mpe@ellerman.id.au> (powerpc)
Tested-by: Michael Ellerman <mpe@ellerman.id.au> (powerpc)
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: Chen Liqin <liqin.linux@gmail.com>
Cc: Lennox Wu <lennox.wu@gmail.com>
Cc: Chris Metcalf <cmetcalf@ezchip.com>
Cc: Guan Xuetao <gxt@mprc.pku.edu.cn>
Cc: Chris Zankel <chris@zankel.net>
Cc: Max Filippov <jcmvbkbc@gmail.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Guenter Roeck <linux@roeck-us.net>
Signed-off-by: James Hogan <james.hogan@imgtec.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-02-12 18:54:12 -08:00
David Drysdale
51f39a1f0c syscalls: implement execveat() system call
This patchset adds execveat(2) for x86, and is derived from Meredydd
Luff's patch from Sept 2012 (https://lkml.org/lkml/2012/9/11/528).

The primary aim of adding an execveat syscall is to allow an
implementation of fexecve(3) that does not rely on the /proc filesystem,
at least for executables (rather than scripts).  The current glibc version
of fexecve(3) is implemented via /proc, which causes problems in sandboxed
or otherwise restricted environments.

Given the desire for a /proc-free fexecve() implementation, HPA suggested
(https://lkml.org/lkml/2006/7/11/556) that an execveat(2) syscall would be
an appropriate generalization.

Also, having a new syscall means that it can take a flags argument without
back-compatibility concerns.  The current implementation just defines the
AT_EMPTY_PATH and AT_SYMLINK_NOFOLLOW flags, but other flags could be
added in future -- for example, flags for new namespaces (as suggested at
https://lkml.org/lkml/2006/7/11/474).

Related history:
 - https://lkml.org/lkml/2006/12/27/123 is an example of someone
   realizing that fexecve() is likely to fail in a chroot environment.
 - http://bugs.debian.org/cgi-bin/bugreport.cgi?bug=514043 covered
   documenting the /proc requirement of fexecve(3) in its manpage, to
   "prevent other people from wasting their time".
 - https://bugzilla.redhat.com/show_bug.cgi?id=241609 described a
   problem where a process that did setuid() could not fexecve()
   because it no longer had access to /proc/self/fd; this has since
   been fixed.

This patch (of 4):

Add a new execveat(2) system call.  execveat() is to execve() as openat()
is to open(): it takes a file descriptor that refers to a directory, and
resolves the filename relative to that.

In addition, if the filename is empty and AT_EMPTY_PATH is specified,
execveat() executes the file to which the file descriptor refers.  This
replicates the functionality of fexecve(), which is a system call in other
UNIXen, but in Linux glibc it depends on opening "/proc/self/fd/<fd>" (and
so relies on /proc being mounted).

The filename fed to the executed program as argv[0] (or the name of the
script fed to a script interpreter) will be of the form "/dev/fd/<fd>"
(for an empty filename) or "/dev/fd/<fd>/<filename>", effectively
reflecting how the executable was found.  This does however mean that
execution of a script in a /proc-less environment won't work; also, script
execution via an O_CLOEXEC file descriptor fails (as the file will not be
accessible after exec).

Based on patches by Meredydd Luff.

Signed-off-by: David Drysdale <drysdale@google.com>
Cc: Meredydd Luff <meredydd@senatehouse.org>
Cc: Shuah Khan <shuah.kh@samsung.com>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Kees Cook <keescook@chromium.org>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Rich Felker <dalias@aerifal.cx>
Cc: Christoph Hellwig <hch@infradead.org>
Cc: Michael Kerrisk <mtk.manpages@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-12-13 12:42:51 -08:00
Vladimir Davydov
6f185c290e memcg: turn memcg_kmem_skip_account into a bit field
It isn't supposed to stack, so turn it into a bit-field to save 4 bytes on
the task_struct.

Also, remove the memcg_stop/resume_kmem_account helpers - it is clearer to
set/clear the flag inline.  Regarding the overwhelming comment to the
helpers, which is removed by this patch too, we already have a compact yet
accurate explanation in memcg_schedule_cache_create, no need in yet
another one.

Signed-off-by: Vladimir Davydov <vdavydov@parallels.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-12-13 12:42:47 -08:00