Commit Graph

19148 Commits

Author SHA1 Message Date
Davidlohr Bueso
a122949100 locktorture: Cannot hold read and write lock
... trigger an error if so.

Signed-off-by: Davidlohr Bueso <dbueso@suse.de>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
2014-09-30 00:10:02 -07:00
Davidlohr Bueso
219f800f99 locktorture: Fix __acquire annotation for spinlock irq
Its quite easy to get mixed up with the names -- 'torture_spinlock_irq'
is not actually a valid spinlock name.

Signed-off-by: Davidlohr Bueso <dbueso@suse.de>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
2014-09-30 00:10:02 -07:00
Davidlohr Bueso
e34191fad8 locktorture: Support rwlocks
Add a "rw_lock" torture test to stress kernel rwlocks and their irq
variant. Reader critical regions are 5x longer than writers. As such
a similar ratio of lock acquisitions is seen in the statistics. In the
case of massive contention, both hold the lock for 1/10 of a second.

Signed-off-by: Davidlohr Bueso <dbueso@suse.de>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
2014-09-30 00:10:00 -07:00
Rafael J. Wysocki
905563ff47 Merge back earlier 'pm-sleep' material for v3.18. 2014-09-29 15:33:26 +02:00
Dan Williams
7bced39751 net_dma: simple removal
Per commit "77873803363c net_dma: mark broken" net_dma is no longer used
and there is no plan to fix it.

This is the mechanical removal of bits in CONFIG_NET_DMA ifdef guards.
Reverting the remainder of the net_dma induced changes is deferred to
subsequent patches.

Marked for stable due to Roman's report of a memory leak in
dma_pin_iovec_pages():

    https://lkml.org/lkml/2014/9/3/177

Cc: Dave Jiang <dave.jiang@intel.com>
Cc: Vinod Koul <vinod.koul@intel.com>
Cc: David Whipple <whipple@securedatainnovations.ch>
Cc: Alexander Duyck <alexander.h.duyck@intel.com>
Cc: <stable@vger.kernel.org>
Reported-by: Roman Gushchin <klamm@yandex-team.ru>
Acked-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
2014-09-28 07:05:16 -07:00
Linus Torvalds
6111da3432 Merge branch 'for-3.17-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup
Pull cgroup fixes from Tejun Heo:
 "This is quite late but these need to be backported anyway.

  This is the fix for a long-standing cpuset bug which existed from
  2009.  cpuset makes use of PF_SPREAD_{PAGE|SLAB} flags to modify the
  task's memory allocation behavior according to the settings of the
  cpuset it belongs to; unfortunately, when those flags have to be
  changed, cpuset did so directly even whlie the target task is running,
  which is obviously racy as task->flags may be modified by the task
  itself at any time.  This obscure bug manifested as corrupt
  PF_USED_MATH flag leading to a weird crash.

  The bug is fixed by moving the flag to task->atomic_flags.  The first
  two are prepatory ones to help defining atomic_flags accessors and the
  third one is the actual fix"

* 'for-3.17-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup:
  cpuset: PF_SPREAD_PAGE and PF_SPREAD_SLAB should be atomic flags
  sched: add macros to define bitops for task atomic flags
  sched: fix confusing PFA_NO_NEW_PRIVS constant
2014-09-27 16:45:33 -07:00
Alexei Starovoitov
3c731eba48 bpf: mini eBPF library, test stubs and verifier testsuite
1.
the library includes a trivial set of BPF syscall wrappers:
int bpf_create_map(int key_size, int value_size, int max_entries);
int bpf_update_elem(int fd, void *key, void *value);
int bpf_lookup_elem(int fd, void *key, void *value);
int bpf_delete_elem(int fd, void *key);
int bpf_get_next_key(int fd, void *key, void *next_key);
int bpf_prog_load(enum bpf_prog_type prog_type,
		  const struct sock_filter_int *insns, int insn_len,
		  const char *license);
bpf_prog_load() stores verifier log into global bpf_log_buf[] array

and BPF_*() macros to build instructions

2.
test stubs configure eBPF infra with 'unspec' map and program types.
These are fake types used by user space testsuite only.

3.
verifier tests valid and invalid programs and expects predefined
error log messages from kernel.
40 tests so far.

$ sudo ./test_verifier
 #0 add+sub+mul OK
 #1 unreachable OK
 #2 unreachable2 OK
 #3 out of range jump OK
 #4 out of range jump2 OK
 #5 test1 ld_imm64 OK
 ...

Signed-off-by: Alexei Starovoitov <ast@plumgrid.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-09-26 15:05:15 -04:00
Alexei Starovoitov
17a5267067 bpf: verifier (add verifier core)
This patch adds verifier core which simulates execution of every insn and
records the state of registers and program stack. Every branch instruction seen
during simulation is pushed into state stack. When verifier reaches BPF_EXIT,
it pops the state from the stack and continues until it reaches BPF_EXIT again.
For program:
1: bpf_mov r1, xxx
2: if (r1 == 0) goto 5
3: bpf_mov r0, 1
4: goto 6
5: bpf_mov r0, 2
6: bpf_exit
The verifier will walk insns: 1, 2, 3, 4, 6
then it will pop the state recorded at insn#2 and will continue: 5, 6

This way it walks all possible paths through the program and checks all
possible values of registers. While doing so, it checks for:
- invalid instructions
- uninitialized register access
- uninitialized stack access
- misaligned stack access
- out of range stack access
- invalid calling convention
- instruction encoding is not using reserved fields

Kernel subsystem configures the verifier with two callbacks:

- bool (*is_valid_access)(int off, int size, enum bpf_access_type type);
  that provides information to the verifer which fields of 'ctx'
  are accessible (remember 'ctx' is the first argument to eBPF program)

- const struct bpf_func_proto *(*get_func_proto)(enum bpf_func_id func_id);
  returns argument constraints of kernel helper functions that eBPF program
  may call, so that verifier can checks that R1-R5 types match the prototype

More details in Documentation/networking/filter.txt and in kernel/bpf/verifier.c

Signed-off-by: Alexei Starovoitov <ast@plumgrid.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-09-26 15:05:15 -04:00
Alexei Starovoitov
475fb78fbf bpf: verifier (add branch/goto checks)
check that control flow graph of eBPF program is a directed acyclic graph

check_cfg() does:
- detect loops
- detect unreachable instructions
- check that program terminates with BPF_EXIT insn
- check that all branches are within program boundary

Signed-off-by: Alexei Starovoitov <ast@plumgrid.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-09-26 15:05:15 -04:00
Alexei Starovoitov
0246e64d9a bpf: handle pseudo BPF_LD_IMM64 insn
eBPF programs passed from userspace are using pseudo BPF_LD_IMM64 instructions
to refer to process-local map_fd. Scan the program for such instructions and
if FDs are valid, convert them to 'struct bpf_map' pointers which will be used
by verifier to check access to maps in bpf_map_lookup/update() calls.
If program passes verifier, convert pseudo BPF_LD_IMM64 into generic by dropping
BPF_PSEUDO_MAP_FD flag.

Note that eBPF interpreter is generic and knows nothing about pseudo insns.

Signed-off-by: Alexei Starovoitov <ast@plumgrid.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-09-26 15:05:15 -04:00
Alexei Starovoitov
cbd3570086 bpf: verifier (add ability to receive verification log)
add optional attributes for BPF_PROG_LOAD syscall:
union bpf_attr {
    struct {
	...
	__u32         log_level; /* verbosity level of eBPF verifier */
	__u32         log_size;  /* size of user buffer */
	__aligned_u64 log_buf;   /* user supplied 'char *buffer' */
    };
};

when log_level > 0 the verifier will return its verification log in the user
supplied buffer 'log_buf' which can be used by program author to analyze why
verifier rejected given program.

'Understanding eBPF verifier messages' section of Documentation/networking/filter.txt
provides several examples of these messages, like the program:

  BPF_ST_MEM(BPF_DW, BPF_REG_10, -8, 0),
  BPF_MOV64_REG(BPF_REG_2, BPF_REG_10),
  BPF_ALU64_IMM(BPF_ADD, BPF_REG_2, -8),
  BPF_LD_MAP_FD(BPF_REG_1, 0),
  BPF_CALL_FUNC(BPF_FUNC_map_lookup_elem),
  BPF_JMP_IMM(BPF_JEQ, BPF_REG_0, 0, 1),
  BPF_ST_MEM(BPF_DW, BPF_REG_0, 4, 0),
  BPF_EXIT_INSN(),

will be rejected with the following multi-line message in log_buf:

  0: (7a) *(u64 *)(r10 -8) = 0
  1: (bf) r2 = r10
  2: (07) r2 += -8
  3: (b7) r1 = 0
  4: (85) call 1
  5: (15) if r0 == 0x0 goto pc+1
   R0=map_ptr R10=fp
  6: (7a) *(u64 *)(r0 +4) = 0
  misaligned access off 4 size 8

The format of the output can change at any time as verifier evolves.

Signed-off-by: Alexei Starovoitov <ast@plumgrid.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-09-26 15:05:15 -04:00
Alexei Starovoitov
51580e798c bpf: verifier (add docs)
this patch adds all of eBPF verfier documentation and empty bpf_check()

The end goal for the verifier is to statically check safety of the program.

Verifier will catch:
- loops
- out of range jumps
- unreachable instructions
- invalid instructions
- uninitialized register access
- uninitialized stack access
- misaligned stack access
- out of range stack access
- invalid calling convention

More details in Documentation/networking/filter.txt

Signed-off-by: Alexei Starovoitov <ast@plumgrid.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-09-26 15:05:14 -04:00
Alexei Starovoitov
0a542a86d7 bpf: handle pseudo BPF_CALL insn
in native eBPF programs userspace is using pseudo BPF_CALL instructions
which encode one of 'enum bpf_func_id' inside insn->imm field.
Verifier checks that program using correct function arguments to given func_id.
If all checks passed, kernel needs to fixup BPF_CALL->imm fields by
replacing func_id with in-kernel function pointer.
eBPF interpreter just calls the function.

In-kernel eBPF users continue to use generic BPF_CALL.

Signed-off-by: Alexei Starovoitov <ast@plumgrid.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-09-26 15:05:14 -04:00
Alexei Starovoitov
09756af468 bpf: expand BPF syscall with program load/unload
eBPF programs are similar to kernel modules. They are loaded by the user
process and automatically unloaded when process exits. Each eBPF program is
a safe run-to-completion set of instructions. eBPF verifier statically
determines that the program terminates and is safe to execute.

The following syscall wrapper can be used to load the program:
int bpf_prog_load(enum bpf_prog_type prog_type,
                  const struct bpf_insn *insns, int insn_cnt,
                  const char *license)
{
    union bpf_attr attr = {
        .prog_type = prog_type,
        .insns = ptr_to_u64(insns),
        .insn_cnt = insn_cnt,
        .license = ptr_to_u64(license),
    };

    return bpf(BPF_PROG_LOAD, &attr, sizeof(attr));
}
where 'insns' is an array of eBPF instructions and 'license' is a string
that must be GPL compatible to call helper functions marked gpl_only

Upon succesful load the syscall returns prog_fd.
Use close(prog_fd) to unload the program.

User space tests and examples follow in the later patches

Signed-off-by: Alexei Starovoitov <ast@plumgrid.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-09-26 15:05:14 -04:00
Alexei Starovoitov
db20fd2b01 bpf: add lookup/update/delete/iterate methods to BPF maps
'maps' is a generic storage of different types for sharing data between kernel
and userspace.

The maps are accessed from user space via BPF syscall, which has commands:

- create a map with given type and attributes
  fd = bpf(BPF_MAP_CREATE, union bpf_attr *attr, u32 size)
  returns fd or negative error

- lookup key in a given map referenced by fd
  err = bpf(BPF_MAP_LOOKUP_ELEM, union bpf_attr *attr, u32 size)
  using attr->map_fd, attr->key, attr->value
  returns zero and stores found elem into value or negative error

- create or update key/value pair in a given map
  err = bpf(BPF_MAP_UPDATE_ELEM, union bpf_attr *attr, u32 size)
  using attr->map_fd, attr->key, attr->value
  returns zero or negative error

- find and delete element by key in a given map
  err = bpf(BPF_MAP_DELETE_ELEM, union bpf_attr *attr, u32 size)
  using attr->map_fd, attr->key

- iterate map elements (based on input key return next_key)
  err = bpf(BPF_MAP_GET_NEXT_KEY, union bpf_attr *attr, u32 size)
  using attr->map_fd, attr->key, attr->next_key

- close(fd) deletes the map

Signed-off-by: Alexei Starovoitov <ast@plumgrid.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-09-26 15:05:14 -04:00
Alexei Starovoitov
749730ce42 bpf: enable bpf syscall on x64 and i386
done as separate commit to ease conflict resolution

Signed-off-by: Alexei Starovoitov <ast@plumgrid.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-09-26 15:05:14 -04:00
Alexei Starovoitov
99c55f7d47 bpf: introduce BPF syscall and maps
BPF syscall is a multiplexor for a range of different operations on eBPF.
This patch introduces syscall with single command to create a map.
Next patch adds commands to access maps.

'maps' is a generic storage of different types for sharing data between kernel
and userspace.

Userspace example:
/* this syscall wrapper creates a map with given type and attributes
 * and returns map_fd on success.
 * use close(map_fd) to delete the map
 */
int bpf_create_map(enum bpf_map_type map_type, int key_size,
                   int value_size, int max_entries)
{
    union bpf_attr attr = {
        .map_type = map_type,
        .key_size = key_size,
        .value_size = value_size,
        .max_entries = max_entries
    };

    return bpf(BPF_MAP_CREATE, &attr, sizeof(attr));
}

'union bpf_attr' is backwards compatible with future extensions.

More details in Documentation/networking/filter.txt and in manpage

Signed-off-by: Alexei Starovoitov <ast@plumgrid.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-09-26 15:05:14 -04:00
Vincent Sanders
75c349062a ARM: 8153/1: Enable gcov support on the ARM architecture
Enable gcov support for ARM based on original patches by David
Singleton and George G. Davis

Riku - updated to patch to current mainline kernel. The patch
has been submitted in 2010, 2012 - for symmetry, now in 2014 too.

https://lwn.net/Articles/390419/
http://marc.info/?l=linux-arm-kernel&m=133823081813044

v2: remove arch/arm/kernel from gcov disabled files

Cc: Andrey Ryabinin <a.ryabinin@samsung.com>
Cc: Naresh Kamboju <naresh.kamboju@linaro.org>
Acked-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Riku Voipio <riku.voipio@linaro.org>
Signed-off-by: Vincent Sanders <vincent.sanders@collabora.co.uk>
Signed-off-by: Russell King <rmk+kernel@arm.linux.org.uk>
2014-09-26 14:39:57 +01:00
Guenter Roeck
b63adb9795 kernel: add support for kernel restart handler call chain
Various drivers implement architecture and/or device specific means to
restart (reset) the system.  Various mechanisms have been implemented to
support those schemes.  The best known mechanism is arm_pm_restart, which
is a function pointer to be set either from platform specific code or from
drivers.  Another mechanism is to use hardware watchdogs to issue a reset;
this mechanism is used if there is no other method available to reset a
board or system.  Two examples are alim7101_wdt, which currently uses the
reboot notifier to trigger a reset, and moxart_wdt, which registers the
arm_pm_restart function.

The existing mechanisms have a number of drawbacks.  Typically only one
scheme to restart the system is supported (at least if arm_pm_restart is
used).  At least in theory there can be multiple means to restart the
system, some of which may be less desirable (for example one mechanism may
only reset the CPU, while another may reset the entire system).  Using
arm_pm_restart can also be racy if the function pointer is set from a
driver, as the driver may be in the process of being unloaded when
arm_pm_restart is called.  Using the reboot notifier is always racy, as it
is unknown if and when other functions using the reboot notifier have
completed execution by the time the watchdog fires.

Introduce a system restart handler call chain to solve the described
problems.  This call chain is expected to be executed from the
architecture specific machine_restart() function.  Drivers providing
system restart functionality (such as the watchdog drivers mentioned
above) are expected to register with this call chain.  By using the
priority field in the notifier block, callers can control restart handler
execution sequence and thus ensure that the restart handler with the
optimal restart capabilities for a given system is called first.

Signed-off-by: Guenter Roeck <linux@roeck-us.net>
Acked-by: Catalin Marinas <catalin.marinas@arm.com>
Acked-by: Heiko Stuebner <heiko@sntech.de>
Cc: Russell King <linux@arm.linux.org.uk>
Cc: Wim Van Sebroeck <wim@iguana.be>
Cc: Maxime Ripard <maxime.ripard@free-electrons.com>
Cc: Will Deacon <will.deacon@arm.com>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Jonas Jensen <jonas.jensen@gmail.com>
Cc: Randy Dunlap <rdunlap@infradead.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Dmitry Eremin-Solenikov <dbaryshkov@gmail.com>
Cc: David Woodhouse <dwmw2@infradead.org>
Cc: Tomasz Figa <t.figa@samsung.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2014-09-26 00:00:06 -07:00
Zefan Li
e756c7b698 Revert "cgroup: remove redundant variable in cgroup_mount()"
This reverts commit 0c7bf3e8ca.

If there are child cgroups in the cgroupfs and then we umount it,
the superblock will be destroyed but the cgroup_root will be kept
around. When we mount it again, cgroup_mount() will find this
cgroup_root and allocate a new sb for it.

So with this commit we will be trapped in a dead loop in the case
described above, because kernfs_pin_sb() keeps returning NULL.

Currently I don't see how we can avoid using both pinned_sb and
new_sb, so just revert it.

Cc: Al Viro <viro@ZenIV.linux.org.uk>
Reported-by: Andrey Wagin <avagin@gmail.com>
Signed-off-by: Zefan Li <lizefan@huawei.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2014-09-26 00:16:23 -04:00
Linus Torvalds
f4cb707e7a ACPI and power management fixes for 3.17-rc7
- Revert of a recent hibernation core commit that introduced
    a NULL pointer dereference during resume for at least one user
    (Rafael J Wysocki).
 
  - Fix for the ACPI LPSS (Low-Power Subsystem) driver to disable
    asynchronous PM callback execution for LPSS devices during system
    suspend/resume (introduced in 3.16) which turns out to break
    ordering expectations on some systems.  From Fu Zhonghui.
 
  - cpufreq core fix related to the handling of sysfs nodes during
    system suspend/resume that has been broken for intel_pstate
    since 3.15 from Lan Tianyu.
 
  - Restore the generation of "online" uevents for ACPI container
    devices that was removed in 3.14, but some user space utilities
    turn out to need them (Rafael J Wysocki).
 
  - The cpufreq core fails to release a lock in an error code path
    after changes made in 3.14.  Fix from Prarit Bhargava.
 
  - ACPICA and ACPI/GPIO fixes to make the handling of ACPI GPIO
    operation regions (which means AML using GPIOs) work correctly
    in all cases from Bob Moore and Srinivas Pandruvada.
 
  - Fix for a wrong sign of the ACPI core's create_modalias() return
    value in case of an error from Mika Westerberg.
 
  - ACPI backlight blacklist entry for ThinkPad X201s from Aaron Lu.
 
 /
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v2.0.22 (GNU/Linux)
 
 iQIcBAABCAAGBQJUJJGgAAoJEILEb/54YlRxt3kP/19OjVjGK/lFKJk4LCmQ77k5
 6DDF7/clNJmYBkKBXGdyqqRVdDUXjRuHS1Yd78zWMmwdLtdOcyI+wBjG1w0mMU7o
 vAYvXkIks9fCeKBRHSlqdtQROFf3+bxothKD8JGTONA5z4Fih40fqsnuSW8G7uJs
 iTEQQK7L2uPJ+w1OnltwN6eNgzN5KqfxgxI+L6DhEMRjWXRHuhfRZorVIjvz+ALV
 Fjm8shhjnhQKzS2zuv5PZ5gGM7zZBH7hy7kd4aDYsbppOLAB2pMOwVs0sgC1Xcbv
 teyWkyzmhix2Z1bX9wwia5FfMgbnY2leejJN7mukKzHz8CQ1vxS98Sji2uviIAej
 Ctp6GKjuemGvjryjbkstD6r3KYS8CuWAL++YwlamqSa0eWBuM+aD9YqGj4i6ntbU
 8BFT5KXauOIsA5U51zC8wNUDHoTgBcvoN99zNIM1jIF81M7wuQrXUzJLXBStuSlR
 /bDpExwxHt7I6MeUfRTjg37ApVNRAiStw32+DfsKAj4HLsqTkGs1879Paxf30T0f
 Z2SlYr5Jeusu5u9DNhk7MG21A+m46R0jjLd1OKBbf2mrtfQfdKCo6szGR7vjEMZC
 aGIlwtIA4iS4MN3UAyqOW3SxIPT2SxqPXzG/z27hRN5MUsGNWiClzcUsaaHoHmpp
 GlbY/BvDYfur4NBeCSli
 =SzQq
 -----END PGP SIGNATURE-----

Merge tag 'pm+acpi-3.17-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm

Pull ACPI and power management fixes from Rafael Wysocki:
 "These are regression fixes (ACPI hotplug, cpufreq, hibernation, ACPI
  LPSS driver), fixes for stuff that never worked correctly (ACPI GPIO
  support in some cases and a wrong sign of an error code in the ACPI
  core in one place), and one blacklist item for ACPI backlight
  handling.

  Specifics:

   - Revert of a recent hibernation core commit that introduced a NULL
     pointer dereference during resume for at least one user (Rafael J
     Wysocki).

   - Fix for the ACPI LPSS (Low-Power Subsystem) driver to disable
     asynchronous PM callback execution for LPSS devices during system
     suspend/resume (introduced in 3.16) which turns out to break
     ordering expectations on some systems.  From Fu Zhonghui.

   - cpufreq core fix related to the handling of sysfs nodes during
     system suspend/resume that has been broken for intel_pstate since
     3.15 from Lan Tianyu.

   - Restore the generation of "online" uevents for ACPI container
     devices that was removed in 3.14, but some user space utilities
     turn out to need them (Rafael J Wysocki).

   - The cpufreq core fails to release a lock in an error code path
     after changes made in 3.14.  Fix from Prarit Bhargava.

   - ACPICA and ACPI/GPIO fixes to make the handling of ACPI GPIO
     operation regions (which means AML using GPIOs) work correctly in
     all cases from Bob Moore and Srinivas Pandruvada.

   - Fix for a wrong sign of the ACPI core's create_modalias() return
     value in case of an error from Mika Westerberg.

   - ACPI backlight blacklist entry for ThinkPad X201s from Aaron Lu"

* tag 'pm+acpi-3.17-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm:
  Revert "PM / Hibernate: Iterate over set bits instead of PFNs in swsusp_free()"
  gpio / ACPI: Use pin index and bit length
  ACPICA: Update to GPIO region handler interface.
  ACPI / platform / LPSS: disable async suspend/resume of LPSS devices
  cpufreq: release policy->rwsem on error
  cpufreq: fix cpufreq suspend/resume for intel_pstate
  ACPI / scan: Correct error return value of create_modalias()
  ACPI / video: disable native backlight for ThinkPad X201s
  ACPI / hotplug: Generate online uevents for ACPI containers
2014-09-25 15:25:52 -07:00
NeilBrown
cbbce82209 SCHED: add some "wait..on_bit...timeout()" interfaces.
In commit c1221321b7
   sched: Allow wait_on_bit_action() functions to support a timeout

I suggested that a "wait_on_bit_timeout()" interface would not meet my
need.  This isn't true - I was just over-engineering.

Including a 'private' field in wait_bit_key instead of a focused
"timeout" field was just premature generalization.  If some other
use is ever found, it can be generalized or added later.

So this patch renames "private" to "timeout" with a meaning "stop
waiting when "jiffies" reaches or passes "timeout",
and adds two of the many possible wait..bit..timeout() interfaces:

wait_on_page_bit_killable_timeout(), which is the one I want to use,
and out_of_line_wait_on_bit_timeout() which is a reasonably general
example.  Others can be added as needed.

Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: NeilBrown <neilb@suse.de>
Acked-by: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2014-09-25 08:23:57 -04:00
Zefan Li
2ad654bc5e cpuset: PF_SPREAD_PAGE and PF_SPREAD_SLAB should be atomic flags
When we change cpuset.memory_spread_{page,slab}, cpuset will flip
PF_SPREAD_{PAGE,SLAB} bit of tsk->flags for each task in that cpuset.
This should be done using atomic bitops, but currently we don't,
which is broken.

Tetsuo reported a hard-to-reproduce kernel crash on RHEL6, which happened
when one thread tried to clear PF_USED_MATH while at the same time another
thread tried to flip PF_SPREAD_PAGE/PF_SPREAD_SLAB. They both operate on
the same task.

Here's the full report:
https://lkml.org/lkml/2014/9/19/230

To fix this, we make PF_SPREAD_PAGE and PF_SPREAD_SLAB atomic flags.

v4:
- updated mm/slab.c. (Fengguang Wu)
- updated Documentation.

Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Miao Xie <miaox@cn.fujitsu.com>
Cc: Kees Cook <keescook@chromium.org>
Fixes: 950592f7b9 ("cpusets: update tasks' page/slab spread flags in time")
Cc: <stable@vger.kernel.org> # 2.6.31+
Reported-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Signed-off-by: Zefan Li <lizefan@huawei.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2014-09-24 22:16:06 -04:00
Tomeu Vizoso
7990da71eb PM / QoS: Add PM_QOS_MEMORY_BANDWIDTH class
Also adds a class type PM_QOS_SUM that aggregates the values by summing them.

It can be used by memory controllers to calculate the optimum clock frequency
based on the bandwidth needs of the different memory clients.

Signed-off-by: Tomeu Vizoso <tomeu.vizoso@collabora.com>
Acked-by: Pavel Machek <pavel@ucw.cz>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2014-09-25 01:18:33 +02:00
Rafael J. Wysocki
5c4dd348af Revert "PM / Hibernate: Iterate over set bits instead of PFNs in swsusp_free()"
Revert commit 6efde38f07 (PM / Hibernate: Iterate over set bits
instead of PFNs in swsusp_free()) that introduced a NULL pointer
dereference during system resume from hibernation:

BUG: unable to handle kernel NULL pointer dereference at (null)
IP: [<ffffffff810a8cc1>] swsusp_free+0x21/0x190
PGD b39c2067 PUD b39c1067 PMD 0
Oops: 0000 [#1] SMP
Modules linked in: <irrelevant list of modules>
CPU: 1 PID: 4898 Comm: s2disk Tainted: G         C     3.17-rc5-amd64 #1 Debian 3.17~rc5-1~exp1
Hardware name: LENOVO 2776LEG/2776LEG, BIOS 6EET55WW (3.15 ) 12/19/2011
task: ffff88023155ea40 ti: ffff8800b3b14000 task.ti: ffff8800b3b14000
RIP: 0010:[<ffffffff810a8cc1>]  [<ffffffff810a8cc1>]
swsusp_free+0x21/0x190
RSP: 0018:ffff8800b3b17ea8  EFLAGS: 00010246
RAX: 0000000000000000 RBX: ffff8800b39bab00 RCX: 0000000000000001
RDX: ffff8800b39bab10 RSI: ffff8800b39bab00 RDI: 0000000000000000
RBP: 0000000000000010 R08: 0000000000000000 R09: 0000000000000000
R10: ffff8800b39bab10 R11: 0000000000000246 R12: ffffea0000000000
R13: ffff880232f485a0 R14: ffff88023ac27cd8 R15: ffff880232927590
FS:  00007f406d83b700(0000) GS:ffff88023bc80000(0000)
knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
CR2: 0000000000000000 CR3: 00000000b3a62000 CR4: 00000000000007e0
Stack:
 ffff8800b39bab00 0000000000000010 ffff880232927590 ffffffff810acb4a
 ffff8800b39bab00 ffffffff811a955a ffff8800b39bab10 0000000000000000
 ffff88023155f098 ffffffff81a6b8c0 ffff88023155ea40 0000000000000007
Call Trace:
 [<ffffffff810acb4a>] ? snapshot_release+0x2a/0xb0
 [<ffffffff811a955a>] ? __fput+0xca/0x1d0
 [<ffffffff81080627>] ? task_work_run+0x97/0xd0
 [<ffffffff81012d89>] ? do_notify_resume+0x69/0xa0
 [<ffffffff8151452a>] ? int_signal+0x12/0x17
Code: 66 2e 0f 1f 84 00 00 00 00 00 66 66 66 66 90 41 54 48 8b 05 ba 62 9c 00 49 bc 00 00 00 00 00 ea ff ff 48 8b 3d a1 62 9c 00 55 53 <48> 8b 10 48 89 50 18 48 8b 52 20 48 c7 40 28 00 00 00 00 c7 40
RIP  [<ffffffff810a8cc1>] swsusp_free+0x21/0x190
 RSP <ffff8800b3b17ea8>
CR2: 0000000000000000
---[ end trace f02be86a1ec0cccb ]---

due to forbidden_pages_map being NULL in swsusp_free().

Fixes: 6efde38f07 "PM / Hibernate: Iterate over set bits instead of PFNs in swsusp_free()"
Reported-by: Bjørn Mork <bjorn@mork.no>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2014-09-25 00:59:54 +02:00
David S. Miller
4daaab4f0c Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net 2014-09-24 16:48:32 -04:00
Tejun Heo
2aad2a86f6 percpu_ref: add PERCPU_REF_INIT_* flags
With the recent addition of percpu_ref_reinit(), percpu_ref now can be
used as a persistent switch which can be turned on and off repeatedly
where turning off maps to killing the ref and waiting for it to drain;
however, there currently isn't a way to initialize a percpu_ref in its
off (killed and drained) state, which can be inconvenient for certain
persistent switch use cases.

Similarly, percpu_ref_switch_to_atomic/percpu() allow dynamic
selection of operation mode; however, currently a newly initialized
percpu_ref is always in percpu mode making it impossible to avoid the
latency overhead of switching to atomic mode.

This patch adds @flags to percpu_ref_init() and implements the
following flags.

* PERCPU_REF_INIT_ATOMIC	: start ref in atomic mode
* PERCPU_REF_INIT_DEAD		: start ref killed and drained

These flags should be able to serve the above two use cases.

v2: target_core_tpg.c conversion was missing.  Fixed.

Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Kent Overstreet <kmo@daterainc.com>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Christoph Hellwig <hch@infradead.org>
Cc: Johannes Weiner <hannes@cmpxchg.org>
2014-09-24 13:31:50 -04:00
Tejun Heo
d06efebf0c Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/axboe/linux-block into for-3.18
This is to receive 0a30288da1 ("blk-mq, percpu_ref: implement a
kludge for SCSI blk-mq stall during probe") which implements
__percpu_ref_kill_expedited() to work around SCSI blk-mq stall.  The
commit reverted and patches to implement proper fix will be added.

Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Kent Overstreet <kmo@daterainc.com>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Christoph Hellwig <hch@lst.de>
2014-09-24 13:00:21 -04:00
Jiri Olsa
802c8a61d4 Revert "perf: Do not allow optimized switch for non-cloned events"
This reverts commit 1f9a7268c6.

With the fix of the initial state for the cloned event we now correctly
handle the error described in:

  1f9a7268c6 perf: Do not allow optimized switch for non-cloned events

so we can revert it.

I made an automated test for this, but its not suitable for automated
perf tests framework. It needs to be customized for each machine (the
more cpu the higher numbers for GROUPS/WORKERS/BYTES) and it could take
longer time to hit the issue.

Signed-off-by: Jiri Olsa <jolsa@kernel.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Stephane Eranian <eranian@google.com>
Cc: Jiri Olsa <jolsa@redhat.com>
Cc: Arnaldo Carvalho de Melo <acme@kernel.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Link: http://lkml.kernel.org/r/20140910143535.GD2409@krava.brq.redhat.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2014-09-24 14:48:13 +02:00
Jiri Olsa
1929def9e6 perf: Fix child event initial state setup
Currently we initialize the child event based on the original
parent state. This is wrong, because the original parent event
(and its state) is not related to current fork and also could
be already gone.

We need to initialize the child state based on the immediate
parent event state.

Signed-off-by: Jiri Olsa <jolsa@kernel.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Stephane Eranian <eranian@google.com>
Cc: Jiri Olsa <jolsa@redhat.com>
Cc: Arnaldo Carvalho de Melo <acme@kernel.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Link: http://lkml.kernel.org/r/1410520708-19275-2-git-send-email-jolsa@kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2014-09-24 14:48:12 +02:00
Jiri Olsa
dc633982ff perf: Do not POLLHUP event if it has children
Currently we return POLLHUP in event polling if the monitored
process is done, but we didn't consider possible children,
that might be still running and producing data.

Before returning POLLHUP making sure that:

   1) the monitored task has exited and that
   2) we don't have any children to monitor

Also adding parent wakeup when the child event is gone.

Suggested-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Jiri Olsa <jolsa@kernel.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: http://lkml.kernel.org/r/1410520708-19275-1-git-send-email-jolsa@kernel.org
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Stephane Eranian <eranian@google.com>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Stephane Eranian <eranian@google.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2014-09-24 14:48:11 +02:00
Kirill Tkhai
8aa6f0ebf4 sched/rt: Use resched_curr() in task_tick_rt()
Some time ago PREEMPT_NEED_RESCHED was implemented,
so reschedule technics is a little more difficult now.

Signed-off-by: Kirill Tkhai <ktkhai@parallels.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Link: http://lkml.kernel.org/r/20140922183642.11015.66039.stgit@localhost
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2014-09-24 14:47:12 +02:00
Kirill Tkhai
f1e3a0932f sched: Use rq->rd in sched_setaffinity() under RCU read lock
Probability of use-after-free isn't zero in this place.

Signed-off-by: Kirill Tkhai <ktkhai@parallels.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: <stable@vger.kernel.org> # v3.14+
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Link: http://lkml.kernel.org/r/20140922183636.11015.83611.stgit@localhost
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2014-09-24 14:47:11 +02:00
Kirill Tkhai
16303ab2fe sched: cleanup: Rename 'out_unlock' to 'out_free_new_mask'
Nothing is locked there, so label's name only confuses a reader.

Signed-off-by: Kirill Tkhai <ktkhai@parallels.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: http://lkml.kernel.org/r/20140922183630.11015.59500.stgit@localhost
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2014-09-24 14:47:10 +02:00
Kirill Tkhai
66339c31bc sched: Use dl_bw_of() under RCU read lock
dl_bw_of() dereferences rq->rd which has to have RCU read lock held.
Probability of use-after-free isn't zero here.

Also add lockdep assert into dl_bw_cpus().

Signed-off-by: Kirill Tkhai <ktkhai@parallels.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: <stable@vger.kernel.org> # v3.14+
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Link: http://lkml.kernel.org/r/20140922183624.11015.71558.stgit@localhost
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2014-09-24 14:47:09 +02:00
Kirill Tkhai
7a96c231ca sched/fair: Remove duplicate code from can_migrate_task()
Combine two branches which do the same.

Signed-off-by: Kirill Tkhai <ktkhai@parallels.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Link: http://lkml.kernel.org/r/20140922183612.11015.64200.stgit@localhost
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2014-09-24 14:47:07 +02:00
Peter Zijlstra
c55f5158f5 sched, mips, ia64: Remove __ARCH_WANT_UNLOCKED_CTXSW
Kirill found that there's a subtle race in the
__ARCH_WANT_UNLOCKED_CTXSW code, and instead of fixing it, remove the
entire exception because neither arch that uses it seems to actually
still require it.

Boot tested on mips64el (qemu) only.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Kirill Tkhai <tkhai@yandex.ru>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Davidlohr Bueso <davidlohr@hp.com>
Cc: Fenghua Yu <fenghua.yu@intel.com>
Cc: James Hogan <james.hogan@imgtec.com>
Cc: Kees Cook <keescook@chromium.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Paul Burton <paul.burton@imgtec.com>
Cc: Qais Yousef <qais.yousef@imgtec.com>
Cc: Ralf Baechle <ralf@linux-mips.org>
Cc: Tony Luck <tony.luck@intel.com>
Cc: oleg@redhat.com
Cc: linux@roeck-us.net
Cc: linux-ia64@vger.kernel.org
Cc: linux-kernel@vger.kernel.org
Cc: linux-mips@linux-mips.org
Link: http://lkml.kernel.org/r/20140923150641.GH3312@worktop.programming.kicks-ass.net
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2014-09-24 14:47:05 +02:00
Oleg Nesterov
5bd96ab6fe sched: print_rq(): Don't use tasklist_lock
read_lock_irqsave(tasklist_lock) in print_rq() looks strange. We do
not need to disable irqs, and they are already disabled by the caller.

And afaics this lock buys nothing, we can rely on rcu_read_lock().
In this case it makes sense to also move rcu_read_lock/unlock from
the caller to print_rq().

Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Kirill Tkhai <tkhai@yandex.ru>
Cc: Mike Galbraith <umgwanakikbuti@gmail.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Link: http://lkml.kernel.org/r/20140921193341.GA28628@redhat.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2014-09-24 14:47:04 +02:00
Oleg Nesterov
3472eaa1f1 sched: normalize_rt_tasks(): Don't use _irqsave for tasklist_lock, use task_rq_lock()
1. read_lock(tasklist_lock) does not need to disable irqs.

2. ->mm != NULL is a common mistake, use PF_KTHREAD.

3. The second ->mm check can be simply removed.

4. task_rq_lock() looks better than raw_spin_lock(&p->pi_lock) +
   __task_rq_lock().

Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Kirill Tkhai <tkhai@yandex.ru>
Cc: Mike Galbraith <umgwanakikbuti@gmail.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Link: http://lkml.kernel.org/r/20140921193338.GA28621@redhat.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2014-09-24 14:47:03 +02:00
Oleg Nesterov
8651c65844 sched: Fix the task-group check in tg_has_rt_tasks()
tg_has_rt_tasks() wants to find an RT task in this task_group, but
task_rq(p)->rt.tg wrongly checks the root rt_rq.

Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Reviewed-by: Kirill Tkhai <ktkhai@parallels.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Mike Galbraith <umgwanakikbuti@gmail.com>
Link: http://lkml.kernel.org/r/20140921193336.GA28618@redhat.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2014-09-24 14:47:00 +02:00
Nicolas Pitre
83a0a96a5f sched/fair: Leverage the idle state info when choosing the "idlest" cpu
The code in find_idlest_cpu() looks for the CPU with the smallest load.
However, if multiple CPUs are idle, the first idle CPU is selected
irrespective of the depth of its idle state.

Among the idle CPUs we should pick the one with with the shallowest idle
state, or the latest to have gone idle if all idle CPUs are in the same
state.  The later applies even when cpuidle is configured out.

This patch doesn't cover the following issues:

- The idle exit latency of a CPU might be larger than the time needed
  to migrate the waking task to an already running CPU with sufficient
  capacity, and therefore performance would benefit from task packing
  in such case (in most cases task packing is about power saving).

- Some idle states have a non negligible and non abortable entry latency
  which needs to run to completion before the exit latency can start.
  A concurrent patch series is making this info available to the cpuidle
  core.  Once available, the entry latency with the idle timestamp could
  determine when the exit latency may be effective.

Those issues will be handled in due course.  In the mean time, what
is implemented here should improve things already compared to the current
state of affairs.

Based on an initial patch from Daniel Lezcano.

Signed-off-by: Nicolas Pitre <nico@linaro.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Daniel Lezcano <daniel.lezcano@linaro.org>
Cc: "Rafael J. Wysocki" <rjw@rjwysocki.net>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: linux-pm@vger.kernel.org
Cc: linaro-kernel@lists.linaro.org
Link: http://lkml.kernel.org/n/tip-@git.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2014-09-24 14:46:59 +02:00
Daniel Lezcano
442bf3aaf5 sched: Let the scheduler see CPU idle states
When the cpu enters idle, it stores the cpuidle state pointer in its
struct rq instance which in turn could be used to make a better decision
when balancing tasks.

As soon as the cpu exits its idle state, the struct rq reference is
cleared.

There are a couple of situations where the idle state pointer could be changed
while it is being consulted:

1. For x86/acpi with dynamic c-states, when a laptop switches from battery
   to AC that could result on removing the deeper idle state. The acpi driver
   triggers:
	'acpi_processor_cst_has_changed'
		'cpuidle_pause_and_lock'
			'cpuidle_uninstall_idle_handler'
				'kick_all_cpus_sync'.

All cpus will exit their idle state and the pointed object will be set to
NULL.

2. The cpuidle driver is unloaded. Logically that could happen but not
in practice because the drivers are always compiled in and 95% of them are
not coded to unregister themselves.  In any case, the unloading code must
call 'cpuidle_unregister_device', that calls 'cpuidle_pause_and_lock'
leading to 'kick_all_cpus_sync' as mentioned above.

A race can happen if we use the pointer and then one of these two scenarios
occurs at the same moment.

In order to be safe, the idle state pointer stored in the rq must be
used inside a rcu_read_lock section where we are protected with the
'rcu_barrier' in the 'cpuidle_uninstall_idle_handler' function. The
idle_get_state() and idle_put_state() accessors should be used to that
effect.

Signed-off-by: Daniel Lezcano <daniel.lezcano@linaro.org>
Signed-off-by: Nicolas Pitre <nico@linaro.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: "Rafael J. Wysocki" <rjw@rjwysocki.net>
Cc: linux-pm@vger.kernel.org
Cc: linaro-kernel@lists.linaro.org
Cc: Daniel Lezcano <daniel.lezcano@linaro.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Link: http://lkml.kernel.org/n/tip-@git.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2014-09-24 14:46:58 +02:00
Juri Lelli
91ec6778ec sched/deadline: Fix inter- exclusive cpusets migrations
Users can perform clustered scheduling using the cpuset facility.
After an exclusive cpuset is created, task migrations happen only
between CPUs belonging to the same cpuset. Inter- cpuset migrations
can only happen when the user requires so, moving a task between
different cpusets. This behaviour is broken in SCHED_DEADLINE, as
currently spurious inter- cpuset migration may happen without user
intervention.

This patch fix the problem (and shuffles the code a bit to improve
clarity).

Signed-off-by: Juri Lelli <juri.lelli@arm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: raistlin@linux.it
Cc: michael@amarulasolutions.com
Cc: fchecconi@gmail.com
Cc: daniel.wagner@bmw-carit.de
Cc: vincent@legout.info
Cc: luca.abeni@unitn.it
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Link: http://lkml.kernel.org/r/1411118561-26323-4-git-send-email-juri.lelli@arm.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2014-09-24 14:46:57 +02:00
Juri Lelli
a5e7be3b28 sched/deadline: Clear dl_entity params when setscheduling to different class
When a task is using SCHED_DEADLINE and the user setschedules it to a
different class its sched_dl_entity static parameters are not cleaned
up. This causes a bug if the user sets it back to SCHED_DEADLINE with
the same parameters again.  The problem resides in the check we
perform at the very beginning of dl_overflow():

	if (new_bw == p->dl.dl_bw)
		return 0;

This condition is met in the case depicted above, so the function
returns and dl_b->total_bw is not updated (the p->dl.dl_bw is not
added to it). After this, admission control is broken.

This patch fixes the thing, properly clearing static parameters for a
task that ceases to use SCHED_DEADLINE.

Reported-by: Daniele Alessandrelli <daniele.alessandrelli@gmail.com>
Reported-by: Daniel Wagner <daniel.wagner@bmw-carit.de>
Reported-by: Vincent Legout <vincent@legout.info>
Tested-by: Luca Abeni <luca.abeni@unitn.it>
Tested-by: Daniel Wagner <daniel.wagner@bmw-carit.de>
Tested-by: Vincent Legout <vincent@legout.info>
Signed-off-by: Juri Lelli <juri.lelli@arm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Fabio Checconi <fchecconi@gmail.com>
Cc: Dario Faggioli <raistlin@linux.it>
Cc: Michael Trimarchi <michael@amarulasolutions.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Link: http://lkml.kernel.org/r/1411118561-26323-2-git-send-email-juri.lelli@arm.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2014-09-24 14:46:56 +02:00
Oleg Nesterov
be34f0f3e6 sched/numa: Kill the wrong/dead TASK_DEAD check in task_numa_fault()
current->state == TASK_DEAD means that the task is doing its
last schedule(), page fault is obviously impossible at this
stage.

Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Acked-by: Mel Gorman <mgorman@suse.de>
Acked-by: Rik van Riel <riel@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Link: http://lkml.kernel.org/r/20140921194743.GA30114@redhat.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2014-09-24 09:35:05 +02:00
Richard Guy Briggs
9eab339b19 audit: get comm using lock to avoid race in string printing
When task->comm is passed directly to audit_log_untrustedstring() without
getting a copy or using the task_lock, there is a race that could happen that
would output a NULL (\0) in the output string that would effectively truncate
the rest of the report text after the comm= field in the audit, losing fields.

Use get_task_comm() to get a copy while acquiring the task_lock to prevent
this and to prevent the result from being a mixture of old and new values of
comm.

Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Signed-off-by: Richard Guy Briggs <rgb@redhat.com>
2014-09-23 16:37:56 -04:00
Richard Guy Briggs
f874738e8c audit: remove open_arg() function that is never used
open_arg() was added in commit 55669bfa "audit: AUDIT_PERM support"
and never used.  Remove it.

Signed-off-by: Richard Guy Briggs <rgb@redhat.com>
2014-09-23 16:37:55 -04:00
Richard Guy Briggs
9ef9151477 audit: correct AUDIT_GET_FEATURE return message type
When an AUDIT_GET_FEATURE message is sent from userspace to the kernel, it
should reply with a message tagged as an AUDIT_GET_FEATURE type with a struct
audit_feature.  The current reply is a message tagged as an AUDIT_GET
type with a struct audit_feature.

This appears to have been a cut-and-paste-eo in commit b0fed40.

Reported-by: Steve Grubb <sgrubb@redhat.com>
Signed-off-by: Richard Guy Briggs <rgb@redhat.com>
2014-09-23 16:37:55 -04:00
Richard Guy Briggs
54e05eddbe audit: set nlmsg_len for multicast messages.
Report:
	Looking at your example code in
	http://people.redhat.com/rbriggs/audit-multicast-listen/audit-multicast-listen.c,
	it seems that nlmsg_len field in the received messages is supposed to
	contain the length of the header + payload, but it is always set to the
	size of the header only, i.e. 16. The example program works, because
	the printf format specifies the minimum width, not "precision", so it
	simply prints out the payload until the first zero byte. This isn't too
	much of a problem, but precludes the use of recvmmsg, iiuc?

	(gdb) p *(struct nlmsghdr*)nlh
	$14 = {nlmsg_len = 16, nlmsg_type = 1100, nlmsg_flags = 0, nlmsg_seq = 0, nlmsg_pid = 9910}

The only time nlmsg_len would have been updated was at audit_buffer_alloc()
inside audit_log_start() and never updated after.  It should arguably be done
in audit_log_vformat(), but would be more efficient in audit_log_end().

Reported-by: Zbigniew Jędrzejewski-Szmek <zbyszek@in.waw.pl>
Signed-off-by: Richard Guy Briggs <rgb@redhat.com>
2014-09-23 16:37:54 -04:00
Richard Guy Briggs
219ca39427 audit: use union for audit_field values since they are mutually exclusive
Since only one of val, uid, gid and lsm* are used at any given time, combine
them to reduce the size of the struct audit_field.

Signed-off-by: Richard Guy Briggs <rgb@redhat.com>
2014-09-23 16:37:53 -04:00
Burn Alting
e7df61f4d1 audit: invalid op= values for rules
Various audit events dealing with adding, removing and updating rules result in
invalid values set for the op keys which result in embedded spaces in op=
values.

The invalid values are
        op="add rule"       set in kernel/auditfilter.c
        op="remove rule"    set in kernel/auditfilter.c
        op="remove rule"    set in kernel/audit_tree.c
        op="updated rules"  set in kernel/audit_watch.c
        op="remove rule"    set in kernel/audit_watch.c

Replace the space in the above values with an underscore character ('_').

Coded-by: Burn Alting <burn@swtf.dyndns.org>
Signed-off-by: Richard Guy Briggs <rgb@redhat.com>
2014-09-23 16:37:53 -04:00
Richard Guy Briggs
01478d7d60 audit: use atomic_t to simplify audit_serial()
Since there is already a primitive to do this operation in the atomic_t, use it
to simplify audit_serial().

Signed-off-by: Richard Guy Briggs <rgb@redhat.com>
2014-09-23 16:37:52 -04:00
Fabian Frederick
6eed9b2613 kernel/audit.c: use ARRAY_SIZE instead of sizeof/sizeof[0]
Use kernel.h definition.

Cc: Eric Paris <eparis@redhat.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Fabian Frederick <fabf@skynet.be>
Signed-off-by: Richard Guy Briggs <rgb@redhat.com>
2014-09-23 16:37:51 -04:00
Richard Guy Briggs
691e6d59d2 audit: reduce scope of audit_log_fcaps
audit_log_fcaps() isn't used outside kernel/audit.c.  Reduce its scope.

Signed-off-by: Richard Guy Briggs <rgb@redhat.com>
2014-09-23 16:37:51 -04:00
Richard Guy Briggs
c0a8d9b069 audit: reduce scope of audit_net_id
audit_net_id isn't used outside kernel/audit.c.  Reduce its scope.

Signed-off-by: Richard Guy Briggs <rgb@redhat.com>
2014-09-23 16:37:50 -04:00
Richard Guy Briggs
b4f0d3755c audit: x86: drop arch from __audit_syscall_entry() interface
Since the arch is found locally in __audit_syscall_entry(), there is no need to
pass it in as a parameter.  Delete it from the parameter list.

x86* was the only arch to call __audit_syscall_entry() directly and did so from
assembly code.

Signed-off-by: Richard Guy Briggs <rgb@redhat.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: x86@kernel.org
Cc: linux-kernel@vger.kernel.org
Cc: linux-audit@redhat.com
Signed-off-by: Eric Paris <eparis@redhat.com>

---

As this patch relies on changes in the audit tree, I think it
appropriate to send it through my tree rather than the x86 tree.
2014-09-23 16:21:28 -04:00
Richard Guy Briggs
84db564aad audit: add arch field to seccomp event log
The AUDIT_SECCOMP record looks something like this:

type=SECCOMP msg=audit(1373478171.953:32775): auid=4325 uid=4325 gid=4325 ses=1 subj=unconfined_u:unconfined_r:unconfined_t:s0 pid=12381 comm="test" sig=31 syscall=231 compat=0 ip=0x39ea8bca89 code=0x0

In order to determine what syscall 231 maps to, we need to have the arch= field right before it.

To see the event, compile this test.c program:

=====
int main(void)
{
        return seccomp_load(seccomp_init(SCMP_ACT_KILL));
}
=====

gcc -g test.c -o test -lseccomp

After running the program, find the record by:  ausearch --start recent -m SECCOMP -i

Signed-off-by: Richard Guy Briggs <rgb@redhat.com>
signed-off-by: Eric Paris <eparis@redhat.com>
2014-09-23 16:21:26 -04:00
Richard Guy Briggs
4a99854c58 audit: __audit_syscall_entry: ignore arch arg and call syscall_get_arch() directly
Since every arch should have syscall_get_arch() defined, stop using the
function argument and just collect this ourselves.  We do not drop the
argument as fixing some code paths (in assembly) to not pass this first
argument is non-trivial.  The argument will be dropped when that is
fixed.

Signed-off-by: Richard Guy Briggs <rgb@redhat.com>
Signed-off-by: Eric Paris <eparis@redhat.com>
2014-09-23 16:21:26 -04:00
David S. Miller
1f6d80358d Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
Conflicts:
	arch/mips/net/bpf_jit.c
	drivers/net/can/flexcan.c

Both the flexcan and MIPS bpf_jit conflicts were cases of simple
overlapping changes.

Signed-off-by: David S. Miller <davem@davemloft.net>
2014-09-23 12:09:27 -04:00
Linus Torvalds
324c7b62d0 Merge branch 'for-3.17-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup
Pull cgroup fix from Tejun Heo:
 "One late fix for cgroup.

  I was waiting for another set of fixes for a long-standing obscure
   cpuset bug but am not sure whether they'll be ready before v3.17
  release.  This one is a simple fix for a mutex unlock balance bug in
  an allocation failure path in pidlist_array_load().

  The bug was introduced in v3.14 and the fix is tagged for -stable"

* 'for-3.17-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup:
  cgroup: fix unbalanced locking
2014-09-23 09:06:18 -07:00
Josh Triplett
3cf6b0151b Merge branches 'tiny/bloat-o-meter-no-SyS', 'tiny/more-procless', 'tiny/no-advice', 'tiny/tinyconfig' and 'tiny/x86-boot-compressed-use-yn' into tiny/next 2014-09-22 23:14:40 -07:00
Ingo Molnar
6273143359 Merge branch 'rcu/next' of git://git.kernel.org/pub/scm/linux/kernel/git/paulmck/linux-rcu into core/rcu
Pull the v3.18 RCU changes from Paul E. McKenney:

"
  * Update RCU documentation.  These were posted to LKML at
    https://lkml.org/lkml/2014/8/28/378.

  * Miscellaneous fixes.  These were posted to LKML at
    https://lkml.org/lkml/2014/8/28/386.  An additional fix that
    eliminates a documented (but now inconvenient) deadlock between
    RCU hotplug and expedited grace periods was posted at
    https://lkml.org/lkml/2014/8/28/573.

  * Changes related to No-CBs CPUs and NO_HZ_FULL.  These were posted
    to LKML at https://lkml.org/lkml/2014/8/28/412.

  * Torture-test updates.  These were posted to LKML at
    https://lkml.org/lkml/2014/8/28/546 and at
    https://lkml.org/lkml/2014/9/11/1114.

  * RCU-tasks implementation.  These were posted to LKML at
    https://lkml.org/lkml/2014/8/28/540.
"

Signed-off-by: Ingo Molnar <mingo@kernel.org>
2014-09-23 07:21:42 +02:00
Tomasz Figa
a4a8c2c496 ARM: exynos: Move to generic PM domain DT bindings
This patch moves Exynos PM domain code to use the new generic PM domain
look-up framework introduced in previous patches, thus also allowing
the new code to be compiled with CONFIG_ARCH_EXYNOS.

This patch was originally submitted by Tomasz Figa when he was employed
by Samsung.

Link: http://marc.info/?l=linux-pm&m=139955336002083&w=2
Signed-off-by: Ulf Hansson <ulf.hansson@linaro.org>
Reviewed-by: Kevin Hilman <khilman@linaro.org>
Reviewed-by: Dmitry Torokhov <dmitry.torokhov@gmail.com>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2014-09-22 15:57:40 +02:00
Tomasz Figa
aa42240ab2 PM / Domains: Add generic OF-based PM domain look-up
This patch introduces generic code to perform PM domain look-up using
device tree and automatically bind devices to their PM domains.

Generic device tree bindings are introduced to specify PM domains of
devices in their device tree nodes.

Backwards compatibility with legacy Samsung-specific PM domain bindings
is provided, but for now the new code is not compiled when
CONFIG_ARCH_EXYNOS is selected to avoid collision with legacy code.
This will change as soon as the Exynos PM domain code gets converted to
use the generic framework in further patch.

This patch was originally submitted by Tomasz Figa when he was employed
by Samsung.

Link: http://marc.info/?l=linux-pm&m=139955349702152&w=2
Signed-off-by: Ulf Hansson <ulf.hansson@linaro.org>
Acked-by: Rob Herring <robh@kernel.org>
Tested-by: Philipp Zabel <p.zabel@pengutronix.de>
Reviewed-by: Kevin Hilman <khilman@linaro.org>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2014-09-22 15:57:40 +02:00
Todd E Brandt
0cadc70282 PM / sleep: new suspend_resume trace event for console resume
This patch adds another suspend_resume trace event for analyze_suspend
to capture. The resume_console call can take several hundred milliseconds
if the printk buffer is full of debug info. The tool will now inform
testers of the wasted time and encourage them to disable it in
production builds.

Signed-off-by: Todd Brandt <todd.e.brandt@intel.com>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2014-09-22 14:53:23 +02:00
Zhihui Zhang
9c58c79a8a sched: Clean up some typos and grammatical errors in code/comments
Signed-off-by: Zhihui Zhang <zzhsuny@gmail.com>
Cc: peterz@infradead.org
Link: http://lkml.kernel.org/r/1411262676-19928-1-git-send-email-zzhsuny@gmail.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2014-09-21 09:00:02 +02:00
Zefan Li
0c7bf3e8ca cgroup: remove redundant variable in cgroup_mount()
Both pinned_sb and new_sb indicate if a new superblock is needed,
so we can just remove new_sb.

Note now we must check if kernfs_tryget_sb() returns NULL, because
when it returns NULL, kernfs_mount() may still re-use an existing
superblock, which is just allocated by another concurent mount.

Suggested-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Zefan Li <lizefan@huawei.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2014-09-20 13:09:35 -04:00
Zefan Li
3e2cd91ab9 cgroup: fix missing unlock in cgroup_release_agent()
The patch 971ff49355: "cgroup: use a per-cgroup work for release
agent" from Sep 18, 2014, leads to the following static checker
warning:

	kernel/cgroup.c:5310 cgroup_release_agent()
	warn: 'mutex:&cgroup_mutex' is sometimes locked here and sometimes unlocked.

Reported-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Zefan Li <lizefan@huawei.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2014-09-20 12:23:35 -04:00
Linus Torvalds
598a0c7d09 Merge branch 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull perf fixes from Ingo Molnar:
 "Two kernel side fixes: a kprobes fix and a perf_remove_from_context()
  fix (which does not yet fix the migration bug which is WIP)"

* 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  perf: Fix a race condition in perf_remove_from_context()
  kprobes/x86: Free 'optinsn' cache when range check fails
2014-09-19 10:31:36 -07:00
Zefan Li
a25eb52e81 cgroup: remove CGRP_RELEASABLE flag
We call put_css_set() after setting CGRP_RELEASABLE flag in
cgroup_task_migrate(), but in other places we call it without setting
the flag. I don't see the necessity of this flag.

Moreover once the flag is set, it will never be cleared, unless writing
to the notify_on_release control file, so it can be quite confusing
if we look at the output of debug.releasable.

  # mount -t cgroup -o debug xxx /cgroup
  # mkdir /cgroup/child
  # cat /cgroup/child/debug.releasable
  0   <-- shows 0 though the cgroup is empty
  # echo $$ > /cgroup/child/tasks
  # cat /cgroup/child/debug.releasable
  0
  # echo $$ > /cgroup/tasks && echo $$ > /cgroup/child/tasks
  # cat /proc/child/debug.releasable
  1   <-- shows 1 though the cgroup is not empty

This patch removes the flag, and now debug.releasable shows if the
cgroup is empty or not.

Signed-off-by: Zefan Li <lizefan@huawei.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2014-09-19 09:29:32 -04:00
Zefan Li
4e2ba65068 perf/cgroup: Remove perf_put_cgroup()
Commit 5a17f543ed ("cgroup: improve css_from_dir() into css_tryget_from_dir()")
removed perf_tryget_cgroup(), so let's also remove perf_put_cgroup().

Signed-off-by: Zefan Li <lizefan@huawei.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2014-09-19 09:17:47 -04:00
Vincent Guittot
bd61c98f9b sched: Test the CPU's capacity in wake_affine()
Currently the task always wakes affine on this_cpu if the latter is idle.
Before waking up the task on this_cpu, we check that this_cpu capacity is not
significantly reduced because of RT tasks or irq activity.

Use case where the number of irq and/or the time spent under irq is important
will take benefit of this because the task that is woken up by irq or softirq
will not use the same CPU than irq (and softirq) but a idle one.

Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: preeti@linux.vnet.ibm.com
Cc: riel@redhat.com
Cc: Morten.Rasmussen@arm.com
Cc: efault@gmx.de
Cc: nicolas.pitre@linaro.org
Cc: daniel.lezcano@linaro.org
Cc: dietmar.eggemann@arm.com
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Link: http://lkml.kernel.org/r/1409051215-16788-8-git-send-email-vincent.guittot@linaro.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2014-09-19 12:35:28 +02:00
Vincent Guittot
26bc3c50d3 sched: Allow all architectures to set 'capacity_orig'
'capacity_orig' is only changed for systems with an SMT sched_domain level in order
to reflect the lower capacity of CPUs. Heterogenous systems also have to reflect an
original capacity that is different from the default value.

Create a more generic function arch_scale_cpu_capacity that can be also used by
non SMT platforms to set capacity_orig.

The __weak implementation of arch_scale_cpu_capacity() is the previous SMT variant,
in order to keep backward compatibility with the use of capacity_orig.

arch_scale_smt_capacity() and default_scale_smt_capacity() have been removed as
they were not used elsewhere than in arch_scale_cpu_capacity().

Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
Reviewed-by: Kamalesh Babulal <kamalesh@linux.vnet.ibm.com>
Reviewed-by: Preeti U. Murthy <preeti@linux.vnet.ibm.com>
[ Added default_scale_cpu_capacity() back. ]
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: riel@redhat.com
Cc: Morten.Rasmussen@arm.com
Cc: efault@gmx.de
Cc: nicolas.pitre@linaro.org
Cc: daniel.lezcano@linaro.org
Cc: dietmar.eggemann@arm.com
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Link: http://lkml.kernel.org/r/1409051215-16788-5-git-send-email-vincent.guittot@linaro.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2014-09-19 12:35:27 +02:00
Vincent Guittot
65fdac08c2 sched: Fix avg_load computation
The computation of avg_load and avg_load_per_task should only take into
account the number of CFS tasks. The non-CFS tasks are already taken into
account by decreasing the CPU's capacity and they will be tracked in the
CPU's utilization (group_utilization) of the next patches.

Reviewed-by: Preeti U Murthy <preeti@linux.vnet.ibm.com>
Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: riel@redhat.com
Cc: Morten.Rasmussen@arm.com
Cc: efault@gmx.de
Cc: nicolas.pitre@linaro.org
Cc: daniel.lezcano@linaro.org
Cc: dietmar.eggemann@arm.com
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Link: http://lkml.kernel.org/r/1409051215-16788-4-git-send-email-vincent.guittot@linaro.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2014-09-19 12:35:26 +02:00
Vincent Guittot
05bfb65f52 sched: Remove a wake_affine() condition
In wake_affine() I have tried to understand the meaning of the condition:

 (this_load <= load &&
  this_load + target_load(prev_cpu, idx) <= tl_per_task)

but I failed to find a use case that can take advantage of it and I haven't
found clear description in the previous commit's log.

Futhermore, the comment of the condition refers to the task_hot function that
was used before being replaced by the current condition:

/*
 * This domain has SD_WAKE_AFFINE and
 * p is cache cold in this domain, and
 * there is no bad imbalance.
 */

If we look more deeply the below condition:

 this_load + target_load(prev_cpu, idx) <= tl_per_task

When sync is clear, we have:

 tl_per_task = runnable_load_avg / nr_running
 this_load = max(runnable_load_avg, cpuload[idx])
 target_load =  max(runnable_load_avg', cpuload'[idx])

It implies that runnable_load_avg == 0 and nr_running <= 1 in order to match the
condition. This implies that runnable_load_avg == 0 too because of the
condition: this_load <= load.

but if this _load is null, 'balanced' is already set and the test is redundant.

If sync is set, it's not as straight forward as above (especially if cgroup
are involved) but the policy should be similar as we have removed a task that's
going to sleep in order to get a more accurate load and this_load values.

The current conclusion is that these additional condition don't give any benefit
so we can remove them.

Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: preeti@linux.vnet.ibm.com
Cc: riel@redhat.com
Cc: Morten.Rasmussen@arm.com
Cc: efault@gmx.de
Cc: nicolas.pitre@linaro.org
Cc: daniel.lezcano@linaro.org
Cc: dietmar.eggemann@arm.com
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Link: http://lkml.kernel.org/r/1409051215-16788-3-git-send-email-vincent.guittot@linaro.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2014-09-19 12:35:25 +02:00
Vincent Guittot
afdeee0510 sched: Fix imbalance flag reset
The imbalance flag can stay set whereas there is no imbalance.

Let assume that we have 3 tasks that run on a dual cores /dual cluster system.
We will have some idle load balance which are triggered during tick.
Unfortunately, the tick is also used to queue background work so we can reach
the situation where short work has been queued on a CPU which already runs a
task. The load balance will detect this imbalance (2 tasks on 1 CPU and an idle
CPU) and will try to pull the waiting task on the idle CPU. The waiting task is
a worker thread that is pinned on a CPU so an imbalance due to pinned task is
detected and the imbalance flag is set.

Then, we will not be able to clear the flag because we have at most 1 task on
each CPU but the imbalance flag will trig to useless active load balance
between the idle CPU and the busy CPU.

We need to reset of the imbalance flag as soon as we have reached a balanced
state. If all tasks are pinned, we don't consider that as a balanced state and
let the imbalance flag set.

Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
Reviewed-by: Preeti U Murthy <preeti@linux.vnet.ibm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: riel@redhat.com
Cc: Morten.Rasmussen@arm.com
Cc: efault@gmx.de
Cc: nicolas.pitre@linaro.org
Cc: daniel.lezcano@linaro.org
Cc: dietmar.eggemann@arm.com
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Link: http://lkml.kernel.org/r/1409051215-16788-2-git-send-email-vincent.guittot@linaro.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2014-09-19 12:35:24 +02:00
Aaron Tomlin
0d9e26329b sched: Add default-disabled option to BUG() when stack end location is overwritten
Currently in the event of a stack overrun a call to schedule()
does not check for this type of corruption. This corruption is
often silent and can go unnoticed. However once the corrupted
region is examined at a later stage, the outcome is undefined
and often results in a sporadic page fault which cannot be
handled.

This patch checks for a stack overrun and takes appropriate
action since the damage is already done, there is no point
in continuing.

Signed-off-by: Aaron Tomlin <atomlin@redhat.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: aneesh.kumar@linux.vnet.ibm.com
Cc: dzickus@redhat.com
Cc: bmr@redhat.com
Cc: jcastillo@redhat.com
Cc: oleg@redhat.com
Cc: riel@redhat.com
Cc: prarit@redhat.com
Cc: jgh@redhat.com
Cc: minchan@kernel.org
Cc: mpe@ellerman.id.au
Cc: tglx@linutronix.de
Cc: rostedt@goodmis.org
Cc: hannes@cmpxchg.org
Cc: Alexei Starovoitov <ast@plumgrid.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Andi Kleen <ak@linux.intel.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Dan Streetman <ddstreet@ieee.org>
Cc: Davidlohr Bueso <davidlohr@hp.com>
Cc: David S. Miller <davem@davemloft.net>
Cc: Kees Cook <keescook@chromium.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Lubomir Rintel <lkundrak@v3.sk>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Link: http://lkml.kernel.org/r/1410527779-8133-4-git-send-email-atomlin@redhat.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2014-09-19 12:35:24 +02:00
Aaron Tomlin
a70857e46d sched: Add helper for task stack page overrun checking
This facility is used in a few places so let's introduce
a helper function to improve code readability.

Signed-off-by: Aaron Tomlin <atomlin@redhat.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: aneesh.kumar@linux.vnet.ibm.com
Cc: dzickus@redhat.com
Cc: bmr@redhat.com
Cc: jcastillo@redhat.com
Cc: oleg@redhat.com
Cc: riel@redhat.com
Cc: prarit@redhat.com
Cc: jgh@redhat.com
Cc: minchan@kernel.org
Cc: mpe@ellerman.id.au
Cc: tglx@linutronix.de
Cc: hannes@cmpxchg.org
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Jiri Olsa <jolsa@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Seiji Aguchi <seiji.aguchi@hds.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
Cc: linuxppc-dev@lists.ozlabs.org
Link: http://lkml.kernel.org/r/1410527779-8133-3-git-send-email-atomlin@redhat.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2014-09-19 12:35:23 +02:00
Aaron Tomlin
d4311ff1a8 init/main.c: Give init_task a canary
Tasks get their end of stack set to STACK_END_MAGIC with the
aim to catch stack overruns. Currently this feature does not
apply to init_task. This patch removes this restriction.

Note that a similar patch was posted by Prarit Bhargava
some time ago but was never merged:

  http://marc.info/?l=linux-kernel&m=127144305403241&w=2

Signed-off-by: Aaron Tomlin <atomlin@redhat.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Oleg Nesterov <oleg@redhat.com>
Acked-by: Michael Ellerman <mpe@ellerman.id.au>
Cc: aneesh.kumar@linux.vnet.ibm.com
Cc: dzickus@redhat.com
Cc: bmr@redhat.com
Cc: jcastillo@redhat.com
Cc: jgh@redhat.com
Cc: minchan@kernel.org
Cc: tglx@linutronix.de
Cc: hannes@cmpxchg.org
Cc: Alex Thorlton <athorlton@sgi.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Daeseok Youn <daeseok.youn@gmail.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Fabian Frederick <fabf@skynet.be>
Cc: Geert Uytterhoeven <geert@linux-m68k.org>
Cc: Jiri Olsa <jolsa@redhat.com>
Cc: Kees Cook <keescook@chromium.org>
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com>
Cc: Michael Opdenacker <michael.opdenacker@free-electrons.com>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Prarit Bhargava <prarit@redhat.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Rusty Russell <rusty@rustcorp.com.au>
Cc: Seiji Aguchi <seiji.aguchi@hds.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Vladimir Davydov <vdavydov@parallels.com>
Cc: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
Cc: linuxppc-dev@lists.ozlabs.org
Link: http://lkml.kernel.org/r/1410527779-8133-2-git-send-email-atomlin@redhat.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2014-09-19 12:35:22 +02:00
Kirill Tkhai
a15b12ac36 sched: Do not stop cpu in set_cpus_allowed_ptr() if task is not running
If a task is queued but not running on it rq, we can simply migrate
it without migration thread and switching of context.

Signed-off-by: Kirill Tkhai <ktkhai@parallels.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Link: http://lkml.kernel.org/r/1410519814.3569.7.camel@tkhai
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2014-09-19 12:35:21 +02:00
Kirill Tkhai
1ba93d4272 sched/dl: Simplify pick_dl_task()
1) Nobody calls pick_dl_task() with negative cpu, it's old RT leftover.

2) If p->nr_cpus_allowed is 1, than the affinity has just been changed
  in set_cpus_allowed_ptr(); we'll pick it just earlier than migration
  thread.

Signed-off-by: Kirill Tkhai <ktkhai@parallels.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: http://lkml.kernel.org/r/1410529340.3569.27.camel@tkhai
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2014-09-19 12:35:20 +02:00
Kirill Tkhai
f3f1768f89 sched/rt: Remove useless if from cleanup pick_next_task_rt()
_pick_next_task_rt() never returns NULL.

Signed-off-by: Kirill Tkhai <ktkhai@parallels.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: http://lkml.kernel.org/r/1410529321.3569.26.camel@tkhai
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2014-09-19 12:35:20 +02:00
Kirill Tkhai
f3cd1c4ec0 sched/core: Use put_prev_task() accessor where possible
Signed-off-by: Kirill Tkhai <ktkhai@parallels.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: http://lkml.kernel.org/r/1410529300.3569.25.camel@tkhai
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2014-09-19 12:35:19 +02:00
Kirill Tkhai
a8edd07532 sched/fair: cleanup: Remove useless assignment in select_task_rq_fair()
new_cpu is reassigned below, so we do not need this here.

Signed-off-by: Kirill Tkhai <ktkhai@parallels.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: http://lkml.kernel.org/r/1410529276.3569.24.camel@tkhai
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2014-09-19 12:35:18 +02:00
Kirill Tkhai
f139caf2e8 sched, cleanup, treewide: Remove set_current_state(TASK_RUNNING) after schedule()
schedule(), io_schedule() and schedule_timeout() always return
with TASK_RUNNING state set, so one more setting is unnecessary.

(All places in patch are visible good, only exception is
 kiblnd_scheduler() from:

      drivers/staging/lustre/lnet/klnds/o2iblnd/o2iblnd_cb.c

 Its schedule() is one line above standard 3 lines of unified diff)

No places where set_current_state() is used for mb().

Signed-off-by: Kirill Tkhai <ktkhai@parallels.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: http://lkml.kernel.org/r/1410529254.3569.23.camel@tkhai
Cc: Alasdair Kergon <agk@redhat.com>
Cc: Anil Belur <askb23@gmail.com>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Dave Kleikamp <shaggy@kernel.org>
Cc: David Airlie <airlied@linux.ie>
Cc: David Howells <dhowells@redhat.com>
Cc: Dmitry Eremin <dmitry.eremin@intel.com>
Cc: Frank Blaschka <blaschka@linux.vnet.ibm.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: Helge Deller <deller@gmx.de>
Cc: Isaac Huang <he.huang@intel.com>
Cc: James E.J. Bottomley <JBottomley@parallels.com>
Cc: James E.J. Bottomley <jejb@parisc-linux.org>
Cc: J. Bruce Fields <bfields@fieldses.org>
Cc: Jeff Dike <jdike@addtoit.com>
Cc: Jesper Nilsson <jesper.nilsson@axis.com>
Cc: Jiri Slaby <jslaby@suse.cz>
Cc: Laura Abbott <lauraa@codeaurora.org>
Cc: Liang Zhen <liang.zhen@intel.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Masaru Nomura <massa.nomura@gmail.com>
Cc: Michael Opdenacker <michael.opdenacker@free-electrons.com>
Cc: Mikael Starvik <starvik@axis.com>
Cc: Mike Snitzer <snitzer@redhat.com>
Cc: Neil Brown <neilb@suse.de>
Cc: Oleg Drokin <green@linuxhacker.ru>
Cc: Peng Tao <bergwolf@gmail.com>
Cc: Richard Weinberger <richard@nod.at>
Cc: Robert Love <robert.w.love@intel.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Trond Myklebust <trond.myklebust@primarydata.com>
Cc: Ursula Braun <ursula.braun@de.ibm.com>
Cc: Zi Shen Lim <zlim.lnx@gmail.com>
Cc: devel@driverdev.osuosl.org
Cc: dm-devel@redhat.com
Cc: dri-devel@lists.freedesktop.org
Cc: fcoe-devel@open-fcoe.org
Cc: jfs-discussion@lists.sourceforge.net
Cc: linux390@de.ibm.com
Cc: linux-afs@lists.infradead.org
Cc: linux-cris-kernel@axis.com
Cc: linux-kernel@vger.kernel.org
Cc: linux-nfs@vger.kernel.org
Cc: linux-parisc@vger.kernel.org
Cc: linux-raid@vger.kernel.org
Cc: linux-s390@vger.kernel.org
Cc: linux-scsi@vger.kernel.org
Cc: qla2xxx-upstream@qlogic.com
Cc: user-mode-linux-devel@lists.sourceforge.net
Cc: user-mode-linux-user@lists.sourceforge.net
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2014-09-19 12:35:17 +02:00
Rik van Riel
9c368b5b6e sched, time: Fix lock inversion in thread_group_cputime()
The sig->stats_lock nests inside the tasklist_lock and the
sighand->siglock in __exit_signal and wait_task_zombie.

However, both of those locks can be taken from irq context,
which means we need to use the interrupt safe variant of
read_seqbegin_or_lock. This blocks interrupts when the "lock"
branch is taken (seq is odd), preventing the lock inversion.

On the first (lockless) pass through the loop, irqs are not
blocked.

Reported-by: Stanislaw Gruszka <sgruszka@redhat.com>
Signed-off-by: Rik van Riel <riel@redhat.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: prarit@redhat.com
Cc: oleg@redhat.com
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Link: http://lkml.kernel.org/r/1410527535-9814-3-git-send-email-riel@redhat.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2014-09-19 12:35:17 +02:00
Chuansheng Liu
c6f4459fc3 smp: Add new wake_up_all_idle_cpus() function
Currently kick_all_cpus_sync() can break non-polling idle cpus
thru IPI interrupts.

But sometimes we need to break the polling idle cpus immediately
to reselect the suitable c-state, also for non-idle cpus, we need
to do nothing if we try to wake up them.

Here adding one new function wake_up_all_idle_cpus() to let all cpus
out of idle based on function wake_up_if_idle().

Signed-off-by: Chuansheng Liu <chuansheng.liu@intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: daniel.lezcano@linaro.org
Cc: rjw@rjwysocki.net
Cc: linux-pm@vger.kernel.org
Cc: changcheng.liu@intel.com
Cc: xiaoming.wang@intel.com
Cc: souvik.k.chakravarty@intel.com
Cc: luto@amacapital.net
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Christoph Hellwig <hch@infradead.org>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Geert Uytterhoeven <geert+renesas@glider.be>
Cc: Jan Kara <jack@suse.cz>
Cc: Jens Axboe <axboe@fb.com>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Paul Gortmaker <paul.gortmaker@windriver.com>
Cc: Roman Gushchin <klamm@yandex-team.ru>
Cc: Srivatsa S. Bhat <srivatsa.bhat@linux.vnet.ibm.com>
Link: http://lkml.kernel.org/r/1409815075-4180-2-git-send-email-chuansheng.liu@intel.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2014-09-19 12:35:15 +02:00
Chuansheng Liu
f6be8af1c9 sched: Add new API wake_up_if_idle() to wake up the idle cpu
Implementing one new API wake_up_if_idle(), which is used to
wake up the idle CPU.

Suggested-by: Andy Lutomirski <luto@amacapital.net>
Signed-off-by: Chuansheng Liu <chuansheng.liu@intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: daniel.lezcano@linaro.org
Cc: rjw@rjwysocki.net
Cc: linux-pm@vger.kernel.org
Cc: changcheng.liu@intel.com
Cc: xiaoming.wang@intel.com
Cc: souvik.k.chakravarty@intel.com
Cc: chuansheng.liu@intel.com
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Link: http://lkml.kernel.org/r/1409815075-4180-1-git-send-email-chuansheng.liu@intel.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2014-09-19 12:35:14 +02:00
Rik van Riel
ba7e5a279e sched/numa: Use select_idle_sibling() to select a destination for task_numa_move()
The code in task_numa_compare() will only examine at most one idle CPU per node,
because they all have the same score. However, some idle CPUs are better
candidates than others, due to busy or idle SMT siblings, etc...

The scheduler has logic to find the best CPU within an LLC to place a
task. The NUMA code should probably use it.

This seems to reduce the standard deviation for single instance SPECjbb2005
with a low warehouse count on my 4 node test system.

Signed-off-by: Rik van Riel <riel@redhat.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: mgorman@suse.de
Cc: Mike Galbraith <umgwanakikbuti@gmail.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Link: http://lkml.kernel.org/r/20140904163530.189d410a@cuia.bos.redhat.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2014-09-19 12:35:14 +02:00
Paul E. McKenney
dd56af42bd rcu: Eliminate deadlock between CPU hotplug and expedited grace periods
Currently, the expedited grace-period primitives do get_online_cpus().
This greatly simplifies their implementation, but means that calls
to them holding locks that are acquired by CPU-hotplug notifiers (to
say nothing of calls to these primitives from CPU-hotplug notifiers)
can deadlock.  But this is starting to become inconvenient, as can be
seen here: https://lkml.org/lkml/2014/8/5/754.  The problem in this
case is that some developers need to acquire a mutex from a CPU-hotplug
notifier, but also need to hold it across a synchronize_rcu_expedited().
As noted above, this currently results in deadlock.

This commit avoids the deadlock and retains the simplicity by creating
a try_get_online_cpus(), which returns false if the get_online_cpus()
reference count could not immediately be incremented.  If a call to
try_get_online_cpus() returns true, the expedited primitives operate as
before.  If a call returns false, the expedited primitives fall back to
normal grace-period operations.  This falling back of course results in
increased grace-period latency, but only during times when CPU hotplug
operations are actually in flight.  The effect should therefore be
negligible during normal operation.

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Josh Triplett <josh@joshtriplett.org>
Cc: "Rafael J. Wysocki" <rjw@rjwysocki.net>
Tested-by: Lan Tianyu <tianyu.lan@intel.com>
2014-09-18 16:22:27 -07:00
Zefan Li
52de4779f2 cpuset: simplify proc_cpuset_show()
Use the ONE macro instead of REG, and we can simplify proc_cpuset_show().

Signed-off-by: Zefan Li <lizefan@huawei.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2014-09-18 13:27:23 -04:00
Zefan Li
006f4ac497 cgroup: simplify proc_cgroup_show()
Use the ONE macro instead of REG, and we can simplify proc_cgroup_show().

Signed-off-by: Zefan Li <lizefan@huawei.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2014-09-18 13:27:23 -04:00
Zefan Li
971ff49355 cgroup: use a per-cgroup work for release agent
Instead of using a global work to schedule release agent on removable
cgroups, we change to use a per-cgroup work to do this, which makes
the code much simpler.

v2: use a dedicated work instead of reusing css->destroy_work. (Tejun)

Signed-off-by: Zefan Li <lizefan@huawei.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2014-09-18 13:14:22 -04:00
Zefan Li
eb4aec84d6 cgroup: fix unbalanced locking
cgroup_pidlist_start() holds cgrp->pidlist_mutex and then calls
pidlist_array_load(), and cgroup_pidlist_stop() releases the mutex.

It is wrong that we release the mutex in the failure path in
pidlist_array_load(), because cgroup_pidlist_stop() will be called
no matter if cgroup_pidlist_start() returns errno or not.

Fixes: 4bac00d16a
Cc: <stable@vger.kernel.org> # 3.14+
Signed-off-by: Zefan Li <lizefan@huawei.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Cong Wang <xiyou.wangcong@gmail.com>
2014-09-18 12:32:52 -04:00
Li Zefan
0c8fc2c121 cgroup: remove bogus comments
We never grab cgroup mutex in fork and exit paths no matter whether
notify_on_release is set or not.

Signed-off-by: Zefan Li <lizefan@huawei.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2014-09-18 06:34:16 +09:00
Li Zefan
244bb9a633 cgroup: remove redundant code in cgroup_rmdir()
We no longer clear kn->priv in cgroup_rmdir(), so we don't need
to get an extra refcnt.

Signed-off-by: Zefan Li <lizefan@huawei.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2014-09-18 06:34:15 +09:00
Li Zefan
6213daab25 cgroup: remove some useless forward declarations
Signed-off-by: Zefan Li <lizefan@huawei.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2014-09-18 06:34:15 +09:00
Tejun Heo
9253b279f4 Merge branch 'for-3.17-fixes' of ra.kernel.org:/pub/scm/linux/kernel/git/tj/cgroup into for-3.18
Pull to receive a4189487da ("cgroup: delay the clearing of
cgrp->kn->priv") for the scheduled clean up patches.

Signed-off-by: Tejun Heo <tj@kernel.org>
2014-09-18 06:29:05 +09:00
Paul E. McKenney
59da22a020 rcutorture: Rename rcutorture_runnable parameter
This commit changes rcutorture_runnable to torture_runnable, which is
consistent with the names of the other parameters and is a bit shorter
as well.

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
2014-09-16 13:41:44 -07:00
Davidlohr Bueso
630952c22b locktorture: Introduce torture context
The amount of global variables is getting pretty ugly. Group variables
related to the execution (ie: not parameters) in a new context structure.

Signed-off-by: Davidlohr Bueso <dbueso@suse.de>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
2014-09-16 13:41:40 -07:00
Davidlohr Bueso
4a3b427f0b locktorture: Support rwsems
We can easily do so with our new reader lock support. Just an arbitrary
design default: readers have higher (5x) critical region latencies than
writers: 50 ms and 10 ms, respectively.

Signed-off-by: Davidlohr Bueso <dbueso@suse.de>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
2014-09-16 13:41:36 -07:00
Davidlohr Bueso
4f6332c1dc locktorture: Add infrastructure for torturing read locks
Most of it is based on what we already have for writers. This allows
readers to be very independent (and thus configurable), enabling
future module parameters to control things such as rw distribution.
Furthermore, readers have their own delaying function, allowing us
to test different rw critical region latencies, and stress locking
internals. Similarly, statistics, for now will only serve for the
number of lock acquisitions -- as opposed to writers, readers have
no failure detection.

In addition, introduce a new nreaders_stress module parameter. The
default number of readers will be the same number of writers threads.
Writer threads are interleaved with readers. Documentation is updated,
respectively.

Signed-off-by: Davidlohr Bueso <dbueso@suse.de>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
2014-09-16 13:41:07 -07:00
Davidlohr Bueso
d36a7a0d5e torture: Address race in module cleanup
When performing module cleanups by calling torture_cleanup() the
'torture_type' string in nullified However, callers are not necessarily
done, and might still need to reference the variable. This impacts
both rcutorture and locktorture, causing printing things like:

[   94.226618] (null)-torture: Stopping lock_torture_writer task
[   94.226624] (null)-torture: Stopping lock_torture_stats task

Thus delay this operation until the very end of the cleanup process.
The consequence (which shouldn't matter for this kid of program) is,
of course, that we delay the window between rmmod and modprobing,
for instance in module_torture_begin().

Signed-off-by: Davidlohr Bueso <dbueso@suse.de>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
2014-09-16 13:41:06 -07:00
Davidlohr Bueso
1e6757a921 locktorture: Make statistics generic
The statistics structure can serve well for both reader and writer
locks, thus simply rename some fields that mention 'write' and leave
the declaration of lwsa.

Signed-off-by: Davidlohr Bueso <dbueso@suse.de>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
2014-09-16 13:41:05 -07:00
Davidlohr Bueso
f095bfc0ea locktorture: Teach about lock debugging
Regular locks are very different than locks with debugging. For instance
for mutexes, debugging forces to only take the slowpaths. As such, the
locktorture module should take this into account when printing related
information -- specifically when printing user passed parameters, it seems
the right place for such info.

Signed-off-by: Davidlohr Bueso <dbueso@suse.de>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
2014-09-16 13:41:04 -07:00
Davidlohr Bueso
42ddc75ddd locktorture: Support mutexes
Add a "mutex_lock" torture test. The main difference with the already
existing spinlock tests is that the latency of the critical region
is much larger. We randomly delay for (arbitrarily) either 500 ms or,
otherwise, 25 ms. While this can considerably reduce the amount of
writes compared to non blocking locks, if run long enough it can have
the same torturous effect. Furthermore it is more representative of
mutex hold times and can stress better things like thrashing.

Signed-off-by: Davidlohr Bueso <dbueso@suse.de>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
2014-09-16 13:40:59 -07:00
Davidlohr Bueso
23a8e5c2d2 locktorture: Rename locktorture_runnable parameter
... to just 'torture_runnable'. It follows other variable naming
and is shorter.

Signed-off-by: Davidlohr Bueso <dbueso@suse.de>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
2014-09-16 10:15:48 -07:00
Paul E. McKenney
96b4672703 Merge branch 'rcu-tasks.2014.09.10a' into HEAD
rcu-tasks.2014.09.10a: Add RCU-tasks flavor of RCU.
2014-09-16 10:10:44 -07:00
Paul E. McKenney
e98d06dd6c Merge branches 'doc.2014.09.07a', 'fixes.2014.09.10a', 'nocb-nohz.2014.09.16b' and 'torture.2014.09.07a' into HEAD
doc.2014.09.07a: Documentation updates.
fixes.2014.09.10a: Miscellaneous fixes.
nocb-nohz.2014.09.16b: No-CBs CPUs and NO_HZ_FULL updates.
torture.2014.09.07a: Torture-test updates.
2014-09-16 10:08:34 -07:00
Paul E. McKenney
c847f14217 rcu: Avoid misordering in nocb_leader_wait()
The NOCB follower wakeup ordering depends on the store to the tail
pointer happening before the wakeup.  However, because atomic_long_add()
does not return a value, it does not provide ordering guarantees, and
the locking in wake_up() only guarantees that the store will happen
before the unlock, which might be too late.  Even though this is only a
theoretical issue, this commit adds a smp_mb__after_atomic() after the
final atomic_long_add() to provide the needed ordering guarantee.

Reported-by: Amit Shah <amit.shah@redhat.com>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Tested-by: Paul Gortmaker <paul.gortmaker@windriver.com>
2014-09-16 10:08:03 -07:00
Paul E. McKenney
1772947bd0 rcu: Handle NOCB callbacks from irq-disabled idle code
If an RCU callback is queued on a no-CBs CPU from idle code with irqs
disabled, and if that CPU stays idle forever after, the callback will
never be invoked.  This commit therefore adds a check for this situation
in ____call_rcu_nocb(), invoking the RCU core solely for the purpose
of the ensuing return-to-idle transition.  (If the CPU doesn't return
to idle, the next scheduling-clock interrupt will fix things up.)

Reported-by: Amit Shah <amit.shah@redhat.com>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Tested-by: Paul Gortmaker <paul.gortmaker@windriver.com>
2014-09-16 10:08:03 -07:00
Paul E. McKenney
39953dfd40 rcu: Avoid misordering in __call_rcu_nocb_enqueue()
The NOCB leader wakeup ordering depends on the store to the header
happening before the check for the leader already being awake.  However,
because atomic_long_add() does not return a value, it does not provide
ordering guarantees, the incorrect comment in wake_nocb_leader()
notwithstanding.  This commit therefore adds a smp_mb__after_atomic()
after the final atomic_long_add() to provide the needed ordering
guarantee.

Reported-by: Amit Shah <amit.shah@redhat.com>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Tested-by: Paul Gortmaker <paul.gortmaker@windriver.com>
2014-09-16 10:08:03 -07:00
Paul E. McKenney
663e131090 rcu: Don't track sysidle state if no nohz_full= CPUs
If there are no nohz_full= CPUs, then there is currently no reason to
track sysidle state.  This commit therefore short-circuits this state
tracking if !tick_nohz_full_enabled().

Note that these checks will need to be revisited if nohz_full= state
can ever be changed at runtime.

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Acked-by: Frederic Weisbecker <fweisbec@gmail.com>
Tested-by: Paul Gortmaker <paul.gortmaker@windriver.com>
2014-09-16 10:08:02 -07:00
Paul E. McKenney
417e8d2655 rcu: Eliminate redundant rcu_sysidle_state variable
Now that we have rcu_state_p, which references rcu_preempt_state for
TREE_PREEMPT_RCU and rcu_sched_state for TREE_RCU, we don't need a
separate rcu_sysidle_state variable.  This commit therefore eliminates
rcu_preempt_state in favor of rcu_state_p.

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Reviewed-by: Pranith Kumar <bobby.prani@gmail.com>
Acked-by: Frederic Weisbecker <fweisbec@gmail.com>
Tested-by: Paul Gortmaker <paul.gortmaker@windriver.com>
2014-09-16 10:08:02 -07:00
Pranith Kumar
22c2f66961 rcu: Check for have_rcu_nocb_mask instead of rcu_nocb_mask
If we configure a kernel with CONFIG_NOCB_CPU=y, CONFIG_RCU_NOCB_CPU_NONE=y and
CONFIG_CPUMASK_OFFSTACK=n and do not pass in a rcu_nocb= boot parameter, the
cpumask rcu_nocb_mask can be garbage instead of NULL.

Hence this commit replaces checks for rcu_nocb_mask == NULL with a check for
have_rcu_nocb_mask.

Signed-off-by: Pranith Kumar <bobby.prani@gmail.com>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Tested-by: Paul Gortmaker <paul.gortmaker@windriver.com>
2014-09-16 10:08:02 -07:00
Paul E. McKenney
35ce7f29a4 rcu: Create rcuo kthreads only for onlined CPUs
RCU currently uses for_each_possible_cpu() to spawn rcuo kthreads,
which can result in more rcuo kthreads than one would expect, for
example, derRichard reported 64 CPUs worth of rcuo kthreads on an
8-CPU image.  This commit therefore creates rcuo kthreads only for
those CPUs that actually come online.

This was reported by derRichard on the OFTC IRC network.

Reported-by: Richard Weinberger <richard@nod.at>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Reviewed-by: Josh Triplett <josh@joshtriplett.org>
Tested-by: Paul Gortmaker <paul.gortmaker@windriver.com>
2014-09-16 10:08:02 -07:00
Paul E. McKenney
9386c0b75d rcu: Rationalize kthread spawning
Currently, RCU spawns kthreads from several different early_initcall()
functions.  Although this has served RCU well for quite some time,
as more kthreads are added a more deterministic approach is required.
This commit therefore causes all of RCU's early-boot kthreads to be
spawned from a single early_initcall() function.

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Reviewed-by: Josh Triplett <josh@joshtriplett.org>
Tested-by: Paul Gortmaker <paul.gortmaker@windriver.com>
2014-09-16 10:08:01 -07:00
Pranith Kumar
f4aa84ba24 rcu: Return false instead of 0 in rcu_nocb_adopt_orphan_cbs()
Return false instead of 0 in rcu_nocb_adopt_orphan_cbs() as this has
bool as return type.

Signed-off-by: Pranith Kumar <bobby.prani@gmail.com>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Tested-by: Paul Gortmaker <paul.gortmaker@windriver.com>
2014-09-16 10:08:01 -07:00
Pranith Kumar
4afc7e269b rcu: Use false for return in __call_rcu_nocb()
Return false instead of 0 in __call_rcu_nocb() as this has bool as
return type.

Signed-off-by: Pranith Kumar <bobby.prani@gmail.com>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Tested-by: Paul Gortmaker <paul.gortmaker@windriver.com>
2014-09-16 10:08:01 -07:00
Pranith Kumar
0a9e1e111b rcu: Use true/false for return in rcu_nocb_adopt_orphan_cbs()
Return true/false in rcu_nocb_adopt_orphan_cbs() instead of 0/1 as
this function has return type of bool.

Signed-off-by: Pranith Kumar <bobby.prani@gmail.com>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Tested-by: Paul Gortmaker <paul.gortmaker@windriver.com>
2014-09-16 10:08:00 -07:00
Pranith Kumar
c271d3a957 rcu: Use true/false for return in __call_rcu_nocb()
Return true/false instead of 0/1 in __call_rcu_nocb() as this returns a
bool type.

Signed-off-by: Pranith Kumar <bobby.prani@gmail.com>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Tested-by: Paul Gortmaker <paul.gortmaker@windriver.com>
2014-09-16 10:08:00 -07:00
Pranith Kumar
949cccdbe6 rcu: Check the return value of zalloc_cpumask_var()
This commit checks the return value of the zalloc_cpumask_var() used for
allocating cpumask for rcu_nocb_mask.

Signed-off-by: Pranith Kumar <bobby.prani@gmail.com>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Tested-by: Paul Gortmaker <paul.gortmaker@windriver.com>
2014-09-16 10:08:00 -07:00
Paul E. McKenney
f4579fc57c rcu: Fix attempt to avoid unsolicited offloading of callbacks
Commit b58cc46c5f (rcu: Don't offload callbacks unless specifically
requested) failed to adjust the callback lists of the CPUs that are
known to be no-CBs CPUs only because they are also nohz_full= CPUs.
This failure can result in callbacks that are posted during early boot
getting stranded on nxtlist for CPUs whose no-CBs property becomes
apparent late, and there can also be spurious warnings about offline
CPUs posting callbacks.

This commit fixes these problems by adding an early-boot rcu_init_nohz()
that properly initializes the no-CBs CPUs.

Note that kernels built with CONFIG_RCU_NOCB_CPU_ALL=y or with
CONFIG_RCU_NOCB_CPU=n do not exhibit this bug.  Neither do kernels
booted without the nohz_full= boot parameter.

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Reviewed-by: Pranith Kumar <bobby.prani@gmail.com>
Tested-by: Paul Gortmaker <paul.gortmaker@windriver.com>
2014-09-16 10:07:59 -07:00
Jiri Olsa
c88f209613 perf: Do not check PERF_EVENT_STATE_EXIT on syscall read path
Revert PERF_EVENT_STATE_EXIT check on read syscall path.
It breaks standard way to read counter, which is to open
the counter, wait for the monitored process to die and
read the counter.

Reported-by: Stephane Eranian <eranian@google.com>
Signed-off-by: Jiri Olsa <jolsa@kernel.org>
Acked-by: Stephane Eranian <eranian@google.com>
Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Namhyung Kim <namhyung@kernel.org>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Stephane Eranian <eranian@google.com>
Cc: David Ahern <dsahern@gmail.com>
Link: http://lkml.kernel.org/r/20140908143107.GG17728@krava.brq.redhat.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2014-09-16 10:30:36 +02:00
Davidlohr Bueso
db0e716a15 locking/rwsem: Move EXPORT_SYMBOL() lines to follow function definition
rw-semaphore is the only type of lock doing this ugliness of
exporting at the end of the file.

Signed-off-by: Davidlohr Bueso <dbueso@suse.de>
Cc: dave@stgolabs.net
Cc: peterz@infradead.org
Link: http://lkml.kernel.org/r/1410500066-5909-1-git-send-email-dave@stgolabs.net
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2014-09-16 09:49:01 +02:00
Linus Torvalds
1536340e7c Merge branches 'locking-urgent-for-linus' and 'timers-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull futex and timer fixes from Thomas Gleixner:
 "A oneliner bugfix for the jinxed futex code:

   - Drop hash bucket lock in the error exit path.  I really could slap
     myself for intruducing that bug while fixing all the other horror
     in that code three month ago ...

  and the timer department is not too proud about the following fixes:

   - Deal with a long standing rounding bug in the timeval to jiffies
     conversion.  It's a real issue and this fix fell through the cracks
     for quite some time.

   - Another round of alarmtimer fixes.  Finally this code gets used
     more widely and the subtle issues hidden for quite some time are
     noticed and fixed.  Nothing really exciting, just the itty bitty
     details which bite the serious users here and there"

* 'locking-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  futex: Unlock hb->lock in futex_wait_requeue_pi() error path

* 'timers-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  alarmtimer: Lock k_itimer during timer callback
  alarmtimer: Do not signal SIGEV_NONE timers
  alarmtimer: Return relative times in timer_gettime
  jiffies: Fix timeval conversion to jiffies
2014-09-13 14:22:12 -07:00
Frederic Weisbecker
9b01f5bf39 nohz: nohz full depends on irq work self IPI support
The nohz full functionality depends on IRQ work to trigger its own
interrupts. As it's used to restart the tick, we can't rely on the tick
fallback for irq work callbacks, ie: we can't use the tick to restart
the tick itself.

Lets reject the full dynticks initialization if that arch support isn't
available.

As a side effect, this makes sure that nohz kick is never called from
the tick. That otherwise would result in illegal hrtimer self-cancellation
and lockup.

Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
2014-09-13 18:46:41 +02:00
Frederic Weisbecker
4327b15f64 nohz: Consolidate nohz full init code
The supports for CONFIG_NO_HZ_FULL_ALL=y and the nohz_full= kernel
parameter both have their own way to do the same thing: allocate
full dynticks cpumasks, fill them and initialize some state variables.

Lets consolidate that all in the same place.

While at it, convert some regular printk message to warnings when
fundamental allocations fail.

Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
2014-09-13 18:46:40 +02:00
Frederic Weisbecker
76a33061b9 irq_work: Force raised irq work to run on irq work interrupt
The nohz full kick, which restarts the tick when any resource depend
on it, can't be executed anywhere given the operation it does on timers.
If it is called from the scheduler or timers code, chances are that
we run into a deadlock.

This is why we run the nohz full kick from an irq work. That way we make
sure that the kick runs on a virgin context.

However if that's the case when irq work runs in its own dedicated
self-ipi, things are different for the big bunch of archs that don't
support the self triggered way. In order to support them, irq works are
also handled by the timer interrupt as fallback.

Now when irq works run on the timer interrupt, the context isn't blank.
More precisely, they can run in the context of the hrtimer that runs the
tick. But the nohz kick cancels and restarts this hrtimer and cancelling
an hrtimer from itself isn't allowed. This is why we run in an endless
loop:

	Kernel panic - not syncing: Watchdog detected hard LOCKUP on cpu 2
	CPU: 2 PID: 7538 Comm: kworker/u8:8 Not tainted 3.16.0+ #34
	Workqueue: btrfs-endio-write normal_work_helper [btrfs]
	 ffff880244c06c88 000000001b486fe1 ffff880244c06bf0 ffffffff8a7f1e37
	 ffffffff8ac52a18 ffff880244c06c78 ffffffff8a7ef928 0000000000000010
	 ffff880244c06c88 ffff880244c06c20 000000001b486fe1 0000000000000000
	Call Trace:
	 <NMI[<ffffffff8a7f1e37>] dump_stack+0x4e/0x7a
	 [<ffffffff8a7ef928>] panic+0xd4/0x207
	 [<ffffffff8a1450e8>] watchdog_overflow_callback+0x118/0x120
	 [<ffffffff8a186b0e>] __perf_event_overflow+0xae/0x350
	 [<ffffffff8a184f80>] ? perf_event_task_disable+0xa0/0xa0
	 [<ffffffff8a01a4cf>] ? x86_perf_event_set_period+0xbf/0x150
	 [<ffffffff8a187934>] perf_event_overflow+0x14/0x20
	 [<ffffffff8a020386>] intel_pmu_handle_irq+0x206/0x410
	 [<ffffffff8a01937b>] perf_event_nmi_handler+0x2b/0x50
	 [<ffffffff8a007b72>] nmi_handle+0xd2/0x390
	 [<ffffffff8a007aa5>] ? nmi_handle+0x5/0x390
	 [<ffffffff8a0cb7f8>] ? match_held_lock+0x8/0x1b0
	 [<ffffffff8a008062>] default_do_nmi+0x72/0x1c0
	 [<ffffffff8a008268>] do_nmi+0xb8/0x100
	 [<ffffffff8a7ff66a>] end_repeat_nmi+0x1e/0x2e
	 [<ffffffff8a0cb7f8>] ? match_held_lock+0x8/0x1b0
	 [<ffffffff8a0cb7f8>] ? match_held_lock+0x8/0x1b0
	 [<ffffffff8a0cb7f8>] ? match_held_lock+0x8/0x1b0
	 <<EOE><IRQ[<ffffffff8a0ccd2f>] lock_acquired+0xaf/0x450
	 [<ffffffff8a0f74c5>] ? lock_hrtimer_base.isra.20+0x25/0x50
	 [<ffffffff8a7fc678>] _raw_spin_lock_irqsave+0x78/0x90
	 [<ffffffff8a0f74c5>] ? lock_hrtimer_base.isra.20+0x25/0x50
	 [<ffffffff8a0f74c5>] lock_hrtimer_base.isra.20+0x25/0x50
	 [<ffffffff8a0f7723>] hrtimer_try_to_cancel+0x33/0x1e0
	 [<ffffffff8a0f78ea>] hrtimer_cancel+0x1a/0x30
	 [<ffffffff8a109237>] tick_nohz_restart+0x17/0x90
	 [<ffffffff8a10a213>] __tick_nohz_full_check+0xc3/0x100
	 [<ffffffff8a10a25e>] nohz_full_kick_work_func+0xe/0x10
	 [<ffffffff8a17c884>] irq_work_run_list+0x44/0x70
	 [<ffffffff8a17c8da>] irq_work_run+0x2a/0x50
	 [<ffffffff8a0f700b>] update_process_times+0x5b/0x70
	 [<ffffffff8a109005>] tick_sched_handle.isra.21+0x25/0x60
	 [<ffffffff8a109b81>] tick_sched_timer+0x41/0x60
	 [<ffffffff8a0f7aa2>] __run_hrtimer+0x72/0x470
	 [<ffffffff8a109b40>] ? tick_sched_do_timer+0xb0/0xb0
	 [<ffffffff8a0f8707>] hrtimer_interrupt+0x117/0x270
	 [<ffffffff8a034357>] local_apic_timer_interrupt+0x37/0x60
	 [<ffffffff8a80010f>] smp_apic_timer_interrupt+0x3f/0x50
	 [<ffffffff8a7fe52f>] apic_timer_interrupt+0x6f/0x80

To fix this we force non-lazy irq works to run on irq work self-IPIs
when available. That ability of the arch to trigger irq work self IPIs
is available with arch_irq_work_has_interrupt().

Reported-by: Catalin Iacob <iacobcatalin@gmail.com>
Reported-by: Dave Jones <davej@redhat.com>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
2014-09-13 18:38:15 +02:00
Frederic Weisbecker
a80e49e2cc nohz: Move nohz full init call to tick init
This way we unbloat a bit main.c and more importantly we initialize
nohz full after init_IRQ(). This dependency will be needed in further
patches because nohz full needs irq work to raise its own IRQ.
Information about the support for this ability on ARM64 is obtained on
init_IRQ() which initialize the pointer to __smp_call_function.

Since tick_init() is called right after init_IRQ(), this is a good place
to call tick_nohz_init() and prepare for that dependency.

Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
2014-09-13 18:34:44 +02:00
Steven Rostedt (Red Hat)
3ddee63a09 ftrace: Only disable ftrace_enabled to test buffer in selftest
The ftrace_enabled variable is set to zero in the self tests to keep
delayed functions from being traced and messing with the checks. This
only needs to be done when the checks are being performed, otherwise,
if ftrace_enabled is off when calls back to the utility that is being
tested, it can cause errors to happen and the tests can fail with
false positives.

Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2014-09-12 20:48:49 -04:00
Steven Rostedt (Red Hat)
84bde62ca4 ftrace: Add sanity check when unregistering last ftrace_ops
When the last ftrace_ops is unregistered, all the function records should
have a zeroed flags value. Make sure that is the case when the last ftrace_ops
is unregistered.

Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2014-09-12 20:48:43 -04:00
Richard Larocque
474e941bed alarmtimer: Lock k_itimer during timer callback
Locks the k_itimer's it_lock member when handling the alarm timer's
expiry callback.

The regular posix timers defined in posix-timers.c have this lock held
during timout processing because their callbacks are routed through
posix_timer_fn().  The alarm timers follow a different path, so they
ought to grab the lock somewhere else.

Cc: stable@vger.kernel.org
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Richard Cochran <richardcochran@gmail.com>
Cc: Prarit Bhargava <prarit@redhat.com>
Cc: Sharvil Nanavati <sharvil@google.com>
Signed-off-by: Richard Larocque <rlarocque@google.com>
Signed-off-by: John Stultz <john.stultz@linaro.org>
2014-09-12 13:59:12 -07:00
Richard Larocque
265b81d23a alarmtimer: Do not signal SIGEV_NONE timers
Avoids sending a signal to alarm timers created with sigev_notify set to
SIGEV_NONE by checking for that special case in the timeout callback.

The regular posix timers avoid sending signals to SIGEV_NONE timers by
not scheduling any callbacks for them in the first place.  Although it
would be possible to do something similar for alarm timers, it's simpler
to handle this as a special case in the timeout.

Prior to this patch, the alarm timer would ignore the sigev_notify value
and try to deliver signals to the process anyway.  Even worse, the
sanity check for the value of sigev_signo is skipped when SIGEV_NONE was
specified, so the signal number could be bogus.  If sigev_signo was an
unitialized value (as it often would be if SIGEV_NONE is used), then
it's hard to predict which signal will be sent.

Cc: stable@vger.kernel.org
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Richard Cochran <richardcochran@gmail.com>
Cc: Prarit Bhargava <prarit@redhat.com>
Cc: Sharvil Nanavati <sharvil@google.com>
Signed-off-by: Richard Larocque <rlarocque@google.com>
Signed-off-by: John Stultz <john.stultz@linaro.org>
2014-09-12 13:59:12 -07:00
Richard Larocque
e86fea7649 alarmtimer: Return relative times in timer_gettime
Returns the time remaining for an alarm timer, rather than the time at
which it is scheduled to expire.  If the timer has already expired or it
is not currently scheduled, the it_value's members are set to zero.

This new behavior matches that of the other posix-timers and the POSIX
specifications.

This is a change in user-visible behavior, and may break existing
applications.  Hopefully, few users rely on the old incorrect behavior.

Cc: stable@vger.kernel.org
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Richard Cochran <richardcochran@gmail.com>
Cc: Prarit Bhargava <prarit@redhat.com>
Cc: Sharvil Nanavati <sharvil@google.com>
Signed-off-by: Richard Larocque <rlarocque@google.com>
[jstultz: minor style tweak]
Signed-off-by: John Stultz <john.stultz@linaro.org>
2014-09-12 13:59:11 -07:00
Andrew Hunter
d78c9300c5 jiffies: Fix timeval conversion to jiffies
timeval_to_jiffies tried to round a timeval up to an integral number
of jiffies, but the logic for doing so was incorrect: intervals
corresponding to exactly N jiffies would become N+1. This manifested
itself particularly repeatedly stopping/starting an itimer:

setitimer(ITIMER_PROF, &val, NULL);
setitimer(ITIMER_PROF, NULL, &val);

would add a full tick to val, _even if it was exactly representable in
terms of jiffies_ (say, the result of a previous rounding.)  Doing
this repeatedly would cause unbounded growth in val.  So fix the math.

Here's what was wrong with the conversion: we essentially computed
(eliding seconds)

jiffies = usec  * (NSEC_PER_USEC/TICK_NSEC)

by using scaling arithmetic, which took the best approximation of
NSEC_PER_USEC/TICK_NSEC with denominator of 2^USEC_JIFFIE_SC =
x/(2^USEC_JIFFIE_SC), and computed:

jiffies = (usec * x) >> USEC_JIFFIE_SC

and rounded this calculation up in the intermediate form (since we
can't necessarily exactly represent TICK_NSEC in usec.) But the
scaling arithmetic is a (very slight) *over*approximation of the true
value; that is, instead of dividing by (1 usec/ 1 jiffie), we
effectively divided by (1 usec/1 jiffie)-epsilon (rounding
down). This would normally be fine, but we want to round timeouts up,
and we did so by adding 2^USEC_JIFFIE_SC - 1 before the shift; this
would be fine if our division was exact, but dividing this by the
slightly smaller factor was equivalent to adding just _over_ 1 to the
final result (instead of just _under_ 1, as desired.)

In particular, with HZ=1000, we consistently computed that 10000 usec
was 11 jiffies; the same was true for any exact multiple of
TICK_NSEC.

We could possibly still round in the intermediate form, adding
something less than 2^USEC_JIFFIE_SC - 1, but easier still is to
convert usec->nsec, round in nanoseconds, and then convert using
time*spec*_to_jiffies.  This adds one constant multiplication, and is
not observably slower in microbenchmarks on recent x86 hardware.

Tested: the following program:

int main() {
  struct itimerval zero = {{0, 0}, {0, 0}};
  /* Initially set to 10 ms. */
  struct itimerval initial = zero;
  initial.it_interval.tv_usec = 10000;
  setitimer(ITIMER_PROF, &initial, NULL);
  /* Save and restore several times. */
  for (size_t i = 0; i < 10; ++i) {
    struct itimerval prev;
    setitimer(ITIMER_PROF, &zero, &prev);
    /* on old kernels, this goes up by TICK_USEC every iteration */
    printf("previous value: %ld %ld %ld %ld\n",
           prev.it_interval.tv_sec, prev.it_interval.tv_usec,
           prev.it_value.tv_sec, prev.it_value.tv_usec);
    setitimer(ITIMER_PROF, &prev, NULL);
  }
    return 0;
}

Cc: stable@vger.kernel.org
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Paul Turner <pjt@google.com>
Cc: Richard Cochran <richardcochran@gmail.com>
Cc: Prarit Bhargava <prarit@redhat.com>
Reviewed-by: Paul Turner <pjt@google.com>
Reported-by: Aaron Jacobs <jacobsa@google.com>
Signed-off-by: Andrew Hunter <ahh@google.com>
[jstultz: Tweaked to apply to 3.17-rc]
Signed-off-by: John Stultz <john.stultz@linaro.org>
2014-09-12 13:59:03 -07:00
Thomas Gleixner
13c42c2f43 futex: Unlock hb->lock in futex_wait_requeue_pi() error path
futex_wait_requeue_pi() calls futex_wait_setup(). If
futex_wait_setup() succeeds it returns with hb->lock held and
preemption disabled. Now the sanity check after this does:

        if (match_futex(&q.key, &key2)) {
	   	ret = -EINVAL;
		goto out_put_keys;
	}

which releases the keys but does not release hb->lock.

So we happily return to user space with hb->lock held and therefor
preemption disabled.

Unlock hb->lock before taking the exit route.

Reported-by: Dave "Trinity" Jones <davej@redhat.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Darren Hart <dvhart@linux.intel.com>
Reviewed-by: Davidlohr Bueso <dave@stgolabs.net>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: stable@vger.kernel.org
Link: http://lkml.kernel.org/r/alpine.DEB.2.10.1409112318500.4178@nanos
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2014-09-12 22:04:36 +02:00
Rasmus Villemoes
acbbe6fbb2 kcmp: fix standard comparison bug
The C operator <= defines a perfectly fine total ordering on the set of
values representable in a long.  However, unlike its namesake in the
integers, it is not translation invariant, meaning that we do not have
"b <= c" iff "a+b <= a+c" for all a,b,c.

This means that it is always wrong to try to boil down the relationship
between two longs to a question about the sign of their difference,
because the resulting relation [a LEQ b iff a-b <= 0] is neither
anti-symmetric or transitive.  The former is due to -LONG_MIN==LONG_MIN
(take any two a,b with a-b = LONG_MIN; then a LEQ b and b LEQ a, but a !=
b).  The latter can either be seen observing that x LEQ x+1 for all x,
implying x LEQ x+1 LEQ x+2 ...  LEQ x-1 LEQ x; or more directly with the
simple example a=LONG_MIN, b=0, c=1, for which a-b < 0, b-c < 0, but a-c >
0.

Note that it makes absolutely no difference that a transmogrying bijection
has been applied before the comparison is done.  In fact, had the
obfuscation not been done, one could probably not observe the bug
(assuming all values being compared always lie in one half of the address
space, the mathematical value of a-b is always representable in a long).
As it stands, one can easily obtain three file descriptors exhibiting the
non-transitivity of kcmp().

Side note 1: I can't see that ensuring the MSB of the multiplier is
set serves any purpose other than obfuscating the obfuscating code.

Side note 2:
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <fcntl.h>
#include <unistd.h>
#include <assert.h>
#include <sys/syscall.h>

enum kcmp_type {
        KCMP_FILE,
        KCMP_VM,
        KCMP_FILES,
        KCMP_FS,
        KCMP_SIGHAND,
        KCMP_IO,
        KCMP_SYSVSEM,
        KCMP_TYPES,
};
pid_t pid;

int kcmp(pid_t pid1, pid_t pid2, int type,
	 unsigned long idx1, unsigned long idx2)
{
	return syscall(SYS_kcmp, pid1, pid2, type, idx1, idx2);
}
int cmp_fd(int fd1, int fd2)
{
	int c = kcmp(pid, pid, KCMP_FILE, fd1, fd2);
	if (c < 0) {
		perror("kcmp");
		exit(1);
	}
	assert(0 <= c && c < 3);
	return c;
}
int cmp_fdp(const void *a, const void *b)
{
	static const int normalize[] = {0, -1, 1};
	return normalize[cmp_fd(*(int*)a, *(int*)b)];
}
#define MAX 100 /* This is plenty; I've seen it trigger for MAX==3 */
int main(int argc, char *argv[])
{
	int r, s, count = 0;
	int REL[3] = {0,0,0};
	int fd[MAX];
	pid = getpid();
	while (count < MAX) {
		r = open("/dev/null", O_RDONLY);
		if (r < 0)
			break;
		fd[count++] = r;
	}
	printf("opened %d file descriptors\n", count);
	for (r = 0; r < count; ++r) {
		for (s = r+1; s < count; ++s) {
			REL[cmp_fd(fd[r], fd[s])]++;
		}
	}
	printf("== %d\t< %d\t> %d\n", REL[0], REL[1], REL[2]);
	qsort(fd, count, sizeof(fd[0]), cmp_fdp);
	memset(REL, 0, sizeof(REL));

	for (r = 0; r < count; ++r) {
		for (s = r+1; s < count; ++s) {
			REL[cmp_fd(fd[r], fd[s])]++;
		}
	}
	printf("== %d\t< %d\t> %d\n", REL[0], REL[1], REL[2]);
	return (REL[0] + REL[2] != 0);
}

Signed-off-by: Rasmus Villemoes <linux@rasmusvillemoes.dk>
Reviewed-by: Cyrill Gorcunov <gorcunov@openvz.org>
"Eric W. Biederman" <ebiederm@xmission.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-09-10 15:42:12 -07:00
Patrick Palka
000a7d66ec kernel/printk/printk.c: fix faulty logic in the case of recursive printk
We shouldn't set text_len in the code path that detects printk recursion
because text_len corresponds to the length of the string inside textbuf.
A few lines down from the line

    text_len = strlen(recursion_msg);

is the line

    text_len += vscnprintf(text + text_len, ...);

So if printk detects recursion, it sets text_len to 29 (the length of
recursion_msg) and logs an error.  Then the message supplied by the
caller of printk is stored inside textbuf but offset by 29 bytes.  This
means that the output of the recursive call to printk will contain 29
bytes of garbage in front of it.

This defect is caused by commit 458df9fd48 ("printk: remove separate
printk_sched buffers and use printk buf instead") which turned the line

    text_len = vscnprintf(text, ...);

into

    text_len += vscnprintf(text + text_len, ...);

To fix this, this patch avoids setting text_len when logging the printk
recursion error.  This patch also marks unlikely() the branch leading up
to this code.

Fixes: 458df9fd48 ("printk: remove separate printk_sched buffers and use printk buf instead")
Signed-off-by: Patrick Palka <patrick@parcs.ath.cx>
Reviewed-by: Petr Mladek <pmladek@suse.cz>
Reviewed-by: Jan Kara <jack@suse.cz>
Acked-by: Steven Rostedt <rostedt@goodmis.org>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-09-10 15:42:12 -07:00
Daniel Borkmann
b954d83421 net: bpf: only build bpf_jit_binary_{alloc, free}() when jit selected
Since BPF JIT depends on the availability of module_alloc() and
module_free() helpers (HAVE_BPF_JIT and MODULES), we better build
that code only in case we have BPF_JIT in our config enabled, just
like with other JIT code. Fixes builds for arm/marzen_defconfig
and sh/rsk7269_defconfig.

====================
kernel/built-in.o: In function `bpf_jit_binary_alloc':
/home/cwang/linux/kernel/bpf/core.c:144: undefined reference to `module_alloc'
kernel/built-in.o: In function `bpf_jit_binary_free':
/home/cwang/linux/kernel/bpf/core.c:164: undefined reference to `module_free'
make: *** [vmlinux] Error 1
====================

Reported-by: Fengguang Wu <fengguang.wu@intel.com>
Fixes: 738cbe72ad ("net: bpf: consolidate JIT binary allocator")
Signed-off-by: Daniel Borkmann <dborkman@redhat.com>
Acked-by: Alexei Starovoitov <ast@plumgrid.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-09-10 14:05:07 -07:00
Andreea-Cristina Bernat
fb5a613b4f kernel: trace_syscalls: Replace rcu_assign_pointer() with RCU_INIT_POINTER()
The uses of "rcu_assign_pointer()" are NULLing out the pointers.
According to RCU_INIT_POINTER()'s block comment:
"1.   This use of RCU_INIT_POINTER() is NULLing out the pointer"
it is better to use it instead of rcu_assign_pointer() because it has a
smaller overhead.

The following Coccinelle semantic patch was used:
@@
@@

- rcu_assign_pointer
+ RCU_INIT_POINTER
  (..., NULL)

Link: http://lkml.kernel.org/p/20140822142822.GA32391@ada

Signed-off-by: Andreea-Cristina Bernat <bernat.ada@gmail.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2014-09-10 10:48:47 -04:00
Steven Rostedt (Red Hat)
fef5aeeee9 ftrace: Replace tramp_hash with old_*_hash to save space
Allowing function callbacks to declare their own trampolines requires
that each ftrace_ops that has a trampoline must have some sort of
accounting that keeps track of which ops has a trampoline attached
to a record.

The easy way to solve this was to add a "tramp_hash" that created a
hash entry for every function that a ops uses with a trampoline.
But since we can have literally tens of thousands of functions being
traced, that means we need tens of thousands of descriptors to map
the ops to the function in the hash. This is quite expensive and
can cause enabling and disabling the function graph tracer to take
some time to start and stop. It can take up to several seconds to
disable or enable all functions in the function graph tracer for this
reason.

The better approach albeit more complex, is to keep track of how ops
are being enabled and disabled, and use that along with the counting
of the number of ops attached to records, to determive what ops has
a trampoline attached to a record at enabling and disabling of
tracing.

To do this, the tramp_hash has been replaced with an old_filter_hash
and old_notrace_hash, which get the copy of the ops filter_hash and
notrace_hash respectively. The old hashes is kept until the ops has
been modified or removed and the old hashes are used with the logic
of the accounting to determine the ops that have the trampoline of
a record. The reason this has less of a footprint is due to the trick
that an "empty" hash in the filter_hash means "all functions" and
an empty hash in the notrace hash means "no functions" in the hash.

This is much more efficienct, doesn't have the delay, and takes up
much less memory, as we do not need to map all the functions but
just figure out which functions are mapped at the time it is
enabled or disabled.

Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2014-09-10 10:48:45 -04:00
Steven Rostedt (Red Hat)
e1effa0144 ftrace: Annotate the ops operation on update
Add three new flags for ftrace_ops:

  FTRACE_OPS_FL_ADDING
  FTRACE_OPS_FL_REMOVING
  FTRACE_OPS_FL_MODIFYING

These will be set for the ftrace_ops when they are first added
to the function tracing, being removed from function tracing
or just having their functions changed from function tracing,
respectively.

This will be needed to remove the tramp_hash, which can grow quite
big. The tramp_hash is used to note what functions a ftrace_ops
is using a trampoline for. Denoting which ftrace_ops is being
modified, will allow us to use the ftrace_ops hashes themselves,
which are much smaller as they have a global flag to denote if
a ftrace_ops is tracing all functions, as well as a notrace hash
if the ftrace_ops is tracing all but a few. The tramp_hash just
creates a hash item for every function, which can go into the 10s
of thousands if all functions are using the ftrace_ops trampoline.

Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2014-09-10 10:48:44 -04:00
Steven Rostedt (Red Hat)
5fecaa044a ftrace: Grab any ops for a rec for enabled_functions output
When dumping the enabled_functions, use the first op that is
found with a trampoline to the record, as there should only be
one, as only one ops can be registered to a function that has
a trampoline.

Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2014-09-10 10:48:43 -04:00
Steven Rostedt (Red Hat)
3296fc4e25 ftrace: Remove freeing of old_hash from ftrace_hash_move()
ftrace_hash_move() currently frees the old hash that is passed to it
after replacing the pointer with the new hash. Instead of having the
function do that chore, have the caller perform the free.

This lets the ftrace_hash_move() be used a bit more freely, which
is needed for changing the way the trampoline logic is done.

Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2014-09-10 10:48:42 -04:00
Steven Rostedt (Red Hat)
f7aad4e1a8 ftrace: Set callback to ftrace_stub when no ops are registered
The clean up that adds the helper function ftrace_ops_get_func()
caused the default function to not change when DYNAMIC_FTRACE was not
set and no ftrace_ops were registered. Although static tracing is
not very useful (not having DYNAMIC_FTRACE set), it is still supported
and we don't want to break it.

Clean up the if statement even more to specifically have the default
function call ftrace_stub when no ftrace_ops are registered. This
fixes the small bug for static tracing as well as makes the code a
bit more understandable.

Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2014-09-10 10:48:18 -04:00
Daniel Borkmann
738cbe72ad net: bpf: consolidate JIT binary allocator
Introduced in commit 314beb9bca ("x86: bpf_jit_comp: secure bpf jit
against spraying attacks") and later on replicated in aa2d2c73c2
("s390/bpf,jit: address randomize and write protect jit code") for
s390 architecture, write protection for BPF JIT images got added and
a random start address of the JIT code, so that it's not on a page
boundary anymore.

Since both use a very similar allocator for the BPF binary header,
we can consolidate this code into the BPF core as it's mostly JIT
independant anyway.

This will also allow for future archs that support DEBUG_SET_MODULE_RONX
to just reuse instead of reimplementing it.

JIT tested on x86_64 and s390x with BPF test suite.

Signed-off-by: Daniel Borkmann <dborkman@redhat.com>
Acked-by: Alexei Starovoitov <ast@plumgrid.com>
Cc: Eric Dumazet <edumazet@google.com>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-09-09 16:58:56 -07:00
Steven Rostedt (Red Hat)
8735405988 ftrace: Add helper function ftrace_ops_get_func()
Add the helper function to what the mcount trampoline is to call
for a ftrace_ops function. This helper will be used by arch code
in the future to set up dynamic trampolines. But as this does the
same tests that are performed in choosing what function to call for
the default mcount trampoline, might as well use it to clean up
the existing code.

Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2014-09-09 19:26:06 -04:00
Alexei Starovoitov
02ab695bb3 net: filter: add "load 64-bit immediate" eBPF instruction
add BPF_LD_IMM64 instruction to load 64-bit immediate value into a register.
All previous instructions were 8-byte. This is first 16-byte instruction.
Two consecutive 'struct bpf_insn' blocks are interpreted as single instruction:
insn[0].code = BPF_LD | BPF_DW | BPF_IMM
insn[0].dst_reg = destination register
insn[0].imm = lower 32-bit
insn[1].code = 0
insn[1].imm = upper 32-bit
All unused fields must be zero.

Classic BPF has similar instruction: BPF_LD | BPF_W | BPF_IMM
which loads 32-bit immediate value into a register.

x64 JITs it as single 'movabsq %rax, imm64'
arm64 may JIT as sequence of four 'movk x0, #imm16, lsl #shift' insn

Note that old eBPF programs are binary compatible with new interpreter.

It helps eBPF programs load 64-bit constant into a register with one
instruction instead of using two registers and 4 instructions:
BPF_MOV32_IMM(R1, imm32)
BPF_ALU64_IMM(BPF_LSH, R1, 32)
BPF_MOV32_IMM(R2, imm32)
BPF_ALU64_REG(BPF_OR, R1, R2)

User space generated programs will use this instruction to load constants only.

To tell kernel that user space needs a pointer the _pseudo_ variant of
this instruction may be added later, which will use extra bits of encoding
to indicate what type of pointer user space is asking kernel to provide.
For example 'off' or 'src_reg' fields can be used for such purpose.
src_reg = 1 could mean that user space is asking kernel to validate and
load in-kernel map pointer.
src_reg = 2 could mean that user space needs readonly data section pointer
src_reg = 3 could mean that user space needs a pointer to per-cpu local data
All such future pseudo instructions will not be carrying the actual pointer
as part of the instruction, but rather will be treated as a request to kernel
to provide one. The kernel will verify the request_for_a_pointer, then
will drop _pseudo_ marking and will store actual internal pointer inside
the instruction, so the end result is the interpreter and JITs never
see pseudo BPF_LD_IMM64 insns and only operate on generic BPF_LD_IMM64 that
loads 64-bit immediate into a register. User space never operates on direct
pointers and verifier can easily recognize request_for_pointer vs other
instructions.

Signed-off-by: Alexei Starovoitov <ast@plumgrid.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-09-09 10:26:47 -07:00
Steven Rostedt (Red Hat)
f1ff6348b3 ftrace: Add separate function for non recursive callbacks
Instead of using the generic list function for callbacks that
are not recursive, call a new helper function from the mcount
trampoline called ftrace_ops_recur_func() that will do the recursion
checking for the callback.

This eliminates an indirection as well as will help in future code
that will use dynamically allocated trampolines.

Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2014-09-09 10:26:48 -04:00