Pull powerpc non-virtualized cpuidle from Ben Herrenschmidt:
"This is the branch I mentioned in my other pull request which contains
our improved cpuidle support for the "powernv" platform
(non-virtualized).
It adds support for the "fast sleep" feature of the processor which
provides higher power savings than our usual "nap" mode but at the
cost of losing the timers while asleep, and thus exploits the new
timer broadcast framework to work around that limitation.
It's based on a tip timer tree that you seem to have already merged"
* 'powernv-cpuidle' of git://git.kernel.org/pub/scm/linux/kernel/git/benh/powerpc:
cpuidle/powernv: Parse device tree to setup idle states
cpuidle/powernv: Add "Fast-Sleep" CPU idle state
powerpc/powernv: Add OPAL call to resync timebase on wakeup
powerpc/powernv: Add context management for Fast Sleep
powerpc: Split timer_interrupt() into timer handling and interrupt handling routines
powerpc: Implement tick broadcast IPI as a fixed IPI message
powerpc: Free up the slot of PPC_MSG_CALL_FUNC_SINGLE IPI message
Pull main powerpc updates from Ben Herrenschmidt:
"This time around, the powerpc merges are going to be a little bit more
complicated than usual.
This is the main pull request with most of the work for this merge
window. I will describe it a bit more further down.
There is some additional cpuidle driver work, however I haven't
included it in this tree as it depends on some work in tip/timer-core
which Thomas accidentally forgot to put in a topic branch. Since I
didn't want to carry all of that tip timer stuff in powerpc -next, I
setup a separate branch on top of Thomas tree with just that cpuidle
driver in it, and Stephen has been carrying that in next separately
for a while now. I'll send a separate pull request for it.
Additionally, two new pieces in this tree add users for a sysfs API
that Tejun and Greg have been deprecating in drivers-core-next.
Thankfully Greg reverted the patch that removes the old API so this
merge can happen cleanly, but once merged, I will send a patch
adjusting our new code to the new API so that Greg can send you the
removal patch.
Now as for the content of this branch, we have a lot of perf work for
power8 new counters including support for our new "nest" counters
(also called 24x7) under pHyp (not natively yet).
We have new functionality when running under the OPAL firmware
(non-virtualized or KVM host), such as access to the firmware error
logs and service processor dumps, system parameters and sensors, along
with a hwmon driver for the latter.
There's also a bunch of bug fixes accross the board, some LE fixes,
and a nice set of selftests for validating our various types of copy
loops.
On the Freescale side, we see mostly new chip/board revisions, some
clock updates, better support for machine checks and debug exceptions,
etc..."
* 'next' of git://git.kernel.org/pub/scm/linux/kernel/git/benh/powerpc: (70 commits)
powerpc/book3s: Fix CFAR clobbering issue in machine check handler.
powerpc/compat: 32-bit little endian machine name is ppcle, not ppc
powerpc/le: Big endian arguments for ppc_rtas()
powerpc: Use default set of netfilter modules (CONFIG_NETFILTER_ADVANCED=n)
powerpc/defconfigs: Enable THP in pseries defconfig
powerpc/mm: Make sure a local_irq_disable prevent a parallel THP split
powerpc: Rate-limit users spamming kernel log buffer
powerpc/perf: Fix handling of L3 events with bank == 1
powerpc/perf/hv_{gpci, 24x7}: Add documentation of device attributes
powerpc/perf: Add kconfig option for hypervisor provided counters
powerpc/perf: Add support for the hv 24x7 interface
powerpc/perf: Add support for the hv gpci (get performance counter info) interface
powerpc/perf: Add macros for defining event fields & formats
powerpc/perf: Add a shared interface to get gpci version and capabilities
powerpc/perf: Add 24x7 interface headers
powerpc/perf: Add hv_gpci interface header
powerpc: Add hvcalls for 24x7 and gpci (Get Performance Counter Info)
sysfs: create bin_attributes under the requested group
powerpc/perf: Enable BHRB access for EBB events
powerpc/perf: Add BHRB constraint and IFM MMCRA handling for EBB
...
Here's the big char/misc driver updates for 3.15-rc1.
Lots of various things here, including the new mcb driver subsystem.
All of these have been in linux-next for a while.
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2.0.22 (GNU/Linux)
iEYEABECAAYFAlM7ArIACgkQMUfUDdst+ylS+gCfcJr0Zo2v5aWnqD7rFtFETmFI
LhcAoNTQ4cvlVdxnI0driWCWFYxLj6at
=aj+L
-----END PGP SIGNATURE-----
Merge tag 'char-misc-3.15-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/char-misc
Pull char/misc driver patches from Greg KH:
"Here's the big char/misc driver updates for 3.15-rc1.
Lots of various things here, including the new mcb driver subsystem.
All of these have been in linux-next for a while"
* tag 'char-misc-3.15-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/char-misc: (118 commits)
extcon: Move OF helper function to extcon core and change function name
extcon: of: Remove unnecessary function call by using the name of device_node
extcon: gpio: Use SIMPLE_DEV_PM_OPS macro
extcon: palmas: Use SIMPLE_DEV_PM_OPS macro
mei: don't use deprecated DEFINE_PCI_DEVICE_TABLE macro
mei: amthif: fix checkpatch error
mei: client.h fix checkpatch errors
mei: use cl_dbg where appropriate
mei: fix Unnecessary space after function pointer name
mei: report consistently copy_from/to_user failures
mei: drop pr_fmt macros
mei: make me hw headers private to me hw.
mei: fix memory leak of pending write cb objects
mei: me: do not reset when less than expected data is received
drivers: mcb: Fix build error discovered by 0-day bot
cs5535-mfgpt: Simplify dependencies
spmi: pm: drop bus-level PM suspend/resume routines
spmi: pmic_arb: make selectable on ARCH_QCOM
Drivers: hv: vmbus: Increase the limit on the number of pfns we can handle
pch_phub: Report error writing MAC back to user
...
Pull scheduler changes from Ingo Molnar:
"Bigger changes:
- sched/idle restructuring: they are WIP preparation for deeper
integration between the scheduler and idle state selection, by
Nicolas Pitre.
- add NUMA scheduling pseudo-interleaving, by Rik van Riel.
- optimize cgroup context switches, by Peter Zijlstra.
- RT scheduling enhancements, by Thomas Gleixner.
The rest is smaller changes, non-urgnt fixes and cleanups"
* 'sched-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (68 commits)
sched: Clean up the task_hot() function
sched: Remove double calculation in fix_small_imbalance()
sched: Fix broken setscheduler()
sparc64, sched: Remove unused sparc64_multi_core
sched: Remove unused mc_capable() and smt_capable()
sched/numa: Move task_numa_free() to __put_task_struct()
sched/fair: Fix endless loop in idle_balance()
sched/core: Fix endless loop in pick_next_task()
sched/fair: Push down check for high priority class task into idle_balance()
sched/rt: Fix picking RT and DL tasks from empty queue
trace: Replace hardcoding of 19 with MAX_NICE
sched: Guarantee task priority in pick_next_task()
sched/idle: Remove stale old file
sched: Put rq's sched_avg under CONFIG_FAIR_GROUP_SCHED
cpuidle/arm64: Remove redundant cpuidle_idle_call()
cpuidle/powernv: Remove redundant cpuidle_idle_call()
sched, nohz: Exclude isolated cores from load balancing
sched: Fix select_task_rq_fair() description comments
workqueue: Replace hardcoding of -20 and 19 with MIN_NICE and MAX_NICE
sys: Replace hardcoding of -20 and 19 with MIN_NICE and MAX_NICE
...
Pull core locking updates from Ingo Molnar:
"The biggest change is the MCS spinlock generalization changes from Tim
Chen, Peter Zijlstra, Jason Low et al. There's also lockdep
fixes/enhancements from Oleg Nesterov, in particular a false negative
fix related to lockdep_set_novalidate_class() usage"
* 'core-locking-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (22 commits)
locking/mutex: Fix debug checks
locking/mutexes: Add extra reschedule point
locking/mutexes: Introduce cancelable MCS lock for adaptive spinning
locking/mutexes: Unlock the mutex without the wait_lock
locking/mutexes: Modify the way optimistic spinners are queued
locking/mutexes: Return false if task need_resched() in mutex_can_spin_on_owner()
locking: Move mcs_spinlock.h into kernel/locking/
m68k: Skip futex_atomic_cmpxchg_inatomic() test
futex: Allow architectures to skip futex_atomic_cmpxchg_inatomic() test
Revert "sched/wait: Suppress Sparse 'variable shadowing' warning"
lockdep: Change lockdep_set_novalidate_class() to use _and_name
lockdep: Change mark_held_locks() to check hlock->check instead of lockdep_no_validate
lockdep: Don't create the wrong dependency on hlock->check == 0
lockdep: Make held_lock->check and "int check" argument bool
locking/mcs: Allow architecture specific asm files to be used for contended case
locking/mcs: Order the header files in Kbuild of each architecture in alphabetical order
sched/wait: Suppress Sparse 'variable shadowing' warning
hung_task/Documentation: Fix hung_task_warnings description
locking/mcs: Allow architectures to hook in to contended paths
locking/mcs: Micro-optimize the MCS code, add extra comments
...
Freescale updates from Scott. Mostly support for critical
and machine check exceptions on 64-bit BookE, some new
PCI suspend/resume work and misc bits.
While checking powersaving mode in machine check handler at 0x200, we
clobber CFAR register. Fix it by saving and restoring it during beq/bgt.
Signed-off-by: Mahesh Salgaonkar <mahesh@linux.vnet.ibm.com>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
I noticed this when testing setarch. No, we don't magically
support a big endian userspace on a little endian kernel.
Signed-off-by: Anton Blanchard <anton@samba.org>
Cc: stable@vger.kernel.org # v3.10+
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Signed-off-by: Cody P Schafer <cody@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
The previous commit added constraint and register handling to allow
processes using EBB (Event Based Branches) to request access to the BHRB
(Branch History Rolling Buffer).
With that in place we can allow processes using EBB to access the BHRB.
This is achieved by setting BHRBA in MMCR0 when we enable EBB access. We
must also clear BHRBA when we are disabling.
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Some power8 revisions have a hardware bug where we can lose a PMU
exception, this commit adds a workaround to detect the bad condition and
rectify the situation.
See the comment in the commit for a full description.
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Some power8 revisions have a hardware bug where we can lose a
Performance Monitor (PMU) exception under certain circumstances.
We will be adding a workaround for this case, see the next commit for
details. The observed behaviour is that writing PMAO doesn't cause an
exception as we would expect, hence the name of the feature.
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Currently the sysrq ShowRegs command does not print any PMU registers as
we have an empty definition for perf_event_print_debug(). This patch
defines perf_event_print_debug() to print various PMU registers.
Example output:
CPU: 0 PMU registers, ppmu = POWER7 n_counters = 6
PMC1: 00000000 PMC2: 00000000 PMC3: 00000000 PMC4: 00000000
PMC5: 00000000 PMC6: 00000000 PMC7: deadbeef PMC8: deadbeef
MMCR0: 0000000080000000 MMCR1: 0000000000000000 MMCRA: 0f00000001000000
SIAR: 0000000000000000 SDAR: 0000000000000000 SIER: 0000000000000000
Signed-off-by: Anshuman Khandual <khandual@linux.vnet.ibm.com>
[mpe: Fix 32 bit build and rework formatting for compactness]
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
This patch enables fetching of various platform sensor data through
OPAL and expects a sensor handle from the driver to pass to OPAL.
Signed-off-by: Neelesh Gupta <neelegup@linux.vnet.ibm.com>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
This patch enables reading and updating of system parameters through
OPAL call.
Signed-off-by: Neelesh Gupta <neelegup@linux.vnet.ibm.com>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
This patch adds support for notifying the clients of their request
completion. Clients request for the token before making OPAL call
and then wait for the response.
This patch uses messaging infrastructure to pull the data to linux
by registering itself for the message type OPAL_MSG_ASYNC_COMP.
Signed-off-by: Neelesh Gupta <neelegup@linux.vnet.ibm.com>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Add special state saving for critical and machine check exceptions.
Most of this code could be used to handle debug exceptions taken from
kernel space, but actually doing so is outside the scope of this patch.
The various critical and machine check exceptions now point to their
real handlers, rather than hanging the kernel.
Signed-off-by: Scott Wood <scottwood@freescale.com>
While bolted handlers (including e6500) do not need to deal with a TLB
miss recursively causing another TLB miss, nested TLB misses can still
happen with crit/mc/debug exceptions -- so we still need to honor
SPRG_TLB_EXFRAME.
We don't need to spend time modifying it in the TLB miss fastpath,
though -- the special level exception will handle that.
Signed-off-by: Scott Wood <scottwood@freescale.com>
Cc: Mihai Caraman <mihai.caraman@freescale.com>
Cc: kvm-ppc@vger.kernel.org
Previously SPRG3 was marked for use by both VDSO and critical
interrupts (though critical interrupts were not fully implemented).
In commit 8b64a9dfb0 ("powerpc/booke64:
Use SPRG0/3 scratch for bolted TLB miss & crit int"), Mihai Caraman
made an attempt to resolve this conflict by restoring the VDSO value
early in the critical interrupt, but this has some issues:
- It's incompatible with EXCEPTION_COMMON which restores r13 from the
by-then-overwritten scratch (this cost me some debugging time).
- It forces critical exceptions to be a special case handled
differently from even machine check and debug level exceptions.
- It didn't occur to me that it was possible to make this work at all
(by doing a final "ld r13, PACA_EXCRIT+EX_R13(r13)") until after
I made (most of) this patch. :-)
It might be worth investigating using a load rather than SPRG on return
from all exceptions (except TLB misses where the scratch never leaves
the SPRG) -- it could save a few cycles. Until then, let's stick with
SPRG for all exceptions.
Since we cannot use SPRG4-7 for scratch without corrupting the state of
a KVM guest, move VDSO to SPRG7 on book3e. Since neither SPRG4-7 nor
critical interrupts exist on book3s, SPRG3 is still used for VDSO
there.
Signed-off-by: Scott Wood <scottwood@freescale.com>
Cc: Mihai Caraman <mihai.caraman@freescale.com>
Cc: Anton Blanchard <anton@samba.org>
Cc: Paul Mackerras <paulus@samba.org>
Cc: kvm-ppc@vger.kernel.org
Once special level interrupts are supported, we may take nested TLB
misses -- so allow the same thread to acquire the lock recursively.
The lock will not be effective against the nested TLB miss handler
trying to write the same entry as the interrupted TLB miss handler, but
that's also a problem on non-threaded CPUs that lack TLB write
conditional. This will be addressed in the patch that enables crit/mc
support by invalidating the TLB on return from level exceptions.
Signed-off-by: Scott Wood <scottwood@freescale.com>
Remove mc_capable() and smt_capable(). Neither is used.
Both were added by 5c45bf279d ("sched: mc/smt power savings sched
policy"). Uses of both were removed by 8e7fbcbc22 ("sched: Remove stale
power aware scheduling remnants and dysfunctional knobs").
Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Acked-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: David S. Miller <davem@davemloft.net>
Acked-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Link: http://lkml.kernel.org/r/20140304210737.16893.54289.stgit@bhelgaas-glaptop.roam.corp.google.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
This enables support for userspace to fetch and initiate FSP and
Platform dumps from the service processor (via firmware) through sysfs.
Based on original patch from Vasant Hegde <hegdevasant@linux.vnet.ibm.com>
Flow:
- We register for OPAL notification events.
- OPAL sends new dump available notification.
- We make information on dump available via sysfs
- Userspace requests dump contents
- We retrieve the dump via OPAL interface
- User copies the dump data
- userspace sends ack for dump
- We send ACK to OPAL.
sysfs files:
- We add the /sys/firmware/opal/dump directory
- echoing 1 (well, anything, but in future we may support
different dump types) to /sys/firmware/opal/dump/initiate_dump
will initiate a dump.
- Each dump that we've been notified of gets a directory
in /sys/firmware/opal/dump/ with a name of the dump type and ID (in hex,
as this is what's used elsewhere to identify the dump).
- Each dump has files: id, type, dump and acknowledge
dump is binary and is the dump itself.
echoing 'ack' to acknowledge (currently any string will do) will
acknowledge the dump and it will soon after disappear from sysfs.
OPAL APIs:
- opal_dump_init()
- opal_dump_info()
- opal_dump_read()
- opal_dump_ack()
- opal_dump_resend_notification()
Currently we are only ever notified for one dump at a time (until
the user explicitly acks the current dump, then we get a notification
of the next dump), but this kernel code should "just work" when OPAL
starts notifying us of all the dumps present.
Signed-off-by: Stewart Smith <stewart@linux.vnet.ibm.com>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Based on a patch by: Mahesh Salgaonkar <mahesh@linux.vnet.ibm.com>
This patch adds support to read error logs from OPAL and export
them to userspace through a sysfs interface.
We export each log entry as a directory in /sys/firmware/opal/elog/
Currently, OPAL will buffer up to 128 error log records, we don't
need to have any knowledge of this limit on the Linux side as that
is actually largely transparent to us.
Each error log entry has the following files: id, type, acknowledge, raw.
Currently we just export the raw binary error log in the 'raw' attribute.
In a future patch, we may parse more of the error log to make it a bit
easier for userspace (e.g. to be able to display a brief summary in
petitboot without having to have a full parser).
If we have >128 logs from OPAL, we'll only be notified of 128 until
userspace starts acknowledging them. This limitation may be lifted in
the future and with this patch, that should "just work" from the linux side.
A userspace daemon should:
- wait for error log entries using normal mechanisms (we announce creation)
- read error log entry
- save error log entry safely to disk
- acknowledge the error log entry
- rinse, repeat.
On the Linux side, we read the error log when we're notified of it. This
possibly isn't ideal as it would be better to only read them on-demand.
However, this doesn't really work with current OPAL interface, so we
read the error log immediately when notified at the moment.
I've tested this pretty extensively and am rather confident that the
linux side of things works rather well. There is currently an issue with
the service processor side of things for >128 error logs though.
Signed-off-by: Stewart Smith <stewart@linux.vnet.ibm.com>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
pHyp can change cache nodes for suspend/resume operation. Currently the
device tree is updated by drmgr in userspace after all non boot CPUs are
enabled. Hence, we do not modify the cache list based on the latest cache
nodes. Also we do not remove cache entries for the primary CPU.
This patch removes the cache list for the boot CPU, updates the device tree
before enabling nonboot CPUs and adds cache list for the boot cpu.
This patch also has the side effect that older versions of drmgr will
perform a second device tree update from userspace. While this is a
redundant waste of a couple cycles it is harmless since firmware returns the
same data for the subsequent update-nodes/properties rtas calls.
Signed-off-by: Haren Myneni <hbabu@us.ibm.com>
Signed-off-by: Tyrel Datwyler <tyreld@linux.vnet.ibm.com>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
The memory remove code for powerpc/pseries should call remove_memory()
so that we are holding the hotplug_memory lock during memory remove
operations.
This patch updates the memory node remove handler to call remove_memory()
and adds a ppc_md.remove_memory() entry to handle pseries specific work
that is called from arch_remove_memory().
During memory remove in pseries_remove_memblock() we have to stay with
removing memory one section at a time. This is needed because of how memory
resources are handled. During memory add for pseries (via the probe file in
sysfs) we add memory one section at a time which gives us a memory resource
for each section. Future patches will aim to address this so will not have
to remove memory one section at a time.
Signed-off-by: Nathan Fontenot <nfont@linux.vnet.ibm.com>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Detect and recover from machine check when inside opal on a special
scom load instructions. On specific SCOM read via MMIO we may get a machine
check exception with SRR0 pointing inside opal. To recover from MC
in this scenario, get a recovery instruction address and return to it from
MC.
OPAL will export the machine check recoverable ranges through
device tree node mcheck-recoverable-ranges under ibm,opal:
# hexdump /proc/device-tree/ibm,opal/mcheck-recoverable-ranges
0000000 0000 0000 3000 2804 0000 000c 0000 0000
0000010 3000 2814 0000 0000 3000 27f0 0000 000c
0000020 0000 0000 3000 2814 xxxx xxxx xxxx xxxx
0000030 llll llll yyyy yyyy yyyy yyyy
...
...
#
where:
xxxx xxxx xxxx xxxx = Starting instruction address
llll llll = Length of the address range.
yyyy yyyy yyyy yyyy = recovery address
Each recoverable address range entry is (start address, len,
recovery address), 2 cells each for start and recovery address, 1 cell for
len, totalling 5 cells per entry. During kernel boot time, build up the
recovery table with the list of recovery ranges from device-tree node which
will be used during machine check exception to recover from MMIO SCOM UE.
Signed-off-by: Mahesh Salgaonkar <mahesh@linux.vnet.ibm.com>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
During "Fast-sleep" and deeper power savings state, decrementer and
timebase could be stopped making it out of sync with rest
of the cores in the system.
Add a firmware call to request platform to resync timebase
using low level platform methods.
Signed-off-by: Vaidyanathan Srinivasan <svaidy@linux.vnet.ibm.com>
Signed-off-by: Preeti U. Murthy <preeti@linux.vnet.ibm.com>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Before adding Fast-Sleep into the cpuidle framework, some low level
support needs to be added to enable it. This includes saving and
restoring of certain registers at entry and exit time of this state
respectively just like we do in the NAP idle state.
Signed-off-by: Vaidyanathan Srinivasan <svaidy@linux.vnet.ibm.com>
[Changelog modified by Preeti U. Murthy <preeti@linux.vnet.ibm.com>]
Signed-off-by: Preeti U. Murthy <preeti@linux.vnet.ibm.com>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
For scalability and performance reasons, we want the tick broadcast IPIs
to be handled as efficiently as possible. Fixed IPI messages
are one of the most efficient mechanisms available - they are faster than
the smp_call_function mechanism because the IPI handlers are fixed and hence
they don't involve costly operations such as adding IPI handlers to the target
CPU's function queue, acquiring locks for synchronization etc.
Luckily we have an unused IPI message slot, so use that to implement
tick broadcast IPIs efficiently.
Signed-off-by: Srivatsa S. Bhat <srivatsa.bhat@linux.vnet.ibm.com>
[Functions renamed to tick_broadcast* and Changelog modified by
Preeti U. Murthy<preeti@linux.vnet.ibm.com>]
Signed-off-by: Preeti U. Murthy <preeti@linux.vnet.ibm.com>
Acked-by: Geoff Levand <geoff@infradead.org> [For the PS3 part]
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
The IPI handlers for both PPC_MSG_CALL_FUNC and PPC_MSG_CALL_FUNC_SINGLE map
to a common implementation - generic_smp_call_function_single_interrupt(). So,
we can consolidate them and save one of the IPI message slots, (which are
precious on powerpc, since only 4 of those slots are available).
So, implement the functionality of PPC_MSG_CALL_FUNC_SINGLE using
PPC_MSG_CALL_FUNC itself and release its IPI message slot, so that it can be
used for something else in the future, if desired.
Signed-off-by: Srivatsa S. Bhat <srivatsa.bhat@linux.vnet.ibm.com>
Signed-off-by: Preeti U. Murthy <preeti@linux.vnet.ibm.com>
Acked-by: Geoff Levand <geoff@infradead.org> [For the PS3 part]
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
The OPAL firmware functions opal_xscom_read and opal_xscom_write
take a 64-bit argument for the XSCOM (PCB) address in order to
support the indirect mode on P8.
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
CC: <stable@vger.kernel.org> [v3.13]
The new ELFv2 little-endian ABI increases the stack redzone -- the
area below the stack pointer that can be used for storing data --
from 288 bytes to 512 bytes. This means that we need to allow more
space on the user stack when delivering a signal to a 64-bit process.
To make the code a bit clearer, we define new USER_REDZONE_SIZE and
KERNEL_REDZONE_SIZE symbols in ptrace.h. For now, we leave the
kernel redzone size at 288 bytes, since increasing it to 512 bytes
would increase the size of interrupt stack frames correspondingly.
Gcc currently only makes use of 288 bytes of redzone even when
compiling for the new little-endian ABI, and the kernel cannot
currently be compiled with the new ABI anyway.
In the future, hopefully gcc will provide an option to control the
amount of redzone used, and then we could reduce it even more.
This also changes the code in arch_compat_alloc_user_space() to
preserve the expanded redzone. It is not clear why this function would
ever be used on a 64-bit process, though.
Signed-off-by: Paul Mackerras <paulus@samba.org>
CC: <stable@vger.kernel.org> [v3.13]
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Freescale IFC controller has been used for mpc8xxx. It will be used
for ARM-based SoC as well. This patch moves the driver to driver/memory
and fix the header file includes.
Also remove module_platform_driver() and instead call
platform_driver_register() from subsys_initcall() to make sure this module
has been loaded before MTD partition parsing starts.
Signed-off-by: Prabhakar Kushwaha <prabhakar@freescale.com>
Acked-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
The patch cleans up variable eeh_subsystem_enabled so that we needn't
refer the variable directly from external. Instead, we will use
function eeh_enabled() and eeh_set_enable() to operate the variable.
Signed-off-by: Gavin Shan <shangw@linux.vnet.ibm.com>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
perf is failing to resolve symbols in the VDSO. A while (1)
gettimeofday() loop shows:
93.99% [vdso] [.] 0x00000000000005e0
3.12% test [.] 00000037.plt_call.gettimeofday@@GLIBC_2.18
2.81% test [.] main
The reason for this is that we are linking our VDSO shared libraries
at 1MB, which is a little weird. Even though this is uncommon, Alan
points out that it is valid and we should probably fix perf userspace.
Regardless, I can't see a reason why we are doing this. The code
is all position independent and we never rely on the VDSO ending
up at 1M (and we never place it there on 64bit tasks).
Changing our link address to 0x0 fixes perf VDSO symbol resolution:
73.18% [vdso] [.] 0x000000000000060c
12.39% [vdso] [.] __kernel_gettimeofday
3.58% test [.] 00000037.plt_call.gettimeofday@@GLIBC_2.18
2.94% [vdso] [.] __kernel_datapage_offset
2.90% test [.] main
We still have some local symbol resolution issues that will be
fixed in a subsequent patch.
Signed-off-by: Anton Blanchard <anton@samba.org>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Archs like ppc64 doesn't do tlb flush in set_pte/pmd functions when using
a hash table MMU for various reasons (the flush is handled as part of
the PTE modification when necessary).
ppc64 thus doesn't implement flush_tlb_range for hash based MMUs.
Additionally ppc64 require the tlb flushing to be batched within ptl locks.
The reason to do that is to ensure that the hash page table is in sync with
linux page table.
We track the hpte index in linux pte and if we clear them without flushing
hash and drop the ptl lock, we can have another cpu update the pte and can
end up with duplicate entry in the hash table, which is fatal.
We also want to keep set_pte_at simpler by not requiring them to do hash
flush for performance reason. We do that by assuming that set_pte_at() is
never *ever* called on a PTE that is already valid.
This was the case until the NUMA code went in which broke that assumption.
Fix that by introducing a new pair of helpers to set _PAGE_NUMA in a
way similar to ptep/pmdp_set_wrprotect(), with a generic implementation
using set_pte_at() and a powerpc specific one using the appropriate
mechanism needed to keep the hash table in sync.
Acked-by: Mel Gorman <mgorman@suse.de>
Reviewed-by: Rik van Riel <riel@redhat.com>
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
pte_update() is a powerpc-ism used to change the bits of a PTE
when the access permission is being restricted (a flush is
potentially needed).
It uses atomic operations on when needed and handles the hash
synchronization on hash based processors.
It is currently only used to clear PTE bits and so the current
implementation doesn't provide a way to also set PTE bits.
The new _PAGE_NUMA bit, when set, is actually restricting access
so it must use that function too, so this change adds the ability
for pte_update() to also set bits.
We will use this later to set the _PAGE_NUMA bit.
Acked-by: Mel Gorman <mgorman@suse.de>
Acked-by: Rik van Riel <riel@redhat.com>
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
This patch adds the support for to create a direct iommu "bypass"
window on IODA2 bridges (such as Power8) allowing to bypass iommu
page translation completely for 64-bit DMA capable devices, thus
significantly improving DMA performances.
Additionally, this adds a hook to the struct iommu_table so that
the IOMMU API / VFIO can disable the bypass when external ownership
is requested, since in that case, the device will be used by an
environment such as userspace or a KVM guest which must not be
allowed to bypass translations.
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
On p8 systems, with relocation on exception feature enabled we are seeing
kdump kernel hang at interrupt vector 0xc*4400. The reason is, with this
feature enabled, exception are raised with MMU (IR=DR=1) ON with the
default offset of 0xc*4000. Since exception is raised in virtual mode it
requires the vector region to be executable without which it fails to
fetch and execute instruction at 0xc*4xxx. For default kernel since kernel
is loaded at real 0, the htab mappings sets the entire kernel text region
executable. But for relocatable kernel (e.g. kdump case) we only copy
interrupt vectors down to real 0 and never marked that region as
executable because in p7 and below we always get exception in real mode.
This patch fixes this issue by marking htab mapping range as executable
that overlaps with the interrupt vector region for relocatable kernel.
Thanks to Ben who helped me to debug this issue and find the root cause.
Signed-off-by: Mahesh Salgaonkar <mahesh@linux.vnet.ibm.com>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
This patch allows each architecture to add its specific assembly optimized
arch_mcs_spin_lock_contended and arch_mcs_spinlock_uncontended for
MCS lock and unlock functions.
Signed-off-by: Tim Chen <tim.c.chen@linux.intel.com>
Cc: Scott J Norton <scott.norton@hp.com>
Cc: Raghavendra K T <raghavendra.kt@linux.vnet.ibm.com>
Cc: AswinChandramouleeswaran <aswin@hp.com>
Cc: George Spelvin <linux@horizon.com>
Cc: Rik vanRiel <riel@redhat.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: MichelLespinasse <walken@google.com>
Cc: Peter Hurley <peter@hurleysoftware.com>
Cc: Andi Kleen <andi@firstfloor.org>
Cc: Alex Shi <alex.shi@linaro.org>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Tim Chen <tim.c.chen@linux.intel.com>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: "Figo.zhang" <figo1802@gmail.com>
Cc: "Paul E.McKenney" <paulmck@linux.vnet.ibm.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Davidlohr Bueso <davidlohr.bueso@hp.com>
Cc: Waiman Long <waiman.long@hp.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Will Deacon <will.deacon@arm.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Matthew R Wilcox <matthew.r.wilcox@intel.com>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/1390347382.3138.67.camel@schen9-DESK
Signed-off-by: Ingo Molnar <mingo@kernel.org>
We perform a clean up of the Kbuid files in each architecture.
We order the files in each Kbuild in alphabetical order
by running the below script.
for i in arch/*/include/asm/Kbuild
do
cat $i | gawk '/^generic-y/ {
i = 3;
do {
for (; i <= NF; i++) {
if ($i == "\\") {
getline;
i = 1;
continue;
}
if ($i != "")
hdr[$i] = $i;
}
break;
} while (1);
next;
}
// {
print $0;
}
END {
n = asort(hdr);
for (i = 1; i <= n; i++)
print "generic-y += " hdr[i];
}' > ${i}.sorted;
mv ${i}.sorted $i;
done
Signed-off-by: Tim Chen <tim.c.chen@linux.intel.com>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Matthew R Wilcox <matthew.r.wilcox@intel.com>
Cc: AswinChandramouleeswaran <aswin@hp.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: "Paul E.McKenney" <paulmck@linux.vnet.ibm.com>
Cc: Scott J Norton <scott.norton@hp.com>
Cc: Will Deacon <will.deacon@arm.com>
Cc: "Figo.zhang" <figo1802@gmail.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Rik van Riel <riel@redhat.com>
Cc: Waiman Long <waiman.long@hp.com>
Cc: Peter Hurley <peter@hurleysoftware.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Tim Chen <tim.c.chen@linux.intel.com>
Cc: Alex Shi <alex.shi@linaro.org>
Cc: Raghavendra K T <raghavendra.kt@linux.vnet.ibm.com>
Cc: Andi Kleen <andi@firstfloor.org>
Cc: George Spelvin <linux@horizon.com>
Cc: MichelLespinasse <walken@google.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Davidlohr Bueso <davidlohr.bueso@hp.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
[ Fixed build bug. ]
Signed-off-by: Ingo Molnar <mingo@kernel.org>
two s390 guest features that need some handling in the host,
and all the PPC changes. The PPC changes include support for
little-endian guests and enablement for new POWER8 features.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2.0.22 (GNU/Linux)
iQIcBAABAgAGBQJS6UF5AAoJEBvWZb6bTYby55kP/AgTJnyu7avN653/2aSHkjkx
KgYSMYhZPIFoY5LyZuNetXaoXFRvCykux1VYSZ6V6s35h2PZ+hdJNbHGjFYKPGTq
FQ92xQVNuWCAPxmFCjDNuDV/0BauG5y08/Orh/jpjz+GAfH43LruUQGbtXUuyJ8u
vf+yTHniU5gguqsAmodqjHUgbf+GoPJ1j7hmRoWwt8IWm7Ns3v/IK4l0p6G0h26a
RjE6aK+Tm208Yr5hD/dRAqeTbBNt3c4xub+QPsKoiEMaZBSuAOiux7D3Kx+If1gp
WsmqEQxoymihVtkZhUFO9ONLJepvmG2QwJVVyMSUW9iqxX9rraXsvVyVMwcQAhog
JuOAYxBftH07xu6Fs4eym5KvCFghM+EaJvxxt+kgnvdD4htK1+eK5trntc2zygSi
/qGiIrkqjXpkskW8kujLayF0eAU3CrZvFWveEPBfFgYiOGX/2wzJCtSm/bt9Jo0M
v60qgNFK3LNqAyeEfnm9VtlwGr6ZgsAB6DHNPX4fM5s2IBjL+qloXk/e/+aVKkW0
I3yeRdy/ExhLAab6w81JtMeR7G3YS0UNuAEVvcoxzNb5wIBY8qnpfUzTKyMxQR94
64EVpxWEYO1s55eCCyMujWrSvc+YAwhJcWHGKgC4K7mxxLD3FVyQXX6YZvgRozMX
HjQju+DToj9CskyrFlRL
=yd0Z
-----END PGP SIGNATURE-----
Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm
Pull more KVM updates from Paolo Bonzini:
"Second batch of KVM updates. Some minor x86 fixes, two s390 guest
features that need some handling in the host, and all the PPC changes.
The PPC changes include support for little-endian guests and
enablement for new POWER8 features"
* tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (45 commits)
x86, kvm: correctly access the KVM_CPUID_FEATURES leaf at 0x40000101
x86, kvm: cache the base of the KVM cpuid leaves
kvm: x86: move KVM_CAP_HYPERV_TIME outside #ifdef
KVM: PPC: Book3S PR: Cope with doorbell interrupts
KVM: PPC: Book3S HV: Add software abort codes for transactional memory
KVM: PPC: Book3S HV: Add new state for transactional memory
powerpc/Kconfig: Make TM select VSX and VMX
KVM: PPC: Book3S HV: Basic little-endian guest support
KVM: PPC: Book3S HV: Add support for DABRX register on POWER7
KVM: PPC: Book3S HV: Prepare for host using hypervisor doorbells
KVM: PPC: Book3S HV: Handle new LPCR bits on POWER8
KVM: PPC: Book3S HV: Handle guest using doorbells for IPIs
KVM: PPC: Book3S HV: Consolidate code that checks reason for wake from nap
KVM: PPC: Book3S HV: Implement architecture compatibility modes for POWER8
KVM: PPC: Book3S HV: Add handler for HV facility unavailable
KVM: PPC: Book3S HV: Flush the correct number of TLB sets on POWER8
KVM: PPC: Book3S HV: Context-switch new POWER8 SPRs
KVM: PPC: Book3S HV: Align physical and virtual CPU thread numbers
KVM: PPC: Book3S HV: Don't set DABR on POWER8
kvm/ppc: IRQ disabling cleanup
...
smt-snooze-delay was designed to disable NAP state or delay the entry
to the NAP state prior to adoption of cpuidle framework. This
is per-cpu variable. With the coming of CPUIDLE framework,
states can be disabled on per-cpu basis using the cpuidle/enable
sysfs entry.
Also, with the coming of cpuidle driver each state's target residency
is per-driver unlike earlier which was per-device. Therefore,
the per-cpu sysfs smt-snooze-delay which decides the target residency
of the idle state on a particular cpu causes more confusion to the user
as we cannot have different smt-snooze-delay (target residency)
values for each cpu.
In the current code, smt-snooze-delay functionality is completely broken.
It makes sense to remove smt-snooze-delay from idle driver with the
coming of cpuidle framework.
However, sysfs files are retained as ppc64_util currently
utilises it. Once we fix ppc64_util, propose to clean
up the kernel code.
Signed-off-by: Deepthi Dharwar <deepthi@linux.vnet.ibm.com>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Move the file from arch specific pseries/processor_idle.c
to drivers/cpuidle/cpuidle-pseries.c
Make the relevant Makefile and Kconfig changes.
Also, introduce Kconfig.powerpc in drivers/cpuidle
for all powerpc cpuidle drivers.
Signed-off-by: Deepthi Dharwar <deepthi@linux.vnet.ibm.com>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
It seems that forward declaration couldn't work well with typedef, use
struct spinlock directly to avoiding following build errors:
In file included from include/linux/spinlock.h:81,
from include/linux/seqlock.h:35,
from include/linux/time.h:5,
from include/uapi/linux/timex.h:56,
from include/linux/timex.h:56,
from include/linux/sched.h:17,
from arch/powerpc/kernel/asm-offsets.c:17:
include/linux/spinlock_types.h:76: error: redefinition of typedef 'spinlock_t'
/root/linux-next/arch/powerpc/include/asm/pgtable-ppc64.h:563: note: previous declaration of 'spinlock_t' was here
Signed-off-by: Li Zhong <zhong@linux.vnet.ibm.com>
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>