When running a PAPR guest, we need to handle a few hypercalls in kernel space,
most prominently the page table invalidation (to sync the shadows).
So this patch adds handling for a few PAPR hypercalls to PR mode KVM. I tried
to share the code with HV mode, but it ended up being a lot easier this way
around, as the two differ too much in those details.
Signed-off-by: Alexander Graf <agraf@suse.de>
---
v1 -> v2:
- whitespace fix
Until now, we always set HIOR based on the PVR, but this is just wrong.
Instead, we should be setting HIOR explicitly, so user space can decide
what the initial HIOR value is - just like on real hardware.
We keep the old PVR based way around for backwards compatibility, but
once user space uses the SREGS based method, we drop the PVR logic.
Signed-off-by: Alexander Graf <agraf@suse.de>
When running a PAPR guest, some things change. The privilege level drops
from hypervisor to supervisor, SDR1 gets treated differently and we interpret
hypercalls. For bisectability sake, add the flag now, but only enable it when
all the support code is there.
Signed-off-by: Alexander Graf <agraf@suse.de>
We need the compute_tlbie_rb in _pr and _hv implementations for papr
soon, so let's move it over to a common header file that both
implementations can leverage.
Signed-off-by: Alexander Graf <agraf@suse.de>
OPAL can handle various interrupt for us such as Machine Checks (it
performs all sorts of recovery tasks and passes back control to us with
informations about the error), Hardware Management Interrupts and Softpatch
interrupts.
This wires up the mechanisms and prints out specific informations returned
by HAL when a machine check occurs.
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
OPAL handles HW access to the various ICS or equivalent chips
for us (with the exception of p5ioc2 based HEA which uses a
different backend) similarily to what RTAS does on pSeries.
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Implements OPAL RTC and NVRAM support and wire all that up to
the powernv platform.
We use RTAS for RTC as a fallback if available. Using RTAS for nvram
is not supported yet, pending some rework/cleanup and generalization
of the pSeries & CHRP code. We also use RTAS fallbacks for power off
and reboot
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
This adds a udbg and an hvc console backend for supporting a console
using the OPAL console interfaces.
On OPAL v1 we have hvc0 mapped to whatever console the system was
configured for (network or hvsi serial port) via the service
processor.
On OPAL v2 we have hvcN mapped to the Nth console provided by OPAL
which generally corresponds to:
hvc0 : network console (raw protocol)
hvc1 : serial port S1 (hvsi)
hvc2 : serial port S2 (hvsi)
Note: At this point, early debug console only works with OPAL v1
and shouldn't be enabled in a normal kernel.
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Add definition of OPAL interfaces along with the wrappers to call
into OPAL runtime and the early device-tree parsing hook to locate
the OPAL runtime firmware.
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
On machines supporting the OPAL firmware version 1, the system
is initially booted under pHyp. We then use a special hypercall
to verify if OPAL is available and if it is, we then trigger
a "takeover" which disables pHyp and loads the OPAL runtime
firmware, giving control to the kernel in hypervisor mode.
This patch add the necessary code to detect that the OPAL takeover
capability is present when running under PowerVM (aka pHyp) and
perform said takeover to get hypervisor control of the processor.
To perform the takeover, we must first use RTAS (within Open
Firmware runtime environment) to start all processors & threads,
in order to give control to OPAL on all of them. We then call
the takeover hypercall on everybody, OPAL will re-enter the kernel
main entry point passing it a flat device-tree.
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
This adds more generic support for doing CPU hotplug with a simple
idle loop and no actual reset of the processors. The generic
smp_generic_kick_cpu() does the hotplug bringup trick if the PACA
shows that the CPU has already been started at boot and we provide
an accessor for the CPU state.
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
If we echo an address the hypervisor doesn't like to
/sys/devices/system/memory/probe we oops the box:
# echo 0x10000000000 > /sys/devices/system/memory/probe
kernel BUG at arch/powerpc/mm/hash_utils_64.c:541!
The backtrace is:
create_section_mapping
arch_add_memory
add_memory
memory_probe_store
sysdev_class_store
sysfs_write_file
vfs_write
SyS_write
In create_section_mapping we BUG if htab_bolt_mapping returned
an error. A better approach is to return an error which will
propagate back to userspace.
Rerunning the test with this patch applied:
# echo 0x10000000000 > /sys/devices/system/memory/probe
-bash: echo: write error: Invalid argument
Signed-off-by: Anton Blanchard <anton@samba.org>
Cc: stable@kernel.org
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
We have two identical definitions of RECLAIM_DISTANCE, looks like
the patch got applied twice. Remove one.
Signed-off-by: Anton Blanchard <anton@samba.org>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
On big POWER7 boxes we see large amounts of CPU time in system
processes like workqueue and watchdog kernel threads.
We currently rebalance the entire machine each time a task goes
idle and this is very expensive on large machines. Disable newidle
balancing at the node level and rely on the scheduler tick to
rebalance across nodes.
Signed-off-by: Anton Blanchard <anton@samba.org>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
The largest POWER7 boxes have 32 nodes. SD_NODES_PER_DOMAIN groups
nodes into chunks of 16 and adds a global balancing domain
(SD_ALLNODES) above it.
If we bump SD_NODES_PER_DOMAIN to 32, then we avoid this extra
level of balancing on our largest boxes.
Signed-off-by: Anton Blanchard <anton@samba.org>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
When chasing a performance issue on ppc64, I noticed tasks
communicating via a pipe would often end up on different nodes.
It turns out SD_WAKE_AFFINE is not set in our node defition. Commit
9fcd18c9e6 (sched: re-tune balancing) enabled SD_WAKE_AFFINE
in the node definition for x86 and we need a similar change for
ppc64.
I used lmbench lat_ctx and perf bench pipe to verify this fix. Each
benchmark was run 10 times and the average taken.
lmbench lat_ctx:
before: 66565 ops/sec
after: 204700 ops/sec
3.1x faster
perf bench pipe:
before: 5.6570 usecs
after: 1.3470 usecs
4.2x faster
Signed-off-by: Anton Blanchard <anton@samba.org>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Add a new udbg driver for the PS3 gelic Ehthernet device.
This driver shares only a few stucture and constant definitions with the
gelic Ethernet device driver, so is implemented as a stand-alone driver
with no dependencies on the gelic Ethernet device driver.
Signed-off-by: Hector Martin <hector@marcansoft.com>
Signed-off-by: Andre Heider <a.heider@gmail.com>
Signed-off-by: Geoff Levand <geoff@infradead.org>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Capture more than twice as much text from the printk buffer, and
compress it to fit it in the lnx,oops-log NVRAM partition. You
can view the compressed text using the new (as of July 20) --unzip
option of the nvram command in the powerpc-utils package.
[BenH: Added select of ZLIB_DEFLATE]
Signed-off-by: Jim Keniston <jkenisto@us.ibm.com>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
There is one place in the MPIC driver that assumes that the cores are numbered
from 0 to n-1. However, this is not true if the CPUs are not numbered
sequentially. This can happen on a eight-core SOC where cores two and three
are removed in the device tree. So instead of blindly looping, we iterate
over the discovered CPUs and use the SMP ID as the index.
This means that we no longer ask the MPIC how many CPUs there are, so
we also delete mpic->num_cpus.
We also catch if the number of CPUs in the SOC exceeds the number that the
MPIC supports. This should never happen, of course, but it's good to be
sure.
Signed-off-by: Timur Tabi <timur@freescale.com>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Enable hugepages on Freescale BookE processors. This allows the kernel to
use huge TLB entries to map pages, which can greatly reduce the number of
TLB misses and the amount of TLB thrashing experienced by applications with
large memory footprints. Care should be taken when using this on FSL
processors, as the number of large TLB entries supported by the core is low
(16-64) on current processors.
The supported set of hugepage sizes include 4m, 16m, 64m, 256m, and 1g.
Page sizes larger than the max zone size are called "gigantic" pages and
must be allocated on the command line (and cannot be deallocated).
This is currently only fully implemented for Freescale 32-bit BookE
processors, but there is some infrastructure in the code for
64-bit BooKE.
Signed-off-by: Becky Bruce <beckyb@kernel.crashing.org>
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Now that the generic code has dma_map_ops set, instead of having a
messy ifdef & if block in the base dma_get_required_mask hook push
the computation into the dma ops.
If the ops fails to set the get_required_mask hook default to the
width of dma_addr_t.
This also corrects ibmbus ibmebus_dma_supported to require a 64
bit mask. I doubt anything is checking or setting the dma mask on
that bus.
Signed-off-by: Milton Miller <miltonm@bga.com>
Signed-off-by: Nishanth Aravamudan <nacc@us.ibm.com>
Cc: linuxppc-dev@lists.ozlabs.org
Cc: linux-kernel@vger.kernel.org
Cc: benh@kernel.crashing.org
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
The hook dma_get_required_mask is supposed to return the mask required
by the platform to operate efficently. The generic version of
dma_get_required_mask in driver/base/platform.c returns a mask based
only on max_pfn. However, this is likely too big for iommu systems
and could be too small for platforms that require a dma offset or have
a secondary window at a high offset.
Override the default, provide a hook in ppc_md used by pseries lpar and
cell, and provide the default answer based on memblock_end_of_DRAM(),
with hooks for get_dma_offset, and provide an implementation for iommu
that looks at the defined table size. Coverting from the end address
to the required bit mask is based on the generic implementation.
The need for this was discovered when the qla2xxx driver switched to
64 bit dma then reverted to 32 bit when dma_get_required_mask said
32 bits was sufficient.
Signed-off-by: Milton Miller <miltonm@bga.com>
Signed-off-by: Nishanth Aravamudan <nacc@us.ibm.com>
Cc: linuxppc-dev@lists.ozlabs.org
Cc: linux-kernel@vger.kernel.org
Cc: benh@kernel.crashing.org
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
These were missed in commit f5b9409973 "All Arch: remove linkage
for sys_nfsservctl system call" due to them having no sys_ prefix
(presumably).
Cc: NeilBrown <neilb@suse.de>
Cc: linuxppc-dev@lists.ozlabs.org
Cc: linux-parisc@vger.kernel.org
Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
Acked-by: James Bottomley <James.Bottomley@hansenpartnership.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The ePAPR embedded hypervisor specification provides an API for "byte
channels", which are serial-like virtual devices for sending and receiving
streams of bytes. This driver provides Linux kernel support for byte
channels via three distinct interfaces:
1) An early-console (udbg) driver. This provides early console output
through a byte channel. The byte channel handle must be specified in a
Kconfig option.
2) A normal console driver. Output is sent to the byte channel designated
for stdout in the device tree. The console driver is for handling kernel
printk calls.
3) A tty driver, which is used to handle user-space input and output. The
byte channel used for the console is designated as the default tty.
Signed-off-by: Timur Tabi <timur@freescale.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
This patch adds kexec support for PPC440 based chipsets. This work is based
on the KEXEC patches for FSL BookE.
The FSL BookE patch and the code flow could be found at the link below:
http://patchwork.ozlabs.org/patch/49359/
Steps:
1) Invalidate all the TLB entries except the one this code is run from
2) Create a tmp mapping for our code in the other address space and jump to it
3) Invalidate the entry we used
4) Create a 1:1 mapping for 0-2GiB in blocks of 256M
5) Jump to the new 1:1 mapping and invalidate the tmp mapping
I have tested this patches on Ebony, Sequoia boards and Virtex on QEMU.
You need kexec-tools commit e8b7939b1e or newer for ppc440x support,
available at:
git://git.kernel.org/pub/scm/utils/kernel/kexec/kexec-tools.git
Signed-off-by: Suzuki Poulose <suzuki@in.ibm.com>
Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Signed-off-by: Josh Boyer <jwboyer@gmail.com>
We are seeing boot failures on some very large boxes even with
commit b5416ca9f8 (powerpc: Move kdump default base address to
64MB on 64bit).
This patch halves the RMO so both kernels get about the same
amount of RMO memory. On large machines this region will be
at least 256MB, so each kernel will get 128MB.
We cap it at 256MB (small SLB size) since some early allocations need
to be in the bolted SLB region. We could relax this on machines with
1TB SLBs in a future patch.
Signed-off-by: Anton Blanchard <anton@samba.org>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
One definition of PV_POWER7 seems enough to me.
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
I hit an oops at boot on the first instruction of timer_cpu_notify:
NIP [c000000000722f88] .timer_cpu_notify+0x0/0x388
The code should look like:
c000000000722f78: eb e9 00 30 ld r31,48(r9)
c000000000722f7c: 2f bf 00 00 cmpdi cr7,r31,0
c000000000722f80: 40 9e ff 44 bne+ cr7,c000000000722ec4
c000000000722f84: 4b ff ff 74 b c000000000722ef8
c000000000722f88 <.timer_cpu_notify>:
c000000000722f88: 7c 08 02 a6 mflr r0
c000000000722f8c: 2f a4 00 07 cmpdi cr7,r4,7
c000000000722f90: fb c1 ff f0 std r30,-16(r1)
c000000000722f94: fb 61 ff d8 std r27,-40(r1)
But the oops output shows:
eb61ffd8 eb81ffe0 eba1ffe8 ebc1fff0 7c0803a6 ebe1fff8 4e800020
00000000 ebe90030 c0000000 00ad0a28 00000000 2fa40007 fbc1fff0 fb61ffd8
So we scribbled over our instructions with c000000000ad0a28, which
is an address inside the jump_table ELF section.
It turns out the jump_table section is only aligned to 8 bytes but
we are aligning our entries within the section to 16 bytes. This
means our entries are offset from the table:
c000000000acd4a8 <__start___jump_table>:
...
c000000000ad0a10: c0 00 00 00 lfs f0,0(0)
c000000000ad0a14: 00 70 cd 5c .long 0x70cd5c
c000000000ad0a18: c0 00 00 00 lfs f0,0(0)
c000000000ad0a1c: 00 70 cd 90 .long 0x70cd90
c000000000ad0a20: c0 00 00 00 lfs f0,0(0)
c000000000ad0a24: 00 ac a4 20 .long 0xaca420
And the jump table sort code gets very confused and writes into the
wrong spot. Remove the alignment, and also remove the padding since
we it saves some space and we shouldn't need it.
Signed-off-by: Anton Blanchard <anton@samba.org>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Add a cast in case the caller passes in a different type, as it would
if mtspr/mtmsr were functions.
Previously, if a 64-bit type was passed in on 32-bit, GCC would bind the
constraint to a pair of registers, and would substitute the first register
in the pair in the asm code. This corresponds to the upper half of the
64-bit register, which is generally not the desired behavior.
Signed-off-by: Scott Wood <scottwood@freescale.com>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
* 'next/cross-platform' of git://git.kernel.org/pub/scm/linux/kernel/git/arm/linux-arm-soc:
ARM: Consolidate the clkdev header files
ARM: set vga memory base at run-time
ARM: convert PCI defines to variables
ARM: pci: make pcibios_assign_all_busses use pci_has_flag
ARM: remove unnecessary mach/hardware.h includes
pci: move microblaze and powerpc pci flag functions into asm-generic
powerpc: rename ppc_pci_*_flags to pci_*_flags
Fix up conflicts in arch/microblaze/include/asm/pci-bridge.h
After changing all consumers of atomics to include <linux/atomic.h>, we
ran into some compile time errors due to this dependency chain:
linux/atomic.h
-> asm/atomic.h
-> asm-generic/atomic-long.h
where atomic-long.h could use funcs defined later in linux/atomic.h
without a prototype. This patches moves the code that includes
asm-generic/atomic*.h to linux/atomic.h.
Archs that need <asm-generic/atomic64.h> need to select
CONFIG_GENERIC_ATOMIC64 from now on (some of them used to include it
unconditionally).
Compile tested on i386 and x86_64 with allnoconfig.
Signed-off-by: Arun Sharma <asharma@fb.com>
Cc: Eric Dumazet <eric.dumazet@gmail.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: David Miller <davem@davemloft.net>
Acked-by: Mike Frysinger <vapier@gentoo.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This is in preparation for more generic atomic primitives based on
__atomic_add_unless.
Signed-off-by: Arun Sharma <asharma@fb.com>
Signed-off-by: Hans-Christian Egtvedt <hans-christian.egtvedt@atmel.com>
Reviewed-by: Eric Dumazet <eric.dumazet@gmail.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: David Miller <davem@davemloft.net>
Acked-by: Mike Frysinger <vapier@gentoo.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This allows us to move duplicated code in <asm/atomic.h>
(atomic_inc_not_zero() for now) to <linux/atomic.h>
Signed-off-by: Arun Sharma <asharma@fb.com>
Reviewed-by: Eric Dumazet <eric.dumazet@gmail.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: David Miller <davem@davemloft.net>
Cc: Eric Dumazet <eric.dumazet@gmail.com>
Acked-by: Mike Frysinger <vapier@gentoo.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The majority of architectures implement ext2 atomic bitops as
test_and_{set,clear}_bit() without spinlock.
This adds this type of generic implementation in ext2-atomic-setbit.h and
use it wherever possible.
Signed-off-by: Akinobu Mita <akinobu.mita@gmail.com>
Suggested-by: Andreas Dilger <adilger@dilger.ca>
Suggested-by: Arnd Bergmann <arnd@arndb.de>
Acked-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
[ poleg@redhat.com: no need to declare show_regs() in ptrace.h, sched.h does this ]
Signed-off-by: Mike Frysinger <vapier@gentoo.org>
Cc: Tejun Heo <tj@kernel.org>
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* 'next' of git://git.kernel.org/pub/scm/linux/kernel/git/benh/powerpc: (99 commits)
drivers/virt: add missing linux/interrupt.h to fsl_hypervisor.c
powerpc/85xx: fix mpic configuration in CAMP mode
powerpc: Copy back TIF flags on return from softirq stack
powerpc/64: Make server perfmon only built on ppc64 server devices
powerpc/pseries: Fix hvc_vio.c build due to recent changes
powerpc: Exporting boot_cpuid_phys
powerpc: Add CFAR to oops output
hvc_console: Add kdb support
powerpc/pseries: Fix hvterm_raw_get_chars to accept < 16 chars, fixing xmon
powerpc/irq: Quieten irq mapping printks
powerpc: Enable lockup and hung task detectors in pseries and ppc64 defeconfigs
powerpc: Add mpt2sas driver to pseries and ppc64 defconfig
powerpc: Disable IRQs off tracer in ppc64 defconfig
powerpc: Sync pseries and ppc64 defconfigs
powerpc/pseries/hvconsole: Fix dropped console output
hvc_console: Improve tty/console put_chars handling
powerpc/kdump: Fix timeout in crash_kexec_wait_realmode
powerpc/mm: Fix output of total_ram.
powerpc/cpufreq: Add cpufreq driver for Momentum Maple boards
powerpc: Correct annotations of pmu registration functions
...
Fix up trivial Kconfig/Makefile conflicts in arch/powerpc, drivers, and
drivers/cpufreq
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/trivial: (43 commits)
fs: Merge split strings
treewide: fix potentially dangerous trailing ';' in #defined values/expressions
uwb: Fix misspelling of neighbourhood in comment
net, netfilter: Remove redundant goto in ebt_ulog_packet
trivial: don't touch files that are removed in the staging tree
lib/vsprintf: replace link to Draft by final RFC number
doc: Kconfig: `to be' -> `be'
doc: Kconfig: Typo: square -> squared
doc: Konfig: Documentation/power/{pm => apm-acpi}.txt
drivers/net: static should be at beginning of declaration
drivers/media: static should be at beginning of declaration
drivers/i2c: static should be at beginning of declaration
XTENSA: static should be at beginning of declaration
SH: static should be at beginning of declaration
MIPS: static should be at beginning of declaration
ARM: static should be at beginning of declaration
rcu: treewide: Do not use rcu_read_lock_held when calling rcu_dereference_check
Update my e-mail address
PCIe ASPM: forcedly -> forcibly
gma500: push through device driver tree
...
Fix up trivial conflicts:
- arch/arm/mach-ep93xx/dma-m2p.c (deleted)
- drivers/gpio/gpio-ep93xx.c (renamed and context nearby)
- drivers/net/r8169.c (just context changes)
* 'timers-cleanup-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
mips: Fix i8253 clockevent fallout
i8253: Cleanup outb/inb magic
arm: Footbridge: Use common i8253 clockevent
mips: Use common i8253 clockevent
x86: Use common i8253 clockevent
i8253: Create common clockevent implementation
i8253: Export i8253_lock unconditionally
pcpskr: MIPS: Make config dependencies finer grained
pcspkr: Cleanup Kconfig dependencies
i8253: Move remaining content and delete asm/i8253.h
i8253: Consolidate definitions of PIT_LATCH
x86: i8253: Consolidate definitions of global_clock_event
i8253: Alpha, PowerPC: Remove unused asm/8253pit.h
alpha: i8253: Cleanup remaining users of i8253pit.h
i8253: Remove I8253_LOCK config
i8253: Make pcsp sound driver use the shared i8253_lock
i8253: Make pcspkr input driver use the shared i8253_lock
i8253: Consolidate all kernel definitions of i8253_lock
i8253: Unify all kernel declarations of i8253_lock
i8253: Create linux/i8253.h and use it in all 8253 related files
* 'perf-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: (123 commits)
perf: Remove the nmi parameter from the oprofile_perf backend
x86, perf: Make copy_from_user_nmi() a library function
perf: Remove perf_event_attr::type check
x86, perf: P4 PMU - Fix typos in comments and style cleanup
perf tools: Make test use the preset debugfs path
perf tools: Add automated tests for events parsing
perf tools: De-opt the parse_events function
perf script: Fix display of IP address for non-callchain path
perf tools: Fix endian conversion reading event attr from file header
perf tools: Add missing 'node' alias to the hw_cache[] array
perf probe: Support adding probes on offline kernel modules
perf probe: Add probed module in front of function
perf probe: Introduce debuginfo to encapsulate dwarf information
perf-probe: Move dwarf library routines to dwarf-aux.{c, h}
perf probe: Remove redundant dwarf functions
perf probe: Move strtailcmp to string.c
perf probe: Rename DIE_FIND_CB_FOUND to DIE_FIND_CB_END
tracing/kprobe: Update symbol reference when loading module
tracing/kprobes: Support module init function probing
kprobes: Return -ENOENT if probe point doesn't exist
...
* 'of-pci' of git://git.kernel.org/pub/scm/linux/kernel/git/benh/powerpc:
pci/of: Consolidate pci_bus_to_OF_node()
pci/of: Consolidate pci_device_to_OF_node()
x86/devicetree: Use generic PCI <-> OF matching
microblaze/pci: Move the remains of pci_32.c to pci-common.c
microblaze/pci: Remove powermac originated cruft
pci/of: Match PCI devices to OF nodes dynamically
An implementation of a code generator for BPF programs to speed up packet
filtering on PPC64, inspired by Eric Dumazet's x86-64 version.
Filter code is generated as an ABI-compliant function in module_alloc()'d mem
with stackframe & prologue/epilogue generated if required (simple filters don't
need anything more than an li/blr). The filter's local variables, M[], live in
registers. Supports all BPF opcodes, although "complicated" loads from negative
packet offsets (e.g. SKF_LL_OFF) are not yet supported.
There are a couple of further optimisations left for future work; many-pass
assembly with branch-reach reduction and a register allocator to push M[]
variables into volatile registers would improve the code quality further.
This currently supports big-endian 64-bit PowerPC only (but is fairly simple
to port to PPC32 or LE!).
Enabled in the same way as x86-64:
echo 1 > /proc/sys/net/core/bpf_jit_enable
Or, enabled with extra debug output:
echo 2 > /proc/sys/net/core/bpf_jit_enable
Signed-off-by: Matt Evans <matt@ozlabs.org>
Acked-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
All these are instances of
#define NAME value;
or
#define NAME(params_opt) value;
These of course fail to build when used in contexts like
if(foo $OP NAME)
while(bar $OP NAME)
and may silently generate the wrong code in contexts such as
foo = NAME + 1; /* foo = value; + 1; */
bar = NAME - 1; /* bar = value; - 1; */
baz = NAME & quux; /* baz = value; & quux; */
Reported on comp.lang.c,
Message-ID: <ab0d55fe-25e5-482b-811e-c475aa6065c3@c29g2000yqd.googlegroups.com>
Initial analysis of the dangers provided by Keith Thompson in that thread.
There are many more instances of more complicated macros having unnecessary
trailing semicolons, but this pile seems to be all of the cases of simple
values suffering from the problem. (Thus things that are likely to be found
in one of the contexts above, more complicated ones aren't.)
Signed-off-by: Phil Carmody <ext-phil.2.carmody@nokia.com>
Signed-off-by: Jiri Kosina <jkosina@suse.cz>
Move separate microblaze and powerpc pci flag functions pci_set_flags,
pci_add_flags, and pci_has_flag into asm-generic/pci-bridge.h so other
archs can use them.
Signed-off-by: Rob Herring <rob.herring@calxeda.com>
Acked-by: Nicolas Pitre <nicolas.pitre@linaro.org>
Acked-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Jesse Barnes <jbarnes@virtuousgeek.org>
Signed-off-by: Michal Simek <monstr@monstr.eu>
This renames pci flags functions and enums in preparation for creating
generic version in asm-generic/pci-bridge.h. The following search and
replace is done:
s/ppc_pci_/pci_/
s/PPC_PCI_/PCI_/
Direct accesses to ppc_pci_flag variable are replaced with helper
functions.
Signed-off-by: Rob Herring <rob.herring@calxeda.com>
Acked-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Paul Mackerras <paulus@samba.org>
Since other OS's may be running on the other cores don't use tlbivax
Signed-off-by: Dave Kleikamp <shaggy@linux.vnet.ibm.com>
Signed-off-by: Tony Breeds <tony@bakeyournoodle.com>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Josh Boyer <jwboyer@linux.vnet.ibm.com>
Cc: linuxppc-dev@lists.ozlabs.org
Signed-off-by: Josh Boyer <jwboyer@linux.vnet.ibm.com>
Commit c8f729d408 (KVM: PPC: Deliver program interrupts right away instead
of queueing them) made away with all users of prog_flags, so we can just
remove it from the headers.
Signed-off-by: Alexander Graf <agraf@suse.de>
This adds support for running KVM guests in supervisor mode on those
PPC970 processors that have a usable hypervisor mode. Unfortunately,
Apple G5 machines have supervisor mode disabled (MSR[HV] is forced to
1), but the YDL PowerStation does have a usable hypervisor mode.
There are several differences between the PPC970 and POWER7 in how
guests are managed. These differences are accommodated using the
CPU_FTR_ARCH_201 (PPC970) and CPU_FTR_ARCH_206 (POWER7) CPU feature
bits. Notably, on PPC970:
* The LPCR, LPID or RMOR registers don't exist, and the functions of
those registers are provided by bits in HID4 and one bit in HID0.
* External interrupts can be directed to the hypervisor, but unlike
POWER7 they are masked by MSR[EE] in non-hypervisor modes and use
SRR0/1 not HSRR0/1.
* There is no virtual RMA (VRMA) mode; the guest must use an RMO
(real mode offset) area.
* The TLB entries are not tagged with the LPID, so it is necessary to
flush the whole TLB on partition switch. Furthermore, when switching
partitions we have to ensure that no other CPU is executing the tlbie
or tlbsync instructions in either the old or the new partition,
otherwise undefined behaviour can occur.
* The PMU has 8 counters (PMC registers) rather than 6.
* The DSCR, PURR, SPURR, AMR, AMOR, UAMOR registers don't exist.
* The SLB has 64 entries rather than 32.
* There is no mediated external interrupt facility, so if we switch to
a guest that has a virtual external interrupt pending but the guest
has MSR[EE] = 0, we have to arrange to have an interrupt pending for
it so that we can get control back once it re-enables interrupts. We
do that by sending ourselves an IPI with smp_send_reschedule after
hard-disabling interrupts.
Signed-off-by: Paul Mackerras <paulus@samba.org>
Signed-off-by: Alexander Graf <agraf@suse.de>
This replaces the single CPU_FTR_HVMODE_206 bit with two bits, one to
indicate that we have a usable hypervisor mode, and another to indicate
that the processor conforms to PowerISA version 2.06. We also add
another bit to indicate that the processor conforms to ISA version 2.01
and set that for PPC970 and derivatives.
Some PPC970 chips (specifically those in Apple machines) have a
hypervisor mode in that MSR[HV] is always 1, but the hypervisor mode
is not useful in the sense that there is no way to run any code in
supervisor mode (HV=0 PR=0). On these processors, the LPES0 and LPES1
bits in HID4 are always 0, and we use that as a way of detecting that
hypervisor mode is not useful.
Where we have a feature section in assembly code around code that
only applies on POWER7 in hypervisor mode, we use a construct like
END_FTR_SECTION_IFSET(CPU_FTR_HVMODE | CPU_FTR_ARCH_206)
The definition of END_FTR_SECTION_IFSET is such that the code will
be enabled (not overwritten with nops) only if all bits in the
provided mask are set.
Note that the CPU feature check in __tlbie() only needs to check the
ARCH_206 bit, not the HVMODE bit, because __tlbie() can only get called
if we are running bare-metal, i.e. in hypervisor mode.
Signed-off-by: Paul Mackerras <paulus@samba.org>
Signed-off-by: Alexander Graf <agraf@suse.de>
This adds infrastructure which will be needed to allow book3s_hv KVM to
run on older POWER processors, including PPC970, which don't support
the Virtual Real Mode Area (VRMA) facility, but only the Real Mode
Offset (RMO) facility. These processors require a physically
contiguous, aligned area of memory for each guest. When the guest does
an access in real mode (MMU off), the address is compared against a
limit value, and if it is lower, the address is ORed with an offset
value (from the Real Mode Offset Register (RMOR)) and the result becomes
the real address for the access. The size of the RMA has to be one of
a set of supported values, which usually includes 64MB, 128MB, 256MB
and some larger powers of 2.
Since we are unlikely to be able to allocate 64MB or more of physically
contiguous memory after the kernel has been running for a while, we
allocate a pool of RMAs at boot time using the bootmem allocator. The
size and number of the RMAs can be set using the kvm_rma_size=xx and
kvm_rma_count=xx kernel command line options.
KVM exports a new capability, KVM_CAP_PPC_RMA, to signal the availability
of the pool of preallocated RMAs. The capability value is 1 if the
processor can use an RMA but doesn't require one (because it supports
the VRMA facility), or 2 if the processor requires an RMA for each guest.
This adds a new ioctl, KVM_ALLOCATE_RMA, which allocates an RMA from the
pool and returns a file descriptor which can be used to map the RMA. It
also returns the size of the RMA in the argument structure.
Having an RMA means we will get multiple KMV_SET_USER_MEMORY_REGION
ioctl calls from userspace. To cope with this, we now preallocate the
kvm->arch.ram_pginfo array when the VM is created with a size sufficient
for up to 64GB of guest memory. Subsequently we will get rid of this
array and use memory associated with each memslot instead.
This moves most of the code that translates the user addresses into
host pfns (page frame numbers) out of kvmppc_prepare_vrma up one level
to kvmppc_core_prepare_memory_region. Also, instead of having to look
up the VMA for each page in order to check the page size, we now check
that the pages we get are compound pages of 16MB. However, if we are
adding memory that is mapped to an RMA, we don't bother with calling
get_user_pages_fast and instead just offset from the base pfn for the
RMA.
Typically the RMA gets added after vcpus are created, which makes it
inconvenient to have the LPCR (logical partition control register) value
in the vcpu->arch struct, since the LPCR controls whether the processor
uses RMA or VRMA for the guest. This moves the LPCR value into the
kvm->arch struct and arranges for the MER (mediated external request)
bit, which is the only bit that varies between vcpus, to be set in
assembly code when going into the guest if there is a pending external
interrupt request.
Signed-off-by: Paul Mackerras <paulus@samba.org>
Signed-off-by: Alexander Graf <agraf@suse.de>
This lifts the restriction that book3s_hv guests can only run one
hardware thread per core, and allows them to use up to 4 threads
per core on POWER7. The host still has to run single-threaded.
This capability is advertised to qemu through a new KVM_CAP_PPC_SMT
capability. The return value of the ioctl querying this capability
is the number of vcpus per virtual CPU core (vcore), currently 4.
To use this, the host kernel should be booted with all threads
active, and then all the secondary threads should be offlined.
This will put the secondary threads into nap mode. KVM will then
wake them from nap mode and use them for running guest code (while
they are still offline). To wake the secondary threads, we send
them an IPI using a new xics_wake_cpu() function, implemented in
arch/powerpc/sysdev/xics/icp-native.c. In other words, at this stage
we assume that the platform has a XICS interrupt controller and
we are using icp-native.c to drive it. Since the woken thread will
need to acknowledge and clear the IPI, we also export the base
physical address of the XICS registers using kvmppc_set_xics_phys()
for use in the low-level KVM book3s code.
When a vcpu is created, it is assigned to a virtual CPU core.
The vcore number is obtained by dividing the vcpu number by the
number of threads per core in the host. This number is exported
to userspace via the KVM_CAP_PPC_SMT capability. If qemu wishes
to run the guest in single-threaded mode, it should make all vcpu
numbers be multiples of the number of threads per core.
We distinguish three states of a vcpu: runnable (i.e., ready to execute
the guest), blocked (that is, idle), and busy in host. We currently
implement a policy that the vcore can run only when all its threads
are runnable or blocked. This way, if a vcpu needs to execute elsewhere
in the kernel or in qemu, it can do so without being starved of CPU
by the other vcpus.
When a vcore starts to run, it executes in the context of one of the
vcpu threads. The other vcpu threads all go to sleep and stay asleep
until something happens requiring the vcpu thread to return to qemu,
or to wake up to run the vcore (this can happen when another vcpu
thread goes from busy in host state to blocked).
It can happen that a vcpu goes from blocked to runnable state (e.g.
because of an interrupt), and the vcore it belongs to is already
running. In that case it can start to run immediately as long as
the none of the vcpus in the vcore have started to exit the guest.
We send the next free thread in the vcore an IPI to get it to start
to execute the guest. It synchronizes with the other threads via
the vcore->entry_exit_count field to make sure that it doesn't go
into the guest if the other vcpus are exiting by the time that it
is ready to actually enter the guest.
Note that there is no fixed relationship between the hardware thread
number and the vcpu number. Hardware threads are assigned to vcpus
as they become runnable, so we will always use the lower-numbered
hardware threads in preference to higher-numbered threads if not all
the vcpus in the vcore are runnable, regardless of which vcpus are
runnable.
Signed-off-by: Paul Mackerras <paulus@samba.org>
Signed-off-by: Alexander Graf <agraf@suse.de>
This improves I/O performance for guests using the PAPR
paravirtualization interface by making the H_PUT_TCE hcall faster, by
implementing it in real mode. H_PUT_TCE is used for updating virtual
IOMMU tables, and is used both for virtual I/O and for real I/O in the
PAPR interface.
Since this moves the IOMMU tables into the kernel, we define a new
KVM_CREATE_SPAPR_TCE ioctl to allow qemu to create the tables. The
ioctl returns a file descriptor which can be used to mmap the newly
created table. The qemu driver models use them in the same way as
userspace managed tables, but they can be updated directly by the
guest with a real-mode H_PUT_TCE implementation, reducing the number
of host/guest context switches during guest IO.
There are certain circumstances where it is useful for userland qemu
to write to the TCE table even if the kernel H_PUT_TCE path is used
most of the time. Specifically, allowing this will avoid awkwardness
when we need to reset the table. More importantly, we will in the
future need to write the table in order to restore its state after a
checkpoint resume or migration.
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Paul Mackerras <paulus@samba.org>
Signed-off-by: Alexander Graf <agraf@suse.de>
This adds the infrastructure for handling PAPR hcalls in the kernel,
either early in the guest exit path while we are still in real mode,
or later once the MMU has been turned back on and we are in the full
kernel context. The advantage of handling hcalls in real mode if
possible is that we avoid two partition switches -- and this will
become more important when we support SMT4 guests, since a partition
switch means we have to pull all of the threads in the core out of
the guest. The disadvantage is that we can only access the kernel
linear mapping, not anything vmalloced or ioremapped, since the MMU
is off.
This also adds code to handle the following hcalls in real mode:
H_ENTER Add an HPTE to the hashed page table
H_REMOVE Remove an HPTE from the hashed page table
H_READ Read HPTEs from the hashed page table
H_PROTECT Change the protection bits in an HPTE
H_BULK_REMOVE Remove up to 4 HPTEs from the hashed page table
H_SET_DABR Set the data address breakpoint register
Plus code to handle the following hcalls in the kernel:
H_CEDE Idle the vcpu until an interrupt or H_PROD hcall arrives
H_PROD Wake up a ceded vcpu
H_REGISTER_VPA Register a virtual processor area (VPA)
The code that runs in real mode has to be in the base kernel, not in
the module, if KVM is compiled as a module. The real-mode code can
only access the kernel linear mapping, not vmalloc or ioremap space.
Signed-off-by: Paul Mackerras <paulus@samba.org>
Signed-off-by: Alexander Graf <agraf@suse.de>
This adds support for KVM running on 64-bit Book 3S processors,
specifically POWER7, in hypervisor mode. Using hypervisor mode means
that the guest can use the processor's supervisor mode. That means
that the guest can execute privileged instructions and access privileged
registers itself without trapping to the host. This gives excellent
performance, but does mean that KVM cannot emulate a processor
architecture other than the one that the hardware implements.
This code assumes that the guest is running paravirtualized using the
PAPR (Power Architecture Platform Requirements) interface, which is the
interface that IBM's PowerVM hypervisor uses. That means that existing
Linux distributions that run on IBM pSeries machines will also run
under KVM without modification. In order to communicate the PAPR
hypercalls to qemu, this adds a new KVM_EXIT_PAPR_HCALL exit code
to include/linux/kvm.h.
Currently the choice between book3s_hv support and book3s_pr support
(i.e. the existing code, which runs the guest in user mode) has to be
made at kernel configuration time, so a given kernel binary can only
do one or the other.
This new book3s_hv code doesn't support MMIO emulation at present.
Since we are running paravirtualized guests, this isn't a serious
restriction.
With the guest running in supervisor mode, most exceptions go straight
to the guest. We will never get data or instruction storage or segment
interrupts, alignment interrupts, decrementer interrupts, program
interrupts, single-step interrupts, etc., coming to the hypervisor from
the guest. Therefore this introduces a new KVMTEST_NONHV macro for the
exception entry path so that we don't have to do the KVM test on entry
to those exception handlers.
We do however get hypervisor decrementer, hypervisor data storage,
hypervisor instruction storage, and hypervisor emulation assist
interrupts, so we have to handle those.
In hypervisor mode, real-mode accesses can access all of RAM, not just
a limited amount. Therefore we put all the guest state in the vcpu.arch
and use the shadow_vcpu in the PACA only for temporary scratch space.
We allocate the vcpu with kzalloc rather than vzalloc, and we don't use
anything in the kvmppc_vcpu_book3s struct, so we don't allocate it.
We don't have a shared page with the guest, but we still need a
kvm_vcpu_arch_shared struct to store the values of various registers,
so we include one in the vcpu_arch struct.
The POWER7 processor has a restriction that all threads in a core have
to be in the same partition. MMU-on kernel code counts as a partition
(partition 0), so we have to do a partition switch on every entry to and
exit from the guest. At present we require the host and guest to run
in single-thread mode because of this hardware restriction.
This code allocates a hashed page table for the guest and initializes
it with HPTEs for the guest's Virtual Real Memory Area (VRMA). We
require that the guest memory is allocated using 16MB huge pages, in
order to simplify the low-level memory management. This also means that
we can get away without tracking paging activity in the host for now,
since huge pages can't be paged or swapped.
This also adds a few new exports needed by the book3s_hv code.
Signed-off-by: Paul Mackerras <paulus@samba.org>
Signed-off-by: Alexander Graf <agraf@suse.de>
There are several fields in struct kvmppc_book3s_shadow_vcpu that
temporarily store bits of host state while a guest is running,
rather than anything relating to the particular guest or vcpu.
This splits them out into a new kvmppc_host_state structure and
modifies the definitions in asm-offsets.c to suit.
On 32-bit, we have a kvmppc_host_state structure inside the
kvmppc_book3s_shadow_vcpu since the assembly code needs to be able
to get to them both with one pointer. On 64-bit they are separate
fields in the PACA. This means that on 64-bit we don't need to
copy the kvmppc_host_state in and out on vcpu load/unload, and
in future will mean that the book3s_hv code doesn't need a
shadow_vcpu struct in the PACA at all. That does mean that we
have to be careful not to rely on any values persisting in the
hstate field of the paca across any point where we could block
or get preempted.
Signed-off-by: Paul Mackerras <paulus@samba.org>
Signed-off-by: Alexander Graf <agraf@suse.de>
In hypervisor mode, the LPCR controls several aspects of guest
partitions, including virtual partition memory mode, and also controls
whether the hypervisor decrementer interrupts are enabled. This sets
up LPCR at boot time so that guest partitions will use a virtual real
memory area (VRMA) composed of 16MB large pages, and hypervisor
decrementer interrupts are disabled.
Signed-off-by: Paul Mackerras <paulus@samba.org>
Signed-off-by: Alexander Graf <agraf@suse.de>
Instead of doing the kvm_guest_enter/exit() and local_irq_dis/enable()
calls in powerpc.c, this moves them down into the subarch-specific
book3s_pr.c and booke.c. This eliminates an extra local_irq_enable()
call in book3s_pr.c, and will be needed for when we do SMT4 guest
support in the book3s hypervisor mode code.
Signed-off-by: Paul Mackerras <paulus@samba.org>
Signed-off-by: Alexander Graf <agraf@suse.de>
This arranges for the top-level arch/powerpc/kvm/powerpc.c file to
pass down some of the calls it gets to the lower-level subarchitecture
specific code. The lower-level implementations (in booke.c and book3s.c)
are no-ops. The coming book3s_hv.c will need this.
Signed-off-by: Paul Mackerras <paulus@samba.org>
Signed-off-by: Alexander Graf <agraf@suse.de>
Instead of branching out-of-line with the DO_KVM macro to check if we
are in a KVM guest at the time of an interrupt, this moves the KVM
check inline in the first-level interrupt handlers. This speeds up
the non-KVM case and makes sure that none of the interrupt handlers
are missing the check.
Because the first-level interrupt handlers are now larger, some things
had to be move out of line in exceptions-64s.S.
This all necessitated some minor changes to the interrupt entry code
in KVM. This also streamlines the book3s_32 KVM test.
Signed-off-by: Paul Mackerras <paulus@samba.org>
Signed-off-by: Alexander Graf <agraf@suse.de>
In preparation for adding code to enable KVM to use hypervisor mode
on 64-bit Book 3S processors, this splits book3s.c into two files,
book3s.c and book3s_pr.c, where book3s_pr.c contains the code that is
specific to running the guest in problem state (user mode) and book3s.c
contains code which should apply to all Book 3S processors.
In doing this, we abstract some details, namely the interrupt offset,
updating the interrupt pending flag, and detecting if the guest is
in a critical section. These are all things that will be different
when we use hypervisor mode.
Signed-off-by: Paul Mackerras <paulus@samba.org>
Signed-off-by: Alexander Graf <agraf@suse.de>
This moves the slb field, which represents the state of the emulated
SLB, from the kvmppc_vcpu_book3s struct to the kvm_vcpu_arch, and the
hpte_hash_[v]pte[_long] fields from kvm_vcpu_arch to kvmppc_vcpu_book3s.
This is in accord with the principle that the kvm_vcpu_arch struct
represents the state of the emulated CPU, and the kvmppc_vcpu_book3s
struct holds the auxiliary data structures used in the emulation.
Signed-off-by: Paul Mackerras <paulus@samba.org>
Signed-off-by: Alexander Graf <agraf@suse.de>
Dynamically assign host PIDs to guest PIDs, splitting each guest PID into
multiple host (shadow) PIDs based on kernel/user and MSR[IS/DS]. Use
both PID0 and PID1 so that the shadow PIDs for the right mode can be
selected, that correspond both to guest TID = zero and guest TID = guest
PID.
This allows us to significantly reduce the frequency of needing to
invalidate the entire TLB. When the guest mode or PID changes, we just
update the host PID0/PID1. And since the allocation of shadow PIDs is
global, multiple guests can share the TLB without conflict.
Note that KVM does not yet support the guest setting PID1 or PID2 to
a value other than zero. This will need to be fixed for nested KVM
to work. Until then, we enforce the requirement for guest PID1/PID2
to stay zero by failing the emulation if the guest tries to set them
to something else.
Signed-off-by: Liu Yu <yu.liu@freescale.com>
Signed-off-by: Scott Wood <scottwood@freescale.com>
Signed-off-by: Alexander Graf <agraf@suse.de>
Instead of a fully separate set of TLB entries, keep just the
pfn and dirty status.
Signed-off-by: Liu Yu <yu.liu@freescale.com>
Signed-off-by: Scott Wood <scottwood@freescale.com>
Signed-off-by: Alexander Graf <agraf@suse.de>
This is a shared page used for paravirtualization. It is always present
in the guest kernel's effective address space at the address indicated
by the hypercall that enables it.
The physical address specified by the hypercall is not used, as
e500 does not have real mode.
Signed-off-by: Scott Wood <scottwood@freescale.com>
Signed-off-by: Alexander Graf <agraf@suse.de>
This is in line with what other architectures do, and will allow us to
map things other than ordinary, unreserved kernel pages -- such as
dedicated devices, or large contiguous reserved regions.
Signed-off-by: Scott Wood <scottwood@freescale.com>
Signed-off-by: Alexander Graf <agraf@suse.de>
This is done lazily. The SPE save will be done only if the guest has
used SPE since the last preemption or heavyweight exit. Restore will be
done only on demand, when enabling MSR_SPE in the shadow MSR, in response
to an SPE fault or mtmsr emulation.
For SPEFSCR, Linux already switches it on context switch (non-lazily), so
the only remaining bit is to save it between qemu and the guest.
Signed-off-by: Liu Yu <yu.liu@freescale.com>
Signed-off-by: Scott Wood <scottwood@freescale.com>
Signed-off-by: Alexander Graf <agraf@suse.de>
Keep the guest MSR and the guest-mode true MSR separate, rather than
modifying the guest MSR on each guest entry to produce a true MSR.
Any bits which should be modified based on guest MSR must be explicitly
propagated from vcpu->arch.shared->msr to vcpu->arch.shadow_msr in
kvmppc_set_msr().
While we're modifying the guest entry code, reorder a few instructions
to bury some load latencies.
Signed-off-by: Scott Wood <scottwood@freescale.com>
Signed-off-by: Alexander Graf <agraf@suse.de>
Previously, these macros hardcoded THREAD_EVR0 as the base of the save
area, relative to the base register passed. This base offset is now
passed as a separate macro parameter, allowing reuse with other SPE
save areas, such as used by KVM.
Acked-by: Kumar Gala <galak@kernel.crashing.org>
Signed-off-by: Scott Wood <scottwood@freescale.com>
Signed-off-by: Alexander Graf <agraf@suse.de>
Up until now, Book3S KVM had variables stored in the kernel that a kernel module
or the kvm code in the kernel could read from to figure out where some real mode
helper functions are located.
This is all unnecessary. The high bits of the EA get ignore in real mode, so we
can just use the pointer as is. Also, it's a lot easier on relocations when we
use the normal way of resolving the address to a function, instead of jumping
through hoops.
This patch fixes compilation with CONFIG_RELOCATABLE=y.
Signed-off-by: Alexander Graf <agraf@suse.de>
This is used to round-robin TLBCAM entries.
Signed-off-by: Becky Bruce <beckyb@kernel.crashing.org>
Signed-off-by: Kumar Gala <galak@kernel.crashing.org>
The patch a8b0ca17b8 ("perf: Remove the nmi parameter from the swevent
and overflow interface") missed a spot in the ppc hw_breakpoint code,
fix this up so things compile again.
Reported-by: Ingo Molnar <mingo@elte.hu>
Cc: Anton Blanchard <anton@samba.org>
Cc: Eric B Munson <emunson@mgebm.net>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Link: http://lkml.kernel.org/n/tip-09pfip95g88s70iwkxu6nnbt@git.kernel.org
Signed-off-by: Ingo Molnar <mingo@elte.hu>
The nmi parameter indicated if we could do wakeups from the current
context, if not, we would set some state and self-IPI and let the
resulting interrupt do the wakeup.
For the various event classes:
- hardware: nmi=0; PMI is in fact an NMI or we run irq_work_run from
the PMI-tail (ARM etc.)
- tracepoint: nmi=0; since tracepoint could be from NMI context.
- software: nmi=[0,1]; some, like the schedule thing cannot
perform wakeups, and hence need 0.
As one can see, there is very little nmi=1 usage, and the down-side of
not using it is that on some platforms some software events can have a
jiffy delay in wakeup (when arch_irq_work_raise isn't implemented).
The up-side however is that we can remove the nmi parameter and save a
bunch of conditionals in fast paths.
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Michael Cree <mcree@orcon.net.nz>
Cc: Will Deacon <will.deacon@arm.com>
Cc: Deng-Cheng Zhu <dengcheng.zhu@gmail.com>
Cc: Anton Blanchard <anton@samba.org>
Cc: Eric B Munson <emunson@mgebm.net>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: Paul Mundt <lethal@linux-sh.org>
Cc: David S. Miller <davem@davemloft.net>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Jason Wessel <jason.wessel@windriver.com>
Cc: Don Zickus <dzickus@redhat.com>
Link: http://lkml.kernel.org/n/tip-agjev8eu666tvknpb3iaj0fg@git.kernel.org
Signed-off-by: Ingo Molnar <mingo@elte.hu>
This patch adds support for the new "jump label" feature.
Unlike x86 and sparc we just merrily patch the code with no locks etc,
as far as I know this is safe, but I'm not really sure what the x86/sparc
code is protecting against so maybe it's not.
I also don't see any reason for us to implement the poke_early() routine,
even though sparc does.
[BenH: Updated the patch to upstream generic changes]
Signed-off-by: Michael Ellerman <michael@ellerman.id.au>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
A mix of think & mismerge on my side caused a problem where both the
new hvsi_lib and the old hvsi driver gets compiled and try to define
symbols with the same name.
This fixes it by renaming the hvsi_lib exported symbols.
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
a9c0f41b3a breaks the build
on some platforms. The extern declaration must be shielded
against assembly.
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
This patch adds a printk companion to replace the udbg progress function
when initmem is freed.
Suggested-by: Milton Miller <miltonm@bga.com>
Suggested-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Signed-off-by: Dave Carroll <dcarroll@astekcorp.com>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
On pseries machines, consoles are provided by the hypervisor using
a low level get_chars/put_chars type interface. However, this is
really just a transport to the service processor which implements
them either as "raw" console (networked consoles, HMC, ...) or as
"hvsi" serial ports.
The later is a simple packet protocol on top of the raw character
interface that is supposed to convey additional "serial port" style
semantics. In practice however, all it does is provide a way to
read the CD line and set/clear our DTR line, that's it.
We currently implement the "raw" protocol as an hvc console backend
(/dev/hvcN) and the "hvsi" protocol using a separate tty driver
(/dev/hvsi0).
However this is quite impractical. The arbitrary difference between
the two type of devices has been a major source of user (and distro)
confusion. Additionally, there's an additional mini -hvsi implementation
in the pseries platform code for our low level debug console and early
boot kernel messages, which means code duplication, though that low
level variant is impractical as it's incapable of doing the initial
protocol negociation to establish the link to the FSP.
This essentially replaces the dedicated hvsi driver and the platform
udbg code completely by extending the existing hvc_vio backend used
in "raw" mode so that:
- It now supports HVSI as well
- We add support for hvc backend providing tiocm{get,set}
- It also provides a udbg interface for early debug and boot console
This is overall less code, though this will only be obvious once we
remove the old "hvsi" driver, which is still available for now. When
the old driver is enabled, the new code still kicks in for the low
level udbg console, replacing the old mini implementation in the platform
code, it just doesn't provide the higher level "hvc" interface.
In addition to producing generally simler code, this has several benefits
over our current situation:
- The user/distro only has to deal with /dev/hvcN for the hypervisor
console, avoiding all sort of confusion that has plagued us in the past
- The tty, kernel and low level debug console all use the same code
base which supports the full protocol establishment process, thus the
console is now available much earlier than it used to be with the
old HVSI driver. The kernel console works much earlier and udbg is
available much earlier too. Hackers can enable a hard coded very-early
debug console as well that works with HVSI (previously that was only
supported for the "raw" mode).
I've tried to keep the same semantics as hvsi relative to how I react
to things like CD changes, with some subtle differences though:
- I clear DTR on close if HUPCL is set
- Current hvsi triggers a hangup if it detects a up->down transition
on CD (you can still open a console with CD down). My new implementation
triggers a hangup if the link to the FSP is severed, and severs it upon
detecting a up->down transition on CD.
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Embed the struct hvsi_header in the various packet definitions
rather than open coding it multiple times. Will help provide
stronger type checking.
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
This moves various HVSI protocol definitions from the hvsi.c
driver to a header file that can be used later on by a udbg
implementation
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
This introduces pSeries_reconfig_notify() as a just wrapper of
blocking_notifier_call_chain() for pSeries_reconfig_chain.
This is a preparation to improvement of error code on reconfiguration
notifier failure.
Signed-off-by: Akinobu Mita <akinobu.mita@gmail.com>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Paul Mackerras <paulus@samba.org>
Cc: linuxppc-dev@lists.ozlabs.org
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
On MMUs such as FSL where we can guarantee the entire linear mapping is
bolted, we don't need to worry about linear TLB misses. If on top of
that we do a full table walk, we get rid of all recursive TLB faults, and
can dispense with some state saving. This gains a few percent on
TLB-miss-heavy workloads, and around 50% on a benchmark that had a high
rate of virtual page table faults under the normal handler.
While touching the EX_TLB layout, remove EX_TLB_MMUCR0, EX_TLB_SRR0, and
EX_TLB_SRR1 as they're not used.
[BenH: Fixed build with 64K pages (wsp config)]
Signed-off-by: Scott Wood <scottwood@freescale.com>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
commit 21a3c96 uses node_start/end_pfn(nid) for detection start/end
of nodes. But, it's not defined in linux/mmzone.h but defined in
/arch/???/include/mmzone.h which is included only under
CONFIG_NEED_MULTIPLE_NODES=y.
Then, we see
mm/page_cgroup.c: In function 'page_cgroup_init':
mm/page_cgroup.c:308: error: implicit declaration of function 'node_start_pfn'
mm/page_cgroup.c:309: error: implicit declaration of function 'node_end_pfn'
So, fixiing page_cgroup.c is an idea...
But node_start_pfn()/node_end_pfn() is a very generic macro and
should be implemented in the same manner for all archs.
(m32r has different implementation...)
This patch removes definitions of node_start/end_pfn() in each archs
and defines a unified one in linux/mmzone.h. It's not under
CONFIG_NEED_MULTIPLE_NODES, now.
A result of macro expansion is here (mm/page_cgroup.c)
for !NUMA
start_pfn = ((&contig_page_data)->node_start_pfn);
end_pfn = ({ pg_data_t *__pgdat = (&contig_page_data); __pgdat->node_start_pfn + __pgdat->node_spanned_pages;});
for NUMA (x86-64)
start_pfn = ((node_data[nid])->node_start_pfn);
end_pfn = ({ pg_data_t *__pgdat = (node_data[nid]); __pgdat->node_start_pfn + __pgdat->node_spanned_pages;});
Changelog:
- fixed to avoid using "nid" twice in node_end_pfn() macro.
Reported-and-acked-by: Randy Dunlap <randy.dunlap@oracle.com>
Reported-and-tested-by: Ingo Molnar <mingo@elte.hu>
Acked-by: Mel Gorman <mgorman@suse.de>
Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The Freescale ePAPR reference hypervisor provides interrupt controller
services via a hypercall interface, instead of emulating the MPIC
controller. This is called the VMPIC.
The ePAPR "virtual interrupt controller" provides interrupt controller
services for external interrupts. External interrupts received by a
partition can come from two sources:
- Hardware interrupts - hardware interrupts come from external
interrupt lines or on-chip I/O devices.
- Virtual interrupts - virtual interrupts are generated by the hypervisor
as part of some hypervisor service or hypervisor-created virtual device.
Both types of interrupts are processed using the same programming model and
same set of hypercalls.
Signed-off-by: Ashish Kalra <ashish.kalra@freescale.com>
Signed-off-by: Timur Tabi <timur@freescale.com>
Signed-off-by: Kumar Gala <galak@kernel.crashing.org>
ePAPR hypervisors provide operating system services via a "hypercall"
interface. The following steps need to be performed to make an hcall:
1. Load r11 with the hcall number
2. Load specific other registers with parameters
3. Issue instrucion "sc 1"
4. The return code is in r3
5. Other returned parameters are in other registers.
To provide this service to the kernel, these steps are wrapped in inline
assembly functions. Standard ePAPR hcalls are in epapr_hcalls.h, and
Freescale extensions are in fsl_hcalls.h.
Signed-off-by: Timur Tabi <timur@freescale.com>
Signed-off-by: Kumar Gala <galak@kernel.crashing.org>
Move irq_choose_cpu() into arch/powerpc/kernel/irq.c so that it can be used
by other PIC drivers. The function is not MPIC-specific.
Signed-off-by: Stuart Yoder <stuart.yoder@freescale.com>
Signed-off-by: Timur Tabi <timur@freescale.com>
Signed-off-by: Kumar Gala <galak@kernel.crashing.org>
We expect this is actually faster, and we end up needing more space than we
can get from the SPRGs in some instances. This is also useful when running
as a guest OS - SPRGs4-7 do not have guest versions.
8 slots are allocated in thread_info for this even though we only actually
use 4 of them - this allows space for future code to have more scratch
space (and we know we'll need it for things like hugetlb).
Signed-off-by: Ashish Kalra <Ashish.Kalra@freescale.com>
Signed-off-by: Becky Bruce <beckyb@kernel.crashing.org>
Signed-off-by: Kumar Gala <galak@kernel.crashing.org>
doorbell type is defined as bits 32:36 so should be shifted by 63-36 =
27 rather than 28.
We never noticed this bug as we've only every used type PPC_DBELL = 0.
Signed-off-by: Michael Neuling <mikey@neuling.org>
Acked-by: Kumar Gala <galak@kernel.crashing.org>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
smp_release_cpus() waits for all cpus (including the bootcpu) due to an
off-by-one count on boot_cpu_count (which is all CPUs). This patch replaces
that with spinning_secondaries (which is all secondary CPUs).
Signed-off-by: Matt Evans <matt@ozlabs.org>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Several fixes as well where the +1 was missing.
Done via coccinelle scripts like:
@@
struct resource *ptr;
@@
- ptr->end - ptr->start + 1
+ resource_size(ptr)
and some grep and typing.
Mostly uncompiled, no cross-compilers.
Signed-off-by: Joe Perches <joe@perches.com>
Signed-off-by: Jiri Kosina <jkosina@suse.cz>
The generic code always get the device-node in the right place now
so a single implementation will work for all archs
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Acked-by: Grant Likely <grant.likely@secretlab.ca>
Acked-by: Michal Simek <monstr@monstr.eu>
Acked-by: Jesse Barnes <jbarnes@virtuousgeek.org>