There are three different code levels in regard to the identification
of guest samples. They differ in the way the LPP instruction is used.
1) Old kernels without the LPP instruction. The guest program parameter
is always zero.
2) Newer kernels load the process pid into the program parameter with LPP.
The guest program parameter is non-zero if the guest executes in a
process != idle.
3) The latest kernels load ((1UL << 31) | pid) with LPP to make the value
non-zero even for the idle task. The guest program parameter is non-zero
if the guest is running.
All kernels load the process pid to CR4 on context switch. The CPU sampling
code uses the value in CR4 to decide between guest and host samples in case
the guest program parameter is zero. The three cases:
1) CR4==pid, gpp==0
2) CR4==pid, gpp==pid
3) CR4==pid, gpp==((1UL << 31) | pid)
The load-control instruction to load the pid into CR4 is expensive and the
goal is to remove it. To distinguish the host CR4 from the guest pid for
the idle process the maximum value 0xffff for the PASN is used.
This adds a fourth case for a guest OS with an updated kernel:
4) CR4==0xffff, gpp=((1UL << 31) | pid)
The host kernel will have CR4==0xffff and will use (gpp!=0 || CR4!==0xffff)
to identify guest samples. This works nicely with all 4 cases, the only
possible issue would be a guest with an old kernel (gpp==0) and a process
pid of 0xffff. Well, don't do that..
Suggested-by: Christian Borntraeger <borntraeger@de.ibm.com>
Reviewed-by: Christian Borntraeger <borntraeger@de.ibm.com>
Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
Move a struct definition to get rid of a forward declaration.
Signed-off-by: Sebastian Ott <sebott@linux.vnet.ibm.com>
Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
Users complained that they are hitting the limit of 64 functions (some
physical functions can spawn lots of virtual functions). Double the
default limit. With the latest savings in static data usage this
increases the image size by only 520 bytes.
Signed-off-by: Sebastian Ott <sebott@linux.vnet.ibm.com>
Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
Commit c506fff3d3 ("s390/pci: resize iomap") reduced the iomap
to NR_FUNCTIONS * PCI_BAR_COUNT elements. Since we only support
functions with 64bit BARs we can cut that number in half.
Signed-off-by: Sebastian Ott <sebott@linux.vnet.ibm.com>
Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
Address space identifiers are already defined in <asm/pci_insn.h>.
Signed-off-by: Sebastian Ott <sebott@linux.vnet.ibm.com>
Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
barsize was never used. Get rid of it.
Signed-off-by: Sebastian Ott <sebott@linux.vnet.ibm.com>
Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
The 32-bit lctl instruction is quite a bit slower than the 64-bit
counter part lctlg. Use the faster instruction.
Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
Pull vfio-ccw branch to add the basic channel I/O passthrough
intrastructure based on vfio.
The focus is on supporting dasd-eckd(cu_type/dev_type = 0x3990/0x3390)
as the target device.
Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
Although Linux does not use format-0 channel command words (CCW0)
these are a non-optional part of the platform spec, and for the sake
of platform compliance, and possibly some non-Linux guests, we have
to support CCW0.
Making the kernel execute a format 0 channel program is too much hassle
because we would need to allocate and use memory which can be addressed
by 24 bit physical addresses (because of CCW0.cda). So we implement CCW0
support by translating the channel program into an equivalent CCW1
program instead.
Based upon an orginal patch by Kai Yue Wang.
Signed-off-by: Dong Jia Shi <bjsdjshi@linux.vnet.ibm.com>
Message-Id: <20170317031743.40128-16-bjsdjshi@linux.vnet.ibm.com>
Signed-off-by: Cornelia Huck <cornelia.huck@de.ibm.com>
Add file Documentation/s390/vfio-ccw.txt that includes details
of vfio-ccw.
Acked-by: Pierre Morel <pmorel@linux.vnet.ibm.com>
Signed-off-by: Dong Jia Shi <bjsdjshi@linux.vnet.ibm.com>
Message-Id: <20170317031743.40128-15-bjsdjshi@linux.vnet.ibm.com>
Signed-off-by: Cornelia Huck <cornelia.huck@de.ibm.com>
The current implementation doesn't check if the subchannel is in a
proper device state when handling an event. Let's introduce
a finite state machine to manage the state/event change.
Signed-off-by: Dong Jia Shi <bjsdjshi@linux.vnet.ibm.com>
Message-Id: <20170317031743.40128-14-bjsdjshi@linux.vnet.ibm.com>
Signed-off-by: Cornelia Huck <cornelia.huck@de.ibm.com>
Introduce a singlethreaded workqueue to handle the I/O interrupts.
With the work added to this queue, we store the I/O results to the
io_region of the subchannel, then signal the userspace program to
handle the results.
Signed-off-by: Dong Jia Shi <bjsdjshi@linux.vnet.ibm.com>
Message-Id: <20170317031743.40128-13-bjsdjshi@linux.vnet.ibm.com>
Signed-off-by: Cornelia Huck <cornelia.huck@de.ibm.com>
Realize VFIO_DEVICE_GET_IRQ_INFO ioctl to retrieve
VFIO_CCW_IO_IRQ information.
Realize VFIO_DEVICE_SET_IRQS ioctl to set an eventfd fd for
VFIO_CCW_IO_IRQ. Once a write operation to the ccw_io_region
was performed, trigger a signal on this fd.
Reviewed-by: Pierre Morel <pmorel@linux.vnet.ibm.com>
Signed-off-by: Dong Jia Shi <bjsdjshi@linux.vnet.ibm.com>
Acked-by: Alex Williamson <alex.williamson@redhat.com>
Message-Id: <20170317031743.40128-12-bjsdjshi@linux.vnet.ibm.com>
Signed-off-by: Cornelia Huck <cornelia.huck@de.ibm.com>
Introduce VFIO_DEVICE_RESET ioctl for vfio-ccw to make it possible
to hot-reset the device.
We try to achieve a reset by first disabling the subchannel and
then enabling it again: this should clear all state at the subchannel.
Signed-off-by: Dong Jia Shi <bjsdjshi@linux.vnet.ibm.com>
Message-Id: <20170317031743.40128-11-bjsdjshi@linux.vnet.ibm.com>
Signed-off-by: Cornelia Huck <cornelia.huck@de.ibm.com>
Introduce device information about vfio-ccw: VFIO_DEVICE_FLAGS_CCW.
Realize VFIO_DEVICE_GET_REGION_INFO ioctl for vfio-ccw.
Reviewed-by: Pierre Morel <pmorel@linux.vnet.ibm.com>
Signed-off-by: Dong Jia Shi <bjsdjshi@linux.vnet.ibm.com>
Acked-by: Alex Williamson <alex.williamson@redhat.com>
Message-Id: <20170317031743.40128-10-bjsdjshi@linux.vnet.ibm.com>
Signed-off-by: Cornelia Huck <cornelia.huck@de.ibm.com>
We implement the basic ccw command handling infrastructure
here:
1. Translate the ccw commands.
2. Issue the translated ccw commands to the device.
3. Once we get the execution result, update the guest SCSW
with it.
Acked-by: Pierre Morel <pmorel@linux.vnet.ibm.com>
Signed-off-by: Dong Jia Shi <bjsdjshi@linux.vnet.ibm.com>
Message-Id: <20170317031743.40128-9-bjsdjshi@linux.vnet.ibm.com>
Signed-off-by: Cornelia Huck <cornelia.huck@de.ibm.com>
To provide user-space a set of interfaces to:
1. pass in a ccw program to perform an I/O operation.
2. read back I/O results of the completed I/O operations.
We introduce an MMIO region for the vfio-ccw device here.
This region is defined to content:
1. areas to store arguments that an ssch required.
2. areas to store the I/O results.
Using pwrite/pread to the device on this region, a user-space program
could write/read data to/from the vfio-ccw device.
Reviewed-by: Pierre Morel <pmorel@linux.vnet.ibm.com>
Signed-off-by: Dong Jia Shi <bjsdjshi@linux.vnet.ibm.com>
Message-Id: <20170317031743.40128-8-bjsdjshi@linux.vnet.ibm.com>
Signed-off-by: Cornelia Huck <cornelia.huck@de.ibm.com>
To make vfio support subchannel devices, we need to leverage the
mediated device framework to create a mediated device for the
subchannel device.
This registers the subchannel device to the mediated device
framework during probe to enable mediated device creation.
Reviewed-by: Pierre Morel <pmorel@linux.vnet.ibm.com>
Signed-off-by: Dong Jia Shi <bjsdjshi@linux.vnet.ibm.com>
Message-Id: <20170317031743.40128-7-bjsdjshi@linux.vnet.ibm.com>
Signed-off-by: Cornelia Huck <cornelia.huck@de.ibm.com>
Introduce ccwchain structure and helper functions that can be used to
handle a channel program issued from a virtual machine.
The following limitations apply:
1. Supports only prefetch enabled mode.
2. Supports idal(c64) ccw chaining.
3. Supports 4k idaw.
4. Supports ccw1.
5. Supports direct ccw chaining by translating them to idal ccws.
CCW translation requires to leverage the vfio_(un)pin_pages interfaces
to pin/unpin sets of mem pages frequently. Currently we have a lack of
support to do this in an efficient way. So we introduce pfn_array data
structure and helper functions to handle pin/unpin operations here.
Signed-off-by: Dong Jia Shi <bjsdjshi@linux.vnet.ibm.com>
Message-Id: <20170317031743.40128-6-bjsdjshi@linux.vnet.ibm.com>
Signed-off-by: Cornelia Huck <cornelia.huck@de.ibm.com>
To make vfio support subchannel devices, we need a css driver for
the vfio subchannels. This patch adds a basic vfio-ccw subchannel
driver for this purpose.
To enable VFIO for vfio-ccw, enable S390_CCW_IOMMU config option
and configure VFIO as required.
Acked-by: Pierre Morel <pmorel@linux.vnet.ibm.com>
Signed-off-by: Dong Jia Shi <bjsdjshi@linux.vnet.ibm.com>
Message-Id: <20170317031743.40128-5-bjsdjshi@linux.vnet.ibm.com>
Signed-off-by: Cornelia Huck <cornelia.huck@de.ibm.com>
Define vfio-ccw device API strings. CCW vendor driver using mediated
device framework should use this string for device_api attribute.
Reviewed-by: Pierre Morel <pmorel@linux.vnet.ibm.com>
Signed-off-by: Dong Jia Shi <bjsdjshi@linux.vnet.ibm.com>
Acked-by: Alex Williamson <alex.williamson@redhat.com>
Message-Id: <20170317031743.40128-4-bjsdjshi@linux.vnet.ibm.com>
Signed-off-by: Cornelia Huck <cornelia.huck@de.ibm.com>
Export the common I/O interfaces those are needed by an I/O
subchannel driver to actually talk to the subchannel.
Reviewed-by: Pierre Morel <pmorel@linux.vnet.ibm.com>
Signed-off-by: Dong Jia Shi <bjsdjshi@linux.vnet.ibm.com>
Cc: Sebastian Ott <sebott@linux.vnet.ibm.com>
Cc: Peter Oberparleiter <oberpar@linux.vnet.ibm.com>
Acked-by: Sebastian Ott <sebott@linux.vnet.ibm.com>
Message-Id: <20170317031743.40128-3-bjsdjshi@linux.vnet.ibm.com>
Signed-off-by: Cornelia Huck <cornelia.huck@de.ibm.com>
For future code reuse purpose, this decouples the cio code with
the ccw device specific parts from ccw_device_cancel_halt_clear,
and makes a new common I/O interface named cio_cancel_halt_clear.
Reviewed-by: Pierre Morel <pmorel@linux.vnet.ibm.com>
Signed-off-by: Dong Jia Shi <bjsdjshi@linux.vnet.ibm.com>
Cc: Sebastian Ott <sebott@linux.vnet.ibm.com>
Cc: Peter Oberparleiter <oberpar@linux.vnet.ibm.com>
Acked-by: Sebastian Ott <sebott@linux.vnet.ibm.com>
Message-Id: <20170317031743.40128-2-bjsdjshi@linux.vnet.ibm.com>
[CH: Fix typo]
Signed-off-by: Cornelia Huck <cornelia.huck@de.ibm.com>
The return code of hw_perf_event_update() is not evaluated by
its callers. Hence, simplify the function by removing the
return code.
Reported-by: Heiko Carstens <heiko.carstens@de.ibm.com>
Signed-off-by: Hendrik Brueckner <brueckner@linux.vnet.ibm.com>
Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
Using a register variable for r4 is not necessary. Let the
compiler decide the register to be used.
Reported-by: Heiko Carstens <heiko.carstens@de.ibm.com>
Signed-off-by: Hendrik Brueckner <brueckner@linux.vnet.ibm.com>
Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
Make clear that the event definitions relate to the counter
facility (cf) and not to the sampling facility (sf).
Signed-off-by: Hendrik Brueckner <brueckner@linux.vnet.ibm.com>
Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
Add the event names for the IBM z13/z13s specific CPU-MF counters.
Also improve the merging of the generic and model specific events
so that their sysfs attribute definitions completely reside in
memory. Hence, flagging the generic event attribute definitions
as initdata too.
Signed-off-by: Hendrik Brueckner <brueckner@linux.vnet.ibm.com>
Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
Complete the IBM z13 support and support counters from the
MT-diagnostic counter set. Note that this counter set is
available only if SMT is enabled.
Signed-off-by: Hendrik Brueckner <brueckner@linux.vnet.ibm.com>
Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
The validate_event() function just checked for reserved counters
in particular CPU-MF counter sets. Because the number of counters
in counter sets vary among different hardware models, remove the
explicit check to tolerate new models.
Reserved counters are not accounted and, thus, will return zero.
Signed-off-by: Hendrik Brueckner <brueckner@linux.vnet.ibm.com>
Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
Use the highest counter number that can be specified for the
ecctr (extract CPU counter) instruction for perf.
Signed-off-by: Hendrik Brueckner <brueckner@linux.vnet.ibm.com>
Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
Deferred struct page initialization works on s390. However it makes
only sense for the fake numa case, since the kthreads that initialize
struct pages are started per node. Without fake numa there is just a
single node and therefore no gain.
However there is no reason to not enable this feature. Therefore
select the config option and enable the feature in all config files.
Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com>
Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
gmap.c deals mostly with KVM-related memory management, so a lot
of changes to this file will come via the KVM tree. Reflect this
in MAINTAINERS. Please note that there are intricate ties to
arch/s390/mm/pgtable.c. If changes are needed in both files,
this will continue to be submitted via the s390 tree (or a
topic branch if necessary).
Signed-off-by: Christian Borntraeger <borntraeger@de.ibm.com>
Acked-by: Cornelia Huck <cornelia.huck@de.ibm.com>
Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
Since linux v3.14 with commit 38dfac843c ("vmcore: prevent PT_NOTE
p_memsz overflow during header update") on s390 we get the following
message in the kdump kernel:
Warning: Exceeded p_memsz, dropping PT_NOTE entry n_namesz=0x6b6b6b6b,
n_descsz=0x6b6b6b6b
The reason for this is that we don't create a final zero note in
the ELF header which the proc/vmcore code uses to find out the end
of the notes section (see also kernel/kexec_core.c:final_note()).
It still worked on s390 by chance because we (most of the time?) have the
byte pattern 0x6b6b6b6b after the notes section which also makes the notes
parsing code stop in update_note_header_size_elf64() because 0x6b6b6b6b is
interpreded as note size:
if ((real_sz + sz) > max_sz) {
pr_warn("Warning: Exceeded p_memsz, dropping P ...);
break;
}
So fix this and add the missing final note to the ELF header.
We don't have to adjust the memory size for ELF header ("alloc_size")
because the new ELF note still fits into the 0x1000 base memory.
Cc: stable@vger.kernel.org # v4.4+
Signed-off-by: Michael Holzheu <holzheu@linux.vnet.ibm.com>
Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
HAVE_ARCH_EARLY_PFN_TO_NID selects a not present Kconfig
option. Therefore remove it.
Given that the first call of early_pfn_to_nid() happens after
numa_setup() finished to establish the memory to node mapping, there
is no need to implement an architecture private version of
__early_pfn_to_nid() like (only) ia64 does.
Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com>
Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
On some z/VM systems the query host access command is not supported for
temp disks, though the corresponding feature code is set.
This does not have any impact beside that the information is not available.
Suppress the full blown command reject error messages to not confuse the
user. The error is still logged in the s390dbf.
Signed-off-by: Stefan Haberland <sth@linux.vnet.ibm.com>
Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
Some storage servers might not support the query host access feature.
Check if the corresponding feature code is set.
Signed-off-by: Stefan Haberland <sth@linux.vnet.ibm.com>
Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
ptep_notify and gmap_shadow_notify both need a guest address and
therefore retrieve them from the available virtual host address.
As they operate on the same guest address, we can calculate it once
and then pass it on. As a gmap normally has more than one shadow gmap,
we also do not recalculate for each of them any more.
Signed-off-by: Janosch Frank <frankja@linux.vnet.ibm.com>
Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
There is no need for the __ASSEMBLY__ ifdefery anymore since the
architecture level set code that deals with facility bits was
converted to C in the meantime.
Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com>
Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
Use the actual size of the facility list array within the lowcore
structure for the MAX_FACILITY_BIT define instead of a comment which
states what this is good for. This makes it a bit harder to break
things.
Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com>
Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
Provide the remaining stsi information via debugfs files. This also
might be useful for debugging purposes.
Suggested-by: Christian Borntraeger <borntraeger@de.ibm.com>
Acked-by: Christian Borntraeger <borntraeger@de.ibm.com>
Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com>
Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
Provide the raw stsi 15,1,x data contents via debugfs. This makes it
much easier to debug unexpected scheduling domains on machines that
provide cpu topology information.
Therefore this file adds a new 's390/stsi' debugfs directory with a
file for each possible topology nesting level that is allowed by the
architecture. The files will be created regardless if the machine
supports all, or any, level. If a level is not supported, or no data
is available, user space can recognize this with a -EINVAL error code
when trying to read such data.
In addition a 'topology' symlink is created that points to the file
that contains the data that is used to create the scheduling domains.
Acked-by: Christian Borntraeger <borntraeger@de.ibm.com>
Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com>
Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
Introduce a top-level 's390' directory which should be used when
adding new s390 specific debug feature files and/or directories.
This makes hopefully sure that the contents of the s390 directory will
be a bit more structured. Right now we have a couple of top-level
files where it is not easy to tell to which subsystem they belong to.
Acked-by: Christian Borntraeger <borntraeger@de.ibm.com>
Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com>
Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
User space needs some information about the secure key(s)
before actually invoking the pkey and/or paes funcionality.
This patch introduces a new ioctl API and in kernel API to
verify the the secure key blob and give back some
information about the key (type, bitsize, old MKVP).
Both APIs are described in detail in the header files
arch/s390/include/asm/pkey.h and arch/s390/include/uapi/asm/pkey.h.
Signed-off-by: Harald Freudenberger <freude@linux.vnet.ibm.com>
Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
If running within a level 3 hypervisor, the hypervisor provides a
SYSIB block which contains a control program indentifier string. Use
this string instead of the simple KVM and z/VM strings only. In case
of z/VM this provides addtional information: the z/VM version.
The new string looks similar to this:
Hardware name: IBM 2964 N96 702 (z/VM 6.4.0)
Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com>
Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
The arch description provided for the "Hardware name:" contains lots
of extra whitespace due to the way the SYSIB contents are defined
(strings aren't zero terminated).
This looks a bit odd and therefore remove the extra whitespace
characters. This also gives the opportunity to add more information,
if required, without hitting the magic 80 characters per line limit.
Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com>
Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>