mirror of
https://github.com/AuxXxilium/linux_dsm_epyc7002.git
synced 2024-11-24 13:50:52 +07:00
vfio.txt: standardize document format
Each text file under Documentation follows a different format. Some doesn't even have titles! Change its representation to follow the adopted standard, using ReST markups for it to be parseable by Sphinx: - adjust title marks; - use footnote marks; - mark literal blocks; - adjust identation. Signed-off-by: Mauro Carvalho Chehab <mchehab@s-opensource.com> Signed-off-by: Jonathan Corbet <corbet@lwn.net>
This commit is contained in:
parent
2a26ed8e4a
commit
c6f4d41338
@ -1,5 +1,7 @@
|
|||||||
VFIO - "Virtual Function I/O"[1]
|
==================================
|
||||||
-------------------------------------------------------------------------------
|
VFIO - "Virtual Function I/O" [1]_
|
||||||
|
==================================
|
||||||
|
|
||||||
Many modern system now provide DMA and interrupt remapping facilities
|
Many modern system now provide DMA and interrupt remapping facilities
|
||||||
to help ensure I/O devices behave within the boundaries they've been
|
to help ensure I/O devices behave within the boundaries they've been
|
||||||
allotted. This includes x86 hardware with AMD-Vi and Intel VT-d,
|
allotted. This includes x86 hardware with AMD-Vi and Intel VT-d,
|
||||||
@ -7,14 +9,14 @@ POWER systems with Partitionable Endpoints (PEs) and embedded PowerPC
|
|||||||
systems such as Freescale PAMU. The VFIO driver is an IOMMU/device
|
systems such as Freescale PAMU. The VFIO driver is an IOMMU/device
|
||||||
agnostic framework for exposing direct device access to userspace, in
|
agnostic framework for exposing direct device access to userspace, in
|
||||||
a secure, IOMMU protected environment. In other words, this allows
|
a secure, IOMMU protected environment. In other words, this allows
|
||||||
safe[2], non-privileged, userspace drivers.
|
safe [2]_, non-privileged, userspace drivers.
|
||||||
|
|
||||||
Why do we want that? Virtual machines often make use of direct device
|
Why do we want that? Virtual machines often make use of direct device
|
||||||
access ("device assignment") when configured for the highest possible
|
access ("device assignment") when configured for the highest possible
|
||||||
I/O performance. From a device and host perspective, this simply
|
I/O performance. From a device and host perspective, this simply
|
||||||
turns the VM into a userspace driver, with the benefits of
|
turns the VM into a userspace driver, with the benefits of
|
||||||
significantly reduced latency, higher bandwidth, and direct use of
|
significantly reduced latency, higher bandwidth, and direct use of
|
||||||
bare-metal device drivers[3].
|
bare-metal device drivers [3]_.
|
||||||
|
|
||||||
Some applications, particularly in the high performance computing
|
Some applications, particularly in the high performance computing
|
||||||
field, also benefit from low-overhead, direct device access from
|
field, also benefit from low-overhead, direct device access from
|
||||||
@ -31,7 +33,7 @@ KVM PCI specific device assignment code as well as provide a more
|
|||||||
secure, more featureful userspace driver environment than UIO.
|
secure, more featureful userspace driver environment than UIO.
|
||||||
|
|
||||||
Groups, Devices, and IOMMUs
|
Groups, Devices, and IOMMUs
|
||||||
-------------------------------------------------------------------------------
|
---------------------------
|
||||||
|
|
||||||
Devices are the main target of any I/O driver. Devices typically
|
Devices are the main target of any I/O driver. Devices typically
|
||||||
create a programming interface made up of I/O access, interrupts,
|
create a programming interface made up of I/O access, interrupts,
|
||||||
@ -114,21 +116,21 @@ well as mechanisms for describing and registering interrupt
|
|||||||
notifications.
|
notifications.
|
||||||
|
|
||||||
VFIO Usage Example
|
VFIO Usage Example
|
||||||
-------------------------------------------------------------------------------
|
------------------
|
||||||
|
|
||||||
Assume user wants to access PCI device 0000:06:0d.0
|
Assume user wants to access PCI device 0000:06:0d.0::
|
||||||
|
|
||||||
$ readlink /sys/bus/pci/devices/0000:06:0d.0/iommu_group
|
$ readlink /sys/bus/pci/devices/0000:06:0d.0/iommu_group
|
||||||
../../../../kernel/iommu_groups/26
|
../../../../kernel/iommu_groups/26
|
||||||
|
|
||||||
This device is therefore in IOMMU group 26. This device is on the
|
This device is therefore in IOMMU group 26. This device is on the
|
||||||
pci bus, therefore the user will make use of vfio-pci to manage the
|
pci bus, therefore the user will make use of vfio-pci to manage the
|
||||||
group:
|
group::
|
||||||
|
|
||||||
# modprobe vfio-pci
|
# modprobe vfio-pci
|
||||||
|
|
||||||
Binding this device to the vfio-pci driver creates the VFIO group
|
Binding this device to the vfio-pci driver creates the VFIO group
|
||||||
character devices for this group:
|
character devices for this group::
|
||||||
|
|
||||||
$ lspci -n -s 0000:06:0d.0
|
$ lspci -n -s 0000:06:0d.0
|
||||||
06:0d.0 0401: 1102:0002 (rev 08)
|
06:0d.0 0401: 1102:0002 (rev 08)
|
||||||
@ -136,7 +138,7 @@ $ lspci -n -s 0000:06:0d.0
|
|||||||
# echo 1102 0002 > /sys/bus/pci/drivers/vfio-pci/new_id
|
# echo 1102 0002 > /sys/bus/pci/drivers/vfio-pci/new_id
|
||||||
|
|
||||||
Now we need to look at what other devices are in the group to free
|
Now we need to look at what other devices are in the group to free
|
||||||
it for use by VFIO:
|
it for use by VFIO::
|
||||||
|
|
||||||
$ ls -l /sys/bus/pci/devices/0000:06:0d.0/iommu_group/devices
|
$ ls -l /sys/bus/pci/devices/0000:06:0d.0/iommu_group/devices
|
||||||
total 0
|
total 0
|
||||||
@ -147,7 +149,7 @@ lrwxrwxrwx. 1 root root 0 Apr 23 16:13 0000:06:0d.0 ->
|
|||||||
lrwxrwxrwx. 1 root root 0 Apr 23 16:13 0000:06:0d.1 ->
|
lrwxrwxrwx. 1 root root 0 Apr 23 16:13 0000:06:0d.1 ->
|
||||||
../../../../devices/pci0000:00/0000:00:1e.0/0000:06:0d.1
|
../../../../devices/pci0000:00/0000:00:1e.0/0000:06:0d.1
|
||||||
|
|
||||||
This device is behind a PCIe-to-PCI bridge[4], therefore we also
|
This device is behind a PCIe-to-PCI bridge [4]_, therefore we also
|
||||||
need to add device 0000:06:0d.1 to the group following the same
|
need to add device 0000:06:0d.1 to the group following the same
|
||||||
procedure as above. Device 0000:00:1e.0 is a bridge that does
|
procedure as above. Device 0000:00:1e.0 is a bridge that does
|
||||||
not currently have a host driver, therefore it's not required to
|
not currently have a host driver, therefore it's not required to
|
||||||
@ -157,12 +159,12 @@ support PCI bridges).
|
|||||||
The final step is to provide the user with access to the group if
|
The final step is to provide the user with access to the group if
|
||||||
unprivileged operation is desired (note that /dev/vfio/vfio provides
|
unprivileged operation is desired (note that /dev/vfio/vfio provides
|
||||||
no capabilities on its own and is therefore expected to be set to
|
no capabilities on its own and is therefore expected to be set to
|
||||||
mode 0666 by the system).
|
mode 0666 by the system)::
|
||||||
|
|
||||||
# chown user:user /dev/vfio/26
|
# chown user:user /dev/vfio/26
|
||||||
|
|
||||||
The user now has full access to all the devices and the iommu for this
|
The user now has full access to all the devices and the iommu for this
|
||||||
group and can access them as follows:
|
group and can access them as follows::
|
||||||
|
|
||||||
int container, group, device, i;
|
int container, group, device, i;
|
||||||
struct vfio_group_status group_status =
|
struct vfio_group_status group_status =
|
||||||
@ -248,7 +250,7 @@ VFIO bus driver API
|
|||||||
VFIO bus drivers, such as vfio-pci make use of only a few interfaces
|
VFIO bus drivers, such as vfio-pci make use of only a few interfaces
|
||||||
into VFIO core. When devices are bound and unbound to the driver,
|
into VFIO core. When devices are bound and unbound to the driver,
|
||||||
the driver should call vfio_add_group_dev() and vfio_del_group_dev()
|
the driver should call vfio_add_group_dev() and vfio_del_group_dev()
|
||||||
respectively:
|
respectively::
|
||||||
|
|
||||||
extern int vfio_add_group_dev(struct iommu_group *iommu_group,
|
extern int vfio_add_group_dev(struct iommu_group *iommu_group,
|
||||||
struct device *dev,
|
struct device *dev,
|
||||||
@ -260,7 +262,7 @@ extern void *vfio_del_group_dev(struct device *dev);
|
|||||||
vfio_add_group_dev() indicates to the core to begin tracking the
|
vfio_add_group_dev() indicates to the core to begin tracking the
|
||||||
specified iommu_group and register the specified dev as owned by
|
specified iommu_group and register the specified dev as owned by
|
||||||
a VFIO bus driver. The driver provides an ops structure for callbacks
|
a VFIO bus driver. The driver provides an ops structure for callbacks
|
||||||
similar to a file operations structure:
|
similar to a file operations structure::
|
||||||
|
|
||||||
struct vfio_device_ops {
|
struct vfio_device_ops {
|
||||||
int (*open)(void *device_data);
|
int (*open)(void *device_data);
|
||||||
@ -285,7 +287,7 @@ own VFIO_DEVICE_GET_REGION_INFO ioctl.
|
|||||||
|
|
||||||
|
|
||||||
PPC64 sPAPR implementation note
|
PPC64 sPAPR implementation note
|
||||||
-------------------------------------------------------------------------------
|
-------------------------------
|
||||||
|
|
||||||
This implementation has some specifics:
|
This implementation has some specifics:
|
||||||
|
|
||||||
@ -293,8 +295,10 @@ This implementation has some specifics:
|
|||||||
container is supported as an IOMMU table is allocated at the boot time,
|
container is supported as an IOMMU table is allocated at the boot time,
|
||||||
one table per a IOMMU group which is a Partitionable Endpoint (PE)
|
one table per a IOMMU group which is a Partitionable Endpoint (PE)
|
||||||
(PE is often a PCI domain but not always).
|
(PE is often a PCI domain but not always).
|
||||||
|
|
||||||
Newer systems (POWER8 with IODA2) have improved hardware design which allows
|
Newer systems (POWER8 with IODA2) have improved hardware design which allows
|
||||||
to remove this limitation and have multiple IOMMU groups per a VFIO container.
|
to remove this limitation and have multiple IOMMU groups per a VFIO
|
||||||
|
container.
|
||||||
|
|
||||||
2) The hardware supports so called DMA windows - the PCI address range
|
2) The hardware supports so called DMA windows - the PCI address range
|
||||||
within which DMA transfer is allowed, any attempt to access address space
|
within which DMA transfer is allowed, any attempt to access address space
|
||||||
@ -302,33 +306,36 @@ out of the window leads to the whole PE isolation.
|
|||||||
|
|
||||||
3) PPC64 guests are paravirtualized but not fully emulated. There is an API
|
3) PPC64 guests are paravirtualized but not fully emulated. There is an API
|
||||||
to map/unmap pages for DMA, and it normally maps 1..32 pages per call and
|
to map/unmap pages for DMA, and it normally maps 1..32 pages per call and
|
||||||
currently there is no way to reduce the number of calls. In order to make things
|
currently there is no way to reduce the number of calls. In order to make
|
||||||
faster, the map/unmap handling has been implemented in real mode which provides
|
things faster, the map/unmap handling has been implemented in real mode
|
||||||
an excellent performance which has limitations such as inability to do
|
which provides an excellent performance which has limitations such as
|
||||||
locked pages accounting in real time.
|
inability to do locked pages accounting in real time.
|
||||||
|
|
||||||
4) According to sPAPR specification, A Partitionable Endpoint (PE) is an I/O
|
4) According to sPAPR specification, A Partitionable Endpoint (PE) is an I/O
|
||||||
subtree that can be treated as a unit for the purposes of partitioning and
|
subtree that can be treated as a unit for the purposes of partitioning and
|
||||||
error recovery. A PE may be a single or multi-function IOA (IO Adapter), a
|
error recovery. A PE may be a single or multi-function IOA (IO Adapter), a
|
||||||
function of a multi-function IOA, or multiple IOAs (possibly including switch
|
function of a multi-function IOA, or multiple IOAs (possibly including
|
||||||
and bridge structures above the multiple IOAs). PPC64 guests detect PCI errors
|
switch and bridge structures above the multiple IOAs). PPC64 guests detect
|
||||||
and recover from them via EEH RTAS services, which works on the basis of
|
PCI errors and recover from them via EEH RTAS services, which works on the
|
||||||
additional ioctl commands.
|
basis of additional ioctl commands.
|
||||||
|
|
||||||
So 4 additional ioctls have been added:
|
So 4 additional ioctls have been added:
|
||||||
|
|
||||||
VFIO_IOMMU_SPAPR_TCE_GET_INFO - returns the size and the start
|
VFIO_IOMMU_SPAPR_TCE_GET_INFO
|
||||||
of the DMA window on the PCI bus.
|
returns the size and the start of the DMA window on the PCI bus.
|
||||||
|
|
||||||
VFIO_IOMMU_ENABLE - enables the container. The locked pages accounting
|
VFIO_IOMMU_ENABLE
|
||||||
|
enables the container. The locked pages accounting
|
||||||
is done at this point. This lets user first to know what
|
is done at this point. This lets user first to know what
|
||||||
the DMA window is and adjust rlimit before doing any real job.
|
the DMA window is and adjust rlimit before doing any real job.
|
||||||
|
|
||||||
VFIO_IOMMU_DISABLE - disables the container.
|
VFIO_IOMMU_DISABLE
|
||||||
|
disables the container.
|
||||||
|
|
||||||
VFIO_EEH_PE_OP - provides an API for EEH setup, error detection and recovery.
|
VFIO_EEH_PE_OP
|
||||||
|
provides an API for EEH setup, error detection and recovery.
|
||||||
|
|
||||||
The code flow from the example above should be slightly changed:
|
The code flow from the example above should be slightly changed::
|
||||||
|
|
||||||
struct vfio_eeh_pe_op pe_op = { .argsz = sizeof(pe_op), .flags = 0 };
|
struct vfio_eeh_pe_op pe_op = { .argsz = sizeof(pe_op), .flags = 0 };
|
||||||
|
|
||||||
@ -475,21 +482,21 @@ create those in run-time if the guest driver supports 64bit DMA.
|
|||||||
|
|
||||||
VFIO_IOMMU_SPAPR_TCE_CREATE receives a page shift, a DMA window size and
|
VFIO_IOMMU_SPAPR_TCE_CREATE receives a page shift, a DMA window size and
|
||||||
a number of TCE table levels (if a TCE table is going to be big enough and
|
a number of TCE table levels (if a TCE table is going to be big enough and
|
||||||
the kernel may not be able to allocate enough of physically contiguous memory).
|
the kernel may not be able to allocate enough of physically contiguous
|
||||||
It creates a new window in the available slot and returns the bus address where
|
memory). It creates a new window in the available slot and returns the bus
|
||||||
the new window starts. Due to hardware limitation, the user space cannot choose
|
address where the new window starts. Due to hardware limitation, the user
|
||||||
the location of DMA windows.
|
space cannot choose the location of DMA windows.
|
||||||
|
|
||||||
VFIO_IOMMU_SPAPR_TCE_REMOVE receives the bus start address of the window
|
VFIO_IOMMU_SPAPR_TCE_REMOVE receives the bus start address of the window
|
||||||
and removes it.
|
and removes it.
|
||||||
|
|
||||||
-------------------------------------------------------------------------------
|
-------------------------------------------------------------------------------
|
||||||
|
|
||||||
[1] VFIO was originally an acronym for "Virtual Function I/O" in its
|
.. [1] VFIO was originally an acronym for "Virtual Function I/O" in its
|
||||||
initial implementation by Tom Lyon while as Cisco. We've since
|
initial implementation by Tom Lyon while as Cisco. We've since
|
||||||
outgrown the acronym, but it's catchy.
|
outgrown the acronym, but it's catchy.
|
||||||
|
|
||||||
[2] "safe" also depends upon a device being "well behaved". It's
|
.. [2] "safe" also depends upon a device being "well behaved". It's
|
||||||
possible for multi-function devices to have backdoors between
|
possible for multi-function devices to have backdoors between
|
||||||
functions and even for single function devices to have alternative
|
functions and even for single function devices to have alternative
|
||||||
access to things like PCI config space through MMIO registers. To
|
access to things like PCI config space through MMIO registers. To
|
||||||
@ -500,13 +507,13 @@ still provide isolation. For PCI, SR-IOV Virtual Functions are the
|
|||||||
best indicator of "well behaved", as these are designed for
|
best indicator of "well behaved", as these are designed for
|
||||||
virtualization usage models.
|
virtualization usage models.
|
||||||
|
|
||||||
[3] As always there are trade-offs to virtual machine device
|
.. [3] As always there are trade-offs to virtual machine device
|
||||||
assignment that are beyond the scope of VFIO. It's expected that
|
assignment that are beyond the scope of VFIO. It's expected that
|
||||||
future IOMMU technologies will reduce some, but maybe not all, of
|
future IOMMU technologies will reduce some, but maybe not all, of
|
||||||
these trade-offs.
|
these trade-offs.
|
||||||
|
|
||||||
[4] In this case the device is below a PCI bridge, so transactions
|
.. [4] In this case the device is below a PCI bridge, so transactions
|
||||||
from either function of the device are indistinguishable to the iommu:
|
from either function of the device are indistinguishable to the iommu::
|
||||||
|
|
||||||
-[0000:00]-+-1e.0-[06]--+-0d.0
|
-[0000:00]-+-1e.0-[06]--+-0d.0
|
||||||
\-0d.1
|
\-0d.1
|
||||||
|
Loading…
Reference in New Issue
Block a user