linux_dsm_epyc7002/Documentation/x86/mds.rst

Microarchitectural Data Sampling (MDS) mitigation
=================================================

.. _mds:

Overview
--------

Microarchitectural Data Sampling (MDS) is a family of side channel attacks
on internal buffers in Intel CPUs. The variants are:

 - Microarchitectural Store Buffer Data Sampling (MSBDS) (CVE-2018-12126)
 - Microarchitectural Fill Buffer Data Sampling (MFBDS) (CVE-2018-12130)
 - Microarchitectural Load Port Data Sampling (MLPDS) (CVE-2018-12127)

MSBDS leaks Store Buffer Entries which can be speculatively forwarded to a
dependent load (store-to-load forwarding) as an optimization. The forward
can also happen to a faulting or assisting load operation for a different
memory address, which can be exploited under certain conditions. Store
buffers are partitioned between Hyper-Threads so cross thread forwarding is
not possible. But if a thread enters or exits a sleep state the store
buffer is repartitioned which can expose data from one thread to the other.

MFBDS leaks Fill Buffer Entries. Fill buffers are used internally to manage
L1 miss situations and to hold data which is returned or sent in response
to a memory or I/O operation. Fill buffers can forward data to a load
operation and also write data to the cache. When the fill buffer is
deallocated it can retain the stale data of the preceding operations which
can then be forwarded to a faulting or assisting load operation, which can
be exploited under certain conditions. Fill buffers are shared between
Hyper-Threads so cross thread leakage is possible.

MLPDS leaks Load Port Data. Load ports are used to perform load operations
from memory or I/O. The received data is then forwarded to the register
file or a subsequent operation. In some implementations the Load Port can
contain stale data from a previous operation which can be forwarded to
faulting or assisting loads under certain conditions, which again can be
exploited eventually. Load ports are shared between Hyper-Threads so cross
thread leakage is possible.


Exposure assumptions
--------------------

It is assumed that attack code resides in user space or in a guest with one
exception. The rationale behind this assumption is that the code construct
needed for exploiting MDS requires:

 - to control the load to trigger a fault or assist

 - to have a disclosure gadget which exposes the speculatively accessed
   data for consumption through a side channel.

 - to control the pointer through which the disclosure gadget exposes the
   data

The existence of such a construct in the kernel cannot be excluded with
100% certainty, but the complexity involved makes it extremly unlikely.

There is one exception, which is untrusted BPF. The functionality of
untrusted BPF is limited, but it needs to be thoroughly investigated
whether it can be used to create such a construct.


Mitigation strategy
-------------------

All variants have the same mitigation strategy at least for the single CPU
thread case (SMT off): Force the CPU to clear the affected buffers.

This is achieved by using the otherwise unused and obsolete VERW
instruction in combination with a microcode update. The microcode clears
the affected CPU buffers when the VERW instruction is executed.

For virtualization there are two ways to achieve CPU buffer
clearing. Either the modified VERW instruction or via the L1D Flush
command. The latter is issued when L1TF mitigation is enabled so the extra
VERW can be avoided. If the CPU is not affected by L1TF then VERW needs to
be issued.

If the VERW instruction with the supplied segment selector argument is
executed on a CPU without the microcode update there is no side effect
other than a small number of pointlessly wasted CPU cycles.

This does not protect against cross Hyper-Thread attacks except for MSBDS
which is only exploitable cross Hyper-thread when one of the Hyper-Threads
enters a C-state.

The kernel provides a function to invoke the buffer clearing:

    mds_clear_cpu_buffers()

The mitigation is invoked on kernel/userspace, hypervisor/guest and C-state
(idle) transitions.

According to current knowledge additional mitigations inside the kernel
itself are not required because the necessary gadgets to expose the leaked
data cannot be controlled in a way which allows exploitation from malicious
user space or VM guests.

Mitigation points
-----------------

1. Return to user space
^^^^^^^^^^^^^^^^^^^^^^^

   When transitioning from kernel to user space the CPU buffers are flushed
   on affected CPUs when the mitigation is not disabled on the kernel
   command line. The migitation is enabled through the static key
   mds_user_clear.

   The mitigation is invoked in prepare_exit_to_usermode() which covers
   most of the kernel to user space transitions. There are a few exceptions
   which are not invoking prepare_exit_to_usermode() on return to user
   space. These exceptions use the paranoid exit code.

   - Non Maskable Interrupt (NMI):

     Access to sensible data like keys, credentials in the NMI context is
     mostly theoretical: The CPU can do prefetching or execute a
     misspeculated code path and thereby fetching data which might end up
     leaking through a buffer.

     But for mounting other attacks the kernel stack address of the task is
     already valuable information. So in full mitigation mode, the NMI is
     mitigated on the return from do_nmi() to provide almost complete
     coverage.

   - Double fault (#DF):

     A double fault is usually fatal, but the ESPFIX workaround, which can
     be triggered from user space through modify_ldt(2) is a recoverable
     double fault. #DF uses the paranoid exit path, so explicit mitigation
     in the double fault handler is required.

   - Machine Check Exception (#MC):

     Another corner case is a #MC which hits between the CPU buffer clear
     invocation and the actual return to user. As this still is in kernel
     space it takes the paranoid exit path which does not clear the CPU
     buffers. So the #MC handler repopulates the buffers to some
     extent. Machine checks are not reliably controllable and the window is
     extremly small so mitigation would just tick a checkbox that this
     theoretical corner case is covered. To keep the amount of special
     cases small, ignore #MC.

   - Debug Exception (#DB):

     This takes the paranoid exit path only when the INT1 breakpoint is in
     kernel space. #DB on a user space address takes the regular exit path,
     so no extra mitigation required.


2. C-State transition
^^^^^^^^^^^^^^^^^^^^^

   When a CPU goes idle and enters a C-State the CPU buffers need to be
   cleared on affected CPUs when SMT is active. This addresses the
   repartitioning of the store buffer when one of the Hyper-Threads enters
   a C-State.

   When SMT is inactive, i.e. either the CPU does not support it or all
   sibling threads are offline CPU buffer clearing is not required.

   The idle clearing is enabled on CPUs which are only affected by MSBDS
   and not by any other MDS variant. The other MDS variants cannot be
   protected against cross Hyper-Thread attacks because the Fill Buffer and
   the Load Ports are shared. So on CPUs affected by other variants, the
   idle clearing would be a window dressing exercise and is therefore not
   activated.

   The invocation is controlled by the static key mds_idle_clear which is
   switched depending on the chosen mitigation mode and the SMT state of
   the system.

   The buffer clear is only invoked before entering the C-State to prevent
   that stale data from the idling CPU from spilling to the Hyper-Thread
   sibling after the store buffer got repartitioned and all entries are
   available to the non idle sibling.

   When coming out of idle the store buffer is partitioned again so each
   sibling has half of it available. The back from idle CPU could be then
   speculatively exposed to contents of the sibling. The buffers are
   flushed either on exit to user space or on VMENTER so malicious code
   in user space or the guest cannot speculatively access them.

   The mitigation is hooked into all variants of halt()/mwait(), but does
   not cover the legacy ACPI IO-Port mechanism because the ACPI idle driver
   has been superseded by the intel_idle driver around 2010 and is
   preferred on all affected CPUs which are expected to gain the MD_CLEAR
   functionality in microcode. Aside of that the IO-Port mechanism is a
   legacy interface which is only used on older systems which are either
   not affected or do not receive microcode updates anymore.
x86/speculation/mds: Add mds_clear_cpu_buffers() The Microarchitectural Data Sampling (MDS) vulernabilities are mitigated by clearing the affected CPU buffers. The mechanism for clearing the buffers uses the unused and obsolete VERW instruction in combination with a microcode update which triggers a CPU buffer clear when VERW is executed. Provide a inline function with the assembly magic. The argument of the VERW instruction must be a memory operand as documented: "MD_CLEAR enumerates that the memory-operand variant of VERW (for example, VERW m16) has been extended to also overwrite buffers affected by MDS. This buffer overwriting functionality is not guaranteed for the register operand variant of VERW." Documentation also recommends to use a writable data segment selector: "The buffer overwriting occurs regardless of the result of the VERW permission check, as well as when the selector is null or causes a descriptor load segment violation. However, for lowest latency we recommend using a selector that indicates a valid writable data segment." Add x86 specific documentation about MDS and the internal workings of the mitigation. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Borislav Petkov <bp@suse.de> Reviewed-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Reviewed-by: Frederic Weisbecker <frederic@kernel.org> Reviewed-by: Jon Masters <jcm@redhat.com> Tested-by: Jon Masters <jcm@redhat.com> 2019-02-19 05:13:06 +07:00			`Microarchitectural Data Sampling (MDS) mitigation`
			`=================================================`

			`.. _mds:`

			`Overview`
			`--------`

			`Microarchitectural Data Sampling (MDS) is a family of side channel attacks`
			`on internal buffers in Intel CPUs. The variants are:`

			`- Microarchitectural Store Buffer Data Sampling (MSBDS) (CVE-2018-12126)`
			`- Microarchitectural Fill Buffer Data Sampling (MFBDS) (CVE-2018-12130)`
			`- Microarchitectural Load Port Data Sampling (MLPDS) (CVE-2018-12127)`

			`MSBDS leaks Store Buffer Entries which can be speculatively forwarded to a`
			`dependent load (store-to-load forwarding) as an optimization. The forward`
			`can also happen to a faulting or assisting load operation for a different`
			`memory address, which can be exploited under certain conditions. Store`
			`buffers are partitioned between Hyper-Threads so cross thread forwarding is`
			`not possible. But if a thread enters or exits a sleep state the store`
			`buffer is repartitioned which can expose data from one thread to the other.`

			`MFBDS leaks Fill Buffer Entries. Fill buffers are used internally to manage`
			`L1 miss situations and to hold data which is returned or sent in response`
			`to a memory or I/O operation. Fill buffers can forward data to a load`
			`operation and also write data to the cache. When the fill buffer is`
			`deallocated it can retain the stale data of the preceding operations which`
			`can then be forwarded to a faulting or assisting load operation, which can`
			`be exploited under certain conditions. Fill buffers are shared between`
			`Hyper-Threads so cross thread leakage is possible.`

			`MLPDS leaks Load Port Data. Load ports are used to perform load operations`
			`from memory or I/O. The received data is then forwarded to the register`
			`file or a subsequent operation. In some implementations the Load Port can`
			`contain stale data from a previous operation which can be forwarded to`
			`faulting or assisting loads under certain conditions, which again can be`
			`exploited eventually. Load ports are shared between Hyper-Threads so cross`
			`thread leakage is possible.`


			`Exposure assumptions`
			`--------------------`

			`It is assumed that attack code resides in user space or in a guest with one`
			`exception. The rationale behind this assumption is that the code construct`
			`needed for exploiting MDS requires:`

			`- to control the load to trigger a fault or assist`

			`- to have a disclosure gadget which exposes the speculatively accessed`
			`data for consumption through a side channel.`

			`- to control the pointer through which the disclosure gadget exposes the`
			`data`

			`The existence of such a construct in the kernel cannot be excluded with`
			`100% certainty, but the complexity involved makes it extremly unlikely.`

			`There is one exception, which is untrusted BPF. The functionality of`
			`untrusted BPF is limited, but it needs to be thoroughly investigated`
			`whether it can be used to create such a construct.`


			`Mitigation strategy`
			`-------------------`

			`All variants have the same mitigation strategy at least for the single CPU`
			`thread case (SMT off): Force the CPU to clear the affected buffers.`

			`This is achieved by using the otherwise unused and obsolete VERW`
			`instruction in combination with a microcode update. The microcode clears`
			`the affected CPU buffers when the VERW instruction is executed.`

			`For virtualization there are two ways to achieve CPU buffer`
			`clearing. Either the modified VERW instruction or via the L1D Flush`
			`command. The latter is issued when L1TF mitigation is enabled so the extra`
			`VERW can be avoided. If the CPU is not affected by L1TF then VERW needs to`
			`be issued.`

			`If the VERW instruction with the supplied segment selector argument is`
			`executed on a CPU without the microcode update there is no side effect`
			`other than a small number of pointlessly wasted CPU cycles.`

			`This does not protect against cross Hyper-Thread attacks except for MSBDS`
			`which is only exploitable cross Hyper-thread when one of the Hyper-Threads`
			`enters a C-state.`

			`The kernel provides a function to invoke the buffer clearing:`

			`mds_clear_cpu_buffers()`

			`The mitigation is invoked on kernel/userspace, hypervisor/guest and C-state`
			`(idle) transitions.`

			`According to current knowledge additional mitigations inside the kernel`
			`itself are not required because the necessary gadgets to expose the leaked`
			`data cannot be controlled in a way which allows exploitation from malicious`
			`user space or VM guests.`
x86/speculation/mds: Clear CPU buffers on exit to user Add a static key which controls the invocation of the CPU buffer clear mechanism on exit to user space and add the call into prepare_exit_to_usermode() and do_nmi() right before actually returning. Add documentation which kernel to user space transition this covers and explain why some corner cases are not mitigated. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Reviewed-by: Borislav Petkov <bp@suse.de> Reviewed-by: Frederic Weisbecker <frederic@kernel.org> Reviewed-by: Jon Masters <jcm@redhat.com> Tested-by: Jon Masters <jcm@redhat.com> 2019-02-19 05:42:51 +07:00
			`Mitigation points`
			`-----------------`

			`1. Return to user space`
			`^^^^^^^^^^^^^^^^^^^^^^^`

			`When transitioning from kernel to user space the CPU buffers are flushed`
			`on affected CPUs when the mitigation is not disabled on the kernel`
			`command line. The migitation is enabled through the static key`
			`mds_user_clear.`

			`The mitigation is invoked in prepare_exit_to_usermode() which covers`
			`most of the kernel to user space transitions. There are a few exceptions`
			`which are not invoking prepare_exit_to_usermode() on return to user`
			`space. These exceptions use the paranoid exit code.`

			`- Non Maskable Interrupt (NMI):`

			`Access to sensible data like keys, credentials in the NMI context is`
			`mostly theoretical: The CPU can do prefetching or execute a`
			`misspeculated code path and thereby fetching data which might end up`
			`leaking through a buffer.`

			`But for mounting other attacks the kernel stack address of the task is`
			`already valuable information. So in full mitigation mode, the NMI is`
			`mitigated on the return from do_nmi() to provide almost complete`
			`coverage.`

			`- Double fault (#DF):`

			`A double fault is usually fatal, but the ESPFIX workaround, which can`
			`be triggered from user space through modify_ldt(2) is a recoverable`
			`double fault. #DF uses the paranoid exit path, so explicit mitigation`
			`in the double fault handler is required.`

			`- Machine Check Exception (#MC):`

			`Another corner case is a #MC which hits between the CPU buffer clear`
			`invocation and the actual return to user. As this still is in kernel`
			`space it takes the paranoid exit path which does not clear the CPU`
			`buffers. So the #MC handler repopulates the buffers to some`
			`extent. Machine checks are not reliably controllable and the window is`
			`extremly small so mitigation would just tick a checkbox that this`
			`theoretical corner case is covered. To keep the amount of special`
			`cases small, ignore #MC.`

			`- Debug Exception (#DB):`

			`This takes the paranoid exit path only when the INT1 breakpoint is in`
			`kernel space. #DB on a user space address takes the regular exit path,`
			`so no extra mitigation required.`
x86/speculation/mds: Conditionally clear CPU buffers on idle entry Add a static key which controls the invocation of the CPU buffer clear mechanism on idle entry. This is independent of other MDS mitigations because the idle entry invocation to mitigate the potential leakage due to store buffer repartitioning is only necessary on SMT systems. Add the actual invocations to the different halt/mwait variants which covers all usage sites. mwaitx is not patched as it's not available on Intel CPUs. The buffer clear is only invoked before entering the C-State to prevent that stale data from the idling CPU is spilled to the Hyper-Thread sibling after the Store buffer got repartitioned and all entries are available to the non idle sibling. When coming out of idle the store buffer is partitioned again so each sibling has half of it available. Now CPU which returned from idle could be speculatively exposed to contents of the sibling, but the buffers are flushed either on exit to user space or on VMENTER. When later on conditional buffer clearing is implemented on top of this, then there is no action required either because before returning to user space the context switch will set the condition flag which causes a flush on the return to user path. Note, that the buffer clearing on idle is only sensible on CPUs which are solely affected by MSBDS and not any other variant of MDS because the other MDS variants cannot be mitigated when SMT is enabled, so the buffer clearing on idle would be a window dressing exercise. This intentionally does not handle the case in the acpi/processor_idle driver which uses the legacy IO port interface for C-State transitions for two reasons: - The acpi/processor_idle driver was replaced by the intel_idle driver almost a decade ago. Anything Nehalem upwards supports it and defaults to that new driver. - The legacy IO port interface is likely to be used on older and therefore unaffected CPUs or on systems which do not receive microcode updates anymore, so there is no point in adding that. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Borislav Petkov <bp@suse.de> Reviewed-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Reviewed-by: Frederic Weisbecker <frederic@kernel.org> Reviewed-by: Jon Masters <jcm@redhat.com> Tested-by: Jon Masters <jcm@redhat.com> 2019-02-19 05:04:01 +07:00

			`2. C-State transition`
			`^^^^^^^^^^^^^^^^^^^^^`

			`When a CPU goes idle and enters a C-State the CPU buffers need to be`
			`cleared on affected CPUs when SMT is active. This addresses the`
			`repartitioning of the store buffer when one of the Hyper-Threads enters`
			`a C-State.`

			`When SMT is inactive, i.e. either the CPU does not support it or all`
			`sibling threads are offline CPU buffer clearing is not required.`

			`The idle clearing is enabled on CPUs which are only affected by MSBDS`
			`and not by any other MDS variant. The other MDS variants cannot be`
			`protected against cross Hyper-Thread attacks because the Fill Buffer and`
			`the Load Ports are shared. So on CPUs affected by other variants, the`
			`idle clearing would be a window dressing exercise and is therefore not`
			`activated.`

			`The invocation is controlled by the static key mds_idle_clear which is`
			`switched depending on the chosen mitigation mode and the SMT state of`
			`the system.`

			`The buffer clear is only invoked before entering the C-State to prevent`
			`that stale data from the idling CPU from spilling to the Hyper-Thread`
			`sibling after the store buffer got repartitioned and all entries are`
			`available to the non idle sibling.`

			`When coming out of idle the store buffer is partitioned again so each`
			`sibling has half of it available. The back from idle CPU could be then`
			`speculatively exposed to contents of the sibling. The buffers are`
			`flushed either on exit to user space or on VMENTER so malicious code`
			`in user space or the guest cannot speculatively access them.`

			`The mitigation is hooked into all variants of halt()/mwait(), but does`
			`not cover the legacy ACPI IO-Port mechanism because the ACPI idle driver`
			`has been superseded by the intel_idle driver around 2010 and is`
			`preferred on all affected CPUs which are expected to gain the MD_CLEAR`
			`functionality in microcode. Aside of that the IO-Port mechanism is a`
			`legacy interface which is only used on older systems which are either`
			`not affected or do not receive microcode updates anymore.`