linux_dsm_epyc7002/arch/x86
Linus Torvalds c228d294f2 x86: explicitly align IO accesses in memcpy_{to,from}io
In commit 170d13ca3a ("x86: re-introduce non-generic memcpy_{to,from}io")
I made our copy from IO space use a separate copy routine rather than
rely on the generic memcpy.  I did that because our generic memory copy
isn't actually well-defined when it comes to internal access ordering or
alignment, and will in fact depend on various CPUID flags.

In particular, the default memcpy() for a modern Intel CPU will
generally be just a "rep movsb", which works reasonably well for
medium-sized memory copies of regular RAM, since the CPU will turn it
into fairly optimized microcode.

However, for non-cached memory and IO, "rep movs" ends up being
horrendously slow and will just do the architectural "one byte at a
time" accesses implied by the movsb.

At the other end of the spectrum, if you _don't_ end up using the "rep
movsb" code, you'd likely fall back to the software copy, which does
overlapping accesses for the tail, and may copy things backwards.
Again, for regular memory that's fine, for IO memory not so much.

The thinking was that clearly nobody really cared (because things
worked), but some people had seen horrible performance due to the byte
accesses, so let's just revert back to our long ago version that dod
"rep movsl" for the bulk of the copy, and then fixed up the potentially
last few bytes of the tail with "movsw/b".

Interestingly (and perhaps not entirely surprisingly), while that was
our original memory copy implementation, and had been used before for
IO, in the meantime many new users of memcpy_*io() had come about.  And
while the access patterns for the memory copy weren't well-defined (so
arguably _any_ access pattern should work), in practice the "rep movsb"
case had been very common for the last several years.

In particular Jarkko Sakkinen reported that the memcpy_*io() change
resuled in weird errors from his Geminilake NUC TPM module.

And it turns out that the TPM TCG accesses according to spec require
that the accesses be

 (a) done strictly sequentially

 (b) be naturally aligned

otherwise the TPM chip will abort the PCI transaction.

And, in fact, the tpm_crb.c driver did this:

	memcpy_fromio(buf, priv->rsp, 6);
	...
	memcpy_fromio(&buf[6], &priv->rsp[6], expected - 6);

which really should never have worked in the first place, but back
before commit 170d13ca3a it *happened* to work, because the
memcpy_fromio() would be expanded to a regular memcpy, and

 (a) gcc would expand the first memcpy in-line, and turn it into a
     4-byte and a 2-byte read, and they happened to be in the right
     order, and the alignment was right.

 (b) gcc would call "memcpy()" for the second one, and the machines that
     had this TPM chip also apparently ended up always having ERMS
     ("Enhanced REP MOVSB/STOSB instructions"), so we'd use the "rep
     movbs" for that copy.

In other words, basically by pure luck, the code happened to use the
right access sizes in the (two different!) memcpy() implementations to
make it all work.

But after commit 170d13ca3a, both of the memcpy_fromio() calls
resulted in a call to the routine with the consistent memory accesses,
and in both cases it started out transferring with 4-byte accesses.
Which worked for the first copy, but resulted in the second copy doing a
32-bit read at an address that was only 2-byte aligned.

Jarkko is actually fixing the fragile code in the TPM driver, but since
this is an excellent example of why we absolutely must not use a generic
memcpy for IO accesses, _and_ an IO-specific one really should strive to
align the IO accesses, let's do exactly that.

Side note: Jarkko also noted that the driver had been used on ARM
platforms, and had worked.  That was because on 32-bit ARM, memcpy_*io()
ends up always doing byte accesses, and on 64-bit ARM it first does byte
accesses to align to 8-byte boundaries, and then does 8-byte accesses
for the bulk.

So ARM actually worked by design, and the x86 case worked by pure luck.

We *might* want to make x86-64 do the 8-byte case too.  That should be a
pretty straightforward extension, but let's do one thing at a time.  And
generally MMIO accesses aren't really all that performance-critical, as
shown by the fact that for a long time we just did them a byte at a
time, and very few people ever noticed.

Reported-and-tested-by: Jarkko Sakkinen <jarkko.sakkinen@linux.intel.com>
Tested-by: Jerry Snitselaar <jsnitsel@redhat.com>
Cc: David Laight <David.Laight@aculab.com>
Fixes: 170d13ca3a ("x86: re-introduce non-generic memcpy_{to,from}io")
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-02-01 09:07:48 -08:00
..
boot kbuild: remove redundant target cleaning on failure 2019-01-06 09:46:51 +09:00
configs PCI: consolidate PCI config entry in drivers/pci 2018-11-23 11:45:34 +09:00
crypto Merge branch 'linus' of git://git.kernel.org/pub/scm/linux/kernel/git/herbert/crypto-2.6 2018-12-27 13:53:32 -08:00
entry x86/entry/64/compat: Fix stack switching for XEN PV 2019-01-18 00:39:33 +01:00
events Merge branch 'x86-cleanups-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip 2018-12-26 17:03:51 -08:00
hyperv x86/hyper-v: Add HvFlushGuestAddressList hypercall support 2018-12-21 11:28:39 +01:00
ia32 Remove 'type' argument from access_ok() function 2019-01-03 18:57:57 -08:00
include Merge branch 'x86-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip 2019-01-27 12:02:00 -08:00
kernel Merge branch 'x86-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip 2019-01-27 12:02:00 -08:00
kvm KVM: x86: Mark expected switch fall-throughs 2019-01-25 19:29:36 +01:00
lib x86: explicitly align IO accesses in memcpy_{to,from}io 2019-02-01 09:07:48 -08:00
math-emu Remove 'type' argument from access_ok() function 2019-01-03 18:57:57 -08:00
mm x86/mm/mem_encrypt: Fix erroneous sizeof() 2019-01-15 11:41:58 +01:00
net bpf: Add bpf_line_info support 2018-12-09 13:54:38 -08:00
oprofile x86/oprofile: Fix bogus GCC-8 warning in nmi_setup() 2018-02-21 09:54:17 +01:00
pci pci-v4.21-changes 2019-01-05 17:57:34 -08:00
platform Merge branch 'x86-platform-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip 2018-12-26 18:42:51 -08:00
power mm: remove include/linux/bootmem.h 2018-10-31 08:54:16 -07:00
purgatory kbuild: move bin2c back to scripts/ from scripts/basic/ 2018-07-18 01:18:05 +09:00
ras
realmode x86-64/realmode: Add instruction suffix 2018-02-20 09:33:41 +01:00
tools x86: Clean up 'sizeof x' => 'sizeof(x)' 2018-10-29 07:13:28 +01:00
um Remove 'type' argument from access_ok() function 2019-01-03 18:57:57 -08:00
video
xen xen: fixes for 5.0-rc3 2019-01-19 05:53:41 +12:00
.gitignore x86/build: Add arch/x86/tools/insn_decoder_test to .gitignore 2018-02-13 14:10:29 +01:00
Kbuild KVM: x86: Allow Qemu/KVM to use PVH entry point 2018-12-13 13:41:49 -05:00
Kconfig x86/Kconfig: Select PCI_LOCKLESS_CONFIG if PCI is enabled 2019-01-22 17:06:28 +01:00
Kconfig.cpu x86/cpu: Create Hygon Dhyana architecture support file 2018-09-27 16:14:05 +02:00
Kconfig.debug x86/kconfig: Remove redundant 'default n' lines from all x86 Kconfig's 2018-10-17 08:39:42 +02:00
Makefile jump_label: move 'asm goto' support test to Kconfig 2019-01-06 09:46:51 +09:00
Makefile_32.cpu
Makefile.um x86, powerpc: Remove -funit-at-a-time compiler option entirely 2018-12-09 11:55:32 +01:00