linux_dsm_epyc7002/arch
Sourabh Jain ba608c4fa1 powerpc/fadump: fix race between pstore write and fadump crash trigger
When we enter into fadump crash path via system reset we fail to update
the pstore.

On the system reset path we first update the pstore then we go for fadump
crash. But the problem here is when all the CPUs try to get the pstore
lock to initiate the pstore write, only one CPUs will acquire the lock
and proceed with the pstore write. Since it in NMI context CPUs that fail
to get lock do not wait for their turn to write to the pstore and simply
proceed with the next operation which is fadump crash. One of the CPU who
proceeded with fadump crash path triggers the crash and does not wait for
the CPU who gets the pstore lock to complete the pstore update.

Timeline diagram to depicts the sequence of events that leads to an
unsuccessful pstore update when we hit fadump crash path via system reset.

                 1    2     3    ...      n   CPU Threads
                 |    |     |             |
                 |    |     |             |
 Reached to   -->|--->|---->| ----------->|
 system reset    |    |     |             |
 path            |    |     |             |
                 |    |     |             |
 Try to       -->|--->|---->|------------>|
 acquire the     |    |     |             |
 pstore lock     |    |     |             |
                 |    |     |             |
                 |    |     |             |
 Got the      -->| +->|     |             |<-+
 pstore lock     | |  |     |             |  |-->  Didn't get the
                 | --------------------------+     lock and moving
                 |    |     |             |        ahead on fadump
                 |    |     |             |        crash path
                 |    |     |             |
  Begins the  -->|    |     |             |
  process to     |    |     |             |<-- Got the chance to
  update the     |    |     |             |    trigger the crash
  pstore         | -> |     |    ... <-   |
                 | |  |     |         |   |
                 | |  |     |         |   |<-- Triggers the
                 | |  |     |         |   |    crash
                 | |  |     |         |   |      ^
                 | |  |     |         |   |      |
  Writing to  -->| |  |     |         |   |      |
  pstore         | |  |     |         |   |      |
                   |                  |          |
       ^           |__________________|          |
       |               CPU Relax                 |
       |                                         |
       +-----------------------------------------+
                          |
                          v
            Race: crash triggered before pstore
                  update completes

To avoid this race condition a barrier is added on crash_fadump path, it
prevents the CPU to trigger the crash until all the online CPUs completes
their task.

A barrier is added to make sure all the secondary CPUs hit the
crash_fadump function before we initiates the crash. A timeout is kept to
ensure the primary CPU (one who initiates the crash) do not wait for
secondary CPUs indefinitely.

Signed-off-by: Sourabh Jain <sourabhjain@linux.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20200713052435.183750-1-sourabhjain@linux.ibm.com
2020-07-16 13:12:44 +10:00
..
alpha Kbuild updates for v5.8 (2nd) 2020-06-13 13:29:16 -07:00
arc treewide: replace '---help---' in Kconfig files with 'help' 2020-06-14 01:57:21 +09:00
arm maccess: make get_kernel_nofault() check for minimal type compatibility 2020-06-18 12:10:37 -07:00
arm64 Kbuild fixes for v5.8 2020-06-21 12:44:52 -07:00
c6x This time around we have 4 lines of diff in the core framework, removing a 2020-06-10 11:42:19 -07:00
csky maccess: rename probe_kernel_{read,write} to copy_{from,to}_kernel_nofault 2020-06-17 10:57:41 -07:00
h8300 This time around we have 4 lines of diff in the core framework, removing a 2020-06-10 11:42:19 -07:00
hexagon treewide: replace '---help---' in Kconfig files with 'help' 2020-06-14 01:57:21 +09:00
ia64 Merge branch 'hch' (maccess patches from Christoph Hellwig) 2020-06-18 12:35:51 -07:00
m68k Kbuild updates for v5.8 (2nd) 2020-06-13 13:29:16 -07:00
microblaze mmap locking API: convert mmap_sem comments 2020-06-09 09:39:14 -07:00
mips maccess: rename probe_kernel_{read,write} to copy_{from,to}_kernel_nofault 2020-06-17 10:57:41 -07:00
nds32 maccess: rename probe_kernel_{read,write} to copy_{from,to}_kernel_nofault 2020-06-17 10:57:41 -07:00
nios2 nios2 update for v5.8-rc1 2020-06-12 11:55:11 -07:00
openrisc OpenRISC updates for 5.8 2020-06-13 10:54:09 -07:00
parisc maccess: make get_kernel_nofault() check for minimal type compatibility 2020-06-18 12:10:37 -07:00
powerpc powerpc/fadump: fix race between pstore write and fadump crash trigger 2020-07-16 13:12:44 +10:00
riscv RISC-V Fixes for 5.8-rc2 2020-06-20 12:14:29 -07:00
s390 s390 fixes for 5.8-rc2 2020-06-20 12:31:08 -07:00
sh maccess: rename probe_kernel_address to get_kernel_nofault 2020-06-18 11:14:40 -07:00
sparc treewide: replace '---help---' in Kconfig files with 'help' 2020-06-14 01:57:21 +09:00
um maccess: rename probe_kernel_{read,write} to copy_{from,to}_kernel_nofault 2020-06-17 10:57:41 -07:00
unicore32 This time around we have 4 lines of diff in the core framework, removing a 2020-06-10 11:42:19 -07:00
x86 Merge branch 'i2c/for-current' of git://git.kernel.org/pub/scm/linux/kernel/git/wsa/linux 2020-06-20 19:18:27 -07:00
xtensa mmap locking API: convert mmap_sem API comments 2020-06-09 09:39:14 -07:00
.gitignore
Kconfig treewide: replace '---help---' in Kconfig files with 'help' 2020-06-14 01:57:21 +09:00