License cleanup: add SPDX GPL-2.0 license identifier to files with no license
Many source files in the tree are missing licensing information, which
makes it harder for compliance tools to determine the correct license.
By default all files without license information are under the default
license of the kernel, which is GPL version 2.
Update the files which contain no license information with the 'GPL-2.0'
SPDX license identifier. The SPDX identifier is a legally binding
shorthand, which can be used instead of the full boiler plate text.
This patch is based on work done by Thomas Gleixner and Kate Stewart and
Philippe Ombredanne.
How this work was done:
Patches were generated and checked against linux-4.14-rc6 for a subset of
the use cases:
- file had no licensing information it it.
- file was a */uapi/* one with no licensing information in it,
- file was a */uapi/* one with existing licensing information,
Further patches will be generated in subsequent months to fix up cases
where non-standard license headers were used, and references to license
had to be inferred by heuristics based on keywords.
The analysis to determine which SPDX License Identifier to be applied to
a file was done in a spreadsheet of side by side results from of the
output of two independent scanners (ScanCode & Windriver) producing SPDX
tag:value files created by Philippe Ombredanne. Philippe prepared the
base worksheet, and did an initial spot review of a few 1000 files.
The 4.13 kernel was the starting point of the analysis with 60,537 files
assessed. Kate Stewart did a file by file comparison of the scanner
results in the spreadsheet to determine which SPDX license identifier(s)
to be applied to the file. She confirmed any determination that was not
immediately clear with lawyers working with the Linux Foundation.
Criteria used to select files for SPDX license identifier tagging was:
- Files considered eligible had to be source code files.
- Make and config files were included as candidates if they contained >5
lines of source
- File already had some variant of a license header in it (even if <5
lines).
All documentation files were explicitly excluded.
The following heuristics were used to determine which SPDX license
identifiers to apply.
- when both scanners couldn't find any license traces, file was
considered to have no license information in it, and the top level
COPYING file license applied.
For non */uapi/* files that summary was:
SPDX license identifier # files
---------------------------------------------------|-------
GPL-2.0 11139
and resulted in the first patch in this series.
If that file was a */uapi/* path one, it was "GPL-2.0 WITH
Linux-syscall-note" otherwise it was "GPL-2.0". Results of that was:
SPDX license identifier # files
---------------------------------------------------|-------
GPL-2.0 WITH Linux-syscall-note 930
and resulted in the second patch in this series.
- if a file had some form of licensing information in it, and was one
of the */uapi/* ones, it was denoted with the Linux-syscall-note if
any GPL family license was found in the file or had no licensing in
it (per prior point). Results summary:
SPDX license identifier # files
---------------------------------------------------|------
GPL-2.0 WITH Linux-syscall-note 270
GPL-2.0+ WITH Linux-syscall-note 169
((GPL-2.0 WITH Linux-syscall-note) OR BSD-2-Clause) 21
((GPL-2.0 WITH Linux-syscall-note) OR BSD-3-Clause) 17
LGPL-2.1+ WITH Linux-syscall-note 15
GPL-1.0+ WITH Linux-syscall-note 14
((GPL-2.0+ WITH Linux-syscall-note) OR BSD-3-Clause) 5
LGPL-2.0+ WITH Linux-syscall-note 4
LGPL-2.1 WITH Linux-syscall-note 3
((GPL-2.0 WITH Linux-syscall-note) OR MIT) 3
((GPL-2.0 WITH Linux-syscall-note) AND MIT) 1
and that resulted in the third patch in this series.
- when the two scanners agreed on the detected license(s), that became
the concluded license(s).
- when there was disagreement between the two scanners (one detected a
license but the other didn't, or they both detected different
licenses) a manual inspection of the file occurred.
- In most cases a manual inspection of the information in the file
resulted in a clear resolution of the license that should apply (and
which scanner probably needed to revisit its heuristics).
- When it was not immediately clear, the license identifier was
confirmed with lawyers working with the Linux Foundation.
- If there was any question as to the appropriate license identifier,
the file was flagged for further research and to be revisited later
in time.
In total, over 70 hours of logged manual review was done on the
spreadsheet to determine the SPDX license identifiers to apply to the
source files by Kate, Philippe, Thomas and, in some cases, confirmation
by lawyers working with the Linux Foundation.
Kate also obtained a third independent scan of the 4.13 code base from
FOSSology, and compared selected files where the other two scanners
disagreed against that SPDX file, to see if there was new insights. The
Windriver scanner is based on an older version of FOSSology in part, so
they are related.
Thomas did random spot checks in about 500 files from the spreadsheets
for the uapi headers and agreed with SPDX license identifier in the
files he inspected. For the non-uapi files Thomas did random spot checks
in about 15000 files.
In initial set of patches against 4.14-rc6, 3 files were found to have
copy/paste license identifier errors, and have been fixed to reflect the
correct identifier.
Additionally Philippe spent 10 hours this week doing a detailed manual
inspection and review of the 12,461 patched files from the initial patch
version early this week with:
- a full scancode scan run, collecting the matched texts, detected
license ids and scores
- reviewing anything where there was a license detected (about 500+
files) to ensure that the applied SPDX license was correct
- reviewing anything where there was no detection but the patch license
was not GPL-2.0 WITH Linux-syscall-note to ensure that the applied
SPDX license was correct
This produced a worksheet with 20 files needing minor correction. This
worksheet was then exported into 3 different .csv files for the
different types of files to be modified.
These .csv files were then reviewed by Greg. Thomas wrote a script to
parse the csv files and add the proper SPDX tag to the file, in the
format that the file expected. This script was further refined by Greg
based on the output to detect more types of files automatically and to
distinguish between header and source .c files (which need different
comment types.) Finally Greg ran the script using the .csv files to
generate the patches.
Reviewed-by: Kate Stewart <kstewart@linuxfoundation.org>
Reviewed-by: Philippe Ombredanne <pombredanne@nexb.com>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-11-01 21:07:57 +07:00
|
|
|
/* SPDX-License-Identifier: GPL-2.0 */
|
2008-10-23 12:26:29 +07:00
|
|
|
#ifndef _ASM_X86_PGTABLE_3LEVEL_H
|
|
|
|
#define _ASM_X86_PGTABLE_3LEVEL_H
|
2005-04-17 05:20:36 +07:00
|
|
|
|
2018-08-21 22:37:55 +07:00
|
|
|
#include <asm/atomic64_32.h>
|
|
|
|
|
2005-04-17 05:20:36 +07:00
|
|
|
/*
|
|
|
|
* Intel Physical Address Extension (PAE) Mode - three-level page
|
|
|
|
* tables on PPro+ CPUs.
|
|
|
|
*
|
|
|
|
* Copyright (C) 1999 Ingo Molnar <mingo@redhat.com>
|
|
|
|
*/
|
|
|
|
|
2008-03-23 15:03:10 +07:00
|
|
|
#define pte_ERROR(e) \
|
2012-05-22 09:50:07 +07:00
|
|
|
pr_err("%s:%d: bad pte %p(%08lx%08lx)\n", \
|
2008-03-23 15:03:10 +07:00
|
|
|
__FILE__, __LINE__, &(e), (e).pte_high, (e).pte_low)
|
|
|
|
#define pmd_ERROR(e) \
|
2012-05-22 09:50:07 +07:00
|
|
|
pr_err("%s:%d: bad pmd %p(%016Lx)\n", \
|
2008-03-23 15:03:10 +07:00
|
|
|
__FILE__, __LINE__, &(e), pmd_val(e))
|
|
|
|
#define pgd_ERROR(e) \
|
2012-05-22 09:50:07 +07:00
|
|
|
pr_err("%s:%d: bad pgd %p(%016Lx)\n", \
|
2008-03-23 15:03:10 +07:00
|
|
|
__FILE__, __LINE__, &(e), pgd_val(e))
|
2008-01-30 19:34:11 +07:00
|
|
|
|
2005-04-17 05:20:36 +07:00
|
|
|
/* Rules for using set_pte: the pte being assigned *must* be
|
|
|
|
* either not present or in a state where the hardware will
|
|
|
|
* not attempt to update the pte. In places where this is
|
|
|
|
* not possible, use pte_get_and_clear to obtain the old pte
|
|
|
|
* value and then use set_pte to update it. -ben
|
|
|
|
*/
|
2007-05-03 00:27:13 +07:00
|
|
|
static inline void native_set_pte(pte_t *ptep, pte_t pte)
|
2005-04-17 05:20:36 +07:00
|
|
|
{
|
|
|
|
ptep->pte_high = pte.pte_high;
|
|
|
|
smp_wmb();
|
|
|
|
ptep->pte_low = pte.pte_low;
|
|
|
|
}
|
|
|
|
|
mm: pmd_read_atomic: fix 32bit PAE pmd walk vs pmd_populate SMP race condition
When holding the mmap_sem for reading, pmd_offset_map_lock should only
run on a pmd_t that has been read atomically from the pmdp pointer,
otherwise we may read only half of it leading to this crash.
PID: 11679 TASK: f06e8000 CPU: 3 COMMAND: "do_race_2_panic"
#0 [f06a9dd8] crash_kexec at c049b5ec
#1 [f06a9e2c] oops_end at c083d1c2
#2 [f06a9e40] no_context at c0433ded
#3 [f06a9e64] bad_area_nosemaphore at c043401a
#4 [f06a9e6c] __do_page_fault at c0434493
#5 [f06a9eec] do_page_fault at c083eb45
#6 [f06a9f04] error_code (via page_fault) at c083c5d5
EAX: 01fb470c EBX: fff35000 ECX: 00000003 EDX: 00000100 EBP:
00000000
DS: 007b ESI: 9e201000 ES: 007b EDI: 01fb4700 GS: 00e0
CS: 0060 EIP: c083bc14 ERR: ffffffff EFLAGS: 00010246
#7 [f06a9f38] _spin_lock at c083bc14
#8 [f06a9f44] sys_mincore at c0507b7d
#9 [f06a9fb0] system_call at c083becd
start len
EAX: ffffffda EBX: 9e200000 ECX: 00001000 EDX: 6228537f
DS: 007b ESI: 00000000 ES: 007b EDI: 003d0f00
SS: 007b ESP: 62285354 EBP: 62285388 GS: 0033
CS: 0073 EIP: 00291416 ERR: 000000da EFLAGS: 00000286
This should be a longstanding bug affecting x86 32bit PAE without THP.
Only archs with 64bit large pmd_t and 32bit unsigned long should be
affected.
With THP enabled the barrier() in pmd_none_or_trans_huge_or_clear_bad()
would partly hide the bug when the pmd transition from none to stable,
by forcing a re-read of the *pmd in pmd_offset_map_lock, but when THP is
enabled a new set of problem arises by the fact could then transition
freely in any of the none, pmd_trans_huge or pmd_trans_stable states.
So making the barrier in pmd_none_or_trans_huge_or_clear_bad()
unconditional isn't good idea and it would be a flakey solution.
This should be fully fixed by introducing a pmd_read_atomic that reads
the pmd in order with THP disabled, or by reading the pmd atomically
with cmpxchg8b with THP enabled.
Luckily this new race condition only triggers in the places that must
already be covered by pmd_none_or_trans_huge_or_clear_bad() so the fix
is localized there but this bug is not related to THP.
NOTE: this can trigger on x86 32bit systems with PAE enabled with more
than 4G of ram, otherwise the high part of the pmd will never risk to be
truncated because it would be zero at all times, in turn so hiding the
SMP race.
This bug was discovered and fully debugged by Ulrich, quote:
----
[..]
pmd_none_or_trans_huge_or_clear_bad() loads the content of edx and
eax.
496 static inline int pmd_none_or_trans_huge_or_clear_bad(pmd_t
*pmd)
497 {
498 /* depend on compiler for an atomic pmd read */
499 pmd_t pmdval = *pmd;
// edi = pmd pointer
0xc0507a74 <sys_mincore+548>: mov 0x8(%esp),%edi
...
// edx = PTE page table high address
0xc0507a84 <sys_mincore+564>: mov 0x4(%edi),%edx
...
// eax = PTE page table low address
0xc0507a8e <sys_mincore+574>: mov (%edi),%eax
[..]
Please note that the PMD is not read atomically. These are two "mov"
instructions where the high order bits of the PMD entry are fetched
first. Hence, the above machine code is prone to the following race.
- The PMD entry {high|low} is 0x0000000000000000.
The "mov" at 0xc0507a84 loads 0x00000000 into edx.
- A page fault (on another CPU) sneaks in between the two "mov"
instructions and instantiates the PMD.
- The PMD entry {high|low} is now 0x00000003fda38067.
The "mov" at 0xc0507a8e loads 0xfda38067 into eax.
----
Reported-by: Ulrich Obergfell <uobergfe@redhat.com>
Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Hugh Dickins <hughd@google.com>
Cc: Larry Woodman <lwoodman@redhat.com>
Cc: Petr Matousek <pmatouse@redhat.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-05-30 05:06:49 +07:00
|
|
|
#define pmd_read_atomic pmd_read_atomic
|
|
|
|
/*
|
2019-09-25 13:38:57 +07:00
|
|
|
* pte_offset_map_lock() on 32-bit PAE kernels was reading the pmd_t with
|
|
|
|
* a "*pmdp" dereference done by GCC. Problem is, in certain places
|
|
|
|
* where pte_offset_map_lock() is called, concurrent page faults are
|
mm: pmd_read_atomic: fix 32bit PAE pmd walk vs pmd_populate SMP race condition
When holding the mmap_sem for reading, pmd_offset_map_lock should only
run on a pmd_t that has been read atomically from the pmdp pointer,
otherwise we may read only half of it leading to this crash.
PID: 11679 TASK: f06e8000 CPU: 3 COMMAND: "do_race_2_panic"
#0 [f06a9dd8] crash_kexec at c049b5ec
#1 [f06a9e2c] oops_end at c083d1c2
#2 [f06a9e40] no_context at c0433ded
#3 [f06a9e64] bad_area_nosemaphore at c043401a
#4 [f06a9e6c] __do_page_fault at c0434493
#5 [f06a9eec] do_page_fault at c083eb45
#6 [f06a9f04] error_code (via page_fault) at c083c5d5
EAX: 01fb470c EBX: fff35000 ECX: 00000003 EDX: 00000100 EBP:
00000000
DS: 007b ESI: 9e201000 ES: 007b EDI: 01fb4700 GS: 00e0
CS: 0060 EIP: c083bc14 ERR: ffffffff EFLAGS: 00010246
#7 [f06a9f38] _spin_lock at c083bc14
#8 [f06a9f44] sys_mincore at c0507b7d
#9 [f06a9fb0] system_call at c083becd
start len
EAX: ffffffda EBX: 9e200000 ECX: 00001000 EDX: 6228537f
DS: 007b ESI: 00000000 ES: 007b EDI: 003d0f00
SS: 007b ESP: 62285354 EBP: 62285388 GS: 0033
CS: 0073 EIP: 00291416 ERR: 000000da EFLAGS: 00000286
This should be a longstanding bug affecting x86 32bit PAE without THP.
Only archs with 64bit large pmd_t and 32bit unsigned long should be
affected.
With THP enabled the barrier() in pmd_none_or_trans_huge_or_clear_bad()
would partly hide the bug when the pmd transition from none to stable,
by forcing a re-read of the *pmd in pmd_offset_map_lock, but when THP is
enabled a new set of problem arises by the fact could then transition
freely in any of the none, pmd_trans_huge or pmd_trans_stable states.
So making the barrier in pmd_none_or_trans_huge_or_clear_bad()
unconditional isn't good idea and it would be a flakey solution.
This should be fully fixed by introducing a pmd_read_atomic that reads
the pmd in order with THP disabled, or by reading the pmd atomically
with cmpxchg8b with THP enabled.
Luckily this new race condition only triggers in the places that must
already be covered by pmd_none_or_trans_huge_or_clear_bad() so the fix
is localized there but this bug is not related to THP.
NOTE: this can trigger on x86 32bit systems with PAE enabled with more
than 4G of ram, otherwise the high part of the pmd will never risk to be
truncated because it would be zero at all times, in turn so hiding the
SMP race.
This bug was discovered and fully debugged by Ulrich, quote:
----
[..]
pmd_none_or_trans_huge_or_clear_bad() loads the content of edx and
eax.
496 static inline int pmd_none_or_trans_huge_or_clear_bad(pmd_t
*pmd)
497 {
498 /* depend on compiler for an atomic pmd read */
499 pmd_t pmdval = *pmd;
// edi = pmd pointer
0xc0507a74 <sys_mincore+548>: mov 0x8(%esp),%edi
...
// edx = PTE page table high address
0xc0507a84 <sys_mincore+564>: mov 0x4(%edi),%edx
...
// eax = PTE page table low address
0xc0507a8e <sys_mincore+574>: mov (%edi),%eax
[..]
Please note that the PMD is not read atomically. These are two "mov"
instructions where the high order bits of the PMD entry are fetched
first. Hence, the above machine code is prone to the following race.
- The PMD entry {high|low} is 0x0000000000000000.
The "mov" at 0xc0507a84 loads 0x00000000 into edx.
- A page fault (on another CPU) sneaks in between the two "mov"
instructions and instantiates the PMD.
- The PMD entry {high|low} is now 0x00000003fda38067.
The "mov" at 0xc0507a8e loads 0xfda38067 into eax.
----
Reported-by: Ulrich Obergfell <uobergfe@redhat.com>
Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Hugh Dickins <hughd@google.com>
Cc: Larry Woodman <lwoodman@redhat.com>
Cc: Petr Matousek <pmatouse@redhat.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-05-30 05:06:49 +07:00
|
|
|
* allowed, if the mmap_sem is hold for reading. An example is mincore
|
|
|
|
* vs page faults vs MADV_DONTNEED. On the page fault side
|
2019-09-25 13:38:57 +07:00
|
|
|
* pmd_populate() rightfully does a set_64bit(), but if we're reading the
|
mm: pmd_read_atomic: fix 32bit PAE pmd walk vs pmd_populate SMP race condition
When holding the mmap_sem for reading, pmd_offset_map_lock should only
run on a pmd_t that has been read atomically from the pmdp pointer,
otherwise we may read only half of it leading to this crash.
PID: 11679 TASK: f06e8000 CPU: 3 COMMAND: "do_race_2_panic"
#0 [f06a9dd8] crash_kexec at c049b5ec
#1 [f06a9e2c] oops_end at c083d1c2
#2 [f06a9e40] no_context at c0433ded
#3 [f06a9e64] bad_area_nosemaphore at c043401a
#4 [f06a9e6c] __do_page_fault at c0434493
#5 [f06a9eec] do_page_fault at c083eb45
#6 [f06a9f04] error_code (via page_fault) at c083c5d5
EAX: 01fb470c EBX: fff35000 ECX: 00000003 EDX: 00000100 EBP:
00000000
DS: 007b ESI: 9e201000 ES: 007b EDI: 01fb4700 GS: 00e0
CS: 0060 EIP: c083bc14 ERR: ffffffff EFLAGS: 00010246
#7 [f06a9f38] _spin_lock at c083bc14
#8 [f06a9f44] sys_mincore at c0507b7d
#9 [f06a9fb0] system_call at c083becd
start len
EAX: ffffffda EBX: 9e200000 ECX: 00001000 EDX: 6228537f
DS: 007b ESI: 00000000 ES: 007b EDI: 003d0f00
SS: 007b ESP: 62285354 EBP: 62285388 GS: 0033
CS: 0073 EIP: 00291416 ERR: 000000da EFLAGS: 00000286
This should be a longstanding bug affecting x86 32bit PAE without THP.
Only archs with 64bit large pmd_t and 32bit unsigned long should be
affected.
With THP enabled the barrier() in pmd_none_or_trans_huge_or_clear_bad()
would partly hide the bug when the pmd transition from none to stable,
by forcing a re-read of the *pmd in pmd_offset_map_lock, but when THP is
enabled a new set of problem arises by the fact could then transition
freely in any of the none, pmd_trans_huge or pmd_trans_stable states.
So making the barrier in pmd_none_or_trans_huge_or_clear_bad()
unconditional isn't good idea and it would be a flakey solution.
This should be fully fixed by introducing a pmd_read_atomic that reads
the pmd in order with THP disabled, or by reading the pmd atomically
with cmpxchg8b with THP enabled.
Luckily this new race condition only triggers in the places that must
already be covered by pmd_none_or_trans_huge_or_clear_bad() so the fix
is localized there but this bug is not related to THP.
NOTE: this can trigger on x86 32bit systems with PAE enabled with more
than 4G of ram, otherwise the high part of the pmd will never risk to be
truncated because it would be zero at all times, in turn so hiding the
SMP race.
This bug was discovered and fully debugged by Ulrich, quote:
----
[..]
pmd_none_or_trans_huge_or_clear_bad() loads the content of edx and
eax.
496 static inline int pmd_none_or_trans_huge_or_clear_bad(pmd_t
*pmd)
497 {
498 /* depend on compiler for an atomic pmd read */
499 pmd_t pmdval = *pmd;
// edi = pmd pointer
0xc0507a74 <sys_mincore+548>: mov 0x8(%esp),%edi
...
// edx = PTE page table high address
0xc0507a84 <sys_mincore+564>: mov 0x4(%edi),%edx
...
// eax = PTE page table low address
0xc0507a8e <sys_mincore+574>: mov (%edi),%eax
[..]
Please note that the PMD is not read atomically. These are two "mov"
instructions where the high order bits of the PMD entry are fetched
first. Hence, the above machine code is prone to the following race.
- The PMD entry {high|low} is 0x0000000000000000.
The "mov" at 0xc0507a84 loads 0x00000000 into edx.
- A page fault (on another CPU) sneaks in between the two "mov"
instructions and instantiates the PMD.
- The PMD entry {high|low} is now 0x00000003fda38067.
The "mov" at 0xc0507a8e loads 0xfda38067 into eax.
----
Reported-by: Ulrich Obergfell <uobergfe@redhat.com>
Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Hugh Dickins <hughd@google.com>
Cc: Larry Woodman <lwoodman@redhat.com>
Cc: Petr Matousek <pmatouse@redhat.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-05-30 05:06:49 +07:00
|
|
|
* pmd_t with a "*pmdp" on the mincore side, a SMP race can happen
|
2019-09-25 13:38:57 +07:00
|
|
|
* because GCC will not read the 64-bit value of the pmd atomically.
|
|
|
|
*
|
|
|
|
* To fix this all places running pte_offset_map_lock() while holding the
|
mm: pmd_read_atomic: fix 32bit PAE pmd walk vs pmd_populate SMP race condition
When holding the mmap_sem for reading, pmd_offset_map_lock should only
run on a pmd_t that has been read atomically from the pmdp pointer,
otherwise we may read only half of it leading to this crash.
PID: 11679 TASK: f06e8000 CPU: 3 COMMAND: "do_race_2_panic"
#0 [f06a9dd8] crash_kexec at c049b5ec
#1 [f06a9e2c] oops_end at c083d1c2
#2 [f06a9e40] no_context at c0433ded
#3 [f06a9e64] bad_area_nosemaphore at c043401a
#4 [f06a9e6c] __do_page_fault at c0434493
#5 [f06a9eec] do_page_fault at c083eb45
#6 [f06a9f04] error_code (via page_fault) at c083c5d5
EAX: 01fb470c EBX: fff35000 ECX: 00000003 EDX: 00000100 EBP:
00000000
DS: 007b ESI: 9e201000 ES: 007b EDI: 01fb4700 GS: 00e0
CS: 0060 EIP: c083bc14 ERR: ffffffff EFLAGS: 00010246
#7 [f06a9f38] _spin_lock at c083bc14
#8 [f06a9f44] sys_mincore at c0507b7d
#9 [f06a9fb0] system_call at c083becd
start len
EAX: ffffffda EBX: 9e200000 ECX: 00001000 EDX: 6228537f
DS: 007b ESI: 00000000 ES: 007b EDI: 003d0f00
SS: 007b ESP: 62285354 EBP: 62285388 GS: 0033
CS: 0073 EIP: 00291416 ERR: 000000da EFLAGS: 00000286
This should be a longstanding bug affecting x86 32bit PAE without THP.
Only archs with 64bit large pmd_t and 32bit unsigned long should be
affected.
With THP enabled the barrier() in pmd_none_or_trans_huge_or_clear_bad()
would partly hide the bug when the pmd transition from none to stable,
by forcing a re-read of the *pmd in pmd_offset_map_lock, but when THP is
enabled a new set of problem arises by the fact could then transition
freely in any of the none, pmd_trans_huge or pmd_trans_stable states.
So making the barrier in pmd_none_or_trans_huge_or_clear_bad()
unconditional isn't good idea and it would be a flakey solution.
This should be fully fixed by introducing a pmd_read_atomic that reads
the pmd in order with THP disabled, or by reading the pmd atomically
with cmpxchg8b with THP enabled.
Luckily this new race condition only triggers in the places that must
already be covered by pmd_none_or_trans_huge_or_clear_bad() so the fix
is localized there but this bug is not related to THP.
NOTE: this can trigger on x86 32bit systems with PAE enabled with more
than 4G of ram, otherwise the high part of the pmd will never risk to be
truncated because it would be zero at all times, in turn so hiding the
SMP race.
This bug was discovered and fully debugged by Ulrich, quote:
----
[..]
pmd_none_or_trans_huge_or_clear_bad() loads the content of edx and
eax.
496 static inline int pmd_none_or_trans_huge_or_clear_bad(pmd_t
*pmd)
497 {
498 /* depend on compiler for an atomic pmd read */
499 pmd_t pmdval = *pmd;
// edi = pmd pointer
0xc0507a74 <sys_mincore+548>: mov 0x8(%esp),%edi
...
// edx = PTE page table high address
0xc0507a84 <sys_mincore+564>: mov 0x4(%edi),%edx
...
// eax = PTE page table low address
0xc0507a8e <sys_mincore+574>: mov (%edi),%eax
[..]
Please note that the PMD is not read atomically. These are two "mov"
instructions where the high order bits of the PMD entry are fetched
first. Hence, the above machine code is prone to the following race.
- The PMD entry {high|low} is 0x0000000000000000.
The "mov" at 0xc0507a84 loads 0x00000000 into edx.
- A page fault (on another CPU) sneaks in between the two "mov"
instructions and instantiates the PMD.
- The PMD entry {high|low} is now 0x00000003fda38067.
The "mov" at 0xc0507a8e loads 0xfda38067 into eax.
----
Reported-by: Ulrich Obergfell <uobergfe@redhat.com>
Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Hugh Dickins <hughd@google.com>
Cc: Larry Woodman <lwoodman@redhat.com>
Cc: Petr Matousek <pmatouse@redhat.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-05-30 05:06:49 +07:00
|
|
|
* mmap_sem in read mode, shall read the pmdp pointer using this
|
2019-09-25 13:38:57 +07:00
|
|
|
* function to know if the pmd is null or not, and in turn to know if
|
2019-09-25 08:44:53 +07:00
|
|
|
* they can run pte_offset_map_lock() or pmd_trans_huge() or other pmd
|
mm: pmd_read_atomic: fix 32bit PAE pmd walk vs pmd_populate SMP race condition
When holding the mmap_sem for reading, pmd_offset_map_lock should only
run on a pmd_t that has been read atomically from the pmdp pointer,
otherwise we may read only half of it leading to this crash.
PID: 11679 TASK: f06e8000 CPU: 3 COMMAND: "do_race_2_panic"
#0 [f06a9dd8] crash_kexec at c049b5ec
#1 [f06a9e2c] oops_end at c083d1c2
#2 [f06a9e40] no_context at c0433ded
#3 [f06a9e64] bad_area_nosemaphore at c043401a
#4 [f06a9e6c] __do_page_fault at c0434493
#5 [f06a9eec] do_page_fault at c083eb45
#6 [f06a9f04] error_code (via page_fault) at c083c5d5
EAX: 01fb470c EBX: fff35000 ECX: 00000003 EDX: 00000100 EBP:
00000000
DS: 007b ESI: 9e201000 ES: 007b EDI: 01fb4700 GS: 00e0
CS: 0060 EIP: c083bc14 ERR: ffffffff EFLAGS: 00010246
#7 [f06a9f38] _spin_lock at c083bc14
#8 [f06a9f44] sys_mincore at c0507b7d
#9 [f06a9fb0] system_call at c083becd
start len
EAX: ffffffda EBX: 9e200000 ECX: 00001000 EDX: 6228537f
DS: 007b ESI: 00000000 ES: 007b EDI: 003d0f00
SS: 007b ESP: 62285354 EBP: 62285388 GS: 0033
CS: 0073 EIP: 00291416 ERR: 000000da EFLAGS: 00000286
This should be a longstanding bug affecting x86 32bit PAE without THP.
Only archs with 64bit large pmd_t and 32bit unsigned long should be
affected.
With THP enabled the barrier() in pmd_none_or_trans_huge_or_clear_bad()
would partly hide the bug when the pmd transition from none to stable,
by forcing a re-read of the *pmd in pmd_offset_map_lock, but when THP is
enabled a new set of problem arises by the fact could then transition
freely in any of the none, pmd_trans_huge or pmd_trans_stable states.
So making the barrier in pmd_none_or_trans_huge_or_clear_bad()
unconditional isn't good idea and it would be a flakey solution.
This should be fully fixed by introducing a pmd_read_atomic that reads
the pmd in order with THP disabled, or by reading the pmd atomically
with cmpxchg8b with THP enabled.
Luckily this new race condition only triggers in the places that must
already be covered by pmd_none_or_trans_huge_or_clear_bad() so the fix
is localized there but this bug is not related to THP.
NOTE: this can trigger on x86 32bit systems with PAE enabled with more
than 4G of ram, otherwise the high part of the pmd will never risk to be
truncated because it would be zero at all times, in turn so hiding the
SMP race.
This bug was discovered and fully debugged by Ulrich, quote:
----
[..]
pmd_none_or_trans_huge_or_clear_bad() loads the content of edx and
eax.
496 static inline int pmd_none_or_trans_huge_or_clear_bad(pmd_t
*pmd)
497 {
498 /* depend on compiler for an atomic pmd read */
499 pmd_t pmdval = *pmd;
// edi = pmd pointer
0xc0507a74 <sys_mincore+548>: mov 0x8(%esp),%edi
...
// edx = PTE page table high address
0xc0507a84 <sys_mincore+564>: mov 0x4(%edi),%edx
...
// eax = PTE page table low address
0xc0507a8e <sys_mincore+574>: mov (%edi),%eax
[..]
Please note that the PMD is not read atomically. These are two "mov"
instructions where the high order bits of the PMD entry are fetched
first. Hence, the above machine code is prone to the following race.
- The PMD entry {high|low} is 0x0000000000000000.
The "mov" at 0xc0507a84 loads 0x00000000 into edx.
- A page fault (on another CPU) sneaks in between the two "mov"
instructions and instantiates the PMD.
- The PMD entry {high|low} is now 0x00000003fda38067.
The "mov" at 0xc0507a8e loads 0xfda38067 into eax.
----
Reported-by: Ulrich Obergfell <uobergfe@redhat.com>
Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Hugh Dickins <hughd@google.com>
Cc: Larry Woodman <lwoodman@redhat.com>
Cc: Petr Matousek <pmatouse@redhat.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-05-30 05:06:49 +07:00
|
|
|
* operations.
|
|
|
|
*
|
2019-09-25 13:38:57 +07:00
|
|
|
* Without THP if the mmap_sem is held for reading, the pmd can only
|
|
|
|
* transition from null to not null while pmd_read_atomic() runs. So
|
2012-06-21 02:52:57 +07:00
|
|
|
* we can always return atomic pmd values with this function.
|
mm: pmd_read_atomic: fix 32bit PAE pmd walk vs pmd_populate SMP race condition
When holding the mmap_sem for reading, pmd_offset_map_lock should only
run on a pmd_t that has been read atomically from the pmdp pointer,
otherwise we may read only half of it leading to this crash.
PID: 11679 TASK: f06e8000 CPU: 3 COMMAND: "do_race_2_panic"
#0 [f06a9dd8] crash_kexec at c049b5ec
#1 [f06a9e2c] oops_end at c083d1c2
#2 [f06a9e40] no_context at c0433ded
#3 [f06a9e64] bad_area_nosemaphore at c043401a
#4 [f06a9e6c] __do_page_fault at c0434493
#5 [f06a9eec] do_page_fault at c083eb45
#6 [f06a9f04] error_code (via page_fault) at c083c5d5
EAX: 01fb470c EBX: fff35000 ECX: 00000003 EDX: 00000100 EBP:
00000000
DS: 007b ESI: 9e201000 ES: 007b EDI: 01fb4700 GS: 00e0
CS: 0060 EIP: c083bc14 ERR: ffffffff EFLAGS: 00010246
#7 [f06a9f38] _spin_lock at c083bc14
#8 [f06a9f44] sys_mincore at c0507b7d
#9 [f06a9fb0] system_call at c083becd
start len
EAX: ffffffda EBX: 9e200000 ECX: 00001000 EDX: 6228537f
DS: 007b ESI: 00000000 ES: 007b EDI: 003d0f00
SS: 007b ESP: 62285354 EBP: 62285388 GS: 0033
CS: 0073 EIP: 00291416 ERR: 000000da EFLAGS: 00000286
This should be a longstanding bug affecting x86 32bit PAE without THP.
Only archs with 64bit large pmd_t and 32bit unsigned long should be
affected.
With THP enabled the barrier() in pmd_none_or_trans_huge_or_clear_bad()
would partly hide the bug when the pmd transition from none to stable,
by forcing a re-read of the *pmd in pmd_offset_map_lock, but when THP is
enabled a new set of problem arises by the fact could then transition
freely in any of the none, pmd_trans_huge or pmd_trans_stable states.
So making the barrier in pmd_none_or_trans_huge_or_clear_bad()
unconditional isn't good idea and it would be a flakey solution.
This should be fully fixed by introducing a pmd_read_atomic that reads
the pmd in order with THP disabled, or by reading the pmd atomically
with cmpxchg8b with THP enabled.
Luckily this new race condition only triggers in the places that must
already be covered by pmd_none_or_trans_huge_or_clear_bad() so the fix
is localized there but this bug is not related to THP.
NOTE: this can trigger on x86 32bit systems with PAE enabled with more
than 4G of ram, otherwise the high part of the pmd will never risk to be
truncated because it would be zero at all times, in turn so hiding the
SMP race.
This bug was discovered and fully debugged by Ulrich, quote:
----
[..]
pmd_none_or_trans_huge_or_clear_bad() loads the content of edx and
eax.
496 static inline int pmd_none_or_trans_huge_or_clear_bad(pmd_t
*pmd)
497 {
498 /* depend on compiler for an atomic pmd read */
499 pmd_t pmdval = *pmd;
// edi = pmd pointer
0xc0507a74 <sys_mincore+548>: mov 0x8(%esp),%edi
...
// edx = PTE page table high address
0xc0507a84 <sys_mincore+564>: mov 0x4(%edi),%edx
...
// eax = PTE page table low address
0xc0507a8e <sys_mincore+574>: mov (%edi),%eax
[..]
Please note that the PMD is not read atomically. These are two "mov"
instructions where the high order bits of the PMD entry are fetched
first. Hence, the above machine code is prone to the following race.
- The PMD entry {high|low} is 0x0000000000000000.
The "mov" at 0xc0507a84 loads 0x00000000 into edx.
- A page fault (on another CPU) sneaks in between the two "mov"
instructions and instantiates the PMD.
- The PMD entry {high|low} is now 0x00000003fda38067.
The "mov" at 0xc0507a8e loads 0xfda38067 into eax.
----
Reported-by: Ulrich Obergfell <uobergfe@redhat.com>
Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Hugh Dickins <hughd@google.com>
Cc: Larry Woodman <lwoodman@redhat.com>
Cc: Petr Matousek <pmatouse@redhat.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-05-30 05:06:49 +07:00
|
|
|
*
|
2019-09-25 13:38:57 +07:00
|
|
|
* With THP if the mmap_sem is held for reading, the pmd can become
|
2012-06-21 02:52:57 +07:00
|
|
|
* trans_huge or none or point to a pte (and in turn become "stable")
|
2019-09-25 13:38:57 +07:00
|
|
|
* at any time under pmd_read_atomic(). We could read it truly
|
|
|
|
* atomically here with an atomic64_read() for the THP enabled case (and
|
2012-06-21 02:52:57 +07:00
|
|
|
* it would be a whole lot simpler), but to avoid using cmpxchg8b we
|
|
|
|
* only return an atomic pmdval if the low part of the pmdval is later
|
2019-09-25 13:38:57 +07:00
|
|
|
* found to be stable (i.e. pointing to a pte). We are also returning a
|
|
|
|
* 'none' (zero) pmdval if the low part of the pmd is zero.
|
|
|
|
*
|
|
|
|
* In some cases the high and low part of the pmdval returned may not be
|
|
|
|
* consistent if THP is enabled (the low part may point to previously
|
|
|
|
* mapped hugepage, while the high part may point to a more recently
|
|
|
|
* mapped hugepage), but pmd_none_or_trans_huge_or_clear_bad() only
|
|
|
|
* needs the low part of the pmd to be read atomically to decide if the
|
|
|
|
* pmd is unstable or not, with the only exception when the low part
|
|
|
|
* of the pmd is zero, in which case we return a 'none' pmd.
|
mm: pmd_read_atomic: fix 32bit PAE pmd walk vs pmd_populate SMP race condition
When holding the mmap_sem for reading, pmd_offset_map_lock should only
run on a pmd_t that has been read atomically from the pmdp pointer,
otherwise we may read only half of it leading to this crash.
PID: 11679 TASK: f06e8000 CPU: 3 COMMAND: "do_race_2_panic"
#0 [f06a9dd8] crash_kexec at c049b5ec
#1 [f06a9e2c] oops_end at c083d1c2
#2 [f06a9e40] no_context at c0433ded
#3 [f06a9e64] bad_area_nosemaphore at c043401a
#4 [f06a9e6c] __do_page_fault at c0434493
#5 [f06a9eec] do_page_fault at c083eb45
#6 [f06a9f04] error_code (via page_fault) at c083c5d5
EAX: 01fb470c EBX: fff35000 ECX: 00000003 EDX: 00000100 EBP:
00000000
DS: 007b ESI: 9e201000 ES: 007b EDI: 01fb4700 GS: 00e0
CS: 0060 EIP: c083bc14 ERR: ffffffff EFLAGS: 00010246
#7 [f06a9f38] _spin_lock at c083bc14
#8 [f06a9f44] sys_mincore at c0507b7d
#9 [f06a9fb0] system_call at c083becd
start len
EAX: ffffffda EBX: 9e200000 ECX: 00001000 EDX: 6228537f
DS: 007b ESI: 00000000 ES: 007b EDI: 003d0f00
SS: 007b ESP: 62285354 EBP: 62285388 GS: 0033
CS: 0073 EIP: 00291416 ERR: 000000da EFLAGS: 00000286
This should be a longstanding bug affecting x86 32bit PAE without THP.
Only archs with 64bit large pmd_t and 32bit unsigned long should be
affected.
With THP enabled the barrier() in pmd_none_or_trans_huge_or_clear_bad()
would partly hide the bug when the pmd transition from none to stable,
by forcing a re-read of the *pmd in pmd_offset_map_lock, but when THP is
enabled a new set of problem arises by the fact could then transition
freely in any of the none, pmd_trans_huge or pmd_trans_stable states.
So making the barrier in pmd_none_or_trans_huge_or_clear_bad()
unconditional isn't good idea and it would be a flakey solution.
This should be fully fixed by introducing a pmd_read_atomic that reads
the pmd in order with THP disabled, or by reading the pmd atomically
with cmpxchg8b with THP enabled.
Luckily this new race condition only triggers in the places that must
already be covered by pmd_none_or_trans_huge_or_clear_bad() so the fix
is localized there but this bug is not related to THP.
NOTE: this can trigger on x86 32bit systems with PAE enabled with more
than 4G of ram, otherwise the high part of the pmd will never risk to be
truncated because it would be zero at all times, in turn so hiding the
SMP race.
This bug was discovered and fully debugged by Ulrich, quote:
----
[..]
pmd_none_or_trans_huge_or_clear_bad() loads the content of edx and
eax.
496 static inline int pmd_none_or_trans_huge_or_clear_bad(pmd_t
*pmd)
497 {
498 /* depend on compiler for an atomic pmd read */
499 pmd_t pmdval = *pmd;
// edi = pmd pointer
0xc0507a74 <sys_mincore+548>: mov 0x8(%esp),%edi
...
// edx = PTE page table high address
0xc0507a84 <sys_mincore+564>: mov 0x4(%edi),%edx
...
// eax = PTE page table low address
0xc0507a8e <sys_mincore+574>: mov (%edi),%eax
[..]
Please note that the PMD is not read atomically. These are two "mov"
instructions where the high order bits of the PMD entry are fetched
first. Hence, the above machine code is prone to the following race.
- The PMD entry {high|low} is 0x0000000000000000.
The "mov" at 0xc0507a84 loads 0x00000000 into edx.
- A page fault (on another CPU) sneaks in between the two "mov"
instructions and instantiates the PMD.
- The PMD entry {high|low} is now 0x00000003fda38067.
The "mov" at 0xc0507a8e loads 0xfda38067 into eax.
----
Reported-by: Ulrich Obergfell <uobergfe@redhat.com>
Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Hugh Dickins <hughd@google.com>
Cc: Larry Woodman <lwoodman@redhat.com>
Cc: Petr Matousek <pmatouse@redhat.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-05-30 05:06:49 +07:00
|
|
|
*/
|
|
|
|
static inline pmd_t pmd_read_atomic(pmd_t *pmdp)
|
|
|
|
{
|
|
|
|
pmdval_t ret;
|
|
|
|
u32 *tmp = (u32 *)pmdp;
|
|
|
|
|
|
|
|
ret = (pmdval_t) (*tmp);
|
|
|
|
if (ret) {
|
|
|
|
/*
|
|
|
|
* If the low part is null, we must not read the high part
|
|
|
|
* or we can end up with a partial pmd.
|
|
|
|
*/
|
|
|
|
smp_rmb();
|
|
|
|
ret |= ((pmdval_t)*(tmp + 1)) << 32;
|
|
|
|
}
|
|
|
|
|
|
|
|
return (pmd_t) { ret };
|
|
|
|
}
|
|
|
|
|
2007-05-03 00:27:13 +07:00
|
|
|
static inline void native_set_pte_atomic(pte_t *ptep, pte_t pte)
|
|
|
|
{
|
2008-03-23 15:03:10 +07:00
|
|
|
set_64bit((unsigned long long *)(ptep), native_pte_val(pte));
|
2007-05-03 00:27:13 +07:00
|
|
|
}
|
2008-03-23 15:03:10 +07:00
|
|
|
|
2007-05-03 00:27:13 +07:00
|
|
|
static inline void native_set_pmd(pmd_t *pmdp, pmd_t pmd)
|
|
|
|
{
|
2008-03-23 15:03:10 +07:00
|
|
|
set_64bit((unsigned long long *)(pmdp), native_pmd_val(pmd));
|
2007-05-03 00:27:13 +07:00
|
|
|
}
|
2008-03-23 15:03:10 +07:00
|
|
|
|
2007-05-03 00:27:13 +07:00
|
|
|
static inline void native_set_pud(pud_t *pudp, pud_t pud)
|
|
|
|
{
|
2018-07-18 16:40:59 +07:00
|
|
|
#ifdef CONFIG_PAGE_TABLE_ISOLATION
|
|
|
|
pud.p4d.pgd = pti_set_user_pgtbl(&pudp->p4d.pgd, pud.p4d.pgd);
|
|
|
|
#endif
|
2008-03-23 15:03:10 +07:00
|
|
|
set_64bit((unsigned long long *)(pudp), native_pud_val(pud));
|
2007-05-03 00:27:13 +07:00
|
|
|
}
|
2005-04-17 05:20:36 +07:00
|
|
|
|
[PATCH] x86/PAE: Fix pte_clear for the >4GB RAM case
Proposed fix for ptep_get_and_clear_full PAE bug. Pte_clear had the same bug,
so use the same fix for both. Turns out pmd_clear had it as well, but pgds
are not affected.
The problem is rather intricate. Page table entries in PAE mode are 64-bits
wide, but the only atomic 8-byte write operation available in 32-bit mode is
cmpxchg8b, which is expensive (at least on P4), and thus avoided. But it can
happen that the processor may prefetch entries into the TLB in the middle of an
operation which clears a page table entry. So one must always clear the P-bit
in the low word of the page table entry first when clearing it.
Since the sequence *ptep = __pte(0) leaves the order of the write dependent on
the compiler, it must be coded explicitly as a clear of the low word followed
by a clear of the high word. Further, there must be a write memory barrier
here to enforce proper ordering by the compiler (and, in the future, by the
processor as well).
On > 4GB memory machines, the implementation of pte_clear for PAE was clearly
deficient, as it could leave virtual mappings of physical memory above 4GB
aliased to memory below 4GB in the TLB. The implementation of
ptep_get_and_clear_full has a similar bug, although not nearly as likely to
occur, since the mappings being cleared are in the process of being destroyed,
and should never be dereferenced again.
But, as luck would have it, it is possible to trigger bugs even without ever
dereferencing these bogus TLB mappings, even if the clear is followed fairly
soon after with a TLB flush or invalidation. The problem is that memory above
4GB may now be aliased into the first 4GB of memory, and in fact, may hit a
region of memory with non-memory semantics. These regions include AGP and PCI
space. As such, these memory regions are not cached by the processor. This
introduces the bug.
The processor can speculate memory operations, including memory writes, as long
as they are committed with the proper ordering. Speculating a memory write to
a linear address that has a bogus TLB mapping is possible. Normally, the
speculation is harmless. But for cached memory, it does leave the falsely
speculated cacheline unmodified, but in a dirty state. This cache line will be
eventually written back. If this cacheline happens to intersect a region of
memory that is not protected by the cache coherency protocol, it can corrupt
data in I/O memory, which is generally a very bad thing to do, and can cause
total system failure or just plain undefined behavior.
These bugs are extremely unlikely, but the severity is of such magnitude, and
the fix so simple that I think fixing them immediately is justified. Also,
they are nearly impossible to debug.
Signed-off-by: Zachary Amsden <zach@vmware.com>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-04-28 01:32:29 +07:00
|
|
|
/*
|
|
|
|
* For PTEs and PDEs, we must clear the P-bit first when clearing a page table
|
|
|
|
* entry, so clear the bottom half first and enforce ordering with a compiler
|
|
|
|
* barrier.
|
|
|
|
*/
|
2008-03-23 15:03:10 +07:00
|
|
|
static inline void native_pte_clear(struct mm_struct *mm, unsigned long addr,
|
|
|
|
pte_t *ptep)
|
[PATCH] x86/PAE: Fix pte_clear for the >4GB RAM case
Proposed fix for ptep_get_and_clear_full PAE bug. Pte_clear had the same bug,
so use the same fix for both. Turns out pmd_clear had it as well, but pgds
are not affected.
The problem is rather intricate. Page table entries in PAE mode are 64-bits
wide, but the only atomic 8-byte write operation available in 32-bit mode is
cmpxchg8b, which is expensive (at least on P4), and thus avoided. But it can
happen that the processor may prefetch entries into the TLB in the middle of an
operation which clears a page table entry. So one must always clear the P-bit
in the low word of the page table entry first when clearing it.
Since the sequence *ptep = __pte(0) leaves the order of the write dependent on
the compiler, it must be coded explicitly as a clear of the low word followed
by a clear of the high word. Further, there must be a write memory barrier
here to enforce proper ordering by the compiler (and, in the future, by the
processor as well).
On > 4GB memory machines, the implementation of pte_clear for PAE was clearly
deficient, as it could leave virtual mappings of physical memory above 4GB
aliased to memory below 4GB in the TLB. The implementation of
ptep_get_and_clear_full has a similar bug, although not nearly as likely to
occur, since the mappings being cleared are in the process of being destroyed,
and should never be dereferenced again.
But, as luck would have it, it is possible to trigger bugs even without ever
dereferencing these bogus TLB mappings, even if the clear is followed fairly
soon after with a TLB flush or invalidation. The problem is that memory above
4GB may now be aliased into the first 4GB of memory, and in fact, may hit a
region of memory with non-memory semantics. These regions include AGP and PCI
space. As such, these memory regions are not cached by the processor. This
introduces the bug.
The processor can speculate memory operations, including memory writes, as long
as they are committed with the proper ordering. Speculating a memory write to
a linear address that has a bogus TLB mapping is possible. Normally, the
speculation is harmless. But for cached memory, it does leave the falsely
speculated cacheline unmodified, but in a dirty state. This cache line will be
eventually written back. If this cacheline happens to intersect a region of
memory that is not protected by the cache coherency protocol, it can corrupt
data in I/O memory, which is generally a very bad thing to do, and can cause
total system failure or just plain undefined behavior.
These bugs are extremely unlikely, but the severity is of such magnitude, and
the fix so simple that I think fixing them immediately is justified. Also,
they are nearly impossible to debug.
Signed-off-by: Zachary Amsden <zach@vmware.com>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-04-28 01:32:29 +07:00
|
|
|
{
|
|
|
|
ptep->pte_low = 0;
|
|
|
|
smp_wmb();
|
|
|
|
ptep->pte_high = 0;
|
|
|
|
}
|
|
|
|
|
2007-05-03 00:27:13 +07:00
|
|
|
static inline void native_pmd_clear(pmd_t *pmd)
|
[PATCH] x86/PAE: Fix pte_clear for the >4GB RAM case
Proposed fix for ptep_get_and_clear_full PAE bug. Pte_clear had the same bug,
so use the same fix for both. Turns out pmd_clear had it as well, but pgds
are not affected.
The problem is rather intricate. Page table entries in PAE mode are 64-bits
wide, but the only atomic 8-byte write operation available in 32-bit mode is
cmpxchg8b, which is expensive (at least on P4), and thus avoided. But it can
happen that the processor may prefetch entries into the TLB in the middle of an
operation which clears a page table entry. So one must always clear the P-bit
in the low word of the page table entry first when clearing it.
Since the sequence *ptep = __pte(0) leaves the order of the write dependent on
the compiler, it must be coded explicitly as a clear of the low word followed
by a clear of the high word. Further, there must be a write memory barrier
here to enforce proper ordering by the compiler (and, in the future, by the
processor as well).
On > 4GB memory machines, the implementation of pte_clear for PAE was clearly
deficient, as it could leave virtual mappings of physical memory above 4GB
aliased to memory below 4GB in the TLB. The implementation of
ptep_get_and_clear_full has a similar bug, although not nearly as likely to
occur, since the mappings being cleared are in the process of being destroyed,
and should never be dereferenced again.
But, as luck would have it, it is possible to trigger bugs even without ever
dereferencing these bogus TLB mappings, even if the clear is followed fairly
soon after with a TLB flush or invalidation. The problem is that memory above
4GB may now be aliased into the first 4GB of memory, and in fact, may hit a
region of memory with non-memory semantics. These regions include AGP and PCI
space. As such, these memory regions are not cached by the processor. This
introduces the bug.
The processor can speculate memory operations, including memory writes, as long
as they are committed with the proper ordering. Speculating a memory write to
a linear address that has a bogus TLB mapping is possible. Normally, the
speculation is harmless. But for cached memory, it does leave the falsely
speculated cacheline unmodified, but in a dirty state. This cache line will be
eventually written back. If this cacheline happens to intersect a region of
memory that is not protected by the cache coherency protocol, it can corrupt
data in I/O memory, which is generally a very bad thing to do, and can cause
total system failure or just plain undefined behavior.
These bugs are extremely unlikely, but the severity is of such magnitude, and
the fix so simple that I think fixing them immediately is justified. Also,
they are nearly impossible to debug.
Signed-off-by: Zachary Amsden <zach@vmware.com>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-04-28 01:32:29 +07:00
|
|
|
{
|
|
|
|
u32 *tmp = (u32 *)pmd;
|
|
|
|
*tmp = 0;
|
|
|
|
smp_wmb();
|
|
|
|
*(tmp + 1) = 0;
|
|
|
|
}
|
2007-05-03 00:27:13 +07:00
|
|
|
|
2017-02-25 05:57:02 +07:00
|
|
|
static inline void native_pud_clear(pud_t *pudp)
|
|
|
|
{
|
|
|
|
}
|
|
|
|
|
2008-01-30 19:34:11 +07:00
|
|
|
static inline void pud_clear(pud_t *pudp)
|
|
|
|
{
|
|
|
|
set_pud(pudp, __pud(0));
|
|
|
|
|
|
|
|
/*
|
2008-02-04 22:48:02 +07:00
|
|
|
* According to Intel App note "TLBs, Paging-Structure Caches,
|
|
|
|
* and Their Invalidation", April 2007, document 317080-001,
|
|
|
|
* section 8.1: in PAE mode we explicitly have to flush the
|
|
|
|
* TLB via cr3 if the top-level pgd is changed...
|
2008-01-30 19:34:11 +07:00
|
|
|
*
|
2011-03-16 10:37:29 +07:00
|
|
|
* Currently all places where pud_clear() is called either have
|
|
|
|
* flush_tlb_mm() followed or don't need TLB flush (x86_64 code or
|
|
|
|
* pud_clear_bad()), so we don't need TLB flush here.
|
2008-01-30 19:34:11 +07:00
|
|
|
*/
|
|
|
|
}
|
2006-12-07 08:14:08 +07:00
|
|
|
|
2007-05-03 00:27:19 +07:00
|
|
|
#ifdef CONFIG_SMP
|
2007-05-03 00:27:13 +07:00
|
|
|
static inline pte_t native_ptep_get_and_clear(pte_t *ptep)
|
2005-04-17 05:20:36 +07:00
|
|
|
{
|
|
|
|
pte_t res;
|
|
|
|
|
2018-08-21 22:37:55 +07:00
|
|
|
res.pte = (pteval_t)arch_atomic64_xchg((atomic64_t *)ptep, 0);
|
2005-04-17 05:20:36 +07:00
|
|
|
|
|
|
|
return res;
|
|
|
|
}
|
2007-05-03 00:27:19 +07:00
|
|
|
#else
|
|
|
|
#define native_ptep_get_and_clear(xp) native_local_ptep_get_and_clear(xp)
|
|
|
|
#endif
|
2005-04-17 05:20:36 +07:00
|
|
|
|
2011-01-14 06:47:01 +07:00
|
|
|
union split_pmd {
|
|
|
|
struct {
|
|
|
|
u32 pmd_low;
|
|
|
|
u32 pmd_high;
|
|
|
|
};
|
|
|
|
pmd_t pmd;
|
|
|
|
};
|
2018-02-01 07:18:13 +07:00
|
|
|
|
|
|
|
#ifdef CONFIG_SMP
|
2011-01-14 06:47:01 +07:00
|
|
|
static inline pmd_t native_pmdp_get_and_clear(pmd_t *pmdp)
|
|
|
|
{
|
|
|
|
union split_pmd res, *orig = (union split_pmd *)pmdp;
|
|
|
|
|
|
|
|
/* xchg acts as a barrier before setting of the high bits */
|
|
|
|
res.pmd_low = xchg(&orig->pmd_low, 0);
|
|
|
|
res.pmd_high = orig->pmd_high;
|
|
|
|
orig->pmd_high = 0;
|
|
|
|
|
|
|
|
return res.pmd;
|
|
|
|
}
|
|
|
|
#else
|
|
|
|
#define native_pmdp_get_and_clear(xp) native_local_pmdp_get_and_clear(xp)
|
|
|
|
#endif
|
|
|
|
|
2018-02-01 07:18:13 +07:00
|
|
|
#ifndef pmdp_establish
|
|
|
|
#define pmdp_establish pmdp_establish
|
|
|
|
static inline pmd_t pmdp_establish(struct vm_area_struct *vma,
|
|
|
|
unsigned long address, pmd_t *pmdp, pmd_t pmd)
|
|
|
|
{
|
|
|
|
pmd_t old;
|
|
|
|
|
|
|
|
/*
|
|
|
|
* If pmd has present bit cleared we can get away without expensive
|
|
|
|
* cmpxchg64: we can update pmdp half-by-half without racing with
|
|
|
|
* anybody.
|
|
|
|
*/
|
|
|
|
if (!(pmd_val(pmd) & _PAGE_PRESENT)) {
|
|
|
|
union split_pmd old, new, *ptr;
|
|
|
|
|
|
|
|
ptr = (union split_pmd *)pmdp;
|
|
|
|
|
|
|
|
new.pmd = pmd;
|
|
|
|
|
|
|
|
/* xchg acts as a barrier before setting of the high bits */
|
|
|
|
old.pmd_low = xchg(&ptr->pmd_low, new.pmd_low);
|
|
|
|
old.pmd_high = ptr->pmd_high;
|
|
|
|
ptr->pmd_high = new.pmd_high;
|
|
|
|
return old.pmd;
|
|
|
|
}
|
|
|
|
|
|
|
|
do {
|
|
|
|
old = *pmdp;
|
|
|
|
} while (cmpxchg64(&pmdp->pmd, old.pmd, pmd.pmd) != old.pmd);
|
|
|
|
|
|
|
|
return old;
|
|
|
|
}
|
|
|
|
#endif
|
|
|
|
|
2017-02-25 05:57:02 +07:00
|
|
|
#ifdef CONFIG_SMP
|
|
|
|
union split_pud {
|
|
|
|
struct {
|
|
|
|
u32 pud_low;
|
|
|
|
u32 pud_high;
|
|
|
|
};
|
|
|
|
pud_t pud;
|
|
|
|
};
|
|
|
|
|
|
|
|
static inline pud_t native_pudp_get_and_clear(pud_t *pudp)
|
|
|
|
{
|
|
|
|
union split_pud res, *orig = (union split_pud *)pudp;
|
|
|
|
|
2018-07-18 16:40:59 +07:00
|
|
|
#ifdef CONFIG_PAGE_TABLE_ISOLATION
|
|
|
|
pti_set_user_pgtbl(&pudp->p4d.pgd, __pgd(0));
|
|
|
|
#endif
|
|
|
|
|
2017-02-25 05:57:02 +07:00
|
|
|
/* xchg acts as a barrier before setting of the high bits */
|
|
|
|
res.pud_low = xchg(&orig->pud_low, 0);
|
|
|
|
res.pud_high = orig->pud_high;
|
|
|
|
orig->pud_high = 0;
|
|
|
|
|
|
|
|
return res.pud;
|
|
|
|
}
|
|
|
|
#else
|
|
|
|
#define native_pudp_get_and_clear(xp) native_local_pudp_get_and_clear(xp)
|
|
|
|
#endif
|
|
|
|
|
2005-04-17 05:20:36 +07:00
|
|
|
/* Encode and de-code a swap entry */
|
2018-06-22 22:39:33 +07:00
|
|
|
#define SWP_TYPE_BITS 5
|
|
|
|
|
|
|
|
#define SWP_OFFSET_FIRST_BIT (_PAGE_BIT_PROTNONE + 1)
|
|
|
|
|
|
|
|
/* We always extract/encode the offset by shifting it all the way up, and then down again */
|
|
|
|
#define SWP_OFFSET_SHIFT (SWP_OFFSET_FIRST_BIT + SWP_TYPE_BITS)
|
|
|
|
|
2008-12-16 18:35:24 +07:00
|
|
|
#define MAX_SWAPFILES_CHECK() BUILD_BUG_ON(MAX_SWAPFILES_SHIFT > 5)
|
2005-04-17 05:20:36 +07:00
|
|
|
#define __swp_type(x) (((x).val) & 0x1f)
|
|
|
|
#define __swp_offset(x) ((x).val >> 5)
|
|
|
|
#define __swp_entry(type, offset) ((swp_entry_t){(type) | (offset) << 5})
|
2018-06-22 22:39:33 +07:00
|
|
|
|
|
|
|
/*
|
|
|
|
* Normally, __swp_entry() converts from arch-independent swp_entry_t to
|
|
|
|
* arch-dependent swp_entry_t, and __swp_entry_to_pte() just stores the result
|
|
|
|
* to pte. But here we have 32bit swp_entry_t and 64bit pte, and need to use the
|
|
|
|
* whole 64 bits. Thus, we shift the "real" arch-dependent conversion to
|
|
|
|
* __swp_entry_to_pte() through the following helper macro based on 64bit
|
|
|
|
* __swp_entry().
|
|
|
|
*/
|
|
|
|
#define __swp_pteval_entry(type, offset) ((pteval_t) { \
|
|
|
|
(~(pteval_t)(offset) << SWP_OFFSET_SHIFT >> SWP_TYPE_BITS) \
|
|
|
|
| ((pteval_t)(type) << (64 - SWP_TYPE_BITS)) })
|
|
|
|
|
|
|
|
#define __swp_entry_to_pte(x) ((pte_t){ .pte = \
|
|
|
|
__swp_pteval_entry(__swp_type(x), __swp_offset(x)) })
|
|
|
|
/*
|
|
|
|
* Analogically, __pte_to_swp_entry() doesn't just extract the arch-dependent
|
|
|
|
* swp_entry_t, but also has to convert it from 64bit to the 32bit
|
|
|
|
* intermediate representation, using the following macros based on 64bit
|
|
|
|
* __swp_type() and __swp_offset().
|
|
|
|
*/
|
|
|
|
#define __pteval_swp_type(x) ((unsigned long)((x).pte >> (64 - SWP_TYPE_BITS)))
|
|
|
|
#define __pteval_swp_offset(x) ((unsigned long)(~((x).pte) << SWP_TYPE_BITS >> SWP_OFFSET_SHIFT))
|
|
|
|
|
|
|
|
#define __pte_to_swp_entry(pte) (__swp_entry(__pteval_swp_type(pte), \
|
|
|
|
__pteval_swp_offset(pte)))
|
2005-04-17 05:20:36 +07:00
|
|
|
|
2018-06-14 05:48:24 +07:00
|
|
|
#include <asm/pgtable-invert.h>
|
|
|
|
|
2008-10-23 12:26:29 +07:00
|
|
|
#endif /* _ASM_X86_PGTABLE_3LEVEL_H */
|