linux_dsm_epyc7002/arch/sparc/mm/tsb.c

611 lines
17 KiB
C
Raw Normal View History

/* arch/sparc64/mm/tsb.c
*
* Copyright (C) 2006, 2008 David S. Miller <davem@davemloft.net>
*/
#include <linux/kernel.h>
#include <linux/preempt.h>
include cleanup: Update gfp.h and slab.h includes to prepare for breaking implicit slab.h inclusion from percpu.h percpu.h is included by sched.h and module.h and thus ends up being included when building most .c files. percpu.h includes slab.h which in turn includes gfp.h making everything defined by the two files universally available and complicating inclusion dependencies. percpu.h -> slab.h dependency is about to be removed. Prepare for this change by updating users of gfp and slab facilities include those headers directly instead of assuming availability. As this conversion needs to touch large number of source files, the following script is used as the basis of conversion. http://userweb.kernel.org/~tj/misc/slabh-sweep.py The script does the followings. * Scan files for gfp and slab usages and update includes such that only the necessary includes are there. ie. if only gfp is used, gfp.h, if slab is used, slab.h. * When the script inserts a new include, it looks at the include blocks and try to put the new include such that its order conforms to its surrounding. It's put in the include block which contains core kernel includes, in the same order that the rest are ordered - alphabetical, Christmas tree, rev-Xmas-tree or at the end if there doesn't seem to be any matching order. * If the script can't find a place to put a new include (mostly because the file doesn't have fitting include block), it prints out an error message indicating which .h file needs to be added to the file. The conversion was done in the following steps. 1. The initial automatic conversion of all .c files updated slightly over 4000 files, deleting around 700 includes and adding ~480 gfp.h and ~3000 slab.h inclusions. The script emitted errors for ~400 files. 2. Each error was manually checked. Some didn't need the inclusion, some needed manual addition while adding it to implementation .h or embedding .c file was more appropriate for others. This step added inclusions to around 150 files. 3. The script was run again and the output was compared to the edits from #2 to make sure no file was left behind. 4. Several build tests were done and a couple of problems were fixed. e.g. lib/decompress_*.c used malloc/free() wrappers around slab APIs requiring slab.h to be added manually. 5. The script was run on all .h files but without automatically editing them as sprinkling gfp.h and slab.h inclusions around .h files could easily lead to inclusion dependency hell. Most gfp.h inclusion directives were ignored as stuff from gfp.h was usually wildly available and often used in preprocessor macros. Each slab.h inclusion directive was examined and added manually as necessary. 6. percpu.h was updated not to include slab.h. 7. Build test were done on the following configurations and failures were fixed. CONFIG_GCOV_KERNEL was turned off for all tests (as my distributed build env didn't work with gcov compiles) and a few more options had to be turned off depending on archs to make things build (like ipr on powerpc/64 which failed due to missing writeq). * x86 and x86_64 UP and SMP allmodconfig and a custom test config. * powerpc and powerpc64 SMP allmodconfig * sparc and sparc64 SMP allmodconfig * ia64 SMP allmodconfig * s390 SMP allmodconfig * alpha SMP allmodconfig * um on x86_64 SMP allmodconfig 8. percpu.h modifications were reverted so that it could be applied as a separate patch and serve as bisection point. Given the fact that I had only a couple of failures from tests on step 6, I'm fairly confident about the coverage of this conversion patch. If there is a breakage, it's likely to be something in one of the arch headers which should be easily discoverable easily on most builds of the specific arch. Signed-off-by: Tejun Heo <tj@kernel.org> Guess-its-ok-by: Christoph Lameter <cl@linux-foundation.org> Cc: Ingo Molnar <mingo@redhat.com> Cc: Lee Schermerhorn <Lee.Schermerhorn@hp.com>
2010-03-24 15:04:11 +07:00
#include <linux/slab.h>
#include <linux/mm_types.h>
#include <asm/page.h>
#include <asm/pgtable.h>
sparc64: Fix race in TLB batch processing. As reported by Dave Kleikamp, when we emit cross calls to do batched TLB flush processing we have a race because we do not synchronize on the sibling cpus completing the cross call. So meanwhile the TLB batch can be reset (tb->tlb_nr set to zero, etc.) and either flushes are missed or flushes will flush the wrong addresses. Fix this by using generic infrastructure to synchonize on the completion of the cross call. This first required getting the flush_tlb_pending() call out from switch_to() which operates with locks held and interrupts disabled. The problem is that smp_call_function_many() cannot be invoked with IRQs disabled and this is explicitly checked for with WARN_ON_ONCE(). We get the batch processing outside of locked IRQ disabled sections by using some ideas from the powerpc port. Namely, we only batch inside of arch_{enter,leave}_lazy_mmu_mode() calls. If we're not in such a region, we flush TLBs synchronously. 1) Get rid of xcall_flush_tlb_pending and per-cpu type implementations. 2) Do TLB batch cross calls instead via: smp_call_function_many() tlb_pending_func() __flush_tlb_pending() 3) Batch only in lazy mmu sequences: a) Add 'active' member to struct tlb_batch b) Define __HAVE_ARCH_ENTER_LAZY_MMU_MODE c) Set 'active' in arch_enter_lazy_mmu_mode() d) Run batch and clear 'active' in arch_leave_lazy_mmu_mode() e) Check 'active' in tlb_batch_add_one() and do a synchronous flush if it's clear. 4) Add infrastructure for synchronous TLB page flushes. a) Implement __flush_tlb_page and per-cpu variants, patch as needed. b) Likewise for xcall_flush_tlb_page. c) Implement smp_flush_tlb_page() to invoke the cross-call. d) Wire up global_flush_tlb_page() to the right routine based upon CONFIG_SMP 5) It turns out that singleton batches are very common, 2 out of every 3 batch flushes have only a single entry in them. The batch flush waiting is very expensive, both because of the poll on sibling cpu completeion, as well as because passing the tlb batch pointer to the sibling cpus invokes a shared memory dereference. Therefore, in flush_tlb_pending(), if there is only one entry in the batch perform a completely asynchronous global_flush_tlb_page() instead. Reported-by: Dave Kleikamp <dave.kleikamp@oracle.com> Signed-off-by: David S. Miller <davem@davemloft.net> Acked-by: Dave Kleikamp <dave.kleikamp@oracle.com>
2013-04-20 04:26:26 +07:00
#include <asm/mmu_context.h>
#include <asm/setup.h>
#include <asm/tsb.h>
sparc64: Fix race in TLB batch processing. As reported by Dave Kleikamp, when we emit cross calls to do batched TLB flush processing we have a race because we do not synchronize on the sibling cpus completing the cross call. So meanwhile the TLB batch can be reset (tb->tlb_nr set to zero, etc.) and either flushes are missed or flushes will flush the wrong addresses. Fix this by using generic infrastructure to synchonize on the completion of the cross call. This first required getting the flush_tlb_pending() call out from switch_to() which operates with locks held and interrupts disabled. The problem is that smp_call_function_many() cannot be invoked with IRQs disabled and this is explicitly checked for with WARN_ON_ONCE(). We get the batch processing outside of locked IRQ disabled sections by using some ideas from the powerpc port. Namely, we only batch inside of arch_{enter,leave}_lazy_mmu_mode() calls. If we're not in such a region, we flush TLBs synchronously. 1) Get rid of xcall_flush_tlb_pending and per-cpu type implementations. 2) Do TLB batch cross calls instead via: smp_call_function_many() tlb_pending_func() __flush_tlb_pending() 3) Batch only in lazy mmu sequences: a) Add 'active' member to struct tlb_batch b) Define __HAVE_ARCH_ENTER_LAZY_MMU_MODE c) Set 'active' in arch_enter_lazy_mmu_mode() d) Run batch and clear 'active' in arch_leave_lazy_mmu_mode() e) Check 'active' in tlb_batch_add_one() and do a synchronous flush if it's clear. 4) Add infrastructure for synchronous TLB page flushes. a) Implement __flush_tlb_page and per-cpu variants, patch as needed. b) Likewise for xcall_flush_tlb_page. c) Implement smp_flush_tlb_page() to invoke the cross-call. d) Wire up global_flush_tlb_page() to the right routine based upon CONFIG_SMP 5) It turns out that singleton batches are very common, 2 out of every 3 batch flushes have only a single entry in them. The batch flush waiting is very expensive, both because of the poll on sibling cpu completeion, as well as because passing the tlb batch pointer to the sibling cpus invokes a shared memory dereference. Therefore, in flush_tlb_pending(), if there is only one entry in the batch perform a completely asynchronous global_flush_tlb_page() instead. Reported-by: Dave Kleikamp <dave.kleikamp@oracle.com> Signed-off-by: David S. Miller <davem@davemloft.net> Acked-by: Dave Kleikamp <dave.kleikamp@oracle.com>
2013-04-20 04:26:26 +07:00
#include <asm/tlb.h>
#include <asm/oplib.h>
extern struct tsb swapper_tsb[KERNEL_TSB_NENTRIES];
static inline unsigned long tsb_hash(unsigned long vaddr, unsigned long hash_shift, unsigned long nentries)
{
vaddr >>= hash_shift;
return vaddr & (nentries - 1);
}
static inline int tag_compare(unsigned long tag, unsigned long vaddr)
{
return (tag == (vaddr >> 22));
}
static void flush_tsb_kernel_range_scan(unsigned long start, unsigned long end)
{
unsigned long idx;
for (idx = 0; idx < KERNEL_TSB_NENTRIES; idx++) {
struct tsb *ent = &swapper_tsb[idx];
unsigned long match = idx << 13;
match |= (ent->tag << 22);
if (match >= start && match < end)
ent->tag = (1UL << TSB_TAG_INVALID_BIT);
}
}
/* TSB flushes need only occur on the processor initiating the address
* space modification, not on each cpu the address space has run on.
* Only the TLB flush needs that treatment.
*/
void flush_tsb_kernel_range(unsigned long start, unsigned long end)
{
unsigned long v;
if ((end - start) >> PAGE_SHIFT >= 2 * KERNEL_TSB_NENTRIES)
return flush_tsb_kernel_range_scan(start, end);
for (v = start; v < end; v += PAGE_SIZE) {
unsigned long hash = tsb_hash(v, PAGE_SHIFT,
KERNEL_TSB_NENTRIES);
struct tsb *ent = &swapper_tsb[hash];
if (tag_compare(ent->tag, v))
ent->tag = (1UL << TSB_TAG_INVALID_BIT);
}
}
sparc64: Fix race in TLB batch processing. As reported by Dave Kleikamp, when we emit cross calls to do batched TLB flush processing we have a race because we do not synchronize on the sibling cpus completing the cross call. So meanwhile the TLB batch can be reset (tb->tlb_nr set to zero, etc.) and either flushes are missed or flushes will flush the wrong addresses. Fix this by using generic infrastructure to synchonize on the completion of the cross call. This first required getting the flush_tlb_pending() call out from switch_to() which operates with locks held and interrupts disabled. The problem is that smp_call_function_many() cannot be invoked with IRQs disabled and this is explicitly checked for with WARN_ON_ONCE(). We get the batch processing outside of locked IRQ disabled sections by using some ideas from the powerpc port. Namely, we only batch inside of arch_{enter,leave}_lazy_mmu_mode() calls. If we're not in such a region, we flush TLBs synchronously. 1) Get rid of xcall_flush_tlb_pending and per-cpu type implementations. 2) Do TLB batch cross calls instead via: smp_call_function_many() tlb_pending_func() __flush_tlb_pending() 3) Batch only in lazy mmu sequences: a) Add 'active' member to struct tlb_batch b) Define __HAVE_ARCH_ENTER_LAZY_MMU_MODE c) Set 'active' in arch_enter_lazy_mmu_mode() d) Run batch and clear 'active' in arch_leave_lazy_mmu_mode() e) Check 'active' in tlb_batch_add_one() and do a synchronous flush if it's clear. 4) Add infrastructure for synchronous TLB page flushes. a) Implement __flush_tlb_page and per-cpu variants, patch as needed. b) Likewise for xcall_flush_tlb_page. c) Implement smp_flush_tlb_page() to invoke the cross-call. d) Wire up global_flush_tlb_page() to the right routine based upon CONFIG_SMP 5) It turns out that singleton batches are very common, 2 out of every 3 batch flushes have only a single entry in them. The batch flush waiting is very expensive, both because of the poll on sibling cpu completeion, as well as because passing the tlb batch pointer to the sibling cpus invokes a shared memory dereference. Therefore, in flush_tlb_pending(), if there is only one entry in the batch perform a completely asynchronous global_flush_tlb_page() instead. Reported-by: Dave Kleikamp <dave.kleikamp@oracle.com> Signed-off-by: David S. Miller <davem@davemloft.net> Acked-by: Dave Kleikamp <dave.kleikamp@oracle.com>
2013-04-20 04:26:26 +07:00
static void __flush_tsb_one_entry(unsigned long tsb, unsigned long v,
unsigned long hash_shift,
unsigned long nentries)
{
sparc64: Fix race in TLB batch processing. As reported by Dave Kleikamp, when we emit cross calls to do batched TLB flush processing we have a race because we do not synchronize on the sibling cpus completing the cross call. So meanwhile the TLB batch can be reset (tb->tlb_nr set to zero, etc.) and either flushes are missed or flushes will flush the wrong addresses. Fix this by using generic infrastructure to synchonize on the completion of the cross call. This first required getting the flush_tlb_pending() call out from switch_to() which operates with locks held and interrupts disabled. The problem is that smp_call_function_many() cannot be invoked with IRQs disabled and this is explicitly checked for with WARN_ON_ONCE(). We get the batch processing outside of locked IRQ disabled sections by using some ideas from the powerpc port. Namely, we only batch inside of arch_{enter,leave}_lazy_mmu_mode() calls. If we're not in such a region, we flush TLBs synchronously. 1) Get rid of xcall_flush_tlb_pending and per-cpu type implementations. 2) Do TLB batch cross calls instead via: smp_call_function_many() tlb_pending_func() __flush_tlb_pending() 3) Batch only in lazy mmu sequences: a) Add 'active' member to struct tlb_batch b) Define __HAVE_ARCH_ENTER_LAZY_MMU_MODE c) Set 'active' in arch_enter_lazy_mmu_mode() d) Run batch and clear 'active' in arch_leave_lazy_mmu_mode() e) Check 'active' in tlb_batch_add_one() and do a synchronous flush if it's clear. 4) Add infrastructure for synchronous TLB page flushes. a) Implement __flush_tlb_page and per-cpu variants, patch as needed. b) Likewise for xcall_flush_tlb_page. c) Implement smp_flush_tlb_page() to invoke the cross-call. d) Wire up global_flush_tlb_page() to the right routine based upon CONFIG_SMP 5) It turns out that singleton batches are very common, 2 out of every 3 batch flushes have only a single entry in them. The batch flush waiting is very expensive, both because of the poll on sibling cpu completeion, as well as because passing the tlb batch pointer to the sibling cpus invokes a shared memory dereference. Therefore, in flush_tlb_pending(), if there is only one entry in the batch perform a completely asynchronous global_flush_tlb_page() instead. Reported-by: Dave Kleikamp <dave.kleikamp@oracle.com> Signed-off-by: David S. Miller <davem@davemloft.net> Acked-by: Dave Kleikamp <dave.kleikamp@oracle.com>
2013-04-20 04:26:26 +07:00
unsigned long tag, ent, hash;
[SPARC64]: Fix and re-enable dynamic TSB sizing. This is good for up to %50 performance improvement of some test cases. The problem has been the race conditions, and hopefully I've plugged them all up here. 1) There was a serious race in switch_mm() wrt. lazy TLB switching to and from kernel threads. We could erroneously skip a tsb_context_switch() and thus use a stale TSB across a TSB grow event. There is a big comment now in that function describing exactly how it can happen. 2) All code paths that do something with the TSB need to be guarded with the mm->context.lock spinlock. This makes page table flushing paths properly synchronize with both TSB growing and TLB context changes. 3) TSB growing events are moved to the end of successful fault processing. Previously it was in update_mmu_cache() but that is deadlock prone. At the end of do_sparc64_fault() we hold no spinlocks that could deadlock the TSB grow sequence. We also have dropped the address space semaphore. While we're here, add prefetching to the copy_tsb() routine and put it in assembler into the tsb.S file. This piece of code is quite time critical. There are some small negative side effects to this code which can be improved upon. In particular we grab the mm->context.lock even for the tsb insert done by update_mmu_cache() now and that's a bit excessive. We can get rid of that locking, and the same lock taking in flush_tsb_user(), by disabling PSTATE_IE around the whole operation including the capturing of the tsb pointer and tsb_nentries value. That would work because anyone growing the TSB won't free up the old TSB until all cpus respond to the TSB change cross call. I'm not quite so confident in that optimization to put it in right now, but eventually we might be able to and the description is here for reference. This code seems very solid now. It passes several parallel GCC bootstrap builds, and our favorite "nut cruncher" stress test which is a full "make -j8192" build of a "make allmodconfig" kernel. That puts about 256 processes on each cpu's run queue, makes lots of process cpu migrations occur, causes lots of page table and TLB flushing activity, incurs many context version number changes, and it swaps the machine real far out to disk even though there is 16GB of ram on this test system. :-) Signed-off-by: David S. Miller <davem@davemloft.net>
2006-03-16 17:02:32 +07:00
sparc64: Fix race in TLB batch processing. As reported by Dave Kleikamp, when we emit cross calls to do batched TLB flush processing we have a race because we do not synchronize on the sibling cpus completing the cross call. So meanwhile the TLB batch can be reset (tb->tlb_nr set to zero, etc.) and either flushes are missed or flushes will flush the wrong addresses. Fix this by using generic infrastructure to synchonize on the completion of the cross call. This first required getting the flush_tlb_pending() call out from switch_to() which operates with locks held and interrupts disabled. The problem is that smp_call_function_many() cannot be invoked with IRQs disabled and this is explicitly checked for with WARN_ON_ONCE(). We get the batch processing outside of locked IRQ disabled sections by using some ideas from the powerpc port. Namely, we only batch inside of arch_{enter,leave}_lazy_mmu_mode() calls. If we're not in such a region, we flush TLBs synchronously. 1) Get rid of xcall_flush_tlb_pending and per-cpu type implementations. 2) Do TLB batch cross calls instead via: smp_call_function_many() tlb_pending_func() __flush_tlb_pending() 3) Batch only in lazy mmu sequences: a) Add 'active' member to struct tlb_batch b) Define __HAVE_ARCH_ENTER_LAZY_MMU_MODE c) Set 'active' in arch_enter_lazy_mmu_mode() d) Run batch and clear 'active' in arch_leave_lazy_mmu_mode() e) Check 'active' in tlb_batch_add_one() and do a synchronous flush if it's clear. 4) Add infrastructure for synchronous TLB page flushes. a) Implement __flush_tlb_page and per-cpu variants, patch as needed. b) Likewise for xcall_flush_tlb_page. c) Implement smp_flush_tlb_page() to invoke the cross-call. d) Wire up global_flush_tlb_page() to the right routine based upon CONFIG_SMP 5) It turns out that singleton batches are very common, 2 out of every 3 batch flushes have only a single entry in them. The batch flush waiting is very expensive, both because of the poll on sibling cpu completeion, as well as because passing the tlb batch pointer to the sibling cpus invokes a shared memory dereference. Therefore, in flush_tlb_pending(), if there is only one entry in the batch perform a completely asynchronous global_flush_tlb_page() instead. Reported-by: Dave Kleikamp <dave.kleikamp@oracle.com> Signed-off-by: David S. Miller <davem@davemloft.net> Acked-by: Dave Kleikamp <dave.kleikamp@oracle.com>
2013-04-20 04:26:26 +07:00
v &= ~0x1UL;
hash = tsb_hash(v, hash_shift, nentries);
ent = tsb + (hash * sizeof(struct tsb));
tag = (v >> 22UL);
sparc64: Fix race in TLB batch processing. As reported by Dave Kleikamp, when we emit cross calls to do batched TLB flush processing we have a race because we do not synchronize on the sibling cpus completing the cross call. So meanwhile the TLB batch can be reset (tb->tlb_nr set to zero, etc.) and either flushes are missed or flushes will flush the wrong addresses. Fix this by using generic infrastructure to synchonize on the completion of the cross call. This first required getting the flush_tlb_pending() call out from switch_to() which operates with locks held and interrupts disabled. The problem is that smp_call_function_many() cannot be invoked with IRQs disabled and this is explicitly checked for with WARN_ON_ONCE(). We get the batch processing outside of locked IRQ disabled sections by using some ideas from the powerpc port. Namely, we only batch inside of arch_{enter,leave}_lazy_mmu_mode() calls. If we're not in such a region, we flush TLBs synchronously. 1) Get rid of xcall_flush_tlb_pending and per-cpu type implementations. 2) Do TLB batch cross calls instead via: smp_call_function_many() tlb_pending_func() __flush_tlb_pending() 3) Batch only in lazy mmu sequences: a) Add 'active' member to struct tlb_batch b) Define __HAVE_ARCH_ENTER_LAZY_MMU_MODE c) Set 'active' in arch_enter_lazy_mmu_mode() d) Run batch and clear 'active' in arch_leave_lazy_mmu_mode() e) Check 'active' in tlb_batch_add_one() and do a synchronous flush if it's clear. 4) Add infrastructure for synchronous TLB page flushes. a) Implement __flush_tlb_page and per-cpu variants, patch as needed. b) Likewise for xcall_flush_tlb_page. c) Implement smp_flush_tlb_page() to invoke the cross-call. d) Wire up global_flush_tlb_page() to the right routine based upon CONFIG_SMP 5) It turns out that singleton batches are very common, 2 out of every 3 batch flushes have only a single entry in them. The batch flush waiting is very expensive, both because of the poll on sibling cpu completeion, as well as because passing the tlb batch pointer to the sibling cpus invokes a shared memory dereference. Therefore, in flush_tlb_pending(), if there is only one entry in the batch perform a completely asynchronous global_flush_tlb_page() instead. Reported-by: Dave Kleikamp <dave.kleikamp@oracle.com> Signed-off-by: David S. Miller <davem@davemloft.net> Acked-by: Dave Kleikamp <dave.kleikamp@oracle.com>
2013-04-20 04:26:26 +07:00
tsb_flush(ent, tag);
}
sparc64: Fix race in TLB batch processing. As reported by Dave Kleikamp, when we emit cross calls to do batched TLB flush processing we have a race because we do not synchronize on the sibling cpus completing the cross call. So meanwhile the TLB batch can be reset (tb->tlb_nr set to zero, etc.) and either flushes are missed or flushes will flush the wrong addresses. Fix this by using generic infrastructure to synchonize on the completion of the cross call. This first required getting the flush_tlb_pending() call out from switch_to() which operates with locks held and interrupts disabled. The problem is that smp_call_function_many() cannot be invoked with IRQs disabled and this is explicitly checked for with WARN_ON_ONCE(). We get the batch processing outside of locked IRQ disabled sections by using some ideas from the powerpc port. Namely, we only batch inside of arch_{enter,leave}_lazy_mmu_mode() calls. If we're not in such a region, we flush TLBs synchronously. 1) Get rid of xcall_flush_tlb_pending and per-cpu type implementations. 2) Do TLB batch cross calls instead via: smp_call_function_many() tlb_pending_func() __flush_tlb_pending() 3) Batch only in lazy mmu sequences: a) Add 'active' member to struct tlb_batch b) Define __HAVE_ARCH_ENTER_LAZY_MMU_MODE c) Set 'active' in arch_enter_lazy_mmu_mode() d) Run batch and clear 'active' in arch_leave_lazy_mmu_mode() e) Check 'active' in tlb_batch_add_one() and do a synchronous flush if it's clear. 4) Add infrastructure for synchronous TLB page flushes. a) Implement __flush_tlb_page and per-cpu variants, patch as needed. b) Likewise for xcall_flush_tlb_page. c) Implement smp_flush_tlb_page() to invoke the cross-call. d) Wire up global_flush_tlb_page() to the right routine based upon CONFIG_SMP 5) It turns out that singleton batches are very common, 2 out of every 3 batch flushes have only a single entry in them. The batch flush waiting is very expensive, both because of the poll on sibling cpu completeion, as well as because passing the tlb batch pointer to the sibling cpus invokes a shared memory dereference. Therefore, in flush_tlb_pending(), if there is only one entry in the batch perform a completely asynchronous global_flush_tlb_page() instead. Reported-by: Dave Kleikamp <dave.kleikamp@oracle.com> Signed-off-by: David S. Miller <davem@davemloft.net> Acked-by: Dave Kleikamp <dave.kleikamp@oracle.com>
2013-04-20 04:26:26 +07:00
static void __flush_tsb_one(struct tlb_batch *tb, unsigned long hash_shift,
unsigned long tsb, unsigned long nentries)
{
unsigned long i;
sparc64: Fix race in TLB batch processing. As reported by Dave Kleikamp, when we emit cross calls to do batched TLB flush processing we have a race because we do not synchronize on the sibling cpus completing the cross call. So meanwhile the TLB batch can be reset (tb->tlb_nr set to zero, etc.) and either flushes are missed or flushes will flush the wrong addresses. Fix this by using generic infrastructure to synchonize on the completion of the cross call. This first required getting the flush_tlb_pending() call out from switch_to() which operates with locks held and interrupts disabled. The problem is that smp_call_function_many() cannot be invoked with IRQs disabled and this is explicitly checked for with WARN_ON_ONCE(). We get the batch processing outside of locked IRQ disabled sections by using some ideas from the powerpc port. Namely, we only batch inside of arch_{enter,leave}_lazy_mmu_mode() calls. If we're not in such a region, we flush TLBs synchronously. 1) Get rid of xcall_flush_tlb_pending and per-cpu type implementations. 2) Do TLB batch cross calls instead via: smp_call_function_many() tlb_pending_func() __flush_tlb_pending() 3) Batch only in lazy mmu sequences: a) Add 'active' member to struct tlb_batch b) Define __HAVE_ARCH_ENTER_LAZY_MMU_MODE c) Set 'active' in arch_enter_lazy_mmu_mode() d) Run batch and clear 'active' in arch_leave_lazy_mmu_mode() e) Check 'active' in tlb_batch_add_one() and do a synchronous flush if it's clear. 4) Add infrastructure for synchronous TLB page flushes. a) Implement __flush_tlb_page and per-cpu variants, patch as needed. b) Likewise for xcall_flush_tlb_page. c) Implement smp_flush_tlb_page() to invoke the cross-call. d) Wire up global_flush_tlb_page() to the right routine based upon CONFIG_SMP 5) It turns out that singleton batches are very common, 2 out of every 3 batch flushes have only a single entry in them. The batch flush waiting is very expensive, both because of the poll on sibling cpu completeion, as well as because passing the tlb batch pointer to the sibling cpus invokes a shared memory dereference. Therefore, in flush_tlb_pending(), if there is only one entry in the batch perform a completely asynchronous global_flush_tlb_page() instead. Reported-by: Dave Kleikamp <dave.kleikamp@oracle.com> Signed-off-by: David S. Miller <davem@davemloft.net> Acked-by: Dave Kleikamp <dave.kleikamp@oracle.com>
2013-04-20 04:26:26 +07:00
for (i = 0; i < tb->tlb_nr; i++)
__flush_tsb_one_entry(tsb, tb->vaddrs[i], hash_shift, nentries);
}
#if defined(CONFIG_HUGETLB_PAGE) || defined(CONFIG_TRANSPARENT_HUGEPAGE)
static void __flush_huge_tsb_one_entry(unsigned long tsb, unsigned long v,
unsigned long hash_shift,
unsigned long nentries,
unsigned int hugepage_shift)
{
unsigned int hpage_entries;
unsigned int i;
hpage_entries = 1 << (hugepage_shift - hash_shift);
for (i = 0; i < hpage_entries; i++)
__flush_tsb_one_entry(tsb, v + (i << hash_shift), hash_shift,
nentries);
}
static void __flush_huge_tsb_one(struct tlb_batch *tb, unsigned long hash_shift,
unsigned long tsb, unsigned long nentries,
unsigned int hugepage_shift)
{
unsigned long i;
for (i = 0; i < tb->tlb_nr; i++)
__flush_huge_tsb_one_entry(tsb, tb->vaddrs[i], hash_shift,
nentries, hugepage_shift);
}
#endif
void flush_tsb_user(struct tlb_batch *tb)
{
struct mm_struct *mm = tb->mm;
unsigned long nentries, base, flags;
spin_lock_irqsave(&mm->context.lock, flags);
[SPARC64]: Fix and re-enable dynamic TSB sizing. This is good for up to %50 performance improvement of some test cases. The problem has been the race conditions, and hopefully I've plugged them all up here. 1) There was a serious race in switch_mm() wrt. lazy TLB switching to and from kernel threads. We could erroneously skip a tsb_context_switch() and thus use a stale TSB across a TSB grow event. There is a big comment now in that function describing exactly how it can happen. 2) All code paths that do something with the TSB need to be guarded with the mm->context.lock spinlock. This makes page table flushing paths properly synchronize with both TSB growing and TLB context changes. 3) TSB growing events are moved to the end of successful fault processing. Previously it was in update_mmu_cache() but that is deadlock prone. At the end of do_sparc64_fault() we hold no spinlocks that could deadlock the TSB grow sequence. We also have dropped the address space semaphore. While we're here, add prefetching to the copy_tsb() routine and put it in assembler into the tsb.S file. This piece of code is quite time critical. There are some small negative side effects to this code which can be improved upon. In particular we grab the mm->context.lock even for the tsb insert done by update_mmu_cache() now and that's a bit excessive. We can get rid of that locking, and the same lock taking in flush_tsb_user(), by disabling PSTATE_IE around the whole operation including the capturing of the tsb pointer and tsb_nentries value. That would work because anyone growing the TSB won't free up the old TSB until all cpus respond to the TSB change cross call. I'm not quite so confident in that optimization to put it in right now, but eventually we might be able to and the description is here for reference. This code seems very solid now. It passes several parallel GCC bootstrap builds, and our favorite "nut cruncher" stress test which is a full "make -j8192" build of a "make allmodconfig" kernel. That puts about 256 processes on each cpu's run queue, makes lots of process cpu migrations occur, causes lots of page table and TLB flushing activity, incurs many context version number changes, and it swaps the machine real far out to disk even though there is 16GB of ram on this test system. :-) Signed-off-by: David S. Miller <davem@davemloft.net>
2006-03-16 17:02:32 +07:00
if (tb->hugepage_shift < REAL_HPAGE_SHIFT) {
base = (unsigned long) mm->context.tsb_block[MM_TSB_BASE].tsb;
nentries = mm->context.tsb_block[MM_TSB_BASE].tsb_nentries;
if (tlb_type == cheetah_plus || tlb_type == hypervisor)
base = __pa(base);
if (tb->hugepage_shift == PAGE_SHIFT)
__flush_tsb_one(tb, PAGE_SHIFT, base, nentries);
#if defined(CONFIG_HUGETLB_PAGE)
else
__flush_huge_tsb_one(tb, PAGE_SHIFT, base, nentries,
tb->hugepage_shift);
#endif
}
#if defined(CONFIG_HUGETLB_PAGE) || defined(CONFIG_TRANSPARENT_HUGEPAGE)
else if (mm->context.tsb_block[MM_TSB_HUGE].tsb) {
base = (unsigned long) mm->context.tsb_block[MM_TSB_HUGE].tsb;
nentries = mm->context.tsb_block[MM_TSB_HUGE].tsb_nentries;
if (tlb_type == cheetah_plus || tlb_type == hypervisor)
base = __pa(base);
__flush_huge_tsb_one(tb, REAL_HPAGE_SHIFT, base, nentries,
tb->hugepage_shift);
}
#endif
[SPARC64]: Fix and re-enable dynamic TSB sizing. This is good for up to %50 performance improvement of some test cases. The problem has been the race conditions, and hopefully I've plugged them all up here. 1) There was a serious race in switch_mm() wrt. lazy TLB switching to and from kernel threads. We could erroneously skip a tsb_context_switch() and thus use a stale TSB across a TSB grow event. There is a big comment now in that function describing exactly how it can happen. 2) All code paths that do something with the TSB need to be guarded with the mm->context.lock spinlock. This makes page table flushing paths properly synchronize with both TSB growing and TLB context changes. 3) TSB growing events are moved to the end of successful fault processing. Previously it was in update_mmu_cache() but that is deadlock prone. At the end of do_sparc64_fault() we hold no spinlocks that could deadlock the TSB grow sequence. We also have dropped the address space semaphore. While we're here, add prefetching to the copy_tsb() routine and put it in assembler into the tsb.S file. This piece of code is quite time critical. There are some small negative side effects to this code which can be improved upon. In particular we grab the mm->context.lock even for the tsb insert done by update_mmu_cache() now and that's a bit excessive. We can get rid of that locking, and the same lock taking in flush_tsb_user(), by disabling PSTATE_IE around the whole operation including the capturing of the tsb pointer and tsb_nentries value. That would work because anyone growing the TSB won't free up the old TSB until all cpus respond to the TSB change cross call. I'm not quite so confident in that optimization to put it in right now, but eventually we might be able to and the description is here for reference. This code seems very solid now. It passes several parallel GCC bootstrap builds, and our favorite "nut cruncher" stress test which is a full "make -j8192" build of a "make allmodconfig" kernel. That puts about 256 processes on each cpu's run queue, makes lots of process cpu migrations occur, causes lots of page table and TLB flushing activity, incurs many context version number changes, and it swaps the machine real far out to disk even though there is 16GB of ram on this test system. :-) Signed-off-by: David S. Miller <davem@davemloft.net>
2006-03-16 17:02:32 +07:00
spin_unlock_irqrestore(&mm->context.lock, flags);
}
void flush_tsb_user_page(struct mm_struct *mm, unsigned long vaddr,
unsigned int hugepage_shift)
sparc64: Fix race in TLB batch processing. As reported by Dave Kleikamp, when we emit cross calls to do batched TLB flush processing we have a race because we do not synchronize on the sibling cpus completing the cross call. So meanwhile the TLB batch can be reset (tb->tlb_nr set to zero, etc.) and either flushes are missed or flushes will flush the wrong addresses. Fix this by using generic infrastructure to synchonize on the completion of the cross call. This first required getting the flush_tlb_pending() call out from switch_to() which operates with locks held and interrupts disabled. The problem is that smp_call_function_many() cannot be invoked with IRQs disabled and this is explicitly checked for with WARN_ON_ONCE(). We get the batch processing outside of locked IRQ disabled sections by using some ideas from the powerpc port. Namely, we only batch inside of arch_{enter,leave}_lazy_mmu_mode() calls. If we're not in such a region, we flush TLBs synchronously. 1) Get rid of xcall_flush_tlb_pending and per-cpu type implementations. 2) Do TLB batch cross calls instead via: smp_call_function_many() tlb_pending_func() __flush_tlb_pending() 3) Batch only in lazy mmu sequences: a) Add 'active' member to struct tlb_batch b) Define __HAVE_ARCH_ENTER_LAZY_MMU_MODE c) Set 'active' in arch_enter_lazy_mmu_mode() d) Run batch and clear 'active' in arch_leave_lazy_mmu_mode() e) Check 'active' in tlb_batch_add_one() and do a synchronous flush if it's clear. 4) Add infrastructure for synchronous TLB page flushes. a) Implement __flush_tlb_page and per-cpu variants, patch as needed. b) Likewise for xcall_flush_tlb_page. c) Implement smp_flush_tlb_page() to invoke the cross-call. d) Wire up global_flush_tlb_page() to the right routine based upon CONFIG_SMP 5) It turns out that singleton batches are very common, 2 out of every 3 batch flushes have only a single entry in them. The batch flush waiting is very expensive, both because of the poll on sibling cpu completeion, as well as because passing the tlb batch pointer to the sibling cpus invokes a shared memory dereference. Therefore, in flush_tlb_pending(), if there is only one entry in the batch perform a completely asynchronous global_flush_tlb_page() instead. Reported-by: Dave Kleikamp <dave.kleikamp@oracle.com> Signed-off-by: David S. Miller <davem@davemloft.net> Acked-by: Dave Kleikamp <dave.kleikamp@oracle.com>
2013-04-20 04:26:26 +07:00
{
unsigned long nentries, base, flags;
spin_lock_irqsave(&mm->context.lock, flags);
if (hugepage_shift < REAL_HPAGE_SHIFT) {
base = (unsigned long) mm->context.tsb_block[MM_TSB_BASE].tsb;
nentries = mm->context.tsb_block[MM_TSB_BASE].tsb_nentries;
if (tlb_type == cheetah_plus || tlb_type == hypervisor)
base = __pa(base);
if (hugepage_shift == PAGE_SHIFT)
__flush_tsb_one_entry(base, vaddr, PAGE_SHIFT,
nentries);
#if defined(CONFIG_HUGETLB_PAGE)
else
__flush_huge_tsb_one_entry(base, vaddr, PAGE_SHIFT,
nentries, hugepage_shift);
#endif
}
sparc64: Fix race in TLB batch processing. As reported by Dave Kleikamp, when we emit cross calls to do batched TLB flush processing we have a race because we do not synchronize on the sibling cpus completing the cross call. So meanwhile the TLB batch can be reset (tb->tlb_nr set to zero, etc.) and either flushes are missed or flushes will flush the wrong addresses. Fix this by using generic infrastructure to synchonize on the completion of the cross call. This first required getting the flush_tlb_pending() call out from switch_to() which operates with locks held and interrupts disabled. The problem is that smp_call_function_many() cannot be invoked with IRQs disabled and this is explicitly checked for with WARN_ON_ONCE(). We get the batch processing outside of locked IRQ disabled sections by using some ideas from the powerpc port. Namely, we only batch inside of arch_{enter,leave}_lazy_mmu_mode() calls. If we're not in such a region, we flush TLBs synchronously. 1) Get rid of xcall_flush_tlb_pending and per-cpu type implementations. 2) Do TLB batch cross calls instead via: smp_call_function_many() tlb_pending_func() __flush_tlb_pending() 3) Batch only in lazy mmu sequences: a) Add 'active' member to struct tlb_batch b) Define __HAVE_ARCH_ENTER_LAZY_MMU_MODE c) Set 'active' in arch_enter_lazy_mmu_mode() d) Run batch and clear 'active' in arch_leave_lazy_mmu_mode() e) Check 'active' in tlb_batch_add_one() and do a synchronous flush if it's clear. 4) Add infrastructure for synchronous TLB page flushes. a) Implement __flush_tlb_page and per-cpu variants, patch as needed. b) Likewise for xcall_flush_tlb_page. c) Implement smp_flush_tlb_page() to invoke the cross-call. d) Wire up global_flush_tlb_page() to the right routine based upon CONFIG_SMP 5) It turns out that singleton batches are very common, 2 out of every 3 batch flushes have only a single entry in them. The batch flush waiting is very expensive, both because of the poll on sibling cpu completeion, as well as because passing the tlb batch pointer to the sibling cpus invokes a shared memory dereference. Therefore, in flush_tlb_pending(), if there is only one entry in the batch perform a completely asynchronous global_flush_tlb_page() instead. Reported-by: Dave Kleikamp <dave.kleikamp@oracle.com> Signed-off-by: David S. Miller <davem@davemloft.net> Acked-by: Dave Kleikamp <dave.kleikamp@oracle.com>
2013-04-20 04:26:26 +07:00
#if defined(CONFIG_HUGETLB_PAGE) || defined(CONFIG_TRANSPARENT_HUGEPAGE)
else if (mm->context.tsb_block[MM_TSB_HUGE].tsb) {
sparc64: Fix race in TLB batch processing. As reported by Dave Kleikamp, when we emit cross calls to do batched TLB flush processing we have a race because we do not synchronize on the sibling cpus completing the cross call. So meanwhile the TLB batch can be reset (tb->tlb_nr set to zero, etc.) and either flushes are missed or flushes will flush the wrong addresses. Fix this by using generic infrastructure to synchonize on the completion of the cross call. This first required getting the flush_tlb_pending() call out from switch_to() which operates with locks held and interrupts disabled. The problem is that smp_call_function_many() cannot be invoked with IRQs disabled and this is explicitly checked for with WARN_ON_ONCE(). We get the batch processing outside of locked IRQ disabled sections by using some ideas from the powerpc port. Namely, we only batch inside of arch_{enter,leave}_lazy_mmu_mode() calls. If we're not in such a region, we flush TLBs synchronously. 1) Get rid of xcall_flush_tlb_pending and per-cpu type implementations. 2) Do TLB batch cross calls instead via: smp_call_function_many() tlb_pending_func() __flush_tlb_pending() 3) Batch only in lazy mmu sequences: a) Add 'active' member to struct tlb_batch b) Define __HAVE_ARCH_ENTER_LAZY_MMU_MODE c) Set 'active' in arch_enter_lazy_mmu_mode() d) Run batch and clear 'active' in arch_leave_lazy_mmu_mode() e) Check 'active' in tlb_batch_add_one() and do a synchronous flush if it's clear. 4) Add infrastructure for synchronous TLB page flushes. a) Implement __flush_tlb_page and per-cpu variants, patch as needed. b) Likewise for xcall_flush_tlb_page. c) Implement smp_flush_tlb_page() to invoke the cross-call. d) Wire up global_flush_tlb_page() to the right routine based upon CONFIG_SMP 5) It turns out that singleton batches are very common, 2 out of every 3 batch flushes have only a single entry in them. The batch flush waiting is very expensive, both because of the poll on sibling cpu completeion, as well as because passing the tlb batch pointer to the sibling cpus invokes a shared memory dereference. Therefore, in flush_tlb_pending(), if there is only one entry in the batch perform a completely asynchronous global_flush_tlb_page() instead. Reported-by: Dave Kleikamp <dave.kleikamp@oracle.com> Signed-off-by: David S. Miller <davem@davemloft.net> Acked-by: Dave Kleikamp <dave.kleikamp@oracle.com>
2013-04-20 04:26:26 +07:00
base = (unsigned long) mm->context.tsb_block[MM_TSB_HUGE].tsb;
nentries = mm->context.tsb_block[MM_TSB_HUGE].tsb_nentries;
if (tlb_type == cheetah_plus || tlb_type == hypervisor)
base = __pa(base);
__flush_huge_tsb_one_entry(base, vaddr, REAL_HPAGE_SHIFT,
nentries, hugepage_shift);
sparc64: Fix race in TLB batch processing. As reported by Dave Kleikamp, when we emit cross calls to do batched TLB flush processing we have a race because we do not synchronize on the sibling cpus completing the cross call. So meanwhile the TLB batch can be reset (tb->tlb_nr set to zero, etc.) and either flushes are missed or flushes will flush the wrong addresses. Fix this by using generic infrastructure to synchonize on the completion of the cross call. This first required getting the flush_tlb_pending() call out from switch_to() which operates with locks held and interrupts disabled. The problem is that smp_call_function_many() cannot be invoked with IRQs disabled and this is explicitly checked for with WARN_ON_ONCE(). We get the batch processing outside of locked IRQ disabled sections by using some ideas from the powerpc port. Namely, we only batch inside of arch_{enter,leave}_lazy_mmu_mode() calls. If we're not in such a region, we flush TLBs synchronously. 1) Get rid of xcall_flush_tlb_pending and per-cpu type implementations. 2) Do TLB batch cross calls instead via: smp_call_function_many() tlb_pending_func() __flush_tlb_pending() 3) Batch only in lazy mmu sequences: a) Add 'active' member to struct tlb_batch b) Define __HAVE_ARCH_ENTER_LAZY_MMU_MODE c) Set 'active' in arch_enter_lazy_mmu_mode() d) Run batch and clear 'active' in arch_leave_lazy_mmu_mode() e) Check 'active' in tlb_batch_add_one() and do a synchronous flush if it's clear. 4) Add infrastructure for synchronous TLB page flushes. a) Implement __flush_tlb_page and per-cpu variants, patch as needed. b) Likewise for xcall_flush_tlb_page. c) Implement smp_flush_tlb_page() to invoke the cross-call. d) Wire up global_flush_tlb_page() to the right routine based upon CONFIG_SMP 5) It turns out that singleton batches are very common, 2 out of every 3 batch flushes have only a single entry in them. The batch flush waiting is very expensive, both because of the poll on sibling cpu completeion, as well as because passing the tlb batch pointer to the sibling cpus invokes a shared memory dereference. Therefore, in flush_tlb_pending(), if there is only one entry in the batch perform a completely asynchronous global_flush_tlb_page() instead. Reported-by: Dave Kleikamp <dave.kleikamp@oracle.com> Signed-off-by: David S. Miller <davem@davemloft.net> Acked-by: Dave Kleikamp <dave.kleikamp@oracle.com>
2013-04-20 04:26:26 +07:00
}
#endif
spin_unlock_irqrestore(&mm->context.lock, flags);
}
#define HV_PGSZ_IDX_BASE HV_PGSZ_IDX_8K
#define HV_PGSZ_MASK_BASE HV_PGSZ_MASK_8K
#if defined(CONFIG_HUGETLB_PAGE) || defined(CONFIG_TRANSPARENT_HUGEPAGE)
#define HV_PGSZ_IDX_HUGE HV_PGSZ_IDX_4MB
#define HV_PGSZ_MASK_HUGE HV_PGSZ_MASK_4MB
#endif
static void setup_tsb_params(struct mm_struct *mm, unsigned long tsb_idx, unsigned long tsb_bytes)
{
unsigned long tsb_reg, base, tsb_paddr;
unsigned long page_sz, tte;
mm->context.tsb_block[tsb_idx].tsb_nentries =
tsb_bytes / sizeof(struct tsb);
switch (tsb_idx) {
case MM_TSB_BASE:
base = TSBMAP_8K_BASE;
break;
#if defined(CONFIG_HUGETLB_PAGE) || defined(CONFIG_TRANSPARENT_HUGEPAGE)
case MM_TSB_HUGE:
base = TSBMAP_4M_BASE;
break;
#endif
default:
BUG();
}
tte = pgprot_val(PAGE_KERNEL_LOCKED);
tsb_paddr = __pa(mm->context.tsb_block[tsb_idx].tsb);
BUG_ON(tsb_paddr & (tsb_bytes - 1UL));
/* Use the smallest page size that can map the whole TSB
* in one TLB entry.
*/
switch (tsb_bytes) {
case 8192 << 0:
tsb_reg = 0x0UL;
#ifdef DCACHE_ALIASING_POSSIBLE
base += (tsb_paddr & 8192);
#endif
page_sz = 8192;
break;
case 8192 << 1:
tsb_reg = 0x1UL;
page_sz = 64 * 1024;
break;
case 8192 << 2:
tsb_reg = 0x2UL;
page_sz = 64 * 1024;
break;
case 8192 << 3:
tsb_reg = 0x3UL;
page_sz = 64 * 1024;
break;
case 8192 << 4:
tsb_reg = 0x4UL;
page_sz = 512 * 1024;
break;
case 8192 << 5:
tsb_reg = 0x5UL;
page_sz = 512 * 1024;
break;
case 8192 << 6:
tsb_reg = 0x6UL;
page_sz = 512 * 1024;
break;
case 8192 << 7:
tsb_reg = 0x7UL;
page_sz = 4 * 1024 * 1024;
break;
default:
printk(KERN_ERR "TSB[%s:%d]: Impossible TSB size %lu, killing process.\n",
current->comm, current->pid, tsb_bytes);
do_exit(SIGSEGV);
}
tte |= pte_sz_bits(page_sz);
if (tlb_type == cheetah_plus || tlb_type == hypervisor) {
/* Physical mapping, no locked TLB entry for TSB. */
tsb_reg |= tsb_paddr;
mm->context.tsb_block[tsb_idx].tsb_reg_val = tsb_reg;
mm->context.tsb_block[tsb_idx].tsb_map_vaddr = 0;
mm->context.tsb_block[tsb_idx].tsb_map_pte = 0;
} else {
tsb_reg |= base;
tsb_reg |= (tsb_paddr & (page_sz - 1UL));
tte |= (tsb_paddr & ~(page_sz - 1UL));
mm->context.tsb_block[tsb_idx].tsb_reg_val = tsb_reg;
mm->context.tsb_block[tsb_idx].tsb_map_vaddr = base;
mm->context.tsb_block[tsb_idx].tsb_map_pte = tte;
}
/* Setup the Hypervisor TSB descriptor. */
if (tlb_type == hypervisor) {
struct hv_tsb_descr *hp = &mm->context.tsb_descr[tsb_idx];
switch (tsb_idx) {
case MM_TSB_BASE:
hp->pgsz_idx = HV_PGSZ_IDX_BASE;
break;
#if defined(CONFIG_HUGETLB_PAGE) || defined(CONFIG_TRANSPARENT_HUGEPAGE)
case MM_TSB_HUGE:
hp->pgsz_idx = HV_PGSZ_IDX_HUGE;
break;
#endif
default:
BUG();
}
hp->assoc = 1;
hp->num_ttes = tsb_bytes / 16;
hp->ctx_idx = 0;
switch (tsb_idx) {
case MM_TSB_BASE:
hp->pgsz_mask = HV_PGSZ_MASK_BASE;
break;
#if defined(CONFIG_HUGETLB_PAGE) || defined(CONFIG_TRANSPARENT_HUGEPAGE)
case MM_TSB_HUGE:
hp->pgsz_mask = HV_PGSZ_MASK_HUGE;
break;
#endif
default:
BUG();
}
hp->tsb_base = tsb_paddr;
hp->resv = 0;
}
}
struct kmem_cache *pgtable_cache __read_mostly;
static struct kmem_cache *tsb_caches[8] __read_mostly;
static const char *tsb_cache_names[8] = {
"tsb_8KB",
"tsb_16KB",
"tsb_32KB",
"tsb_64KB",
"tsb_128KB",
"tsb_256KB",
"tsb_512KB",
"tsb_1MB",
};
void __init pgtable_cache_init(void)
{
unsigned long i;
pgtable_cache = kmem_cache_create("pgtable_cache",
PAGE_SIZE, PAGE_SIZE,
0,
_clear_page);
if (!pgtable_cache) {
prom_printf("pgtable_cache_init(): Could not create!\n");
prom_halt();
}
for (i = 0; i < ARRAY_SIZE(tsb_cache_names); i++) {
unsigned long size = 8192 << i;
const char *name = tsb_cache_names[i];
tsb_caches[i] = kmem_cache_create(name,
size, size,
0, NULL);
if (!tsb_caches[i]) {
prom_printf("Could not create %s cache\n", name);
prom_halt();
}
}
}
int sysctl_tsb_ratio = -2;
static unsigned long tsb_size_to_rss_limit(unsigned long new_size)
{
unsigned long num_ents = (new_size / sizeof(struct tsb));
if (sysctl_tsb_ratio < 0)
return num_ents - (num_ents >> -sysctl_tsb_ratio);
else
return num_ents + (num_ents >> sysctl_tsb_ratio);
}
/* When the RSS of an address space exceeds tsb_rss_limit for a TSB,
* do_sparc64_fault() invokes this routine to try and grow it.
[SPARC64]: Fix and re-enable dynamic TSB sizing. This is good for up to %50 performance improvement of some test cases. The problem has been the race conditions, and hopefully I've plugged them all up here. 1) There was a serious race in switch_mm() wrt. lazy TLB switching to and from kernel threads. We could erroneously skip a tsb_context_switch() and thus use a stale TSB across a TSB grow event. There is a big comment now in that function describing exactly how it can happen. 2) All code paths that do something with the TSB need to be guarded with the mm->context.lock spinlock. This makes page table flushing paths properly synchronize with both TSB growing and TLB context changes. 3) TSB growing events are moved to the end of successful fault processing. Previously it was in update_mmu_cache() but that is deadlock prone. At the end of do_sparc64_fault() we hold no spinlocks that could deadlock the TSB grow sequence. We also have dropped the address space semaphore. While we're here, add prefetching to the copy_tsb() routine and put it in assembler into the tsb.S file. This piece of code is quite time critical. There are some small negative side effects to this code which can be improved upon. In particular we grab the mm->context.lock even for the tsb insert done by update_mmu_cache() now and that's a bit excessive. We can get rid of that locking, and the same lock taking in flush_tsb_user(), by disabling PSTATE_IE around the whole operation including the capturing of the tsb pointer and tsb_nentries value. That would work because anyone growing the TSB won't free up the old TSB until all cpus respond to the TSB change cross call. I'm not quite so confident in that optimization to put it in right now, but eventually we might be able to and the description is here for reference. This code seems very solid now. It passes several parallel GCC bootstrap builds, and our favorite "nut cruncher" stress test which is a full "make -j8192" build of a "make allmodconfig" kernel. That puts about 256 processes on each cpu's run queue, makes lots of process cpu migrations occur, causes lots of page table and TLB flushing activity, incurs many context version number changes, and it swaps the machine real far out to disk even though there is 16GB of ram on this test system. :-) Signed-off-by: David S. Miller <davem@davemloft.net>
2006-03-16 17:02:32 +07:00
*
* When we reach the maximum TSB size supported, we stick ~0UL into
* tsb_rss_limit for that TSB so the grow checks in do_sparc64_fault()
* will not trigger any longer.
*
* The TSB can be anywhere from 8K to 1MB in size, in increasing powers
* of two. The TSB must be aligned to it's size, so f.e. a 512K TSB
* must be 512K aligned. It also must be physically contiguous, so we
* cannot use vmalloc().
*
* The idea here is to grow the TSB when the RSS of the process approaches
* the number of entries that the current TSB can hold at once. Currently,
* we trigger when the RSS hits 3/4 of the TSB capacity.
*/
void tsb_grow(struct mm_struct *mm, unsigned long tsb_index, unsigned long rss)
{
unsigned long max_tsb_size = 1 * 1024 * 1024;
unsigned long new_size, old_size, flags;
[SPARC64]: Fix and re-enable dynamic TSB sizing. This is good for up to %50 performance improvement of some test cases. The problem has been the race conditions, and hopefully I've plugged them all up here. 1) There was a serious race in switch_mm() wrt. lazy TLB switching to and from kernel threads. We could erroneously skip a tsb_context_switch() and thus use a stale TSB across a TSB grow event. There is a big comment now in that function describing exactly how it can happen. 2) All code paths that do something with the TSB need to be guarded with the mm->context.lock spinlock. This makes page table flushing paths properly synchronize with both TSB growing and TLB context changes. 3) TSB growing events are moved to the end of successful fault processing. Previously it was in update_mmu_cache() but that is deadlock prone. At the end of do_sparc64_fault() we hold no spinlocks that could deadlock the TSB grow sequence. We also have dropped the address space semaphore. While we're here, add prefetching to the copy_tsb() routine and put it in assembler into the tsb.S file. This piece of code is quite time critical. There are some small negative side effects to this code which can be improved upon. In particular we grab the mm->context.lock even for the tsb insert done by update_mmu_cache() now and that's a bit excessive. We can get rid of that locking, and the same lock taking in flush_tsb_user(), by disabling PSTATE_IE around the whole operation including the capturing of the tsb pointer and tsb_nentries value. That would work because anyone growing the TSB won't free up the old TSB until all cpus respond to the TSB change cross call. I'm not quite so confident in that optimization to put it in right now, but eventually we might be able to and the description is here for reference. This code seems very solid now. It passes several parallel GCC bootstrap builds, and our favorite "nut cruncher" stress test which is a full "make -j8192" build of a "make allmodconfig" kernel. That puts about 256 processes on each cpu's run queue, makes lots of process cpu migrations occur, causes lots of page table and TLB flushing activity, incurs many context version number changes, and it swaps the machine real far out to disk even though there is 16GB of ram on this test system. :-) Signed-off-by: David S. Miller <davem@davemloft.net>
2006-03-16 17:02:32 +07:00
struct tsb *old_tsb, *new_tsb;
unsigned long new_cache_index, old_cache_index;
unsigned long new_rss_limit;
gfp_t gfp_flags;
if (max_tsb_size > (PAGE_SIZE << MAX_ORDER))
max_tsb_size = (PAGE_SIZE << MAX_ORDER);
new_cache_index = 0;
for (new_size = 8192; new_size < max_tsb_size; new_size <<= 1UL) {
new_rss_limit = tsb_size_to_rss_limit(new_size);
if (new_rss_limit > rss)
break;
new_cache_index++;
}
if (new_size == max_tsb_size)
new_rss_limit = ~0UL;
retry_tsb_alloc:
gfp_flags = GFP_KERNEL;
if (new_size > (PAGE_SIZE * 2))
gfp_flags |= __GFP_NOWARN | __GFP_NORETRY;
new_tsb = kmem_cache_alloc_node(tsb_caches[new_cache_index],
gfp_flags, numa_node_id());
if (unlikely(!new_tsb)) {
/* Not being able to fork due to a high-order TSB
* allocation failure is very bad behavior. Just back
* down to a 0-order allocation and force no TSB
* growing for this address space.
*/
if (mm->context.tsb_block[tsb_index].tsb == NULL &&
new_cache_index > 0) {
new_cache_index = 0;
new_size = 8192;
new_rss_limit = ~0UL;
goto retry_tsb_alloc;
}
/* If we failed on a TSB grow, we are under serious
* memory pressure so don't try to grow any more.
*/
if (mm->context.tsb_block[tsb_index].tsb != NULL)
mm->context.tsb_block[tsb_index].tsb_rss_limit = ~0UL;
return;
}
/* Mark all tags as invalid. */
tsb_init(new_tsb, new_size);
[SPARC64]: Fix and re-enable dynamic TSB sizing. This is good for up to %50 performance improvement of some test cases. The problem has been the race conditions, and hopefully I've plugged them all up here. 1) There was a serious race in switch_mm() wrt. lazy TLB switching to and from kernel threads. We could erroneously skip a tsb_context_switch() and thus use a stale TSB across a TSB grow event. There is a big comment now in that function describing exactly how it can happen. 2) All code paths that do something with the TSB need to be guarded with the mm->context.lock spinlock. This makes page table flushing paths properly synchronize with both TSB growing and TLB context changes. 3) TSB growing events are moved to the end of successful fault processing. Previously it was in update_mmu_cache() but that is deadlock prone. At the end of do_sparc64_fault() we hold no spinlocks that could deadlock the TSB grow sequence. We also have dropped the address space semaphore. While we're here, add prefetching to the copy_tsb() routine and put it in assembler into the tsb.S file. This piece of code is quite time critical. There are some small negative side effects to this code which can be improved upon. In particular we grab the mm->context.lock even for the tsb insert done by update_mmu_cache() now and that's a bit excessive. We can get rid of that locking, and the same lock taking in flush_tsb_user(), by disabling PSTATE_IE around the whole operation including the capturing of the tsb pointer and tsb_nentries value. That would work because anyone growing the TSB won't free up the old TSB until all cpus respond to the TSB change cross call. I'm not quite so confident in that optimization to put it in right now, but eventually we might be able to and the description is here for reference. This code seems very solid now. It passes several parallel GCC bootstrap builds, and our favorite "nut cruncher" stress test which is a full "make -j8192" build of a "make allmodconfig" kernel. That puts about 256 processes on each cpu's run queue, makes lots of process cpu migrations occur, causes lots of page table and TLB flushing activity, incurs many context version number changes, and it swaps the machine real far out to disk even though there is 16GB of ram on this test system. :-) Signed-off-by: David S. Miller <davem@davemloft.net>
2006-03-16 17:02:32 +07:00
/* Ok, we are about to commit the changes. If we are
* growing an existing TSB the locking is very tricky,
* so WATCH OUT!
*
* We have to hold mm->context.lock while committing to the
* new TSB, this synchronizes us with processors in
* flush_tsb_user() and switch_mm() for this address space.
*
* But even with that lock held, processors run asynchronously
* accessing the old TSB via TLB miss handling. This is OK
* because those actions are just propagating state from the
* Linux page tables into the TSB, page table mappings are not
* being changed. If a real fault occurs, the processor will
* synchronize with us when it hits flush_tsb_user(), this is
* also true for the case where vmscan is modifying the page
* tables. The only thing we need to be careful with is to
* skip any locked TSB entries during copy_tsb().
*
* When we finish committing to the new TSB, we have to drop
* the lock and ask all other cpus running this address space
* to run tsb_context_switch() to see the new TSB table.
*/
spin_lock_irqsave(&mm->context.lock, flags);
old_tsb = mm->context.tsb_block[tsb_index].tsb;
old_cache_index =
(mm->context.tsb_block[tsb_index].tsb_reg_val & 0x7UL);
old_size = (mm->context.tsb_block[tsb_index].tsb_nentries *
sizeof(struct tsb));
[SPARC64]: Fix and re-enable dynamic TSB sizing. This is good for up to %50 performance improvement of some test cases. The problem has been the race conditions, and hopefully I've plugged them all up here. 1) There was a serious race in switch_mm() wrt. lazy TLB switching to and from kernel threads. We could erroneously skip a tsb_context_switch() and thus use a stale TSB across a TSB grow event. There is a big comment now in that function describing exactly how it can happen. 2) All code paths that do something with the TSB need to be guarded with the mm->context.lock spinlock. This makes page table flushing paths properly synchronize with both TSB growing and TLB context changes. 3) TSB growing events are moved to the end of successful fault processing. Previously it was in update_mmu_cache() but that is deadlock prone. At the end of do_sparc64_fault() we hold no spinlocks that could deadlock the TSB grow sequence. We also have dropped the address space semaphore. While we're here, add prefetching to the copy_tsb() routine and put it in assembler into the tsb.S file. This piece of code is quite time critical. There are some small negative side effects to this code which can be improved upon. In particular we grab the mm->context.lock even for the tsb insert done by update_mmu_cache() now and that's a bit excessive. We can get rid of that locking, and the same lock taking in flush_tsb_user(), by disabling PSTATE_IE around the whole operation including the capturing of the tsb pointer and tsb_nentries value. That would work because anyone growing the TSB won't free up the old TSB until all cpus respond to the TSB change cross call. I'm not quite so confident in that optimization to put it in right now, but eventually we might be able to and the description is here for reference. This code seems very solid now. It passes several parallel GCC bootstrap builds, and our favorite "nut cruncher" stress test which is a full "make -j8192" build of a "make allmodconfig" kernel. That puts about 256 processes on each cpu's run queue, makes lots of process cpu migrations occur, causes lots of page table and TLB flushing activity, incurs many context version number changes, and it swaps the machine real far out to disk even though there is 16GB of ram on this test system. :-) Signed-off-by: David S. Miller <davem@davemloft.net>
2006-03-16 17:02:32 +07:00
[SPARC64]: Fix and re-enable dynamic TSB sizing. This is good for up to %50 performance improvement of some test cases. The problem has been the race conditions, and hopefully I've plugged them all up here. 1) There was a serious race in switch_mm() wrt. lazy TLB switching to and from kernel threads. We could erroneously skip a tsb_context_switch() and thus use a stale TSB across a TSB grow event. There is a big comment now in that function describing exactly how it can happen. 2) All code paths that do something with the TSB need to be guarded with the mm->context.lock spinlock. This makes page table flushing paths properly synchronize with both TSB growing and TLB context changes. 3) TSB growing events are moved to the end of successful fault processing. Previously it was in update_mmu_cache() but that is deadlock prone. At the end of do_sparc64_fault() we hold no spinlocks that could deadlock the TSB grow sequence. We also have dropped the address space semaphore. While we're here, add prefetching to the copy_tsb() routine and put it in assembler into the tsb.S file. This piece of code is quite time critical. There are some small negative side effects to this code which can be improved upon. In particular we grab the mm->context.lock even for the tsb insert done by update_mmu_cache() now and that's a bit excessive. We can get rid of that locking, and the same lock taking in flush_tsb_user(), by disabling PSTATE_IE around the whole operation including the capturing of the tsb pointer and tsb_nentries value. That would work because anyone growing the TSB won't free up the old TSB until all cpus respond to the TSB change cross call. I'm not quite so confident in that optimization to put it in right now, but eventually we might be able to and the description is here for reference. This code seems very solid now. It passes several parallel GCC bootstrap builds, and our favorite "nut cruncher" stress test which is a full "make -j8192" build of a "make allmodconfig" kernel. That puts about 256 processes on each cpu's run queue, makes lots of process cpu migrations occur, causes lots of page table and TLB flushing activity, incurs many context version number changes, and it swaps the machine real far out to disk even though there is 16GB of ram on this test system. :-) Signed-off-by: David S. Miller <davem@davemloft.net>
2006-03-16 17:02:32 +07:00
/* Handle multiple threads trying to grow the TSB at the same time.
* One will get in here first, and bump the size and the RSS limit.
* The others will get in here next and hit this check.
*/
if (unlikely(old_tsb &&
(rss < mm->context.tsb_block[tsb_index].tsb_rss_limit))) {
[SPARC64]: Fix and re-enable dynamic TSB sizing. This is good for up to %50 performance improvement of some test cases. The problem has been the race conditions, and hopefully I've plugged them all up here. 1) There was a serious race in switch_mm() wrt. lazy TLB switching to and from kernel threads. We could erroneously skip a tsb_context_switch() and thus use a stale TSB across a TSB grow event. There is a big comment now in that function describing exactly how it can happen. 2) All code paths that do something with the TSB need to be guarded with the mm->context.lock spinlock. This makes page table flushing paths properly synchronize with both TSB growing and TLB context changes. 3) TSB growing events are moved to the end of successful fault processing. Previously it was in update_mmu_cache() but that is deadlock prone. At the end of do_sparc64_fault() we hold no spinlocks that could deadlock the TSB grow sequence. We also have dropped the address space semaphore. While we're here, add prefetching to the copy_tsb() routine and put it in assembler into the tsb.S file. This piece of code is quite time critical. There are some small negative side effects to this code which can be improved upon. In particular we grab the mm->context.lock even for the tsb insert done by update_mmu_cache() now and that's a bit excessive. We can get rid of that locking, and the same lock taking in flush_tsb_user(), by disabling PSTATE_IE around the whole operation including the capturing of the tsb pointer and tsb_nentries value. That would work because anyone growing the TSB won't free up the old TSB until all cpus respond to the TSB change cross call. I'm not quite so confident in that optimization to put it in right now, but eventually we might be able to and the description is here for reference. This code seems very solid now. It passes several parallel GCC bootstrap builds, and our favorite "nut cruncher" stress test which is a full "make -j8192" build of a "make allmodconfig" kernel. That puts about 256 processes on each cpu's run queue, makes lots of process cpu migrations occur, causes lots of page table and TLB flushing activity, incurs many context version number changes, and it swaps the machine real far out to disk even though there is 16GB of ram on this test system. :-) Signed-off-by: David S. Miller <davem@davemloft.net>
2006-03-16 17:02:32 +07:00
spin_unlock_irqrestore(&mm->context.lock, flags);
kmem_cache_free(tsb_caches[new_cache_index], new_tsb);
[SPARC64]: Fix and re-enable dynamic TSB sizing. This is good for up to %50 performance improvement of some test cases. The problem has been the race conditions, and hopefully I've plugged them all up here. 1) There was a serious race in switch_mm() wrt. lazy TLB switching to and from kernel threads. We could erroneously skip a tsb_context_switch() and thus use a stale TSB across a TSB grow event. There is a big comment now in that function describing exactly how it can happen. 2) All code paths that do something with the TSB need to be guarded with the mm->context.lock spinlock. This makes page table flushing paths properly synchronize with both TSB growing and TLB context changes. 3) TSB growing events are moved to the end of successful fault processing. Previously it was in update_mmu_cache() but that is deadlock prone. At the end of do_sparc64_fault() we hold no spinlocks that could deadlock the TSB grow sequence. We also have dropped the address space semaphore. While we're here, add prefetching to the copy_tsb() routine and put it in assembler into the tsb.S file. This piece of code is quite time critical. There are some small negative side effects to this code which can be improved upon. In particular we grab the mm->context.lock even for the tsb insert done by update_mmu_cache() now and that's a bit excessive. We can get rid of that locking, and the same lock taking in flush_tsb_user(), by disabling PSTATE_IE around the whole operation including the capturing of the tsb pointer and tsb_nentries value. That would work because anyone growing the TSB won't free up the old TSB until all cpus respond to the TSB change cross call. I'm not quite so confident in that optimization to put it in right now, but eventually we might be able to and the description is here for reference. This code seems very solid now. It passes several parallel GCC bootstrap builds, and our favorite "nut cruncher" stress test which is a full "make -j8192" build of a "make allmodconfig" kernel. That puts about 256 processes on each cpu's run queue, makes lots of process cpu migrations occur, causes lots of page table and TLB flushing activity, incurs many context version number changes, and it swaps the machine real far out to disk even though there is 16GB of ram on this test system. :-) Signed-off-by: David S. Miller <davem@davemloft.net>
2006-03-16 17:02:32 +07:00
return;
}
mm->context.tsb_block[tsb_index].tsb_rss_limit = new_rss_limit;
[SPARC64]: Fix and re-enable dynamic TSB sizing. This is good for up to %50 performance improvement of some test cases. The problem has been the race conditions, and hopefully I've plugged them all up here. 1) There was a serious race in switch_mm() wrt. lazy TLB switching to and from kernel threads. We could erroneously skip a tsb_context_switch() and thus use a stale TSB across a TSB grow event. There is a big comment now in that function describing exactly how it can happen. 2) All code paths that do something with the TSB need to be guarded with the mm->context.lock spinlock. This makes page table flushing paths properly synchronize with both TSB growing and TLB context changes. 3) TSB growing events are moved to the end of successful fault processing. Previously it was in update_mmu_cache() but that is deadlock prone. At the end of do_sparc64_fault() we hold no spinlocks that could deadlock the TSB grow sequence. We also have dropped the address space semaphore. While we're here, add prefetching to the copy_tsb() routine and put it in assembler into the tsb.S file. This piece of code is quite time critical. There are some small negative side effects to this code which can be improved upon. In particular we grab the mm->context.lock even for the tsb insert done by update_mmu_cache() now and that's a bit excessive. We can get rid of that locking, and the same lock taking in flush_tsb_user(), by disabling PSTATE_IE around the whole operation including the capturing of the tsb pointer and tsb_nentries value. That would work because anyone growing the TSB won't free up the old TSB until all cpus respond to the TSB change cross call. I'm not quite so confident in that optimization to put it in right now, but eventually we might be able to and the description is here for reference. This code seems very solid now. It passes several parallel GCC bootstrap builds, and our favorite "nut cruncher" stress test which is a full "make -j8192" build of a "make allmodconfig" kernel. That puts about 256 processes on each cpu's run queue, makes lots of process cpu migrations occur, causes lots of page table and TLB flushing activity, incurs many context version number changes, and it swaps the machine real far out to disk even though there is 16GB of ram on this test system. :-) Signed-off-by: David S. Miller <davem@davemloft.net>
2006-03-16 17:02:32 +07:00
if (old_tsb) {
extern void copy_tsb(unsigned long old_tsb_base,
unsigned long old_tsb_size,
unsigned long new_tsb_base,
unsigned long new_tsb_size);
unsigned long old_tsb_base = (unsigned long) old_tsb;
unsigned long new_tsb_base = (unsigned long) new_tsb;
if (tlb_type == cheetah_plus || tlb_type == hypervisor) {
old_tsb_base = __pa(old_tsb_base);
new_tsb_base = __pa(new_tsb_base);
}
copy_tsb(old_tsb_base, old_size, new_tsb_base, new_size);
[SPARC64]: Fix and re-enable dynamic TSB sizing. This is good for up to %50 performance improvement of some test cases. The problem has been the race conditions, and hopefully I've plugged them all up here. 1) There was a serious race in switch_mm() wrt. lazy TLB switching to and from kernel threads. We could erroneously skip a tsb_context_switch() and thus use a stale TSB across a TSB grow event. There is a big comment now in that function describing exactly how it can happen. 2) All code paths that do something with the TSB need to be guarded with the mm->context.lock spinlock. This makes page table flushing paths properly synchronize with both TSB growing and TLB context changes. 3) TSB growing events are moved to the end of successful fault processing. Previously it was in update_mmu_cache() but that is deadlock prone. At the end of do_sparc64_fault() we hold no spinlocks that could deadlock the TSB grow sequence. We also have dropped the address space semaphore. While we're here, add prefetching to the copy_tsb() routine and put it in assembler into the tsb.S file. This piece of code is quite time critical. There are some small negative side effects to this code which can be improved upon. In particular we grab the mm->context.lock even for the tsb insert done by update_mmu_cache() now and that's a bit excessive. We can get rid of that locking, and the same lock taking in flush_tsb_user(), by disabling PSTATE_IE around the whole operation including the capturing of the tsb pointer and tsb_nentries value. That would work because anyone growing the TSB won't free up the old TSB until all cpus respond to the TSB change cross call. I'm not quite so confident in that optimization to put it in right now, but eventually we might be able to and the description is here for reference. This code seems very solid now. It passes several parallel GCC bootstrap builds, and our favorite "nut cruncher" stress test which is a full "make -j8192" build of a "make allmodconfig" kernel. That puts about 256 processes on each cpu's run queue, makes lots of process cpu migrations occur, causes lots of page table and TLB flushing activity, incurs many context version number changes, and it swaps the machine real far out to disk even though there is 16GB of ram on this test system. :-) Signed-off-by: David S. Miller <davem@davemloft.net>
2006-03-16 17:02:32 +07:00
}
mm->context.tsb_block[tsb_index].tsb = new_tsb;
setup_tsb_params(mm, tsb_index, new_size);
[SPARC64]: Fix and re-enable dynamic TSB sizing. This is good for up to %50 performance improvement of some test cases. The problem has been the race conditions, and hopefully I've plugged them all up here. 1) There was a serious race in switch_mm() wrt. lazy TLB switching to and from kernel threads. We could erroneously skip a tsb_context_switch() and thus use a stale TSB across a TSB grow event. There is a big comment now in that function describing exactly how it can happen. 2) All code paths that do something with the TSB need to be guarded with the mm->context.lock spinlock. This makes page table flushing paths properly synchronize with both TSB growing and TLB context changes. 3) TSB growing events are moved to the end of successful fault processing. Previously it was in update_mmu_cache() but that is deadlock prone. At the end of do_sparc64_fault() we hold no spinlocks that could deadlock the TSB grow sequence. We also have dropped the address space semaphore. While we're here, add prefetching to the copy_tsb() routine and put it in assembler into the tsb.S file. This piece of code is quite time critical. There are some small negative side effects to this code which can be improved upon. In particular we grab the mm->context.lock even for the tsb insert done by update_mmu_cache() now and that's a bit excessive. We can get rid of that locking, and the same lock taking in flush_tsb_user(), by disabling PSTATE_IE around the whole operation including the capturing of the tsb pointer and tsb_nentries value. That would work because anyone growing the TSB won't free up the old TSB until all cpus respond to the TSB change cross call. I'm not quite so confident in that optimization to put it in right now, but eventually we might be able to and the description is here for reference. This code seems very solid now. It passes several parallel GCC bootstrap builds, and our favorite "nut cruncher" stress test which is a full "make -j8192" build of a "make allmodconfig" kernel. That puts about 256 processes on each cpu's run queue, makes lots of process cpu migrations occur, causes lots of page table and TLB flushing activity, incurs many context version number changes, and it swaps the machine real far out to disk even though there is 16GB of ram on this test system. :-) Signed-off-by: David S. Miller <davem@davemloft.net>
2006-03-16 17:02:32 +07:00
spin_unlock_irqrestore(&mm->context.lock, flags);
/* If old_tsb is NULL, we're being invoked for the first time
* from init_new_context().
*/
if (old_tsb) {
[SPARC64]: Fix and re-enable dynamic TSB sizing. This is good for up to %50 performance improvement of some test cases. The problem has been the race conditions, and hopefully I've plugged them all up here. 1) There was a serious race in switch_mm() wrt. lazy TLB switching to and from kernel threads. We could erroneously skip a tsb_context_switch() and thus use a stale TSB across a TSB grow event. There is a big comment now in that function describing exactly how it can happen. 2) All code paths that do something with the TSB need to be guarded with the mm->context.lock spinlock. This makes page table flushing paths properly synchronize with both TSB growing and TLB context changes. 3) TSB growing events are moved to the end of successful fault processing. Previously it was in update_mmu_cache() but that is deadlock prone. At the end of do_sparc64_fault() we hold no spinlocks that could deadlock the TSB grow sequence. We also have dropped the address space semaphore. While we're here, add prefetching to the copy_tsb() routine and put it in assembler into the tsb.S file. This piece of code is quite time critical. There are some small negative side effects to this code which can be improved upon. In particular we grab the mm->context.lock even for the tsb insert done by update_mmu_cache() now and that's a bit excessive. We can get rid of that locking, and the same lock taking in flush_tsb_user(), by disabling PSTATE_IE around the whole operation including the capturing of the tsb pointer and tsb_nentries value. That would work because anyone growing the TSB won't free up the old TSB until all cpus respond to the TSB change cross call. I'm not quite so confident in that optimization to put it in right now, but eventually we might be able to and the description is here for reference. This code seems very solid now. It passes several parallel GCC bootstrap builds, and our favorite "nut cruncher" stress test which is a full "make -j8192" build of a "make allmodconfig" kernel. That puts about 256 processes on each cpu's run queue, makes lots of process cpu migrations occur, causes lots of page table and TLB flushing activity, incurs many context version number changes, and it swaps the machine real far out to disk even though there is 16GB of ram on this test system. :-) Signed-off-by: David S. Miller <davem@davemloft.net>
2006-03-16 17:02:32 +07:00
/* Reload it on the local cpu. */
tsb_context_switch(mm);
[SPARC64]: Fix and re-enable dynamic TSB sizing. This is good for up to %50 performance improvement of some test cases. The problem has been the race conditions, and hopefully I've plugged them all up here. 1) There was a serious race in switch_mm() wrt. lazy TLB switching to and from kernel threads. We could erroneously skip a tsb_context_switch() and thus use a stale TSB across a TSB grow event. There is a big comment now in that function describing exactly how it can happen. 2) All code paths that do something with the TSB need to be guarded with the mm->context.lock spinlock. This makes page table flushing paths properly synchronize with both TSB growing and TLB context changes. 3) TSB growing events are moved to the end of successful fault processing. Previously it was in update_mmu_cache() but that is deadlock prone. At the end of do_sparc64_fault() we hold no spinlocks that could deadlock the TSB grow sequence. We also have dropped the address space semaphore. While we're here, add prefetching to the copy_tsb() routine and put it in assembler into the tsb.S file. This piece of code is quite time critical. There are some small negative side effects to this code which can be improved upon. In particular we grab the mm->context.lock even for the tsb insert done by update_mmu_cache() now and that's a bit excessive. We can get rid of that locking, and the same lock taking in flush_tsb_user(), by disabling PSTATE_IE around the whole operation including the capturing of the tsb pointer and tsb_nentries value. That would work because anyone growing the TSB won't free up the old TSB until all cpus respond to the TSB change cross call. I'm not quite so confident in that optimization to put it in right now, but eventually we might be able to and the description is here for reference. This code seems very solid now. It passes several parallel GCC bootstrap builds, and our favorite "nut cruncher" stress test which is a full "make -j8192" build of a "make allmodconfig" kernel. That puts about 256 processes on each cpu's run queue, makes lots of process cpu migrations occur, causes lots of page table and TLB flushing activity, incurs many context version number changes, and it swaps the machine real far out to disk even though there is 16GB of ram on this test system. :-) Signed-off-by: David S. Miller <davem@davemloft.net>
2006-03-16 17:02:32 +07:00
/* Now force other processors to do the same. */
preempt_disable();
[SPARC64]: Fix and re-enable dynamic TSB sizing. This is good for up to %50 performance improvement of some test cases. The problem has been the race conditions, and hopefully I've plugged them all up here. 1) There was a serious race in switch_mm() wrt. lazy TLB switching to and from kernel threads. We could erroneously skip a tsb_context_switch() and thus use a stale TSB across a TSB grow event. There is a big comment now in that function describing exactly how it can happen. 2) All code paths that do something with the TSB need to be guarded with the mm->context.lock spinlock. This makes page table flushing paths properly synchronize with both TSB growing and TLB context changes. 3) TSB growing events are moved to the end of successful fault processing. Previously it was in update_mmu_cache() but that is deadlock prone. At the end of do_sparc64_fault() we hold no spinlocks that could deadlock the TSB grow sequence. We also have dropped the address space semaphore. While we're here, add prefetching to the copy_tsb() routine and put it in assembler into the tsb.S file. This piece of code is quite time critical. There are some small negative side effects to this code which can be improved upon. In particular we grab the mm->context.lock even for the tsb insert done by update_mmu_cache() now and that's a bit excessive. We can get rid of that locking, and the same lock taking in flush_tsb_user(), by disabling PSTATE_IE around the whole operation including the capturing of the tsb pointer and tsb_nentries value. That would work because anyone growing the TSB won't free up the old TSB until all cpus respond to the TSB change cross call. I'm not quite so confident in that optimization to put it in right now, but eventually we might be able to and the description is here for reference. This code seems very solid now. It passes several parallel GCC bootstrap builds, and our favorite "nut cruncher" stress test which is a full "make -j8192" build of a "make allmodconfig" kernel. That puts about 256 processes on each cpu's run queue, makes lots of process cpu migrations occur, causes lots of page table and TLB flushing activity, incurs many context version number changes, and it swaps the machine real far out to disk even though there is 16GB of ram on this test system. :-) Signed-off-by: David S. Miller <davem@davemloft.net>
2006-03-16 17:02:32 +07:00
smp_tsb_sync(mm);
preempt_enable();
[SPARC64]: Fix and re-enable dynamic TSB sizing. This is good for up to %50 performance improvement of some test cases. The problem has been the race conditions, and hopefully I've plugged them all up here. 1) There was a serious race in switch_mm() wrt. lazy TLB switching to and from kernel threads. We could erroneously skip a tsb_context_switch() and thus use a stale TSB across a TSB grow event. There is a big comment now in that function describing exactly how it can happen. 2) All code paths that do something with the TSB need to be guarded with the mm->context.lock spinlock. This makes page table flushing paths properly synchronize with both TSB growing and TLB context changes. 3) TSB growing events are moved to the end of successful fault processing. Previously it was in update_mmu_cache() but that is deadlock prone. At the end of do_sparc64_fault() we hold no spinlocks that could deadlock the TSB grow sequence. We also have dropped the address space semaphore. While we're here, add prefetching to the copy_tsb() routine and put it in assembler into the tsb.S file. This piece of code is quite time critical. There are some small negative side effects to this code which can be improved upon. In particular we grab the mm->context.lock even for the tsb insert done by update_mmu_cache() now and that's a bit excessive. We can get rid of that locking, and the same lock taking in flush_tsb_user(), by disabling PSTATE_IE around the whole operation including the capturing of the tsb pointer and tsb_nentries value. That would work because anyone growing the TSB won't free up the old TSB until all cpus respond to the TSB change cross call. I'm not quite so confident in that optimization to put it in right now, but eventually we might be able to and the description is here for reference. This code seems very solid now. It passes several parallel GCC bootstrap builds, and our favorite "nut cruncher" stress test which is a full "make -j8192" build of a "make allmodconfig" kernel. That puts about 256 processes on each cpu's run queue, makes lots of process cpu migrations occur, causes lots of page table and TLB flushing activity, incurs many context version number changes, and it swaps the machine real far out to disk even though there is 16GB of ram on this test system. :-) Signed-off-by: David S. Miller <davem@davemloft.net>
2006-03-16 17:02:32 +07:00
/* Now it is safe to free the old tsb. */
kmem_cache_free(tsb_caches[old_cache_index], old_tsb);
}
}
int init_new_context(struct task_struct *tsk, struct mm_struct *mm)
{
sparc64 mm: Fix more TSB sizing issues Commit af1b1a9b36b8 ("sparc64 mm: Fix base TSB sizing when hugetlb pages are used") addressed the difference between hugetlb and THP pages when computing TSB sizes. The following additional issues were also discovered while working with the code. In order to save memory, THP makes use of a huge zero page. This huge zero page does not count against a task's RSS, but it does consume TSB entries. This is similar to hugetlb pages. Therefore, count huge zero page entries in hugetlb_pte_count. Accounting of THP pages is done in the routine set_pmd_at(). Unfortunately, this does not catch the case where a THP page is split. To handle this case, decrement the count in pmdp_invalidate(). pmdp_invalidate is only called when splitting a THP. However, 'sanity checks' are added in case it is ever called for other purposes. A more general issue exists with HPAGE_SIZE accounting. hugetlb_pte_count tracks the number of HPAGE_SIZE (8M) pages. This value is used to size the TSB for HPAGE_SIZE pages. However, each HPAGE_SIZE page consists of two REAL_HPAGE_SIZE (4M) pages. The TSB contains an entry for each REAL_HPAGE_SIZE page. Therefore, the number of REAL_HPAGE_SIZE pages should be used to size the huge page TSB. A new compile time constant REAL_HPAGE_PER_HPAGE is used to multiply hugetlb_pte_count before sizing the TSB. Changes from V1 - Fixed build issue if hugetlb or THP not configured Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-09-01 03:48:19 +07:00
unsigned long mm_rss = get_mm_rss(mm);
#if defined(CONFIG_HUGETLB_PAGE) || defined(CONFIG_TRANSPARENT_HUGEPAGE)
sparc64 mm: Fix more TSB sizing issues Commit af1b1a9b36b8 ("sparc64 mm: Fix base TSB sizing when hugetlb pages are used") addressed the difference between hugetlb and THP pages when computing TSB sizes. The following additional issues were also discovered while working with the code. In order to save memory, THP makes use of a huge zero page. This huge zero page does not count against a task's RSS, but it does consume TSB entries. This is similar to hugetlb pages. Therefore, count huge zero page entries in hugetlb_pte_count. Accounting of THP pages is done in the routine set_pmd_at(). Unfortunately, this does not catch the case where a THP page is split. To handle this case, decrement the count in pmdp_invalidate(). pmdp_invalidate is only called when splitting a THP. However, 'sanity checks' are added in case it is ever called for other purposes. A more general issue exists with HPAGE_SIZE accounting. hugetlb_pte_count tracks the number of HPAGE_SIZE (8M) pages. This value is used to size the TSB for HPAGE_SIZE pages. However, each HPAGE_SIZE page consists of two REAL_HPAGE_SIZE (4M) pages. The TSB contains an entry for each REAL_HPAGE_SIZE page. Therefore, the number of REAL_HPAGE_SIZE pages should be used to size the huge page TSB. A new compile time constant REAL_HPAGE_PER_HPAGE is used to multiply hugetlb_pte_count before sizing the TSB. Changes from V1 - Fixed build issue if hugetlb or THP not configured Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-09-01 03:48:19 +07:00
unsigned long saved_hugetlb_pte_count;
unsigned long saved_thp_pte_count;
#endif
unsigned int i;
spin_lock_init(&mm->context.lock);
mm->context.sparc64_ctx_val = 0UL;
#if defined(CONFIG_HUGETLB_PAGE) || defined(CONFIG_TRANSPARENT_HUGEPAGE)
/* We reset them to zero because the fork() page copying
* will re-increment the counters as the parent PTEs are
* copied into the child address space.
*/
sparc64 mm: Fix more TSB sizing issues Commit af1b1a9b36b8 ("sparc64 mm: Fix base TSB sizing when hugetlb pages are used") addressed the difference between hugetlb and THP pages when computing TSB sizes. The following additional issues were also discovered while working with the code. In order to save memory, THP makes use of a huge zero page. This huge zero page does not count against a task's RSS, but it does consume TSB entries. This is similar to hugetlb pages. Therefore, count huge zero page entries in hugetlb_pte_count. Accounting of THP pages is done in the routine set_pmd_at(). Unfortunately, this does not catch the case where a THP page is split. To handle this case, decrement the count in pmdp_invalidate(). pmdp_invalidate is only called when splitting a THP. However, 'sanity checks' are added in case it is ever called for other purposes. A more general issue exists with HPAGE_SIZE accounting. hugetlb_pte_count tracks the number of HPAGE_SIZE (8M) pages. This value is used to size the TSB for HPAGE_SIZE pages. However, each HPAGE_SIZE page consists of two REAL_HPAGE_SIZE (4M) pages. The TSB contains an entry for each REAL_HPAGE_SIZE page. Therefore, the number of REAL_HPAGE_SIZE pages should be used to size the huge page TSB. A new compile time constant REAL_HPAGE_PER_HPAGE is used to multiply hugetlb_pte_count before sizing the TSB. Changes from V1 - Fixed build issue if hugetlb or THP not configured Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-09-01 03:48:19 +07:00
saved_hugetlb_pte_count = mm->context.hugetlb_pte_count;
saved_thp_pte_count = mm->context.thp_pte_count;
mm->context.hugetlb_pte_count = 0;
mm->context.thp_pte_count = 0;
sparc64 mm: Fix more TSB sizing issues Commit af1b1a9b36b8 ("sparc64 mm: Fix base TSB sizing when hugetlb pages are used") addressed the difference between hugetlb and THP pages when computing TSB sizes. The following additional issues were also discovered while working with the code. In order to save memory, THP makes use of a huge zero page. This huge zero page does not count against a task's RSS, but it does consume TSB entries. This is similar to hugetlb pages. Therefore, count huge zero page entries in hugetlb_pte_count. Accounting of THP pages is done in the routine set_pmd_at(). Unfortunately, this does not catch the case where a THP page is split. To handle this case, decrement the count in pmdp_invalidate(). pmdp_invalidate is only called when splitting a THP. However, 'sanity checks' are added in case it is ever called for other purposes. A more general issue exists with HPAGE_SIZE accounting. hugetlb_pte_count tracks the number of HPAGE_SIZE (8M) pages. This value is used to size the TSB for HPAGE_SIZE pages. However, each HPAGE_SIZE page consists of two REAL_HPAGE_SIZE (4M) pages. The TSB contains an entry for each REAL_HPAGE_SIZE page. Therefore, the number of REAL_HPAGE_SIZE pages should be used to size the huge page TSB. A new compile time constant REAL_HPAGE_PER_HPAGE is used to multiply hugetlb_pte_count before sizing the TSB. Changes from V1 - Fixed build issue if hugetlb or THP not configured Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-09-01 03:48:19 +07:00
mm_rss -= saved_thp_pte_count * (HPAGE_SIZE / PAGE_SIZE);
#endif
/* copy_mm() copies over the parent's mm_struct before calling
* us, so we need to zero out the TSB pointer or else tsb_grow()
* will be confused and think there is an older TSB to free up.
*/
for (i = 0; i < MM_NUM_TSBS; i++)
mm->context.tsb_block[i].tsb = NULL;
[SPARC64]: Fix and re-enable dynamic TSB sizing. This is good for up to %50 performance improvement of some test cases. The problem has been the race conditions, and hopefully I've plugged them all up here. 1) There was a serious race in switch_mm() wrt. lazy TLB switching to and from kernel threads. We could erroneously skip a tsb_context_switch() and thus use a stale TSB across a TSB grow event. There is a big comment now in that function describing exactly how it can happen. 2) All code paths that do something with the TSB need to be guarded with the mm->context.lock spinlock. This makes page table flushing paths properly synchronize with both TSB growing and TLB context changes. 3) TSB growing events are moved to the end of successful fault processing. Previously it was in update_mmu_cache() but that is deadlock prone. At the end of do_sparc64_fault() we hold no spinlocks that could deadlock the TSB grow sequence. We also have dropped the address space semaphore. While we're here, add prefetching to the copy_tsb() routine and put it in assembler into the tsb.S file. This piece of code is quite time critical. There are some small negative side effects to this code which can be improved upon. In particular we grab the mm->context.lock even for the tsb insert done by update_mmu_cache() now and that's a bit excessive. We can get rid of that locking, and the same lock taking in flush_tsb_user(), by disabling PSTATE_IE around the whole operation including the capturing of the tsb pointer and tsb_nentries value. That would work because anyone growing the TSB won't free up the old TSB until all cpus respond to the TSB change cross call. I'm not quite so confident in that optimization to put it in right now, but eventually we might be able to and the description is here for reference. This code seems very solid now. It passes several parallel GCC bootstrap builds, and our favorite "nut cruncher" stress test which is a full "make -j8192" build of a "make allmodconfig" kernel. That puts about 256 processes on each cpu's run queue, makes lots of process cpu migrations occur, causes lots of page table and TLB flushing activity, incurs many context version number changes, and it swaps the machine real far out to disk even though there is 16GB of ram on this test system. :-) Signed-off-by: David S. Miller <davem@davemloft.net>
2006-03-16 17:02:32 +07:00
/* If this is fork, inherit the parent's TSB size. We would
* grow it to that size on the first page fault anyways.
*/
sparc64 mm: Fix more TSB sizing issues Commit af1b1a9b36b8 ("sparc64 mm: Fix base TSB sizing when hugetlb pages are used") addressed the difference between hugetlb and THP pages when computing TSB sizes. The following additional issues were also discovered while working with the code. In order to save memory, THP makes use of a huge zero page. This huge zero page does not count against a task's RSS, but it does consume TSB entries. This is similar to hugetlb pages. Therefore, count huge zero page entries in hugetlb_pte_count. Accounting of THP pages is done in the routine set_pmd_at(). Unfortunately, this does not catch the case where a THP page is split. To handle this case, decrement the count in pmdp_invalidate(). pmdp_invalidate is only called when splitting a THP. However, 'sanity checks' are added in case it is ever called for other purposes. A more general issue exists with HPAGE_SIZE accounting. hugetlb_pte_count tracks the number of HPAGE_SIZE (8M) pages. This value is used to size the TSB for HPAGE_SIZE pages. However, each HPAGE_SIZE page consists of two REAL_HPAGE_SIZE (4M) pages. The TSB contains an entry for each REAL_HPAGE_SIZE page. Therefore, the number of REAL_HPAGE_SIZE pages should be used to size the huge page TSB. A new compile time constant REAL_HPAGE_PER_HPAGE is used to multiply hugetlb_pte_count before sizing the TSB. Changes from V1 - Fixed build issue if hugetlb or THP not configured Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-09-01 03:48:19 +07:00
tsb_grow(mm, MM_TSB_BASE, mm_rss);
#if defined(CONFIG_HUGETLB_PAGE) || defined(CONFIG_TRANSPARENT_HUGEPAGE)
sparc64 mm: Fix more TSB sizing issues Commit af1b1a9b36b8 ("sparc64 mm: Fix base TSB sizing when hugetlb pages are used") addressed the difference between hugetlb and THP pages when computing TSB sizes. The following additional issues were also discovered while working with the code. In order to save memory, THP makes use of a huge zero page. This huge zero page does not count against a task's RSS, but it does consume TSB entries. This is similar to hugetlb pages. Therefore, count huge zero page entries in hugetlb_pte_count. Accounting of THP pages is done in the routine set_pmd_at(). Unfortunately, this does not catch the case where a THP page is split. To handle this case, decrement the count in pmdp_invalidate(). pmdp_invalidate is only called when splitting a THP. However, 'sanity checks' are added in case it is ever called for other purposes. A more general issue exists with HPAGE_SIZE accounting. hugetlb_pte_count tracks the number of HPAGE_SIZE (8M) pages. This value is used to size the TSB for HPAGE_SIZE pages. However, each HPAGE_SIZE page consists of two REAL_HPAGE_SIZE (4M) pages. The TSB contains an entry for each REAL_HPAGE_SIZE page. Therefore, the number of REAL_HPAGE_SIZE pages should be used to size the huge page TSB. A new compile time constant REAL_HPAGE_PER_HPAGE is used to multiply hugetlb_pte_count before sizing the TSB. Changes from V1 - Fixed build issue if hugetlb or THP not configured Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-09-01 03:48:19 +07:00
if (unlikely(saved_hugetlb_pte_count + saved_thp_pte_count))
tsb_grow(mm, MM_TSB_HUGE,
(saved_hugetlb_pte_count + saved_thp_pte_count) *
REAL_HPAGE_PER_HPAGE);
#endif
if (unlikely(!mm->context.tsb_block[MM_TSB_BASE].tsb))
return -ENOMEM;
return 0;
}
static void tsb_destroy_one(struct tsb_config *tp)
{
unsigned long cache_index;
if (!tp->tsb)
return;
cache_index = tp->tsb_reg_val & 0x7UL;
kmem_cache_free(tsb_caches[cache_index], tp->tsb);
tp->tsb = NULL;
tp->tsb_reg_val = 0UL;
}
void destroy_context(struct mm_struct *mm)
{
unsigned long flags, i;
for (i = 0; i < MM_NUM_TSBS; i++)
tsb_destroy_one(&mm->context.tsb_block[i]);
spin_lock_irqsave(&ctx_alloc_lock, flags);
if (CTX_VALID(mm->context)) {
unsigned long nr = CTX_NRBITS(mm->context);
mmu_context_bmap[nr>>6] &= ~(1UL << (nr & 63));
}
spin_unlock_irqrestore(&ctx_alloc_lock, flags);
}