We have variants of kmalloc and kmem_cache_alloc that leave leak tracking to
the caller. This is used for subsystem-specific allocators like skb_alloc.
To make skb_alloc node-aware we need similar routines for the node-aware slab
allocator, which this patch adds.
Note that the code is rather ugly, but it mirrors the non-node-aware code 1:1:
[akpm@osdl.org: add module export]
Signed-off-by: Christoph Hellwig <hch@lst.de>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
When using numa=fake on non-NUMA hardware there is no benefit to having the
alien caches, and they consume much memory.
Add a kernel boot option to disable them.
Christoph sayeth "This is good to have even on large NUMA. The problem is
that the alien caches grow by the square of the size of the system in terms of
nodes."
Cc: Christoph Lameter <clameter@engr.sgi.com>
Cc: Pekka Enberg <penberg@cs.helsinki.fi>
Cc: Manfred Spraul <manfred@colorfullife.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Here's an attempt towards doing away with lock_cpu_hotplug in the slab
subsystem. This approach also fixes a bug which shows up when cpus are
being offlined/onlined and slab caches are being tuned simultaneously.
http://marc.theaimsgroup.com/?l=linux-kernel&m=116098888100481&w=2
The patch has been stress tested overnight on a 2 socket 4 core AMD box with
repeated cpu online and offline, while dbench and kernbench process are
running, and slab caches being tuned at the same time.
There were no lockdep warnings either. (This test on 2,6.18 as 2.6.19-rc
crashes at __drain_pages
http://marc.theaimsgroup.com/?l=linux-kernel&m=116172164217678&w=2 )
The approach here is to hold cache_chain_mutex from CPU_UP_PREPARE until
CPU_ONLINE (similar in approach as worqueue_mutex) . Slab code sensitive
to cpu_online_map (kmem_cache_create, kmem_cache_destroy, slabinfo_write,
__cache_shrink) is already serialized with cache_chain_mutex. (This patch
lengthens cache_chain_mutex hold time at kmem_cache_destroy to cover this).
This patch also takes the cache_chain_sem at kmem_cache_shrink to protect
sanity of cpu_online_map at __cache_shrink, as viewed by slab.
(kmem_cache_shrink->__cache_shrink->drain_cpu_caches). But, really,
kmem_cache_shrink is used at just one place in the acpi subsystem! Do we
really need to keep kmem_cache_shrink at all?
Another note. Looks like a cpu hotplug event can send CPU_UP_CANCELED to
a registered subsystem even if the subsystem did not receive CPU_UP_PREPARE.
This could be due to a subsystem registered for notification earlier than
the current subsystem crapping out with NOTIFY_BAD. Badness can occur with
in the CPU_UP_CANCELED code path at slab if this happens (The same would
apply for workqueue.c as well). To overcome this, we might have to use either
a) a per subsystem flag and avoid handling of CPU_UP_CANCELED, or
b) Use a special notifier events like LOCK_ACQUIRE/RELEASE as Gautham was
using in his experiments, or
c) Do not send CPU_UP_CANCELED to a subsystem which did not receive
CPU_UP_PREPARE.
I would prefer c).
Signed-off-by: Ravikiran Thirumalai <kiran@scalex86.org>
Signed-off-by: Shai Fultheim <shai@scalex86.org>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
When CONFIG_SLAB_DEBUG is used in combination with ARCH_SLAB_MINALIGN, some
debug flags should be disabled which depend on BYTES_PER_WORD alignment.
The disabling of these debug flags is not properly handled when
BYTES_PER_WORD < ARCH_SLAB_MEMALIGN < cache_line_size()
This patch fixes that and also adds an alignment check to
cache_alloc_debugcheck_after() when ARCH_SLAB_MINALIGN is used.
Signed-off-by: Kevin Hilman <khilman@mvista.com>
Cc: Pekka Enberg <penberg@cs.helsinki.fi>
Cc: Christoph Lameter <clameter@engr.sgi.com>
Cc: Manfred Spraul <manfred@colorfullife.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Pass the work_struct pointer to the work function rather than context data.
The work function can use container_of() to work out the data.
For the cases where the container of the work_struct may go away the moment the
pending bit is cleared, it is made possible to defer the release of the
structure by deferring the clearing of the pending bit.
To make this work, an extra flag is introduced into the management side of the
work_struct. This governs auto-release of the structure upon execution.
Ordinarily, the work queue executor would release the work_struct for further
scheduling or deallocation by clearing the pending bit prior to jumping to the
work function. This means that, unless the driver makes some guarantee itself
that the work_struct won't go away, the work function may not access anything
else in the work_struct or its container lest they be deallocated.. This is a
problem if the auxiliary data is taken away (as done by the last patch).
However, if the pending bit is *not* cleared before jumping to the work
function, then the work function *may* access the work_struct and its container
with no problems. But then the work function must itself release the
work_struct by calling work_release().
In most cases, automatic release is fine, so this is the default. Special
initiators exist for the non-auto-release case (ending in _NAR).
Signed-Off-By: David Howells <dhowells@redhat.com>
Separate delayable work items from non-delayable work items be splitting them
into a separate structure (delayed_work), which incorporates a work_struct and
the timer_list removed from work_struct.
The work_struct struct is huge, and this limits it's usefulness. On a 64-bit
architecture it's nearly 100 bytes in size. This reduces that by half for the
non-delayable type of event.
Signed-Off-By: David Howells <dhowells@redhat.com>
It looks like there is a bug in init_reap_node() in slab.c that can cause
multiple oops's on certain ES7000 configurations. The variable reap_node
is defined per cpu, but only initialized on a single CPU. This causes an
oops in next_reap_node() when __get_cpu_var(reap_node) returns the wrong
value. Fix is below.
Signed-off-by: Dan Yeisley <dan.yeisley@unisys.com>
Cc: Andi Kleen <ak@suse.de>
Acked-by: Christoph Lameter <clameter@engr.sgi.com>
Cc: Pekka Enberg <penberg@cs.helsinki.fi>
Cc: Manfred Spraul <manfred@colorfullife.com>
Cc: <stable@kernel.org>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
The zonelist may contain zones of nodes that have not been bootstrapped and
we will oops if we try to allocate from those zones. So check if the node
information for the slab and the node have been setup before attempting an
allocation. If it has not been setup then skip that zone.
Usually we will not encounter this situation since the slab bootstrap code
avoids falling back before we have setup the respective nodes but we seem
to have a special needs for pppc.
Signed-off-by: Christoph Lameter <clameter@sgi.com>
Acked-by: Andy Whitcroft <apw@shadowen.org>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Mike Kravetz <kravetz@us.ibm.com>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Acked-by: Mel Gorman <mel@csn.ul.ie>
Acked-by: Will Schmidt <will_schmidt@vnet.ibm.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Init list is called with a list parameter that is not equal to the
cachep->nodelists entry under NUMA if more than one node exists. This is
fully legitimatei. One may want to populate the list fields before
switching nodelist pointers.
Signed-off-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Reduce the NUMA text size of mm/slab.o a little on x86 by using a local
variable to store the result of numa_node_id().
text data bss dec hex filename
16858 2584 16 19458 4c02 mm/slab.o (before)
16804 2584 16 19404 4bcc mm/slab.o (after)
[akpm@osdl.org: use better names]
[pbadari@us.ibm.com: fix that]
Cc: Christoph Lameter <christoph@lameter.com>
Signed-off-by: Pekka Enberg <penberg@cs.helsinki.fi>
Signed-off-by: Badari Pulavarty <pbadari@us.ibm.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
* master.kernel.org:/pub/scm/linux/kernel/git/davej/configh:
Remove all inclusions of <linux/config.h>
Manually resolved trivial path conflicts due to removed files in
the sound/oss/ subdirectory.
- rename ____kmalloc to kmalloc_track_caller so that people have a chance
to guess what it does just from it's name. Add a comment describing it
for those who don't. Also move it after kmalloc in slab.h so people get
less confused when they are just looking for kmalloc - move things around
in slab.c a little to reduce the ifdef mess.
[penberg@cs.helsinki.fi: Fix up reversed #ifdef]
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Pekka Enberg <penberg@cs.helsinki.fi>
Cc: Christoph Lameter <clameter@engr.sgi.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
In cases where we detect a single bit has been flipped, we spew the usual
slab corruption message, which users instantly think is a kernel bug. In a
lot of cases, single bit errors are down to bad memory, or other hardware
failure.
This patch adds an extra line to the slab debug messages in those cases, in
the hope that users will try memtest before they report a bug.
000: 6b 6b 6b 6b 6a 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b
Single bit error detected. Possibly bad RAM. Run memtest86.
[akpm@osdl.org: cleanups]
Signed-off-by: Dave Jones <davej@redhat.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
This patch insures that the slab node lists in the NUMA case only contain
slabs that belong to that specific node. All slab allocations use
GFP_THISNODE when calling into the page allocator. If an allocation fails
then we fall back in the slab allocator according to the zonelists appropriate
for a certain context.
This allows a replication of the behavior of alloc_pages and alloc_pages node
in the slab layer.
Currently allocations requested from the page allocator may be redirected via
cpusets to other nodes. This results in remote pages on nodelists and that in
turn results in interrupt latency issues during cache draining. Plus the slab
is handing out memory as local when it is really remote.
Fallback for slab memory allocations will occur within the slab allocator and
not in the page allocator. This is necessary in order to be able to use the
existing pools of objects on the nodes that we fall back to before adding more
pages to a slab.
The fallback function insures that the nodes we fall back to obey cpuset
restrictions of the current context. We do not allocate objects from outside
of the current cpuset context like before.
Note that the implementation of locality constraints within the slab allocator
requires importing logic from the page allocator. This is a mischmash that is
not that great. Other allocators (uncached allocator, vmalloc, huge pages)
face similar problems and have similar minimal reimplementations of the basic
fallback logic of the page allocator. There is another way of implementing a
slab by avoiding per node lists (see modular slab) but this wont work within
the existing slab.
V1->V2:
- Use NUMA_BUILD to avoid #ifdef CONFIG_NUMA
- Exploit GFP_THISNODE being 0 in the NON_NUMA case to avoid another
#ifdef
[akpm@osdl.org: build fix]
Signed-off-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
kmalloc_node() falls back to ___cache_alloc() under certain conditions and
at that point memory policies may be applied redirecting the allocation
away from the current node. Therefore kmalloc_node(...,numa_node_id()) or
kmalloc_node(...,-1) may not return memory from the local node.
Fix this by doing the policy check in __cache_alloc() instead of
____cache_alloc().
This version here is a cleanup of Kiran's patch.
- Tested on ia64.
- Extra material removed.
- Consolidate the exit path if alternate_node_alloc() returned an object.
[akpm@osdl.org: warning fix]
Signed-off-by: Alok N Kataria <alok.kataria@calsoftinc.com>
Signed-off-by: Ravikiran Thirumalai <kiran@scalex86.org>
Signed-off-by: Shai Fultheim <shai@scalex86.org>
Signed-off-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
un-, de-, -free, -destroy, -exit, etc functions should in general return
void. Also,
There is very little, say, filesystem driver code can do upon failed
kmem_cache_destroy(). If it will be decided to BUG in this case, BUG
should be put in generic code, instead.
Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Remove the atomic counter for slab_reclaim_pages and replace the counter
and NR_SLAB with two ZVC counter that account for unreclaimable and
reclaimable slab pages: NR_SLAB_RECLAIMABLE and NR_SLAB_UNRECLAIMABLE.
Change the check in vmscan.c to refer to to NR_SLAB_RECLAIMABLE. The
intend seems to be to check for slab pages that could be freed.
Signed-off-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
The allocpercpu functions __alloc_percpu and __free_percpu() are heavily
using the slab allocator. However, they are conceptually slab. This also
simplifies SLOB (at this point slob may be broken in mm. This should fix
it).
Signed-off-by: Christoph Lameter <clameter@sgi.com>
Cc: Matt Mackall <mpm@selenic.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
On High end systems (1024 or so cpus) this can potentially cause stack
overflow. Fix the stack usage.
Signed-off-by: Suresh Siddha <suresh.b.siddha@intel.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Place the alien array cache locks of on slab malloc slab caches on a
seperate lockdep class. This avoids false positives from lockdep
[akpm@osdl.org: build fix]
Signed-off-by: Ravikiran Thirumalai <kiran@scalex86.org>
Signed-off-by: Shai Fultheim <shai@scalex86.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Arjan van de Ven <arjan@linux.intel.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Pekka Enberg <penberg@cs.helsinki.fi>
Cc: Christoph Lameter <clameter@engr.sgi.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
It is fairly easy to get a system to oops by simply sizing a cache via
/proc in such a way that one of the chaches (shared is easiest) becomes
bigger than the maximum allowed slab allocation size. This occurs because
enable_cpucache() fails if it cannot reallocate some caches.
However, enable_cpucache() is used for multiple purposes: resizing caches,
cache creation and bootstrap.
If the slab is already up then we already have working caches. The resize
can fail without a problem. We just need to return the proper error code.
F.e. after this patch:
# echo "size-64 10000 50 1000" >/proc/slabinfo
-bash: echo: write error: Cannot allocate memory
notice no OOPS.
If we are doing a kmem_cache_create() then we also should not panic but
return -ENOMEM.
If on the other hand we do not have a fully bootstrapped slab allocator yet
then we should indeed panic since we are unable to bring up the slab to its
full functionality.
Signed-off-by: Christoph Lameter <clameter@sgi.com>
Cc: Pekka Enberg <penberg@cs.helsinki.fi>
Cc: Manfred Spraul <manfred@colorfullife.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
The ability to free memory allocated to a slab cache is also useful if an
error occurs during setup of a slab. So extract the function.
Signed-off-by: Christoph Lameter <clameter@sgi.com>
Cc: Pekka Enberg <penberg@cs.helsinki.fi>
Cc: Manfred Spraul <manfred@colorfullife.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Also, checks if we get a valid slabp_cache for off slab slab-descriptors.
We should always get this. If we don't, then in that case we, will have to
disable off-slab descriptors for this cache and do the calculations again.
This is a rare case, so add a BUG_ON, for now, just in case.
Signed-off-by: Alok N Kataria <alok.kataria@calsoftinc.com>
Signed-off-by: Ravikiran Thirumalai <kiran@scalex86.org>
Signed-off-by: Shai Fultheim <shai@scalex86.org>
Cc: Pekka Enberg <penberg@cs.helsinki.fi>
Cc: Manfred Spraul <manfred@colorfullife.com>
Cc: Christoph Lameter <clameter@engr.sgi.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
As explained by Heiko, on s390 (32-bit) ARCH_KMALLOC_MINALIGN is set to
eight because their common I/O layer allocates data structures that need to
have an eight byte alignment. This does not work when CONFIG_SLAB_DEBUG is
enabled because kmem_cache_create will override alignment to BYTES_PER_WORD
which is four.
So change kmem_cache_create to ensure cache alignment is always at minimum
what the architecture or caller mandates even if slab debugging is enabled.
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Manfred Spraul <manfred@colorfullife.com>
Signed-off-by: Pekka Enberg <penberg@cs.helsinki.fi>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
This patch splits alloc_percpu() up into two phases. Likewise for
free_percpu(). This allows clients to limit initial allocations to online
cpu's, and to populate or depopulate per-cpu data at run time as needed:
struct my_struct *obj;
/* initial allocation for online cpu's */
obj = percpu_alloc(sizeof(struct my_struct), GFP_KERNEL);
...
/* populate per-cpu data for cpu coming online */
ptr = percpu_populate(obj, sizeof(struct my_struct), GFP_KERNEL, cpu);
...
/* access per-cpu object */
ptr = percpu_ptr(obj, smp_processor_id());
...
/* depopulate per-cpu data for cpu going offline */
percpu_depopulate(obj, cpu);
...
/* final removal */
percpu_free(obj);
Signed-off-by: Martin Peschke <mp3@de.ibm.com>
Cc: Paul Jackson <pj@sgi.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
This patch makes the following needlessly global functions static:
- slab.c: kmem_find_general_cachep()
- swap.c: __page_cache_release()
- vmalloc.c: __vmalloc_node()
Signed-off-by: Adrian Bunk <bunk@stusta.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
kmem_cache_alloc() was documented twice, but kmem_cache_zalloc() never.
Fix this obvious typo to get things right.
Signed-off-by: Rolf Eike Beer <eike-kernel@sf-tec.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Few of the callback functions and notifier blocks that are associated with cpu
notifications incorrectly have __devinit and __devinitdata. They should be
__cpuinit and __cpuinitdata instead.
It makes no functional difference but wastes text area when CONFIG_HOTPLUG is
enabled and CONFIG_HOTPLUG_CPU is not.
This patch fixes all those instances.
Signed-off-by: Chandra Seetharaman <sekharan@us.ibm.com>
Cc: Ashok Raj <ashok.raj@intel.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Chandra Seetharaman reported SLAB crashes caused by the slab.c lock
annotation patch. There is only one chunk of that patch that has a
material effect on the slab logic - this patch undoes that chunk.
This was confirmed to fix the slab problem by Chandra.
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Tested-by: Chandra Seetharaman <sekharan@us.ibm.com>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
mm/slab.c uses nested locking when dealing with 'off-slab'
caches, in that case it allocates the slab header from the
(on-slab) kmalloc caches. Teach the lock validator about
this by putting all on-slab caches into a separate class.
this patch has no effect on non-lockdep kernels.
Signed-off-by: Arjan van de Ven <arjan@linux.intel.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
undo existing mm/slab.c lock-validator annotations, in preparation
of a new, less intrusive annotation patch.
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Arjan van de Ven <arjan@linux.intel.com>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Teach special (recursive) locking code to the lock validator. Has no effect
on non-lockdep kernels.
Fix initialize-locks-via-memcpy assumptions.
Effects on non-lockdep kernels: the subclass nesting parameter is passed into
cache_free_alien() and __cache_free(), and turns one internal
kmem_cache_free() call into an open-coded __cache_free() call.
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Arjan van de Ven <arjan@linux.intel.com>
Cc: Pekka Enberg <penberg@cs.helsinki.fi>
Cc: Christoph Lameter <clameter@engr.sgi.com>
Cc: Manfred Spraul <manfred@colorfullife.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Post and discussion:
http://marc.theaimsgroup.com/?t=115074342800003&r=1&w=2
Code in __shrink_node() duplicates code in cache_reap()
Add a new function drain_freelist that removes slabs with objects that are
already free and use that in various places.
This eliminates the __node_shrink() function and provides the interrupt
holdoff reduction from slab_free to code that used to call __node_shrink.
[akpm@osdl.org: build fixes]
Signed-off-by: Christoph Lameter <clameter@sgi.com>
Cc: Pekka Enberg <penberg@cs.helsinki.fi>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
- Allows reclaim to access counter without looping over processor counts.
- Allows accurate statistics on how many pages are used in a zone by
the slab. This may become useful to balance slab allocations over
various zones.
[akpm@osdl.org: bugfix]
Signed-off-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Per zone counter infrastructure
The counters that we currently have for the VM are split per processor. The
processor however has not much to do with the zone these pages belong to. We
cannot tell f.e. how many ZONE_DMA pages are dirty.
So we are blind to potentially inbalances in the usage of memory in various
zones. F.e. in a NUMA system we cannot tell how many pages are dirty on a
particular node. If we knew then we could put measures into the VM to balance
the use of memory between different zones and different nodes in a NUMA
system. For example it would be possible to limit the dirty pages per node so
that fast local memory is kept available even if a process is dirtying huge
amounts of pages.
Another example is zone reclaim. We do not know how many unmapped pages exist
per zone. So we just have to try to reclaim. If it is not working then we
pause and try again later. It would be better if we knew when it makes sense
to reclaim unmapped pages from a zone. This patchset allows the determination
of the number of unmapped pages per zone. We can remove the zone reclaim
interval with the counters introduced here.
Futhermore the ability to have various usage statistics available will allow
the development of new NUMA balancing algorithms that may be able to improve
the decision making in the scheduler of when to move a process to another node
and hopefully will also enable automatic page migration through a user space
program that can analyse the memory load distribution and then rebalance
memory use in order to increase performance.
The counter framework here implements differential counters for each processor
in struct zone. The differential counters are consolidated when a threshold
is exceeded (like done in the current implementation for nr_pageache), when
slab reaping occurs or when a consolidation function is called.
Consolidation uses atomic operations and accumulates counters per zone in the
zone structure and also globally in the vm_stat array. VM functions can
access the counts by simply indexing a global or zone specific array.
The arrangement of counters in an array also simplifies processing when output
has to be generated for /proc/*.
Counters can be updated by calling inc/dec_zone_page_state or
_inc/dec_zone_page_state analogous to *_page_state. The second group of
functions can be called if it is known that interrupts are disabled.
Special optimized increment and decrement functions are provided. These can
avoid certain checks and use increment or decrement instructions that an
architecture may provide.
We also add a new CONFIG_DMA_IS_NORMAL that signifies that an architecture can
do DMA to all memory and therefore ZONE_NORMAL will not be populated. This is
only currently set for IA64 SGI SN2 and currently only affects
node_page_state(). In the best case node_page_state can be reduced to
retrieving a single counter for the one zone on the node.
[akpm@osdl.org: cleanups]
[akpm@osdl.org: export vm_stat[] for filesystems]
Signed-off-by: Christoph Lameter <clameter@sgi.com>
Cc: Trond Myklebust <trond.myklebust@fys.uio.no>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Runtime debugging functionality for rt-mutexes.
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Arjan van de Ven <arjan@linux.intel.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Add debug_check_no_locks_freed(), as a central inline to add
bad-lock-free-debugging functionality to.
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Arjan van de Ven <arjan@linux.intel.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Make notifier_blocks associated with cpu_notifier as __cpuinitdata.
__cpuinitdata makes sure that the data is init time only unless
CONFIG_HOTPLUG_CPU is defined.
Signed-off-by: Chandra Seetharaman <sekharan@us.ibm.com>
Cc: Ashok Raj <ashok.raj@intel.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
In 2.6.17, there was a problem with cpu_notifiers and XFS. I provided a
band-aid solution to solve that problem. In the process, i undid all the
changes you both were making to ensure that these notifiers were available
only at init time (unless CONFIG_HOTPLUG_CPU is defined).
We deferred the real fix to 2.6.18. Here is a set of patches that fixes the
XFS problem cleanly and makes the cpu notifiers available only at init time
(unless CONFIG_HOTPLUG_CPU is defined).
If CONFIG_HOTPLUG_CPU is defined then cpu notifiers are available at run
time.
This patch reverts the notifier_call changes made in 2.6.17
Signed-off-by: Chandra Seetharaman <sekharan@us.ibm.com>
Cc: Ashok Raj <ashok.raj@intel.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Localize poison values into one header file for better documentation and
easier/quicker debugging and so that the same values won't be used for
multiple purposes.
Use these constants in core arch., mm, driver, and fs code.
Signed-off-by: Randy Dunlap <rdunlap@xenotime.net>
Acked-by: Matt Mackall <mpm@selenic.com>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Andi Kleen <ak@muc.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
- Move comments for kmalloc to right place, currently it near __do_kmalloc
- Comments for kzalloc
- More detailed comments for kmalloc
- Appearance of "kmalloc" and "kzalloc" man pages after "make mandocs"
[rdunlap@xenotime.net: simplification]
Signed-off-by: Paul Drynoff <pauldrynoff@gmail.com>
Acked-by: Randy Dunlap <rdunlap@xenotime.net>
Cc: Pekka Enberg <penberg@cs.helsinki.fi>
Cc: Manfred Spraul <manfred@colorfullife.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
The SLAB bootstrap code assumes that the first two kmalloc caches created
(the INDEX_AC and INDEX_L3 kmalloc caches) wont be off-slab. But due to AC
and L3 structure size increase in lockdep, one of them ended up being
off-slab, and subsequently crashing with:
Unable to handle kernel NULL pointer dereference at 0000000000000000 RIP:
[<ffffffff80267478>] kmem_cache_alloc+0x26/0x7d
The fix is to introduce a bootstrap flag and to use it to prevent off-slab
caches being created so early during bootup.
(The calculation for off-slab caches is quite complex so i didnt want to
complicate things with introducing yet another INDEX_ calculation, the flag
approach is simpler and smaller.)
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Cc: Manfred Spraul <manfred@colorfullife.com>
Cc: Pekka Enberg <penberg@cs.helsinki.fi>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Passing an invalid pointer to kfree() and kmem_cache_free() is likely to
cause bad memory corruption or even take down the whole system because the
bad pointer is likely reused immediately due to the per-CPU caches. Until
now, we don't do any verification for this if CONFIG_DEBUG_SLAB is
disabled.
As suggested by Linus, add PageSlab check to page_to_cache() and
page_to_slab() to verify pointers passed to kfree(). Also, move the
stronger check from cache_free_debugcheck() to kmem_cache_free() to ensure
the passed pointer actually belongs to the cache we're about to free the
object.
For page_to_cache() and page_to_slab(), the assertions should have
virtually no extra cost (two instructions, no data cache pressure) and for
kmem_cache_free() the overhead should be minimal.
Signed-off-by: Pekka Enberg <penberg@cs.helsinki.fi>
Cc: Manfred Spraul <manfred@colorfullife.com>
Cc: Christoph Lameter <clameter@engr.sgi.com>
Cc: Linus Torvalds <torvalds@osdl.org>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
At present our slab debugging tells us that it detected a double-free or
corruption - it does not distinguish between them. Sometimes it's useful
to be able to differentiate between these two types of information.
Add double-free detection to redzone verification when freeing an object.
As explained by Manfred, when we are freeing an object, both redzones
should be RED_ACTIVE. However, if both are RED_INACTIVE, we are trying to
free an object that was already free'd.
Signed-off-by: Manfred Spraul <manfred@colorfullife.com>
Signed-off-by: Pekka Enberg <penberg@cs.helsinki.fi>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Use the _entry variant everywhere to clean the code up a tiny bit.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
The last ifdef addition hit the ugliness treshold on this functions, so:
- rename the variable i to nr_pages so it's somewhat descriptive
- remove the addr variable and do the page_address call at the very end
- instead of ifdef'ing the whole alloc_pages_node call just make the
__GFP_COMP addition to flags conditional
- rewrite the __GFP_COMP comment to make sense
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Clean up slab allocator page mapping a bit. The memory allocated for a
slab is physically contiguous so it is okay to assume struct pages are too
so kill the long-standing comment. Furthermore, rename set_slab_attr to
slab_map_pages and add a comment explaining why its needed.
Signed-off-by: Pekka Enberg <penberg@cs.helsinki.fi>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Move alien object freeing to cache_free_alien() to reduce #ifdef clutter in
__cache_free().
Signed-off-by: Pekka Enberg <penberg@cs.helsinki.fi>
Acked-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>