Commit Graph

2430 Commits

Author SHA1 Message Date
Andy Whitcroft
6babc32c41 mm: handle initialising compound pages at orders greater than MAX_ORDER
When we initialise a compound page we initialise the page flags and head
page pointer for all base pages spanned by that page.  When we initialise
a gigantic page (a page of order greater than or equal to MAX_ORDER) we
have to initialise more than MAX_ORDER_NR_PAGES pages.  Currently we
assume that all elements of the mem_map in this page are contigious in
memory.  However this is only guarenteed out to MAX_ORDER_NR_PAGES pages,
and with SPARSEMEM enabled they will not be contigious.  This leads us to
walk off the end of the first section and scribble on everything which
follows, BAD.

When we reach a MAX_ORDER_NR_PAGES boundary we much locate the next
section of the mem_map.  As gigantic pages can only be maximally aligned
we know this will occur at exact multiple of MAX_ORDER_NR_PAGES pages from
the start of the page.

This is a bug fix for the gigantic page support in hugetlbfs.

Credit to Mel Gorman for spotting the issue.

Signed-off-by: Andy Whitcroft <apw@shadowen.org>
Cc: Mel Gorman <mel@csn.ul.ie>
Cc: Jon Tollefson <kniht@linux.vnet.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-10-02 15:53:13 -07:00
Nick Piggin
4b19de6d1c mm: tiny-shmem nommu fix
The previous patch db203d53d4 ("mm:
tiny-shmem fix lock ordering: mmap_sem vs i_mutex") to fix the lock
ordering in tiny-shmem breaks shared anonymous and IPC memory on NOMMU
architectures because it was using the expanding truncate to signal ramfs
to allocate a physically contiguous RAM backing the inode (otherwise it is
unusable for "memory mapping" it to userspace).

However do_truncate is what caused the lock ordering error, due to it
taking i_mutex.  In this case, we can actually just call ramfs directly to
allocate memory for the mapping, rather than go via truncate.

Acked-by: David Howells <dhowells@redhat.com>
Acked-by: Hugh Dickins <hugh@veritas.com>
Signed-off-by: Nick Piggin <npiggin@suse.de>
Cc: Matt Mackall <mpm@selenic.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-10-02 15:53:13 -07:00
Gerald Schaefer
6c1b7f680d memory hotplug: missing zone->lock in test_pages_isolated()
__test_page_isolated_in_pageblock() in mm/page_isolation.c has a comment
saying that the caller must hold zone->lock. But the only caller of that
function, test_pages_isolated(), does not hold zone->lock and the lock is
also not acquired anywhere before. This patch adds the missing zone->lock
to test_pages_isolated().

We reproducibly run into BUG_ON(!PageBuddy(page)) in __offline_isolated_pages()
during memory hotplug stress test, see trace below. This patch fixes that
problem, it would be good if we could have it in 2.6.27.

kernel BUG at /home/autobuild/BUILD/linux-2.6.26-20080909/mm/page_alloc.c:4561!
illegal operation: 0001 [#1] PREEMPT SMP
Modules linked in: dm_multipath sunrpc bonding qeth_l3 dm_mod qeth ccwgroup vmur
CPU: 1 Not tainted 2.6.26-29.x.20080909-s390default #1
Process memory_loop_all (pid: 10025, task: 2f444028, ksp: 2b10dd28)
Krnl PSW : 040c0000 801727ea (__offline_isolated_pages+0x18e/0x1c4)
 R:0 T:1 IO:0 EX:0 Key:0 M:1 W:0 P:0 AS:0 CC:0 PM:0
Krnl GPRS: 00000000 7e27fc00 00000000 7e27fc00
 00000000 00000400 00014000 7e27fc01
 00606f00 7e27fc00 00013fe0 2b10dd28
 00000005 80172662 801727b2 2b10dd28
Krnl Code: 801727de: 5810900c l %r1,12(%r9)
 801727e2: a7f4ffb3 brc 15,80172748
 801727e6: a7f40001 brc 15,801727e8
 >801727ea: a7f4ffbc brc 15,80172762
 801727ee: a7f40001 brc 15,801727f0
 801727f2: a7f4ffaf brc 15,80172750
 801727f6: 0707 bcr 0,%r7
 801727f8: 0017 unknown
Call Trace:
([<0000000000172772>] __offline_isolated_pages+0x116/0x1c4)
 [<00000000001953a2>] offline_isolated_pages_cb+0x22/0x34
 [<000000000013164c>] walk_memory_resource+0xcc/0x11c
 [<000000000019520e>] offline_pages+0x36a/0x498
 [<00000000001004d6>] remove_memory+0x36/0x44
 [<000000000028fb06>] memory_block_change_state+0x112/0x150
 [<000000000028ffb8>] store_mem_state+0x90/0xe4
 [<0000000000289c00>] sysdev_store+0x34/0x40
 [<00000000001ee048>] sysfs_write_file+0xd0/0x178
 [<000000000019b1a8>] vfs_write+0x74/0x118
 [<000000000019b9ae>] sys_write+0x46/0x7c
 [<000000000011160e>] sysc_do_restart+0x12/0x16
 [<0000000077f3e8ca>] 0x77f3e8ca

Signed-off-by: Gerald Schaefer <gerald.schaefer@de.ibm.com>
Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-10-02 15:53:13 -07:00
Balbir Singh
31a78f23ba mm owner: fix race between swapoff and exit
There's a race between mm->owner assignment and swapoff, more easily
seen when task slab poisoning is turned on.  The condition occurs when
try_to_unuse() runs in parallel with an exiting task.  A similar race
can occur with callers of get_task_mm(), such as /proc/<pid>/<mmstats>
or ptrace or page migration.

CPU0                                    CPU1
                                        try_to_unuse
                                        looks at mm = task0->mm
                                        increments mm->mm_users
task 0 exits
mm->owner needs to be updated, but no
new owner is found (mm_users > 1, but
no other task has task->mm = task0->mm)
mm_update_next_owner() leaves
                                        mmput(mm) decrements mm->mm_users
task0 freed
                                        dereferencing mm->owner fails

The fix is to notify the subsystem via mm_owner_changed callback(),
if no new owner is found, by specifying the new task as NULL.

Jiri Slaby:
mm->owner was set to NULL prior to calling cgroup_mm_owner_callbacks(), but
must be set after that, so as not to pass NULL as old owner causing oops.

Daisuke Nishimura:
mm_update_next_owner() may set mm->owner to NULL, but mem_cgroup_from_task()
and its callers need to take account of this situation to avoid oops.

Hugh Dickins:
Lockdep warning and hang below exec_mmap() when testing these patches.
exit_mm() up_reads mmap_sem before calling mm_update_next_owner(),
so exec_mmap() now needs to do the same.  And with that repositioning,
there's now no point in mm_need_new_owner() allowing for NULL mm.

Reported-by: Hugh Dickins <hugh@veritas.com>
Signed-off-by: Balbir Singh <balbir@linux.vnet.ibm.com>
Signed-off-by: Jiri Slaby <jirislaby@gmail.com>
Signed-off-by: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
Signed-off-by: Hugh Dickins <hugh@veritas.com>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Paul Menage <menage@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-09-29 08:41:47 -07:00
Daisuke Nishimura
a10cebf56c memcg: check under limit at shrink_usage
Current memory cgroup(both in mainline and -mm) doesn't account swap
caches as memory(swap cache support is dropped temporarily now).

So try_to_free_mem_cgroup_pages doesn't reflect the count of pages that
have been moved to swap cache.

But this makes mem_cgroup_shrink_usage fail easily if most of the pages
are anon/shmem, and then shmem_getpage returns -ENOMEM and the process
will be killed.

This patch adds res_counter_check_under_limit to avoid these cases.

BTW, even if swap cache support is enabled again, if a process is moved to
another cgroup, which has been just made, between precharge and
shrink_usage in shmem_getpage, shrink_usage may fail just because there is
no pages to reclaim.

So this change would make sense anyway.

Signed-off-by: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
Cc: Balbir Singh <balbir@in.ibm.com>
Cc: Pavel Emelyanov <xemul@openvz.org>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Hugh Dickins <hugh@veritas.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-09-23 08:09:14 -07:00
Nick Piggin
db203d53d4 mm: tiny-shmem fix lock ordering: mmap_sem vs i_mutex
tiny-shmem calls do_truncate in shmem_file_setup.  do_truncate takes
i_mutex, and shmem_file_setup is called with mmap_sem held.  However
i_mutex nests outside mmap_sem.

Copy the code in shmem.c to avoid this problem.

[akpm@linux-foundation.org: coding-style fixes]
Signed-off-by: Nick Piggin <npiggin@suse.de>
Reported-and-tested-by: Ingo Molnar <mingo@elte.hu>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Matt Mackall <mpm@selenic.com>
Cc: Hugh Dickins <hugh@veritas.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-09-23 08:09:14 -07:00
Salman Qazi
02b71b7012 slub: fixed uninitialized counter in struct kmem_cache_node
Initialized total objects atomic for the node in init_kmem_cache_node.  The
uninitialized value was ruining the stats in /proc/slabinfo.

Acked-by: Christoph Lameter <cl@linux-foundation.org>
Signed-off-by: Salman Qazi <sqazi@google.com>
Signed-off-by: Pekka Enberg <penberg@cs.helsinki.fi>
2008-09-15 09:49:05 +03:00
Mel Gorman
5bead2a068 mm: mark the correct zone as full when scanning zonelists
The iterator for_each_zone_zonelist() uses a struct zoneref *z cursor when
scanning zonelists to keep track of where in the zonelist it is.  The
zoneref that is returned corresponds to the the next zone that is to be
scanned, not the current one.  It was intended to be treated as an opaque
list.

When the page allocator is scanning a zonelist, it marks elements in the
zonelist corresponding to zones that are temporarily full.  As the
zonelist is being updated, it uses the cursor here;

  if (NUMA_BUILD)
        zlc_mark_zone_full(zonelist, z);

This is intended to prevent rescanning in the near future but the zoneref
cursor does not correspond to the zone that has been found to be full.
This is an easy misunderstanding to make so this patch corrects the
problem by changing zoneref cursor to be the current zone being scanned
instead of the next one.

Signed-off-by: Mel Gorman <mel@csn.ul.ie>
Cc: Andy Whitcroft <apw@shadowen.org>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: <stable@kernel.org>		[2.6.26.x]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-09-13 14:41:52 -07:00
Tejun Heo
ce36394269 mmap: fix petty bug in anonymous shared mmap offset handling
Anonymous mappings should ignore offset but shared anonymous mapping
forgot to clear it and makes the following legit test program trigger
SIGBUS.

 #include <sys/mman.h>
 #include <stdio.h>
 #include <errno.h>

 #define PAGE_SIZE	4096

 int main(void)
 {
	 char *p;
	 int i;

	 p = mmap(NULL, 2 * PAGE_SIZE, PROT_READ|PROT_WRITE,
		  MAP_SHARED|MAP_ANONYMOUS, -1, PAGE_SIZE);
	 if (p == MAP_FAILED) {
		 perror("mmap");
		 return 1;
	 }

	 for (i = 0; i < 2; i++) {
		 printf("page %d\n", i);
		 p[i * 4096] = i;
	 }
	 return 0;
 }

Fix it.

Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Hugh Dickins <hugh@veritas.com>
Acked-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-09-03 19:58:53 -07:00
KOSAKI Motohiro
b954185214 mm: size of quicklists shouldn't be proportional to the number of CPUs
Quicklists store pages for each CPU as caches.  (Each CPU can cache
node_free_pages/16 pages)

It is used for page table cache.  exit() will increase the cache size,
while fork() consumes it.

So for example if an apache-style application runs (one parent and many
child model), one CPU process will fork() while another CPU will process
the middleware work and exit().

At that time, the CPU on which the parent runs doesn't have page table
cache at all.  Others (on which children runs) have maximum caches.

	QList_max = (#ofCPUs - 1) x Free / 16
	=> QList_max / (Free + QList_max) = (#ofCPUs - 1) / (16 + #ofCPUs - 1)

So, How much quicklist memory is used in the maximum case?

This is proposional to # of CPUs because the limit of per cpu quicklist
cache doesn't see the number of cpus.

Above calculation mean

	 Number of CPUs per node            2    4    8   16
	 ==============================  ====================
	 QList_max / (Free + QList_max)   5.8%  16%  30%  48%

Wow! Quicklist can spend about 50% memory at worst case.

My demonstration program is here
--------------------------------------------------------------------------------
#define _GNU_SOURCE

#include <stdio.h>
#include <errno.h>
#include <stdlib.h>
#include <string.h>
#include <sched.h>
#include <unistd.h>
#include <sys/mman.h>
#include <sys/wait.h>

#define BUFFSIZE 512

int max_cpu(void)	/* get max number of logical cpus from /proc/cpuinfo */
{
  FILE *fd;
  char *ret, buffer[BUFFSIZE];
  int cpu = 1;

  fd = fopen("/proc/cpuinfo", "r");
  if (fd == NULL) {
    perror("fopen(/proc/cpuinfo)");
    exit(EXIT_FAILURE);
  }
  while (1) {
    ret = fgets(buffer, BUFFSIZE, fd);
    if (ret == NULL)
      break;
    if (!strncmp(buffer, "processor", 9))
      cpu = atoi(strchr(buffer, ':') + 2);
  }
  fclose(fd);
  return cpu;
}

void cpu_bind(int cpu)	/* bind current process to one cpu */
{
  cpu_set_t mask;
  int ret;

  CPU_ZERO(&mask);
  CPU_SET(cpu, &mask);
  ret = sched_setaffinity(0, sizeof(mask), &mask);
  if (ret == -1) {
    perror("sched_setaffinity()");
    exit(EXIT_FAILURE);
  }
  sched_yield();	/* not necessary */
}

#define MMAP_SIZE (10 * 1024 * 1024)	/* 10 MB */
#define FORK_INTERVAL 1	/* 1 second */

main(int argc, char *argv[])
{
  int cpu_max, nextcpu;
  long pagesize;
  pid_t pid;

  /* set max number of logical cpu */
  if (argc > 1)
    cpu_max = atoi(argv[1]) - 1;
  else
    cpu_max = max_cpu();

  /* get the page size */
  pagesize = sysconf(_SC_PAGESIZE);
  if (pagesize == -1) {
    perror("sysconf(_SC_PAGESIZE)");
    exit(EXIT_FAILURE);
  }

  /* prepare parent process */
  cpu_bind(0);
  nextcpu = cpu_max;

loop:

  /* select destination cpu for child process by round-robin rule */
  if (++nextcpu > cpu_max)
    nextcpu = 1;

  pid = fork();

  if (pid == 0) { /* child action */

    char *p;
    int i;

    /* consume page tables */
    p = mmap(0, MMAP_SIZE, PROT_WRITE, MAP_PRIVATE | MAP_ANONYMOUS, 0, 0);
    i = MMAP_SIZE / pagesize;
    while (i-- > 0) {
      *p = 1;
      p += pagesize;
    }

    /* move to other cpu */
    cpu_bind(nextcpu);
/*
    printf("a child moved to cpu%d after mmap().\n", nextcpu);
    fflush(stdout);
 */

    /* back page tables to pgtable_quicklist */
    exit(0);

  } else if (pid > 0) { /* parent action */

    sleep(FORK_INTERVAL);
    waitpid(pid, NULL, WNOHANG);

  }

  goto loop;
}
----------------------------------------

When above program which does task migration runs, my 8GB box spends
800MB of memory for quicklist.  This is not memory leak but doesn't seem
good.

% cat /proc/meminfo

MemTotal:        7701568 kB
MemFree:         4724672 kB
(snip)
Quicklists:       844800 kB

because

- My machine spec is
	number of numa node: 2
	number of cpus:      8 (4CPU x2 node)
        total mem:           8GB (4GB x2 node)
        free mem:            about 5GB

- Then, 4.7GB x 16% ~= 880MB.
  So, Quicklist can use 800MB.

So, if following spec machine run that program

   CPUs: 64 (8cpu x 8node)
   Mem:  1TB (128GB x8node)

Then, quicklist can waste 300GB (= 1TB x 30%).  It is too large.

So, I don't like cache policies which is proportional to # of cpus.

My patch changes the number of caches
from:
   per-cpu-cache-amount = memory_on_node / 16
to
   per-cpu-cache-amount = memory_on_node / 16 / number_of_cpus_on_node.

Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Keiichiro Tokunaga <tokunaga.keiich@jp.fujitsu.com>
Acked-by: Christoph Lameter <cl@linux-foundation.org>
Tested-by: David Miller <davem@davemloft.net>
Acked-by: Mike Travis <travis@sgi.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-09-02 19:21:38 -07:00
Marcin Slusarz
527655835e mm/bootmem: silence section mismatch warning - contig_page_data/bootmem_node_data
WARNING: vmlinux.o(.data+0x1f5c0): Section mismatch in reference from the variable contig_page_data to the variable .init.data:bootmem_node_data
The variable contig_page_data references
the variable __initdata bootmem_node_data
If the reference is valid then annotate the
variable with __init* (see linux/init.h) or name the variable:
*driver, *_template, *_timer, *_sht, *_ops, *_probe, *_probe_one, *_console,

Signed-off-by: Marcin Slusarz <marcin.slusarz@gmail.com>
Cc: Johannes Weiner <hannes@saeurebad.de>
Cc: Sean MacLennan <smaclennan@pikatech.com>
Cc: Sam Ravnborg <sam@ravnborg.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-09-02 19:21:37 -07:00
Hisashi Hifumi
6ccfa806a9 VFS: fix dio write returning EIO when try_to_release_page fails
Dio write returns EIO when try_to_release_page fails because bh is
still referenced.

The patch

    commit 3f31fddfa2
    Author: Mingming Cao <cmm@us.ibm.com>
    Date:   Fri Jul 25 01:46:22 2008 -0700

        jbd: fix race between free buffer and commit transaction

was merged into 2.6.27-rc1, but I noticed that this patch is not enough
to fix the race.

I did fsstress test heavily to 2.6.27-rc1, and found that dio write still
sometimes got EIO through this test.

The patch above fixed race between freeing buffer(dio) and committing
transaction(jbd) but I discovered that there is another race, freeing
buffer(dio) and ext3/4_ordered_writepage.

: background_writeout()
     ->write_cache_pages()
       ->ext3_ordered_writepage()
     	   walk_page_buffers() -> take a bh ref
 	   block_write_full_page() -> unlock_page
		: <- end_page_writeback
                : <- race! (dio write->try_to_release_page fails)
      	   walk_page_buffers() ->release a bh ref

ext3_ordered_writepage holds bh ref and does unlock_page remaining
taking a bh ref, so this causes the race and failure of
try_to_release_page.

To fix this race, I used the approach of falling back to buffered
writes if try_to_release_page() fails on a page.

[akpm@linux-foundation.org: cleanups]
Signed-off-by: Hisashi Hifumi <hifumi.hisashi@oss.ntt.co.jp>
Cc: Chris Mason <chris.mason@oracle.com>
Cc: Jan Kara <jack@suse.cz>
Cc: Mingming Cao <cmm@us.ibm.com>
Cc: Zach Brown <zach.brown@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-09-02 19:21:37 -07:00
Adam Litke
344c790e38 mm: make setup_zone_migrate_reserve() aware of overlapping nodes
I have gotten to the root cause of the hugetlb badness I reported back on
August 15th.  My system has the following memory topology (note the
overlapping node):

            Node 0 Memory: 0x8000000-0x44000000
            Node 1 Memory: 0x0-0x8000000 0x44000000-0x80000000

setup_zone_migrate_reserve() scans the address range 0x0-0x8000000 looking
for a pageblock to move onto the MIGRATE_RESERVE list.  Finding no
candidates, it happily continues the scan into 0x8000000-0x44000000.  When
a pageblock is found, the pages are moved to the MIGRATE_RESERVE list on
the wrong zone.  Oops.

setup_zone_migrate_reserve() should skip pageblocks in overlapping nodes.

Signed-off-by: Adam Litke <agl@us.ibm.com>
Acked-by: Mel Gorman <mel@csn.ul.ie>
Cc: Dave Hansen <dave@linux.vnet.ibm.com>
Cc: Nishanth Aravamudan <nacc@us.ibm.com>
Cc: Andy Whitcroft <apw@shadowen.org>
Cc: <stable@kernel.org>		[2.6.25.x, 2.6.26.x]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-09-02 19:21:37 -07:00
David Woodhouse
0ed97ee470 Remove '#include <stddef.h>' from mm/page_isolation.c
Signed-off-by: David Woodhouse <David.Woodhouse@intel.com>
2008-09-02 09:29:01 +01:00
Linus Torvalds
41c3e45f08 Merge master.kernel.org:/home/rmk/linux-2.6-arm
* master.kernel.org:/home/rmk/linux-2.6-arm:
  [ARM] 5226/1: remove unmatched comment end.
  [ARM] Skip memory holes in FLATMEM when reading /proc/pagetypeinfo
  [ARM] use bcd2bin/bin2bcd
  [ARM] use the new byteorder headers
  [ARM] OMAP: Fix 2430 SMC91x ethernet IRQ
  [ARM] OMAP: Add and update OMAP default configuration files
  [ARM] OMAP: Change mailing list for OMAP in MAINTAINERS
  [ARM] S3C2443: Fix the S3C2443 clock register definitions
  [ARM] JIVE: Fix the spi bus numbering
  [ARM] S3C24XX: pwm.c: stop debugging output
  [ARM] S3C24XX: Fix sparse warnings in pwm.c
  [ARM] S3C24XX: Fix spare errors in pwm-clock driver
  [ARM] S3C24XX: Fix sparse warnings in arch/arm/plat-s3c24xx/gpiolib.c
  [ARM] S3C24XX: Fix nor-simtec driver sparse errors
  [ARM] 5225/1: zaurus: Register I2C controller for audio codecs
  [ARM] orion5x: update defconfig to v2.6.27-rc4
  [ARM] Orion: register UART1 on QNAP TS-209 and TS-409
  [ARM] Orion: activate lm75 driver on DNS-323
  [ARM] Orion: fix MAC detection on QNAP TS-209 and TS-409
  [ARM] Orion: Fix boot crash on Kurobox Pro
2008-08-28 12:34:27 -07:00
Linus Torvalds
e4268bd3b2 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/penberg/slab-2.6
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/penberg/slab-2.6:
  slub: Disable NUMA remote node defragmentation by default
2008-08-27 14:33:06 -07:00
Mel Gorman
e80d6a2482 [ARM] Skip memory holes in FLATMEM when reading /proc/pagetypeinfo
Ordinarily, memory holes in flatmem still have a valid memmap and is safe
to use. However, an architecture (ARM) frees up the memmap backing memory
holes on the assumption it is never used. /proc/pagetypeinfo reads the
whole range of pages in a zone believing that the memmap is valid and that
pfn_valid will return false if it is not. On ARM, freeing the memmap breaks
the page->zone linkages even though pfn_valid() returns true and the kernel
can oops shortly afterwards due to accessing a bogus struct zone *.

This patch lets architectures say when FLATMEM can have holes in the
memmap. Rather than an expensive check for valid memory, /proc/pagetypeinfo
will confirm that the page linkages are still valid by checking page->zone
is still the expected zone. The lookup of page_zone is safe as there is a
limited range of memory that is accessed when calling page_zone.  Even if
page_zone happens to return the correct zone, the impact is that the counters
in /proc/pagetypeinfo are slightly off but fragmentation monitoring is
unlikely to be relevant on an embedded system.

Reported-by: H Hartley Sweeten <hsweeten@visionengravers.com>
Signed-off-by: Mel Gorman <mel@csn.ul.ie>
Tested-by: H Hartley Sweeten <hsweeten@visionengravers.com>
Signed-off-by: Russell King <rmk+kernel@arm.linux.org.uk>
2008-08-27 20:09:28 +01:00
Nick Piggin
14bac5acfd mm: xip/ext2 fix block allocation race
XIP can call into get_xip_mem concurrently with the same file,offset with
create=1.  This usually maps down to get_block, which expects the page
lock to prevent such a situation.  This causes ext2 to explode for one
reason or another.

Serialise those calls for the moment.  For common usages today, I suspect
get_xip_mem rarely is called to create new blocks.  In future as XIP
technologies evolve we might need to look at which operations require
scalability, and rework the locking to suit.

Signed-off-by: Nick Piggin <npiggin@suse.de>
Cc: Jared Hulbert <jaredeh@gmail.com>
Acked-by: Carsten Otte <cotte@freenet.de>
Cc: Hugh Dickins <hugh@veritas.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-08-20 15:40:32 -07:00
Nick Piggin
538f8ea6c8 mm: xip fix fault vs sparse page invalidate race
XIP has a race between sparse pages being inserted into page tables, and
sparse pages being zapped when its time to put a non-sparse page in.

What can happen is that a process can be left with a dangling sparse page
in a MAP_SHARED mapping, while the rest of the world sees the non-sparse
version.  Ie.  data corruption.

Guard these operations with a seqlock, making fault-in-sparse-pages the
slowpath, and try-to-unmap-sparse-pages the fastpath.

Signed-off-by: Nick Piggin <npiggin@suse.de>
Cc: Jared Hulbert <jaredeh@gmail.com>
Acked-by: Carsten Otte <cotte@freenet.de>
Cc: Hugh Dickins <hugh@veritas.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-08-20 15:40:32 -07:00
Nick Piggin
479db0bf40 mm: dirty page tracking race fix
There is a race with dirty page accounting where a page may not properly
be accounted for.

clear_page_dirty_for_io() calls page_mkclean; then TestClearPageDirty.

page_mkclean walks the rmaps for that page, and for each one it cleans and
write protects the pte if it was dirty.  It uses page_check_address to
find the pte.  That function has a shortcut to avoid the ptl if the pte is
not present.  Unfortunately, the pte can be switched to not-present then
back to present by other code while holding the page table lock -- this
should not be a signal for page_mkclean to ignore that pte, because it may
be dirty.

For example, powerpc64's set_pte_at will clear a previously present pte
before setting it to the desired value.  There may also be other code in
core mm or in arch which do similar things.

The consequence of the bug is loss of data integrity due to msync, and
loss of dirty page accounting accuracy.  XIP's __xip_unmap could easily
also be unreliable (depending on the exact XIP locking scheme), which can
lead to data corruption.

Fix this by having an option to always take ptl to check the pte in
page_check_address.

It's possible to retain this optimization for page_referenced and
try_to_unmap.

Signed-off-by: Nick Piggin <npiggin@suse.de>
Cc: Jared Hulbert <jaredeh@gmail.com>
Cc: Carsten Otte <cotte@freenet.de>
Cc: Hugh Dickins <hugh@veritas.com>
Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-08-20 15:40:32 -07:00
Johannes Weiner
481ebd0d76 bootmem: fix aligning of node-relative indexes and offsets
Absolute alignment requirements may never be applied to node-relative
offsets.  Andreas Herrmann spotted this flaw when a bootmem allocation on
an unaligned node was itself not aligned because the combination of an
unaligned node with an aligned offset into that node is not garuanteed to
be aligned itself.

This patch introduces two helper functions that align a node-relative
index or offset with respect to the node's starting address so that the
absolute PFN or virtual address that results from combining the two
satisfies the requested alignment.

Then all the broken ALIGN()s in alloc_bootmem_core() are replaced by these
helpers.

Signed-off-by: Johannes Weiner <hannes@saeurebad.de>
Reported-by: Andreas Herrmann <andreas.herrmann3@amd.com>
Debugged-by: Andreas Herrmann <andreas.herrmann3@amd.com>
Reviewed-by: Andreas Herrmann <andreas.herrmann3@amd.com>
Tested-by: Andreas Herrmann <andreas.herrmann3@amd.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-08-20 15:40:31 -07:00
Marcin Slusarz
759f9a2df7 mm: mminit_loglevel cannot be __meminitdata anymore
mminit_loglevel is now used from mminit_verify_zonelist <- build_all_zonelists <-

1. online_pages <- memory_block_action <- memory_block_change_state <- store_mem_state (sys handler)
2. numa_zonelist_order_handler (proc handler)

so it cannot be annotated __meminit - drop it

fixes following section mismatch warning:
WARNING: vmlinux.o(.text+0x71628): Section mismatch in reference from the function mminit_verify_zonelist() to the variable .meminit.data:mminit_loglevel
The function mminit_verify_zonelist() references
the variable __meminitdata mminit_loglevel.
This is often because mminit_verify_zonelist lacks a __meminitdata
annotation or the annotation of mminit_loglevel is wrong.

Signed-off-by: Marcin Slusarz <marcin.slusarz@gmail.com>
Acked-by: Mel Gorman <mel@csn.ul.ie>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-08-20 15:40:30 -07:00
Hugh Dickins
07279cdfd9 mm: show free swap as signed
Adjust <Alt><SysRq>m show_swap_cache_info() to show "Free swap" as a
signed long: the signed format is preferable, because during swapoff
nr_swap_pages can legitimately go negative, so makes more sense thus
(it used to be shown redundantly, once as signed and once as unsigned).

Signed-off-by: Hugh Dickins <hugh@veritas.com>
Reviewed-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-08-20 15:40:30 -07:00
Hugh Dickins
16f8c5b2e6 mm: page_remove_rmap comments on PageAnon
Add a comment to s390's page_test_dirty/page_clear_dirty/page_set_dirty
dance in page_remove_rmap(): I was wrong to think the PageSwapCache test
could be avoided, and would like a comment in there to remind me.  And
mention s390, to help us remember that this block is not really common.

Also move down the "It would be tidy to reset PageAnon" comment: it does
not belong to s390's block, and it would be unwise to reset PageAnon
before we're done with testing it.

Signed-off-by: Hugh Dickins <hugh@veritas.com>
Acked-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-08-20 15:40:30 -07:00
Christoph Lameter
e2cb96b7ec slub: Disable NUMA remote node defragmentation by default
Switch remote node defragmentation off by default. The current settings can
cause excessive node local allocations with hackbench:

  SLAB:

    % cat /proc/meminfo
    MemTotal:        7701760 kB
    MemFree:         5940096 kB
    Slab:             123840 kB

  SLUB:

    % cat /proc/meminfo
    MemTotal:        7701376 kB
    MemFree:         4740928 kB
    Slab:            1591680 kB

[Note: this feature is not related to slab defragmentation.]

You can find the original discussion here:

  http://lkml.org/lkml/2008/8/4/308

Reported-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Tested-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Signed-off-by: Christoph Lameter <cl@linux-foundation.org>
Signed-off-by: Pekka Enberg <penberg@cs.helsinki.fi>
2008-08-20 21:50:21 +03:00
Linus Torvalds
71ef2a46fc Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jmorris/security-testing-2.6
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jmorris/security-testing-2.6:
  security: Fix setting of PF_SUPERPRIV by __capable()
2008-08-15 15:32:13 -07:00
Mikulas Patocka
627240aaa9 bootmem allocator: alloc_bootmem_core(): page-align the end offset
This is the minimal sequence that jams the allocator:

void *p, *q, *r;
p = alloc_bootmem(PAGE_SIZE);
q = alloc_bootmem(64);
free_bootmem(p, PAGE_SIZE);
p = alloc_bootmem(PAGE_SIZE);
r = alloc_bootmem(64);

after this sequence (assuming that the allocator was empty or page-aligned
before), pointer "q" will be equal to pointer "r".

What's hapenning inside the allocator:
p = alloc_bootmem(PAGE_SIZE);
in allocator: last_end_off == PAGE_SIZE, bitmap contains bits 10000...
q = alloc_bootmem(64);
in allocator: last_end_off == PAGE_SIZE + 64, bitmap contains 11000...
free_bootmem(p, PAGE_SIZE);
in allocator: last_end_off == PAGE_SIZE + 64, bitmap contains 01000...
p = alloc_bootmem(PAGE_SIZE);
in allocator: last_end_off == PAGE_SIZE, bitmap contains 11000...
r = alloc_bootmem(64);

and now:

it finds bit "2", as a place where to allocate (sidx)

it hits the condition

if (bdata->last_end_off && PFN_DOWN(bdata->last_end_off) + 1 == sidx))
start_off = ALIGN(bdata->last_end_off, align);

-you can see that the condition is true, so it assigns start_off =
ALIGN(bdata->last_end_off, align); (that is PAGE_SIZE) and allocates
over already allocated block.

With the patch it tries to continue at the end of previous allocation only
if the previous allocation ended in the middle of the page.

Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Acked-by: Johannes Weiner <hannes@saeurebad.de>
Cc: David Miller <davem@davemloft.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-08-15 08:35:41 -07:00
David Howells
5cd9c58fbe security: Fix setting of PF_SUPERPRIV by __capable()
Fix the setting of PF_SUPERPRIV by __capable() as it could corrupt the flags
the target process if that is not the current process and it is trying to
change its own flags in a different way at the same time.

__capable() is using neither atomic ops nor locking to protect t->flags.  This
patch removes __capable() and introduces has_capability() that doesn't set
PF_SUPERPRIV on the process being queried.

This patch further splits security_ptrace() in two:

 (1) security_ptrace_may_access().  This passes judgement on whether one
     process may access another only (PTRACE_MODE_ATTACH for ptrace() and
     PTRACE_MODE_READ for /proc), and takes a pointer to the child process.
     current is the parent.

 (2) security_ptrace_traceme().  This passes judgement on PTRACE_TRACEME only,
     and takes only a pointer to the parent process.  current is the child.

     In Smack and commoncap, this uses has_capability() to determine whether
     the parent will be permitted to use PTRACE_ATTACH if normal checks fail.
     This does not set PF_SUPERPRIV.

Two of the instances of __capable() actually only act on current, and so have
been changed to calls to capable().

Of the places that were using __capable():

 (1) The OOM killer calls __capable() thrice when weighing the killability of a
     process.  All of these now use has_capability().

 (2) cap_ptrace() and smack_ptrace() were using __capable() to check to see
     whether the parent was allowed to trace any process.  As mentioned above,
     these have been split.  For PTRACE_ATTACH and /proc, capable() is now
     used, and for PTRACE_TRACEME, has_capability() is used.

 (3) cap_safe_nice() only ever saw current, so now uses capable().

 (4) smack_setprocattr() rejected accesses to tasks other than current just
     after calling __capable(), so the order of these two tests have been
     switched and capable() is used instead.

 (5) In smack_file_send_sigiotask(), we need to allow privileged processes to
     receive SIGIO on files they're manipulating.

 (6) In smack_task_wait(), we let a process wait for a privileged process,
     whether or not the process doing the waiting is privileged.

I've tested this with the LTP SELinux and syscalls testscripts.

Signed-off-by: David Howells <dhowells@redhat.com>
Acked-by: Serge Hallyn <serue@us.ibm.com>
Acked-by: Casey Schaufler <casey@schaufler-ca.com>
Acked-by: Andrew G. Morgan <morgan@kernel.org>
Acked-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: James Morris <jmorris@namei.org>
2008-08-14 22:59:43 +10:00
Huang Weiyi
fc1efbdb7a mm/sparse.c: removed duplicated include
Signed-off-by: Huang Weiyi <weiyi.huang@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-08-12 16:07:30 -07:00
MinChan Kim
d6bf73e434 do_migrate_pages(): remove unused variable
Signed-off-by: MinChan Kim <minchan.kim@gmail.com>
Acked-by: Christoph Lameter <cl@linux-foundation.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-08-12 16:07:29 -07:00
Andy Whitcroft
2b26736c88 allocate structures for reservation tracking in hugetlbfs outside of spinlocks v2
[Andrew this should replace the previous version which did not check
the returns from the region prepare for errors.  This has been tested by
us and Gerald and it looks good.

Bah, while reviewing the locking based on your previous email I spotted
that we need to check the return from the vma_needs_reservation call for
allocation errors.  Here is an updated patch to correct this.  This passes
testing here.]

Signed-off-by: Andy Whitcroft <apw@shadowen.org>
Tested-by: Gerald Schaefer <gerald.schaefer@de.ibm.com>
Cc: Mel Gorman <mel@csn.ul.ie>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-08-12 16:07:28 -07:00
Andy Whitcroft
57303d8017 hugetlbfs: allocate structures for reservation tracking outside of spinlocks
In the normal case, hugetlbfs reserves hugepages at map time so that the
pages exist for future faults.  A struct file_region is used to track when
reservations have been consumed and where.  These file_regions are
allocated as necessary with kmalloc() which can sleep with the
mm->page_table_lock held.  This is wrong and triggers may-sleep warning
when PREEMPT is enabled.

Updates to the underlying file_region are done in two phases.  The first
phase prepares the region for the change, allocating any necessary memory,
without actually making the change.  The second phase actually commits the
change.  This patch makes use of this by checking the reservations before
the page_table_lock is taken; triggering any necessary allocations.  This
may then be safely repeated within the locks without any allocations being
required.

Credit to Mel Gorman for diagnosing this failure and initial versions of
the patch.

Signed-off-by: Andy Whitcroft <apw@shadowen.org>
Tested-by: Gerald Schaefer <gerald.schaefer@de.ibm.com>
Cc: Mel Gorman <mel@csn.ul.ie>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-08-12 16:07:28 -07:00
Hugh Dickins
9623e078c1 memcg: fix oops in mem_cgroup_shrink_usage
Got an oops in mem_cgroup_shrink_usage() when testing loop over tmpfs:
yes, of course, loop0 has no mm: other entry points check but this didn't.

Signed-off-by: Hugh Dickins <hugh@veritas.com>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Acked-by: Balbir Singh <balbir@linux.vnet.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-08-12 16:07:28 -07:00
Jan Beulich
74768ed833 page allocator: use no-panic variant of alloc_bootmem() in alloc_large_system_hash()
..  since a failed allocation is being (initially) handled gracefully, and
panic()-ed upon failure explicitly in the function if retries with smaller
sizes failed.

Signed-off-by: Jan Beulich <jbeulich@novell.com>
Signed-off-by: David Howells <dhowells@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-08-12 16:07:27 -07:00
Gerald Schaefer
caff3a2c33 hugetlb: call arch_prepare_hugepage() for surplus pages
The s390 software large page emulation implements shared page tables by
using page->index of the first tail page from a compound large page to
store page table information.  This is set up in arch_prepare_hugepage(),
which is called from alloc_fresh_huge_page_node().

A similar call to arch_prepare_hugepage() is missing for surplus large
pages that are allocated in alloc_buddy_huge_page(), which breaks the
software emulation mode for (surplus) large pages on s390.  This patch
adds the missing call to arch_prepare_hugepage().  It will have no effect
on other architectures where arch_prepare_hugepage() is a nop.

Also, use the correct order in the error path in alloc_fresh_huge_page_node().

Acked-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
Signed-off-by: Gerald Schaefer <gerald.schaefer@de.ibm.com>
Acked-by: Nick Piggin <npiggin@suse.de>
Acked-by: Adam Litke <agl@us.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-08-12 16:07:27 -07:00
Linus Torvalds
1c89ac5501 Merge git://git.kernel.org/pub/scm/linux/kernel/git/rusty/linux-2.6-for-linus
* git://git.kernel.org/pub/scm/linux/kernel/git/rusty/linux-2.6-for-linus:
  fix spinlock recursion in hvc_console
  stop_machine: remove unused variable
  modules: extend initcall_debug functionality to the module loader
  export virtio_rng.h
  lguest: use get_user_pages_fast() instead of get_user_pages()
  mm: Make generic weak get_user_pages_fast and EXPORT_GPL it
  lguest: don't set MAC address for guest unless specified
2008-08-12 08:40:19 -07:00
Rusty Russell
912985dce4 mm: Make generic weak get_user_pages_fast and EXPORT_GPL it
Out of line get_user_pages_fast fallback implementation, make it a weak
symbol, get rid of CONFIG_HAVE_GET_USER_PAGES_FAST.

Export the symbol to modules so lguest can use it.

Signed-off-by: Nick Piggin <npiggin@suse.de>
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
2008-08-12 17:52:53 +10:00
Linus Torvalds
9b4d0bab32 Merge branch 'core-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip
* 'core-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
  lockdep: fix debug_lock_alloc
  lockdep: increase MAX_LOCKDEP_KEYS
  generic-ipi: fix stack and rcu interaction bug in smp_call_function_mask()
  lockdep: fix overflow in the hlock shrinkage code
  lockdep: rename map_[acquire|release]() => lock_map_[acquire|release]()
  lockdep: handle chains involving classes defined in modules
  mm: fix mm_take_all_locks() locking order
  lockdep: annotate mm_take_all_locks()
  lockdep: spin_lock_nest_lock()
  lockdep: lock protection locks
  lockdep: map_acquire
  lockdep: shrink held_lock structure
  lockdep: re-annotate scheduler runqueues
  lockdep: lock_set_subclass - reset a held lock's subclass
  lockdep: change scheduler annotation
  debug_locks: set oops_in_progress if we will log messages.
  lockdep: fix combinatorial explosion in lock subgraph traversal
2008-08-11 16:45:46 -07:00
Ingo Molnar
23a0ee908c Merge branch 'core/locking' into core/urgent 2008-08-12 00:11:49 +02:00
Peter Zijlstra
7cd5a02f54 mm: fix mm_take_all_locks() locking order
Lockdep spotted:

=======================================================
[ INFO: possible circular locking dependency detected ]
2.6.27-rc1 #270
-------------------------------------------------------
qemu-kvm/2033 is trying to acquire lock:
 (&inode->i_data.i_mmap_lock){----}, at: [<ffffffff802996cc>] mm_take_all_locks+0xc2/0xea

but task is already holding lock:
 (&anon_vma->lock){----}, at: [<ffffffff8029967a>] mm_take_all_locks+0x70/0xea

which lock already depends on the new lock.

the existing dependency chain (in reverse order) is:

-> #1 (&anon_vma->lock){----}:
       [<ffffffff8025cd37>] __lock_acquire+0x11be/0x14d2
       [<ffffffff8025d0a9>] lock_acquire+0x5e/0x7a
       [<ffffffff804c655b>] _spin_lock+0x3b/0x47
       [<ffffffff8029a2ef>] vma_adjust+0x200/0x444
       [<ffffffff8029a662>] split_vma+0x12f/0x146
       [<ffffffff8029bc60>] mprotect_fixup+0x13c/0x536
       [<ffffffff8029c203>] sys_mprotect+0x1a9/0x21e
       [<ffffffff8020c0db>] system_call_fastpath+0x16/0x1b
       [<ffffffffffffffff>] 0xffffffffffffffff

-> #0 (&inode->i_data.i_mmap_lock){----}:
       [<ffffffff8025ca54>] __lock_acquire+0xedb/0x14d2
       [<ffffffff8025d397>] lock_release_non_nested+0x1c2/0x219
       [<ffffffff8025d515>] lock_release+0x127/0x14a
       [<ffffffff804c6403>] _spin_unlock+0x1e/0x50
       [<ffffffff802995d9>] mm_drop_all_locks+0x7f/0xb0
       [<ffffffff802a965d>] do_mmu_notifier_register+0xe2/0x112
       [<ffffffff802a96a8>] mmu_notifier_register+0xe/0x10
       [<ffffffffa0043b6b>] kvm_dev_ioctl+0x11e/0x287 [kvm]
       [<ffffffff802bd0ca>] vfs_ioctl+0x2a/0x78
       [<ffffffff802bd36f>] do_vfs_ioctl+0x257/0x274
       [<ffffffff802bd3e1>] sys_ioctl+0x55/0x78
       [<ffffffff8020c0db>] system_call_fastpath+0x16/0x1b
       [<ffffffffffffffff>] 0xffffffffffffffff

other info that might help us debug this:

5 locks held by qemu-kvm/2033:
 #0:  (&mm->mmap_sem){----}, at: [<ffffffff802a95d0>] do_mmu_notifier_register+0x55/0x112
 #1:  (mm_all_locks_mutex){--..}, at: [<ffffffff8029963e>] mm_take_all_locks+0x34/0xea
 #2:  (&anon_vma->lock){----}, at: [<ffffffff8029967a>] mm_take_all_locks+0x70/0xea
 #3:  (&anon_vma->lock){----}, at: [<ffffffff8029967a>] mm_take_all_locks+0x70/0xea
 #4:  (&anon_vma->lock){----}, at: [<ffffffff8029967a>] mm_take_all_locks+0x70/0xea

stack backtrace:
Pid: 2033, comm: qemu-kvm Not tainted 2.6.27-rc1 #270

Call Trace:
 [<ffffffff8025b7c7>] print_circular_bug_tail+0xb8/0xc3
 [<ffffffff8025ca54>] __lock_acquire+0xedb/0x14d2
 [<ffffffff80259bb1>] ? add_lock_to_list+0x7e/0xad
 [<ffffffff8029967a>] ? mm_take_all_locks+0x70/0xea
 [<ffffffff8029967a>] ? mm_take_all_locks+0x70/0xea
 [<ffffffff8025d397>] lock_release_non_nested+0x1c2/0x219
 [<ffffffff802996cc>] ? mm_take_all_locks+0xc2/0xea
 [<ffffffff802996cc>] ? mm_take_all_locks+0xc2/0xea
 [<ffffffff8025b202>] ? trace_hardirqs_on_caller+0x4d/0x115
 [<ffffffff802995d9>] ? mm_drop_all_locks+0x7f/0xb0
 [<ffffffff8025d515>] lock_release+0x127/0x14a
 [<ffffffff804c6403>] _spin_unlock+0x1e/0x50
 [<ffffffff802995d9>] mm_drop_all_locks+0x7f/0xb0
 [<ffffffff802a965d>] do_mmu_notifier_register+0xe2/0x112
 [<ffffffff802a96a8>] mmu_notifier_register+0xe/0x10
 [<ffffffffa0043b6b>] kvm_dev_ioctl+0x11e/0x287 [kvm]
 [<ffffffff8033f9f2>] ? file_has_perm+0x83/0x8e
 [<ffffffff802bd0ca>] vfs_ioctl+0x2a/0x78
 [<ffffffff802bd36f>] do_vfs_ioctl+0x257/0x274
 [<ffffffff802bd3e1>] sys_ioctl+0x55/0x78
 [<ffffffff8020c0db>] system_call_fastpath+0x16/0x1b

Which the locking hierarchy in mm/rmap.c confirms as valid.

Fix this by first taking all the mapping->i_mmap_lock instances and then
take all anon_vma->lock instances.

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Acked-by: Hugh Dickins <hugh@veritas.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2008-08-11 09:30:25 +02:00
Peter Zijlstra
454ed842d5 lockdep: annotate mm_take_all_locks()
The nesting is correct due to holding mmap_sem, use the new annotation
to annotate this.

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2008-08-11 09:30:25 +02:00
Linus Torvalds
4fbb71597a Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/penberg/slab-2.6
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/penberg/slab-2.6:
  SLUB: dynamic per-cache MIN_PARTIAL
  mm: unexport ksize
2008-08-09 16:21:33 -07:00
Linus Torvalds
d6606683a5 Revert duplicate "mm/hugetlb.c must #include <asm/io.h>"
This reverts commit 7cb9318162, since we
did that patch twice, and the problem was already fixed earlier by
78a34ae29b.

Reported-by: Andi Kleen <andi@firstfloor.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-08-06 12:04:54 -07:00
Benny Halevy
dfe195fb79 mm: fix uninitialized variables for find_vma_prepare callers
gcc 4.3.0 correctly emits the following warnings.
When a vma covering addr is found, find_vma_prepare indeed returns without
setting pprev, rb_link, and rb_parent.

  mm/mmap.c: In function `insert_vm_struct':
  mm/mmap.c:2085: warning: `rb_parent' may be used uninitialized in this function
  mm/mmap.c:2085: warning: `rb_link' may be used uninitialized in this function
  mm/mmap.c:2084: warning: `prev' may be used uninitialized in this function
  mm/mmap.c: In function `copy_vma':
  mm/mmap.c:2124: warning: `rb_parent' may be used uninitialized in this function
  mm/mmap.c:2124: warning: `rb_link' may be used uninitialized in this function
  mm/mmap.c:2123: warning: `prev' may be used uninitialized in this function
  mm/mmap.c: In function `do_brk':
  mm/mmap.c:1951: warning: `rb_parent' may be used uninitialized in this function
  mm/mmap.c:1951: warning: `rb_link' may be used uninitialized in this function
  mm/mmap.c:1949: warning: `prev' may be used uninitialized in this function
  mm/mmap.c: In function `mmap_region':
  mm/mmap.c:1092: warning: `rb_parent' may be used uninitialized in this function
  mm/mmap.c:1092: warning: `rb_link' may be used uninitialized in this function
  mm/mmap.c:1089: warning: `prev' may be used uninitialized in this function

Hugh adds: in fact, none of find_vma_prepare's callers use those values
when a vma is found to be already covering addr, it's either an error or
an occasion to munmap and repeat.  Okay, let's quieten the compiler (but I
would prefer it if pprev, rb_link and rb_parent were meaningful in that
case, rather than whatever's in them from descending the tree).

Signed-off-by: Benny Halevy <bhalevy@panasas.com>
Signed-off-by: Hugh Dickins <hugh@veritas.com>
Cc: "Ryan Hope" <rmh3093@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-08-05 14:33:50 -07:00
Andrew Morton
5c9ffc9c3d mm_init.c: avoid ifdef-inside-macro-expansion
gcc-3.2:

  mm/mm_init.c:77:1: directives may not be used inside a macro argument
  mm/mm_init.c:76:47: unterminated argument list invoking macro "mminit_dprintk"
  mm/mm_init.c: In function `mminit_verify_pageflags_layout':
  mm/mm_init.c:80: `mminit_dprintk' undeclared (first use in this function)
  mm/mm_init.c:80: (Each undeclared identifier is reported only once
  mm/mm_init.c:80: for each function it appears in.)
  mm/mm_init.c:80: syntax error before numeric constant

Also fix a typo in a comment.

Reported-by: Adrian Bunk <bunk@kernel.org>
Cc: Mel Gorman <mel@csn.ul.ie>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-08-05 14:33:46 -07:00
Pekka Enberg
5595cffc82 SLUB: dynamic per-cache MIN_PARTIAL
This patch changes the static MIN_PARTIAL to a dynamic per-cache ->min_partial
value that is calculated from object size. The bigger the object size, the more
pages we keep on the partial list.

I tested SLAB, SLUB, and SLUB with this patch on Jens Axboe's 'netio' example
script of the fio benchmarking tool. The script stresses the networking
subsystem which should also give a fairly good beating of kmalloc() et al.

To run the test yourself, first clone the fio repository:

  git clone git://git.kernel.dk/fio.git

and then run the following command n times on your machine:

  time ./fio examples/netio

The results on my 2-way 64-bit x86 machine are as follows:

  [ the minimum, maximum, and average are captured from 50 individual runs ]

                 real time (seconds)
                 min      max      avg      sd
  SLAB           22.76    23.38    22.98    0.17
  SLUB           22.80    25.78    23.46    0.72
  SLUB (dynamic) 22.74    23.54    23.00    0.20

                 sys time (seconds)
                 min      max      avg      sd
  SLAB           6.90     8.28     7.70     0.28
  SLUB           7.42     16.95    8.89     2.28
  SLUB (dynamic) 7.17     8.64     7.73     0.29

                 user time (seconds)
                 min      max      avg      sd
  SLAB           36.89    38.11    37.50    0.29
  SLUB           30.85    37.99    37.06    1.67
  SLUB (dynamic) 36.75    38.07    37.59    0.32

As you can see from the above numbers, this patch brings SLUB to the same level
as SLAB for this particular workload fixing a ~2% regression. I'd expect this
change to help similar workloads that allocate a lot of objects that are close
to the size of a page.

Cc: Matthew Wilcox <matthew@wil.cx>
Cc: Andrew Morton <akpm@linux-foundation.org>
Acked-by: Christoph Lameter <cl@linux-foundation.org>
Signed-off-by: Pekka Enberg <penberg@cs.helsinki.fi>
2008-08-05 09:28:47 +03:00
Nick Piggin
529ae9aaa0 mm: rename page trylock
Converting page lock to new locking bitops requires a change of page flag
operation naming, so we might as well convert it to something nicer
(!TestSetPageLocked_Lock => trylock_page, SetPageLocked => set_page_locked).

This also facilitates lockdeping of page lock.

Signed-off-by: Nick Piggin <npiggin@suse.de>
Acked-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Acked-by: Peter Zijlstra <peterz@infradead.org>
Acked-by: Andrew Morton <akpm@linux-foundation.org>
Acked-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-08-04 21:31:34 -07:00
Linus Torvalds
2e1e9212ed Merge git://git.kernel.org/pub/scm/linux/kernel/git/lethal/sh-2.6
* git://git.kernel.org/pub/scm/linux/kernel/git/lethal/sh-2.6: (29 commits)
  sh: enable maple_keyb in dreamcast_defconfig.
  SH2(A) cache update
  nommu: Provide vmalloc_exec().
  add addrespace definition for sh2a.
  sh: Kill off ARCH_SUPPORTS_AOUT and remnants of a.out support.
  sh: define GENERIC_HARDIRQS_NO__DO_IRQ.
  sh: define GENERIC_LOCKBREAK.
  sh: Save NUMA node data in vmcore for crash dumps.
  sh: module_alloc() should be using vmalloc_exec().
  sh: Fix up __bug_table handling in module loader.
  sh: Add documentation and integrate into docbook build.
  sh: Fix up broken kerneldoc comments.
  maple: Kill useless private_data pointer.
  maple: Clean up maple_driver_register/unregister routines.
  input: Clean up maple keyboard driver
  maple: allow removal and reinsertion of keyboard driver module
  sh: /proc/asids depends on MMU.
  arch/sh/boards/mach-se/7343/irq.c: removed duplicated #include
  arch/sh/boards/board-ap325rxa.c: removed duplicated #include
  sh/boards/Makefile typo fix
  ...
2008-08-04 17:26:15 -07:00
KOSAKI Motohiro
a477097d9c mlock() fix return values
Halesh says:

Please find the below testcase provide to test mlock.

Test Case :
===========================

#include <sys/resource.h>
#include <stdio.h>
#include <sys/stat.h>
#include <sys/types.h>
#include <unistd.h>
#include <sys/mman.h>
#include <fcntl.h>
#include <errno.h>
#include <stdlib.h>

int main(void)
{
  int fd,ret, i = 0;
  char *addr, *addr1 = NULL;
  unsigned int page_size;
  struct rlimit rlim;

  if (0 != geteuid())
  {
   printf("Execute this pgm as root\n");
   exit(1);
  }

  /* create a file */
  if ((fd = open("mmap_test.c",O_RDWR|O_CREAT,0755)) == -1)
  {
   printf("cant create test file\n");
   exit(1);
  }

  page_size = sysconf(_SC_PAGE_SIZE);

  /* set the MEMLOCK limit */
  rlim.rlim_cur = 2000;
  rlim.rlim_max = 2000;

  if ((ret = setrlimit(RLIMIT_MEMLOCK,&rlim)) != 0)
  {
   printf("Cant change limit values\n");
   exit(1);
  }

  addr = 0;
  while (1)
  {
  /* map a page into memory each time*/
  if ((addr = (char *) mmap(addr,page_size, PROT_READ |
PROT_WRITE,MAP_SHARED,fd,0)) == MAP_FAILED)
  {
   printf("cant do mmap on file\n");
   exit(1);
  }

  if (0 == i)
    addr1 = addr;
  i++;
  errno = 0;
  /* lock the mapped memory pagewise*/
  if ((ret = mlock((char *)addr, 1500)) == -1)
  {
   printf("errno value is %d\n", errno);
   printf("cant lock maped region\n");
   exit(1);
  }
  addr = addr + page_size;
 }
}
======================================================

This testcase results in an mlock() failure with errno 14 that is EFAULT,
but it has nowhere been specified that mlock() will return EFAULT.  When I
tested the same on older kernels like 2.6.18, I got the correct result i.e
errno 12 (ENOMEM).

I think in source code mlock(2), setting errno ENOMEM has been missed in
do_mlock() , on mlock_fixup() failure.

SUSv3 requires the following behavior frmo mlock(2).

[ENOMEM]
    Some or all of the address range specified by the addr and
    len arguments does not correspond to valid mapped pages
    in the address space of the process.

[EAGAIN]
    Some or all of the memory identified by the operation could not
    be locked when the call was made.

This rule isn't so nice and slighly strange.  but many people think
POSIX/SUS compliance is important.

Reported-by: Halesh Sadashiv <halesh.sadashiv@ap.sony.com>
Tested-by: Halesh Sadashiv <halesh.sadashiv@ap.sony.com>
Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: <stable@kernel.org>		[2.6.25.x, 2.6.26.x]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-08-04 16:58:45 -07:00
Paul Mundt
1af446edfe nommu: Provide vmalloc_exec().
Now that SH has switched to vmalloc_exec() for PAGE_KERNEL_EXEC usage,
it's apparent that nommu has no vmalloc_exec() definition of its own.
Stub in the one from mm/vmalloc.c.

Signed-off-by: Paul Mundt <lethal@linux-sh.org>
2008-08-04 16:01:47 +09:00