hugepage_activelist will be used to track currently used HugeTLB pages.
We need to find the in-use HugeTLB pages to support HugeTLB cgroup removal.
On cgroup removal we update the page's HugeTLB cgroup to point to parent
cgroup.
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Hillf Danton <dhillf@gmail.com>
Reviewed-by: Michal Hocko <mhocko@suse.cz>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Since we migrate only one hugepage, don't use linked list for passing the
page around. Directly pass the page that need to be migrated as argument.
This also removes the usage of page->lru in the migrate path.
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Hillf Danton <dhillf@gmail.com>
Reviewed-by: Michal Hocko <mhocko@suse.cz>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Use a mmu_gather instead of a temporary linked list for accumulating pages
when we unmap a hugepage range
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Hillf Danton <dhillf@gmail.com>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Add an inline helper and use it in the code.
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Acked-by: David Rientjes <rientjes@google.com>
Acked-by: Michal Hocko <mhocko@suse.cz>
Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Hillf Danton <dhillf@gmail.com>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The current use of VM_FAULT_* codes with ERR_PTR requires us to ensure
VM_FAULT_* values will not exceed MAX_ERRNO value. Decouple the
VM_FAULT_* values from MAX_ERRNO.
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Acked-by: Hillf Danton <dhillf@gmail.com>
Acked-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Michal Hocko <mhocko@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This patchset implements a cgroup resource controller for HugeTLB pages.
The controller allows to limit the HugeTLB usage per control group and
enforces the controller limit during page fault. Since HugeTLB doesn't
support page reclaim, enforcing the limit at page fault time implies that,
the application will get SIGBUS signal if it tries to access HugeTLB pages
beyond its limit. This requires the application to know beforehand how
much HugeTLB pages it would require for its use.
The goal is to control how many HugeTLB pages a group of task can
allocate. It can be looked at as an extension of the existing quota
interface which limits the number of HugeTLB pages per hugetlbfs
superblock. HPC job scheduler requires jobs to specify their resource
requirements in the job file. Once their requirements can be met, job
schedulers like (SLURM) will schedule the job. We need to make sure that
the jobs won't consume more resources than requested. If they do we
should either error out or kill the application.
This patch:
Rename max_hstate to hugetlb_max_hstate. We will be using this from other
subsystems like hugetlb controller in later patches.
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Acked-by: David Rientjes <rientjes@google.com>
Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Acked-by: Hillf Danton <dhillf@gmail.com>
Acked-by: Michal Hocko <mhocko@suse.cz>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Since per-BDI flusher threads were introduced in 2.6, the pdflush
mechanism is not used any more. But the old interface exported through
/proc/sys/vm/nr_pdflush_threads still exists and is obviously useless.
For back-compatibility, printk warning information and return 2 to notify
the users that the interface is removed.
Signed-off-by: Wanpeng Li <liwp@linux.vnet.ibm.com>
Cc: Wu Fengguang <fengguang.wu@intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Currently, function should_fail() has "bool" for its return value, so it's
reasonable to change the return value of function should_fail_alloc_page()
into "bool" as well.
The patch does cleanup on function should_fail_alloc_page() to have "bool"
for its return value.
Signed-off-by: Gavin Shan <shangw@linux.vnet.ibm.com>
Acked-by: David Rientjes <rientjes@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
vm_stat_account() accounts the shared_vm, stack_vm and reserved_vm now.
But we can also account for total_vm in the vm_stat_account() which makes
the code tidy.
Even for mprotect_fixup(), we can get the right result in the end.
Signed-off-by: Huang Shijie <shijie8@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Fix of the documentation of /proc/sys/vm/page-cluster to match the
behavior of the code and add some comments about what the tunable will
change in that behavior.
Signed-off-by: Christian Ehrhardt <ehrhardt@linux.vnet.ibm.com>
Acked-by: Jens Axboe <axboe@kernel.dk>
Reviewed-by: Minchan Kim <minchan@kernel.org>
Cc: Hugh Dickins <hughd@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Swap readahead works fine, but the I/O to disk is almost always done in
page size requests, despite the fact that readahead submits
1<<page-cluster pages at a time.
On older kernels the old per device plugging behavior might have captured
this and merged the requests, but currently all comes down to much more
I/Os than required.
On a single device this might not be an issue, but as soon as a server
runs on shared san resources savin I/Os not only improves swapin
throughput but also provides a lower resource utilization.
With a load running KVM in a lot of memory overcommitment (the hot memory
is 1.5 times the host memory) swapping throughput improves significantly
and the lead feels more responsive as well as achieves more throughput.
In a test setup with 16 swap disks running blocktrace on one of those disks
shows the improved merging:
Prior:
Reads Queued: 560,888, 2,243MiB Writes Queued: 226,242, 904,968KiB
Read Dispatches: 544,701, 2,243MiB Write Dispatches: 159,318, 904,968KiB
Reads Requeued: 0 Writes Requeued: 0
Reads Completed: 544,716, 2,243MiB Writes Completed: 159,321, 904,980KiB
Read Merges: 16,187, 64,748KiB Write Merges: 61,744, 246,976KiB
IO unplugs: 149,614 Timer unplugs: 2,940
With the patch:
Reads Queued: 734,315, 2,937MiB Writes Queued: 300,188, 1,200MiB
Read Dispatches: 214,972, 2,937MiB Write Dispatches: 215,176, 1,200MiB
Reads Requeued: 0 Writes Requeued: 0
Reads Completed: 214,971, 2,937MiB Writes Completed: 215,177, 1,200MiB
Read Merges: 519,343, 2,077MiB Write Merges: 73,325, 293,300KiB
IO unplugs: 337,130 Timer unplugs: 11,184
I got ~10% to ~40% more throughput in my cases and at the same time much
lower cpu consumption when broken down per transferred kilobyte (the
majority of that due to saved interrupts and better cache handling). In a
shared SAN others might get an additional benefit as well, because this
now causes less protocol overhead.
Signed-off-by: Christian Ehrhardt <ehrhardt@linux.vnet.ibm.com>
Acked-by: Rik van Riel <riel@redhat.com>
Acked-by: Jens Axboe <axboe@kernel.dk>
Reviewed-by: Minchan Kim <minchan@kernel.org>
Cc: Hugh Dickins <hughd@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
There are no users since commit b24028572f ("memcg: remove PCG_CACHE").
Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Acked-by: Hugh Dickins <hughd@google.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Now, in memcg, 2 "MAPPED" enum/macro are found
MEM_CGROUP_CHARGE_TYPE_MAPPED
MEM_CGROUP_STAT_FILE_MAPPED
Thier names looks similar to each other but the former is used for
accounting anonymous memory. rename it as TYPE_ANON.
Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Acked-by: Michal Hocko <mhocko@suse.cz>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
MEM_CGROUP_STAT_SWAPOUT represents the usage of swap rather than
the number of swap-out events. Rename it to be MEM_CGROUP_STAT_SWAP.
Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Acked-by: Michal Hocko <mhocko@suse.cz>
Acked-by: Hugh Dickins <hughd@google.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
If someone calls vb_alloc() (or vm_map_ram() for that matter) to allocate
0 bytes (0 pages), get_order() returns BITS_PER_LONG - PAGE_CACHE_SHIFT
and interesting stuff happens. So make debugging such problems easier and
warn about 0-size allocation.
[akpm@linux-foundation.org: use WARN_ON-return-value feature]
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
There's a walk by repeating rb_next to find a suitable hole. Could be
simply replaced by walk on the sorted vmap_area_list. More simpler and
efficient.
Mutation of the list and tree only happens in pair within
__insert_vmap_area and __free_vmap_area, under protection of
vmap_area_lock. The patch code is also under vmap_area_lock, so the list
walk is safe, and consistent with the tree walk.
Tested on SMP by repeating batch of vmalloc anf vfree for random sizes and
rounds for hours.
Signed-off-by: Hong Zhiguo <honkiko@gmail.com>
Cc: Nick Piggin <npiggin@kernel.dk>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Fix zillions of these:
drivers/media/video/v4l2-ioctl.c:1848: error: unknown field 'func' specified in initializer
drivers/media/video/v4l2-ioctl.c:1848: warning: missing braces around initializer
drivers/media/video/v4l2-ioctl.c:1848: warning: (near initialization for 'v4l2_ioctls[0].<anonymous>')
drivers/media/video/v4l2-ioctl.c:1848: warning: initialization makes integer from pointer without a cast
drivers/media/video/v4l2-ioctl.c:1848: error: initializer element is not computable at load time
drivers/media/video/v4l2-ioctl.c:1848: error: (near initialization for 'v4l2_ioctls[0].<anonymous>.offset')
Cc: Mauro Carvalho Chehab <mchehab@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This will fix build errors:
block/blk-cgroup.c:609:2: error: unknown type name 'atomic64_t'
block/blk-cgroup.c:609:2: error: implicit declaration of function 'ATOMIC64_INIT' [-Werror=implicit-function-declaration]
Signed-off-by: Fengguang Wu <fengguang.wu@intel.com>
Cc: Chris Zankel <chris@zankel.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
"fault-injection: add tool to run command with failslab or
fail_page_alloc" added tools/testing/fault-injection/failcmd.sh to make it
easier to inject slab/page allocation failures by fault injection.
failcmd.sh prints the following warning when running with arguments
for command.
# ./failcmd.sh echo aaa
failcmd.sh: line 209: [: echo: binary operator expected
aaa
This warning is caused by an improper check whether at least one
parameter is left after parsing command options.
Fix it by testing the length of $1 instead of $@
Signed-off-by: Akinobu Mita <akinobu.mita@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
all pretty straightforward, except one thing.
One of our patches added thermal support for power supply class, but
thermal/ subsystem changed under our feet. We (well, Stephen, that is)
caught the issue and it was decided[1] that I'd just delay the battery
pull request, and then will fix it up by merging upstream back into
battery tree at the specific commit.
That's not all though: another[2] small fixup for thermal subsystem was
needed to get rid of a warning in power supply subsystem (the warning
was not drivers/power's "fault", the thermal registration function just
needed a proper const annotation, which is also done by a small commit
on top of the merge.
So, to sum this up:
- The 'master' branch of the battery tree was in the -next tree for
weeks, was never rebased, altered etc. It should be all OK;
- Although, for-v3.6 tag contains the 'master' branch + merge + the
warning fix.
[1] http://lkml.org/lkml/2012/6/19/23
[2] http://lkml.org/lkml/2012/6/18/28
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2.0.18 (GNU/Linux)
iQIcBAABAgAGBQJQF9V8AAoJEGgI9fZJve1bLvkP/j/Nt1fBud2w5Q/NJr310hYJ
NWIMSJwFbMhPoNd7sESznogXH8eHQ6YJP+CmkA5Gxr0t8pjEEJHEEyEcf1eNv6/c
YDZfDB3TIaeYzulvRUkXMQ1f7hiA5Bq2t13yXeMM19+r9DzNZ51jZ3TXETLkpWZG
BfZPg5wmP0xssXB3fjJMWuW5hVEc503WLpLFXkWfWKMU3PGdy/8DckV/YLvf2l7K
1fkBLZry0gtruKqFbwcXhanP1JQ8FFFO8n1tSVLJhXXoym5twn/5GAgcpcKSFfJg
mkGXAQLLuXKfERBIda7qbQl74HmTzYadCcueeXy1hTpom+VwfOpG+by2t/FrXT/M
aJW6hfSLMgicG8FIuSYqbkutvijU9srU/YI00zrSGDBgi4sGKChRMf4sKQXnHO7X
Lb7csQ7hEWsfG5gkgjRkmgZdhqWYoIxxe5Gv2Z9MKF27mHoSM03KHiZGlDJMrmNs
w4KcU5H9tA62dT/UFszuu7NZenmsVS/ktiHWe5k+EXElZMZRrxKDJk2cvLPkRz/E
VkXGvAmTDYPasZm29yzTnPKcuo6pfeOVJnUWybHOYxhkqAhJu1QzHCatqapgfy8E
F2ODI5FoWtQES96B9t3tY4lTzq/0yUHcbiJt4BsyRcCGP+ggEQjCq0HGqOca12XX
gxE20O3l+YQTCQIYKH+S
=1NaS
-----END PGP SIGNATURE-----
Merge tag 'for-v3.6' of git://git.infradead.org/battery-2.6
Pull battery updates from Anton Vorontsov:
"The tag contains just a few battery-related changes for v3.6. It's is
all pretty straightforward, except one thing.
One of our patches added thermal support for power supply class, but
thermal/ subsystem changed under our feet. We (well, Stephen, that
is) caught the issue and it was decided[1] that I'd just delay the
battery pull request, and then will fix it up by merging upstream back
into battery tree at the specific commit.
That's not all though: another[2] small fixup for thermal subsystem
was needed to get rid of a warning in power supply subsystem (the
warning was not drivers/power's "fault", the thermal registration
function just needed a proper const annotation, which is also done by
a small commit on top of the merge.
So, to sum this up:
- The 'master' branch of the battery tree was in the -next tree for
weeks, was never rebased, altered etc. It should be all OK;
- Although, for-v3.6 tag contains the 'master' branch + merge + the
warning fix.
[1] http://lkml.org/lkml/2012/6/19/23
[2] http://lkml.org/lkml/2012/6/18/28"
* tag 'for-v3.6' of git://git.infradead.org/battery-2.6: (23 commits)
thermal: Constify 'type' argument for the registration routine
olpc-battery: update CHARGE_FULL_DESIGN property for BYD LiFe batteries
olpc-battery: Add VOLTAGE_MAX_DESIGN property
charger-manager: Fix build break related to EXTCON
lp8727_charger: Move header file into platform_data directory
power_supply: Add min/max alert properties for CAPACITY, TEMP, TEMP_AMBIENT
bq27x00_battery: Add support for BQ27425 chip
charger-manager: Set current limit of regulator for over current protection
charger-manager: Use EXTCON Subsystem to detect charger cables for charging
test_power: Add VOLTAGE_NOW and BATTERY_TEMP properties
test_power: Add support for USB AC source
gpio-charger: Use cansleep version of gpio_set_value
bq27x00_battery: Add support for power average and health properties
sbs-battery: Don't trigger false supply_changed event
twl4030_charger: Allow charger to control the regulator that feeds it
twl4030_charger: Add backup-battery charging
twl4030_charger: Fix some typos
max17042_battery: Support CHARGE_COUNTER power supply attribute
smb347-charger: Add constant charge and current properties
power_supply: Add constant charge_current and charge_voltage properties
...
Pull perf updates from Ingo Molnar:
"The biggest changes are Intel Nehalem-EX PMU uncore support, uprobes
updates/cleanups/fixes from Oleg and diverse tooling updates (mostly
fixes) now that Arnaldo is back from vacation."
* 'perf-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (40 commits)
uprobes: __replace_page() needs munlock_vma_page()
uprobes: Rename vma_address() and make it return "unsigned long"
uprobes: Fix register_for_each_vma()->vma_address() check
uprobes: Introduce vaddr_to_offset(vma, vaddr)
uprobes: Teach build_probe_list() to consider the range
uprobes: Remove insert_vm_struct()->uprobe_mmap()
uprobes: Remove copy_vma()->uprobe_mmap()
uprobes: Fix overflow in vma_address()/find_active_uprobe()
uprobes: Suppress uprobe_munmap() from mmput()
uprobes: Uprobe_mmap/munmap needs list_for_each_entry_safe()
uprobes: Clean up and document write_opcode()->lock_page(old_page)
uprobes: Kill write_opcode()->lock_page(new_page)
uprobes: __replace_page() should not use page_address_in_vma()
uprobes: Don't recheck vma/f_mapping in write_opcode()
perf/x86: Fix missing struct before structure name
perf/x86: Fix format definition of SNB-EP uncore QPI box
perf/x86: Make bitfield unsigned
perf/x86: Fix LLC-* and node-* events on Intel SandyBridge
perf/x86: Add Intel Nehalem-EX uncore support
perf/x86: Fix typo in format definition of uncore PCU filter
...
Pull powerpc updates from Benjamin Herrenschmidt:
"Kumar sent me a handful of Freescale related fixes and I added another
regression fix to the pile.
PS. I -will- eventually learn about that signed tag business :-)"
* 'merge' of git://git.kernel.org/pub/scm/linux/kernel/git/benh/powerpc:
powerpc/kvm/book3s_32: Fix MTMSR_EERI macro
powerpc/85xx: p1022ds: fix DIU/LBC switching with NAND enabled
powerpc/85xx: p1022ds: disable the NAND flash node if video is enabled
powerpc/85xx: Fix sram_offset parameter type
powerpc/85xx: P3041DS - change espi input-clock from 40MHz to 35MHz
powerpc/85xx: Fix pci base address error for p2020rdb-pc in dts
Pull s390 updates from Martin Schwidefsky:
"This it the second batch of s390 patches for the 3.6 merge window.
Included is enablement for two common code changes, killable page
faults and sorted exception tables. And the regular set of cleanup
and bug fix patches."
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/s390/linux:
s390: make use of user_mode() macro where possible
s390/mm: rename user_mode variable to addressing_mode
s390/mm: fix fault handling for page table walk case
s390/mm: make page faults killable
s390: update defconfig
s390/mm: downgrade page table after fork of a 31 bit process
s390/ipl: Use diagnose 8 command separation
s390/linker script: use RO_DATA_SECTION
s390/exceptions: sort exception table at build time
s390/debug: remove module_exit function / move EXPORT_SYMBOLs
When a device is unregistered, we have to purge all of the
references to it that may exist in the entire system.
If a route is uncached, we currently have no way of accomplishing
this.
So create a global list that is scanned when a network device goes
down. This mirrors the logic in net/core/dst.c's dst_ifdown().
Signed-off-by: David S. Miller <davem@davemloft.net>
Pull nfsd changes from J. Bruce Fields:
"This has been an unusually quiet cycle--mostly bugfixes and cleanup.
The one large piece is Stanislav's work to containerize the server's
grace period--but that in itself is just one more step in a
not-yet-complete project to allow fully containerized nfs service.
There are a number of outstanding delegation, container, v4 state, and
gss patches that aren't quite ready yet; 3.7 may be wilder."
* 'nfsd-next' of git://linux-nfs.org/~bfields/linux: (35 commits)
NFSd: make boot_time variable per network namespace
NFSd: make grace end flag per network namespace
Lockd: move grace period management from lockd() to per-net functions
LockD: pass actual network namespace to grace period management functions
LockD: manage grace list per network namespace
SUNRPC: service request network namespace helper introduced
NFSd: make nfsd4_manager allocated per network namespace context.
LockD: make lockd manager allocated per network namespace
LockD: manage grace period per network namespace
Lockd: add more debug to host shutdown functions
Lockd: host complaining function introduced
LockD: manage used host count per networks namespace
LockD: manage garbage collection timeout per networks namespace
LockD: make garbage collector network namespace aware.
LockD: mark host per network namespace on garbage collect
nfsd4: fix missing fault_inject.h include
locks: move lease-specific code out of locks_delete_lock
locks: prevent side-effects of locks_release_private before file_lock is initialized
NFSd: set nfsd_serv to NULL after service destruction
NFSd: introduce nfsd_destroy() helper
...
Input path is mostly run under RCU and doesnt touch dst refcnt
But output path on forwarding or UDP workloads hits
badly dst refcount, and we have lot of false sharing, for example
in ipv4_mtu() when reading rt->rt_pmtu
Using a percpu cache for nh_rth_output gives a nice performance
increase at a small cost.
24 udpflood test on my 24 cpu machine (dummy0 output device)
(each process sends 1.000.000 udp frames, 24 processes are started)
before : 5.24 s
after : 2.06 s
For reference, time on linux-3.5 : 6.60 s
Signed-off-by: Eric Dumazet <edumazet@google.com>
Tested-by: Alexander Duyck <alexander.h.duyck@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
commit 404e0a8b6a (net: ipv4: fix RCU races on dst refcounts) tried
to solve a race but added a problem at device/fib dismantle time :
We really want to call dst_free() as soon as possible, even if sockets
still have dst in their cache.
dst_release() calls in free_fib_info_rcu() are not welcomed.
Root of the problem was that now we also cache output routes (in
nh_rth_output), we must use call_rcu() instead of call_rcu_bh() in
rt_free(), because output route lookups are done in process context.
Based on feedback and initial patch from David Miller (adding another
call_rcu_bh() call in fib, but it appears it was not the right fix)
I left the inet_sk_rx_dst_set() helper and added __rcu attributes
to nh_rth_output and nh_rth_input to better document what is going on in
this code.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Pull Ceph changes from Sage Weil:
"Lots of stuff this time around:
- lots of cleanup and refactoring in the libceph messenger code, and
many hard to hit races and bugs closed as a result.
- lots of cleanup and refactoring in the rbd code from Alex Elder,
mostly in preparation for the layering functionality that will be
coming in 3.7.
- some misc rbd cleanups from Josh Durgin that are finally going
upstream
- support for CRUSH tunables (used by newer clusters to improve the
data placement)
- some cleanup in our use of d_parent that Al brought up a while back
- a random collection of fixes across the tree
There is another patch coming that fixes up our ->atomic_open()
behavior, but I'm going to hammer on it a bit more before sending it."
Fix up conflicts due to commits that were already committed earlier in
drivers/block/rbd.c, net/ceph/{messenger.c, osd_client.c}
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client: (132 commits)
rbd: create rbd_refresh_helper()
rbd: return obj version in __rbd_refresh_header()
rbd: fixes in rbd_header_from_disk()
rbd: always pass ops array to rbd_req_sync_op()
rbd: pass null version pointer in add_snap()
rbd: make rbd_create_rw_ops() return a pointer
rbd: have __rbd_add_snap_dev() return a pointer
libceph: recheck con state after allocating incoming message
libceph: change ceph_con_in_msg_alloc convention to be less weird
libceph: avoid dropping con mutex before fault
libceph: verify state after retaking con lock after dispatch
libceph: revoke mon_client messages on session restart
libceph: fix handling of immediate socket connect failure
ceph: update MAINTAINERS file
libceph: be less chatty about stray replies
libceph: clear all flags on con_close
libceph: clean up con flags
libceph: replace connection state bits with states
libceph: drop unnecessary CLOSED check in socket state change callback
libceph: close socket directly from ceph_con_close()
...
drivers/built-in.o: In function `radio_tea5777_set_freq':
radio-tea5777.c:(.text+0x4d8704): undefined reference to `__udivdi3'
Reported-by: Randy Dunlap <rdunlap@xenotime.net>
Cc: Hans de Goede <hdegoede@redhat.com>
Acked-by: Randy Dunlap <rdunlap@xenotime.net>
Signed-off-by: Mauro Carvalho Chehab <mchehab@redhat.com>
We have no mechanism to emulate LOCK_MAND locks on NFSv4, so explicitly
return -EINVAL if someone requests it.
Signed-off-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
By default a sunrpc service is limited to (N+3)*20 connections
where N is the number of threads. This is 80 when N==1.
If this number is exceeded a warning is printed suggesting that
the number of threads be increased. However with services which
run a single thread, this is impossible.
For such services there is a ->sv_maxconn setting that can be
used to forcibly increase the limit, and silence the message.
This is used by lockd.
The nfs client uses a sunrpc service to handle callbacks and
it too is single-threaded, so to avoid the useless messages,
and to allow a reasonable number of concurrent connections,
we need to set ->sv_maxconn. 1024 seems like a good number.
Signed-off-by: NeilBrown <neilb@suse.de>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Add PCI device support for VFIO. PCI devices expose regions
for accessing config space, I/O port space, and MMIO areas
of the device. PCI config access is virtualized in the kernel,
allowing us to ensure the integrity of the system, by preventing
various accesses while reducing duplicate support across various
userspace drivers. I/O port supports read/write access while
MMIO also supports mmap of sufficiently sized regions. Support
for INTx, MSI, and MSI-X interrupts are provided using eventfds to
userspace.
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
This VFIO IOMMU backend is designed primarily for AMD-Vi and Intel
VT-d hardware, but is potentially usable by anything supporting
similar mapping functionality. We arbitrarily call this a Type1
backend for lack of a better name. This backend has no IOVA
or host memory mapping restrictions for the user and is optimized
for relatively static mappings. Mapped areas are pinned into system
memory.
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
VFIO is a secure user level driver for use with both virtual machines
and user level drivers. VFIO makes use of IOMMU groups to ensure the
isolation of devices in use, allowing unprivileged user access. It's
intended that VFIO will replace KVM device assignment and UIO drivers
(in cases where the target platform includes a sufficiently capable
IOMMU).
New in this version of VFIO is support for IOMMU groups managed
through the IOMMU core as well as a rework of the API, removing the
group merge interface. We now go back to a model more similar to
original VFIO with UIOMMU support where the file descriptor obtained
from /dev/vfio/vfio allows access to the IOMMU, but only after a
group is added, avoiding the previous privilege issues with this type
of model. IOMMU support is also now fully modular as IOMMUs have
vastly different interface requirements on different platforms. VFIO
users are able to query and initialize the IOMMU model of their
choice.
Please see the follow-on Documentation commit for further description
and usage example.
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
thermal_zone_device_register() does not modify 'type' argument, so it is
safe to declare it as const. Otherwise, if we pass a const string, we are
getting the ugly warning:
CC drivers/power/power_supply_core.o
drivers/power/power_supply_core.c: In function 'psy_register_thermal':
drivers/power/power_supply_core.c:204:6: warning: passing argument 1 of 'thermal_zone_device_register' discards 'const' qualifier from pointer target type [enabled by default]
include/linux/thermal.h:140:29: note: expected 'char *' but argument is of type 'const char *'
Signed-off-by: Anton Vorontsov <anton.vorontsov@linaro.org>
Acked-by: Jean Delvare <khali@linux-fr.org>
This merge is performed to take commit c56f5c0342 ("Thermal: Make
Thermal trip points writeable") out of Linus' tree and then fixup power
supply class. This is needed since thermal stuff added a new argument:
CC drivers/power/power_supply_core.o
drivers/power/power_supply_core.c: In function ‘psy_register_thermal’:
drivers/power/power_supply_core.c:204:6: warning: passing argument 3 of ‘thermal_zone_device_register’ makes integer from pointer without a cast [enabled by default]
include/linux/thermal.h:154:29: note: expected ‘int’ but argument is of type ‘struct power_supply *’
drivers/power/power_supply_core.c:204:6: error: too few arguments to function ‘thermal_zone_device_register’
include/linux/thermal.h:154:29: note: declared here
make[1]: *** [drivers/power/power_supply_core.o] Error 1
make: *** [drivers/power/] Error 2
Signed-off-by: Anton Vorontsov <anton.vorontsov@linaro.org>
commit 13f30fc893e4610f67dd7a8b0b67aec02eac1775
Author: Russell King <rmk+kernel@arm.linux.org.uk>
Date: Sat Apr 21 22:41:10 2012 +0100
mmc: omap: remove private DMA API implementation
removed the private DMA API implementation from the OMAP mmc host to exclusively use the DMA engine API.
Unfortunately OMAP MMC and High Speed MMC host drivers don't support poll mode and only works with DMA.
Since omap2plus_defconfig doesn't enable this feature by default, the
following error is happens on an IGEPv2 Rev.C (and probably on most OMAP boards with MMC support):
[ 2.199981] omap_hsmmc omap_hsmmc.1: unable to obtain RX DMA engine channel 48
[ 2.215087] omap_hsmmc omap_hsmmc.0: unable to obtain RX DMA engine channel 62
Signed-off-by: Javier Martinez Canillas <javier@dowhile0.org>
Signed-off-by: Tony Lindgren <tony@atomide.com>
Signed-off-by: Russell King <rmk+kernel@arm.linux.org.uk>
If dma_request_channel() fails (e.g. because DMA enine is not built
into the kernel), the return value from probe is zero causing the
driver to be bound to the device even though probe failed.
To fix, ensure that probe returns an error value when a DMA channel
request fail.
Signed-off-by: Kevin Hilman <khilman@ti.com>
Signed-off-by: Russell King <rmk+kernel@arm.linux.org.uk>
Remove the private DMA API implementation from nand/omap2.c
making it use entirely the DMA engine API.
Signed-off-by: Russell King <rmk+kernel@arm.linux.org.uk>
Add DMA engine support to the OMAP2 NAND driver. This supplements the
private DMA API implementation contained within this driver, and the
driver can be independently switched at build time between using DMA
engine and the private DMA API.
Tested-by: Grazvydas Ignotas <notasas@gmail.com>
Signed-off-by: Russell King <rmk+kernel@arm.linux.org.uk>
Remove the private DMA API implementation from spi-omap2-mcspi.c,
making it use entirely the DMA engine API.
Acked-by: Grant Likely <grant.likely@secretlab.ca>
Signed-off-by: Russell King <rmk+kernel@arm.linux.org.uk>
Add DMA engine support to the OMAP SPI driver. This supplements the
private DMA API implementation contained within this driver, and the
driver can be independently switched at build time between using DMA
engine and the private DMA API for the transmit and receive sides.
Tested-by: Shubhrajyoti <shubhrajyoti@ti.com>
Acked-by: Grant Likely <grant.likely@secretlab.ca>
Acked-by: Linus Walleij <linus.walleij@linaro.org>
Signed-off-by: Russell King <rmk+kernel@arm.linux.org.uk>
DMAengine uses the DMA engine device structure when mapping/unmapping
memory for DMA, so the MMC devices do not need their DMA masks
initialized (this reflects hardware: the MMC device is not the device
doing DMA.)
Tested-by: Tony Lindgren <tony@atomide.com>
Signed-off-by: Russell King <rmk+kernel@arm.linux.org.uk>
Remove the private DMA API implementation from omap, making it use
entirely the DMA engine API.
Tested-by: Tony Lindgren <tony@atomide.com>
Signed-off-by: Russell King <rmk+kernel@arm.linux.org.uk>
Add DMA engine support to the OMAP driver. This supplements the
private DMA API implementation contained within this driver, and the
driver can be switched at build time between using DMA engine and the
private DMA API.
Tested-by: Tony Lindgren <tony@atomide.com>
Acked-by: Linus Walleij <linus.walleij@linaro.org>
Signed-off-by: Russell King <rmk+kernel@arm.linux.org.uk>
Remove the private DMA API implementation from omap_hsmmc, making it
use entirely the DMA engine API.
Tested-by: Tony Lindgren <tony@atomide.com>
Tested-by: Venkatraman S <svenkatr@ti.com>
Signed-off-by: Russell King <rmk+kernel@arm.linux.org.uk>