This avoids duplicating the function in every arch gup_fast.
Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Hugh Dickins <hughd@google.com>
Cc: Johannes Weiner <jweiner@redhat.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: David Gibson <david@gibson.dropbear.id.au>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: David Miller <davem@davemloft.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Michel while working on the working set estimation code, noticed that
calling get_page_unless_zero() on a random pfn_to_page(random_pfn)
wasn't safe, if the pfn ended up being a tail page of a transparent
hugepage under splitting by __split_huge_page_refcount().
He then found the problem could also theoretically materialize with
page_cache_get_speculative() during the speculative radix tree lookups
that uses get_page_unless_zero() in SMP if the radix tree page is freed
and reallocated and get_user_pages is called on it before
page_cache_get_speculative has a chance to call get_page_unless_zero().
So the best way to fix the problem is to keep page_tail->_count zero at
all times. This will guarantee that get_page_unless_zero() can never
succeed on any tail page. page_tail->_mapcount is guaranteed zero and
is unused for all tail pages of a compound page, so we can simply
account the tail page references there and transfer them to
tail_page->_count in __split_huge_page_refcount() (in addition to the
head_page->_mapcount).
While debugging this s/_count/_mapcount/ change I also noticed get_page is
called by direct-io.c on pages returned by get_user_pages. That wasn't
entirely safe because the two atomic_inc in get_page weren't atomic. As
opposed to other get_user_page users like secondary-MMU page fault to
establish the shadow pagetables would never call any superflous get_page
after get_user_page returns. It's safer to make get_page universally safe
for tail pages and to use get_page_foll() within follow_page (inside
get_user_pages()). get_page_foll() is safe to do the refcounting for tail
pages without taking any locks because it is run within PT lock protected
critical sections (PT lock for pte and page_table_lock for
pmd_trans_huge).
The standard get_page() as invoked by direct-io instead will now take
the compound_lock but still only for tail pages. The direct-io paths
are usually I/O bound and the compound_lock is per THP so very
finegrined, so there's no risk of scalability issues with it. A simple
direct-io benchmarks with all lockdep prove locking and spinlock
debugging infrastructure enabled shows identical performance and no
overhead. So it's worth it. Ideally direct-io should stop calling
get_page() on pages returned by get_user_pages(). The spinlock in
get_page() is already optimized away for no-THP builds but doing
get_page() on tail pages returned by GUP is generally a rare operation
and usually only run in I/O paths.
This new refcounting on page_tail->_mapcount in addition to avoiding new
RCU critical sections will also allow the working set estimation code to
work without any further complexity associated to the tail page
refcounting with THP.
Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
Reported-by: Michel Lespinasse <walken@google.com>
Reviewed-by: Michel Lespinasse <walken@google.com>
Reviewed-by: Minchan Kim <minchan.kim@gmail.com>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Hugh Dickins <hughd@google.com>
Cc: Johannes Weiner <jweiner@redhat.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: David Gibson <david@gibson.dropbear.id.au>
Cc: <stable@kernel.org>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* git://github.com/rustyrussell/linux:
virtio-blk: use ida to allocate disk index
virtio: Add platform bus driver for memory mapped virtio device
virtio: Dont add "config" to list for !per_vq_vector
virtio: console: wait for first console port for early console output
virtio: console: add port stats for bytes received, sent and discarded
virtio: console: make discard_port_data() use get_inbuf()
virtio: console: rename variable
virtio: console: make get_inbuf() return port->inbuf if present
virtio: console: Fix return type for get_inbuf()
virtio: console: Use wait_event_freezable instead of _interruptible
virtio: console: Ignore port name update request if name already set
virtio: console: Fix indentation
virtio: modify vring_init and vring_size to take account of the layout containing *_event_idx
virtio.h: correct comment for struct virtio_driver
virtio-net: Use virtio_config_val() for retrieving config
virtio_config: Add virtio_config_val_len()
virtio-console: Use virtio_config_val() for retrieving config
* 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4: (97 commits)
jbd2: Unify log messages in jbd2 code
jbd/jbd2: validate sb->s_first in journal_get_superblock()
ext4: let ext4_ext_rm_leaf work with EXT_DEBUG defined
ext4: fix a syntax error in ext4_ext_insert_extent when debugging enabled
ext4: fix a typo in struct ext4_allocation_context
ext4: Don't normalize an falloc request if it can fit in 1 extent.
ext4: remove comments about extent mount option in ext4_new_inode()
ext4: let ext4_discard_partial_buffers handle unaligned range correctly
ext4: return ENOMEM if find_or_create_pages fails
ext4: move vars to local scope in ext4_discard_partial_page_buffers_no_lock()
ext4: Create helper function for EXT4_IO_END_UNWRITTEN and i_aiodio_unwritten
ext4: optimize locking for end_io extent conversion
ext4: remove unnecessary call to waitqueue_active()
ext4: Use correct locking for ext4_end_io_nolock()
ext4: fix race in xattr block allocation path
ext4: trace punch_hole correctly in ext4_ext_map_blocks
ext4: clean up AGGRESSIVE_TEST code
ext4: move variables to their scope
ext4: fix quota accounting during migration
ext4: migrate cleanup
...
* 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs:
udf: Cleanup metadata flags handling
udf: Skip mirror metadata FE loading when metadata FE is ok
ext3: Allow quota file use root reservation
udf: Remove web reference from UDF MAINTAINERS entry
quota: Drop path reference on error exit from quotactl
udf: Neaten udf_debug uses
udf: Neaten logging output, use vsprintf extension %pV
udf: Convert printks to pr_<level>
udf: Rename udf_warning to udf_warn
udf: Rename udf_error to udf_err
udf: Promote some debugging messages to udf_error
ext3: Remove the obsolete broken EXT3_IOC32_WAIT_FOR_READONLY.
udf: Add readpages support for udf.
ext3/balloc.c: local functions should be static
ext2: fix the outdated comment in ext2_nfs_get_inode()
ext3: remove deprecated oldalloc
fs/ext3/balloc.c: delete useless initialization
fs/ext2/balloc.c: delete useless initialization
ext3: fix message in ext3_remount for rw-remount case
ext3: Remove i_mutex from ext3_sync_file()
Fix up trivial (printf format cleanup) conflicts in fs/udf/udfdecl.h
I missed to include a patch adding the new silicon revision define
CUT3P3 and remove the retired CUT0 versions of AB8500. Also delete
the reference to the retired AB3550 from the header.
Reported-by: Randy Dunlap <rdunlap@xenotime.net>
Signed-off-by: Linus Walleij <linus.walleij@linaro.org>
Signed-off-by: Samuel Ortiz <sameo@linux.intel.com>
This adds a d_prune dentry operation that is called by the VFS prior to
pruning (i.e. unhashing and killing) a hashed dentry from the dcache.
Wrap dentry_lru_del() and use the new _prune() helper in the cases where we
are about to unhash and kill the dentry.
This will be used by Ceph to maintain a flag indicating whether the
complete contents of a directory are contained in the dcache, allowing it
to satisfy lookups and readdir without addition server communication.
Renumber a few DCACHE_* #defines to group DCACHE_OP_PRUNE with the other
DCACHE_OP_ bits.
Signed-off-by: Sage Weil <sage@newdream.net>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Prevent direct modification of i_nlink by making it const and adding a
non-const __i_nlink alias.
Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
Tested-by: Toshiyuki Okajima <toshi.okajima@jp.fujitsu.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Replace remaining direct i_nlink updates with a new set_nlink()
updater function.
Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
Tested-by: Toshiyuki Okajima <toshi.okajima@jp.fujitsu.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Since the commit below which added O_PATH support to the *at() calls, the
error return for readlink/readlinkat for the empty pathname has switched
from ENOENT to EINVAL:
commit 65cfc67223
Author: Al Viro <viro@zeniv.linux.org.uk>
Date: Sun Mar 13 15:56:26 2011 -0400
readlinkat(), fchownat() and fstatat() with empty relative pathnames
This is both unexpected for userspace and makes readlink/readlinkat
inconsistant with all other interfaces; and inconsistant with our stated
return for these pathnames.
As the readlinkat call does not have a flags parameter we cannot use the
AT_EMPTY_PATH approach used in the other calls. Therefore expose whether
the original path is infact entry via a new user_path_at_empty() path
lookup function. Use this to determine whether to default to EINVAL or
ENOENT for failures.
Addresses http://bugs.launchpad.net/bugs/817187
[akpm@linux-foundation.org: remove unused getname_flags()]
Signed-off-by: Andy Whitcroft <apw@canonical.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: <stable@kernel.org>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Christoph Hellwig <hch@lst.de>
* 'next/dt' of git://git.linaro.org/people/arnd/arm-soc:
ARM: gic: use module.h instead of export.h
ARM: gic: fix irq_alloc_descs handling for sparse irq
ARM: gic: add OF based initialization
ARM: gic: add irq_domain support
irq: support domains with non-zero hwirq base
of/irq: introduce of_irq_init
ARM: at91: add at91sam9g20 and Calao USB A9G20 DT support
ARM: at91: dt: at91sam9g45 family and board device tree files
arm/mx5: add device tree support for imx51 babbage
arm/mx5: add device tree support for imx53 boards
ARM: msm: Add devicetree support for msm8660-surf
msm_serial: Add devicetree support
msm_serial: Use relative resources for iomem
Fix up conflicts in arch/arm/mach-at91/{at91sam9260.c,at91sam9g45.c}
* 'next/cleanup2' of git://git.linaro.org/people/arnd/arm-soc: (31 commits)
ARM: OMAP: Warn if omap_ioremap is called before SoC detection
ARM: OMAP: Move set_globals initialization to happen in init_early
ARM: OMAP: Map SRAM later on with ioremap_exec()
ARM: OMAP: Remove calls to SRAM allocations for framebuffer
ARM: OMAP: Avoid cpu_is_omapxxxx usage until map_io is done
ARM: OMAP1: Use generic map_io, init_early and init_irq
arm/dts: OMAP3+: Add mpu, dsp and iva nodes
arm/dts: OMAP4: Add a main ocp entry bound to l3-noc driver
ARM: OMAP2+: l3-noc: Add support for device-tree
ARM: OMAP2+: board-generic: Add i2c static init
ARM: OMAP2+: board-generic: Add DT support to generic board
arm/dts: Add support for OMAP3 Beagle board
arm/dts: Add initial device tree support for OMAP3 SoC
arm/dts: Add support for OMAP4 SDP board
arm/dts: Add support for OMAP4 PandaBoard
arm/dts: Add initial device tree support for OMAP4 SoC
ARM: OMAP: omap_device: Add a method to build an omap_device from a DT node
ARM: OMAP: omap_device: Add omap_device_[alloc|delete] for DT integration
of: Add helpers to get one string in multiple strings property
ARM: OMAP2+: devices: Remove all omap_device_pm_latency structures
...
Fix up trivial header file conflicts in arch/arm/mach-omap2/board-generic.c
* 'next/pm' of git://git.linaro.org/people/arnd/arm-soc: (66 commits)
ARM: CSR: PM: use outer_resume to resume L2 cache
ARM: CSR: call l2x0_of_init to init L2 cache of SiRFprimaII
ARM: OMAP: voltage: voltage layer present, even when CONFIG_PM=n
ARM: CSR: PM: add sleep entry for SiRFprimaII
ARM: CSR: PM: save/restore irq status in suspend cycle
ARM: CSR: PM: save/restore timer status in suspend cycle
OMAP4: PM: TWL6030: add cmd register
OMAP4: PM: TWL6030: fix ON/RET/OFF voltages
OMAP4: PM: TWL6030: address 0V conversions
OMAP4: PM: TWL6030: fix uv to voltage for >0x39
OMAP4: PM: TWL6030: fix voltage conversion formula
omap: voltage: add a stub header file for external/regulator use
OMAP2+: VC: more registers are per-channel starting with OMAP5
OMAP3+: voltage: update nominal voltage in voltdm_scale() not VC post-scale
OMAP3+: voltage: rename omap_voltage_get_nom_volt -> voltdm_get_voltage
OMAP3+: voltdm: final removal of omap_vdd_info
OMAP3+: voltage: move/rename curr_volt from vdd_info into struct voltagedomain
OMAP3+: voltage: rename scale and reset functions using voltdm_ prefix
OMAP3+: VP: combine setting init voltage into common function
OMAP3+: VP: remove unused omap_vp_get_curr_volt()
...
Fix up trivial conflict in arch/arm/mach-prima2/l2x0.c (code removal vs
edit)
This patch, based on virtio PCI driver, adds support for memory
mapped (platform) virtio device. This should allow environments
like qemu to use virtio-based block & network devices even on
platforms without PCI support.
One can define and register a platform device which resources
will describe memory mapped control registers and "mailbox"
interrupt. Such device can be also instantiated using the Device
Tree node with compatible property equal "virtio,mmio".
Cc: Anthony Liguori <aliguori@us.ibm.com>
Cc: Michael S.Tsirkin <mst@redhat.com>
Signed-off-by: Pawel Moll <pawel.moll@arm.com>
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Based on the layout description in the comments, take account of
the *_event_idx in functions vring_init and vring_size.
Signed-off-by: Wang Sheng-Hui <shhuiw@gmail.com>
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
This patch adds virtio_config_val_len() which allows retrieving variable
length data from the virtio config space only if a specific feature is on.
Cc: Amit Shah <amit.shah@redhat.com>
Cc: "Michael S. Tsirkin" <mst@redhat.com>
Cc: Rusty Russell <rusty@rustcorp.com.au>
Cc: virtualization@lists.linux-foundation.org
Signed-off-by: Sasha Levin <levinsasha928@gmail.com>
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
* 'for-linus/i2c-3.2' of git://git.fluff.org/bjdooks/linux: (47 commits)
i2c-s3c2410: Add device tree support
i2c-s3c2410: Keep a copy of platform data and use it
i2c-nomadik: cosmetic coding style corrections
i2c-au1550: dev_pm_ops conversion
i2c-au1550: increase timeout waiting for master done
i2c-au1550: remove unused ack_timeout
i2c-au1550: remove usage of volatile keyword
i2c-tegra: __iomem annotation fix
i2c-eg20t: Add initialize processing in case i2c-error occurs
i2c-eg20t: Fix flag setting issue
i2c-eg20t: add stop sequence in case wait-event timeout occurs
i2c-eg20t: Separate error processing
i2c-eg20t: Fix 10bit access issue
i2c-eg20t: Modify returned value s32 to long
i2c-eg20t: Fix bus-idle waiting issue
i2c-designware: Fix PCI core warning on suspend/resume
i2c-designware: Add runtime power management support
i2c-designware: Add support for Designware core behind PCI devices.
i2c-designware: Push all register reads/writes into the core code.
i2c-designware: Support multiple cores using same ISR
...
* 'for-linus' of git://opensource.wolfsonmicro.com/regulator: (22 commits)
regulator: Constify constraints name
regulator: Fix possible nullpointer dereference in regulator_enable()
regulator: gpio-regulator add dependency on GENERIC_GPIO
regulator: Add module.h include to gpio-regulator
regulator: Add driver for gpio-controlled regulators
regulator: remove duplicate REG_CTRL2 defines in tps65023
regulator: Clarify documentation for regulator-regulator supplies
regulator: Fix some bitrot in the machine driver documentation
regulator: tps65023: Added support for the similiar TPS65020 chip
regulator: tps65023: Setting correct core regulator for tps65021
regulator: tps65023: Set missing bit for update core-voltage
regulator: tps65023: Fixes i2c configuration issues
regulator: Add debugfs file showing the supply map table
regulator: tps6586x: add SMx slew rate setting
regulator: tps65023: Fixes i2c configuration issues
regulator: tps6507x: Remove num_voltages array
regulator: max8952: removed unused mutex.
regulator: fix regulator/consumer.h kernel-doc warning
regulator: Ensure enough enable time for max8649
regulator: 88pm8607: Fix off-by-one value range checking in the case of no id is matched
...
* 'pstore' of git://git.kernel.org/pub/scm/linux/kernel/git/aegl/linux:
pstore: make pstore write function return normal success/fail value
pstore: change mutex locking to spin_locks
pstore: defer inserting OOPS entries into pstore
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/roland/infiniband: (62 commits)
mlx4_core: Deprecate log_num_vlan module param
IB/mlx4: Don't set VLAN in IBoE WQEs' control segment
IB/mlx4: Enable 4K mtu for IBoE
RDMA/cxgb4: Mark QP in error before disabling the queue in firmware
RDMA/cxgb4: Serialize calls to CQ's comp_handler
RDMA/cxgb3: Serialize calls to CQ's comp_handler
IB/qib: Fix issue with link states and QSFP cables
IB/mlx4: Configure extended active speeds
mlx4_core: Add extended port capabilities support
IB/qib: Hold links until tuning data is available
IB/qib: Clean up checkpatch issue
IB/qib: Remove s_lock around header validation
IB/qib: Precompute timeout jiffies to optimize latency
IB/qib: Use RCU for qpn lookup
IB/qib: Eliminate divide/mod in converting idx to egr buf pointer
IB/qib: Decode path MTU optimization
IB/qib: Optimize RC/UC code by IB operation
IPoIB: Use the right function to do DMA unmap pages
RDMA/cxgb4: Use correct QID in insert_recv_cqe()
RDMA/cxgb4: Make sure flush CQ entries are collected on connection close
...
* git://github.com/herbertx/crypto: (48 commits)
crypto: user - Depend on NET instead of selecting it
crypto: user - Add dependency on NET
crypto: talitos - handle descriptor not found in error path
crypto: user - Initialise match in crypto_alg_match
crypto: testmgr - add twofish tests
crypto: testmgr - add blowfish test-vectors
crypto: Make hifn_795x build depend on !ARCH_DMA_ADDR_T_64BIT
crypto: twofish-x86_64-3way - fix ctr blocksize to 1
crypto: blowfish-x86_64 - fix ctr blocksize to 1
crypto: whirlpool - count rounds from 0
crypto: Add userspace report for compress type algorithms
crypto: Add userspace report for cipher type algorithms
crypto: Add userspace report for rng type algorithms
crypto: Add userspace report for pcompress type algorithms
crypto: Add userspace report for nivaead type algorithms
crypto: Add userspace report for aead type algorithms
crypto: Add userspace report for givcipher type algorithms
crypto: Add userspace report for ablkcipher type algorithms
crypto: Add userspace report for blkcipher type algorithms
crypto: Add userspace report for ahash type algorithms
...
Remove edac_mce pieces and use the normal MCE decoder notifier chain by
retaining the same functionality with considerably less code.
Signed-off-by: Borislav Petkov <borislav.petkov@amd.com>
Signed-off-by: Mauro Carvalho Chehab <mchehab@redhat.com>
This patch exports several definitions that used to live under
include/net/netfilter/nf_nat.h. These definitions, although not
exported, have been used by iptables and other userspace
applications like miniupnpd since long time. Basically, these
userspace tools included some internal definition of the required
structures and they assume no changes in the binary representation
(which is OK indeed).
To resolve this situation, this patch makes public the required
structure and install them in INSTALL_HDR_PATH.
See: https://bugs.gentoo.org/376873, for more information.
This patch is heavily based on the initial patch sent by:
Anthony G. Basile <blueness@gentoo.org>
Which was entitled:
netfilter: export sanitized nf_nat.h to INSTALL_HDR_PATH
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Fixes parameter name of skb_frag_dmamap function to silence warning on
make htmldocs.
Signed-off-by: Marcos Paulo de Souza <marcos.mage@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Quoth Andrew:
- Most of MM. Still waiting for the poweroc guys to get off their
butts and review some threaded hugepages patches.
- alpha
- vfs bits
- drivers/misc
- a few core kerenl tweaks
- printk() features
- MAINTAINERS updates
- backlight merge
- leds merge
- various lib/ updates
- checkpatch updates
* akpm: (127 commits)
epoll: fix spurious lockdep warnings
checkpatch: add a --strict check for utf-8 in commit logs
kernel.h/checkpatch: mark strict_strto<foo> and simple_strto<foo> as obsolete
llist-return-whether-list-is-empty-before-adding-in-llist_add-fix
wireless: at76c50x: follow rename pack_hex_byte to hex_byte_pack
fat: follow rename pack_hex_byte() to hex_byte_pack()
security: follow rename pack_hex_byte() to hex_byte_pack()
kgdb: follow rename pack_hex_byte() to hex_byte_pack()
lib: rename pack_hex_byte() to hex_byte_pack()
lib/string.c: fix strim() semantics for strings that have only blanks
lib/idr.c: fix comment for ida_get_new_above()
lib/percpu_counter.c: enclose hotplug only variables in hotplug ifdef
lib/bitmap.c: quiet sparse noise about address space
lib/spinlock_debug.c: print owner on spinlock lockup
lib/kstrtox: common code between kstrto*() and simple_strto*() functions
drivers/leds/leds-lp5521.c: check if reset is successful
leds: turn the blink_timer off before starting to blink
leds: save the delay values after a successful call to blink_set()
drivers/leds/leds-gpio.c: use gpio_get_value_cansleep() when initializing
drivers/leds/leds-lm3530.c: add __devexit_p where needed
...
Mark obsolete/deprecated strict_strto<foo> and simple_strto<foo> functions
and macros as obsolete.
Update checkpatch to warn about their use.
Signed-off-by: Joe Perches <joe@perches.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
As suggested by Andrew Morton in [1] there is better to have most
significant part first in the function name.
[1] https://lkml.org/lkml/2011/9/20/22
There is no functional change.
Signed-off-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
Cc: Jesper Nilsson <jesper.nilsson@axis.com>
Cc: David Howells <dhowells@redhat.com>
Cc: Koichi Yasutake <yasutake.koichi@jp.panasonic.com>
Cc: Jason Wessel <jason.wessel@windriver.com>
Cc: Mimi Zohar <zohar@us.ibm.com>
Cc: James Morris <jmorris@namei.org>
Cc: OGAWA Hirofumi <hirofumi@mail.parknet.co.jp>
Cc: "John W. Linville" <linville@tuxdriver.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Use the platform_data include directory for the TPU LED driver, as
suggested by Paul Mundt.
Signed-off-by: Magnus Damm <damm@opensource.se>
Cc: Paul Mundt <lethal@linux-sh.org>
Cc: Richard Purdie <rpurdie@rpsys.net>
Cc: Grant Likely <grant.likely@secretlab.ca>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Add V2 of the LED driver for a single timer channel for the TPU hardware
block commonly found in Renesas SoCs.
The driver has been written with optimal Power Management in mind, so to
save power the LED is driven as a regular GPIO pin in case of maximum
brightness and power off which allows the TPU hardware to be idle and
which in turn allows the clocks to be stopped and the power domain to be
turned off transparently.
Any other brightness level requires use of the TPU hardware in PWM mode.
TPU hardware device clocks and power are managed through Runtime PM.
System suspend and resume is known to be working - during suspend the LED
is set to off by the generic LED code.
The TPU hardware timer is equipeed with a 16-bit counter together with an
up-to-divide-by-64 prescaler which makes the hardware suitable for
brightness control. Hardware blink is unsupported.
The LED PWM waveform has been verified with a Fluke 123 Scope meter on a
sh7372 Mackerel board. Tested with experimental sh7372 A3SP power domain
patches. Platform device bind/unbind tested ok.
V2 has been tested on the DS2 LED of the sh73a0-based AG5EVM.
[axel.lin@gmail.com: include linux/module.h]
Signed-off-by: Magnus Damm <damm@opensource.se>
Cc: Paul Mundt <lethal@linux-sh.org>
Cc: Richard Purdie <rpurdie@rpsys.net>
Cc: Grant Likely <grant.likely@secretlab.ca>
Signed-off-by: Axel Lin <axel.lin@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The regulator support in the l4f00242t03 is very non-idiomatic. Rather
than requesting the regulators based on the device name and the supply
names used by the device the driver requires boards to pass system
specific supply names around through platform data. The driver also
conditionally requests the regulators based on this platform data, adding
unneeded conditional code to the driver.
Fix this by removing the platform data and converting to the standard
idiom, also updating all in tree users of the driver. As no datasheet
appears to be available for the LCD I'm guessing the names for the
supplies based on the existing users and I've no ability to do anything
more than compile test.
The use of regulator_set_voltage() in the driver is also problematic,
since fixed voltages are required the expectation would be that the
voltages would be fixed in the constraints set by the machines rather than
manually configured by the driver, but is less problematic.
Signed-off-by: Mark Brown <broonie@opensource.wolfsonmicro.com>
Tested-by: Fabio Estevam <fabio.estevam@freescale.com>
Cc: Richard Purdie <rpurdie@rpsys.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The regulator API contains a range of features for stubbing itself out
when not in use and for transparently restricting the actual effect of
regulator API calls where they can't be supported on a particular system
so that drivers don't need to individually implement this. Simplify the
driver slightly by making use of this idiom.
The only in tree user is ecovec24 which does not use the regulator API.
Signed-off-by: Mark Brown <broonie@opensource.wolfsonmicro.com>
Cc: Éric Piel <eric.piel@tremplin-utc.net>
Cc: Ilkka Koskinen <ilkka.koskinen@nokia.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
There's no compact_zone_order() user outside file scope, so make it static.
Signed-off-by: Kyungmin Park <kyungmin.park@samsung.com>
Acked-by: David Rientjes <rientjes@google.com>
Reviewed-by: Minchan Kim <minchan.kim@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The callback must not return -1 when nr_to_scan is zero. Fix the bug in
fs/super.c and add this requirement to the callback specification.
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Cc: Dave Chinner <david@fromorbit.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Add __attribute__((format (printf...) to the function to validate format
and arguments. Use vsprintf extension %pV to avoid any possible message
interleaving. Coalesce format string. Convert printks/pr_warning to
pr_warn.
[akpm@linux-foundation.org: use the __printf() macro]
Signed-off-by: Joe Perches <joe@perches.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This adds THP support to mremap (decreases the number of split_huge_page()
calls).
Here are also some benchmarks with a proggy like this:
===
#define _GNU_SOURCE
#include <sys/mman.h>
#include <stdlib.h>
#include <stdio.h>
#include <string.h>
#include <sys/time.h>
#define SIZE (5UL*1024*1024*1024)
int main()
{
static struct timeval oldstamp, newstamp;
long diffsec;
char *p, *p2, *p3, *p4;
if (posix_memalign((void **)&p, 2*1024*1024, SIZE))
perror("memalign"), exit(1);
if (posix_memalign((void **)&p2, 2*1024*1024, SIZE))
perror("memalign"), exit(1);
if (posix_memalign((void **)&p3, 2*1024*1024, 4096))
perror("memalign"), exit(1);
memset(p, 0xff, SIZE);
memset(p2, 0xff, SIZE);
memset(p3, 0x77, 4096);
gettimeofday(&oldstamp, NULL);
p4 = mremap(p, SIZE, SIZE, MREMAP_FIXED|MREMAP_MAYMOVE, p3);
gettimeofday(&newstamp, NULL);
diffsec = newstamp.tv_sec - oldstamp.tv_sec;
diffsec = newstamp.tv_usec - oldstamp.tv_usec + 1000000 * diffsec;
printf("usec %ld\n", diffsec);
if (p == MAP_FAILED || p4 != p3)
//if (p == MAP_FAILED)
perror("mremap"), exit(1);
if (memcmp(p4, p2, SIZE))
printf("mremap bug\n"), exit(1);
printf("ok\n");
return 0;
}
===
THP on
Performance counter stats for './largepage13' (3 runs):
69195836 dTLB-loads ( +- 3.546% ) (scaled from 50.30%)
60708 dTLB-load-misses ( +- 11.776% ) (scaled from 52.62%)
676266476 dTLB-stores ( +- 5.654% ) (scaled from 69.54%)
29856 dTLB-store-misses ( +- 4.081% ) (scaled from 89.22%)
1055848782 iTLB-loads ( +- 4.526% ) (scaled from 80.18%)
8689 iTLB-load-misses ( +- 2.987% ) (scaled from 58.20%)
7.314454164 seconds time elapsed ( +- 0.023% )
THP off
Performance counter stats for './largepage13' (3 runs):
1967379311 dTLB-loads ( +- 0.506% ) (scaled from 60.59%)
9238687 dTLB-load-misses ( +- 22.547% ) (scaled from 61.87%)
2014239444 dTLB-stores ( +- 0.692% ) (scaled from 60.40%)
3312335 dTLB-store-misses ( +- 7.304% ) (scaled from 67.60%)
6764372065 iTLB-loads ( +- 0.925% ) (scaled from 79.00%)
8202 iTLB-load-misses ( +- 0.475% ) (scaled from 70.55%)
9.693655243 seconds time elapsed ( +- 0.069% )
grep thp /proc/vmstat
thp_fault_alloc 35849
thp_fault_fallback 0
thp_collapse_alloc 3
thp_collapse_alloc_failed 0
thp_split 0
thp_split 0 confirms no thp split despite plenty of hugepages allocated.
The measurement of only the mremap time (so excluding the 3 long
memset and final long 10GB memory accessing memcmp):
THP on
usec 14824
usec 14862
usec 14859
THP off
usec 256416
usec 255981
usec 255847
With an older kernel without the mremap optimizations (the below patch
optimizes the non THP version too).
THP on
usec 392107
usec 390237
usec 404124
THP off
usec 444294
usec 445237
usec 445820
I guess with a threaded program that sends more IPI on large SMP it'd
create an even larger difference.
All debug options are off except DEBUG_VM to avoid skewing the
results.
The only problem for native 2M mremap like it happens above both the
source and destination address must be 2M aligned or the hugepmd can't be
moved without a split but that is an hardware limitation.
[akpm@linux-foundation.org: coding-style nitpicking]
Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
Acked-by: Johannes Weiner <jweiner@redhat.com>
Acked-by: Mel Gorman <mgorman@suse.de>
Acked-by: Rik van Riel <riel@redhat.com>
Cc: Hugh Dickins <hughd@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
SPARC32 require access to the start address. Add a new helper
memblock_start_of_DRAM() to give access to the address of the first
memblock - which contains the lowest address.
The awkward name was chosen to match the already present
memblock_end_of_DRAM().
Signed-off-by: Sam Ravnborg <sam@ravnborg.org>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Yinghai Lu <yinghai@kernel.org>
Acked-by: Tejun Heo <tj@kernel.org>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The /proc/vmallocinfo shows information about vmalloc allocations in
vmlist that is a linklist of vm_struct. It, however, may access pages
field of vm_struct where a page was not allocated. This results in a null
pointer access and leads to a kernel panic.
Why this happens: In __vmalloc_node_range() called from vmalloc(), newly
allocated vm_struct is added to vmlist at __get_vm_area_node() and then,
some fields of vm_struct such as nr_pages and pages are set at
__vmalloc_area_node(). In other words, it is added to vmlist before it is
fully initialized. At the same time, when the /proc/vmallocinfo is read,
it accesses the pages field of vm_struct according to the nr_pages field
at show_numa_info(). Thus, a null pointer access happens.
The patch adds the newly allocated vm_struct to the vmlist *after* it is
fully initialized. So, it can avoid accessing the pages field with
unallocated page when show_numa_info() is called.
Signed-off-by: Mitsuo Hayasaka <mitsuo.hayasaka.hu@hitachi.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: David Rientjes <rientjes@google.com>
Cc: Namhyung Kim <namhyung@gmail.com>
Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
Cc: Jeremy Fitzhardinge <jeremy.fitzhardinge@citrix.com>
Cc: <stable@kernel.org>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
memchr_inv() is mainly used to check whether the whole buffer is filled
with just a specified byte.
The function name and prototype are stolen from logfs and the
implementation is from SLUB.
Signed-off-by: Akinobu Mita <akinobu.mita@gmail.com>
Acked-by: Christoph Lameter <cl@linux-foundation.org>
Acked-by: Pekka Enberg <penberg@kernel.org>
Cc: Matt Mackall <mpm@selenic.com>
Acked-by: Joern Engel <joern@logfs.org>
Cc: Marcin Slusarz <marcin.slusarz@gmail.com>
Cc: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
When direct reclaim encounters a dirty page, it gets recycled around the
LRU for another cycle. This patch marks the page PageReclaim similar to
deactivate_page() so that the page gets reclaimed almost immediately after
the page gets cleaned. This is to avoid reclaiming clean pages that are
younger than a dirty page encountered at the end of the LRU that might
have been something like a use-once page.
Signed-off-by: Mel Gorman <mgorman@suse.de>
Acked-by: Johannes Weiner <jweiner@redhat.com>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Christoph Hellwig <hch@infradead.org>
Cc: Wu Fengguang <fengguang.wu@intel.com>
Cc: Jan Kara <jack@suse.cz>
Cc: Minchan Kim <minchan.kim@gmail.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Alex Elder <aelder@sgi.com>
Cc: Theodore Ts'o <tytso@mit.edu>
Cc: Chris Mason <chris.mason@oracle.com>
Cc: Dave Hansen <dave@linux.vnet.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Testing from the XFS folk revealed that there is still too much I/O from
the end of the LRU in kswapd. Previously it was considered acceptable by
VM people for a small number of pages to be written back from reclaim with
testing generally showing about 0.3% of pages reclaimed were written back
(higher if memory was low). That writing back a small number of pages is
ok has been heavily disputed for quite some time and Dave Chinner
explained it well;
It doesn't have to be a very high number to be a problem. IO
is orders of magnitude slower than the CPU time it takes to
flush a page, so the cost of making a bad flush decision is
very high. And single page writeback from the LRU is almost
always a bad flush decision.
To complicate matters, filesystems respond very differently to requests
from reclaim according to Christoph Hellwig;
xfs tries to write it back if the requester is kswapd
ext4 ignores the request if it's a delayed allocation
btrfs ignores the request
As a result, each filesystem has different performance characteristics
when under memory pressure and there are many pages being dirtied. In
some cases, the request is ignored entirely so the VM cannot depend on the
IO being dispatched.
The objective of this series is to reduce writing of filesystem-backed
pages from reclaim, play nicely with writeback that is already in progress
and throttle reclaim appropriately when writeback pages are encountered.
The assumption is that the flushers will always write pages faster than if
reclaim issues the IO.
A secondary goal is to avoid the problem whereby direct reclaim splices
two potentially deep call stacks together.
There is a potential new problem as reclaim has less control over how long
before a page in a particularly zone or container is cleaned and direct
reclaimers depend on kswapd or flusher threads to do the necessary work.
However, as filesystems sometimes ignore direct reclaim requests already,
it is not expected to be a serious issue.
Patch 1 disables writeback of filesystem pages from direct reclaim
entirely. Anonymous pages are still written.
Patch 2 removes dead code in lumpy reclaim as it is no longer able
to synchronously write pages. This hurts lumpy reclaim but
there is an expectation that compaction is used for hugepage
allocations these days and lumpy reclaim's days are numbered.
Patches 3-4 add warnings to XFS and ext4 if called from
direct reclaim. With patch 1, this "never happens" and is
intended to catch regressions in this logic in the future.
Patch 5 disables writeback of filesystem pages from kswapd unless
the priority is raised to the point where kswapd is considered
to be in trouble.
Patch 6 throttles reclaimers if too many dirty pages are being
encountered and the zones or backing devices are congested.
Patch 7 invalidates dirty pages found at the end of the LRU so they
are reclaimed quickly after being written back rather than
waiting for a reclaimer to find them
I consider this series to be orthogonal to the writeback work but it is
worth noting that the writeback work affects the viability of patch 8 in
particular.
I tested this on ext4 and xfs using fs_mark, a simple writeback test based
on dd and a micro benchmark that does a streaming write to a large mapping
(exercises use-once LRU logic) followed by streaming writes to a mix of
anonymous and file-backed mappings. The command line for fs_mark when
botted with 512M looked something like
./fs_mark -d /tmp/fsmark-2676 -D 100 -N 150 -n 150 -L 25 -t 1 -S0 -s 10485760
The number of files was adjusted depending on the amount of available
memory so that the files created was about 3xRAM. For multiple threads,
the -d switch is specified multiple times.
The test machine is x86-64 with an older generation of AMD processor with
4 cores. The underlying storage was 4 disks configured as RAID-0 as this
was the best configuration of storage I had available. Swap is on a
separate disk. Dirty ratio was tuned to 40% instead of the default of
20%.
Testing was run with and without monitors to both verify that the patches
were operating as expected and that any performance gain was real and not
due to interference from monitors.
Here is a summary of results based on testing XFS.
512M1P-xfs Files/s mean 32.69 ( 0.00%) 34.44 ( 5.08%)
512M1P-xfs Elapsed Time fsmark 51.41 48.29
512M1P-xfs Elapsed Time simple-wb 114.09 108.61
512M1P-xfs Elapsed Time mmap-strm 113.46 109.34
512M1P-xfs Kswapd efficiency fsmark 62% 63%
512M1P-xfs Kswapd efficiency simple-wb 56% 61%
512M1P-xfs Kswapd efficiency mmap-strm 44% 42%
512M-xfs Files/s mean 30.78 ( 0.00%) 35.94 (14.36%)
512M-xfs Elapsed Time fsmark 56.08 48.90
512M-xfs Elapsed Time simple-wb 112.22 98.13
512M-xfs Elapsed Time mmap-strm 219.15 196.67
512M-xfs Kswapd efficiency fsmark 54% 56%
512M-xfs Kswapd efficiency simple-wb 54% 55%
512M-xfs Kswapd efficiency mmap-strm 45% 44%
512M-4X-xfs Files/s mean 30.31 ( 0.00%) 33.33 ( 9.06%)
512M-4X-xfs Elapsed Time fsmark 63.26 55.88
512M-4X-xfs Elapsed Time simple-wb 100.90 90.25
512M-4X-xfs Elapsed Time mmap-strm 261.73 255.38
512M-4X-xfs Kswapd efficiency fsmark 49% 50%
512M-4X-xfs Kswapd efficiency simple-wb 54% 56%
512M-4X-xfs Kswapd efficiency mmap-strm 37% 36%
512M-16X-xfs Files/s mean 60.89 ( 0.00%) 65.22 ( 6.64%)
512M-16X-xfs Elapsed Time fsmark 67.47 58.25
512M-16X-xfs Elapsed Time simple-wb 103.22 90.89
512M-16X-xfs Elapsed Time mmap-strm 237.09 198.82
512M-16X-xfs Kswapd efficiency fsmark 45% 46%
512M-16X-xfs Kswapd efficiency simple-wb 53% 55%
512M-16X-xfs Kswapd efficiency mmap-strm 33% 33%
Up until 512-4X, the FSmark improvements were statistically significant.
For the 4X and 16X tests the results were within standard deviations but
just barely. The time to completion for all tests is improved which is an
important result. In general, kswapd efficiency is not affected by
skipping dirty pages.
1024M1P-xfs Files/s mean 39.09 ( 0.00%) 41.15 ( 5.01%)
1024M1P-xfs Elapsed Time fsmark 84.14 80.41
1024M1P-xfs Elapsed Time simple-wb 210.77 184.78
1024M1P-xfs Elapsed Time mmap-strm 162.00 160.34
1024M1P-xfs Kswapd efficiency fsmark 69% 75%
1024M1P-xfs Kswapd efficiency simple-wb 71% 77%
1024M1P-xfs Kswapd efficiency mmap-strm 43% 44%
1024M-xfs Files/s mean 35.45 ( 0.00%) 37.00 ( 4.19%)
1024M-xfs Elapsed Time fsmark 94.59 91.00
1024M-xfs Elapsed Time simple-wb 229.84 195.08
1024M-xfs Elapsed Time mmap-strm 405.38 440.29
1024M-xfs Kswapd efficiency fsmark 79% 71%
1024M-xfs Kswapd efficiency simple-wb 74% 74%
1024M-xfs Kswapd efficiency mmap-strm 39% 42%
1024M-4X-xfs Files/s mean 32.63 ( 0.00%) 35.05 ( 6.90%)
1024M-4X-xfs Elapsed Time fsmark 103.33 97.74
1024M-4X-xfs Elapsed Time simple-wb 204.48 178.57
1024M-4X-xfs Elapsed Time mmap-strm 528.38 511.88
1024M-4X-xfs Kswapd efficiency fsmark 81% 70%
1024M-4X-xfs Kswapd efficiency simple-wb 73% 72%
1024M-4X-xfs Kswapd efficiency mmap-strm 39% 38%
1024M-16X-xfs Files/s mean 42.65 ( 0.00%) 42.97 ( 0.74%)
1024M-16X-xfs Elapsed Time fsmark 103.11 99.11
1024M-16X-xfs Elapsed Time simple-wb 200.83 178.24
1024M-16X-xfs Elapsed Time mmap-strm 397.35 459.82
1024M-16X-xfs Kswapd efficiency fsmark 84% 69%
1024M-16X-xfs Kswapd efficiency simple-wb 74% 73%
1024M-16X-xfs Kswapd efficiency mmap-strm 39% 40%
All FSMark tests up to 16X had statistically significant improvements.
For the most part, tests are completing faster with the exception of the
streaming writes to a mixture of anonymous and file-backed mappings which
were slower in two cases
In the cases where the mmap-strm tests were slower, there was more
swapping due to dirty pages being skipped. The number of additional pages
swapped is almost identical to the fewer number of pages written from
reclaim. In other words, roughly the same number of pages were reclaimed
but swapping was slower. As the test is a bit unrealistic and stresses
memory heavily, the small shift is acceptable.
4608M1P-xfs Files/s mean 29.75 ( 0.00%) 30.96 ( 3.91%)
4608M1P-xfs Elapsed Time fsmark 512.01 492.15
4608M1P-xfs Elapsed Time simple-wb 618.18 566.24
4608M1P-xfs Elapsed Time mmap-strm 488.05 465.07
4608M1P-xfs Kswapd efficiency fsmark 93% 86%
4608M1P-xfs Kswapd efficiency simple-wb 88% 84%
4608M1P-xfs Kswapd efficiency mmap-strm 46% 45%
4608M-xfs Files/s mean 27.60 ( 0.00%) 28.85 ( 4.33%)
4608M-xfs Elapsed Time fsmark 555.96 532.34
4608M-xfs Elapsed Time simple-wb 659.72 571.85
4608M-xfs Elapsed Time mmap-strm 1082.57 1146.38
4608M-xfs Kswapd efficiency fsmark 89% 91%
4608M-xfs Kswapd efficiency simple-wb 88% 82%
4608M-xfs Kswapd efficiency mmap-strm 48% 46%
4608M-4X-xfs Files/s mean 26.00 ( 0.00%) 27.47 ( 5.35%)
4608M-4X-xfs Elapsed Time fsmark 592.91 564.00
4608M-4X-xfs Elapsed Time simple-wb 616.65 575.07
4608M-4X-xfs Elapsed Time mmap-strm 1773.02 1631.53
4608M-4X-xfs Kswapd efficiency fsmark 90% 94%
4608M-4X-xfs Kswapd efficiency simple-wb 87% 82%
4608M-4X-xfs Kswapd efficiency mmap-strm 43% 43%
4608M-16X-xfs Files/s mean 26.07 ( 0.00%) 26.42 ( 1.32%)
4608M-16X-xfs Elapsed Time fsmark 602.69 585.78
4608M-16X-xfs Elapsed Time simple-wb 606.60 573.81
4608M-16X-xfs Elapsed Time mmap-strm 1549.75 1441.86
4608M-16X-xfs Kswapd efficiency fsmark 98% 98%
4608M-16X-xfs Kswapd efficiency simple-wb 88% 82%
4608M-16X-xfs Kswapd efficiency mmap-strm 44% 42%
Unlike the other tests, the fsmark results are not statistically
significant but the min and max times are both improved and for the most
part, tests completed faster.
There are other indications that this is an improvement as well. For
example, in the vast majority of cases, there were fewer pages scanned by
direct reclaim implying in many cases that stalls due to direct reclaim
are reduced. KSwapd is scanning more due to skipping dirty pages which is
unfortunate but the CPU usage is still acceptable
In an earlier set of tests, I used blktrace and in almost all cases
throughput throughout the entire test was higher. However, I ended up
discarding those results as recording blktrace data was too heavy for my
liking.
On a laptop, I plugged in a USB stick and ran a similar tests of tests
using it as backing storage. A desktop environment was running and for
the entire duration of the tests, firefox and gnome terminal were
launching and exiting to vaguely simulate a user.
1024M-xfs Files/s mean 0.41 ( 0.00%) 0.44 ( 6.82%)
1024M-xfs Elapsed Time fsmark 2053.52 1641.03
1024M-xfs Elapsed Time simple-wb 1229.53 768.05
1024M-xfs Elapsed Time mmap-strm 4126.44 4597.03
1024M-xfs Kswapd efficiency fsmark 84% 85%
1024M-xfs Kswapd efficiency simple-wb 92% 81%
1024M-xfs Kswapd efficiency mmap-strm 60% 51%
1024M-xfs Avg wait ms fsmark 5404.53 4473.87
1024M-xfs Avg wait ms simple-wb 2541.35 1453.54
1024M-xfs Avg wait ms mmap-strm 3400.25 3852.53
The mmap-strm results were hurt because firefox launching had a tendency
to push the test out of memory. On the postive side, firefox launched
marginally faster with the patches applied. Time to completion for many
tests was faster but more importantly - the "Avg wait" time as measured by
iostat was far lower implying the system would be more responsive. It was
also the case that "Avg wait ms" on the root filesystem was lower. I
tested it manually and while the system felt slightly more responsive
while copying data to a USB stick, it was marginal enough that it could be
my imagination.
This patch: do not writeback filesystem pages in direct reclaim.
When kswapd is failing to keep zones above the min watermark, a process
will enter direct reclaim in the same manner kswapd does. If a dirty page
is encountered during the scan, this page is written to backing storage
using mapping->writepage.
This causes two problems. First, it can result in very deep call stacks,
particularly if the target storage or filesystem are complex. Some
filesystems ignore write requests from direct reclaim as a result. The
second is that a single-page flush is inefficient in terms of IO. While
there is an expectation that the elevator will merge requests, this does
not always happen. Quoting Christoph Hellwig;
The elevator has a relatively small window it can operate on,
and can never fix up a bad large scale writeback pattern.
This patch prevents direct reclaim writing back filesystem pages by
checking if current is kswapd. Anonymous pages are still written to swap
as there is not the equivalent of a flusher thread for anonymous pages.
If the dirty pages cannot be written back, they are placed back on the LRU
lists. There is now a direct dependency on dirty page balancing to
prevent too many pages in the system being dirtied which would prevent
reclaim making forward progress.
Signed-off-by: Mel Gorman <mgorman@suse.de>
Reviewed-by: Minchan Kim <minchan.kim@gmail.com>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Christoph Hellwig <hch@infradead.org>
Cc: Johannes Weiner <jweiner@redhat.com>
Cc: Wu Fengguang <fengguang.wu@intel.com>
Cc: Jan Kara <jack@suse.cz>
Cc: Rik van Riel <riel@redhat.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Alex Elder <aelder@sgi.com>
Cc: Theodore Ts'o <tytso@mit.edu>
Cc: Chris Mason <chris.mason@oracle.com>
Cc: Dave Hansen <dave@linux.vnet.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Add comments to explain the page statistics field in the mm_struct.
[akpm@linux-foundation.org: add missing ;]
Signed-off-by: Christoph Lameter <cl@linux.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Some kernel components pin user space memory (infiniband and perf) (by
increasing the page count) and account that memory as "mlocked".
The difference between mlocking and pinning is:
A. mlocked pages are marked with PG_mlocked and are exempt from
swapping. Page migration may move them around though.
They are kept on a special LRU list.
B. Pinned pages cannot be moved because something needs to
directly access physical memory. They may not be on any
LRU list.
I recently saw an mlockalled process where mm->locked_vm became
bigger than the virtual size of the process (!) because some
memory was accounted for twice:
Once when the page was mlocked and once when the Infiniband
layer increased the refcount because it needt to pin the RDMA
memory.
This patch introduces a separate counter for pinned pages and
accounts them seperately.
Signed-off-by: Christoph Lameter <cl@linux.com>
Cc: Mike Marciniszyn <infinipath@qlogic.com>
Cc: Roland Dreier <roland@kernel.org>
Cc: Sean Hefty <sean.hefty@intel.com>
Cc: Hugh Dickins <hughd@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
test_set_oom_score_adj() was introduced in 72788c3856 ("oom: replace
PF_OOM_ORIGIN with toggling oom_score_adj") to temporarily elevate
current's oom_score_adj for ksm and swapoff without requiring an
additional per-process flag.
Using that function to both set oom_score_adj to OOM_SCORE_ADJ_MAX and
then reinstate the previous value is racy since it's possible that
userspace can set the value to something else itself before the old value
is reinstated. That results in userspace setting current's oom_score_adj
to a different value and then the kernel immediately setting it back to
its previous value without notification.
To fix this, a new compare_swap_oom_score_adj() function is introduced
with the same semantics as the compare and swap CAS instruction, or
CMPXCHG on x86. It is used to reinstate the previous value of
oom_score_adj if and only if the present value is the same as the old
value.
Signed-off-by: David Rientjes <rientjes@google.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Ying Han <yinghan@google.com>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This removes mm->oom_disable_count entirely since it's unnecessary and
currently buggy. The counter was intended to be per-process but it's
currently decremented in the exit path for each thread that exits, causing
it to underflow.
The count was originally intended to prevent oom killing threads that
share memory with threads that cannot be killed since it doesn't lead to
future memory freeing. The counter could be fixed to represent all
threads sharing the same mm, but it's better to remove the count since:
- it is possible that the OOM_DISABLE thread sharing memory with the
victim is waiting on that thread to exit and will actually cause
future memory freeing, and
- there is no guarantee that a thread is disabled from oom killing just
because another thread sharing its mm is oom disabled.
Signed-off-by: David Rientjes <rientjes@google.com>
Reported-by: Oleg Nesterov <oleg@redhat.com>
Reviewed-by: Oleg Nesterov <oleg@redhat.com>
Cc: Ying Han <yinghan@google.com>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
In __zone_reclaim case, we don't want to shrink mapped page. Nonetheless,
we have isolated mapped page and re-add it into LRU's head. It's
unnecessary CPU overhead and makes LRU churning.
Of course, when we isolate the page, the page might be mapped but when we
try to migrate the page, the page would be not mapped. So it could be
migrated. But race is rare and although it happens, it's no big deal.
Signed-off-by: Minchan Kim <minchan.kim@gmail.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Reviewed-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Reviewed-by: Michal Hocko <mhocko@suse.cz>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Rik van Riel <riel@redhat.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
In async mode, compaction doesn't migrate dirty or writeback pages. So,
it's meaningless to pick the page and re-add it to lru list.
Of course, when we isolate the page in compaction, the page might be dirty
or writeback but when we try to migrate the page, the page would be not
dirty, writeback. So it could be migrated. But it's very unlikely as
isolate and migration cycle is much faster than writeout.
So, this patch helps cpu overhead and prevent unnecessary LRU churning.
Signed-off-by: Minchan Kim <minchan.kim@gmail.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Reviewed-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Acked-by: Mel Gorman <mgorman@suse.de>
Acked-by: Rik van Riel <riel@redhat.com>
Reviewed-by: Michal Hocko <mhocko@suse.cz>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Change ISOLATE_XXX macro with bitwise isolate_mode_t type. Normally,
macro isn't recommended as it's type-unsafe and making debugging harder as
symbol cannot be passed throught to the debugger.
Quote from Johannes
" Hmm, it would probably be cleaner to fully convert the isolation mode
into independent flags. INACTIVE, ACTIVE, BOTH is currently a
tri-state among flags, which is a bit ugly."
This patch moves isolate mode from swap.h to mmzone.h by memcontrol.h
Signed-off-by: Minchan Kim <minchan.kim@gmail.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Rik van Riel <riel@redhat.com>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The basic idea behind cross memory attach is to allow MPI programs doing
intra-node communication to do a single copy of the message rather than a
double copy of the message via shared memory.
The following patch attempts to achieve this by allowing a destination
process, given an address and size from a source process, to copy memory
directly from the source process into its own address space via a system
call. There is also a symmetrical ability to copy from the current
process's address space into a destination process's address space.
- Use of /proc/pid/mem has been considered, but there are issues with
using it:
- Does not allow for specifying iovecs for both src and dest, assuming
preadv or pwritev was implemented either the area read from or
written to would need to be contiguous.
- Currently mem_read allows only processes who are currently
ptrace'ing the target and are still able to ptrace the target to read
from the target. This check could possibly be moved to the open call,
but its not clear exactly what race this restriction is stopping
(reason appears to have been lost)
- Having to send the fd of /proc/self/mem via SCM_RIGHTS on unix
domain socket is a bit ugly from a userspace point of view,
especially when you may have hundreds if not (eventually) thousands
of processes that all need to do this with each other
- Doesn't allow for some future use of the interface we would like to
consider adding in the future (see below)
- Interestingly reading from /proc/pid/mem currently actually
involves two copies! (But this could be fixed pretty easily)
As mentioned previously use of vmsplice instead was considered, but has
problems. Since you need the reader and writer working co-operatively if
the pipe is not drained then you block. Which requires some wrapping to
do non blocking on the send side or polling on the receive. In all to all
communication it requires ordering otherwise you can deadlock. And in the
example of many MPI tasks writing to one MPI task vmsplice serialises the
copying.
There are some cases of MPI collectives where even a single copy interface
does not get us the performance gain we could. For example in an
MPI_Reduce rather than copy the data from the source we would like to
instead use it directly in a mathops (say the reduce is doing a sum) as
this would save us doing a copy. We don't need to keep a copy of the data
from the source. I haven't implemented this, but I think this interface
could in the future do all this through the use of the flags - eg could
specify the math operation and type and the kernel rather than just
copying the data would apply the specified operation between the source
and destination and store it in the destination.
Although we don't have a "second user" of the interface (though I've had
some nibbles from people who may be interested in using it for intra
process messaging which is not MPI). This interface is something which
hardware vendors are already doing for their custom drivers to implement
fast local communication. And so in addition to this being useful for
OpenMPI it would mean the driver maintainers don't have to fix things up
when the mm changes.
There was some discussion about how much faster a true zero copy would
go. Here's a link back to the email with some testing I did on that:
http://marc.info/?l=linux-mm&m=130105930902915&w=2
There is a basic man page for the proposed interface here:
http://ozlabs.org/~cyeoh/cma/process_vm_readv.txt
This has been implemented for x86 and powerpc, other architecture should
mainly (I think) just need to add syscall numbers for the process_vm_readv
and process_vm_writev. There are 32 bit compatibility versions for
64-bit kernels.
For arch maintainers there are some simple tests to be able to quickly
verify that the syscalls are working correctly here:
http://ozlabs.org/~cyeoh/cma/cma-test-20110718.tgz
Signed-off-by: Chris Yeoh <yeohc@au1.ibm.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: David Howells <dhowells@redhat.com>
Cc: James Morris <jmorris@namei.org>
Cc: <linux-man@vger.kernel.org>
Cc: <linux-arch@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
x86_64 allnoconfig:
In file included from arch/x86/kernel/pci-dma.c:3:
include/linux/dmar.h:248: warning: 'struct acpi_dmar_header' declared inside parameter list
include/linux/dmar.h:248: warning: its scope is only this definition or declaration, which is probably not what you want
Cc: Suresh Siddha <suresh.b.siddha@intel.com>
Cc: Ingo Molnar <mingo@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Recent commit "irq: Track the owner of irq descriptor" in
commit ID b6873807a7 placed module.h into linux/irq.h
but we are trying to limit module.h inclusion to just C files
that really need it, due to its size and number of children
includes. This targets just reversing that include.
Add in the basic "struct module" since that is all we really need
to ensure things compile. In theory, b687380 should have added the
module.h include to the irqdesc.h header as well, but the implicit
module.h everywhere presence masked this from showing up. So give
it the "struct module" as well.
As for the C files, irqdesc.c is only using THIS_MODULE, so it
does not need module.h - give it export.h instead. The C file
irq/manage.c is now (as of b687380) using try_module_get and
module_put and so it needs module.h (which it already has).
Also convert the irq_alloc_descs variants to macros, since all
they really do is is call the __irq_alloc_descs primitive.
This avoids including export.h and no debug info is lost.
Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com>
The <linux/module.h> pretty much brings in the kitchen sink along
with it, so it should be avoided wherever reasonably possible in
terms of being included from other commonly used <linux/something.h>
files, as it results in a measureable increase on compile times.
The worst culprit was probably device.h since it is used everywhere.
This file also had an implicit dependency/usage of mutex.h which was
masked by module.h, and is also fixed here at the same time.
There are over a dozen other headers that simply declare the
struct instead of pulling in the whole file, so follow their lead
and simply make it a few more.
Most of the implicit dependencies on module.h being present by
these headers pulling it in have been now weeded out, so we can
finally make this change with hopefully minimal breakage.
Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com>
The original implementations reference THIS_MODULE in an inline.
We could include <linux/export.h>, but it is better to avoid chaining.
Fortunately someone else already thought of this, and made a similar
inline into a #define in <linux/device.h> for device_schedule_callback(),
[see commit 523ded71de] so follow that precedent here.
Also bubble up any __must_check that were used on the prev. wrapper inline
functions up one to the real __register functions, to preserve any prev.
sanity checks that were used in those instances.
Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com>
The <linux/crypto.h> (which is in turn in common headers
like tcp.h) wants to use module_name() in an inline fcn.
But having all of <linux/module.h> along for the ride is
overkill and slows down compiles by a measureable amount,
since it in turn includes lots of headers.
Since the inline is never used anywhere in the kernel[1],
we can just remove it, and then also remove the module.h
include as well.
In all the many crypto modules, there were some relying on
crypto.h including module.h -- for them we now explicitly
call out module.h for inclusion.
[1] git grep shows some staging drivers also define the same
static inline, but they also never ever use it.
Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com>
Once we clean up the implicit presence of module.h (and all its
sub-includes), we'll see an implicit dependency on page.h for
the PAGE_SIZE define. So fix it in advance.
Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com>
This file was getting notifier.h via device.h --> module.h but
the module.h inclusion is going away, so add notifier.h directly.
Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com>
The implicit presence of module.h and all its sub-includes was
masking these implicit header usages:
include/linux/dmaengine.h:684: warning: 'struct page' declared inside parameter list
include/linux/dmaengine.h:684: warning: its scope is only this definition or declaration, which is probably not what you want
include/linux/dmaengine.h:687: warning: 'struct page' declared inside parameter list
include/linux/dmaengine.h:736:2: error: implicit declaration of function 'bitmap_zero'
With input from Stephen Rothwell <sfr@canb.auug.org.au>
Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com>
By removing the implicit presence of module.h from this file, we
will see things like:
In file included from fs/dlm/user.c:9:
include/linux/miscdevice.h:50: error: field ‘list’ has incomplete type
include/linux/miscdevice.h:54: error: expected specifier-qualifier-list before ‘mode_t’
Call out lists.h and types.h for inclusion to fix each of the
above respectively.
Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com>
This will show up on MIPS when we fix all the implicit header presences
that are because of module.h being everywhere.
In file included from kernel/trace/ftrace.c:16:
include/linux/stop_machine.h: In function 'stop_one_cpu':
include/linux/stop_machine.h:50: error: implicit declaration of function 'smp_processor_id'
include/linux/stop_machine.h: In function 'stop_cpus':
include/linux/stop_machine.h:80: error: implicit declaration of function 'raw_smp_processor_id'
Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com>
It shows up as a build failure on MIPS, as it is used in
three of_property function stubs.
include/linux/of.h:275: error: 'ENOSYS' undeclared (first use in this function)
include/linux/of.h:282: error: 'ENOSYS' undeclared (first use in this function)
include/linux/of.h:295: error: 'ENOSYS' undeclared (first use in this function)
Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com>
There is nothing modular in this file, and no reason to drag
in all the 357 headers that module.h brings with it, since
it just slows down compiles.
Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com>
This file has a define MODULE_ALIAS_MISCDEV which in turn will
use the MODULE_ALIAS define, but only if the former is explicitly
used by modular device driver code (and such code should be
already including module.h).
Delete the include, since module.h is such a giant thing that we
don't want it implicitly sneaking into compiles where it isn't
specifically required.
Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com>
There is nothing modular in this file, and no reason to drag
in all the extra headers that module.h brings with it, since
it just slows down compiles.
Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com>
The <linux/module.h> pretty much brings in the kitchen sink along
with it, so it should be avoided wherever reasonably possible in
terms of being included from other commonly used <linux/something.h>
files, as it results in a measureable increase on compile times.
There doesn't appear to be any module specifics in this file.
The obvious people who were relying on the presence of
the vast amount of stuff module.h sucked in have been fixed.
If other files are implicitly relying on it, then lets see who
they are and fix them too.
Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com>
This file consists of nothing other than things like:
#ifdef CONFIG_FOO
#define ....
There is no reason for it to require module.h
Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com>
This was implicitly appearing by way of module.h -- but when
we fix that, we'll get this:
In file included from drivers/regulator/dummy.c:21:
include/linux/regulator/driver.h:197: error: field 'notifier' has incomplete type
make[3]: *** [drivers/regulator/dummy.o] Error 1
Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com>
* 'v4l_for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mchehab/linux-media: (348 commits)
[media] pctv452e: Remove bogus code
[media] adv7175: Make use of media bus pixel codes
[media] media: vb2: fix incorrect return value
[media] em28xx: implement VIDIOC_ENUM_FRAMESIZES
[media] cx23885: Stop the risc video fifo before reconfiguring it
[media] cx23885: Avoid incorrect error handling and reporting
[media] cx23885: Avoid stopping the risc engine during buffer timeout
[media] cx23885: Removed a spurious function cx23885_set_scale()
[media] cx23885: v4l2 api compliance, set the audioset field correctly
[media] cx23885: hook the audio selection functions into the main driver
[media] cx23885: add generic functions for dealing with audio input selection
[media] cx23885: fixes related to maximum number of inputs and range checking
[media] cx23885: Initial support for the MPX-885 mini-card
[media] cx25840: Ensure AUDIO6 and AUDIO7 trigger line-in baseband use
[media] cx23885: Enable audio line in support from the back panel
[media] cx23885: Allow the audio mux config to be specified on a per input basis
[media] cx25840: Enable support for non-tuner LR1/LR2 audio inputs
[media] cx23885: Name an internal i2c part and declare a bitfield by name
[media] cx23885: Ensure VBI buffers timeout quickly - bugfix for vbi hangs during streaming
[media] cx23885: remove channel dump diagnostics when a vbi buffer times out
...
Fix up trivial conflicts in drivers/misc/altera-stapl/altera.c (header
file rename vs add)
Allow userspace dm log implementations to register their log device so it
is no longer missing from the list of device dependencies.
When device mapper targets use a device they normally call dm_get_device
which includes it in the device list returned to userspace applications
such as LVM through the DM_TABLE_DEPS ioctl. Userspace log devices
don't use dm_get_device as userspace opens them so they are missing from
the list of dependencies.
This patch extends the DM_ULOG_CTR operation to allow userspace to
respond with the name of the log device (if appropriate) to be
registered via 'dm_get_device'. DM_ULOG_REQUEST_VERSION is incremented.
This is backwards compatible. If the kernel and userspace log server
have both been updated, the new information will be passed down to the
kernel and the device will be registered. If the kernel is new, but
the log server is old, the log server will not pass down any device
information and the kernel will simply bypass the device registration
as before. If the kernel is old but the log server is new, the log
server will see the old version number and not pass the device info.
Signed-off-by: Jonathan Brassow <jbrassow@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
Introduce DM_TARGET_IMMUTABLE to indicate that the target type cannot be mixed
with any other target type, and once loaded into a device, it cannot be
replaced with a table containing a different type.
The thin provisioning pool device will use this.
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
Add a target feature flag DM_TARGET_ALWAYS_WRITEABLE to indicate that a target
does not support read-only mode.
The initial implementation of the thin provisioning target uses this.
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
Introduce the concept of a singleton table which contains exactly one target.
If a target type sets the DM_TARGET_SINGLETON feature bit device-mapper
will ensure that any table that includes that target contains no others.
The thin provisioning pool target uses this.
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
This patch introduces dm_kcopyd_zero() to make it easy to use
kcopyd to write zeros into the requested areas instead
instead of copying. It is implemented by passing a NULL
copying source to dm_kcopyd_copy().
The forthcoming thin provisioning target uses this.
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
printk_ratelimit() shares global ratelimiting state with all
other subsystems, so its usage is discouraged. Instead,
define and use dm's local state.
Signed-off-by: Namhyung Kim <namhyung@gmail.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
As we'll need to use those structs for trace functions, they should
be on a more public place. So, move struct mem_ctl_info & friends
to edac.h.
No functional changes on this patch.
Signed-off-by: Mauro Carvalho Chehab <mchehab@redhat.com>
Signed-off-by: Doug Thompson <dougthompson@xmission.com>
Current pnfs_layoutcommit_inode can not handle parallel layoutcommit.
And as Trond suggested , there is no need for client to optimize for
parallel layoutcommit. So add NFS_INO_LAYOUTCOMMITTING flag to
mark inflight layoutcommit and serialize lalyoutcommit with it.
Also mark_inode_dirty_sync if pnfs_layoutcommit_inode fails to issue
layoutcommit.
Reported-by: Vitaliy Gusev <gusev.vitaliy@nexenta.com>
Signed-off-by: Peng Tao <peng_tao@emc.com>
Signed-off-by: Jim Rees <rees@umich.edu>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
There are files which use module_param and MODULE_PARM_DESC
back to back. They only include moduleparam.h which makes sense,
but the implicit presence of module.h everywhere hid the fact
that MODULE_PARM_DESC wasn't in moduleparam.h at all. Relocate
the macro to moduleparam.h so that the moduleparam infrastructure
can be used independently of module.h
Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com>
A lot of files pull in module.h when all they are really
looking for is the basic EXPORT_SYMBOL functionality. The
recent data from Ingo[1] shows that this is one of several
instances that has a significant impact on compile times,
and it should be targeted for factoring out (as done here).
Note that several commonly used header files in include/*
directly include <linux/module.h> themselves (some 34 of them!)
The most commonly used ones of these will have to be made
independent of module.h before the full benefit of this change
can be realized.
We also transition THIS_MODULE from module.h to export.h,
since there are lots of files with subsystem structs that
in turn will have a struct module *owner and only be doing:
.owner = THIS_MODULE;
and absolutely nothing else modular. So, we also want to have
the THIS_MODULE definition present in the lightweight header.
[1] https://lkml.org/lkml/2011/5/23/76
Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com>
Interrupt controllers can have non-zero starting value for h/w irq numbers.
Adding support in irq_domain allows the domain hwirq numbering to match
the interrupt controllers' numbering.
As this makes looping over irqs for a domain more complicated, add loop
iterators to iterate over all hwirqs and irqs for a domain.
Signed-off-by: Rob Herring <rob.herring@calxeda.com>
Reviewed-by: Jamie Iles <jamie@jamieiles.com>
Tested-by: Thomas Abraham <thomas.abraham@linaro.org>
Acked-by: Grant Likely <grant.likely@secretlab.ca>
Acked-by: Thomas Gleixner <tglx@linutronix.de>
of_irq_init will scan the devicetree for matching interrupt controller
nodes. Then it calls an initialization function for each found controller
in the proper order with parent nodes initialized before child nodes.
Based on initial pseudo code from Grant Likely.
Changes in v4:
- Drop unnecessary empty list check
- Be more verbose on errors
- Simplify "if (!desc) WARN_ON(1)" to "if (WARN_ON(!desc))"
Changes in v3:
- add missing kfree's found by Jamie
- Implement Grant's comments to simplify the init loop
- fix function comments
Changes in v2:
- Complete re-write of list searching code from Grant Likely
Signed-off-by: Rob Herring <rob.herring@calxeda.com>
Reviewed-by: Jamie Iles <jamie@jamieiles.com>
Tested-by: Thomas Abraham <thomas.abraham@linaro.org>
Acked-by: Grant Likely <grant.likely@secretlab.ca>
In the OCZ RevoDrive3/zDrive R4 series, the "OCZ SuperScale Storage
Controller" with "Virtualized Controller Architecture 2.0" really seems
to be a Marvell 88SE9485 part, with OCZ firmware/BIOS.
Developed and tested on OCZ RevoDrive3 120GB [PCI 1b85:1021]
Should work on:
- OCZ RevoDrive3 (2x SandForce 2281)
- OCZ RevoDrive3 X2 (4x SandForce 2281)
- OCZ zDrive R4 CM84 (4x SandForce 2281)
- OCZ zDrive R4 CM88 (8x SandForce 2281)
- OCZ zDrive R4 RM84 (4x SandForce 2582)
- OCZ zDrive R4 RM88 (8x SandForce 2582)
All of this because a friend recently bought a OCZ RevoDrive3 and was
bitten by the lack of Linux support.
Notes from testing:
-------------------
- SMART works.
- VPD Device Identification is "OCZ-REVODRIVE3"
- Thin provisioning/TRIM seems to be implemented as WRITE SAME UNMAP,
with deterministic (non-zero) read after TRIM, but I'm not sure if it
works 100% in my testing.
- Some of the tuning in the firmware seems to ensure much better
performance when in a RAID0 setup than using the two devices
seperately.
I have not tested booting from the SSD, because all of this was
developed and tested remotely from the actual hardware.
Signed-off-by: Robin H. Johnson <robbat2@gentoo.org>
Thanks-To: Gordon Pritchard <gordp@sfu.ca>
Acked-by: Xiangliang Yu <yuxiangl@marvell.com>
Signed-off-by: James Bottomley <JBottomley@Parallels.com>
* 'i2c-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jdelvare/staging:
i2c: Functions for byte-swapped smbus_write/read_word_data
i2c-algo-pca: Return standard fault codes
i2c-algo-bit: Return standard fault codes
i2c-algo-bit: Be verbose on bus testing failure
i2c-algo-bit: Let user test buses without failing
i2c/scx200_acb: Fix section mismatch warning in scx200_pci_drv
i2c: I2C_ELEKTOR should depend on HAS_IOPORT
* 'next' of git://git.kernel.org/pub/scm/linux/kernel/git/joro/iommu: (33 commits)
iommu/core: Remove global iommu_ops and register_iommu
iommu/msm: Use bus_set_iommu instead of register_iommu
iommu/omap: Use bus_set_iommu instead of register_iommu
iommu/vt-d: Use bus_set_iommu instead of register_iommu
iommu/amd: Use bus_set_iommu instead of register_iommu
iommu/core: Use bus->iommu_ops in the iommu-api
iommu/core: Convert iommu_found to iommu_present
iommu/core: Add bus_type parameter to iommu_domain_alloc
Driver core: Add iommu_ops to bus_type
iommu/core: Define iommu_ops and register_iommu only with CONFIG_IOMMU_API
iommu/amd: Fix wrong shift direction
iommu/omap: always provide iommu debug code
iommu/core: let drivers know if an iommu fault handler isn't installed
iommu/core: export iommu_set_fault_handler()
iommu/omap: Fix build error with !IOMMU_SUPPORT
iommu/omap: Migrate to the generic fault report mechanism
iommu/core: Add fault reporting mechanism
iommu/core: Use PAGE_SIZE instead of hard-coded value
iommu/core: use the existing IS_ALIGNED macro
iommu/msm: ->unmap() should return order of unmapped page
...
Fixup trivial conflicts in drivers/iommu/Makefile: "move omap iommu to
dedicated iommu folder" vs "Rename the DMAR and INTR_REMAP config
options" just happened to touch lines next to each other.
* 'kvm-updates/3.2' of git://git.kernel.org/pub/scm/linux/kernel/git/avi/kvm: (75 commits)
KVM: SVM: Keep intercepting task switching with NPT enabled
KVM: s390: implement sigp external call
KVM: s390: fix register setting
KVM: s390: fix return value of kvm_arch_init_vm
KVM: s390: check cpu_id prior to using it
KVM: emulate lapic tsc deadline timer for guest
x86: TSC deadline definitions
KVM: Fix simultaneous NMIs
KVM: x86 emulator: convert push %sreg/pop %sreg to direct decode
KVM: x86 emulator: switch lds/les/lss/lfs/lgs to direct decode
KVM: x86 emulator: streamline decode of segment registers
KVM: x86 emulator: simplify OpMem64 decode
KVM: x86 emulator: switch src decode to decode_operand()
KVM: x86 emulator: qualify OpReg inhibit_byte_regs hack
KVM: x86 emulator: switch OpImmUByte decode to decode_imm()
KVM: x86 emulator: free up some flag bits near src, dst
KVM: x86 emulator: switch src2 to generic decode_operand()
KVM: x86 emulator: expand decode flags to 64 bits
KVM: x86 emulator: split dst decode to a generic decode_operand()
KVM: x86 emulator: move memop, memopp into emulation context
...
* 'fbdev-next' of git://github.com/schandinat/linux-2.6: (270 commits)
video: platinumfb: Add __devexit_p at necessary place
drivers/video: fsl-diu-fb: merge diu_pool into fsl_diu_data
drivers/video: fsl-diu-fb: merge diu_hw into fsl_diu_data
drivers/video: fsl-diu-fb: only DIU modes 0 and 1 are supported
drivers/video: fsl-diu-fb: remove unused panel operating mode support
drivers/video: fsl-diu-fb: use an enum for the AOI index
drivers/video: fsl-diu-fb: add several new video modes
drivers/video: fsl-diu-fb: remove broken screen blanking support
drivers/video: fsl-diu-fb: move some definitions out of the header file
drivers/video: fsl-diu-fb: fix some ioctls
video: da8xx-fb: Increased resolution configuration of revised LCDC IP
OMAPDSS: picodlp: add missing #include <linux/module.h>
fb: fix au1100fb bitrot.
mx3fb: fix NULL pointer dereference in screen blanking.
video: irq: Remove IRQF_DISABLED
smscufx: change edid data to u8 instead of char
OMAPDSS: DISPC: zorder support for DSS overlays
OMAPDSS: DISPC: VIDEO3 pipeline support
OMAPDSS/OMAP_VOUT: Fix incorrect OMAP3-alpha compatibility setting
video/omap: fix build dependencies
...
Fix up conflicts in:
- drivers/staging/xgifb/XGI_main_26.c
Changes to XGIfb_pan_var()
- drivers/video/omap/{lcd_apollon.c,lcd_ldp.c,lcd_overo.c}
Removed (or in the case of apollon.c, merged into the generic
DSS panel in drivers/video/omap2/displays/panel-generic-dpi.c)
This creates a new 'reason' field in a wb_writeback_work
structure, which unambiguously identifies who initiates
writeback activity. A 'wb_reason' enumeration has been
added to writeback.h, to enumerate the possible reasons.
The 'writeback_work_class' and tracepoint event class and
'writeback_queue_io' tracepoints are updated to include the
symbolic 'reason' in all trace events.
And the 'writeback_inodes_sbXXX' family of routines has had
a wb_stats parameter added to them, so callers can specify
why writeback is being started.
Acked-by: Jan Kara <jack@suse.cz>
Signed-off-by: Curt Wohlgemuth <curtw@google.com>
Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
For a ERESTARTNOHAND/ERESTARTSYS/ERESTARTNOINTR restarting system call
do_signal will prepare the restart of the system call with a rewind of
the PSW before calling get_signal_to_deliver (where the debugger might
take control). For A ERESTART_RESTARTBLOCK restarting system call
do_signal will set -EINTR as return code.
There are two issues with this approach:
1) strace never sees ERESTARTNOHAND, ERESTARTSYS, ERESTARTNOINTR or
ERESTART_RESTARTBLOCK as the rewinding already took place or the
return code has been changed to -EINTR
2) if get_signal_to_deliver does not return with a signal to deliver
the restart via the repeat of the svc instruction is left in place.
This opens a race if another signal is made pending before the
system call instruction can be reexecuted. The original system call
will be restarted even if the second signal would have ended the
system call with -EINTR.
These two issues can be solved by dropping the early rewind of the
system call before get_signal_to_deliver has been called and by using
the TIF_RESTART_SVC magic to do the restart if no signal has to be
delivered. The only situation where the system call restart via the
repeat of the svc instruction is appropriate is when a SA_RESTART
signal is delivered to user space.
Unfortunately this breaks inferior calls by the debugger again. The
system call number and the length of the system call instruction is
lost over the inferior call and user space will see ERESTARTNOHAND/
ERESTARTSYS/ERESTARTNOINTR/ERESTART_RESTARTBLOCK. To correct this a
new ptrace interface is added to save/restore the system call number
and system call instruction length.
Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
This patch introduces a mechanism that allows architecture backends to
remove page tables for the crashkernel memory. This can protect the loaded
kdump kernel from being overwritten by broken kernel code. Two new
functions crash_map_reserved_pages() and crash_unmap_reserved_pages() are
added that can be implemented by architecture code. The
crash_map_reserved_pages() function is called before and
crash_unmap_reserved_pages() after the crashkernel segments are loaded. The
functions are also called in crash_shrink_memory() to create/remove page
tables when the crashkernel memory size is reduced.
To support architectures that have large pages this patch also introduces
a new define KEXEC_CRASH_MEM_ALIGN. The crashkernel start and size must
always be aligned with KEXEC_CRASH_MEM_ALIGN.
Cc: Andrew Morton <akpm@linux-foundation.org>
Acked-by: Vivek Goyal <vgoyal@redhat.com>
Signed-off-by: Michael Holzheu <holzheu@linux.vnet.ibm.com>
Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
Currently only the address of the pre-allocated ELF header is passed with
the elfcorehdr= kernel parameter. In order to reserve memory for the header
in the 2nd kernel also the size is required. Current kdump architecture
backends use different methods to do that, e.g. x86 uses the memmap= kernel
parameter. On s390 there is no easy way to transfer this information.
Therefore the elfcorehdr kernel parameter is extended to also pass the size.
This now can also be used as standard mechanism by all future kdump
architecture backends.
The syntax of the kernel parameter is extended as follows:
elfcorehdr=[size[KMG]@]offset[KMG]
This change is backward compatible because elfcorehdr=size is still allowed.
Acked-by: Vivek Goyal <vgoyal@redhat.com>
Acked-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Michael Holzheu <holzheu@linux.vnet.ibm.com>
Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
On s390 there is a different KEXEC_CONTROL_MEMORY_LIMIT for the normal and
the kdump kexec case. Therefore this patch introduces a new macro
KEXEC_CRASH_CONTROL_MEMORY_LIMIT. This is set to
KEXEC_CONTROL_MEMORY_LIMIT for all architectures that do not define
KEXEC_CRASH_CONTROL_MEMORY_LIMIT.
Acked-by: Vivek Goyal <vgoyal@redhat.com>
Acked-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Michael Holzheu <holzheu@linux.vnet.ibm.com>
Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
Reimplemented at least 17 times discounting error mangling cases
where it could be used.
Signed-off-by: Jonathan Cameron <jic23@cam.ac.uk>
Signed-off-by: Jean Delvare <khali@linux-fr.org>
Implement sigp external call, which might be required for guests that
issue an external call instead of an emergency signal for IPI.
This fixes an issue with "KVM: unknown SIGP: 0x02" when booting
such an SMP guest.
Signed-off-by: Christian Ehrhardt <ehrhardt@linux.vnet.ibm.com>
Signed-off-by: Christian Borntraeger <borntraeger@de.ibm.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>