Merge IPC cleanup and scalability patches from Andrew Morton.
This cleans up many of the oddities in the IPC code, uses the list
iterator helpers, splits out locking and adds per-semaphore locks for
greater scalability of the IPC semaphore code.
Most normal user-level locking by now uses futexes (ie pthreads, but
also a lot of specialized locks), but SysV IPC semaphores are apparently
still used in some big applications, either for portability reasons, or
because they offer tracking and undo (and you don't need to have a
special shared memory area for them).
Our IPC semaphore scalability was pitiful. We used to lock much too big
ranges, and we used to have a single ipc lock per ipc semaphore array.
Most loads never cared, but some do. There are some numbers in the
individual commits.
* ipc-scalability:
ipc: sysv shared memory limited to 8TiB
ipc/msg.c: use list_for_each_entry_[safe] for list traversing
ipc,sem: fine grained locking for semtimedop
ipc,sem: have only one list in struct sem_queue
ipc,sem: open code and rename sem_lock
ipc,sem: do not hold ipc lock more than necessary
ipc: introduce lockless pre_down ipcctl
ipc: introduce obtaining a lockless ipc object
ipc: remove bogus lock comment for ipc_checkid
ipc/msgutil.c: use linux/uaccess.h
ipc: refactor msg list search into separate function
ipc: simplify msg list search
ipc: implement MSG_COPY as a new receive mode
ipc: remove msg handling from queue scan
ipc: set EFAULT as default error in load_msg()
ipc: tighten msg copy loops
ipc: separate msg allocation from userspace copy
ipc: clamp with min()
Trying to run an application which was trying to put data into half of
memory using shmget(), we found that having a shmall value below 8EiB-8TiB
would prevent us from using anything more than 8TiB. By setting
kernel.shmall greater than 8EiB-8TiB would make the job work.
In the newseg() function, ns->shm_tot which, at 8TiB is INT_MAX.
ipc/shm.c:
458 static int newseg(struct ipc_namespace *ns, struct ipc_params *params)
459 {
...
465 int numpages = (size + PAGE_SIZE -1) >> PAGE_SHIFT;
...
474 if (ns->shm_tot + numpages > ns->shm_ctlall)
475 return -ENOSPC;
[akpm@linux-foundation.org: make ipc/shm.c:newseg()'s numpages size_t, not int]
Signed-off-by: Robin Holt <holt@sgi.com>
Reported-by: Alex Thorlton <athorlton@sgi.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The ipc/msg.c code does its list operations by hand and it open-codes the
accesses, instead of using for_each_entry_[safe].
Signed-off-by: Nikola Pajkovsky <npajkovs@redhat.com>
Cc: Stanislav Kinsbursky <skinsbursky@parallels.com>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Peter Hurley <peter@hurleysoftware.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Introduce finer grained locking for semtimedop, to handle the common case
of a program wanting to manipulate one semaphore from an array with
multiple semaphores.
If the call is a semop manipulating just one semaphore in an array with
multiple semaphores, only take the lock for that semaphore itself.
If the call needs to manipulate multiple semaphores, or another caller is
in a transaction that manipulates multiple semaphores, the sem_array lock
is taken, as well as all the locks for the individual semaphores.
On a 24 CPU system, performance numbers with the semop-multi
test with N threads and N semaphores, look like this:
vanilla Davidlohr's Davidlohr's + Davidlohr's +
threads patches rwlock patches v3 patches
10 610652 726325 1783589 2142206
20 341570 365699 1520453 1977878
30 288102 307037 1498167 2037995
40 290714 305955 1612665 2256484
50 288620 312890 1733453 2650292
60 289987 306043 1649360 2388008
70 291298 306347 1723167 2717486
80 290948 305662 1729545 2763582
90 290996 306680 1736021 2757524
100 292243 306700 1773700 3059159
[davidlohr.bueso@hp.com: do not call sem_lock when bogus sma]
[davidlohr.bueso@hp.com: make refcounter atomic]
Signed-off-by: Rik van Riel <riel@redhat.com>
Suggested-by: Linus Torvalds <torvalds@linux-foundation.org>
Acked-by: Davidlohr Bueso <davidlohr.bueso@hp.com>
Cc: Chegu Vinod <chegu_vinod@hp.com>
Cc: Jason Low <jason.low2@hp.com>
Reviewed-by: Michel Lespinasse <walken@google.com>
Cc: Peter Hurley <peter@hurleysoftware.com>
Cc: Stanislav Kinsbursky <skinsbursky@parallels.com>
Tested-by: Emmanuel Benisty <benisty.e@gmail.com>
Tested-by: Sedat Dilek <sedat.dilek@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Having only one list in struct sem_queue, and only queueing simple
semaphore operations on the list for the semaphore involved, allows us to
introduce finer grained locking for semtimedop.
Signed-off-by: Rik van Riel <riel@redhat.com>
Acked-by: Davidlohr Bueso <davidlohr.bueso@hp.com>
Cc: Chegu Vinod <chegu_vinod@hp.com>
Cc: Emmanuel Benisty <benisty.e@gmail.com>
Cc: Jason Low <jason.low2@hp.com>
Cc: Michel Lespinasse <walken@google.com>
Cc: Peter Hurley <peter@hurleysoftware.com>
Cc: Stanislav Kinsbursky <skinsbursky@parallels.com>
Tested-by: Sedat Dilek <sedat.dilek@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Rename sem_lock() to sem_obtain_lock(), so we can introduce a sem_lock()
later that only locks the sem_array and does nothing else.
Open code the locking from ipc_lock() in sem_obtain_lock() so we can
introduce finer grained locking for the sem_array in the next patch.
[akpm@linux-foundation.org: propagate the ipc_obtain_object() errno out of sem_obtain_lock()]
Signed-off-by: Rik van Riel <riel@redhat.com>
Acked-by: Davidlohr Bueso <davidlohr.bueso@hp.com>
Cc: Chegu Vinod <chegu_vinod@hp.com>
Cc: Emmanuel Benisty <benisty.e@gmail.com>
Cc: Jason Low <jason.low2@hp.com>
Cc: Michel Lespinasse <walken@google.com>
Cc: Peter Hurley <peter@hurleysoftware.com>
Cc: Stanislav Kinsbursky <skinsbursky@parallels.com>
Tested-by: Sedat Dilek <sedat.dilek@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Various forms of ipc use ipcctl_pre_down() to retrieve an ipc object and
check permissions, mostly for IPC_RMID and IPC_SET commands.
Introduce ipcctl_pre_down_nolock(), a lockless version of this function.
The locking version is retained, yet modified to call the nolock version
without affecting its semantics, thus transparent to all ipc callers.
Signed-off-by: Davidlohr Bueso <davidlohr.bueso@hp.com>
Signed-off-by: Rik van Riel <riel@redhat.com>
Suggested-by: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Chegu Vinod <chegu_vinod@hp.com>
Cc: Emmanuel Benisty <benisty.e@gmail.com>
Cc: Jason Low <jason.low2@hp.com>
Cc: Michel Lespinasse <walken@google.com>
Cc: Peter Hurley <peter@hurleysoftware.com>
Cc: Stanislav Kinsbursky <skinsbursky@parallels.com>
Tested-by: Sedat Dilek <sedat.dilek@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Through ipc_lock() and therefore ipc_lock_check() we currently return the
locked ipc object. This is not necessary for all situations and can,
therefore, cause unnecessary ipc lock contention.
Introduce analogous ipc_obtain_object() and ipc_obtain_object_check()
functions that only lookup and return the ipc object.
Both these functions must be called within the RCU read critical section.
[akpm@linux-foundation.org: propagate the ipc_obtain_object() errno from ipc_lock()]
Signed-off-by: Davidlohr Bueso <davidlohr.bueso@hp.com>
Signed-off-by: Rik van Riel <riel@redhat.com>
Reviewed-by: Chegu Vinod <chegu_vinod@hp.com>
Acked-by: Michel Lespinasse <walken@google.com>
Cc: Emmanuel Benisty <benisty.e@gmail.com>
Cc: Jason Low <jason.low2@hp.com>
Cc: Peter Hurley <peter@hurleysoftware.com>
Cc: Stanislav Kinsbursky <skinsbursky@parallels.com>
Tested-by: Sedat Dilek <sedat.dilek@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This series makes the sysv semaphore code more scalable, by reducing the
time the semaphore lock is held, and making the locking more scalable for
semaphore arrays with multiple semaphores.
The first four patches were written by Davidlohr Buesso, and reduce the
hold time of the semaphore lock.
The last three patches change the sysv semaphore code locking to be more
fine grained, providing a performance boost when multiple semaphores in a
semaphore array are being manipulated simultaneously.
On a 24 CPU system, performance numbers with the semop-multi
test with N threads and N semaphores, look like this:
vanilla Davidlohr's Davidlohr's + Davidlohr's +
threads patches rwlock patches v3 patches
10 610652 726325 1783589 2142206
20 341570 365699 1520453 1977878
30 288102 307037 1498167 2037995
40 290714 305955 1612665 2256484
50 288620 312890 1733453 2650292
60 289987 306043 1649360 2388008
70 291298 306347 1723167 2717486
80 290948 305662 1729545 2763582
90 290996 306680 1736021 2757524
100 292243 306700 1773700 3059159
This patch:
There is no reason to be holding the ipc lock while reading ipcp->seq,
hence remove misleading comment.
Also simplify the return value for the function.
Signed-off-by: Davidlohr Bueso <davidlohr.bueso@hp.com>
Signed-off-by: Rik van Riel <riel@redhat.com>
Cc: Chegu Vinod <chegu_vinod@hp.com>
Cc: Emmanuel Benisty <benisty.e@gmail.com>
Cc: Jason Low <jason.low2@hp.com>
Cc: Michel Lespinasse <walken@google.com>
Cc: Peter Hurley <peter@hurleysoftware.com>
Cc: Stanislav Kinsbursky <skinsbursky@parallels.com>
Tested-by: Sedat Dilek <sedat.dilek@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Teach the helper routines about MSG_COPY so that msgtyp is preserved as
the message number to copy.
The security functions affected by this change were audited and no
additional changes are necessary.
Signed-off-by: Peter Hurley <peter@hurleysoftware.com>
Acked-by: Stanislav Kinsbursky <skinsbursky@parallels.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
In preparation for refactoring the queue scan into a separate
function, relocate msg copying.
Signed-off-by: Peter Hurley <peter@hurleysoftware.com>
Acked-by: Stanislav Kinsbursky <skinsbursky@parallels.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
feature this merge window is a new ioctl EXT4_IOC_SWAP_BOOT which
allows installation of a hidden inode designed for boot loaders.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.12 (GNU/Linux)
iQIcBAABCAAGBQJRfpDwAAoJENNvdpvBGATwsjwP/17V3AN6XTEhZK80p3/qN5YD
N2QIHeyYIqCGpczLs2TQEkxWX6nqpDggAPXY956wvgeEMQV+pQ+DLO4Ol9+p5WD2
hrklleYhtOjFQ3Xh4lqrEi5FzKVzWagVDLqgUjALJ+D+hkDB7ZQT/fm2sH45rzot
xBp3aVqANU8GqAAbEW4/Ng9ZGMx0dpANiU2svbjM71sv2dCLFmWAkz+GgZsMbuJZ
vnKIZP6I6plwP3LuZzEbVCA7F2PzC4ywEOJKjIEvgHpX6uMDR3FX8pD5Dlo/o6e2
eP+KLnD43mJMxBmTn22x5Sm0N6DUzJCEELRJWB9wCZoLdEvbEWRxT3qsPXfLWelG
2jj4bImXF2CqYEsJww5FV2WdXXdnuM57pZym5vMZGAFyKPSCJobA4Y3XRdXkBfXf
Gq/cFoPYv2EcBIhz3zrRj+tbY8esbO9wOnF6+x+AF10BspD2V7nuoVdWVhOf0A3v
i9ifGPwLk3e3xHr9oXheo7IWn52oviZeyD77d7D7MLhgn+xU4LaVhW3R63Q+mI4D
0TXG25R1CVcE7wyFy3gqSVXSCDO0JcQBL5LgcL+wAGXcHPAXqBpN2DFTPo+9fJH2
g3YMwr+wMbci1XRVQ2vdTt/nBZYjOCh6PgRmg3KjTz11Ra5EsjQvYjKWYwqf2RGn
QhCgbzd/qtZfNJztLvr7
=GCT2
-----END PGP SIGNATURE-----
Merge tag 'ext4_for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4
Pull ext4 updates from Ted Ts'o:
"Mostly performance and bug fixes, plus some cleanups. The one new
feature this merge window is a new ioctl EXT4_IOC_SWAP_BOOT which
allows installation of a hidden inode designed for boot loaders."
* tag 'ext4_for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4: (50 commits)
ext4: fix type-widening bug in inode table readahead code
ext4: add check for inodes_count overflow in new resize ioctl
ext4: fix Kconfig documentation for CONFIG_EXT4_DEBUG
ext4: fix online resizing for ext3-compat file systems
jbd2: trace when lock_buffer in do_get_write_access takes a long time
ext4: mark metadata blocks using bh flags
buffer: add BH_Prio and BH_Meta flags
ext4: mark all metadata I/O with REQ_META
ext4: fix readdir error in case inline_data+^dir_index.
ext4: fix readdir error in the case of inline_data+dir_index
jbd2: use kmem_cache_zalloc instead of kmem_cache_alloc/memset
ext4: mext_insert_extents should update extent block checksum
ext4: move quota initialization out of inode allocation transaction
ext4: reserve xattr index for Rich ACL support
jbd2: reduce journal_head size
ext4: clear buffer_uninit flag when submitting IO
ext4: use io_end for multiple bios
ext4: make ext4_bio_write_page() use BH_Async_Write flags
ext4: Use kstrtoul() instead of parse_strtoul()
ext4: defragmentation code cleanup
...
Pull Hexagon fixes from Richard Kuo:
"Changes for the Hexagon architecture (and one touching OpenRISC).
They include various fixes to make use of additional arch features and
cleanups. The largest functional change is a cleanup of the signal
and event return paths"
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rkuo/linux-hexagon-kernel: (32 commits)
Hexagon: add v4 CS regs to core copyout macro
Hexagon: use correct translation for VMALLOC_START
Hexagon: use correct translations for DMA mappings
Hexagon: fix return value for notify_resume case in do_work_pending
Hexagon: fix signal number for user mem faults
Hexagon: remove two Kconfig entries
arch: remove CONFIG_GENERIC_FIND_NEXT_BIT again
Hexagon: update copyright dates
Hexagon: add translation types for __vmnewmap
Hexagon: fix signal.c compile error
Hexagon: break up user fn/arg register setting
Hexagon: use generic sys_fork, sys_vfork, and sys_clone
Hexagon: fix psp/sp macro
Hexagon: fix up int enable/disable at ret_from_fork
Hexagon: add IOMEM and _relaxed IO macros
Hexagon: switch to using the device type for IO mappings
Hexagon: don't print info for offline CPU's
Hexagon: add support for single-stepping (v4+)
Hexagon: use correct work mask when checking for more work
Hexagon: add support for additional exceptions
...
We first tried to avoid updating atime/mtime entirely (commit
b0de59b573: "TTY: do not update atime/mtime on read/write"), and then
limited it to only update it occasionally (commit 37b7f3c765: "TTY:
fix atime/mtime regression"), but it turns out that this was both
insufficient and overkill.
It was insufficient because we let people attach to the shared ptmx node
to see activity without even reading atime/mtime, and it was overkill
because the "only once a minute" means that you can't really tell an
idle person from an active one with 'w'.
So this tries to fix the problem properly. It marks the shared ptmx
node as un-notifiable, and it lowers the "only once a minute" to a few
seconds instead - still long enough that you can't time individual
keystrokes, but short enough that you can tell whether somebody is
active or not.
Reported-by: Simon Kirby <sim@hostway.ca>
Acked-by: Jiri Slaby <jslaby@suse.cz>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: stable@vger.kernel.org
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Pull compat cleanup from Al Viro:
"Mostly about syscall wrappers this time; there will be another pile
with patches in the same general area from various people, but I'd
rather push those after both that and vfs.git pile are in."
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/signal:
syscalls.h: slightly reduce the jungles of macros
get rid of union semop in sys_semctl(2) arguments
make do_mremap() static
sparc: no need to sign-extend in sync_file_range() wrapper
ppc compat wrappers for add_key(2) and request_key(2) are pointless
x86: trim sys_ia32.h
x86: sys32_kill and sys32_mprotect are pointless
get rid of compat_sys_semctl() and friends in case of ARCH_WANT_OLD_COMPAT_IPC
merge compat sys_ipc instances
consolidate compat lookup_dcookie()
convert vmsplice to COMPAT_SYSCALL_DEFINE
switch getrusage() to COMPAT_SYSCALL_DEFINE
switch epoll_pwait to COMPAT_SYSCALL_DEFINE
convert sendfile{,64} to COMPAT_SYSCALL_DEFINE
switch signalfd{,4}() to COMPAT_SYSCALL_DEFINE
make SYSCALL_DEFINE<n>-generated wrappers do asmlinkage_protect
make HAVE_SYSCALL_WRAPPERS unconditional
consolidate cond_syscall and SYSCALL_ALIAS declarations
teach SYSCALL_DEFINE<n> how to deal with long long/unsigned long long
get rid of duplicate logics in __SC_....[1-6] definitions
Add debugfs support to make it easier to print debug information
about the dma-buf buffers.
Cc: Dave Airlie <airlied@redhat.com>
[minor fixes on init and warning fix]
Cc: Dan Carpenter <dan.carpenter@oracle.com>
[remove double unlock in fail case]
Signed-off-by: Sumit Semwal <sumit.semwal@linaro.org>
For debugging purposes, it is useful to have a name-string added
while exporting buffers. Hence, dma_buf_export() is replaced with
dma_buf_export_named(), which additionally takes 'exp_name' as a
parameter.
For backward compatibility, and for lazy exporters who don't wish to
name themselves, a #define dma_buf_export() is also made available,
which adds a __FILE__ instead of 'exp_name'.
Cc: Daniel Vetter <daniel.vetter@ffwll.ch>
[Thanks for the idea!]
Reviewed-by: Daniel Vetter <daniel.vetter@ffwll.ch>
Signed-off-by: Sumit Semwal <sumit.semwal@linaro.org>
The Kconfig entries for HEXAGON_VM and HEXAGON_ANGEL_TRAPS were added,
together with the configuration and makefiles for the Hexagon
architecture, in v3.2. They have never been used. They can safely be
removed.
Signed-off-by: Paul Bolle <pebolle@tiscali.nl>
[rkuo@codeaurora.org: adjust for line changes in Kconfig]
Signed-off-by: Richard Kuo <rkuo@codeaurora.org>
CONFIG_GENERIC_FIND_NEXT_BIT was removed in v3.0, but reappeared in two
architectures. Remove it again.
Signed-off-by: Paul Bolle <pebolle@tiscali.nl>
Acked-by: Jonas Bonn <jonas@southpole.se>
Signed-off-by: Richard Kuo <rkuo@codeaurora.org>
Based on feedback from Al Viro; previous-stack-pointer and
user reg for same should always be kept consistent.
Signed-off-by: Richard Kuo <rkuo@codeaurora.org>
Check return coming out of check_work_pending, and if copy_thread
passed us a function in r24, call it. Based on feedback from Al
Viro.
Signed-off-by: Richard Kuo <rkuo@codeaurora.org>
Uncached on our architecture can still have side effects
such as canceled/replayed transactions; device type prevents
this.
Signed-off-by: Richard Kuo <rkuo@codeaurora.org>
Use the exact number of pages needed to be mapped pre-VA-jump,
then map 896MB afterwards, which the arch mem init will fix up.
Signed-off-by: Richard Kuo <rkuo@codeaurora.org>
This fixes the return value of sigreturn and moves the work pending check
into a c routine for readability and fixes the loop for multiple pending
signals. Based on feedback from Al Viro.
Signed-off-by: Richard Kuo <rkuo@codeaurora.org>