linux_dsm_epyc7002

mirror of https://github.com/AuxXxilium/linux_dsm_epyc7002.git synced 2024-11-29 23:16:39 +07:00

Author	SHA1	Message	Date
Sachin Prabhu	466bd31bbd	cifs: Avoid calling unlock_page() twice in cifs_readpage() when using fscache When reading a single page with cifs_readpage(), we make a call to fscache_read_or_alloc_page() which once done, asynchronously calls the completion function cifs_readpage_from_fscache_complete(). This completion function unlocks the page once it has been populated from cache. The module then attempts to unlock the page a second time in cifs_readpage() which leads to warning messages. In case of a successful call to fscache_read_or_alloc_page() we should skip the second unlock_page() since this will be called by the cifs_readpage_from_fscache_complete() once the page has been populated by fscache. With the modifications to cifs_readpage_worker(), we will need to re-grab the page lock in cifs_write_begin(). The problem was first noticed when testing new fscache patches for cifs. https://bugzilla.redhat.com/show_bug.cgi?id=1005737 Signed-off-by: Sachin Prabhu <sprabhu@redhat.com> Reviewed-by: Jeff Layton <jlayton@redhat.com> Signed-off-by: Steve French <smfrench@gmail.com>	2013-09-13 16:24:49 -05:00
Sachin Prabhu	a9e9b7bc15	cifs: Do not take a reference to the page in cifs_readpage_worker() We do not need to take a reference to the pagecache in cifs_readpage_worker() since the calling function will have already taken one before passing the pointer to the page as an argument to the function. Signed-off-by: Sachin Prabhu <sprabhu@redhat.com> Reviewed-by: Jeff Layton <jlayton@redhat.com> Signed-off-by: Steve French <smfrench@gmail.com>	2013-09-13 16:24:43 -05:00
Linus Torvalds	bdbdfdef57	Some more low risk cleanup patches: Remove unnecessary pci_set_drvdata in k10temp driver from Jingoo Han Fix return values in several drivers from Sachin Kamat Remove redundant break in amc6821 driver from Sachin Kamat -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.12 (GNU/Linux) iQIcBAABAgAGBQJSMeiEAAoJEMsfJm/On5mBi78P/jBVis7/4KDVFV1KAy1NhzAD EzxRN7X2sM3Iihj1wjMflcW6k9JSzIbC/Q82ouvVNzlgOZu85yoi926p0PIH7A/g E/X4hlictBzTsc9SjTGio9E6f59wFroBSDnquEo6JRCgdX9DqeaGzh8W/0PR5wSh SkZgXGoXxAK1fH26UZ1sXf2k4y4d6xNGVAv/HIb3jHu1EhDTb0cHzg+5m+qvdSuD 5RBfVJMEiJITCHAhP9IdiWP6k8iePLnaBaR77J6vAxWrSBD6a3D4loDkd/FguqGJ svkyldm5CB0bhHCMEP6IwGNeyr1W/D2OcueYmMDc99HIY+zAOjhAnge1skBIr6A/ hmNf7UT3xH0AQMp4I1ALgdu1Ie95lq3COGXwKy1ir/BVCYAeWC1MWb7qtmRcxqyt pwc9DHaplIRLoKM+CJEwV6o+gaP9L6+BiBn15g5Rv0RA7vvpl0d/Yrjx4ChWL06K paQcxvML/WUjq5uBIFzJmpfXhsUYcMC+YlPRwYhmEql0nTAFWnLejQrIMeSiH0VW ADWlN6DFJz75B7HnVnP9jk4H0ljMS9FJBhC2IkH1pDBrtLZs8ChCGXc4md0LwT7T 1yB2AQGUhb/izDXlw/0I7Q7mJYgYfggaXg1LkRWrV9zjetvzwnArMQ/K4AOiV6yo Kxx5tUAkMaFzobqowgvQ =Lcfb -----END PGP SIGNATURE----- Merge tag 'hwmon-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/groeck/linux-staging Pull hwmon fixes from Guenter Roeck: "Some more low risk cleanup patches: - Remove unnecessary pci_set_drvdata in k10temp driver from Jingoo Han - Fix return values in several drivers from Sachin Kamat - Remove redundant break in amc6821 driver from Sachin Kamat" * tag 'hwmon-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/groeck/linux-staging: hwmon: (k10temp) remove unnecessary pci_set_drvdata() hwmon: (tmp421) Fix return value hwmon: (amc6821) Remove redundant break hwmon: (amc6821) Fix return value hwmon: (ibmaem) Fix return value hwmon: (emc2103) Fix return value	2013-09-13 10:58:41 -07:00
Linus Torvalds	6700215140	Xtensa patchset for v3.12 -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.12 (GNU/Linux) iQIcBAABAgAGBQJSMiG+AAoJEI9vqH3mFV2sLDwP/04Zt2Wvurdwd3tAW2fgJ3c/ RJ2nQwt1w/YhMVUz/52QOtFiOrYF8fldpS3+51FphRiBPZa9oWPafCGWLotMnnfB myVeJ9xArscVpzLCAMONaBazE39/HoHGHtsRSjn8WGymbDByIH8PDwA6zSX9zPTu HuwdAH+x40wEEN6zFxQJyS4tdnHszOrJfHozwYZkSuXApHUkfBxxRQ5teV5u7ozF PSRfuNjiKs9BfDobhU7olIGx+ccUspYY695B9i+ChTNkgVZDSz+HKymYAzCMuPUr z++1qJCp5jR08/48X2UMwevpZr9NuR7Xf1hFGZ/tplCx0DBaTYi4sotviKPINp8R GuVH7SMkVdR4SdarigfoRpKSB/RZ7PvyfAP5bFfFTc8gQR8R8VLLQDp+9D2j2aeU BKxUVFGgXj65hEQaiTJrXXNrSciGE7I64CBGgmmvOGjo5pD8m9hcRaD2HhHontdr N/aM6ryRxadssoFoeo3KXVhnm0X7AxuIjYWexnc7BR3w7lG2VA6hh0DSuI5B1h05 E/oWIsZWLseSojCuPIpPTpgFidx5lG4KYBA/irz5wi2bsFwVkVzGTNFzKe4Vaki2 R4FxBVan7NuxEcS2gjBhkonPKlyCiTWLFGQcrzNY75sDIASmzpBQWjVe8J12Z+T7 V3z8DwIcJuVdZcoyKRth =O8Zr -----END PGP SIGNATURE----- Merge tag 'xtensa-next-20130912' of git://github.com/czankel/xtensa-linux Pull Xtensa updates from Chris Zankel. * tag 'xtensa-next-20130912' of git://github.com/czankel/xtensa-linux: xtensa: Fix broken allmodconfig build xtensa: remove CCOUNT_PER_JIFFY xtensa: fix !CONFIG_XTENSA_CALIBRATE_CCOUNT build failure xtensa: don't use echo -e needlessly xtensa: new fast_alloca handler xtensa: keep a3 and excsave1 on entry to exception handlers xtensa: enable kernel preemption xtensa: check thread flags atomically on return from user exception	2013-09-13 10:57:48 -07:00
Linus Torvalds	9bf12df31f	Merge git://git.kvack.org/~bcrl/aio-next Pull aio changes from Ben LaHaise: "First off, sorry for this pull request being late in the merge window. Al had raised a couple of concerns about 2 items in the series below. I addressed the first issue (the race introduced by Gu's use of mm_populate()), but he has not provided any further details on how he wants to rework the anon_inode.c changes (which were sent out months ago but have yet to be commented on). The bulk of the changes have been sitting in the -next tree for a few months, with all the issues raised being addressed" * git://git.kvack.org/~bcrl/aio-next: (22 commits) aio: rcu_read_lock protection for new rcu_dereference calls aio: fix race in ring buffer page lookup introduced by page migration support aio: fix rcu sparse warnings introduced by ioctx table lookup patch aio: remove unnecessary debugging from aio_free_ring() aio: table lookup: verify ctx pointer staging/lustre: kiocb->ki_left is removed aio: fix error handling and rcu usage in "convert the ioctx list to table lookup v3" aio: be defensive to ensure request batching is non-zero instead of BUG_ON() aio: convert the ioctx list to table lookup v3 aio: double aio_max_nr in calculations aio: Kill ki_dtor aio: Kill ki_users aio: Kill unneeded kiocb members aio: Kill aio_rw_vect_retry() aio: Don't use ctx->tail unnecessarily aio: io_cancel() no longer returns the io_event aio: percpu ioctx refcount aio: percpu reqs_available aio: reqs_active -> reqs_available aio: fix build when migration is disabled ...	2013-09-13 10:55:58 -07:00
Linus Torvalds	399a946edb	Merge branch 'genirq' of git://git.kernel.org/pub/scm/linux/kernel/git/s390/linux Pull generic hardirq option removal from Martin Schwidefsky: "All architectures now use generic hardirqs, s390 has been last to switch. With that the code under !CONFIG_GENERIC_HARDIRQS and the related HAVE_GENERIC_HARDIRQS and GENERIC_HARDIRQS config options can be removed. Yay!" * 'genirq' of git://git.kernel.org/pub/scm/linux/kernel/git/s390/linux: Remove GENERIC_HARDIRQ config option	2013-09-13 07:31:38 -07:00
Linus Torvalds	183c420323	Merge branch 'kconfig' of git://git.kernel.org/pub/scm/linux/kernel/git/mmarek/kbuild Pull kconfig fix from Michal Marek: "This is a fix for a regression caused by my previous pull request. A sed command in scripts/config that used colons as separator was accidentally changed to use slashes, which fails when you use slashes in a value. Changing it back to colons is of course not a proper fix, but at least it will be broken in the same way it had been for four years. A proper fix is pending" * 'kconfig' of git://git.kernel.org/pub/scm/linux/kernel/git/mmarek/kbuild: scripts/config: fix variable substitution command	2013-09-13 07:30:17 -07:00
Linus Torvalds	951a730af4	blackfin updates for Linux 3.12 -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.11 (GNU/Linux) iQIcBAABAgAGBQJSMoABAAoJEJommM3PjknHeK0QAMCWvahMa3bHu1lPbvYoUKZ8 lFYWye3RLpXQQgLcKTfVH/qoci1I4ssr5MmChZ88TmAZgojRlPk95rBu0pX08+dI 6Ro6rfAYCd6LT06YFb1hOYzBkqmz2SCY1R+MKLPu2kzTC06lnd+iF2sEpBpyBSTq YyW42ZRgOcGf+/iqiBUo112ZbP2V00jQIeQNyvgwF7GKy+lx86SYxzsIukteSCgI W1pNNUnshXT9sH8tGLLtHEkAkYzSkL0mDLdpztkYKiVqXUaSZAz2jfz6CDqfTiNj i+wqTG02NN8lMSH8no8Eko9svzuGAmQVYQiSCL2y0Xesy4P2HW2B5uzVD0oKQXp3 dKAmzUlhoSakAcq/6Rf11HYNYfeCN0T1VqnDt4U/OrIIq/WbK0qPsRtawYw0A5tZ 4uOCZqxfcUW4y0y6TXBRToFb4Fa0vmGX3WGi6DnW6wU1PaEnW14tkqqvCOA9EUHr SZePAkPldRqLxCaJkhqS5eh6SlkObO81Nxtq7D0a1KkT6e2pYToq7QPKECitfs4U q5Q6PxKbxrLBRQRIYmU62dQYGsMinwaqjkOsyU0+e89iA4NGu+vq/SVRgYWiSxWF BHFW+OBZdYvhyTVYRF5ruSPdMxR6uUVSgMwYPYUUKaXgJ8qV9zaRhrFtdsYmweVc AVH7Iay5w9WSO7WghTTS =Shbz -----END PGP SIGNATURE----- Merge tag 'blackfin-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/realmz6/blackfin-linux Pull blackfin updates from Steven Miao. * tag 'blackfin-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/realmz6/blackfin-linux: blackfin: Ignore generated uImages blackfin: Add STMMAC platform data to enable dwmac1000 driver on BF60x. bf609: adv7343: add S-Video and Component output support bf609: add adv7343 video encoder support clock: add stmmac clock for ethernet driver blackfin: scb: Add SCB1 to SCB9 config options and data. blackfin: scb: Add system crossbar init code.	2013-09-13 07:23:49 -07:00
Linus Torvalds	0898d2aa9d	Merge git://git.kernel.org/pub/scm/linux/kernel/git/herbert/crypto-2.6 Pull crypto fixes from Herbert Xu: "This fixes a 7+ year race condition in the crypto API that causes sporadic crashes when multiple threads load the same algorithm. It also fixes the crct10dif algorithm again to prevent boot failures on systems where the initramfs tool ignores module softdeps" * git://git.kernel.org/pub/scm/linux/kernel/git/herbert/crypto-2.6: crypto: crct10dif - Add fallback for broken initrds crypto: api - Fix race condition in larval lookup	2013-09-13 07:11:14 -07:00
Markos Chandras	1b4676330a	MIPS: kernel: vpe: Make vpe_attrs an array of pointers. Commit `567b21e973` "mips: convert vpe_class to use dev_groups" broke the build on MIPS since vpe_attrs should be an array of 'struct device_attribute' pointers. Fixes the following build problem: arch/mips/kernel/vpe.c:1372:2: error: missing braces around initializer [-Werror=missing-braces] arch/mips/kernel/vpe.c:1372:2: error: (near initialization for 'vpe_attrs[0]') [-Werror=missing-braces] Cc: Ralf Baechle <ralf@linux-mips.org> Cc: John Crispin <blogic@openwrt.org> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Signed-off-by: Markos Chandras <markos.chandras@imgtec.com> Cc: linux-mips@linux-mips.org Patchwork: https://patchwork.linux-mips.org/patch/5819/ Signed-off-by: Ralf Baechle <ralf@linux-mips.org>	2013-09-13 15:12:48 +02:00
Martin Schwidefsky	0244ad004a	Remove GENERIC_HARDIRQ config option After the last architecture switched to generic hard irqs the config options HAVE_GENERIC_HARDIRQS & GENERIC_HARDIRQS and the related code for !CONFIG_GENERIC_HARDIRQS can be removed. Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>	2013-09-13 15:09:52 +02:00
Clement Chauplannaz	86eb781889	scripts/config: fix variable substitution command Commit 229455bc02b87f7128f190c4491b4ceffff38648 accidentally changed the separator between sed `s' command and its parameters from ':' to '/'. Revert this change. Reported-and-tested-by: Linus Walleij <linus.walleij@linaro.org> Signed-off-by: Clement Chauplannaz <chauplac@gmail.com> Signed-off-by: Michal Marek <mmarek@suse.cz>	2013-09-13 13:06:59 +02:00
Leonid Yegoshin	670bac3a8c	MIPS: Fix SMP core calculations when using MT support. The TCBIND register is only available if the core has MT support. It should not be read otherwise. Secondly, the number of TCs (siblings) are calculated differently depending on if the kernel is configured as SMVP or SMTC. Signed-off-by: Leonid Yegoshin <Leonid.Yegoshin@imgtec.com> Signed-off-by: Steven J. Hill <Steven.Hill@imgtec.com> Cc: linux-mips@linux-mips.org Patchwork: https://patchwork.linux-mips.org/patch/5822/ Signed-off-by: Ralf Baechle <ralf@linux-mips.org>	2013-09-13 11:59:51 +02:00
Maciej W. Rozycki	5359b938c0	MIPS: DECstation I/O ASIC DMA interrupt handling fix This change complements commit d0da7c002f7b2a93582187a9e3f73891a01d8ee4 and brings clear_ioasic_irq back, renaming it to clear_ioasic_dma_irq at the same time, to make I/O ASIC DMA interrupts functional. Unlike ordinary I/O ASIC interrupts DMA interrupts need to be deasserted by software by writing 0 to the respective bit in I/O ASIC's System Interrupt Register (SIR), similarly to how CP0.Cause.IP0 and CP0.Cause.IP1 bits are handled in the CPU (the difference is SIR DMA interrupt bits are R/W0C so there's no need for an RMW cycle). Otherwise the handler is reentered over and over again. The only current user is the DEC LANCE Ethernet driver and its extremely uncommon DMA memory error handler that does not care when exactly the interrupt is cleared. Anticipating the use of DMA interrupts by the Zilog SCC driver this change however exports clear_ioasic_dma_irq for device drivers to choose the right application-specific sequence to clear the request explicitly rather than calling it implicitly in the .irq_eoi handler of `struct irq_chip'. Previously these interrupts were cleared in the .end handler of the said structure, before it was removed. Signed-off-by: Maciej W. Rozycki <macro@linux-mips.org> Cc: linux-mips@linux-mips.org Patchwork: https://patchwork.linux-mips.org/patch/5826/ Signed-off-by: Ralf Baechle <ralf@linux-mips.org>	2013-09-13 11:57:40 +02:00
Maciej W. Rozycki	daed1285c3	MIPS: DECstation HRT initialization rearrangement Not all I/O ASIC versions have the free-running counter implemented, an early revision used in the 5000/1xx models aka 3MIN and 4MIN did not have it. Therefore we cannot unconditionally use it as a clock source. Fortunately if not implemented its register slot has a fixed value so it is enough if we check for the value at the end of the calibration period being the same as at the beginning. This also means we need to look for another high-precision clock source on the systems affected. The 5000/1xx can have an R4000SC processor installed where the CP0 Count register can be used as a clock source. Unfortunately all the R4k DECstations suffer from the missed timer interrupt on CP0 Count reads erratum, so we cannot use the CP0 timer as a clock source and a clock event both at a time. However we never need an R4k clock event device because all DECstations have a DS1287A RTC chip whose periodic interrupt can be used as a clock source. This gives us the following four configuration possibilities for I/O ASIC DECstations: 1. No I/O ASIC counter and no CP0 timer, e.g. R3k 5000/1xx (3MIN). 2. No I/O ASIC counter but the CP0 timer, i.e. R4k 5000/150 (4MIN). 3. The I/O ASIC counter but no CP0 timer, e.g. R3k 5000/240 (3MAX+). 4. The I/O ASIC counter and the CP0 timer, e.g. R4k 5000/260 (4MAX+). For #1 and #2 this change stops the I/O ASIC free-running counter from being installed as a clock source of a 0Hz frequency. For #2 it also arranges for the CP0 timer to be used as a clock source rather than a clock event device, because having an accurate wall clock is more important than a high-precision interval timer. For #3 there is no change. For #4 the change makes the I/O ASIC free-running counter installed as a clock source so that the CP0 timer can be used as a clock event device. Unfortunately the use of the CP0 timer as a clock event device relies on a succesful completion of c0_compare_interrupt. That never happens, because while waiting for a CP0 Compare interrupt to happen the function spins in a loop reading the CP0 Count register. This makes the CP0 Count erratum trigger reliably causing the interrupt waited for to be lost in all cases. As a result #4 resorts to using the CP0 timer as a clock source as well, just as #2. However we want to keep this separate arrangement in case (hope) c0_compare_interrupt is eventually rewritten such that it avoids the erratum. Signed-off-by: Maciej W. Rozycki <macro@linux-mips.org> Cc: linux-mips@linux-mips.org Patchwork: https://patchwork.linux-mips.org/patch/5825/ Signed-off-by: Ralf Baechle <ralf@linux-mips.org>	2013-09-13 11:56:13 +02:00
Mark Brown	08b67faa23	blackfin: Ignore generated uImages We have the build infrastructure to generate uImages so we should ignore the resulting generated files. Signed-off-by: Mark Brown <broonie@linaro.org> Acked-by: Mike Frysinger <vapier@gentoo.org>	2013-09-13 10:42:39 +08:00
Sonic Zhang	1d899fd652	blackfin: Add STMMAC platform data to enable dwmac1000 driver on BF60x. - Enable GMAC - Set propler DMA PBL - Disable DMA store and forward mode - Select PTP input clock from MII clock. Signed-off-by: Sonic Zhang <sonic.zhang@analog.com> Signed-off-by: Steven Miao <realmz6@gmail.com>	2013-09-13 10:42:38 +08:00
Scott Jiang	e57860929c	bf609: adv7343: add S-Video and Component output support Signed-off-by: Scott Jiang <scott.jiang.linux@gmail.com>	2013-09-13 10:42:36 +08:00
Scott Jiang	4940c53d26	bf609: add adv7343 video encoder support Signed-off-by: Scott Jiang <scott.jiang.linux@gmail.com>	2013-09-13 10:42:34 +08:00
Steven Miao	3036dccf2c	clock: add stmmac clock for ethernet driver Signed-off-by: Steven Miao <realmz6@gmail.com>	2013-09-13 10:42:32 +08:00
Sonic Zhang	206f060c21	blackfin: scb: Add SCB1 to SCB9 config options and data. Signed-off-by: Sonic Zhang <sonic.zhang@analog.com>	2013-09-13 10:42:31 +08:00
Steven Miao	24a70cf2b2	blackfin: scb: Add system crossbar init code. If SCB exists in select blackfin cpu, developer can change the SCB priority in kernel configuration. Signed-off-by: Sonic Zhang <sonic.zhang@analog.com> Signed-off-by: Steven Miao <realmz6@gmail.com>	2013-09-13 10:42:27 +08:00
Linus Torvalds	5a7d8a2808	Merge branch 'upstream' of git://git.linux-mips.org/pub/scm/ralf/upstream-linus Pull MIPS updates from Ralf Baechle: "This has been sitting in -next for a while with no objections and all MIPS defconfigs except one are building fine; that one platform got broken by another patch in your tree and I'm going to submit a patch separately. - a handful of fixes that didn't make 3.11 - a few bits of Octeon 3 support with more to come for a later release - platform enhancements for Octeon, ath79, Lantiq, Netlogic and Ralink SOCs - a GPIO driver for the Octeon - some dusting off of the DECstation code - the usual dose of cleanups" * 'upstream' of git://git.linux-mips.org/pub/scm/ralf/upstream-linus: (65 commits) MIPS: DMA: Fix BUG due to smp_processor_id() in preemptible code MIPS: kexec: Fix random crashes while loading crashkernel MIPS: kdump: Skip walking indirection page for crashkernels MIPS: DECstation HRT calibration bug fixes MIPS: Export copy_from_user_page() (needed by lustre) MIPS: Add driver for the built-in PCI controller of the RT3883 SoC MIPS: DMA: For BMIPS5000 cores flush region just like non-coherent R10000 MIPS: ralink: Add support for reset-controller API MIPS: ralink: mt7620: Add cpu-feature-override header MIPS: ralink: mt7620: Add spi clock definition MIPS: ralink: mt7620: Add wdt clock definition MIPS: ralink: mt7620: Improve clock frequency detection MIPS: ralink: mt7620: This SoC has EHCI and OHCI hosts MIPS: ralink: mt7620: Add verbose ram info MIPS: ralink: Probe clocksources from OF MIPS: ralink: Add support for systick timer found on newer ralink SoC MIPS: ralink: Add support for periodic timer irq MIPS: Netlogic: Built-in DTB for XLP2xx SoC boards MIPS: Netlogic: Add support for USB on XLP2xx MIPS: Netlogic: XLP2xx update for I2C controller ...	2013-09-12 16:14:49 -07:00
Linus Torvalds	e0ea4045bc	xfs: update #2 for v3.12-rc1 Here we have defrag support for v5 superblock, a number of bugfixes and a cleanup or two. - defrag support for CRC filesystems - fix endian worning in xlog_recover_get_buf_lsn - fixes for sparse warnings - fix for assert in xfs_dir3_leaf_hdr_from_disk - fix for log recovery of remote symlinks - fix for log recovery of btree root splits - fixes formemory allocation failures with ACLs - fix for assert in xfs_buf_item_relse - fix for assert in xfs_inode_buf_verify - fix an assignment in an assert that should be a test in xfs_bmbt_change_owner - remove dead code in xlog_recover_inode_pass2 -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.10 (GNU/Linux) iQIcBAABAgAGBQJSMjQUAAoJENaLyazVq6ZOu2IP/1OHZYy+Bkmj0tO9pdsdEa4s w4FEBPsQePMJPjwdN693rKpW1exZue5sUmPMErH3ENzc2DPAwpUAlc9XAIohtdFx rTqrz2q+qTfZTq8oYBIA/RCOifJ2cHWN8tDYZPJpp5wceV7CRGYQeR1foiudE3ZH QDIPXioy8P9IkfGaXCtr/iWf9kycMO2lgNTNfdL6qtwX99HCqHZanTlsWx1BIYGQ Fa5TaOsXis6idPMCFMuEC15iEwA+YXc0HmXuHkMFLj+9mwFc4h/Aq65bwUkYZLmy +T1Wo/uQ/21rl6im/rWqgCh6fFS8NJQp8NIJeCIyihUEHbarfPyJIJRJjoP457YO cv8OkixCkt4zX6CkTxaL5ZFEBW9FYbRb13Gg96J6hb4WfdAFMtQg7FAjThSU/+Qr HwjaAso3GXimEaZD1C3c0TtZEQ0x9E6pENVI7/ewB1I0p92p7GJBMq4C7CTAYThV 5zhdcOnViSrJTJvVQxm+gfOYzubkWWiVmbVku3RCO6//kvPBOvJ9juSYsl0mKeRu v2DZZB3AYJE/qnbYfZBlktX9obE6k+keKF6w8Eiufr2IqwJaqfaM4h9eogzAwTJA vyXKeLxUEmgHuqivFSZjw3sEK6sY654GCMMTP+2IpD19vlAIioYXdgp0ZbkkdiE3 6twrzdFZAr1zy80xlM8W =2Uq6 -----END PGP SIGNATURE----- Merge tag 'xfs-for-linus-v3.12-rc1-2' of git://oss.sgi.com/xfs/xfs Pull xfs update #2 from Ben Myers: "Here we have defrag support for v5 superblock, a number of bugfixes and a cleanup or two. - defrag support for CRC filesystems - fix endian worning in xlog_recover_get_buf_lsn - fixes for sparse warnings - fix for assert in xfs_dir3_leaf_hdr_from_disk - fix for log recovery of remote symlinks - fix for log recovery of btree root splits - fixes formemory allocation failures with ACLs - fix for assert in xfs_buf_item_relse - fix for assert in xfs_inode_buf_verify - fix an assignment in an assert that should be a test in xfs_bmbt_change_owner - remove dead code in xlog_recover_inode_pass2" * tag 'xfs-for-linus-v3.12-rc1-2' of git://oss.sgi.com/xfs/xfs: xfs: remove dead code from xlog_recover_inode_pass2 xfs: = vs == typo in ASSERT() xfs: don't assert fail on bad inode numbers xfs: aborted buf items can be in the AIL. xfs: factor all the kmalloc-or-vmalloc fallback allocations xfs: fix memory allocation failures with ACLs xfs: ensure we copy buffer type in da btree root splits xfs: set remote symlink buffer type for recovery xfs: recovery of swap extents operations for CRC filesystems xfs: swap extents operations for CRC filesystems xfs: check magic numbers in dir3 leaf verifier first xfs: fix some minor sparse warnings xfs: fix endian warning in xlog_recover_get_buf_lsn()	2013-09-12 16:13:41 -07:00
Linus Torvalds	48efe453e6	Merge branch 'for-next' of git://git.kernel.org/pub/scm/linux/kernel/git/nab/target-pending Pull SCSI target updates from Nicholas Bellinger: "Lots of activity again this round for I/O performance optimizations (per-cpu IDA pre-allocation for vhost + iscsi/target), and the addition of new fabric independent features to target-core (COMPARE_AND_WRITE + EXTENDED_COPY). The main highlights include: - Support for iscsi-target login multiplexing across individual network portals - Generic Per-cpu IDA logic (kent + akpm + clameter) - Conversion of vhost to use per-cpu IDA pre-allocation for descriptors, SGLs and userspace page pointer list - Conversion of iscsi-target + iser-target to use per-cpu IDA pre-allocation for descriptors - Add support for generic COMPARE_AND_WRITE (AtomicTestandSet) emulation for virtual backend drivers - Add support for generic EXTENDED_COPY (CopyOffload) emulation for virtual backend drivers. - Add support for fast memory registration mode to iser-target (Vu) The patches to add COMPARE_AND_WRITE and EXTENDED_COPY support are of particular significance, which make us the first and only open source target to support the full set of VAAI primitives. Currently Linux clients are lacking upstream support to actually utilize these primitives. However, with server side support now in place for folks like MKP + ZAB working on the client, this logic once reserved for the highest end of storage arrays, can now be run in VMs on their laptops" * 'for-next' of git://git.kernel.org/pub/scm/linux/kernel/git/nab/target-pending: (50 commits) target/iscsi: Bump versions to v4.1.0 target: Update copyright ownership/year information to 2013 iscsi-target: Bump default TCP listen backlog to 256 target: Fix >= v3.9+ regression in PR APTPL + ALUA metadata write-out iscsi-target; Bump default CmdSN Depth to 64 iscsi-target: Remove unnecessary wait_for_completion in iscsi_get_thread_set iscsi-target: Add thread_set->ts_activate_sem + use common deallocate iscsi-target: Fix race with thread_pre_handler flush_signals + ISCSI_THREAD_SET_DIE target: remove unused including <linux/version.h> iser-target: introduce fast memory registration mode (FRWR) iser-target: generalize rdma memory registration and cleanup iser-target: move rdma wr processing to a shared function target: Enable global EXTENDED_COPY setup/release target: Add Third Party Copy (3PC) bit in INQUIRY response target: Enable EXTENDED_COPY setup in spc_parse_cdb target: Add support for EXTENDED_COPY copy offload emulation target: Avoid non-existent tg_pt_gp_mem in target_alua_state_check target: Add global device list for EXTENDED_COPY target: Make helpers non static for EXTENDED_COPY command setup target: Make spc_parse_naa_6h_vendor_specific non static ...	2013-09-12 16:11:45 -07:00
Linus Torvalds	ac4de9543a	Merge branch 'akpm' (patches from Andrew Morton) Merge more patches from Andrew Morton: "The rest of MM. Plus one misc cleanup" * emailed patches from Andrew Morton <akpm@linux-foundation.org>: (35 commits) mm/Kconfig: add MMU dependency for MIGRATION. kernel: replace strict_strto() with kstrto() mm, thp: count thp_fault_fallback anytime thp fault fails thp: consolidate code between handle_mm_fault() and do_huge_pmd_anonymous_page() thp: do_huge_pmd_anonymous_page() cleanup thp: move maybe_pmd_mkwrite() out of mk_huge_pmd() mm: cleanup add_to_page_cache_locked() thp: account anon transparent huge pages into NR_ANON_PAGES truncate: drop 'oldsize' truncate_pagecache() parameter mm: make lru_add_drain_all() selective memcg: document cgroup dirty/writeback memory statistics memcg: add per cgroup writeback pages accounting memcg: check for proper lock held in mem_cgroup_update_page_stat memcg: remove MEMCG_NR_FILE_MAPPED memcg: reduce function dereference memcg: avoid overflow caused by PAGE_ALIGN memcg: rename RESOURCE_MAX to RES_COUNTER_MAX memcg: correct RESOURCE_MAX to ULLONG_MAX mm: memcg: do not trap chargers with full callstack on OOM mm: memcg: rework and document OOM waiting and wakeup ...	2013-09-12 15:44:27 -07:00
Chen Gang	de32a8177f	mm/Kconfig: add MMU dependency for MIGRATION. MIGRATION must depend on MMU, or allmodconfig for the nommu sh architecture fails to build: CC mm/migrate.o mm/migrate.c: In function 'remove_migration_pte': mm/migrate.c:134:3: error: implicit declaration of function 'pmd_trans_huge' [-Werror=implicit-function-declaration] if (pmd_trans_huge(*pmd)) ^ mm/migrate.c:149:2: error: implicit declaration of function 'is_swap_pte' [-Werror=implicit-function-declaration] if (!is_swap_pte(pte)) ^ ... Also let CMA depend on MMU, or when NOMMU, if we select CMA, it will select MIGRATION by force. Signed-off-by: Chen Gang <gang.chen@asianux.com> Reviewed-by: Michal Hocko <mhocko@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2013-09-12 15:38:03 -07:00
Jingoo Han	6072ddc852	kernel: replace strict_strto() with kstrto() The usage of strict_strto() is not preferred, because strict_strto() is obsolete. Thus, kstrto*() should be used. Signed-off-by: Jingoo Han <jg1.han@samsung.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2013-09-12 15:38:03 -07:00
David Rientjes	17766dde36	mm, thp: count thp_fault_fallback anytime thp fault fails Currently, thp_fault_fallback in vmstat only gets incremented if a hugepage allocation fails. If current's memcg hits its limit or the page fault handler returns an error, it is incorrectly accounted as a successful thp_fault_alloc. Count thp_fault_fallback anytime the page fault handler falls back to using regular pages and only count thp_fault_alloc when a hugepage has actually been faulted. Signed-off-by: David Rientjes <rientjes@google.com> Cc: Mel Gorman <mgorman@suse.de> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2013-09-12 15:38:03 -07:00
Kirill A. Shutemov	c02925540c	thp: consolidate code between handle_mm_fault() and do_huge_pmd_anonymous_page() do_huge_pmd_anonymous_page() has copy-pasted piece of handle_mm_fault() to handle fallback path. Let's consolidate code back by introducing VM_FAULT_FALLBACK return code. Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Acked-by: Hillf Danton <dhillf@gmail.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Hugh Dickins <hughd@google.com> Cc: Wu Fengguang <fengguang.wu@intel.com> Cc: Jan Kara <jack@suse.cz> Cc: Mel Gorman <mgorman@suse.de> Cc: Andi Kleen <ak@linux.intel.com> Cc: Matthew Wilcox <willy@linux.intel.com> Cc: Dave Hansen <dave.hansen@linux.intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2013-09-12 15:38:03 -07:00
Kirill A. Shutemov	128ec037ba	thp: do_huge_pmd_anonymous_page() cleanup Minor cleanup: unindent most code of the fucntion by inverting one condition. It's preparation for the next patch. No functional changes. Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Acked-by: Hillf Danton <dhillf@gmail.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Hugh Dickins <hughd@google.com> Cc: Wu Fengguang <fengguang.wu@intel.com> Cc: Jan Kara <jack@suse.cz> Cc: Mel Gorman <mgorman@suse.de> Cc: Andi Kleen <ak@linux.intel.com> Cc: Matthew Wilcox <willy@linux.intel.com> Cc: Dave Hansen <dave.hansen@linux.intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2013-09-12 15:38:03 -07:00
Kirill A. Shutemov	3122359a64	thp: move maybe_pmd_mkwrite() out of mk_huge_pmd() It's confusing that mk_huge_pmd() has semantics different from mk_pte() or mk_pmd(). I spent some time on debugging issue cased by this inconsistency. Let's move maybe_pmd_mkwrite() out of mk_huge_pmd() and adjust prototype to match mk_pte(). Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Acked-by: Dave Hansen <dave.hansen@linux.intel.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Hugh Dickins <hughd@google.com> Cc: Wu Fengguang <fengguang.wu@intel.com> Cc: Jan Kara <jack@suse.cz> Cc: Mel Gorman <mgorman@suse.de> Cc: Andi Kleen <ak@linux.intel.com> Cc: Matthew Wilcox <willy@linux.intel.com> Cc: Hillf Danton <dhillf@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2013-09-12 15:38:03 -07:00
Kirill A. Shutemov	66a0c8ee3d	mm: cleanup add_to_page_cache_locked() Make add_to_page_cache_locked() cleaner: - unindent most code of the function by inverting one condition; - streamline code no-error path; - move insert error path outside normal code path; - call radix_tree_preload_end() earlier; No functional changes. Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Acked-by: Dave Hansen <dave.hansen@linux.intel.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Hugh Dickins <hughd@google.com> Cc: Wu Fengguang <fengguang.wu@intel.com> Cc: Jan Kara <jack@suse.cz> Cc: Mel Gorman <mgorman@suse.de> Cc: Andi Kleen <ak@linux.intel.com> Cc: Matthew Wilcox <willy@linux.intel.com> Cc: Hillf Danton <dhillf@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2013-09-12 15:38:03 -07:00
Kirill A. Shutemov	3cd14fcd3f	thp: account anon transparent huge pages into NR_ANON_PAGES We use NR_ANON_PAGES as base for reporting AnonPages to user. There's not much sense in not accounting transparent huge pages there, but add them on printing to user. Let's account transparent huge pages in NR_ANON_PAGES in the first place. Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Acked-by: Dave Hansen <dave.hansen@linux.intel.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Hugh Dickins <hughd@google.com> Cc: Wu Fengguang <fengguang.wu@intel.com> Cc: Jan Kara <jack@suse.cz> Cc: Mel Gorman <mgorman@suse.de> Cc: Andi Kleen <ak@linux.intel.com> Cc: Matthew Wilcox <willy@linux.intel.com> Cc: Hillf Danton <dhillf@gmail.com> Cc: Ning Qu <quning@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2013-09-12 15:38:03 -07:00
Kirill A. Shutemov	7caef26767	truncate: drop 'oldsize' truncate_pagecache() parameter truncate_pagecache() doesn't care about old size since commit `cedabed49b` ("vfs: Fix vmtruncate() regression"). Let's drop it. Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: OGAWA Hirofumi <hirofumi@mail.parknet.co.jp> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2013-09-12 15:38:02 -07:00
Chris Metcalf	5fbc461636	mm: make lru_add_drain_all() selective make lru_add_drain_all() only selectively interrupt the cpus that have per-cpu free pages that can be drained. This is important in nohz mode where calling mlockall(), for example, otherwise will interrupt every core unnecessarily. This is important on workloads where nohz cores are handling 10 Gb traffic in userspace. Those CPUs do not enter the kernel and place pages into LRU pagevecs and they really, really don't want to be interrupted, or they drop packets on the floor. Signed-off-by: Chris Metcalf <cmetcalf@tilera.com> Reviewed-by: Tejun Heo <tj@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2013-09-12 15:38:02 -07:00
Sha Zhengju	9cb2dc1c95	memcg: document cgroup dirty/writeback memory statistics Signed-off-by: Sha Zhengju <handai.szj@taobao.com> Cc: Fengguang Wu <fengguang.wu@intel.com> Cc: Greg Thelen <gthelen@google.com> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Michal Hocko <mhocko@suse.cz> Cc: Johannes Weiner <hannes@cmpxchg.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2013-09-12 15:38:02 -07:00
Sha Zhengju	3ea67d06e4	memcg: add per cgroup writeback pages accounting Add memcg routines to count writeback pages, later dirty pages will also be accounted. After Kame's commit `89c06bd52f` ("memcg: use new logic for page stat accounting"), we can use 'struct page' flag to test page state instead of per page_cgroup flag. But memcg has a feature to move a page from a cgroup to another one and may have race between "move" and "page stat accounting". So in order to avoid the race we have designed a new lock: mem_cgroup_begin_update_page_stat() modify page information -->(a) mem_cgroup_update_page_stat() -->(b) mem_cgroup_end_update_page_stat() It requires both (a) and (b)(writeback pages accounting) to be pretected in mem_cgroup_{begin/end}_update_page_stat(). It's full no-op for !CONFIG_MEMCG, almost no-op if memcg is disabled (but compiled in), rcu read lock in the most cases (no task is moving), and spin_lock_irqsave on top in the slow path. There're two writeback interfaces to modify: test_{clear/set}_page_writeback(). And the lock order is: --> memcg->move_lock --> mapping->tree_lock Signed-off-by: Sha Zhengju <handai.szj@taobao.com> Acked-by: Michal Hocko <mhocko@suse.cz> Reviewed-by: Greg Thelen <gthelen@google.com> Cc: Fengguang Wu <fengguang.wu@intel.com> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2013-09-12 15:38:02 -07:00
Sha Zhengju	658b72c5a7	memcg: check for proper lock held in mem_cgroup_update_page_stat We should call mem_cgroup_begin_update_page_stat() before mem_cgroup_update_page_stat() to get proper locks, however the latter doesn't do any checking that we use proper locking, which would be hard. Suggested by Michal Hock we could at least test for rcu_read_lock_held() because RCU is held if !mem_cgroup_disabled(). Signed-off-by: Sha Zhengju <handai.szj@taobao.com> Acked-by: Michal Hocko <mhocko@suse.cz> Reviewed-by: Greg Thelen <gthelen@google.com> Cc: Fengguang Wu <fengguang.wu@intel.com> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2013-09-12 15:38:02 -07:00
Sha Zhengju	68b4876d99	memcg: remove MEMCG_NR_FILE_MAPPED While accounting memcg page stat, it's not worth to use MEMCG_NR_FILE_MAPPED as an extra layer of indirection because of the complexity and presumed performance overhead. We can use MEM_CGROUP_STAT_FILE_MAPPED directly. Signed-off-by: Sha Zhengju <handai.szj@taobao.com> Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Acked-by: Michal Hocko <mhocko@suse.cz> Acked-by: Fengguang Wu <fengguang.wu@intel.com> Reviewed-by: Greg Thelen <gthelen@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2013-09-12 15:38:02 -07:00
Sha Zhengju	1a36e59d48	memcg: reduce function dereference This function dereferences res far too often, so optimize it. Signed-off-by: Sha Zhengju <handai.szj@taobao.com> Signed-off-by: Qiang Huang <h.huangqiang@huawei.com> Acked-by: Michal Hocko <mhocko@suse.cz> Cc: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp> Cc: Jeff Liu <jeff.liu@oracle.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2013-09-12 15:38:02 -07:00
Sha Zhengju	3af3351676	memcg: avoid overflow caused by PAGE_ALIGN Since PAGE_ALIGN is aligning up(the next page boundary), so after PAGE_ALIGN, the value might be overflow, such as write the MAX value to *.limit_in_bytes. $ cat /cgroup/memory/memory.limit_in_bytes 18446744073709551615 # echo 18446744073709551615 > /cgroup/memory/memory.limit_in_bytes bash: echo: write error: Invalid argument Some user programs might depend on such behaviours(like libcg, we read the value in snapshot, then use the value to reset cgroup later), and that will cause confusion. So we need to fix it. Signed-off-by: Sha Zhengju <handai.szj@taobao.com> Signed-off-by: Qiang Huang <h.huangqiang@huawei.com> Acked-by: Michal Hocko <mhocko@suse.cz> Cc: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp> Cc: Jeff Liu <jeff.liu@oracle.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2013-09-12 15:38:02 -07:00
Sha Zhengju	6de5a8bfca	memcg: rename RESOURCE_MAX to RES_COUNTER_MAX RESOURCE_MAX is far too general name, change it to RES_COUNTER_MAX. Signed-off-by: Sha Zhengju <handai.szj@taobao.com> Signed-off-by: Qiang Huang <h.huangqiang@huawei.com> Acked-by: Michal Hocko <mhocko@suse.cz> Cc: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp> Cc: Jeff Liu <jeff.liu@oracle.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2013-09-12 15:38:02 -07:00
Sha Zhengju	34ff8dc089	memcg: correct RESOURCE_MAX to ULLONG_MAX Current RESOURCE_MAX is ULONG_MAX, but the value we used to set resource limit is unsigned long long, so we can set bigger value than that which is strange. The XXX_MAX should be reasonable max value, bigger than that should be overflow. Notice that this change will affect user output of default .limit_in_bytes: before change: $ cat /cgroup/memory/memory.limit_in_bytes 9223372036854775807 after change: $ cat /cgroup/memory/memory.limit_in_bytes 18446744073709551615 But it doesn't alter the API in term of input - we can still use "echo -1 > .limit_in_bytes" to reset the numbers to "unlimited". Signed-off-by: Sha Zhengju <handai.szj@taobao.com> Signed-off-by: Qiang Huang <h.huangqiang@huawei.com> Acked-by: Michal Hocko <mhocko@suse.cz> Cc: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp> Cc: Jeff Liu <jeff.liu@oracle.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2013-09-12 15:38:02 -07:00
Johannes Weiner	3812c8c8f3	mm: memcg: do not trap chargers with full callstack on OOM The memcg OOM handling is incredibly fragile and can deadlock. When a task fails to charge memory, it invokes the OOM killer and loops right there in the charge code until it succeeds. Comparably, any other task that enters the charge path at this point will go to a waitqueue right then and there and sleep until the OOM situation is resolved. The problem is that these tasks may hold filesystem locks and the mmap_sem; locks that the selected OOM victim may need to exit. For example, in one reported case, the task invoking the OOM killer was about to charge a page cache page during a write(), which holds the i_mutex. The OOM killer selected a task that was just entering truncate() and trying to acquire the i_mutex: OOM invoking task: mem_cgroup_handle_oom+0x241/0x3b0 mem_cgroup_cache_charge+0xbe/0xe0 add_to_page_cache_locked+0x4c/0x140 add_to_page_cache_lru+0x22/0x50 grab_cache_page_write_begin+0x8b/0xe0 ext3_write_begin+0x88/0x270 generic_file_buffered_write+0x116/0x290 __generic_file_aio_write+0x27c/0x480 generic_file_aio_write+0x76/0xf0 # takes ->i_mutex do_sync_write+0xea/0x130 vfs_write+0xf3/0x1f0 sys_write+0x51/0x90 system_call_fastpath+0x18/0x1d OOM kill victim: do_truncate+0x58/0xa0 # takes i_mutex do_last+0x250/0xa30 path_openat+0xd7/0x440 do_filp_open+0x49/0xa0 do_sys_open+0x106/0x240 sys_open+0x20/0x30 system_call_fastpath+0x18/0x1d The OOM handling task will retry the charge indefinitely while the OOM killed task is not releasing any resources. A similar scenario can happen when the kernel OOM killer for a memcg is disabled and a userspace task is in charge of resolving OOM situations. In this case, ALL tasks that enter the OOM path will be made to sleep on the OOM waitqueue and wait for userspace to free resources or increase the group's limit. But a userspace OOM handler is prone to deadlock itself on the locks held by the waiting tasks. For example one of the sleeping tasks may be stuck in a brk() call with the mmap_sem held for writing but the userspace handler, in order to pick an optimal victim, may need to read files from /proc/<pid>, which tries to acquire the same mmap_sem for reading and deadlocks. This patch changes the way tasks behave after detecting a memcg OOM and makes sure nobody loops or sleeps with locks held: 1. When OOMing in a user fault, invoke the OOM killer and restart the fault instead of looping on the charge attempt. This way, the OOM victim can not get stuck on locks the looping task may hold. 2. When OOMing in a user fault but somebody else is handling it (either the kernel OOM killer or a userspace handler), don't go to sleep in the charge context. Instead, remember the OOMing memcg in the task struct and then fully unwind the page fault stack with -ENOMEM. pagefault_out_of_memory() will then call back into the memcg code to check if the -ENOMEM came from the memcg, and then either put the task to sleep on the memcg's OOM waitqueue or just restart the fault. The OOM victim can no longer get stuck on any lock a sleeping task may hold. Debugged by Michal Hocko. Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Reported-by: azurIt <azurit@pobox.sk> Acked-by: Michal Hocko <mhocko@suse.cz> Cc: David Rientjes <rientjes@google.com> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2013-09-12 15:38:02 -07:00
Johannes Weiner	fb2a6fc56b	mm: memcg: rework and document OOM waiting and wakeup The memcg OOM handler open-codes a sleeping lock for OOM serialization (trylock, wait, repeat) because the required locking is so specific to memcg hierarchies. However, it would be nice if this construct would be clearly recognizable and not be as obfuscated as it is right now. Clean up as follows: 1. Remove the return value of mem_cgroup_oom_unlock() 2. Rename mem_cgroup_oom_lock() to mem_cgroup_oom_trylock(). 3. Pull the prepare_to_wait() out of the memcg_oom_lock scope. This makes it more obvious that the task has to be on the waitqueue before attempting to OOM-trylock the hierarchy, to not miss any wakeups before going to sleep. It just didn't matter until now because it was all lumped together into the global memcg_oom_lock spinlock section. 4. Pull the mem_cgroup_oom_notify() out of the memcg_oom_lock scope. It is proctected by the hierarchical OOM-lock. 5. The memcg_oom_lock spinlock is only required to propagate the OOM lock in any given hierarchy atomically. Restrict its scope to mem_cgroup_oom_(trylock\|unlock). 6. Do not wake up the waitqueue unconditionally at the end of the function. Only the lockholder has to wake up the next in line after releasing the lock. Note that the lockholder kicks off the OOM-killer, which in turn leads to wakeups from the uncharges of the exiting task. But a contender is not guaranteed to see them if it enters the OOM path after the OOM kills but before the lockholder releases the lock. Thus there has to be an explicit wakeup after releasing the lock. 7. Put the OOM task on the waitqueue before marking the hierarchy as under OOM as that is the point where we start to receive wakeups. No point in listening before being on the waitqueue. 8. Likewise, unmark the hierarchy before finishing the sleep, for symmetry. Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Acked-by: Michal Hocko <mhocko@suse.cz> Cc: David Rientjes <rientjes@google.com> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: azurIt <azurit@pobox.sk> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2013-09-12 15:38:01 -07:00
Johannes Weiner	519e52473e	mm: memcg: enable memcg OOM killer only for user faults System calls and kernel faults (uaccess, gup) can handle an out of memory situation gracefully and just return -ENOMEM. Enable the memcg OOM killer only for user faults, where it's really the only option available. Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Acked-by: Michal Hocko <mhocko@suse.cz> Cc: David Rientjes <rientjes@google.com> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: azurIt <azurit@pobox.sk> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2013-09-12 15:38:01 -07:00
Johannes Weiner	3a13c4d761	x86: finish user fault error path with fatal signal The x86 fault handler bails in the middle of error handling when the task has a fatal signal pending. For a subsequent patch this is a problem in OOM situations because it relies on pagefault_out_of_memory() being called even when the task has been killed, to perform proper per-task OOM state unwinding. Shortcutting the fault like this is a rather minor optimization that saves a few instructions in rare cases. Just remove it for user-triggered faults. Use the opportunity to split the fault retry handling from actual fault errors and add locking documentation that reads suprisingly similar to ARM's. Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Reviewed-by: Michal Hocko <mhocko@suse.cz> Acked-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: David Rientjes <rientjes@google.com> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: azurIt <azurit@pobox.sk> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2013-09-12 15:38:01 -07:00
Johannes Weiner	759496ba64	arch: mm: pass userspace fault flag to generic fault handler Unlike global OOM handling, memory cgroup code will invoke the OOM killer in any OOM situation because it has no way of telling faults occuring in kernel context - which could be handled more gracefully - from user-triggered faults. Pass a flag that identifies faults originating in user space from the architecture-specific fault handlers to generic code so that memcg OOM handling can be improved. Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Reviewed-by: Michal Hocko <mhocko@suse.cz> Cc: David Rientjes <rientjes@google.com> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: azurIt <azurit@pobox.sk> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2013-09-12 15:38:01 -07:00
Johannes Weiner	871341023c	arch: mm: do not invoke OOM killer on kernel fault OOM Kernel faults are expected to handle OOM conditions gracefully (gup, uaccess etc.), so they should never invoke the OOM killer. Reserve this for faults triggered in user context when it is the only option. Most architectures already do this, fix up the remaining few. Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Reviewed-by: Michal Hocko <mhocko@suse.cz> Acked-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: David Rientjes <rientjes@google.com> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: azurIt <azurit@pobox.sk> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2013-09-12 15:38:01 -07:00

1 2 3 4 5 ...

399574 Commits