Commit Graph

704762 Commits

Author SHA1 Message Date
Bjorn Helgaas
0dd9636f97 Merge branch 'pci/host-exynos' into next
* pci/host-exynos:
  PCI: exynos: Fix platform_get_irq() error handling
2017-09-07 13:23:56 -05:00
Bjorn Helgaas
51386202a5 Merge branch 'pci/host-dra7xx' into next
* pci/host-dra7xx:
  PCI: dra7xx: Fix platform_get_irq() error handling
  PCI: dra7xx: Propagate platform_get_irq() errors in dra7xx_pcie_probe()
  PCI: dra7xx: Use PCI_NUM_INTX
2017-09-07 13:23:55 -05:00
Bjorn Helgaas
ee75520eb2 Merge branch 'pci/host-designware' into next
* pci/host-designware:
  PCI: dwc: Clear MSI interrupt status after it is handled, not before
  PCI: qcom: Allow ->post_init() to fail
  PCI: qcom: Don't unroll init if ->init() fails
  PCI: dwc: designware: Handle ->host_init() failures
  PCI: dwc: designware: Test PCIE_ATU_ENABLE bit specifically
  PCI: dwc: designware: Make dw_pcie_prog_*_atu_unroll() static
2017-09-07 13:23:55 -05:00
Bjorn Helgaas
199a0253e3 Merge branch 'pci/host-artpec6' into next
* pci/host-artpec6:
  PCI: artpec6: Fix platform_get_irq() error handling
2017-09-07 13:23:54 -05:00
Bjorn Helgaas
9627804be4 Merge branch 'pci/host-armada' into next
* pci/host-armada:
  PCI: armada8k: Fix platform_get_irq() error handling
  PCI: armada8k: Check the return value from clk_prepare_enable()
2017-09-07 13:23:53 -05:00
Bjorn Helgaas
a89d7e43e1 Merge branch 'pci/host-altera' into next
* pci/host-altera:
  PCI: altera: Fix platform_get_irq() error handling
  PCI: altera: Use size=4 IRQ domain for legacy INTx
  PCI: altera: Remove unused num_of_vectors variable
2017-09-07 13:23:53 -05:00
Bjorn Helgaas
741e2237be Merge branch 'pci/host-aardvark' into next
* pci/host-aardvark:
  PCI: aardvark: Use PCI_NUM_INTX
2017-09-07 13:23:52 -05:00
Bjorn Helgaas
37deba4525 Merge branch 'pci/irq-intx' into next
* pci/irq-intx:
  PCI: Add pci_irqd_intx_xlate()
  PCI: Move enum pci_interrupt_pin to linux/pci.h
2017-09-07 13:23:51 -05:00
Linus Torvalds
3ee31b89d9 xen: fixes and features for 4.14
-----BEGIN PGP SIGNATURE-----
 Version: GnuPG v2
 
 iQEcBAABAgAGBQJZsBbYAAoJELDendYovxMv4hoH/39psrSeHw2hPX78KJ6orq4v
 mTVEP2gLA/qxaM03EnFljXfd88J8NcJsxv7vVjh/U4xRwntvAMovCkygkkO1aw93
 nZEhUq6IGupr8KzmqQi5U7WtiWAXFwDbGSasnOKEj/lLa7E0/9MsYYQ01FS6oFkc
 c9CHONaCWepdz0Xpt7s6BKyzo74ZbJeCc5rUZU81oH40XphaZEoy8E9NOgDdfz3l
 VvPSaxZvebynT8JKDe4KxrMPpBjhr7mwgLcXk/Zy2EzOzxFSxXLsDAnwjtCW1gTh
 lPLD4TkgtziDfPfZXxFH3J34IUe1tZ2M+7Cz157FBu6BKX/g9ETQT24DXWDzFuI=
 =cgfV
 -----END PGP SIGNATURE-----

Merge tag 'for-linus-4.14b-rc1-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/xen/tip

Pull xen updates from Juergen Gross:

 - the new pvcalls backend for routing socket calls from a guest to dom0

 - some cleanups of Xen code

 - a fix for wrong usage of {get,put}_cpu()

* tag 'for-linus-4.14b-rc1-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/xen/tip: (27 commits)
  xen/mmu: set MMU_NORMAL_PT_UPDATE in remap_area_mfn_pte_fn
  xen: Don't try to call xen_alloc_p2m_entry() on autotranslating guests
  xen/events: events_fifo: Don't use {get,put}_cpu() in xen_evtchn_fifo_init()
  xen/pvcalls: use WARN_ON(1) instead of __WARN()
  xen: remove not used trace functions
  xen: remove unused function xen_set_domain_pte()
  xen: remove tests for pvh mode in pure pv paths
  xen-platform: constify pci_device_id.
  xen: cleanup xen.h
  xen: introduce a Kconfig option to enable the pvcalls backend
  xen/pvcalls: implement write
  xen/pvcalls: implement read
  xen/pvcalls: implement the ioworker functions
  xen/pvcalls: disconnect and module_exit
  xen/pvcalls: implement release command
  xen/pvcalls: implement poll command
  xen/pvcalls: implement accept command
  xen/pvcalls: implement listen command
  xen/pvcalls: implement bind command
  xen/pvcalls: implement connect command
  ...
2017-09-07 10:24:21 -07:00
Linus Torvalds
bac65d9d87 powerpc updates for 4.14
Nothing really major this release, despite quite a lot of activity. Just lots of
 things all over the place.
 
 Some things of note include:
 
  - Access via perf to a new type of PMU (IMC) on Power9, which can count both
    core events as well as nest unit events (Memory controller etc).
 
  - Optimisations to the radix MMU TLB flushing, mostly to avoid unnecessary Page
    Walk Cache (PWC) flushes when the structure of the tree is not changing.
 
  - Reworks/cleanups of do_page_fault() to modernise it and bring it closer to
    other architectures where possible.
 
  - Rework of our page table walking so that THP updates only need to send IPIs
    to CPUs where the affected mm has run, rather than all CPUs.
 
  - The size of our vmalloc area is increased to 56T on 64-bit hash MMU systems.
    This avoids problems with the percpu allocator on systems with very sparse
    NUMA layouts.
 
  - STRICT_KERNEL_RWX support on PPC32.
 
  - A new sched domain topology for Power9, to capture the fact that pairs of
    cores may share an L2 cache.
 
  - Power9 support for VAS, which is a new mechanism for accessing coprocessors,
    and initial support for using it with the NX compression accelerator.
 
  - Major work on the instruction emulation support, adding support for many new
    instructions, and reworking it so it can be used to implement the emulation
    needed to fixup alignment faults.
 
  - Support for guests under PowerVM to use the Power9 XIVE interrupt controller.
 
 And probably that many things again that are almost as interesting, but I had to
 keep the list short. Plus the usual fixes and cleanups as always.
 
 Thanks to:
   Alexey Kardashevskiy, Alistair Popple, Andreas Schwab, Aneesh Kumar K.V, Anju
   T Sudhakar, Arvind Yadav, Balbir Singh, Benjamin Herrenschmidt, Bhumika Goyal,
   Breno Leitao, Bryant G. Ly, Christophe Leroy, Cédric Le Goater, Dan Carpenter,
   Dou Liyang, Frederic Barrat, Gautham R. Shenoy, Geliang Tang, Geoff Levand,
   Hannes Reinecke, Haren Myneni, Ivan Mikhaylov, John Allen, Julia Lawall, LABBE
   Corentin, Laurentiu Tudor, Madhavan Srinivasan, Markus Elfring, Masahiro
   Yamada, Matt Brown, Michael Neuling, Murilo Opsfelder Araujo, Nathan Fontenot,
   Naveen N. Rao, Nicholas Piggin, Oliver O'Halloran, Paul Mackerras, Rashmica
   Gupta, Rob Herring, Rui Teng, Sam Bobroff, Santosh Sivaraj, Scott Wood,
   Shilpasri G Bhat, Sukadev Bhattiprolu, Suraj Jitindar Singh, Tobin C. Harding,
   Victor Aoqui.
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1
 
 iQIcBAABAgAGBQJZr83SAAoJEFHr6jzI4aWA6pUP/3CEaj2bSxNzWIwidqyYjuoS
 O1moEsP0oYH7eBEWVHalYxvo0QPIIAhbFPaFyrOrgtfDH01Szwu9LcCALGb8orC5
 Hg3IY8mpNG3Q1T8wEtTa56Ik4b5ZFty35S5+X9qLNSFoDUqSvGlSsLzhPNN7f2tl
 XFm2hWqd8wXCwDsuVSFBCF61M3SAm+g6NMVNJ+VL2KIDCwBrOZLhKDPRoxLTAuMa
 jjSdjVIozWyXjUrBFi8HVcoOWLxcT1HsNF0tRs51LwY/+Mlj2jAtFtsx+a06HZa6
 f2p/Kcp/MEispSTk064Ap9cC1seXWI18zwZKpCUFqu0Ec2yTAiGdjOWDyYQldIp+
 ttVPSHQ01YrVKwDFTtM9CiA0EET6fVPhWgAPkPfvH5TvtKwGkNdy0b+nQLuWrYip
 BUmOXmjdIG3nujCzA9sv6/uNNhjhj2y+HWwuV7Qo002VFkhgZFL67u2SSUQLpYPj
 PxdkY8pPVq+O+in94oDV3c36dYFF6+g6A6505Vn6eKUm/TLpszRFGkS3bKKA5vtn
 74FR+guV/5RwYJcdZbfm04DgAocl7AfUDxpwRxibt6KtAK2VZKQuw4ugUTgYEd7W
 mL2+AMmPKuajWXAMTHjCZPbUp9gFNyYyBQTFfGVX/XLiM8erKBnGfoa1/KzUJkhr
 fVZLYIO/gzl34PiTIfgD
 =UJtt
 -----END PGP SIGNATURE-----

Merge tag 'powerpc-4.14-1' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux

Pull powerpc updates from Michael Ellerman:
 "Nothing really major this release, despite quite a lot of activity.
  Just lots of things all over the place.

  Some things of note include:

   - Access via perf to a new type of PMU (IMC) on Power9, which can
     count both core events as well as nest unit events (Memory
     controller etc).

   - Optimisations to the radix MMU TLB flushing, mostly to avoid
     unnecessary Page Walk Cache (PWC) flushes when the structure of the
     tree is not changing.

   - Reworks/cleanups of do_page_fault() to modernise it and bring it
     closer to other architectures where possible.

   - Rework of our page table walking so that THP updates only need to
     send IPIs to CPUs where the affected mm has run, rather than all
     CPUs.

   - The size of our vmalloc area is increased to 56T on 64-bit hash MMU
     systems. This avoids problems with the percpu allocator on systems
     with very sparse NUMA layouts.

   - STRICT_KERNEL_RWX support on PPC32.

   - A new sched domain topology for Power9, to capture the fact that
     pairs of cores may share an L2 cache.

   - Power9 support for VAS, which is a new mechanism for accessing
     coprocessors, and initial support for using it with the NX
     compression accelerator.

   - Major work on the instruction emulation support, adding support for
     many new instructions, and reworking it so it can be used to
     implement the emulation needed to fixup alignment faults.

   - Support for guests under PowerVM to use the Power9 XIVE interrupt
     controller.

  And probably that many things again that are almost as interesting,
  but I had to keep the list short. Plus the usual fixes and cleanups as
  always.

  Thanks to: Alexey Kardashevskiy, Alistair Popple, Andreas Schwab,
  Aneesh Kumar K.V, Anju T Sudhakar, Arvind Yadav, Balbir Singh,
  Benjamin Herrenschmidt, Bhumika Goyal, Breno Leitao, Bryant G. Ly,
  Christophe Leroy, Cédric Le Goater, Dan Carpenter, Dou Liyang,
  Frederic Barrat, Gautham R. Shenoy, Geliang Tang, Geoff Levand, Hannes
  Reinecke, Haren Myneni, Ivan Mikhaylov, John Allen, Julia Lawall,
  LABBE Corentin, Laurentiu Tudor, Madhavan Srinivasan, Markus Elfring,
  Masahiro Yamada, Matt Brown, Michael Neuling, Murilo Opsfelder Araujo,
  Nathan Fontenot, Naveen N. Rao, Nicholas Piggin, Oliver O'Halloran,
  Paul Mackerras, Rashmica Gupta, Rob Herring, Rui Teng, Sam Bobroff,
  Santosh Sivaraj, Scott Wood, Shilpasri G Bhat, Sukadev Bhattiprolu,
  Suraj Jitindar Singh, Tobin C. Harding, Victor Aoqui"

* tag 'powerpc-4.14-1' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux: (321 commits)
  powerpc/xive: Fix section __init warning
  powerpc: Fix kernel crash in emulation of vector loads and stores
  powerpc/xive: improve debugging macros
  powerpc/xive: add XIVE Exploitation Mode to CAS
  powerpc/xive: introduce H_INT_ESB hcall
  powerpc/xive: add the HW IRQ number under xive_irq_data
  powerpc/xive: introduce xive_esb_write()
  powerpc/xive: rename xive_poke_esb() in xive_esb_read()
  powerpc/xive: guest exploitation of the XIVE interrupt controller
  powerpc/xive: introduce a common routine xive_queue_page_alloc()
  powerpc/sstep: Avoid used uninitialized error
  axonram: Return directly after a failed kzalloc() in axon_ram_probe()
  axonram: Improve a size determination in axon_ram_probe()
  axonram: Delete an error message for a failed memory allocation in axon_ram_probe()
  powerpc/powernv/npu: Move tlb flush before launching ATSD
  powerpc/macintosh: constify wf_sensor_ops structures
  powerpc/iommu: Use permission-specific DEVICE_ATTR variants
  powerpc/eeh: Delete an error out of memory message at init time
  powerpc/mm: Use seq_putc() in two functions
  macintosh: Convert to using %pOF instead of full_name
  ...
2017-09-07 10:15:40 -07:00
Linus Torvalds
f92e3da18b Merge branch 'efi-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull EFI updates from Ingo Molnar:
 "The main changes in this cycle were:

   - Transparently fall back to other poweroff method(s) if EFI poweroff
     fails (and returns)

   - Use separate PE/COFF section headers for the RX and RW parts of the
     ARM stub loader so that the firmware can use strict mapping
     permissions

   - Add support for requesting the firmware to wipe RAM at warm reboot

   - Increase the size of the random seed obtained from UEFI so CRNG
     fast init can complete earlier

   - Update the EFI framebuffer address if it points to a BAR that gets
     moved by the PCI resource allocation code

   - Enable "reset attack mitigation" of TPM environments: this is
     enabled if the kernel is configured with
     CONFIG_RESET_ATTACK_MITIGATION=y.

   - Clang related fixes

   - Misc cleanups, constification, refactoring, etc"

* 'efi-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  efi/bgrt: Use efi_mem_type()
  efi: Move efi_mem_type() to common code
  efi/reboot: Make function pointer orig_pm_power_off static
  efi/random: Increase size of firmware supplied randomness
  efi/libstub: Enable reset attack mitigation
  firmware/efi/esrt: Constify attribute_group structures
  firmware/efi: Constify attribute_group structures
  firmware/dcdbas: Constify attribute_group structures
  arm/efi: Split zImage code and data into separate PE/COFF sections
  arm/efi: Replace open coded constants with symbolic ones
  arm/efi: Remove pointless dummy .reloc section
  arm/efi: Remove forbidden values from the PE/COFF header
  drivers/fbdev/efifb: Allow BAR to be moved instead of claiming it
  efi/reboot: Fall back to original power-off method if EFI_RESET_SHUTDOWN returns
  efi/arm/arm64: Add missing assignment of efi.config_table
  efi/libstub/arm64: Set -fpie when building the EFI stub
  efi/libstub/arm64: Force 'hidden' visibility for section markers
  efi/libstub/arm64: Use hidden attribute for struct screen_info reference
  efi/arm: Don't mark ACPI reclaim memory as MEMBLOCK_NOMAP
2017-09-07 09:42:35 -07:00
David S. Miller
0f2be423f1 Back from a long absence, so we have a number of things:
* a remain-on-channel fix from Avi
  * hwsim TX power fix from Beni
  * null-PTR dereference with iTXQ in some rare configurations (Chunho)
  * 40 MHz custom regdomain fixes (Emmanuel)
  * look at right place in HT/VHT capability parsing (Igor)
  * complete A-MPDU teardown properly (Ilan)
  * Mesh ID Element ordering fix (Liad)
  * avoid tracing warning in ht_dbg() (Sharon)
  * fix print of assoc/reassoc (Simon)
  * fix encrypted VLAN with iTXQ (myself)
  * fix calling context of TX queue wake (myself)
  * fix a deadlock with ath10k aggregation (myself)
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEExu3sM/nZ1eRSfR9Ha3t4Rpy0AB0FAlmw78wACgkQa3t4Rpy0
 AB1viw/+K2xrwzsKqrNoNM1sV4bPItUTjay64dPVD5CjJ/pAwou6HCu0gCJCh4kt
 mXhLWHds7Q4sBY+DlN9eIagQLJUaw897FWV+tHHirDGKMsE4tBaIct7PLBpM7r5O
 H03T5qT9+nDGRAJq6ucLG8v91cTAlBNfEIV73Au9Oi5B0Rq4cs+Tz8xS24EHjfTB
 zRcLMaE8qoQjIfrwQsYNQBdvYHY5G+Ui5sbPh3HPLDPzAfKAsc75nbikI2QE//s0
 cMv5ro39vy0DGyQmdTqNzzzuWWzYvhUD7EiIr7Dfm9ilhljCiVqZg6y7ZVMB/QNq
 +HRD7ShbTnNMx1fx8w5WO6gKGVSeo0Ga6KKEauTGiWJQTfZQLuIBLylSMVclfvBN
 4zOv3vC9EUP5qqPt0cby7VV2D+1Z4Lw2GYZZKHF5numMkgHAoDJ+tJHbBFmz1CEX
 co/79RFhGLKvZE+8lN40hqvPoYA5NOUO6jyOq384ZbnC190nVqOXvIxi9jmFKBHp
 rGBE/8e0VPYlc48m6NUFwAvc0HOeN3/ZVaUnoo6SY8fCbru3yhRYzC3pmcepTEbA
 OVBHirgYtntI2mk4FWd2dkTC6aOfP1o11dwm3deaaEtkwaiKlxI2xfnkbsGaMaOh
 RW787Y10g0k785ABD/GxynOeqfiXnIxIjMKZiQliR33zxdv4cAI=
 =QYS4
 -----END PGP SIGNATURE-----

Merge tag 'mac80211-for-davem-2017-09-07' of git://git.kernel.org/pub/scm/linux/kernel/git/jberg/mac80211

Johannes Berg says:

====================
Back from a long absence, so we have a number of things:
 * a remain-on-channel fix from Avi
 * hwsim TX power fix from Beni
 * null-PTR dereference with iTXQ in some rare configurations (Chunho)
 * 40 MHz custom regdomain fixes (Emmanuel)
 * look at right place in HT/VHT capability parsing (Igor)
 * complete A-MPDU teardown properly (Ilan)
 * Mesh ID Element ordering fix (Liad)
 * avoid tracing warning in ht_dbg() (Sharon)
 * fix print of assoc/reassoc (Simon)
 * fix encrypted VLAN with iTXQ (myself)
 * fix calling context of TX queue wake (myself)
 * fix a deadlock with ath10k aggregation (myself)
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2017-09-07 09:40:58 -07:00
Luca Coelho
2eabc84d2f iwlwifi: mvm: only send LEDS_CMD when the FW supports it
The LEDS_CMD command is only supported in some newer FW versions
(e.g. iwlwifi-8000C-31.ucode), so we can't send it to older versions
(such as iwlwifi-8000C-27.ucode).

To fix this, check for a new bit in the FW capabilities TLV that tells
when the command is supported.

Note that the current version of -31.ucode in linux-firmware.git
(31.532993.0) does not have this capability bit set, so the LED won't
work, even though this version should support it.  But we will update
this firmware soon, so it won't be a problem anymore.

Fixes: 7089ae634c ("iwlwifi: mvm: use firmware LED command where applicable")
Reported-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Luca Coelho <luciano.coelho@intel.com>
Signed-off-by: Kalle Valo <kvalo@codeaurora.org>
2017-09-07 19:40:09 +03:00
Radim Krčmář
78809a6849 Merge branch 'kvm-ppc-next' of git://git.kernel.org/pub/scm/linux/kernel/git/paulus/powerpc
KVM/PPC update for 4.14

There are various minor fixes and cleanups.  The only new feature is
that we now export information about storage key support to userspace,
so it can advertise it to the guest.

I have pulled in Michael Ellerman's topic/ppc-kvm branch from the
powerpc tree to get a couple of fixes that touch both KVM PPC code and
other PPC code.  That's why there is some arch/powerpc stuff in the
diffstat that isn't arch/powerpc/kvm.
2017-09-07 18:29:01 +02:00
Mauricio Faria de Oliveira
2a8a98673c fs: aio: fix the increment of aio-nr and counting against aio-max-nr
Currently, aio-nr is incremented in steps of 'num_possible_cpus() * 8'
for io_setup(nr_events, ..) with 'nr_events < num_possible_cpus() * 4':

    ioctx_alloc()
    ...
        nr_events = max(nr_events, num_possible_cpus() * 4);
        nr_events *= 2;
    ...
        ctx->max_reqs = nr_events;
    ...
        aio_nr += ctx->max_reqs;
    ....

This limits the number of aio contexts actually available to much less
than aio-max-nr, and is increasingly worse with greater number of CPUs.

For example, with 64 CPUs, only 256 aio contexts are actually available
(with aio-max-nr = 65536) because the increment is 512 in that scenario.

Note: 65536 [max aio contexts] / (64*4*2) [increment per aio context]
is 128, but make it 256 (double) as counting against 'aio-max-nr * 2':

    ioctx_alloc()
    ...
        if (aio_nr + nr_events > (aio_max_nr * 2UL) ||
        ...
            goto err_ctx;
    ...

This patch uses the original value of nr_events (from userspace) to
increment aio-nr and count against aio-max-nr, which resolves those.

Signed-off-by: Mauricio Faria de Oliveira <mauricfo@linux.vnet.ibm.com>
Reported-by: Lekshmi C. Pillai <lekshmi.cpillai@in.ibm.com>
Tested-by: Lekshmi C. Pillai <lekshmi.cpillai@in.ibm.com>
Tested-by: Paul Nguyen <nguyenp@us.ibm.com>
Reviewed-by: Jeff Moyer <jmoyer@redhat.com>
Signed-off-by: Benjamin LaHaise <bcrl@kvack.org>
2017-09-07 12:28:28 -04:00
Linus Torvalds
57e88b43b8 Merge branch 'x86-platform-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull x86 platform updates from Ingo Molnar:
 "The main changes include various Hyper-V optimizations such as faster
  hypercalls and faster/better TLB flushes - and there's also some
  Intel-MID cleanups"

* 'x86-platform-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  tracing/hyper-v: Trace hyperv_mmu_flush_tlb_others()
  x86/hyper-v: Support extended CPU ranges for TLB flush hypercalls
  x86/platform/intel-mid: Make several arrays static, to make code smaller
  MAINTAINERS: Add missed file for Hyper-V
  x86/hyper-v: Use hypercall for remote TLB flush
  hyper-v: Globalize vp_index
  x86/hyper-v: Implement rep hypercalls
  hyper-v: Use fast hypercall for HVCALL_SIGNAL_EVENT
  x86/hyper-v: Introduce fast hypercall implementation
  x86/hyper-v: Make hv_do_hypercall() inline
  x86/hyper-v: Include hyperv/ only when CONFIG_HYPERV is set
  x86/platform/intel-mid: Make 'bt_sfi_data' const
  x86/platform/intel-mid: Make IRQ allocation a bit more flexible
  x86/platform/intel-mid: Group timers callbacks together
2017-09-07 09:25:15 -07:00
Radim Krčmář
082d3900a4 KVM/ARM Changes for v4.14
Two minor cleanups and improvements, a fix for decoding external abort
 types from guests, and added support for migrating the active priority
 of interrupts when running a GICv2 guest on a GICv3 host.
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1
 
 iQEcBAABAgAGBQJZrsRdAAoJEEtpOizt6ddyoDIH/i94Wc5Iw9mpzi99NXyDGxYb
 2kEGg8uRZFuAwK49+aNVEqL7pPymlV3qdl1fdsmWM9ch9D8WOiezx5JgSd3+uriF
 x2Cc6kxZQJ4/UT3laAZArVcGnQqDHdOrJIsOZ9f+UFRiFazlMVMgAJX7yQwHYiBQ
 QM/z5BbLSbh+NHAv15utz85JjR8lFL7AIzPemHR6Zh7cr3XmeW3nadaYMuyqjFsZ
 aYD4S+y/eRw6HkImV/DaaLv/tDrOEU2cQOqi19TvfYpx/XW4uLQA9oxS9Qd9ADZw
 T8Ha5jGEthy+3mzwEmGFwTVGpbvt3TYavZ743lp230PODYOkfO+ETbIlstuYNb4=
 =dIG3
 -----END PGP SIGNATURE-----

Merge tag 'kvm-arm-for-v4.14' of git://git.kernel.org/pub/scm/linux/kernel/git/kvmarm/kvmarm

KVM/ARM Changes for v4.14

Two minor cleanups and improvements, a fix for decoding external abort
types from guests, and added support for migrating the active priority
of interrupts when running a GICv2 guest on a GICv3 host.
2017-09-07 18:22:04 +02:00
Radim Krčmář
6e0ff1b4db KVM: s390: Fixes and features for 4.14
- merge of topic branch tlb-flushing from the s390 tree to get the
   no-dat base features
 - merge of kvm/master to avoid conflicts with additional sthyi fixes
 - wire up the no-dat enhancements in KVM
 - multiple epoch facility (z14 feature)
 - Configuration z/Architecture Mode
 - more sthyi fixes
 - gdb server range checking fix
 - small code cleanups
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v2.0.22 (GNU/Linux)
 
 iQIcBAABAgAGBQJZqQt+AAoJEBF7vIC1phx82jcP/0PgDUFpyaq50R1LhoLoXfsv
 55x8hZFJzkBTapv6ckb9MP1J90htbmYXUDbgk/JS7ElenLvLw6r9QenHJnLYRoJ4
 s4ApooG5/wpbQWdxfqZaNHZEV1/Fgvv+R348k8cFfhrzJdOd0p7I+qkd5FOX2VTl
 NGk+cR03yhtPOYeUb90k40k0kGmY7acdnYe2txUoh7qC38vPVwMJA6S01mWbB3sm
 ApQN9hzGvXhT833dVTNMS33uCds/9nCHtMn0viqTfs9AWKIs62nK4L54c5dgZWpH
 u7+ji5OUEbV+8QQH5G1OmWUIo3sKO5MVS0P9jbh0bIIdUZvUKdchFP8bQcmde02E
 IgYgjbgmKc8Zec+Z0PHs5Gy+VbGXpqg0nvwyzK96fh/KM7aepxHUfDOdWWj8qbgI
 4sc2fe5JFkP5IldYgJlMPSm4GqLNDkbGFnLralkaqhRvjS1sNjgXxcCclLQQaI5H
 kMklzkNpMQ9yC7pYLjq7AKOzmMdHqq90rAIq0zQSYQjMoiC2RLYFyjFwsK3KlmeJ
 GRgYrIZOVkYONZhT4F81piGcuRsbCGSXZ5I5i/QxJ/EgNGO8EcGrrONAk1NW9Dsp
 q9qu/aKXOzGe57jjP+MonH4bcrHLCmcpkxS9sgiECu0PtutzbGdZPUXA9Pj9YMON
 7kQXrwPgEHhn62+l/JdO
 =N/6G
 -----END PGP SIGNATURE-----

Merge tag 'kvm-s390-next-4.14-2' of git://git.kernel.org/pub/scm/linux/kernel/git/kvms390/linux

KVM: s390: Fixes and features for 4.14

- merge of topic branch tlb-flushing from the s390 tree to get the
  no-dat base features
- merge of kvm/master to avoid conflicts with additional sthyi fixes
- wire up the no-dat enhancements in KVM
- multiple epoch facility (z14 feature)
- Configuration z/Architecture Mode
- more sthyi fixes
- gdb server range checking fix
- small code cleanups
2017-09-07 16:46:46 +02:00
Bjorn Helgaas
fca4848bbc PCI: xgene: Clean up whitespace
Use tabs (not spaces) for indentation.  No functional change intended.

Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
2017-09-07 08:52:43 -05:00
Bjorn Helgaas
582ffae852 PCI: xgene: Define XGENE_PCI_EXP_CAP and use generic PCI_EXP_RTCTL offset
Apparently the PCIe capability is at address 0x40 in config space of X-Gene
v1 Root Ports.  Add a definition of that and use the generic PCI_EXP_RTCTL
offset into the capability.  No functional change intended.

Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
2017-09-07 08:52:43 -05:00
Fabio Estevam
c7aca96aa4 PCI: xgene: Fix platform_get_irq() error handling
When platform_get_irq() fails we should propagate the real error value
instead of always returning -EINVAL.

Signed-off-by: Fabio Estevam <fabio.estevam@nxp.com>
Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
Cc: Duc Dang <dhdang@apm.com>
2017-09-07 08:52:42 -05:00
Larry Finger
6d62269283 rtlwifi: btcoexist: Fix antenna selection code
In commit 87d8a9f352 ("rtlwifi: btcoex: call bind to setup btcoex"),
the code turns on a call to exhalbtc_bind_bt_coex_withadapter(). This
routine contains a bug that causes incorrect antenna selection for those
HP laptops with only one antenna and an incorrectly programmed EFUSE.
These boxes are the ones that need the ant_sel module parameter.

Fixes: 87d8a9f352 ("rtlwifi: btcoex: call bind to setup btcoex")
Signed-off-by: Larry Finger <Larry.Finger@lwfinger.net>
Cc: Ping-Ke Shih <pkshih@realtek.com>
Cc: Yan-Hsuan Chuang <yhchuang@realtek.com>
Cc: Birming Chiu <birming@realtek.com>
Cc: Shaofu <shaofu@realtek.com>
Cc: Steven Ting <steventing@realtek.com>
Cc: Stable <stable@vger.kernel.org> # 4.13+
Signed-off-by: Kalle Valo <kvalo@codeaurora.org>
2017-09-07 15:55:56 +03:00
Larry Finger
a33fcba6ec rtlwifi: btcoexist: Fix breakage of ant_sel for rtl8723be
In commit bcd37f4a08 ("rtlwifi: btcoex: 23b 2ant: let bt transmit when
hw initialisation done"), there is an additional error when the module
parameter ant_sel is used to select the auxilary antenna. The error is
that the antenna selection is not checked when writing the antenna
selection register.

Fixes: bcd37f4a08 ("rtlwifi: btcoex: 23b 2ant: let bt transmit when hw initialisation done")
Signed-off-by: Larry Finger <Larry.Finger@lwfinger.net>
Cc: Ping-Ke Shih <pkshih@realtek.com>
Cc: Yan-Hsuan Chuang <yhchuang@realtek.com>
Cc: Birming Chiu <birming@realtek.com>
Cc: Shaofu <shaofu@realtek.com>
Cc: Steven Ting <steventing@realtek.com>
Cc: Stable <stable@vger.kernel.org> # 4.12+
Signed-off-by: Kalle Valo <kvalo@codeaurora.org>
2017-09-07 15:55:56 +03:00
Linus Torvalds
3b9f8ed25d Merge branch 'for-4.14' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/libata
Pull libata updates from Tejun Heo:
 "Except for the ahci fix that fixes a boot issue, nothing major in this
  pull request. Some new platform controller support and device specific
  changes"

* 'for-4.14' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/libata:
  libata: zpodd: make arrays cdb static, reduces object code size
  ahci: don't use MSI for devices with the silly Intel NVMe remapping scheme
  dt-bindings: ata: add DT bindings for MediaTek SATA controller
  ata: mediatek: add support for MediaTek SATA controller
  pata_octeon_cf: use of_property_read_{bool|u32}()
  cs5536: add support for IDE controller variant
  ata: sata_gemini: Introduce explicit IDE pin control
  ata: sata_gemini: Retire custom pin control
  ata: ahci_platform: Add shutdown handler
  ata: sata_gemini: explicitly request exclusive reset control
  ata: Drop unnecessary static
  ata: Convert to using %pOF instead of full_name
2017-09-06 22:41:21 -07:00
Linus Torvalds
608c1d3c17 Merge branch 'for-4.14' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup
Pull cgroup updates from Tejun Heo:
 "Several notable changes this cycle:

   - Thread mode was merged. This will be used for cgroup2 support for
     CPU and possibly other controllers. Unfortunately, CPU controller
     cgroup2 support didn't make this pull request but most contentions
     have been resolved and the support is likely to be merged before
     the next merge window.

   - cgroup.stat now shows the number of descendant cgroups.

   - cpuset now can enable the easier-to-configure v2 behavior on v1
     hierarchy"

* 'for-4.14' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup: (21 commits)
  cpuset: Allow v2 behavior in v1 cgroup
  cgroup: Add mount flag to enable cpuset to use v2 behavior in v1 cgroup
  cgroup: remove unneeded checks
  cgroup: misc changes
  cgroup: short-circuit cset_cgroup_from_root() on the default hierarchy
  cgroup: re-use the parent pointer in cgroup_destroy_locked()
  cgroup: add cgroup.stat interface with basic hierarchy stats
  cgroup: implement hierarchy limits
  cgroup: keep track of number of descent cgroups
  cgroup: add comment to cgroup_enable_threaded()
  cgroup: remove unnecessary empty check when enabling threaded mode
  cgroup: update debug controller to print out thread mode information
  cgroup: implement cgroup v2 thread support
  cgroup: implement CSS_TASK_ITER_THREADED
  cgroup: introduce cgroup->dom_cgrp and threaded css_set handling
  cgroup: add @flags to css_task_iter_start() and implement CSS_TASK_ITER_PROCS
  cgroup: reorganize cgroup.procs / task write path
  cgroup: replace css_set walking populated test with testing cgrp->nr_populated_csets
  cgroup: distinguish local and children populated states
  cgroup: remove now unused list_head @pending in cgroup_apply_cftypes()
  ...
2017-09-06 22:25:25 -07:00
Linus Torvalds
9954d4892a Merge branch 'for-4.14' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq
Pull workqueue updates from Tejun Heo:
 "Nothing major. I introduced a flag collsion bug during v4.13 cycle
  which is fixed in this pull request. Fortunately, the flag is for
  debugging / verification and the bug isn't critical"

* 'for-4.14' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq:
  workqueue: Fix flag collision
  workqueue: Use TASK_IDLE
  workqueue: fix path to documentation
  workqueue: doc change for ST behavior on NUMA systems
2017-09-06 21:59:31 -07:00
Linus Torvalds
a7cbfd05f4 Merge branch 'for-4.14' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/percpu
Pull percpu updates from Tejun Heo:
 "A lot of changes for percpu this time around. percpu inherited the
  same area allocator from the original pre-virtual-address-mapped
  implementation. This was from the time when percpu allocator wasn't
  used all that much and the implementation was focused on simplicity,
  with the unfortunate computational complexity of O(number of areas
  allocated from the chunk) per alloc / free.

  With the increase in percpu usage, we're hitting cases where the lack
  of scalability is hurting. The most prominent one right now is bpf
  perpcu map creation / destruction which may allocate and free a lot of
  entries consecutively and it's likely that the problem will become
  more prominent in the future.

  To address the issue, Dennis replaced the area allocator with hinted
  bitmap allocator which is more consistent. While the new allocator
  does perform a bit worse in some cases, it outperforms the old
  allocator way more than an order of magnitude in other more common
  scenarios while staying mostly flat in CPU overhead and completely
  flat in memory consumption"

* 'for-4.14' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/percpu: (27 commits)
  percpu: update header to contain bitmap allocator explanation.
  percpu: update pcpu_find_block_fit to use an iterator
  percpu: use metadata blocks to update the chunk contig hint
  percpu: update free path to take advantage of contig hints
  percpu: update alloc path to only scan if contig hints are broken
  percpu: keep track of the best offset for contig hints
  percpu: skip chunks if the alloc does not fit in the contig hint
  percpu: add first_bit to keep track of the first free in the bitmap
  percpu: introduce bitmap metadata blocks
  percpu: replace area map allocator with bitmap
  percpu: generalize bitmap (un)populated iterators
  percpu: increase minimum percpu allocation size and align first regions
  percpu: introduce nr_empty_pop_pages to help empty page accounting
  percpu: change the number of pages marked in the first_chunk pop bitmap
  percpu: combine percpu address checks
  percpu: modify base_addr to be region specific
  percpu: setup_first_chunk rename schunk/dchunk to chunk
  percpu: end chunk area maps page aligned for the populated bitmap
  percpu: unify allocation of schunk and dchunk
  percpu: setup_first_chunk remove dyn_size and consolidate logic
  ...
2017-09-06 21:33:12 -07:00
Kleber Sacilotto de Souza
8e0deed924 tipc: remove unnecessary call to dev_net()
The net device is already stored in the 'net' variable, so no need to call
dev_net() again.

Signed-off-by: Kleber Sacilotto de Souza <kleber.souza@canonical.com>
Acked-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-09-06 21:25:52 -07:00
Xin Long
f773608026 netlink: access nlk groups safely in netlink bind and getname
Now there is no lock protecting nlk ngroups/groups' accessing in
netlink bind and getname. It's safe from nlk groups' setting in
netlink_release, but not from netlink_realloc_groups called by
netlink_setsockopt.

netlink_lock_table is needed in both netlink bind and getname when
accessing nlk groups.

Acked-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Xin Long <lucien.xin@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-09-06 21:22:54 -07:00
Xin Long
be82485fbc netlink: fix an use-after-free issue for nlk groups
ChunYu found a netlink use-after-free issue by syzkaller:

[28448.842981] BUG: KASAN: use-after-free in __nla_put+0x37/0x40 at addr ffff8807185e2378
[28448.969918] Call Trace:
[...]
[28449.117207]  __nla_put+0x37/0x40
[28449.132027]  nla_put+0xf5/0x130
[28449.146261]  sk_diag_fill.isra.4.constprop.5+0x5a0/0x750 [netlink_diag]
[28449.176608]  __netlink_diag_dump+0x25a/0x700 [netlink_diag]
[28449.202215]  netlink_diag_dump+0x176/0x240 [netlink_diag]
[28449.226834]  netlink_dump+0x488/0xbb0
[28449.298014]  __netlink_dump_start+0x4e8/0x760
[28449.317924]  netlink_diag_handler_dump+0x261/0x340 [netlink_diag]
[28449.413414]  sock_diag_rcv_msg+0x207/0x390
[28449.432409]  netlink_rcv_skb+0x149/0x380
[28449.467647]  sock_diag_rcv+0x2d/0x40
[28449.484362]  netlink_unicast+0x562/0x7b0
[28449.564790]  netlink_sendmsg+0xaa8/0xe60
[28449.661510]  sock_sendmsg+0xcf/0x110
[28449.865631]  __sys_sendmsg+0xf3/0x240
[28450.000964]  SyS_sendmsg+0x32/0x50
[28450.016969]  do_syscall_64+0x25c/0x6c0
[28450.154439]  entry_SYSCALL64_slow_path+0x25/0x25

It was caused by no protection between nlk groups' free in netlink_release
and nlk groups' accessing in sk_diag_dump_groups. The similar issue also
exists in netlink_seq_show().

This patch is to defer nlk groups' free in deferred_put_nlk_sk.

Reported-by: ChunYu Wang <chunwang@redhat.com>
Acked-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Xin Long <lucien.xin@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-09-06 21:22:53 -07:00
Gao Feng
39ad1297a2 sched: Use __qdisc_drop instead of kfree_skb in sch_prio and sch_qfq
The commit 520ac30f45 ("net_sched: drop packets after root qdisc lock
is released) made a big change of tc for performance. There are two points
left in sch_prio and sch_qfq which are not changed with that commit. Now
enhance them now with __qdisc_drop.

Signed-off-by: Gao Feng <gfree.wind@vip.163.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-09-06 21:20:07 -07:00
Baruch Siach
9a94b3a4bd dt-binding: phy: don't confuse with Ethernet phy properties
The generic PHY 'phys' property sometime appears in the same node with
the Ethernet PHY 'phy' or 'phy-handle' properties. Add a warning in
phy-bindings.txt to reduce confusion.

Signed-off-by: Baruch Siach <baruch@tkos.co.il>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-09-06 21:13:54 -07:00
Linus Torvalds
d34fc1adf0 Merge branch 'akpm' (patches from Andrew)
Merge updates from Andrew Morton:

 - various misc bits

 - DAX updates

 - OCFS2

 - most of MM

* emailed patches from Andrew Morton <akpm@linux-foundation.org>: (119 commits)
  mm,fork: introduce MADV_WIPEONFORK
  x86,mpx: make mpx depend on x86-64 to free up VMA flag
  mm: add /proc/pid/smaps_rollup
  mm: hugetlb: clear target sub-page last when clearing huge page
  mm: oom: let oom_reap_task and exit_mmap run concurrently
  swap: choose swap device according to numa node
  mm: replace TIF_MEMDIE checks by tsk_is_oom_victim
  mm, oom: do not rely on TIF_MEMDIE for memory reserves access
  z3fold: use per-cpu unbuddied lists
  mm, swap: don't use VMA based swap readahead if HDD is used as swap
  mm, swap: add sysfs interface for VMA based swap readahead
  mm, swap: VMA based swap readahead
  mm, swap: fix swap readahead marking
  mm, swap: add swap readahead hit statistics
  mm/vmalloc.c: don't reinvent the wheel but use existing llist API
  mm/vmstat.c: fix wrong comment
  selftests/memfd: add memfd_create hugetlbfs selftest
  mm/shmem: add hugetlbfs support to memfd_create()
  mm, devm_memremap_pages: use multi-order radix for ZONE_DEVICE lookups
  mm/vmalloc.c: halve the number of comparisons performed in pcpu_get_vm_areas()
  ...
2017-09-06 20:49:49 -07:00
Andy Lutomirski
1c9fe4409c x86/mm: Document how CR4.PCIDE restore works
While debugging a problem, I thought that using
cr4_set_bits_and_update_boot() to restore CR4.PCIDE would be
helpful.  It turns out to be counterproductive.

Add a comment documenting how this works.

Signed-off-by: Andy Lutomirski <luto@kernel.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-09-06 20:12:57 -07:00
Andy Lutomirski
72c0098d92 x86/mm: Reinitialize TLB state on hotplug and resume
When Linux brings a CPU down and back up, it switches to init_mm and then
loads swapper_pg_dir into CR3.  With PCID enabled, this has the side effect
of masking off the ASID bits in CR3.

This can result in some confusion in the TLB handling code.  If we
bring a CPU down and back up with any ASID other than 0, we end up
with the wrong ASID active on the CPU after resume.  This could
cause our internal state to become corrupt, although major
corruption is unlikely because init_mm doesn't have any user pages.
More obviously, if CONFIG_DEBUG_VM=y, we'll trip over an assertion
in the next context switch.  The result of *that* is a failure to
resume from suspend with probability 1 - 1/6^(cpus-1).

Fix it by reinitializing cpu_tlbstate on resume and CPU bringup.

Reported-by: Linus Torvalds <torvalds@linux-foundation.org>
Reported-by: Jiri Kosina <jikos@kernel.org>
Fixes: 10af6235e0 ("x86/mm: Implement PCID based optimization: try to preserve old TLB entries using PCID")
Signed-off-by: Andy Lutomirski <luto@kernel.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-09-06 20:12:57 -07:00
Baohong Liu
170b3b1050 tracing: Apply trace_clock changes to instance max buffer
Currently trace_clock timestamps are applied to both regular and max
buffers only for global trace. For instance trace, trace_clock
timestamps are applied only to regular buffer. But, regular and max
buffers can be swapped, for example, following a snapshot. So, for
instance trace, bad timestamps can be seen following a snapshot.
Let's apply trace_clock timestamps to instance max buffer as well.

Link: http://lkml.kernel.org/r/ebdb168d0be042dcdf51f81e696b17fabe3609c1.1504642143.git.tom.zanussi@linux.intel.com

Cc: stable@vger.kernel.org
Fixes: 277ba0446 ("tracing: Add interface to allow multiple trace buffers")
Signed-off-by: Baohong Liu <baohong.liu@intel.com>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2017-09-06 20:52:20 -04:00
Rik van Riel
d2cd9ede6e mm,fork: introduce MADV_WIPEONFORK
Introduce MADV_WIPEONFORK semantics, which result in a VMA being empty
in the child process after fork.  This differs from MADV_DONTFORK in one
important way.

If a child process accesses memory that was MADV_WIPEONFORK, it will get
zeroes.  The address ranges are still valid, they are just empty.

If a child process accesses memory that was MADV_DONTFORK, it will get a
segmentation fault, since those address ranges are no longer valid in
the child after fork.

Since MADV_DONTFORK also seems to be used to allow very large programs
to fork in systems with strict memory overcommit restrictions, changing
the semantics of MADV_DONTFORK might break existing programs.

MADV_WIPEONFORK only works on private, anonymous VMAs.

The use case is libraries that store or cache information, and want to
know that they need to regenerate it in the child process after fork.

Examples of this would be:
 - systemd/pulseaudio API checks (fail after fork) (replacing a getpid
   check, which is too slow without a PID cache)
 - PKCS#11 API reinitialization check (mandated by specification)
 - glibc's upcoming PRNG (reseed after fork)
 - OpenSSL PRNG (reseed after fork)

The security benefits of a forking server having a re-inialized PRNG in
every child process are pretty obvious.  However, due to libraries
having all kinds of internal state, and programs getting compiled with
many different versions of each library, it is unreasonable to expect
calling programs to re-initialize everything manually after fork.

A further complication is the proliferation of clone flags, programs
bypassing glibc's functions to call clone directly, and programs calling
unshare, causing the glibc pthread_atfork hook to not get called.

It would be better to have the kernel take care of this automatically.

The patch also adds MADV_KEEPONFORK, to undo the effects of a prior
MADV_WIPEONFORK.

This is similar to the OpenBSD minherit syscall with MAP_INHERIT_ZERO:

    https://man.openbsd.org/minherit.2

[akpm@linux-foundation.org: numerically order arch/parisc/include/uapi/asm/mman.h #defines]
Link: http://lkml.kernel.org/r/20170811212829.29186-3-riel@redhat.com
Signed-off-by: Rik van Riel <riel@redhat.com>
Reported-by: Florian Weimer <fweimer@redhat.com>
Reported-by: Colm MacCártaigh <colm@allcosts.net>
Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: "Kirill A. Shutemov" <kirill@shutemov.name>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Helge Deller <deller@gmx.de>
Cc: Kees Cook <keescook@chromium.org>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Will Drewry <wad@chromium.org>
Cc: <linux-api@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-09-06 17:27:30 -07:00
Rik van Riel
df3735c5b4 x86,mpx: make mpx depend on x86-64 to free up VMA flag
Patch series "mm,fork,security: introduce MADV_WIPEONFORK", v4.

If a child process accesses memory that was MADV_WIPEONFORK, it will get
zeroes.  The address ranges are still valid, they are just empty.

If a child process accesses memory that was MADV_DONTFORK, it will get a
segmentation fault, since those address ranges are no longer valid in
the child after fork.

Since MADV_DONTFORK also seems to be used to allow very large programs
to fork in systems with strict memory overcommit restrictions, changing
the semantics of MADV_DONTFORK might break existing programs.

The use case is libraries that store or cache information, and want to
know that they need to regenerate it in the child process after fork.

Examples of this would be:
 - systemd/pulseaudio API checks (fail after fork) (replacing a getpid
   check, which is too slow without a PID cache)
 - PKCS#11 API reinitialization check (mandated by specification)
 - glibc's upcoming PRNG (reseed after fork)
 - OpenSSL PRNG (reseed after fork)

The security benefits of a forking server having a re-inialized PRNG in
every child process are pretty obvious.  However, due to libraries
having all kinds of internal state, and programs getting compiled with
many different versions of each library, it is unreasonable to expect
calling programs to re-initialize everything manually after fork.

A further complication is the proliferation of clone flags, programs
bypassing glibc's functions to call clone directly, and programs calling
unshare, causing the glibc pthread_atfork hook to not get called.

It would be better to have the kernel take care of this automatically.

The patchset also adds MADV_KEEPONFORK, to undo the effects of a prior
MADV_WIPEONFORK.

This is similar to the OpenBSD minherit syscall with MAP_INHERIT_ZERO:

    https://man.openbsd.org/minherit.2

This patch (of 2):

MPX only seems to be available on 64 bit CPUs, starting with Skylake and
Goldmont.  Move VM_MPX into the 64 bit only portion of vma->vm_flags, in
order to free up a VMA flag.

Link: http://lkml.kernel.org/r/20170811212829.29186-2-riel@redhat.com
Signed-off-by: Rik van Riel <riel@redhat.com>
Acked-by: Dave Hansen <dave.hansen@intel.com>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Florian Weimer <fweimer@redhat.com>
Cc: Kees Cook <keescook@chromium.org>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Will Drewry <wad@chromium.org>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: "Kirill A. Shutemov" <kirill@shutemov.name>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Colm MacCártaigh <colm@allcosts.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-09-06 17:27:30 -07:00
Daniel Colascione
493b0e9d94 mm: add /proc/pid/smaps_rollup
/proc/pid/smaps_rollup is a new proc file that improves the performance
of user programs that determine aggregate memory statistics (e.g., total
PSS) of a process.

Android regularly "samples" the memory usage of various processes in
order to balance its memory pool sizes.  This sampling process involves
opening /proc/pid/smaps and summing certain fields.  For very large
processes, sampling memory use this way can take several hundred
milliseconds, due mostly to the overhead of the seq_printf calls in
task_mmu.c.

smaps_rollup improves the situation.  It contains most of the fields of
/proc/pid/smaps, but instead of a set of fields for each VMA,
smaps_rollup instead contains one synthetic smaps-format entry
representing the whole process.  In the single smaps_rollup synthetic
entry, each field is the summation of the corresponding field in all of
the real-smaps VMAs.  Using a common format for smaps_rollup and smaps
allows userspace parsers to repurpose parsers meant for use with
non-rollup smaps for smaps_rollup, and it allows userspace to switch
between smaps_rollup and smaps at runtime (say, based on the
availability of smaps_rollup in a given kernel) with minimal fuss.

By using smaps_rollup instead of smaps, a caller can avoid the
significant overhead of formatting, reading, and parsing each of a large
process's potentially very numerous memory mappings.  For sampling
system_server's PSS in Android, we measured a 12x speedup, representing
a savings of several hundred milliseconds.

One alternative to a new per-process proc file would have been including
PSS information in /proc/pid/status.  We considered this option but
thought that PSS would be too expensive (by a few orders of magnitude)
to collect relative to what's already emitted as part of
/proc/pid/status, and slowing every user of /proc/pid/status for the
sake of readers that happen to want PSS feels wrong.

The code itself works by reusing the existing VMA-walking framework we
use for regular smaps generation and keeping the mem_size_stats
structure around between VMA walks instead of using a fresh one for each
VMA.  In this way, summation happens automatically.  We let seq_file
walk over the VMAs just as it does for regular smaps and just emit
nothing to the seq_file until we hit the last VMA.

Benchmarks:

    using smaps:
    iterations:1000 pid:1163 pss:220023808
    0m29.46s real 0m08.28s user 0m20.98s system

    using smaps_rollup:
    iterations:1000 pid:1163 pss:220702720
    0m04.39s real 0m00.03s user 0m04.31s system

We're using the PSS samples we collect asynchronously for
system-management tasks like fine-tuning oom_adj_score, memory use
tracking for debugging, application-level memory-use attribution, and
deciding whether we want to kill large processes during system idle
maintenance windows.  Android has been using PSS for these purposes for
a long time; as the average process VMA count has increased and and
devices become more efficiency-conscious, PSS-collection inefficiency
has started to matter more.  IMHO, it'd be a lot safer to optimize the
existing PSS-collection model, which has been fine-tuned over the years,
instead of changing the memory tracking approach entirely to work around
smaps-generation inefficiency.

Tim said:

: There are two main reasons why Android gathers PSS information:
:
: 1. Android devices can show the user the amount of memory used per
:    application via the settings app.  This is a less important use case.
:
: 2. We log PSS to help identify leaks in applications.  We have found
:    an enormous number of bugs (in the Android platform, in Google's own
:    apps, and in third-party applications) using this data.
:
: To do this, system_server (the main process in Android userspace) will
: sample the PSS of a process three seconds after it changes state (for
: example, app is launched and becomes the foreground application) and about
: every ten minutes after that.  The net result is that PSS collection is
: regularly running on at least one process in the system (usually a few
: times a minute while the screen is on, less when screen is off due to
: suspend).  PSS of a process is an incredibly useful stat to track, and we
: aren't going to get rid of it.  We've looked at some very hacky approaches
: using RSS ("take the RSS of the target process, subtract the RSS of the
: zygote process that is the parent of all Android apps") to reduce the
: accounting time, but it regularly overestimated the memory used by 20+
: percent.  Accordingly, I don't think that there's a good alternative to
: using PSS.
:
: We started looking into PSS collection performance after we noticed random
: frequency spikes while a phone's screen was off; occasionally, one of the
: CPU clusters would ramp to a high frequency because there was 200-300ms of
: constant CPU work from a single thread in the main Android userspace
: process.  The work causing the spike (which is reasonable governor
: behavior given the amount of CPU time needed) was always PSS collection.
: As a result, Android is burning more power than we should be on PSS
: collection.
:
: The other issue (and why I'm less sure about improving smaps as a
: long-term solution) is that the number of VMAs per process has increased
: significantly from release to release.  After trying to figure out why we
: were seeing these 200-300ms PSS collection times on Android O but had not
: noticed it in previous versions, we found that the number of VMAs in the
: main system process increased by 50% from Android N to Android O (from
: ~1800 to ~2700) and varying increases in every userspace process.  Android
: M to N also had an increase in the number of VMAs, although not as much.
: I'm not sure why this is increasing so much over time, but thinking about
: ASLR and ways to make ASLR better, I expect that this will continue to
: increase going forward.  I would not be surprised if we hit 5000 VMAs on
: the main Android process (system_server) by 2020.
:
: If we assume that the number of VMAs is going to increase over time, then
: doing anything we can do to reduce the overhead of each VMA during PSS
: collection seems like the right way to go, and that means outputting an
: aggregate statistic (to avoid whatever overhead there is per line in
: writing smaps and in reading each line from userspace).

Link: http://lkml.kernel.org/r/20170812022148.178293-1-dancol@google.com
Signed-off-by: Daniel Colascione <dancol@google.com>
Cc: Tim Murray <timmurray@google.com>
Cc: Joel Fernandes <joelaf@google.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Randy Dunlap <rdunlap@infradead.org>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Sonny Rao <sonnyrao@chromium.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-09-06 17:27:30 -07:00
Huang Ying
c79b57e462 mm: hugetlb: clear target sub-page last when clearing huge page
Huge page helps to reduce TLB miss rate, but it has higher cache
footprint, sometimes this may cause some issue.  For example, when
clearing huge page on x86_64 platform, the cache footprint is 2M.  But
on a Xeon E5 v3 2699 CPU, there are 18 cores, 36 threads, and only 45M
LLC (last level cache).  That is, in average, there are 2.5M LLC for
each core and 1.25M LLC for each thread.

If the cache pressure is heavy when clearing the huge page, and we clear
the huge page from the begin to the end, it is possible that the begin
of huge page is evicted from the cache after we finishing clearing the
end of the huge page.  And it is possible for the application to access
the begin of the huge page after clearing the huge page.

To help the above situation, in this patch, when we clear a huge page,
the order to clear sub-pages is changed.  In quite some situation, we
can get the address that the application will access after we clear the
huge page, for example, in a page fault handler.  Instead of clearing
the huge page from begin to end, we will clear the sub-pages farthest
from the the sub-page to access firstly, and clear the sub-page to
access last.  This will make the sub-page to access most cache-hot and
sub-pages around it more cache-hot too.  If we cannot know the address
the application will access, the begin of the huge page is assumed to be
the the address the application will access.

With this patch, the throughput increases ~28.3% in vm-scalability
anon-w-seq test case with 72 processes on a 2 socket Xeon E5 v3 2699
system (36 cores, 72 threads).  The test case creates 72 processes, each
process mmap a big anonymous memory area and writes to it from the begin
to the end.  For each process, other processes could be seen as other
workload which generates heavy cache pressure.  At the same time, the
cache miss rate reduced from ~33.4% to ~31.7%, the IPC (instruction per
cycle) increased from 0.56 to 0.74, and the time spent in user space is
reduced ~7.9%

Christopher Lameter suggests to clear bytes inside a sub-page from end
to begin too.  But tests show no visible performance difference in the
tests.  May because the size of page is small compared with the cache
size.

Thanks Andi Kleen to propose to use address to access to determine the
order of sub-pages to clear.

The hugetlbfs access address could be improved, will do that in another
patch.

[ying.huang@intel.com: improve readability of clear_huge_page()]
  Link: http://lkml.kernel.org/r/20170830051842.1397-1-ying.huang@intel.com
Link: http://lkml.kernel.org/r/20170815014618.15842-1-ying.huang@intel.com
Suggested-by: Andi Kleen <andi.kleen@intel.com>
Signed-off-by: "Huang, Ying" <ying.huang@intel.com>
Acked-by: Jan Kara <jack@suse.cz>
Reviewed-by: Michal Hocko <mhocko@suse.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
Cc: Nadia Yvette Chambers <nyc@holomorphy.com>
Cc: Matthew Wilcox <mawilcox@microsoft.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Shaohua Li <shli@fb.com>
Cc: Christopher Lameter <cl@linux.com>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-09-06 17:27:30 -07:00
Andrea Arcangeli
2129258024 mm: oom: let oom_reap_task and exit_mmap run concurrently
This is purely required because exit_aio() may block and exit_mmap() may
never start, if the oom_reap_task cannot start running on a mm with
mm_users == 0.

At the same time if the OOM reaper doesn't wait at all for the memory of
the current OOM candidate to be freed by exit_mmap->unmap_vmas, it would
generate a spurious OOM kill.

If it wasn't because of the exit_aio or similar blocking functions in
the last mmput, it would be enough to change the oom_reap_task() in the
case it finds mm_users == 0, to wait for a timeout or to wait for
__mmput to set MMF_OOM_SKIP itself, but it's not just exit_mmap the
problem here so the concurrency of exit_mmap and oom_reap_task is
apparently warranted.

It's a non standard runtime, exit_mmap() runs without mmap_sem, and
oom_reap_task runs with the mmap_sem for reading as usual (kind of
MADV_DONTNEED).

The race between the two is solved with a combination of
tsk_is_oom_victim() (serialized by task_lock) and MMF_OOM_SKIP
(serialized by a dummy down_write/up_write cycle on the same lines of
the ksm_exit method).

If the oom_reap_task() may be running concurrently during exit_mmap,
exit_mmap will wait it to finish in down_write (before taking down mm
structures that would make the oom_reap_task fail with use after free).

If exit_mmap comes first, oom_reap_task() will skip the mm if
MMF_OOM_SKIP is already set and in turn all memory is already freed and
furthermore the mm data structures may already have been taken down by
free_pgtables.

[aarcange@redhat.com: incremental one liner]
  Link: http://lkml.kernel.org/r/20170726164319.GC29716@redhat.com
[rientjes@google.com: remove unused mmput_async]
  Link: http://lkml.kernel.org/r/alpine.DEB.2.10.1708141733130.50317@chino.kir.corp.google.com
[aarcange@redhat.com: microoptimization]
  Link: http://lkml.kernel.org/r/20170817171240.GB5066@redhat.com
Link: http://lkml.kernel.org/r/20170726162912.GA29716@redhat.com
Fixes: 26db62f179 ("oom: keep mm of the killed task available")
Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
Signed-off-by: David Rientjes <rientjes@google.com>
Reported-by: David Rientjes <rientjes@google.com>
Tested-by: David Rientjes <rientjes@google.com>
Reviewed-by: Michal Hocko <mhocko@suse.com>
Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: "Kirill A. Shutemov" <kirill@shutemov.name>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-09-06 17:27:30 -07:00
Aaron Lu
a2468cc9bf swap: choose swap device according to numa node
If the system has more than one swap device and swap device has the node
information, we can make use of this information to decide which swap
device to use in get_swap_pages() to get better performance.

The current code uses a priority based list, swap_avail_list, to decide
which swap device to use and if multiple swap devices share the same
priority, they are used round robin.  This patch changes the previous
single global swap_avail_list into a per-numa-node list, i.e.  for each
numa node, it sees its own priority based list of available swap
devices.  Swap device's priority can be promoted on its matching node's
swap_avail_list.

The current swap device's priority is set as: user can set a >=0 value,
or the system will pick one starting from -1 then downwards.  The
priority value in the swap_avail_list is the negated value of the swap
device's due to plist being sorted from low to high.  The new policy
doesn't change the semantics for priority >=0 cases, the previous
starting from -1 then downwards now becomes starting from -2 then
downwards and -1 is reserved as the promoted value.

Take 4-node EX machine as an example, suppose 4 swap devices are
available, each sit on a different node:
swapA on node 0
swapB on node 1
swapC on node 2
swapD on node 3

After they are all swapped on in the sequence of ABCD.

Current behaviour:
their priorities will be:
swapA: -1
swapB: -2
swapC: -3
swapD: -4
And their position in the global swap_avail_list will be:
swapA   -> swapB   -> swapC   -> swapD
prio:1     prio:2     prio:3     prio:4

New behaviour:
their priorities will be(note that -1 is skipped):
swapA: -2
swapB: -3
swapC: -4
swapD: -5
And their positions in the 4 swap_avail_lists[nid] will be:
swap_avail_lists[0]: /* node 0's available swap device list */
swapA   -> swapB   -> swapC   -> swapD
prio:1     prio:3     prio:4     prio:5
swap_avali_lists[1]: /* node 1's available swap device list */
swapB   -> swapA   -> swapC   -> swapD
prio:1     prio:2     prio:4     prio:5
swap_avail_lists[2]: /* node 2's available swap device list */
swapC   -> swapA   -> swapB   -> swapD
prio:1     prio:2     prio:3     prio:5
swap_avail_lists[3]: /* node 3's available swap device list */
swapD   -> swapA   -> swapB   -> swapC
prio:1     prio:2     prio:3     prio:4

To see the effect of the patch, a test that starts N process, each mmap
a region of anonymous memory and then continually write to it at random
position to trigger both swap in and out is used.

On a 2 node Skylake EP machine with 64GiB memory, two 170GB SSD drives
are used as swap devices with each attached to a different node, the
result is:

runtime=30m/processes=32/total test size=128G/each process mmap region=4G
kernel         throughput
vanilla        13306
auto-binding   15169 +14%

runtime=30m/processes=64/total test size=128G/each process mmap region=2G
kernel         throughput
vanilla        11885
auto-binding   14879 +25%

[aaron.lu@intel.com: v2]
  Link: http://lkml.kernel.org/r/20170814053130.GD2369@aaronlu.sh.intel.com
  Link: http://lkml.kernel.org/r/20170816024439.GA10925@aaronlu.sh.intel.com
[akpm@linux-foundation.org: use kmalloc_array()]
Link: http://lkml.kernel.org/r/20170814053130.GD2369@aaronlu.sh.intel.com
Link: http://lkml.kernel.org/r/20170816024439.GA10925@aaronlu.sh.intel.com
Signed-off-by: Aaron Lu <aaron.lu@intel.com>
Cc: "Chen, Tim C" <tim.c.chen@intel.com>
Cc: Huang Ying <ying.huang@intel.com>
Cc: Andi Kleen <andi@firstfloor.org>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Hugh Dickins <hughd@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-09-06 17:27:30 -07:00
Michal Hocko
da99ecf117 mm: replace TIF_MEMDIE checks by tsk_is_oom_victim
TIF_MEMDIE is set only to the tasks whick were either directly selected
by the OOM killer or passed through mark_oom_victim from the allocator
path.  tsk_is_oom_victim is more generic and allows to identify all
tasks (threads) which share the mm with the oom victim.

Please note that the freezer still needs to check TIF_MEMDIE because we
cannot thaw tasks which do not participage in oom_victims counting
otherwise a !TIF_MEMDIE task could interfere after oom_disbale returns.

Link: http://lkml.kernel.org/r/20170810075019.28998-3-mhocko@kernel.org
Signed-off-by: Michal Hocko <mhocko@suse.com>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Cc: David Rientjes <rientjes@google.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Roman Gushchin <guro@fb.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-09-06 17:27:30 -07:00
Michal Hocko
cd04ae1e2d mm, oom: do not rely on TIF_MEMDIE for memory reserves access
For ages we have been relying on TIF_MEMDIE thread flag to mark OOM
victims and then, among other things, to give these threads full access
to memory reserves.  There are few shortcomings of this implementation,
though.

First of all and the most serious one is that the full access to memory
reserves is quite dangerous because we leave no safety room for the
system to operate and potentially do last emergency steps to move on.

Secondly this flag is per task_struct while the OOM killer operates on
mm_struct granularity so all processes sharing the given mm are killed.
Giving the full access to all these task_structs could lead to a quick
memory reserves depletion.  We have tried to reduce this risk by giving
TIF_MEMDIE only to the main thread and the currently allocating task but
that doesn't really solve this problem while it surely opens up a room
for corner cases - e.g.  GFP_NO{FS,IO} requests might loop inside the
allocator without access to memory reserves because a particular thread
was not the group leader.

Now that we have the oom reaper and that all oom victims are reapable
after 1b51e65eab ("oom, oom_reaper: allow to reap mm shared by the
kthreads") we can be more conservative and grant only partial access to
memory reserves because there are reasonable chances of the parallel
memory freeing.  We still want some access to reserves because we do not
want other consumers to eat up the victim's freed memory.  oom victims
will still contend with __GFP_HIGH users but those shouldn't be so
aggressive to starve oom victims completely.

Introduce ALLOC_OOM flag and give all tsk_is_oom_victim tasks access to
the half of the reserves.  This makes the access to reserves independent
on which task has passed through mark_oom_victim.  Also drop any usage
of TIF_MEMDIE from the page allocator proper and replace it by
tsk_is_oom_victim as well which will make page_alloc.c completely
TIF_MEMDIE free finally.

CONFIG_MMU=n doesn't have oom reaper so let's stick to the original
ALLOC_NO_WATERMARKS approach.

There is a demand to make the oom killer memcg aware which will imply
many tasks killed at once.  This change will allow such a usecase
without worrying about complete memory reserves depletion.

Link: http://lkml.kernel.org/r/20170810075019.28998-2-mhocko@kernel.org
Signed-off-by: Michal Hocko <mhocko@suse.com>
Acked-by: Mel Gorman <mgorman@techsingularity.net>
Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Cc: David Rientjes <rientjes@google.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Roman Gushchin <guro@fb.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-09-06 17:27:30 -07:00
Vitaly Wool
d30561c56f z3fold: use per-cpu unbuddied lists
It's been noted that z3fold doesn't scale well when it's run in a large
number of threads on many cores, which can be easily reproduced with fio
'randrw' test with --numjobs=32.  E.g.  the result for 1 cluster (4 cores)
is:

Run status group 0 (all jobs):
   READ: io=244785MB, aggrb=496883KB/s, minb=15527KB/s, ...
  WRITE: io=246735MB, aggrb=500841KB/s, minb=15651KB/s, ...

While for 8 cores (2 clusters) the result is:

Run status group 0 (all jobs):
   READ: io=244785MB, aggrb=265942KB/s, minb=8310KB/s, ...
  WRITE: io=246735MB, aggrb=268060KB/s, minb=8376KB/s, ...

The bottleneck here is the pool lock which many threads become waiting
upon.  To reduce that spin lock contention, z3fold can operate only on
the lists local to the current CPU whenever possible.  Due to the nature
of z3fold unbuddied list handling (it only takes the first entry off the
list on a hot path), if the z3fold pool is big enough and balanced well
enough, limiting search to only local unbuddied list doesn't lead to a
significant compression ratio degrade (2.57x vs 2.65x in our
measurements).

This patch also introduces two worker threads: one for async in-page
object layout optimization and one for releasing freed pages.  This is
done to speed up z3fold_free() which is often on a hot path.

The fio results for 8-core case are now the following:

Run status group 0 (all jobs):
   READ: io=244785MB, aggrb=1568.3MB/s, minb=50182KB/s, ...
  WRITE: io=246735MB, aggrb=1580.8MB/s, minb=50582KB/s, ...

So we're in for almost 6x performance increase.

Link: http://lkml.kernel.org/r/20170806181443.f9b65018f8bde25ef990f9e8@gmail.com
Signed-off-by: Vitaly Wool <vitalywool@gmail.com>
Cc: Dan Streetman <ddstreet@ieee.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-09-06 17:27:30 -07:00
Huang Ying
81a0298bdf mm, swap: don't use VMA based swap readahead if HDD is used as swap
VMA based swap readahead will readahead the virtual pages that is
continuous in the virtual address space.  While the original swap
readahead will readahead the swap slots that is continuous in the swap
device.  Although VMA based swap readahead is more correct for the swap
slots to be readahead, it will trigger more small random readings, which
may cause the performance of HDD (hard disk) to degrade heavily, and may
finally exceed the benefit.

To avoid the issue, in this patch, if the HDD is used as swap, the VMA
based swap readahead will be disabled, and the original swap readahead
will be used instead.

Link: http://lkml.kernel.org/r/20170807054038.1843-6-ying.huang@intel.com
Signed-off-by: "Huang, Ying" <ying.huang@intel.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Rik van Riel <riel@redhat.com>
Cc: Shaohua Li <shli@kernel.org>
Cc: Hugh Dickins <hughd@google.com>
Cc: Fengguang Wu <fengguang.wu@intel.com>
Cc: Tim Chen <tim.c.chen@intel.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-09-06 17:27:30 -07:00
Huang Ying
d9bfcfdc41 mm, swap: add sysfs interface for VMA based swap readahead
The sysfs interface to control the VMA based swap readahead is added as
follow,

/sys/kernel/mm/swap/vma_ra_enabled

Enable the VMA based swap readahead algorithm, or use the original
global swap readahead algorithm.

/sys/kernel/mm/swap/vma_ra_max_order

Set the max order of the readahead window size for the VMA based swap
readahead algorithm.

The corresponding ABI documentation is added too.

Link: http://lkml.kernel.org/r/20170807054038.1843-5-ying.huang@intel.com
Signed-off-by: "Huang, Ying" <ying.huang@intel.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Rik van Riel <riel@redhat.com>
Cc: Shaohua Li <shli@kernel.org>
Cc: Hugh Dickins <hughd@google.com>
Cc: Fengguang Wu <fengguang.wu@intel.com>
Cc: Tim Chen <tim.c.chen@intel.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-09-06 17:27:29 -07:00
Huang Ying
ec560175c0 mm, swap: VMA based swap readahead
The swap readahead is an important mechanism to reduce the swap in
latency.  Although pure sequential memory access pattern isn't very
popular for anonymous memory, the space locality is still considered
valid.

In the original swap readahead implementation, the consecutive blocks in
swap device are readahead based on the global space locality estimation.
But the consecutive blocks in swap device just reflect the order of page
reclaiming, don't necessarily reflect the access pattern in virtual
memory.  And the different tasks in the system may have different access
patterns, which makes the global space locality estimation incorrect.

In this patch, when page fault occurs, the virtual pages near the fault
address will be readahead instead of the swap slots near the fault swap
slot in swap device.  This avoid to readahead the unrelated swap slots.
At the same time, the swap readahead is changed to work on per-VMA from
globally.  So that the different access patterns of the different VMAs
could be distinguished, and the different readahead policy could be
applied accordingly.  The original core readahead detection and scaling
algorithm is reused, because it is an effect algorithm to detect the
space locality.

The test and result is as follow,

Common test condition
=====================

Test Machine: Xeon E5 v3 (2 sockets, 72 threads, 32G RAM) Swap device:
NVMe disk

Micro-benchmark with combined access pattern
============================================

vm-scalability, sequential swap test case, 4 processes to eat 50G
virtual memory space, repeat the sequential memory writing until 300
seconds.  The first round writing will trigger swap out, the following
rounds will trigger sequential swap in and out.

At the same time, run vm-scalability random swap test case in
background, 8 processes to eat 30G virtual memory space, repeat the
random memory write until 300 seconds.  This will trigger random swap-in
in the background.

This is a combined workload with sequential and random memory accessing
at the same time.  The result (for sequential workload) is as follow,

			Base		Optimized
			----		---------
throughput		345413 KB/s	414029 KB/s (+19.9%)
latency.average		97.14 us	61.06 us (-37.1%)
latency.50th		2 us		1 us
latency.60th		2 us		1 us
latency.70th		98 us		2 us
latency.80th		160 us		2 us
latency.90th		260 us		217 us
latency.95th		346 us		369 us
latency.99th		1.34 ms		1.09 ms
ra_hit%			52.69%		99.98%

The original swap readahead algorithm is confused by the background
random access workload, so readahead hit rate is lower.  The VMA-base
readahead algorithm works much better.

Linpack
=======

The test memory size is bigger than RAM to trigger swapping.

			Base		Optimized
			----		---------
elapsed_time		393.49 s	329.88 s (-16.2%)
ra_hit%			86.21%		98.82%

The score of base and optimized kernel hasn't visible changes.  But the
elapsed time reduced and readahead hit rate improved, so the optimized
kernel runs better for startup and tear down stages.  And the absolute
value of readahead hit rate is high, shows that the space locality is
still valid in some practical workloads.

Link: http://lkml.kernel.org/r/20170807054038.1843-4-ying.huang@intel.com
Signed-off-by: "Huang, Ying" <ying.huang@intel.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Rik van Riel <riel@redhat.com>
Cc: Shaohua Li <shli@kernel.org>
Cc: Hugh Dickins <hughd@google.com>
Cc: Fengguang Wu <fengguang.wu@intel.com>
Cc: Tim Chen <tim.c.chen@intel.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-09-06 17:27:29 -07:00
Huang Ying
c4fa63092f mm, swap: fix swap readahead marking
In the original implementation, it is possible that the existing pages
in the swap cache (not newly readahead) could be marked as the readahead
pages.  This will cause the statistics of swap readahead be wrong and
influence the swap readahead algorithm too.

This is fixed via marking a page as the readahead page only if it is
newly allocated and read from the disk.

When testing with linpack, after the fixing the swap readahead hit rate
increased from ~66% to ~86%.

Link: http://lkml.kernel.org/r/20170807054038.1843-3-ying.huang@intel.com
Signed-off-by: "Huang, Ying" <ying.huang@intel.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Rik van Riel <riel@redhat.com>
Cc: Shaohua Li <shli@kernel.org>
Cc: Hugh Dickins <hughd@google.com>
Cc: Fengguang Wu <fengguang.wu@intel.com>
Cc: Tim Chen <tim.c.chen@intel.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-09-06 17:27:29 -07:00
Huang Ying
cbc65df240 mm, swap: add swap readahead hit statistics
Patch series "mm, swap: VMA based swap readahead", v4.

The swap readahead is an important mechanism to reduce the swap in
latency.  Although pure sequential memory access pattern isn't very
popular for anonymous memory, the space locality is still considered
valid.

In the original swap readahead implementation, the consecutive blocks in
swap device are readahead based on the global space locality estimation.
But the consecutive blocks in swap device just reflect the order of page
reclaiming, don't necessarily reflect the access pattern in virtual
memory space.  And the different tasks in the system may have different
access patterns, which makes the global space locality estimation
incorrect.

In this patchset, when page fault occurs, the virtual pages near the
fault address will be readahead instead of the swap slots near the fault
swap slot in swap device.  This avoid to readahead the unrelated swap
slots.  At the same time, the swap readahead is changed to work on
per-VMA from globally.  So that the different access patterns of the
different VMAs could be distinguished, and the different readahead
policy could be applied accordingly.  The original core readahead
detection and scaling algorithm is reused, because it is an effect
algorithm to detect the space locality.

In addition to the swap readahead changes, some new sysfs interface is
added to show the efficiency of the readahead algorithm and some other
swap statistics.

This new implementation will incur more small random read, on SSD, the
improved correctness of estimation and readahead target should beat the
potential increased overhead, this is also illustrated in the test
results below.  But on HDD, the overhead may beat the benefit, so the
original implementation will be used by default.

The test and result is as follow,

Common test condition
=====================

Test Machine: Xeon E5 v3 (2 sockets, 72 threads, 32G RAM)
Swap device: NVMe disk

Micro-benchmark with combined access pattern
============================================

vm-scalability, sequential swap test case, 4 processes to eat 50G
virtual memory space, repeat the sequential memory writing until 300
seconds.  The first round writing will trigger swap out, the following
rounds will trigger sequential swap in and out.

At the same time, run vm-scalability random swap test case in
background, 8 processes to eat 30G virtual memory space, repeat the
random memory write until 300 seconds.  This will trigger random swap-in
in the background.

This is a combined workload with sequential and random memory accessing
at the same time.  The result (for sequential workload) is as follow,

			Base		Optimized
			----		---------
throughput		345413 KB/s	414029 KB/s (+19.9%)
latency.average		97.14 us	61.06 us (-37.1%)
latency.50th		2 us		1 us
latency.60th		2 us		1 us
latency.70th		98 us		2 us
latency.80th		160 us		2 us
latency.90th		260 us		217 us
latency.95th		346 us		369 us
latency.99th		1.34 ms		1.09 ms
ra_hit%			52.69%		99.98%

The original swap readahead algorithm is confused by the background
random access workload, so readahead hit rate is lower.  The VMA-base
readahead algorithm works much better.

Linpack
=======

The test memory size is bigger than RAM to trigger swapping.

			Base		Optimized
			----		---------
elapsed_time		393.49 s	329.88 s (-16.2%)
ra_hit%			86.21%		98.82%

The score of base and optimized kernel hasn't visible changes.  But the
elapsed time reduced and readahead hit rate improved, so the optimized
kernel runs better for startup and tear down stages.  And the absolute
value of readahead hit rate is high, shows that the space locality is
still valid in some practical workloads.

This patch (of 5):

The statistics for total readahead pages and total readahead hits are
recorded and exported via the following sysfs interface.

/sys/kernel/mm/swap/ra_hits
/sys/kernel/mm/swap/ra_total

With them, the efficiency of the swap readahead could be measured, so
that the swap readahead algorithm and parameters could be tuned
accordingly.

[akpm@linux-foundation.org: don't display swap stats if CONFIG_SWAP=n]
Link: http://lkml.kernel.org/r/20170807054038.1843-2-ying.huang@intel.com
Signed-off-by: "Huang, Ying" <ying.huang@intel.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Rik van Riel <riel@redhat.com>
Cc: Shaohua Li <shli@kernel.org>
Cc: Hugh Dickins <hughd@google.com>
Cc: Fengguang Wu <fengguang.wu@intel.com>
Cc: Tim Chen <tim.c.chen@intel.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-09-06 17:27:29 -07:00