Commit Graph

123 Commits

Author SHA1 Message Date
Oscar Salvador
65c7878413 kernel, resource: check for IORESOURCE_SYSRAM in release_mem_region_adjustable
This is a preparation for the next patch.

Currently, we only call release_mem_region_adjustable() in __remove_pages
if the zone is not ZONE_DEVICE, because resources that belong to HMM/devm
are being released by themselves with devm_release_mem_region.

Since we do not want to touch any zone/page stuff during the removing of
the memory (but during the offlining), we do not want to check for the
zone here.  So we need another way to tell release_mem_region_adjustable()
to not realease the resource in case it belongs to HMM/devm.

HMM/devm acquires/releases a resource through
devm_request_mem_region/devm_release_mem_region.

These resources have the flag IORESOURCE_MEM, while resources acquired by
hot-add memory path (register_memory_resource()) contain
IORESOURCE_SYSTEM_RAM.

So, we can check for this flag in release_mem_region_adjustable, and if
the resource does not contain such flag, we know that we are dealing with
a HMM/devm resource, so we can back off.

Link: http://lkml.kernel.org/r/20181127162005.15833-3-osalvador@suse.de
Signed-off-by: Oscar Salvador <osalvador@suse.de>
Reviewed-by: David Hildenbrand <david@redhat.com>
Reviewed-by: Pavel Tatashin <pasha.tatashin@soleen.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Jerome Glisse <jglisse@redhat.com>
Cc: Jonathan Cameron <Jonathan.Cameron@huawei.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Oscar Salvador <osalvador@suse.com>
Cc: "Rafael J. Wysocki" <rafael@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-12-28 12:11:49 -08:00
Borislav Petkov
f26621e60b resource/docs: Complete kernel-doc style function documentation
Add the missing kernel-doc style function parameters documentation.

Signed-off-by: Borislav Petkov <bp@suse.de>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: akpm@linux-foundation.org
Cc: linux-tip-commits@vger.kernel.org
Cc: rdunlap@infradead.org
Fixes: b69c2e20f6 ("resource: Clean it up a bit")
Link: http://lkml.kernel.org/r/20181105093307.GA12445@zn.tnic
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2018-11-07 16:47:47 +01:00
Randy Dunlap
f75d651587 resource/docs: Fix new kernel-doc warnings
The first group of warnings is caused by a "/**" kernel-doc notation
marker but the function comments are not in kernel-doc format.
Also add another error return value here.

  ../kernel/resource.c:337: warning: Function parameter or member 'start' not described in 'find_next_iomem_res'
  ../kernel/resource.c:337: warning: Function parameter or member 'end' not described in 'find_next_iomem_res'
  ../kernel/resource.c:337: warning: Function parameter or member 'flags' not described in 'find_next_iomem_res'
  ../kernel/resource.c:337: warning: Function parameter or member 'desc' not described in 'find_next_iomem_res'
  ../kernel/resource.c:337: warning: Function parameter or member 'first_lvl' not described in 'find_next_iomem_res'
  ../kernel/resource.c:337: warning: Function parameter or member 'res' not described in 'find_next_iomem_res'

Add the missing function parameter documentation for the other warnings:

  ../kernel/resource.c:409: warning: Function parameter or member 'arg' not described in 'walk_iomem_res_desc'
  ../kernel/resource.c:409: warning: Function parameter or member 'func' not described in 'walk_iomem_res_desc'

Signed-off-by: Randy Dunlap <rdunlap@infradead.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Borislav Petkov <bp@suse.de>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Fixes: b69c2e20f6 ("resource: Clean it up a bit")
Link: http://lkml.kernel.org/r/dda2e4d8-bedd-3167-20fe-8c7d2d35b354@infradead.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2018-11-05 07:05:04 +01:00
Borislav Petkov
b69c2e20f6 resource: Clean it up a bit
- Drop BUG_ON()s and do normal error handling instead, in
  find_next_iomem_res().

- Align function arguments on opening braces.

- Get rid of local var sibling_only in find_next_iomem_res().

- Shorten unnecessarily long first_level_children_only arg name.

Signed-off-by: Borislav Petkov <bp@suse.de>
CC: Andrew Morton <akpm@linux-foundation.org>
CC: Bjorn Helgaas <bhelgaas@google.com>
CC: Brijesh Singh <brijesh.singh@amd.com>
CC: Dan Williams <dan.j.williams@intel.com>
CC: H. Peter Anvin <hpa@zytor.com>
CC: Lianbo Jiang <lijiang@redhat.com>
CC: Takashi Iwai <tiwai@suse.de>
CC: Thomas Gleixner <tglx@linutronix.de>
CC: Tom Lendacky <thomas.lendacky@amd.com>
CC: Vivek Goyal <vgoyal@redhat.com>
CC: Yaowei Bai <baiyaowei@cmss.chinamobile.com>
CC: bhe@redhat.com
CC: dan.j.williams@intel.com
CC: dyoung@redhat.com
CC: kexec@lists.infradead.org
CC: mingo@redhat.com
Link: <new submission>
2018-10-09 17:25:58 +02:00
Bjorn Helgaas
010a93bf97 resource: Fix find_next_iomem_res() iteration issue
Previously find_next_iomem_res() used "*res" as both an input parameter for
the range to search and the type of resource to search for, and an output
parameter for the resource we found, which makes the interface confusing.

The current callers use find_next_iomem_res() incorrectly because they
allocate a single struct resource and use it for repeated calls to
find_next_iomem_res().  When find_next_iomem_res() returns a resource, it
overwrites the start, end, flags, and desc members of the struct.  If we
call find_next_iomem_res() again, we must update or restore these fields.
The previous code restored res.start and res.end, but not res.flags or
res.desc.

Since the callers did not restore res.flags, if they searched for flags
IORESOURCE_MEM | IORESOURCE_BUSY and found a resource with flags
IORESOURCE_MEM | IORESOURCE_BUSY | IORESOURCE_SYSRAM, the next search would
incorrectly skip resources unless they were also marked as
IORESOURCE_SYSRAM.

Fix this by restructuring the interface so it takes explicit "start, end,
flags" parameters and uses "*res" only as an output parameter.

Based on a patch by Lianbo Jiang <lijiang@redhat.com>.

 [ bp: While at it:
   - make comments kernel-doc style.
   -

Originally-by: http://lore.kernel.org/lkml/20180921073211.20097-2-lijiang@redhat.com
Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
CC: Andrew Morton <akpm@linux-foundation.org>
CC: Brijesh Singh <brijesh.singh@amd.com>
CC: Dan Williams <dan.j.williams@intel.com>
CC: H. Peter Anvin <hpa@zytor.com>
CC: Lianbo Jiang <lijiang@redhat.com>
CC: Takashi Iwai <tiwai@suse.de>
CC: Thomas Gleixner <tglx@linutronix.de>
CC: Tom Lendacky <thomas.lendacky@amd.com>
CC: Vivek Goyal <vgoyal@redhat.com>
CC: Yaowei Bai <baiyaowei@cmss.chinamobile.com>
CC: bhe@redhat.com
CC: dan.j.williams@intel.com
CC: dyoung@redhat.com
CC: kexec@lists.infradead.org
CC: mingo@redhat.com
CC: x86-ml <x86@kernel.org>
Link: http://lkml.kernel.org/r/153805812916.1157.177580438135143788.stgit@bhelgaas-glaptop.roam.corp.google.com
2018-10-09 17:18:36 +02:00
Bjorn Helgaas
a98959fdbd resource: Include resource end in walk_*() interfaces
find_next_iomem_res() finds an iomem resource that covers part of a range
described by "start, end".  All callers expect that range to be inclusive,
i.e., both start and end are included, but find_next_iomem_res() doesn't
handle the end address correctly.

If it finds an iomem resource that contains exactly the end address, it
skips it, e.g., if "start, end" is [0x0-0x10000] and there happens to be an
iomem resource [mem 0x10000-0x10000] (the single byte at 0x10000), we skip
it:

  find_next_iomem_res(...)
  {
    start = 0x0;
    end = 0x10000;
    for (p = next_resource(...)) {
      # p->start = 0x10000;
      # p->end = 0x10000;
      # we *should* return this resource, but this condition is false:
      if ((p->end >= start) && (p->start < end))
        break;

Adjust find_next_iomem_res() so it allows a resource that includes the
single byte at the end of the range.  This is a corner case that we
probably don't see in practice.

Fixes: 58c1b5b079 ("[PATCH] memory hotadd fixes: find_next_system_ram catch range fix")
Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
CC: Andrew Morton <akpm@linux-foundation.org>
CC: Brijesh Singh <brijesh.singh@amd.com>
CC: Dan Williams <dan.j.williams@intel.com>
CC: H. Peter Anvin <hpa@zytor.com>
CC: Lianbo Jiang <lijiang@redhat.com>
CC: Takashi Iwai <tiwai@suse.de>
CC: Thomas Gleixner <tglx@linutronix.de>
CC: Tom Lendacky <thomas.lendacky@amd.com>
CC: Vivek Goyal <vgoyal@redhat.com>
CC: Yaowei Bai <baiyaowei@cmss.chinamobile.com>
CC: bhe@redhat.com
CC: dan.j.williams@intel.com
CC: dyoung@redhat.com
CC: kexec@lists.infradead.org
CC: mingo@redhat.com
CC: x86-ml <x86@kernel.org>
Link: http://lkml.kernel.org/r/153805812254.1157.16736368485811773752.stgit@bhelgaas-glaptop.roam.corp.google.com
2018-10-09 17:18:34 +02:00
Linus Torvalds
7d3bf613e9 libnvdimm for 4.18
* DAX broke a fundamental assumption of truncate of file mapped pages.
   The truncate path assumed that it is safe to disconnect a pinned page
   from a file and let the filesystem reclaim the physical block. With DAX
   the page is equivalent to the filesystem block. Introduce
   dax_layout_busy_page() to enable filesystems to wait for pinned DAX
   pages to be released. Without this wait a filesystem could allocate
   blocks under active device-DMA to a new file.
 
 * DAX arranges for the block layer to be bypassed and uses
   dax_direct_access() + copy_to_iter() to satisfy read(2) calls.
   However, the memcpy_mcsafe() facility is available through the pmem
   block driver. In order to safely handle media errors, via the DAX
   block-layer bypass, introduce copy_to_iter_mcsafe().
 
 * Fix cache management policy relative to the ACPI NFIT Platform
   Capabilities Structure to properly elide cache flushes when they are not
   necessary. The table indicates whether CPU caches are power-fail
   protected. Clarify that a deep flush is always performed on
   REQ_{FUA,PREFLUSH} requests.
 -----BEGIN PGP SIGNATURE-----
 
 iQIcBAABAgAGBQJbGxI7AAoJEB7SkWpmfYgCDjsP/2Lcibu9Kf4tKIzuInsle6iE
 6qP29qlkpHVTpDKbhvIxTYTYL9sMU0DNUrpPCJR/EYdeyztLWDFC5EAT1wF240vf
 maV37s/uP331jSC/2VJnKWzBs2ztQxmKLEIQCxh6aT0qs9cbaOvJgB/WlVu+qtsl
 aGJFLmb6vdQacp31noU5plKrMgMA1pADyF5qx9I9K2HwowHE7T368ZEFS/3S//c3
 LXmpx/Nfq52sGu/qbRbu6B1CTJhIGhmarObyQnvBYoKntK1Ov4e8DS95wD3EhNDe
 FuRkOCUKhjl6cFy7QVWh1ct1bFm84ny+b4/AtbpOmv9l/+0mveJ7e+5mu8HQTifT
 wYiEe2xzXJ+OG/xntv8SvlZKMpjP3BqI0jYsTutsjT4oHrciiXdXM186cyS+BiGp
 KtFmWyncQJgfiTq6+Hj5XpP9BapNS+OYdYgUagw9ZwzdzptuGFYUMSVOBrYrn6c/
 fwqtxjubykJoW0P3pkIoT91arFSea7nxOKnGwft06imQ7TwR4ARsI308feQ9itJq
 2P2e7/20nYMsw2aRaUDDA70Yu+Lagn1m8WL87IybUGeUDLb1BAkjphAlWa6COJ+u
 PhvAD2tvyM9m0c7O5Mytvz7iWKG6SVgatoAyOPkaeplQK8khZ+wEpuK58sO6C1w8
 4GBvt9ri9i/Ww/A+ppWs
 =4bfw
 -----END PGP SIGNATURE-----

Merge tag 'libnvdimm-for-4.18' of git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm

Pull libnvdimm updates from Dan Williams:
 "This adds a user for the new 'bytes-remaining' updates to
  memcpy_mcsafe() that you already received through Ingo via the
  x86-dax- for-linus pull.

  Not included here, but still targeting this cycle, is support for
  handling memory media errors (poison) consumed via userspace dax
  mappings.

  Summary:

   - DAX broke a fundamental assumption of truncate of file mapped
     pages. The truncate path assumed that it is safe to disconnect a
     pinned page from a file and let the filesystem reclaim the physical
     block. With DAX the page is equivalent to the filesystem block.
     Introduce dax_layout_busy_page() to enable filesystems to wait for
     pinned DAX pages to be released. Without this wait a filesystem
     could allocate blocks under active device-DMA to a new file.

   - DAX arranges for the block layer to be bypassed and uses
     dax_direct_access() + copy_to_iter() to satisfy read(2) calls.
     However, the memcpy_mcsafe() facility is available through the pmem
     block driver. In order to safely handle media errors, via the DAX
     block-layer bypass, introduce copy_to_iter_mcsafe().

   - Fix cache management policy relative to the ACPI NFIT Platform
     Capabilities Structure to properly elide cache flushes when they
     are not necessary. The table indicates whether CPU caches are
     power-fail protected. Clarify that a deep flush is always performed
     on REQ_{FUA,PREFLUSH} requests"

* tag 'libnvdimm-for-4.18' of git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm: (21 commits)
  dax: Use dax_write_cache* helpers
  libnvdimm, pmem: Do not flush power-fail protected CPU caches
  libnvdimm, pmem: Unconditionally deep flush on *sync
  libnvdimm, pmem: Complete REQ_FLUSH => REQ_PREFLUSH
  acpi, nfit: Remove ecc_unit_size
  dax: dax_insert_mapping_entry always succeeds
  libnvdimm, e820: Register all pmem resources
  libnvdimm: Debug probe times
  linvdimm, pmem: Preserve read-only setting for pmem devices
  x86, nfit_test: Add unit test for memcpy_mcsafe()
  pmem: Switch to copy_to_iter_mcsafe()
  dax: Report bytes remaining in dax_iomap_actor()
  dax: Introduce a ->copy_to_iter dax operation
  uio, lib: Fix CONFIG_ARCH_HAS_UACCESS_MCSAFE compilation
  xfs, dax: introduce xfs_break_dax_layouts()
  xfs: prepare xfs_break_layouts() for another layout type
  xfs: prepare xfs_break_layouts() to be called with XFS_MMAPLOCK_EXCL
  mm, fs, dax: handle layout changes to pinned dax mappings
  mm: fix __gup_device_huge vs unmap
  mm: introduce MEMORY_DEVICE_FS_DAX and CONFIG_DEV_PAGEMAP_OPS
  ...
2018-06-08 17:21:52 -07:00
Dan Williams
d76401ade0 libnvdimm, e820: Register all pmem resources
There is currently a mismatch between the resources that will trigger
the e820_pmem driver to register/load and the resources that will
actually be surfaced as pmem ranges. register_e820_pmem() uses
walk_iomem_res_desc() which includes children and siblings. In contrast,
e820_pmem_probe() only considers top level resources. For example the
following resource tree results in the driver being loaded, but no
resources being registered:

    398000000000-39bfffffffff : PCI Bus 0000:ae
      39be00000000-39bf07ffffff : PCI Bus 0000:af
        39be00000000-39beffffffff : 0000:af:00.0
          39be10000000-39beffffffff : Persistent Memory (legacy)

Fix this up to allow definitions of "legacy" pmem ranges anywhere in
system-physical address space. Not that it is a recommended or safe to
define a pmem range in PCI space, but it is useful for debug /
experimentation, and the restriction on being a top-level resource was
arbitrary.

Cc: Christoph Hellwig <hch@lst.de>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
2018-06-02 17:05:43 -07:00
Christoph Hellwig
4e292a9667 resource: switch to proc_create_seq_data
And use the root resource directly from the proc private data.

Signed-off-by: Christoph Hellwig <hch@lst.de>
2018-05-16 07:24:30 +02:00
Takashi Iwai
60bb83b811 resource: fix integer overflow at reallocation
We've got a bug report indicating a kernel panic at booting on an x86-32
system, and it turned out to be the invalid PCI resource assigned after
reallocation.  __find_resource() first aligns the resource start address
and resets the end address with start+size-1 accordingly, then checks
whether it's contained.  Here the end address may overflow the integer,
although resource_contains() still returns true because the function
validates only start and end address.  So this ends up with returning an
invalid resource (start > end).

There was already an attempt to cover such a problem in the commit
47ea91b405 ("Resource: fix wrong resource window calculation"), but
this case is an overseen one.

This patch adds the validity check of the newly calculated resource for
avoiding the integer overflow problem.

Bugzilla: http://bugzilla.opensuse.org/show_bug.cgi?id=1086739
Link: http://lkml.kernel.org/r/s5hpo37d5l8.wl-tiwai@suse.de
Fixes: 23c570a674 ("resource: ability to resize an allocated resource")
Signed-off-by: Takashi Iwai <tiwai@suse.de>
Reported-by: Michael Henders <hendersm@shaw.ca>
Tested-by: Michael Henders <hendersm@shaw.ca>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Ram Pai <linuxram@us.ibm.com>
Cc: Bjorn Helgaas <bhelgaas@google.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-04-13 17:10:27 -07:00
Linus Torvalds
a2e5790d84 Merge branch 'akpm' (patches from Andrew)
Merge misc updates from Andrew Morton:

 - kasan updates

 - procfs

 - lib/bitmap updates

 - other lib/ updates

 - checkpatch tweaks

 - rapidio

 - ubsan

 - pipe fixes and cleanups

 - lots of other misc bits

* emailed patches from Andrew Morton <akpm@linux-foundation.org>: (114 commits)
  Documentation/sysctl/user.txt: fix typo
  MAINTAINERS: update ARM/QUALCOMM SUPPORT patterns
  MAINTAINERS: update various PALM patterns
  MAINTAINERS: update "ARM/OXNAS platform support" patterns
  MAINTAINERS: update Cortina/Gemini patterns
  MAINTAINERS: remove ARM/CLKDEV SUPPORT file pattern
  MAINTAINERS: remove ANDROID ION pattern
  mm: docs: add blank lines to silence sphinx "Unexpected indentation" errors
  mm: docs: fix parameter names mismatch
  mm: docs: fixup punctuation
  pipe: read buffer limits atomically
  pipe: simplify round_pipe_size()
  pipe: reject F_SETPIPE_SZ with size over UINT_MAX
  pipe: fix off-by-one error when checking buffer limits
  pipe: actually allow root to exceed the pipe buffer limits
  pipe, sysctl: remove pipe_proc_fn()
  pipe, sysctl: drop 'min' parameter from pipe-max-size converter
  kasan: rework Kconfig settings
  crash_dump: is_kdump_kernel can be boolean
  kernel/mutex: mutex_is_locked can be boolean
  ...
2018-02-06 22:15:42 -08:00
Yaowei Bai
9825b451f9 kernel/resource: iomem_is_exclusive can be boolean
Make iomem_is_exclusive return bool due to this particular function only
using either one or zero as its return value.

No functional change.

Link: http://lkml.kernel.org/r/1513266622-15860-5-git-send-email-baiyaowei@cmss.chinamobile.com
Signed-off-by: Yaowei Bai <baiyaowei@cmss.chinamobile.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-02-06 18:32:47 -08:00
Bjorn Helgaas
f37e2334bc resource: Set type when reserving new regions
Set resource structs inserted by __reserve_region_with_split() to have the
correct type.  Setting the type doesn't fix any functional problem but
makes %pR on the resource work better.

Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
2017-12-18 23:07:47 -06:00
Bjorn Helgaas
ffd2e8df8d resource: Set type of "reserve=" user-specified resources
When we reserve regions because the user specified a "reserve=" parameter,
set the resource type to either IORESOURCE_IO (for regions below 0x10000)
or IORESOURCE_MEM.  The test for 0x10000 is just a heuristic; obviously
there can be memory below 0x10000 as well.

Improve documentation of the "reserve=" parameter.

Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
2017-12-18 23:07:46 -06:00
Tom Lendacky
0e4c12b45a x86/mm, resource: Use PAGE_KERNEL protection for ioremap of memory pages
In order for memory pages to be properly mapped when SEV is active, it's
necessary to use the PAGE_KERNEL protection attribute as the base
protection.  This ensures that memory mapping of, e.g. ACPI tables,
receives the proper mapping attributes.

Signed-off-by: Tom Lendacky <thomas.lendacky@amd.com>
Signed-off-by: Brijesh Singh <brijesh.singh@amd.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Borislav Petkov <bp@suse.de>
Tested-by: Borislav Petkov <bp@suse.de>
Cc: Laura Abbott <labbott@redhat.com>
Cc: Kees Cook <keescook@chromium.org>
Cc: kvm@vger.kernel.org
Cc: Jérôme Glisse <jglisse@redhat.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
Link: https://lkml.kernel.org/r/20171020143059.3291-11-brijesh.singh@amd.com
2017-11-07 15:35:58 +01:00
Tom Lendacky
1d2e733b13 resource: Provide resource struct in resource walk callback
In preperation for a new function that will need additional resource
information during the resource walk, update the resource walk callback to
pass the resource structure.  Since the current callback start and end
arguments are pulled from the resource structure, the callback functions
can obtain them from the resource structure directly.

Signed-off-by: Tom Lendacky <thomas.lendacky@amd.com>
Signed-off-by: Brijesh Singh <brijesh.singh@amd.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Kees Cook <keescook@chromium.org>
Reviewed-by: Borislav Petkov <bp@suse.de>
Tested-by: Borislav Petkov <bp@suse.de>
Cc: kvm@vger.kernel.org
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: linuxppc-dev@lists.ozlabs.org
Link: https://lkml.kernel.org/r/20171020143059.3291-10-brijesh.singh@amd.com
2017-11-07 15:35:57 +01:00
Tom Lendacky
4ac2aed837 resource: Consolidate resource walking code
The walk_iomem_res_desc(), walk_system_ram_res() and walk_system_ram_range()
functions each have much of the same code.

Create a new function that consolidates the common code from these
functions in one place to reduce the amount of duplicated code.

Signed-off-by: Tom Lendacky <thomas.lendacky@amd.com>
Signed-off-by: Brijesh Singh <brijesh.singh@amd.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Borislav Petkov <bp@suse.de>
Tested-by: Borislav Petkov <bp@suse.de>
Cc: kvm@vger.kernel.org
Cc: Borislav Petkov <bp@alien8.de>
Link: https://lkml.kernel.org/r/20171020143059.3291-9-brijesh.singh@amd.com
2017-11-07 15:35:57 +01:00
Linus Torvalds
51d7b12041 /proc/iomem: only expose physical resource addresses to privileged users
In commit c4004b02f8 ("x86: remove the kernel code/data/bss resources
from /proc/iomem") I was hoping to remove the phyiscal kernel address
data from /proc/iomem entirely, but that had to be reverted because some
system programs actually use it.

This limits all the detailed resource information to properly
credentialed users instead.

Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-04-14 12:56:09 -07:00
Toshi Kani
8095d0f225 resource: Export insert_resource and remove_resource
insert_resource() and remove_resouce() are called by producers
of resources, such as FW modules and bus drivers.  These modules
may be implemented as loadable modules.

Export insert_resource() and remove_resouce() so that they can
be called from such modules.

link: https://lkml.org/lkml/2016/3/8/872
Signed-off-by: Toshi Kani <toshi.kani@hpe.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Borislav Petkov <bp@suse.de>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Dan Williams <dan.j.williams@intel.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
2016-03-09 11:07:20 -08:00
Toshi Kani
ff3cc952d3 resource: Add remove_resource interface
insert_resource() and insert_resource_conflict() are called
by resource producers to insert a new resource.  When there
is any conflict, they move conflicting resources down to the
children of the new resource.  There is no destructor of these
interfaces, however.

Add remove_resource(), which removes a resource previously
inserted by insert_resource() or insert_resource_conflict(),
and moves the children up to where they were before.

__release_resource() is changed to have @release_child, so
that this function can be used for remove_resource() as well.

Also add comments to clarify that these functions are intended
for producers of resources to avoid any confusion with
request/release_resource() for consumers.

Signed-off-by: Toshi Kani <toshi.kani@hpe.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Borislav Petkov <bp@suse.de>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Dan Williams <dan.j.williams@intel.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
2016-03-09 11:07:20 -08:00
Toshi Kani
4e0d8f7eff resource: Change __request_region to inherit from immediate parent
__request_region() sets 'flags' of a new resource from @parent
as it inherits the parent's attribute.  When a target resource
has a conflict, this function inserts the new resource entry
under the conflicted entry by updating @parent.  In this case,
the new resource entry needs to inherit attribute from the updated
parent.  This conflict is a typical case since __request_region()
is used to allocate a new resource from a specific resource range.

For instance, request_mem_region() calls __request_region() with
@parent set to &iomem_resource, which is the root entry of the
whole iomem range.  When this request results in inserting a new
entry "DEV-A" under "BUS-1", "DEV-A" needs to inherit from the
immediate parent "BUS-1" as it holds specific attribute for the
range.

root (&iomem_resource)
 :
 + "BUS-1"
    + "DEV-A"

Change __request_region() to set 'flags' and 'desc' of a new entry
from the immediate parent.

Signed-off-by: Toshi Kani <toshi.kani@hpe.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Borislav Petkov <bp@suse.de>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Dan Williams <dan.j.williams@intel.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
2016-03-09 11:07:20 -08:00
Ingo Molnar
bc94b99636 Linux 4.5-rc6
-----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1
 
 iQEcBAABAgAGBQJW0yM6AAoJEHm+PkMAQRiGeUwIAJRTHFPJTFpJcJjeZEV4/EL1
 7Pl0WSHs/CWBkXIevAg2HgkECSQ9NI9FAUFvoGxCldDpFAnL1U2QV8+Ur2qhiXMG
 5v0jILJuiw57qT/NfhEudZolerlRoHILmB3JRTb+DUV4GHZuWpTkJfUSI9j5aTEl
 w83XUgtK4bKeIyFbHdWQk6xqfzfFBSuEITuSXreOMwkFfMmeScE0WXOPLBZWyhPa
 v0rARJLYgM+vmRAnJjnG8unH+SgnqiNcn2oOFpevKwmpVcOjcEmeuxh/HdeZf7HM
 /R8F86OwdmXsO+z8dQxfcucLg+I9YmKfFr8b6hopu1sRztss2+Uk6H1j2J7IFIg=
 =tvkh
 -----END PGP SIGNATURE-----

Merge tag 'v4.5-rc6' into core/resources, to resolve conflict

Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-03-04 12:12:08 +01:00
Simon Guinot
59ceeaaf35 kernel/resource.c: fix muxed resource handling in __request_region()
In __request_region, if a conflict with a BUSY and MUXED resource is
detected, then the caller goes to sleep and waits for the resource to be
released.  A pointer on the conflicting resource is kept.  At wake-up
this pointer is used as a parent to retry to request the region.

A first problem is that this pointer might well be invalid (if for
example the conflicting resource have already been freed).  Another
problem is that the next call to __request_region() fails to detect a
remaining conflict.  The previously conflicting resource is passed as a
parameter and __request_region() will look for a conflict among the
children of this resource and not at the resource itself.  It is likely
to succeed anyway, even if there is still a conflict.

Instead, the parent of the conflicting resource should be passed to
__request_region().

As a fix, this patch doesn't update the parent resource pointer in the
case we have to wait for a muxed region right after.

Reported-and-tested-by: Vincent Pelletier <plr.vincent@gmail.com>
Signed-off-by: Simon Guinot <simon.guinot@sequanux.org>
Tested-by: Vincent Donnefort <vdonnefort@gmail.com>
Cc: stable@kernel.org
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-02-20 08:57:52 -08:00
Toshi Kani
a8fc42530d resource: Kill walk_iomem_res()
walk_iomem_res_desc() replaced walk_iomem_res() and there is no
caller to walk_iomem_res() any more. Kill it. Also remove @name
from find_next_iomem_res() as it is no longer used.

Signed-off-by: Toshi Kani <toshi.kani@hpe.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Acked-by: Dave Young <dyoung@redhat.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Hanjun Guo <hanjun.guo@linaro.org>
Cc: Jakub Sitnicki <jsitnicki@gmail.com>
Cc: Jiang Liu <jiang.liu@linux.intel.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Luis R. Rodriguez <mcgrof@suse.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Toshi Kani <toshi.kani@hp.com>
Cc: Vinod Koul <vinod.koul@intel.com>
Cc: linux-arch@vger.kernel.org
Cc: linux-mm <linux-mm@kvack.org>
Link: http://lkml.kernel.org/r/1453841853-11383-17-git-send-email-bp@alien8.de
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-01-30 09:49:59 +01:00
Toshi Kani
3f33647c41 resource: Add walk_iomem_res_desc()
Add a new interface, walk_iomem_res_desc(), which walks through
the iomem table by identifying a target with @flags and @desc.
This interface provides the same functionality as
walk_iomem_res(), but does not use strcmp() to @name for better
efficiency.

walk_iomem_res() is deprecated and will be removed in a later
patch.

Requested-by: Borislav Petkov <bp@suse.de>
Signed-off-by: Toshi Kani <toshi.kani@hpe.com>
[ Fixup comments. ]
Signed-off-by: Borislav Petkov <bp@suse.de>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Hanjun Guo <hanjun.guo@linaro.org>
Cc: Jakub Sitnicki <jsitnicki@gmail.com>
Cc: Jiang Liu <jiang.liu@linux.intel.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Luis R. Rodriguez <mcgrof@suse.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Toshi Kani <toshi.kani@hp.com>
Cc: linux-arch@vger.kernel.org
Cc: linux-mm <linux-mm@kvack.org>
Link: http://lkml.kernel.org/r/1453841853-11383-14-git-send-email-bp@alien8.de
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-01-30 09:49:59 +01:00
Toshi Kani
1c29f25bf5 memremap: Change region_intersects() to take @flags and @desc
Change region_intersects() to identify a target with @flags and
@desc, instead of @name with strcmp().

Change the callers of region_intersects(), memremap() and
devm_memremap(), to set IORESOURCE_SYSTEM_RAM in @flags and
IORES_DESC_NONE in @desc when searching System RAM.

Also, export region_intersects() so that the ACPI EINJ error
injection driver can call this function in a later patch.

Signed-off-by: Toshi Kani <toshi.kani@hpe.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Acked-by: Dan Williams <dan.j.williams@intel.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Ard Biesheuvel <ard.biesheuvel@linaro.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Jakub Sitnicki <jsitnicki@gmail.com>
Cc: Jan Kara <jack@suse.cz>
Cc: Jiang Liu <jiang.liu@linux.intel.com>
Cc: Kees Cook <keescook@chromium.org>
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Konstantin Khlebnikov <koct9i@gmail.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Luis R. Rodriguez <mcgrof@suse.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Tejun Heo <tj@kernel.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Toshi Kani <toshi.kani@hp.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: linux-arch@vger.kernel.org
Cc: linux-mm <linux-mm@kvack.org>
Link: http://lkml.kernel.org/r/1453841853-11383-13-git-send-email-bp@alien8.de
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-01-30 09:49:58 +01:00
Toshi Kani
bd7e6cb30c resource: Change walk_system_ram() to use System RAM type
Now that all System RAM resource entries have been initialized
to IORESOURCE_SYSTEM_RAM type, change walk_system_ram_res() and
walk_system_ram_range() to call find_next_iomem_res() by setting
@res.flags to IORESOURCE_SYSTEM_RAM and @name to NULL. With this
change, they walk through the iomem table to find System RAM
ranges without the need to do strcmp() on the resource names.

No functional change is made to the interfaces.

Signed-off-by: Toshi Kani <toshi.kani@hpe.com>
[ Boris: fixup comments. ]
Signed-off-by: Borislav Petkov <bp@suse.de>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Jakub Sitnicki <jsitnicki@gmail.com>
Cc: Jiang Liu <jiang.liu@linux.intel.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Luis R. Rodriguez <mcgrof@suse.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Toshi Kani <toshi.kani@hp.com>
Cc: linux-arch@vger.kernel.org
Cc: linux-mm <linux-mm@kvack.org>
Link: http://lkml.kernel.org/r/1453841853-11383-11-git-send-email-bp@alien8.de
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-01-30 09:49:58 +01:00
Toshi Kani
43ee493bde resource: Add I/O resource descriptor
walk_iomem_res() and region_intersects() still need to use
strcmp() for searching a resource entry by @name in the iomem
table.

This patch introduces I/O resource descriptor 'desc' in struct
resource for the iomem search interfaces. Drivers can assign
their unique descriptor to a range when they support the search
interfaces.

Otherwise, 'desc' is set to IORES_DESC_NONE (0). This avoids
changing most of the drivers as they typically allocate resource
entries statically, or by calling alloc_resource(), kzalloc(),
or alloc_bootmem_low(), which set the field to zero by default.
A later patch will address some drivers that use kmalloc()
without zero'ing the field.

Also change release_mem_region_adjustable() to set 'desc' when
its resource entry gets separated. Other resource interfaces are
also changed to initialize 'desc' explicitly although
alloc_resource() sets it to 0.

Signed-off-by: Toshi Kani <toshi.kani@hpe.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Jakub Sitnicki <jsitnicki@gmail.com>
Cc: Jiang Liu <jiang.liu@linux.intel.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Luis R. Rodriguez <mcgrof@suse.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Toshi Kani <toshi.kani@hp.com>
Cc: linux-arch@vger.kernel.org
Cc: linux-mm <linux-mm@kvack.org>
Link: http://lkml.kernel.org/r/1453841853-11383-4-git-send-email-bp@alien8.de
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-01-30 09:49:56 +01:00
Toshi Kani
a3650d53ba resource: Handle resource flags properly
I/O resource flags consist of I/O resource types and modifier
bits. Therefore, checking an I/O resource type in 'flags' must
be performed with a bitwise operation.

Fix find_next_iomem_res() and region_intersects() that simply
compare 'flags' against a given value.

Also change __request_region() to set 'res->flags' from
resource_type() and resource_ext_type() of the parent, so that
children nodes will inherit the extended I/O resource type.

Signed-off-by: Toshi Kani <toshi.kani@hpe.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Jakub Sitnicki <jsitnicki@gmail.com>
Cc: Jiang Liu <jiang.liu@linux.intel.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Luis R. Rodriguez <mcgrof@suse.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Toshi Kani <toshi.kani@hp.com>
Cc: Vinod Koul <vinod.koul@intel.com>
Cc: linux-arch@vger.kernel.org
Cc: linux-mm <linux-mm@kvack.org>
Link: http://lkml.kernel.org/r/1453841853-11383-3-git-send-email-bp@alien8.de
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-01-30 09:49:56 +01:00
Dan Williams
90a545e981 restrict /dev/mem to idle io memory ranges
This effectively promotes IORESOURCE_BUSY to IORESOURCE_EXCLUSIVE
semantics by default.  If userspace really believes it is safe to access
the memory region it can also perform the extra step of disabling an
active driver.  This protects device address ranges with read side
effects and otherwise directs userspace to use the driver.

Persistent memory presents a large "mistake surface" to /dev/mem as now
accidental writes can corrupt a filesystem.

In general if a device driver is busily using a memory region it already
informs other parts of the kernel to not touch it via
request_mem_region().  /dev/mem should honor the same safety restriction
by default.  Debugging a device driver from userspace becomes more
difficult with this enabled.  Any application using /dev/mem or mmap of
sysfs pci resources will now need to perform the extra step of either:

1/ Disabling the driver, for example:

   echo <device id> > /dev/bus/<parent bus>/drivers/<driver name>/unbind

2/ Rebooting with "iomem=relaxed" on the command line

3/ Recompiling with CONFIG_IO_STRICT_DEVMEM=n

Traditional users of /dev/mem like dosemu are unaffected because the
first 1MB of memory is not subject to the IO_STRICT_DEVMEM restriction.
Legacy X configurations use /dev/mem to talk to graphics hardware, but
that functionality has since moved to kernel graphics drivers.

Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Russell King <linux@arm.linux.org.uk>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Acked-by: Kees Cook <keescook@chromium.org>
Acked-by: Ingo Molnar <mingo@redhat.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
2016-01-09 06:30:49 -08:00
Dan Williams
124fe20d94 mm: enhance region_is_ram() to region_intersects()
region_is_ram() is used to prevent the establishment of aliased mappings
to physical "System RAM" with incompatible cache settings.  However, it
uses "-1" to indicate both "unknown" memory ranges (ranges not described
by platform firmware) and "mixed" ranges (where the parameters describe
a range that partially overlaps "System RAM").

Fix this up by explicitly tracking the "unknown" vs "mixed" resource
cases and returning REGION_INTERSECTS, REGION_MIXED, or REGION_DISJOINT.
This re-write also adds support for detecting when the requested region
completely eclipses all of a resource.  Note, the implementation treats
overlaps between "unknown" and the requested memory type as
REGION_INTERSECTS.

Finally, other memory types can be passed in by name, for now the only
usage "System RAM".

Suggested-by: Luis R. Rodriguez <mcgrof@suse.com>
Reviewed-by: Toshi Kani <toshi.kani@hp.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
2015-08-10 23:07:05 -04:00
Toshi Kani
8c38de992b mm: Fix bugs in region_is_ram()
region_is_ram() looks up the iomem_resource table to check if
a target range is in RAM.  However, it always returns with -1
due to invalid range checks. It always breaks the loop at the
first entry of the table.

Another issue is that it compares p->flags and flags, but it always
fails. flags is declared as int, which makes it as a negative value
with IORESOURCE_BUSY (0x80000000) set while p->flags is unsigned long.

Fix the range check and flags so that region_is_ram() works as
advertised.

Signed-off-by: Toshi Kani <toshi.kani@hp.com>
Reviewed-by: Dan Williams <dan.j.williams@intel.com>
Cc: Mike Travis <travis@sgi.com>
Cc: Luis R. Rodriguez <mcgrof@suse.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Roland Dreier <roland@purestorage.com>
Cc: linux-mm@kvack.org
Link: http://lkml.kernel.org/r/1437088996-28511-4-git-send-email-toshi.kani@hp.com
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2015-07-22 17:20:34 +02:00
Jakub Sitnicki
96831c0a67 kernel/resource.c: remove deprecated __check_region() and friends
All users of __check_region(), check_region(), and check_mem_region() are
gone.  We got rid of the last user in v4.0-rc1.  Remove them.

bloat-o-meter on x86_64 shows:

add/remove: 0/3 grow/shrink: 0/0 up/down: 0/-102 (-102)
function                                     old     new   delta
__kstrtab___check_region                      15       -     -15
__ksymtab___check_region                      16       -     -16
__check_region                                71       -     -71

Signed-off-by: Jakub Sitnicki <jsitnicki@gmail.com>
Cc: Bjorn Helgaas <bhelgaas@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-04-15 16:35:22 -07:00
Jiang Liu
90e9782061 resources: Move struct resource_list_entry from ACPI into resource core
Currently ACPI, PCI and pnp all implement the same resource list
management with different data structure. We need to transfer from
one data structure into another when passing resources from one
subsystem into another subsystem. So move struct resource_list_entry
from ACPI into resource core and rename it as resource_entry,
then it could be reused by different subystems and avoid the data
structure conversion.

Introduce dedicated header file resource_ext.h instead of embedding
it into ioport.h to avoid header file inclusion order issues.

Signed-off-by: Jiang Liu <jiang.liu@linux.intel.com>
Acked-by: Vinod Koul <vinod.koul@intel.com>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2015-02-05 15:09:25 +01:00
Mike Travis
67cf13ceed x86: optimize resource lookups for ioremap
We have a large university system in the UK that is experiencing very long
delays modprobing the driver for a specific I/O device.  The delay is from
8-10 minutes per device and there are 31 devices in the system.  This 4 to
5 hour delay in starting up those I/O devices is very much a burden on the
customer.

There are two causes for requiring a restart/reload of the drivers.  First
is periodic preventive maintenance (PM) and the second is if any of the
devices experience a fatal error.  Both of these trigger this excessively
long delay in bringing the system back up to full capability.

The problem was tracked down to a very slow IOREMAP operation and the
excessively long ioresource lookup to insure that the user is not
attempting to ioremap RAM.  These patches provide a speed up to that
function.

The modprobe time appears to be affected quite a bit by previous activity
on the ioresource list, which I suspect is due to cache preloading.  While
the overall improvement is impacted by other overhead of starting the
devices, this drastically improves the modprobe time.

Also our system is considerably smaller so the percentages gained will not
be the same.  Best case improvement with the modprobe on our 20 device
smallish system was from 'real 5m51.913s' to 'real 0m18.275s'.

This patch (of 2):

Since the ioremap operation is verifying that the specified address range
is NOT RAM, it will search the entire ioresource list if the condition is
true.  To make matters worse, it does this one 4k page at a time.  For a
128M BAR region this is 32 passes to determine the entire region does not
contain any RAM addresses.

This patch provides another resource lookup function, region_is_ram, that
searches for the entire region specified, verifying that it is completely
contained within the resource region.  If it is found, then it is checked
to be RAM or not, within a single pass.

The return result reflects if it was found or not (-1), and whether it is
RAM (1) or not (0).  This allows the caller to fallback to the previous
page by page search if it was not found.

[akpm@linux-foundation.org: fix spellos and typos in comment]
Signed-off-by: Mike Travis <travis@sgi.com>
Acked-by: Alex Thorlton <athorlton@sgi.com>
Reviewed-by: Cliff Wickman <cpw@sgi.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Mark Salter <msalter@redhat.com>
Cc: Dave Young <dyoung@redhat.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-10-14 02:18:22 +02:00
Linus Torvalds
80213c03c4 PCI changes for the v3.18 merge window:
Enumeration
     - Check Vendor ID only for Config Request Retry Status (Rajat Jain)
     - Enable Config Request Retry Status when supported (Rajat Jain)
     - Add generic domain handling (Catalin Marinas)
     - Generate uppercase hex for modalias interface class (Ricardo Ribalda Delgado)
 
   Resource management
     - Add missing MEM_64 mask in pci_assign_unassigned_bridge_resources() (Yinghai Lu)
     - Increase IBM ipr SAS Crocodile BARs to at least system page size (Douglas Lehr)
 
   PCI device hotplug
     - Prevent NULL dereference during pciehp probe (Andreas Noever)
     - Move _HPP & _HPX handling into core (Bjorn Helgaas)
     - Apply _HPP to PCIe devices as well as PCI (Bjorn Helgaas)
     - Apply _HPP/_HPX to display devices (Bjorn Helgaas)
     - Preserve SERR & PARITY settings when applying _HPP/_HPX (Bjorn Helgaas)
     - Preserve MPS and MRRS settings when applying _HPP/_HPX (Bjorn Helgaas)
     - Apply _HPP/_HPX to all devices, not just hot-added ones (Bjorn Helgaas)
     - Fix wait time in pciehp timeout message (Yinghai Lu)
     - Add more pciehp Slot Control debug output (Yinghai Lu)
     - Stop disabling pciehp notifications during init (Yinghai Lu)
 
   MSI
     - Remove arch_msi_check_device() (Alexander Gordeev)
     - Rename pci_msi_check_device() to pci_msi_supported() (Alexander Gordeev)
     - Move D0 check into pci_msi_check_device() (Alexander Gordeev)
     - Remove unused kobject from struct msi_desc (Yijing Wang)
     - Remove "pos" from the struct msi_desc msi_attrib (Yijing Wang)
     - Add "msi_bus" sysfs MSI/MSI-X control for endpoints (Yijing Wang)
     - Use __get_cached_msi_msg() instead of get_cached_msi_msg() (Yijing Wang)
     - Use __read_msi_msg() instead of read_msi_msg() (Yijing Wang)
     - Use __write_msi_msg() instead of write_msi_msg() (Yijing Wang)
 
   Power management
     - Drop unused runtime PM support code for PCIe ports (Rafael J.  Wysocki)
     - Allow PCI devices to be put into D3cold during system suspend (Rafael J. Wysocki)
 
   AER
     - Add additional AER error strings (Gong Chen)
     - Make <linux/aer.h> standalone includable (Thierry Reding)
 
   Virtualization
     - Add ACS quirk for Solarflare SFC9120 & SFC9140 (Alex Williamson)
     - Add ACS quirk for Intel 10G NICs (Alex Williamson)
     - Add ACS quirk for AMD A88X southbridge (Marti Raudsepp)
     - Remove unused pci_find_upstream_pcie_bridge(), pci_get_dma_source() (Alex Williamson)
     - Add device flag helpers (Ethan Zhao)
     - Assume all Mellanox devices have broken INTx masking (Gavin Shan)
 
   Generic host bridge driver
     - Fix ioport_map() for !CONFIG_GENERIC_IOMAP (Liviu Dudau)
     - Add pci_register_io_range() and pci_pio_to_address() (Liviu Dudau)
     - Define PCI_IOBASE as the base of virtual PCI IO space (Liviu Dudau)
     - Fix the conversion of IO ranges into IO resources (Liviu Dudau)
     - Add pci_get_new_domain_nr() and of_get_pci_domain_nr() (Liviu Dudau)
     - Add support for parsing PCI host bridge resources from DT (Liviu Dudau)
     - Add pci_remap_iospace() to map bus I/O resources (Liviu Dudau)
     - Add arm64 architectural support for PCI (Liviu Dudau)
 
   APM X-Gene
     - Add APM X-Gene PCIe driver (Tanmay Inamdar)
     - Add arm64 DT APM X-Gene PCIe device tree nodes (Tanmay Inamdar)
 
   Freescale i.MX6
     - Probe in module_init(), not fs_initcall() (Lucas Stach)
     - Delay enabling reference clock for SS until it stabilizes (Tim Harvey)
 
   Marvell MVEBU
     - Fix uninitialized variable in mvebu_get_tgt_attr() (Thomas Petazzoni)
 
   NVIDIA Tegra
     - Make sure the PCIe PLL is really reset (Eric Yuen)
     - Add error path tegra_msi_teardown_irq() cleanup (Jisheng Zhang)
     - Fix extended configuration space mapping (Peter Daifuku)
     - Implement resource hierarchy (Thierry Reding)
     - Clear CLKREQ# enable on port disable (Thierry Reding)
     - Add Tegra124 support (Thierry Reding)
 
   ST Microelectronics SPEAr13xx
     - Pass config resource through reg property (Pratyush Anand)
 
   Synopsys DesignWare
     - Use NULL instead of false (Fabio Estevam)
     - Parse bus-range property from devicetree (Lucas Stach)
     - Use pci_create_root_bus() instead of pci_scan_root_bus() (Lucas Stach)
     - Remove pci_assign_unassigned_resources() (Lucas Stach)
     - Check private_data validity in single place (Lucas Stach)
     - Setup and clear exactly one MSI at a time (Lucas Stach)
     - Remove open-coded bitmap operations (Lucas Stach)
     - Fix configuration base address when using 'reg' (Minghuan Lian)
     - Fix IO resource end address calculation (Minghuan Lian)
     - Rename get_msi_data() to get_msi_addr() (Minghuan Lian)
     - Add get_msi_data() to pcie_host_ops (Minghuan Lian)
     - Add support for v3.65 hardware (Murali Karicheri)
     - Fold struct pcie_port_info into struct pcie_port (Pratyush Anand)
 
   TI Keystone
     - Add TI Keystone PCIe driver (Murali Karicheri)
     - Limit MRSS for all downstream devices (Murali Karicheri)
     - Assume controller is already in RC mode (Murali Karicheri)
     - Set device ID based on SoC to support multiple ports (Murali Karicheri)
 
   Xilinx AXI
     - Add Xilinx AXI PCIe driver (Srikanth Thokala)
     - Fix xilinx_pcie_assign_msi() return value test (Dan Carpenter)
 
   Miscellaneous
     - Clean up whitespace (Quentin Lambert)
     - Remove assignments from "if" conditions (Quentin Lambert)
     - Move PCI_VENDOR_ID_VMWARE to pci_ids.h (Francesco Ruggeri)
     - x86: Mark DMI tables as initialization data (Mathias Krause)
     - x86: Move __init annotation to the correct place (Mathias Krause)
     - x86: Mark constants of pci_mmcfg_nvidia_mcp55() as __initconst (Mathias Krause)
     - x86: Constify pci_mmcfg_probes[] array (Mathias Krause)
     - x86: Mark PCI BIOS initialization code as such (Mathias Krause)
     - Parenthesize PCI_DEVID and PCI_VPD_LRDT_ID parameters (Megan Kamiya)
     - Remove unnecessary variable in pci_add_dynid() (Tobias Klauser)
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1
 
 iQIcBAABAgAGBQJUNWmJAAoJEFmIoMA60/r8GncP/3uHRoBrnaF6pv+S1l1p3Fs/
 l1kKH91/IuAAU7VJX8pkNybFqx02topWmiVVXAzqvD01PcRLGCLjPbWl5h+y5/Ja
 CHZH33AwHAmm0kt4BrOSOeHTLJhAigly2zV3P4F8jRIgyaeMoGZ6Ko4tkQUpm21k
 +ohrOd4cxYkmzzCjKwsZZhKnyRNpae8FmTk3VQBPuN8DbhvFPrqo5/+GeAdSZTdS
 HZHpfl2HL4095aY7uBVsZqNkjQyl6SnWwjkjLnuI8q3qA3BLgDZE/Jr8F/MNuW1V
 y01JIjerFWMDFyBIkpg7moYnODy6oP3KvczwYdKGmqsJja+0MQvYhLTwD+R/yTQS
 SewJA0mL3T3EJEfnFYkCiaIX27xIwk/FxHfaKPN91xgx/QM7xCVZNrU2/dXjhoX1
 GqLKxOEaFHhWWTyT5Dj27I0ZcElzFZ3tIwvrHfs8y22oAuAlsAypaUgvUwRfL4CO
 hOj4ITZa0t041sYWqxCoGAA9Fdp8HMzNKKS5F4mhADz4Ad9v6uPCNv/s/RoxVsbm
 jhZOtPYJ0/iCA+kNVX563S8Z3VpfPI+7bBjcj2WKdzW+IlICvOKT+kvwL2Tv/rE7
 w0hrNsbkgGsYbPldMx7LwCavsUtYFuNj0zoU6vkhP2jk6O2Tn5VXDmjrXH0v3iHI
 v03vlUtre0bQ26fzDyLQ
 =4Zv1
 -----END PGP SIGNATURE-----

Merge tag 'pci-v3.18-changes' of git://git.kernel.org/pub/scm/linux/kernel/git/helgaas/pci

Pull PCI updates from Bjorn Helgaas:
 "The interesting things here are:

   - Turn on Config Request Retry Status Software Visibility.  This
     caused hangs last time, but we included a fix this time.
   - Rework PCI device configuration to use _HPP/_HPX more aggressively
   - Allow PCI devices to be put into D3cold during system suspend
   - Add arm64 PCI support
   - Add APM X-Gene host bridge driver
   - Add TI Keystone host bridge driver
   - Add Xilinx AXI host bridge driver

  More detailed summary:

  Enumeration
    - Check Vendor ID only for Config Request Retry Status (Rajat Jain)
    - Enable Config Request Retry Status when supported (Rajat Jain)
    - Add generic domain handling (Catalin Marinas)
    - Generate uppercase hex for modalias interface class (Ricardo Ribalda Delgado)

  Resource management
    - Add missing MEM_64 mask in pci_assign_unassigned_bridge_resources() (Yinghai Lu)
    - Increase IBM ipr SAS Crocodile BARs to at least system page size (Douglas Lehr)

  PCI device hotplug
    - Prevent NULL dereference during pciehp probe (Andreas Noever)
    - Move _HPP & _HPX handling into core (Bjorn Helgaas)
    - Apply _HPP to PCIe devices as well as PCI (Bjorn Helgaas)
    - Apply _HPP/_HPX to display devices (Bjorn Helgaas)
    - Preserve SERR & PARITY settings when applying _HPP/_HPX (Bjorn Helgaas)
    - Preserve MPS and MRRS settings when applying _HPP/_HPX (Bjorn Helgaas)
    - Apply _HPP/_HPX to all devices, not just hot-added ones (Bjorn Helgaas)
    - Fix wait time in pciehp timeout message (Yinghai Lu)
    - Add more pciehp Slot Control debug output (Yinghai Lu)
    - Stop disabling pciehp notifications during init (Yinghai Lu)

  MSI
    - Remove arch_msi_check_device() (Alexander Gordeev)
    - Rename pci_msi_check_device() to pci_msi_supported() (Alexander Gordeev)
    - Move D0 check into pci_msi_check_device() (Alexander Gordeev)
    - Remove unused kobject from struct msi_desc (Yijing Wang)
    - Remove "pos" from the struct msi_desc msi_attrib (Yijing Wang)
    - Add "msi_bus" sysfs MSI/MSI-X control for endpoints (Yijing Wang)
    - Use __get_cached_msi_msg() instead of get_cached_msi_msg() (Yijing Wang)
    - Use __read_msi_msg() instead of read_msi_msg() (Yijing Wang)
    - Use __write_msi_msg() instead of write_msi_msg() (Yijing Wang)

  Power management
    - Drop unused runtime PM support code for PCIe ports (Rafael J.  Wysocki)
    - Allow PCI devices to be put into D3cold during system suspend (Rafael J. Wysocki)

  AER
    - Add additional AER error strings (Gong Chen)
    - Make <linux/aer.h> standalone includable (Thierry Reding)

  Virtualization
    - Add ACS quirk for Solarflare SFC9120 & SFC9140 (Alex Williamson)
    - Add ACS quirk for Intel 10G NICs (Alex Williamson)
    - Add ACS quirk for AMD A88X southbridge (Marti Raudsepp)
    - Remove unused pci_find_upstream_pcie_bridge(), pci_get_dma_source() (Alex Williamson)
    - Add device flag helpers (Ethan Zhao)
    - Assume all Mellanox devices have broken INTx masking (Gavin Shan)

  Generic host bridge driver
    - Fix ioport_map() for !CONFIG_GENERIC_IOMAP (Liviu Dudau)
    - Add pci_register_io_range() and pci_pio_to_address() (Liviu Dudau)
    - Define PCI_IOBASE as the base of virtual PCI IO space (Liviu Dudau)
    - Fix the conversion of IO ranges into IO resources (Liviu Dudau)
    - Add pci_get_new_domain_nr() and of_get_pci_domain_nr() (Liviu Dudau)
    - Add support for parsing PCI host bridge resources from DT (Liviu Dudau)
    - Add pci_remap_iospace() to map bus I/O resources (Liviu Dudau)
    - Add arm64 architectural support for PCI (Liviu Dudau)

  APM X-Gene
    - Add APM X-Gene PCIe driver (Tanmay Inamdar)
    - Add arm64 DT APM X-Gene PCIe device tree nodes (Tanmay Inamdar)

  Freescale i.MX6
    - Probe in module_init(), not fs_initcall() (Lucas Stach)
    - Delay enabling reference clock for SS until it stabilizes (Tim Harvey)

  Marvell MVEBU
    - Fix uninitialized variable in mvebu_get_tgt_attr() (Thomas Petazzoni)

  NVIDIA Tegra
    - Make sure the PCIe PLL is really reset (Eric Yuen)
    - Add error path tegra_msi_teardown_irq() cleanup (Jisheng Zhang)
    - Fix extended configuration space mapping (Peter Daifuku)
    - Implement resource hierarchy (Thierry Reding)
    - Clear CLKREQ# enable on port disable (Thierry Reding)
    - Add Tegra124 support (Thierry Reding)

  ST Microelectronics SPEAr13xx
    - Pass config resource through reg property (Pratyush Anand)

  Synopsys DesignWare
    - Use NULL instead of false (Fabio Estevam)
    - Parse bus-range property from devicetree (Lucas Stach)
    - Use pci_create_root_bus() instead of pci_scan_root_bus() (Lucas Stach)
    - Remove pci_assign_unassigned_resources() (Lucas Stach)
    - Check private_data validity in single place (Lucas Stach)
    - Setup and clear exactly one MSI at a time (Lucas Stach)
    - Remove open-coded bitmap operations (Lucas Stach)
    - Fix configuration base address when using 'reg' (Minghuan Lian)
    - Fix IO resource end address calculation (Minghuan Lian)
    - Rename get_msi_data() to get_msi_addr() (Minghuan Lian)
    - Add get_msi_data() to pcie_host_ops (Minghuan Lian)
    - Add support for v3.65 hardware (Murali Karicheri)
    - Fold struct pcie_port_info into struct pcie_port (Pratyush Anand)

  TI Keystone
    - Add TI Keystone PCIe driver (Murali Karicheri)
    - Limit MRSS for all downstream devices (Murali Karicheri)
    - Assume controller is already in RC mode (Murali Karicheri)
    - Set device ID based on SoC to support multiple ports (Murali Karicheri)

  Xilinx AXI
    - Add Xilinx AXI PCIe driver (Srikanth Thokala)
    - Fix xilinx_pcie_assign_msi() return value test (Dan Carpenter)

  Miscellaneous
    - Clean up whitespace (Quentin Lambert)
    - Remove assignments from "if" conditions (Quentin Lambert)
    - Move PCI_VENDOR_ID_VMWARE to pci_ids.h (Francesco Ruggeri)
    - x86: Mark DMI tables as initialization data (Mathias Krause)
    - x86: Move __init annotation to the correct place (Mathias Krause)
    - x86: Mark constants of pci_mmcfg_nvidia_mcp55() as __initconst (Mathias Krause)
    - x86: Constify pci_mmcfg_probes[] array (Mathias Krause)
    - x86: Mark PCI BIOS initialization code as such (Mathias Krause)
    - Parenthesize PCI_DEVID and PCI_VPD_LRDT_ID parameters (Megan Kamiya)
    - Remove unnecessary variable in pci_add_dynid() (Tobias Klauser)"

* tag 'pci-v3.18-changes' of git://git.kernel.org/pub/scm/linux/kernel/git/helgaas/pci: (109 commits)
  arm64: dts: Add APM X-Gene PCIe device tree nodes
  PCI: Add ACS quirk for AMD A88X southbridge devices
  PCI: xgene: Add APM X-Gene PCIe driver
  PCI: designware: Remove open-coded bitmap operations
  PCI/MSI: Remove unnecessary temporary variable
  PCI/MSI: Use __write_msi_msg() instead of write_msi_msg()
  MSI/powerpc: Use __read_msi_msg() instead of read_msi_msg()
  PCI/MSI: Use __get_cached_msi_msg() instead of get_cached_msi_msg()
  PCI/MSI: Add "msi_bus" sysfs MSI/MSI-X control for endpoints
  PCI/MSI: Remove "pos" from the struct msi_desc msi_attrib
  PCI/MSI: Remove unused kobject from struct msi_desc
  PCI/MSI: Rename pci_msi_check_device() to pci_msi_supported()
  PCI/MSI: Move D0 check into pci_msi_check_device()
  PCI/MSI: Remove arch_msi_check_device()
  irqchip: armada-370-xp: Remove arch_msi_check_device()
  PCI/MSI/PPC: Remove arch_msi_check_device()
  arm64: Add architectural support for PCI
  PCI: Add pci_remap_iospace() to map bus I/O resources
  of/pci: Add support for parsing PCI host bridge resources from DT
  of/pci: Add pci_get_new_domain_nr() and of_get_pci_domain_nr()
  ...

Conflicts:
	arch/arm64/boot/dts/apm-storm.dtsi
2014-10-09 15:03:49 -04:00
Thierry Reding
8d38821cbc resources: Add device-managed request/release_resource()
Provide device-managed implementations of the request_resource() and
release_resource() functions.  Upon failure to request a resource, the new
devm_request_resource() function will output an error message for
consistent error reporting.

Signed-off-by: Thierry Reding <treding@nvidia.com>
Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
Acked-by: Tejun Heo <tj@kernel.org>
2014-09-04 14:41:43 -06:00
Vivek Goyal
800df627e2 resource: fix the case of null pointer access
Richard and Daniel reported that UML is broken due to changes to
resource traversal functions.  Problem is that iomem_resource.child can
be null and new code does not consider that possibility.  Old code used
a for loop and that loop will not even execute if p was null.

Revert back to for() loop logic and bail out if p is null.

I also moved sibling_only check out of resource_lock. There is no
reason to keep it inside the lock.

Following is backtrace of the UML crash.

RIP: 0033:[<0000000060039b9f>]
RSP: 0000000081459da0  EFLAGS: 00010202
RAX: 0000000000000000 RBX: 00000000219b3fff RCX: 000000006010d1d9
RDX: 0000000000000001 RSI: 00000000602dfb94 RDI: 0000000081459df8
RBP: 0000000081459de0 R08: 00000000601b59f4 R09: ffffffff0000ff00
R10: ffffffff0000ff00 R11: 0000000081459e88 R12: 0000000081459df8
R13: 00000000219b3fff R14: 00000000602dfb94 R15: 0000000000000000
Kernel panic - not syncing: Segfault with no mm
CPU: 0 PID: 1 Comm: swapper Not tainted 3.16.0-10454-g58d08e3 #13
Stack:
 00000000 000080d0 81459df0 219b3fff
 81459e70 6010d1d9 ffffffff 6033e010
 81459e50 6003a269 81459e30 00000000
Call Trace:
 [<6010d1d9>] ? kclist_add_private+0x0/0xe7
 [<6003a269>] walk_system_ram_range+0x61/0xb7
 [<6000e859>] ? proc_kcore_init+0x0/0xf1
 [<6010d574>] kcore_update_ram+0x4c/0x168
 [<6010d72e>] ? kclist_add+0x0/0x2e
 [<6000e943>] proc_kcore_init+0xea/0xf1
 [<6000e859>] ? proc_kcore_init+0x0/0xf1
 [<6000e859>] ? proc_kcore_init+0x0/0xf1
 [<600189f0>] do_one_initcall+0x13c/0x204
 [<6004ca46>] ? parse_args+0x1df/0x2e0
 [<6004c82d>] ? parameq+0x0/0x3a
 [<601b5990>] ? strcpy+0x0/0x18
 [<60001e1a>] kernel_init_freeable+0x240/0x31e
 [<6026f1c0>] kernel_init+0x12/0x148
 [<60019fad>] new_thread_handler+0x81/0xa3

Fixes 8c86e70ace ("resource: provide new functions to walk
through resources").

Reported-by: Daniel Walter <sahne@0x90.at>
Tested-by: Richard Weinberger <richard@nod.at>
Tested-by: Toralf Förster <toralf.foerster@gmx.de>
Tested-by: Daniel Walter <sahne@0x90.at>
Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-08-29 16:28:15 -07:00
Vivek Goyal
8c86e70ace resource: provide new functions to walk through resources
I have added two more functions to walk through resources.

Currently walk_system_ram_range() deals with pfn and /proc/iomem can
contain partial pages.  By dealing in pfn, callback function loses the
info that last page of a memory range is a partial page and not the full
page.  So I implemented walk_system_ram_res() which returns u64 values to
callback functions and now it properly return start and end address.

walk_system_ram_range() uses find_next_system_ram() to find the next ram
resource.  This in turn only travels through siblings of top level child
and does not travers through all the nodes of the resoruce tree.  I also
need another function where I can walk through all the resources, for
example figure out where "GART" aperture is.  Figure out where ACPI memory
is.

So I wrote another function walk_iomem_res() which walks through all
/proc/iomem resources and returns matches as asked by caller.  Caller can
specify "name" of resource, start and end and flags.

Got rid of find_next_system_ram_res() and instead implemented more generic
find_next_iomem_res() which can be used to traverse top level children
only based on an argument.

Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
Cc: Yinghai Lu <yinghai@kernel.org>
Cc: Borislav Petkov <bp@suse.de>
Cc: Michael Kerrisk <mtk.manpages@gmail.com>
Cc: Eric Biederman <ebiederm@xmission.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Matthew Garrett <mjg59@srcf.ucam.org>
Cc: Greg Kroah-Hartman <greg@kroah.com>
Cc: Dave Young <dyoung@redhat.com>
Cc: WANG Chao <chaowang@redhat.com>
Cc: Baoquan He <bhe@redhat.com>
Cc: Andy Lutomirski <luto@amacapital.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-08-08 15:57:32 -07:00
Bjorn Helgaas
e4c7296643 resources: Clarify sanity check message
The resource map sanity check message is a bit confusing.  Change it to be
more readable:

  -resource map sanity check conflict: 0xfed10000 0xfed15fff 0xfed10000 0xfed13fff pnp 00:01
  +resource sanity check: requesting [mem 0xfed10000-0xfed15fff], which spans more than pnp 00:01 [mem 0xfed10000-0xfed13fff]

Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
2014-05-23 10:47:21 -06:00
Daeseok Youn
28ab49ff7f kernel/resource.c: make reallocate_resource() static
sparse says:

kernel/resource.c:518:5: warning:
 symbol 'reallocate_resource' was not declared. Should it be static?

Signed-off-by: Daeseok Youn <daeseok.youn@gmail.com>
Reviewed-by: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
Acked-by: David Rientjes <rientjes@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-04-03 16:21:07 -07:00
Bjorn Helgaas
6404e88e83 resources: Set type in __request_region()
We don't set the type (I/O, memory, etc.) of resources added by
__request_region(), which leads to confusing messages like this:

    address space collision: [io  0x1000-0x107f] conflicts with ACPI CPU throttle [??? 0x00001010-0x00001015 flags 0x80000000]

Set the type of a new resource added by __request_region() (used by
request_region() and request_mem_region()) to the type of its parent.  This
makes the resource tree internally consistent and fixes messages like the
above, where the ACPI CPU throttle resource really is an I/O port region,
but request_region() didn't fill in the type, so %pR didn't know how to
print it.

Sample dmesg showing the issue at the link below.

Link: https://bugzilla.kernel.org/show_bug.cgi?id=71611
Reported-by: Paul Bolle <pebolle@tiscali.nl>
Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
2014-03-19 15:00:16 -06:00
Bjorn Helgaas
5edb93b89f resource: Add resource_contains()
We have two identical copies of resource_contains() already, and more
places that could use it.  This moves it to ioport.h where it can be
shared.

resource_contains(struct resource *r1, struct resource *r2) returns true
iff r1 and r2 are the same type (most callers already checked this
separately) and the r1 address range completely contains r2.

In addition, the new resource_contains() checks that both r1 and r2 have
addresses assigned to them.  If a resource is IORESOURCE_UNSET, it doesn't
have a valid address and can't contain or be contained by another resource.
Some callers already check this or for res->start.

No functional change.

Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
2014-02-26 14:42:09 -07:00
Kevin Hao
0786f7b225 kernel/resource.c: remove the unneeded assignment in function __find_resource
This line was introduced by fcb11918 ("resources: add arch hook for
preventing allocation in reserved areas").  But the struct tmp was already
assigned to *new in the above line, so this seems superfluous.  Just
remove it.

Signed-off-by: Kevin Hao <haokexin@gmail.com>
Cc: Bjorn Helgaas <bjorn.helgaas@hp.com>
Cc: Jesse Barnes <jbarnes@virtuousgeek.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-07-03 16:08:06 -07:00
Chen Gong
c5a130325f ACPI/APEI: Add parameter check before error injection
When param1 is enabled in EINJ but not assigned with a valid
value, sometimes it will cause the error like below:

APEI: Can not request [mem 0x7aaa7000-0x7aaa7007] for APEI EINJ Trigger registers

It is because some firmware will access target address specified in
param1 to trigger the error when injecting memory error. This will
cause resource conflict with regular memory. So It must be removed
from trigger table resources, but incorrect param1/param2
combination will stop this action. Add extra check to avoid
this kind of error.

Signed-off-by: Chen Gong <gong.chen@linux.intel.com>
Signed-off-by: Tony Luck <tony.luck@intel.com>
2013-06-06 15:20:51 -07:00
Yasuaki Ishimatsu
ebff7d8f27 mem hotunplug: fix kfree() of bootmem memory
When hot removing memory presented at boot time, following messages are shown:

  kernel BUG at mm/slub.c:3409!
  invalid opcode: 0000 [#1] SMP
  Modules linked in: ebtable_nat ebtables xt_CHECKSUM iptable_mangle bridge stp llc ipmi_devintf ipmi_msghandler sunrpc ipt_REJECT nf_conntrack_ipv4 nf_defrag_ipv4 iptable_filter ip_tables ip6t_REJECT nf_conntrack_ipv6 nf_defrag_ipv6 xt_state nf_conntrack ip6table_filter ip6_tables binfmt_misc vfat fat dm_mirror dm_region_hash dm_log dm_mod vhost_net macvtap macvlan tun uinput iTCO_wdt iTCO_vendor_support coretemp kvm_intel kvm crc32c_intel ghash_clmulni_intel microcode pcspkr sg i2c_i801 lpc_ich mfd_core igb i2c_algo_bit i2c_core e1000e ptp pps_core tpm_infineon ioatdma dca sr_mod cdrom sd_mod crc_t10dif usb_storage megaraid_sas lpfc scsi_transport_fc scsi_tgt scsi_mod
  CPU 0
  Pid: 5091, comm: kworker/0:2 Tainted: G        W    3.9.0-rc6+ #15
  RIP: kfree+0x232/0x240
  Process kworker/0:2 (pid: 5091, threadinfo ffff88084678c000, task ffff88083928ca80)
  Call Trace:
    __release_region+0xd4/0xe0
    __remove_pages+0x52/0x110
    arch_remove_memory+0x89/0xd0
    remove_memory+0xc4/0x100
    acpi_memory_device_remove+0x6d/0xb1
    acpi_device_remove+0x89/0xab
    __device_release_driver+0x7c/0xf0
    device_release_driver+0x2f/0x50
    acpi_bus_device_detach+0x6c/0x70
    acpi_ns_walk_namespace+0x11a/0x250
    acpi_walk_namespace+0xee/0x137
    acpi_bus_trim+0x33/0x7a
    acpi_bus_hot_remove_device+0xc4/0x1a1
    acpi_os_execute_deferred+0x27/0x34
    process_one_work+0x1f7/0x590
    worker_thread+0x11a/0x370
    kthread+0xee/0x100
    ret_from_fork+0x7c/0xb0
  RIP  [<ffffffff811c41d2>] kfree+0x232/0x240
   RSP <ffff88084678d968>

The reason why the messages are shown is to release a resource
structure, allocated by bootmem, by kfree().  So when we release a
resource structure, we should check whether it is allocated by bootmem
or not.

But even if we know a resource structure is allocated by bootmem, we
cannot release it since SLxB cannot treat it.  So for reusing a resource
structure, this patch remembers it by using bootmem_resource as follows:

When releasing a resource structure by free_resource(), free_resource()
checks whether the resource structure is allocated by bootmem or not.
If it is allocated by bootmem, free_resource() adds it to
bootmem_resource.  If it is not allocated by bootmem, free_resource()
release it by kfree().

And when getting a new resource structure by get_resource(),
get_resource() checks whether bootmem_resource has released resource
structures or not.  If there is a released resource structure,
get_resource() returns it.  If there is not a releaed resource
structure, get_resource() returns new resource structure allocated by
kzalloc().

[akpm@linux-foundation.org: s/get_resource/alloc_resource/]
Signed-off-by: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
Reviewed-by: Toshi Kani <toshi.kani@hp.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Ram Pai <linuxram@us.ibm.com>
Cc: David Rientjes <rientjes@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-04-29 15:54:40 -07:00
Toshi Kani
825f787bb4 resource: add release_mem_region_adjustable()
Add release_mem_region_adjustable(), which releases a requested region
from a currently busy memory resource.  This interface adjusts the
matched memory resource accordingly even if the requested region does
not match exactly but still fits into.

This new interface is intended for memory hot-delete.  During bootup,
memory resources are inserted from the boot descriptor table, such as
EFI Memory Table and e820.  Each memory resource entry usually covers
the whole contigous memory range.  Memory hot-delete request, on the
other hand, may target to a particular range of memory resource, and its
size can be much smaller than the whole contiguous memory.  Since the
existing release interfaces like __release_region() require a requested
region to be exactly matched to a resource entry, they do not allow a
partial resource to be released.

This new interface is restrictive (i.e.  release under certain
conditions), which is consistent with other release interfaces,
__release_region() and __release_resource().  Additional release
conditions, such as an overlapping region to a resource entry, can be
supported after they are confirmed as valid cases.

There is no change to the existing interfaces since their restriction is
valid for I/O resources.

[akpm@linux-foundation.org: use GFP_ATOMIC under write_lock()]
[akpm@linux-foundation.org: switch back to GFP_KERNEL, less buggily]
[akpm@linux-foundation.org: remove unneeded and wrong kfree(), per Toshi]
Signed-off-by: Toshi Kani <toshi.kani@hp.com>
Reviewed-by : Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
Cc: David Rientjes <rientjes@google.com>
Reviewed-by: Ram Pai <linuxram@us.ibm.com>
Cc: T Makphaibulchoke <tmac@hp.com>
Cc: Wen Congyang <wency@cn.fujitsu.com>
Cc: Tang Chen <tangchen@cn.fujitsu.com>
Cc: Jiang Liu <jiang.liu@huawei.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-04-29 15:54:37 -07:00
Toshi Kani
ae8e3a915a resource: add __adjust_resource() for internal use
Add __adjust_resource(), which is called by adjust_resource() internally
after the resource_lock is held.  There is no interface change to
adjust_resource().  This change allows other functions to call
__adjust_resource() internally while the resource_lock is held.

Signed-off-by: Toshi Kani <toshi.kani@hp.com>
Reviewed-by: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
Acked-by: David Rientjes <rientjes@google.com>
Cc: Ram Pai <linuxram@us.ibm.com>
Cc: T Makphaibulchoke <tmac@hp.com>
Cc: Wen Congyang <wency@cn.fujitsu.com>
Cc: Tang Chen <tangchen@cn.fujitsu.com>
Cc: Jiang Liu <jiang.liu@huawei.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-04-29 15:54:37 -07:00
T Makphaibulchoke
4965f5667f kernel/resource.c: fix stack overflow in __reserve_region_with_split()
Using a recursive call add a non-conflicting region in
__reserve_region_with_split() could result in a stack overflow in the case
that the recursive calls are too deep.  Convert the recursive calls to an
iterative loop to avoid the problem.

Tested on a machine containing 135 regions.  The kernel no longer panicked
with stack overflow.

Also tested with code arbitrarily adding regions with no conflict,
embedding two consecutive conflicts and embedding two non-consecutive
conflicts.

Signed-off-by: T Makphaibulchoke <tmac@hp.com>
Reviewed-by: Ram Pai <linuxram@us.ibm.com>
Cc: Paul Gortmaker <paul.gortmaker@gmail.com>
Cc: Wei Yang <weiyang@linux.vnet.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-10-06 03:05:31 +09:00
Octavian Purdila
65fed8f6f2 resource: make sure requested range is included in the root range
When the requested range is outside of the root range the logic in
__reserve_region_with_split will cause an infinite recursion which will
overflow the stack as seen in the warning bellow.

This particular stack overflow was caused by requesting the
(100000000-107ffffff) range while the root range was (0-ffffffff).  In
this case __request_resource would return the whole root range as
conflict range (i.e.  0-ffffffff).  Then, the logic in
__reserve_region_with_split would continue the recursion requesting the
new range as (conflict->end+1, end) which incidentally in this case
equals the originally requested range.

This patch aborts looking for an usable range when the request does not
intersect with the root range.  When the request partially overlaps with
the root range, it ajust the request to fall in the root range and then
continues with the new request.

When the request is modified or aborted errors and a stack trace are
logged to allow catching the errors in the upper layers.

[    5.968374] WARNING: at kernel/sched.c:4129 sub_preempt_count+0x63/0x89()
[    5.975150] Modules linked in:
[    5.978184] Pid: 1, comm: swapper Not tainted 3.0.22-mid27-00004-gb72c817 #46
[    5.985324] Call Trace:
[    5.987759]  [<c1039dfc>] ? console_unlock+0x17b/0x18d
[    5.992891]  [<c1039620>] warn_slowpath_common+0x48/0x5d
[    5.998194]  [<c1031758>] ? sub_preempt_count+0x63/0x89
[    6.003412]  [<c1039644>] warn_slowpath_null+0xf/0x13
[    6.008453]  [<c1031758>] sub_preempt_count+0x63/0x89
[    6.013499]  [<c14d60c4>] _raw_spin_unlock+0x27/0x3f
[    6.018453]  [<c10c6349>] add_partial+0x36/0x3b
[    6.022973]  [<c10c7c0a>] deactivate_slab+0x96/0xb4
[    6.027842]  [<c14cf9d9>] __slab_alloc.isra.54.constprop.63+0x204/0x241
[    6.034456]  [<c103f78f>] ? kzalloc.constprop.5+0x29/0x38
[    6.039842]  [<c103f78f>] ? kzalloc.constprop.5+0x29/0x38
[    6.045232]  [<c10c7dc9>] kmem_cache_alloc_trace+0x51/0xb0
[    6.050710]  [<c103f78f>] ? kzalloc.constprop.5+0x29/0x38
[    6.056100]  [<c103f78f>] kzalloc.constprop.5+0x29/0x38
[    6.061320]  [<c17b45e9>] __reserve_region_with_split+0x1c/0xd1
[    6.067230]  [<c17b4693>] __reserve_region_with_split+0xc6/0xd1
...
[    7.179057]  [<c17b4693>] __reserve_region_with_split+0xc6/0xd1
[    7.184970]  [<c17b4779>] reserve_region_with_split+0x30/0x42
[    7.190709]  [<c17a8ebf>] e820_reserve_resources_late+0xd1/0xe9
[    7.196623]  [<c17c9526>] pcibios_resource_survey+0x23/0x2a
[    7.202184]  [<c17cad8a>] pcibios_init+0x23/0x35
[    7.206789]  [<c17ca574>] pci_subsys_init+0x3f/0x44
[    7.211659]  [<c1002088>] do_one_initcall+0x72/0x122
[    7.216615]  [<c17ca535>] ? pci_legacy_init+0x3d/0x3d
[    7.221659]  [<c17a27ff>] kernel_init+0xa6/0x118
[    7.226265]  [<c17a2759>] ? start_kernel+0x334/0x334
[    7.231223]  [<c14d7482>] kernel_thread_helper+0x6/0x10

Signed-off-by: Octavian Purdila <octavian.purdila@intel.com>
Signed-off-by: Ram Pai <linuxram@us.ibm.com>
Cc: Jesse Barnes <jbarnes@virtuousgeek.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-07-30 17:25:21 -07:00