Commit Graph

14182 Commits

Author SHA1 Message Date
Peter Zijlstra
e3f3541c19 perf: Extend the mmap control page with time (TSC) fields
Extend the mmap control page with fields so that userspace can compute
time deltas relative to the provided time fields.

Currently only implemented for x86 with constant and nonstop TSC.

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Stephane Eranian <eranian@google.com>
Cc: Arun Sharma <asharma@fb.com>
Link: http://lkml.kernel.org/n/tip-3u1jucza77j3wuvs0x2bic0f@git.kernel.org
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2011-12-21 11:01:13 +01:00
Peter Zijlstra
0c9d42ed4c perf, x86: Provide means for disabling userspace RDPMC
Allow the disabling of RDPMC via a pmu specific attribute:

  echo 0 > /sys/bus/event_source/devices/cpu/rdpmc

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Stephane Eranian <eranian@google.com>
Cc: Arun Sharma <asharma@fb.com>
Link: http://lkml.kernel.org/n/tip-pqeog465zo5hsimtkfz73f27@git.kernel.org
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2011-12-21 11:01:11 +01:00
Peter Zijlstra
fe4a330885 perf, x86: Implement user-space RDPMC support, to allow fast, user-space access to self-monitoring counters
Implement a correct pmu::event_idx for the x86 counter index rules and
set CR4.PCE on CPU_STARTING.

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Stephane Eranian <eranian@google.com>
Cc: Arun Sharma <asharma@fb.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Link: http://lkml.kernel.org/n/tip-mwxab34dibqgzk5zywutfnha@git.kernel.org
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2011-12-21 11:01:10 +01:00
Peter Zijlstra
35edc2a509 perf, arch: Rework perf_event_index()
Put the logic to compute the event index into a per pmu method. This
is required because the x86 rules are weird and wonderful and don't
match the capabilities of the current scheme.

AFAIK only powerpc actually has a usable userspace read of the PMCs
but I'm not at all sure anybody actually used that.

ARM is restored to the default since it currently does not support
userspace access at all. And all software events are provided with a
method that reports their index as 0 (disabled).

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Michael Cree <mcree@orcon.net.nz>
Cc: Will Deacon <will.deacon@arm.com>
Cc: Deng-Cheng Zhu <dengcheng.zhu@gmail.com>
Cc: Anton Blanchard <anton@samba.org>
Cc: Eric B Munson <emunson@mgebm.net>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: Paul Mundt <lethal@linux-sh.org>
Cc: David S. Miller <davem@davemloft.net>
Cc: Richard Kuo <rkuo@codeaurora.org>
Cc: Stephane Eranian <eranian@google.com>
Cc: Arun Sharma <asharma@fb.com>
Link: http://lkml.kernel.org/n/tip-dfydxodki16lylkt3gl2j7cw@git.kernel.org
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2011-12-21 11:01:07 +01:00
Stephane Eranian
9c1497ea59 perf events: Add Intel x86 mapping for PERF_COUNT_HW_REF_CPU_CYCLES
Add event maps for Intel x86 processors (with architected PMU v2 or later).

On AMD, there is frequency scaling but no Turbo. There is no core
cycle event not subject to frequency scaling, therefore we do not
provide a mapping.

Signed-off-by: Stephane Eranian <eranian@google.com>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Link: http://lkml.kernel.org/r/1323559734-3488-4-git-send-email-eranian@google.com
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2011-12-21 10:26:39 +01:00
Stephane Eranian
cd09c0c40a perf events: Enable raw event support for Intel unhalted_reference_cycles event
This patch adds the encoding and definitions necessary for the
unhalted_reference_cycles event avaialble since Intel Core 2 processors.

Signed-off-by: Stephane Eranian <eranian@google.com>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Link: http://lkml.kernel.org/r/1323559734-3488-2-git-send-email-eranian@google.com
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2011-12-21 10:26:32 +01:00
Ingo Molnar
d87f69a16e Merge commit 'v3.2-rc6' into perf/core
Merge reason: Update with the latest fixes.

Signed-off-by: Ingo Molnar <mingo@elte.hu>
2011-12-20 20:32:11 +01:00
Linus Torvalds
42ebfc61cf Merge branch 'stable/for-linus-fixes-3.2' of git://git.kernel.org/pub/scm/linux/kernel/git/konrad/xen
* 'stable/for-linus-fixes-3.2' of git://git.kernel.org/pub/scm/linux/kernel/git/konrad/xen:
  xen/swiotlb: Use page alignment for early buffer allocation.
  xen: only limit memory map to maximum reservation for domain 0.
2011-12-15 10:52:40 -08:00
Ian Campbell
d3db728125 xen: only limit memory map to maximum reservation for domain 0.
d312ae878b "xen: use maximum reservation to limit amount of usable RAM"
clamped the total amount of RAM to the current maximum reservation. This is
correct for dom0 but is not correct for guest domains. In order to boot a guest
"pre-ballooned" (e.g. with memory=1G but maxmem=2G) in order to allow for
future memory expansion the guest must derive max_pfn from the e820 provided by
the toolstack and not the current maximum reservation (which can reflect only
the current maximum, not the guest lifetime max). The existing algorithm
already behaves this correctly if we do not artificially limit the maximum
number of pages for the guest case.

For a guest booted with maxmem=512, memory=128 this results in:
 [    0.000000] BIOS-provided physical RAM map:
 [    0.000000]  Xen: 0000000000000000 - 00000000000a0000 (usable)
 [    0.000000]  Xen: 00000000000a0000 - 0000000000100000 (reserved)
-[    0.000000]  Xen: 0000000000100000 - 0000000008100000 (usable)
-[    0.000000]  Xen: 0000000008100000 - 0000000020800000 (unusable)
+[    0.000000]  Xen: 0000000000100000 - 0000000020800000 (usable)
...
 [    0.000000] NX (Execute Disable) protection: active
 [    0.000000] DMI not present or invalid.
 [    0.000000] e820 update range: 0000000000000000 - 0000000000010000 (usable) ==> (reserved)
 [    0.000000] e820 remove range: 00000000000a0000 - 0000000000100000 (usable)
-[    0.000000] last_pfn = 0x8100 max_arch_pfn = 0x1000000
+[    0.000000] last_pfn = 0x20800 max_arch_pfn = 0x1000000
 [    0.000000] initial memory mapped : 0 - 027ff000
 [    0.000000] Base memory trampoline at [c009f000] 9f000 size 4096
-[    0.000000] init_memory_mapping: 0000000000000000-0000000008100000
-[    0.000000]  0000000000 - 0008100000 page 4k
-[    0.000000] kernel direct mapping tables up to 8100000 @ 27bb000-27ff000
+[    0.000000] init_memory_mapping: 0000000000000000-0000000020800000
+[    0.000000]  0000000000 - 0020800000 page 4k
+[    0.000000] kernel direct mapping tables up to 20800000 @ 26f8000-27ff000
 [    0.000000] xen: setting RW the range 27e8000 - 27ff000
 [    0.000000] 0MB HIGHMEM available.
-[    0.000000] 129MB LOWMEM available.
-[    0.000000]   mapped low ram: 0 - 08100000
-[    0.000000]   low ram: 0 - 08100000
+[    0.000000] 520MB LOWMEM available.
+[    0.000000]   mapped low ram: 0 - 20800000
+[    0.000000]   low ram: 0 - 20800000

With this change "xl mem-set <domain> 512M" will successfully increase the
guest RAM (by reducing the balloon).

There is no change for dom0.

Reported-and-Tested-by:  George Shuklin <george.shuklin@gmail.com>
Signed-off-by: Ian Campbell <ian.campbell@citrix.com>
Cc: stable@kernel.org
Reviewed-by: David Vrabel <david.vrabel@citrix.com>
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
2011-12-15 11:24:02 -05:00
Keith Packard
e1ad783b12 Revert "x86, efi: Calling __pa() with an ioremap()ed address is invalid"
This hangs my MacBook Air at boot time; I get no console
messages at all. I reverted this on top of -rc5 and my machine
boots again.

This reverts commit e8c7106280.

Signed-off-by: Matt Fleming <matt.fleming@intel.com>
Signed-off-by: Keith Packard <keithp@keithp.com>
Acked-by: H. Peter Anvin <hpa@zytor.com>
Cc: Matthew Garrett <mjg@redhat.com>
Cc: Zhang Rui <rui.zhang@intel.com>
Cc: Huang Ying <huang.ying.caritas@gmail.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Link: http://lkml.kernel.org/r/1321621751-3650-1-git-send-email-matt@console
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2011-12-12 18:25:56 +01:00
Matt Fleming
6d3e32e63f x86, efi: Make efi_call_phys_{prelog,epilog} CONFIG_RELOCATABLE-aware
efi_call_phys_prelog() sets up a 1:1 mapping of the physical address
range in swapper_pg_dir. Instead of replacing then restoring entries
in swapper_pg_dir we should be using initial_page_table which already
contains the 1:1 mapping.

It's safe to blindly switch back to swapper_pg_dir in the epilog
because the physical EFI routines are only called before
efi_enter_virtual_mode(), e.g. before any user processes have been
forked. Therefore, we don't need to track which pgd was in %cr3 when
we entered the prelog.

The previous code actually contained a bug because it assumed that the
kernel was loaded at a physical address within the first 8MB of ram,
usually at 0x100000. However, this isn't the case with a
CONFIG_RELOCATABLE=y kernel which could have been loaded anywhere in
the physical address space.

Also delete the ancient (and bogus) comments about the page table
being restored after the lock is released. There is no locking.

Cc: Matthew Garrett <mjg@redhat.com>
Cc: Darrent Hart <dvhart@linux.intel.com>
Signed-off-by: Matt Fleming <matt.fleming@intel.com>
Link: http://lkml.kernel.org/r/1323346250.3894.74.camel@mfleming-mobl1.ger.corp.intel.com
Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
2011-12-09 16:39:11 -08:00
Linus Torvalds
a776878d6c Merge branch 'x86-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
* 'x86-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  x86, efi: Calling __pa() with an ioremap()ed address is invalid
  x86, hpet: Immediately disable HPET timer 1 if rtc irq is masked
  x86/intel_mid: Kconfig select fix
  x86/intel_mid: Fix the Kconfig for MID selection
2011-12-09 14:45:12 -08:00
Youquan Song
b6999b1912 thp: add compound tail page _mapcount when mapped
With the 3.2-rc kernel, IOMMU 2M pages in KVM works.  But when I tried
to use IOMMU 1GB pages in KVM, I encountered an oops and the 1GB page
failed to be used.

The root cause is that 1GB page allocation calls gup_huge_pud() while 2M
page calls gup_huge_pmd.  If compound pages are used and the page is a
tail page, gup_huge_pmd() increases _mapcount to record tail page are
mapped while gup_huge_pud does not do that.

So when the mapped page is relesed, it will result in kernel oops
because the page is not marked mapped.

This patch add tail process for compound page in 1GB huge page which
keeps the same process as 2M page.

Reproduce like:
1. Add grub boot option: hugepagesz=1G hugepages=8
2. mount -t hugetlbfs -o pagesize=1G hugetlbfs /dev/hugepages
3. qemu-kvm -m 2048 -hda os-kvm.img -cpu kvm64 -smp 4 -mem-path /dev/hugepages
	-net none -device pci-assign,host=07:00.1

  kernel BUG at mm/swap.c:114!
  invalid opcode: 0000 [#1] SMP
  Call Trace:
    put_page+0x15/0x37
    kvm_release_pfn_clean+0x31/0x36
    kvm_iommu_put_pages+0x94/0xb1
    kvm_iommu_unmap_memslots+0x80/0xb6
    kvm_assign_device+0xba/0x117
    kvm_vm_ioctl_assigned_device+0x301/0xa47
    kvm_vm_ioctl+0x36c/0x3a2
    do_vfs_ioctl+0x49e/0x4e4
    sys_ioctl+0x5a/0x7c
    system_call_fastpath+0x16/0x1b
  RIP  put_compound_page+0xd4/0x168

Signed-off-by: Youquan Song <youquan.song@intel.com>
Reviewed-by: Andrea Arcangeli <aarcange@redhat.com>
Cc: Andi Kleen <andi@firstfloor.org>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-12-09 07:50:28 -08:00
Matt Fleming
e8c7106280 x86, efi: Calling __pa() with an ioremap()ed address is invalid
If we encounter an efi_memory_desc_t without EFI_MEMORY_WB set
in ->attribute we currently call set_memory_uc(), which in turn
calls __pa() on a potentially ioremap'd address.

On CONFIG_X86_32 this is invalid, resulting in the following
oops on some machines:

  BUG: unable to handle kernel paging request at f7f22280
  IP: [<c10257b9>] reserve_ram_pages_type+0x89/0x210
  [...]

  Call Trace:
   [<c104f8ca>] ? page_is_ram+0x1a/0x40
   [<c1025aff>] reserve_memtype+0xdf/0x2f0
   [<c1024dc9>] set_memory_uc+0x49/0xa0
   [<c19334d0>] efi_enter_virtual_mode+0x1c2/0x3aa
   [<c19216d4>] start_kernel+0x291/0x2f2
   [<c19211c7>] ? loglevel+0x1b/0x1b
   [<c19210bf>] i386_start_kernel+0xbf/0xc8

A better approach to this problem is to map the memory region
with the correct attributes from the start, instead of modifying
it after the fact. The uncached case can be handled by
ioremap_nocache() and the cached by ioremap_cache().

Despite first impressions, it's not possible to use
ioremap_cache() to map all cached memory regions on
CONFIG_X86_64 because EFI_RUNTIME_SERVICES_DATA regions really
don't like being mapped into the vmalloc space, as detailed in
the following bug report,

	https://bugzilla.redhat.com/show_bug.cgi?id=748516

Therefore, we need to ensure that any EFI_RUNTIME_SERVICES_DATA
regions are covered by the direct kernel mapping table on
CONFIG_X86_64. To accomplish this we now map E820_RESERVED_EFI
regions via the direct kernel mapping with the initial call to
init_memory_mapping() in setup_arch(), whereas previously these
regions wouldn't be mapped if they were after the last E820_RAM
region until efi_ioremap() was called. Doing it this way allows
us to delete efi_ioremap() completely.

Signed-off-by: Matt Fleming <matt.fleming@intel.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Matthew Garrett <mjg@redhat.com>
Cc: Zhang Rui <rui.zhang@intel.com>
Cc: Huang Ying <huang.ying.caritas@gmail.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Link: http://lkml.kernel.org/r/1321621751-3650-1-git-send-email-matt@console-pimps.org
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2011-12-09 08:32:26 +01:00
Mark Langsdorf
2ded6e6a94 x86, hpet: Immediately disable HPET timer 1 if rtc irq is masked
When HPET is operating in RTC mode, the TN_ENABLE bit on timer1
controls whether the HPET or the RTC delivers interrupts to irq8. When
the system goes into suspend, the RTC driver sends a signal to the
HPET driver so that the HPET releases control of irq8, allowing the
RTC to wake the system from suspend. The switchover is accomplished by
a write to the HPET configuration registers which currently only
occurs while servicing the HPET interrupt.

On some systems, I have seen the system suspend before an HPET
interrupt occurs, preventing the write to the HPET configuration
register and leaving the HPET in control of the irq8. As the HPET is
not active during suspend, it does not generate a wake signal and RTC
alarms do not work.

This patch forces the HPET driver to immediately transfer control of
the irq8 channel to the RTC instead of waiting until the next
interrupt event.

Signed-off-by: Mark Langsdorf <mark.langsdorf@amd.com>
Link: http://lkml.kernel.org/r/20111118153306.GB16319@alberich.amd.com
Tested-by: Andreas Herrmann <andreas.herrmann3@amd.com>
Signed-off-by: Andreas Herrmann <andreas.herrmann3@amd.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: stable@vger.kernel.org
2011-12-08 21:47:22 +01:00
Gleb Natapov
b3d9468a8b perf, x86: Expose perf capability to other modules
KVM needs to know perf capability to decide which PMU it can expose to a
guest.

Signed-off-by: Gleb Natapov <gleb@redhat.com>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Link: http://lkml.kernel.org/r/1320929850-10480-8-git-send-email-gleb@redhat.com
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2011-12-06 20:41:08 +01:00
Peter Zijlstra
c1d6f42f1a perf, x86: Implement arch event mask as quirk
Implement the disabling of arch events as a quirk so that we can print
a message along with it. This creates some visibility into the problem
space and could allow us to work on adding more work-around like the
AAJ80 one.

Requested-by: Ingo Molnar <mingo@elte.hu>
Cc: Gleb Natapov <gleb@redhat.com>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Link: http://lkml.kernel.org/n/tip-wcja2z48wklzu1b0nkz0a5y7@git.kernel.org
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2011-12-06 20:41:06 +01:00
Gleb Natapov
ffb871bc91 x86, perf: Disable non available architectural events
Intel CPUs report non-available architectural events in cpuid leaf
0AH.EBX. Use it to disable events that are not available according
to CPU.

Signed-off-by: Gleb Natapov <gleb@redhat.com>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Link: http://lkml.kernel.org/r/1320929850-10480-7-git-send-email-gleb@redhat.com
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2011-12-06 20:41:05 +01:00
Peter Zijlstra
9cdbe1cbac jump_label, x86: Fix section mismatch
WARNING: arch/x86/kernel/built-in.o(.text+0x4c71): Section mismatch in
reference from the function arch_jump_label_transform_static() to the
function .init.text:text_poke_early()
The function arch_jump_label_transform_static() references
the function __init text_poke_early().
This is often because arch_jump_label_transform_static lacks a __init
annotation or the annotation of text_poke_early is wrong.

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Jason Baron <jbaron@redhat.com>
Link: http://lkml.kernel.org/n/tip-9lefe89mrvurrwpqw5h8xm8z@git.kernel.org
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2011-12-06 20:41:02 +01:00
Alan Cox
4e2b1c4f56 x86/intel_mid: Kconfig select fix
If we select a symbol it should have a type declared first
otherwise in some situations the config tools get upset. They
are currently perhaps a bit too resilient which is why this
wasn't noticed initially.

Signed-off-by: Alan Cox <alan@linux.intel.com>
Link: http://lkml.kernel.org/r/20111206132811.4041.32549.stgit@bob.linux.org.uk
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2011-12-06 14:40:50 +01:00
Alan Cox
dd13752537 x86/intel_mid: Fix the Kconfig for MID selection
We currently fail to build on CONFIG_X86_INTEL_MID=y and
CONFIG_X86_MRST unset.

We could build all the bits to make generic MID work if you
picked MID platform alone but that's really silly. Instead use
select and two variables.

This looks a bit daft right now but once we add a Medfield
selection it'll start to look a good deal more sensible.

Reported-by: Ingo Molnar <mingo@elte.hu>
Reported-by: Stanislaw Gruszka <sgruszka@redhat.com>
Signed-off-by: Alan Cox <alan@linux.intel.com>
Link: http://lkml.kernel.org/r/20111205231433.28811.51297.stgit@bob.linux.org.uk
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2011-12-06 11:28:36 +01:00
Peter Zijlstra
4defea8559 perf, x86: Prefer fixed-purpose counters when scheduling
This avoids a scheduling failure for cases like:

  cycles, cycles, instructions, instructions (on Core2)

Which would end up being programmed like:

  PMC0, PMC1, FP-instructions, fail

Because all events will have the same weight.

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Link: http://lkml.kernel.org/n/tip-8tnwb92asqj7xajqqoty4gel@git.kernel.org
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2011-12-06 08:33:58 +01:00
Robert Richter
bc1738f6ee perf, x86: Fix event scheduler for constraints with overlapping counters
The current x86 event scheduler fails to resolve scheduling problems
of certain combinations of events and constraints. This happens if the
counter mask of such an event is not a subset of any other counter
mask of a constraint with an equal or higher weight, e.g. constraints
of the AMD family 15h pmu:

                        counter mask    weight

 amd_f15_PMC30          0x09            2  <--- overlapping counters
 amd_f15_PMC20          0x07            3
 amd_f15_PMC53          0x38            3

The scheduler does not find then an existing solution. Here is an
example:

 event code     counter         failure         possible solution

 0x02E          PMC[3,0]        0               3
 0x043          PMC[2:0]        1               0
 0x045          PMC[2:0]        2               1
 0x046          PMC[2:0]        FAIL            2

The event scheduler may not select the correct counter in the first
cycle because it needs to know which subsequent events will be
scheduled. It may fail to schedule the events then.

To solve this, we now save the scheduler state of events with
overlapping counter counstraints.  If we fail to schedule the events
we rollback to those states and try to use another free counter.

Constraints with overlapping counters are marked with a new introduced
overlap flag. We set the overlap flag for such constraints to give the
scheduler a hint which events to select for counter rescheduling. The
EVENT_CONSTRAINT_OVERLAP() macro can be used for this.

Care must be taken as the rescheduling algorithm is O(n!) which will
increase scheduling cycles for an over-commited system dramatically.
The number of such EVENT_CONSTRAINT_OVERLAP() macros and its counter
masks must be kept at a minimum. Thus, the current stack is limited to
2 states to limit the number of loops the algorithm takes in the worst
case.

On systems with no overlapping-counter constraints, this
implementation does not increase the loop count compared to the
previous algorithm.

V2:
* Renamed redo -> overlap.
* Reimplementation using perf scheduling helper functions.

V3:
* Added WARN_ON_ONCE() if out of save states.
* Changed function interface of perf_sched_restore_state() to use bool
  as return value.

Signed-off-by: Robert Richter <robert.richter@amd.com>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Stephane Eranian <eranian@google.com>
Link: http://lkml.kernel.org/r/1321616122-1533-3-git-send-email-robert.richter@amd.com
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2011-12-06 08:33:56 +01:00
Robert Richter
1e2ad28f80 perf, x86: Implement event scheduler helper functions
This patch introduces x86 perf scheduler code helper functions. We
need this to later add more complex functionality to support
overlapping counter constraints (next patch).

The algorithm is modified so that the range of weight values is now
generated from the constraints. There shouldn't be other functional
changes.

With the helper functions the scheduler is controlled. There are
functions to initialize, traverse the event list, find unused counters
etc. The scheduler keeps its own state.

V3:
* Added macro for_each_set_bit_cont().
* Changed functions interfaces of perf_sched_find_counter() and
  perf_sched_next_event() to use bool as return value.
* Added some comments to make code better understandable.

V4:
* Fix broken event assignment if weight of the first event is not
  wmin (perf_sched_init()).

Signed-off-by: Robert Richter <robert.richter@amd.com>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Stephane Eranian <eranian@google.com>
Link: http://lkml.kernel.org/r/1321616122-1533-2-git-send-email-robert.richter@amd.com
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2011-12-06 08:33:54 +01:00
Ingo Molnar
d6c1c49de5 Merge branch 'perf/urgent' into perf/core
Merge reason: Add these cherry-picked commits so that future changes
              on perf/core don't conflict.

Signed-off-by: Ingo Molnar <mingo@elte.hu>
2011-12-06 06:43:49 +01:00
Linus Torvalds
45e713efe2 Merge branch 'x86-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
* 'x86-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  intr_remapping: Fix section mismatch in ir_dev_scope_init()
  intel-iommu: Fix section mismatch in dmar_parse_rmrr_atsr_dev()
  x86, amd: Fix up numa_node information for AMD CPU family 15h model 0-0fh northbridge functions
  x86, AMD: Correct align_va_addr documentation
  x86/rtc, mrst: Don't register a platform RTC device for for Intel MID platforms
  x86/mrst: Battery fixes
  x86/paravirt: PTE updates in k(un)map_atomic need to be synchronous, regardless of lazy_mmu mode
  x86: Fix "Acer Aspire 1" reboot hang
  x86/mtrr: Resolve inconsistency with Intel processor manual
  x86: Document rdmsr_safe restrictions
  x86, microcode: Fix the failure path of microcode update driver init code
  Add TAINT_FIRMWARE_WORKAROUND on MTRR fixup
  x86/mpparse: Account for bus types other than ISA and PCI
  x86, mrst: Change the pmic_gpio device type to IPC
  mrst: Added some platform data for the SFI translations
  x86,mrst: Power control commands update
  x86/reboot: Blacklist Dell OptiPlex 990 known to require PCI reboot
  x86, UV: Fix UV2 hub part number
  x86: Add user_mode_vm check in stack_overflow_check
2011-12-05 16:54:15 -08:00
Linus Torvalds
232ea34455 Merge branch 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
* 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  perf: Fix loss of notification with multi-event
  perf, x86: Force IBS LVT offset assignment for family 10h
  perf, x86: Disable PEBS on SandyBridge chips
  trace_events_filter: Use rcu_assign_pointer() when setting ftrace_event_call->filter
  perf session: Fix crash with invalid CPU list
  perf python: Fix undefined symbol problem
  perf/x86: Enable raw event access to Intel offcore events
  perf: Don't use -ENOSPC for out of PMU resources
  perf: Do not set task_ctx pointer in cpuctx if there are no events in the context
  perf/x86: Fix PEBS instruction unwind
  oprofile, x86: Fix crash when unloading module (nmi timer mode)
  oprofile: Fix crash when unloading module (hr timer mode)
2011-12-05 16:54:00 -08:00
Linus Torvalds
7125faceab Merge branch 'sched-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
* 'sched-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  sched, x86: Avoid unnecessary overflow in sched_clock
  sched: Fix buglet in return_cfs_rq_runtime()
  sched: Avoid SMT siblings in select_idle_sibling() if possible
  sched: Set the command name of the idle tasks in SMP kernels
  sched, rt: Provide means of disabling cross-cpu bandwidth sharing
  sched: Document wait_for_completion_*() return values
  sched_fair: Fix a typo in the comment describing update_sd_lb_stats
  sched: Add a comment to effective_load() since it's a pain
2011-12-05 16:50:24 -08:00
Andreas Herrmann
f62ef5f3e9 x86, amd: Fix up numa_node information for AMD CPU family 15h model 0-0fh northbridge functions
I've received complaints that the numa_node attribute for family
15h model 00-0fh (e.g. Interlagos) northbridge functions shows
-1 instead of the proper node ID.

Correct this with attached quirks (similar to quirks for other
AMD CPU families used in multi-socket systems).

Signed-off-by: Andreas Herrmann <andreas.herrmann3@amd.com>
Cc: Frank Arnold <frank.arnold@amd.com>
Cc: Borislav Petkov <borislav.petkov@amd.com>
Link: http://lkml.kernel.org/r/20111202072143.GA31916@alberich.amd.com
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2011-12-05 18:13:11 +01:00
Mathias Nyman
35d4769962 x86/rtc, mrst: Don't register a platform RTC device for for Intel MID platforms
Intel MID x86 platforms have a memory mapped virtual RTC
instead.  No MID platform have the default ports (and
accessing them may do weird stuff).

Signed-off-by: Mathias Nyman <mathias.nyman@linux.intel.com>
Signed-off-by: Alan Cox <alan@linux.intel.com>
Cc: feng.tang@intel.com
Cc: Feng Tang <feng.tang@intel.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2011-12-05 17:09:21 +01:00
Konrad Rzeszutek Wilk
2cd1c8d4dc x86/paravirt: PTE updates in k(un)map_atomic need to be synchronous, regardless of lazy_mmu mode
Fix an outstanding issue that has been reported since 2.6.37.
Under a heavy loaded machine processing "fork()" calls could
crash with:

BUG: unable to handle kernel paging request at f573fc8c
IP: [<c01abc54>] swap_count_continued+0x104/0x180
*pdpt = 000000002a3b9027 *pde = 0000000001bed067 *pte = 0000000000000000 Oops: 0000 [#1] SMP
Modules linked in:
Pid: 1638, comm: apache2 Not tainted 3.0.4-linode37 #1
EIP: 0061:[<c01abc54>] EFLAGS: 00210246 CPU: 3
EIP is at swap_count_continued+0x104/0x180
.. snip..
Call Trace:
 [<c01ac222>] ? __swap_duplicate+0xc2/0x160
 [<c01040f7>] ? pte_mfn_to_pfn+0x87/0xe0
 [<c01ac2e4>] ? swap_duplicate+0x14/0x40
 [<c01a0a6b>] ? copy_pte_range+0x45b/0x500
 [<c01a0ca5>] ? copy_page_range+0x195/0x200
 [<c01328c6>] ? dup_mmap+0x1c6/0x2c0
 [<c0132cf8>] ? dup_mm+0xa8/0x130
 [<c013376a>] ? copy_process+0x98a/0xb30
 [<c013395f>] ? do_fork+0x4f/0x280
 [<c01573b3>] ? getnstimeofday+0x43/0x100
 [<c010f770>] ? sys_clone+0x30/0x40
 [<c06c048d>] ? ptregs_clone+0x15/0x48
 [<c06bfb71>] ? syscall_call+0x7/0xb

The problem is that in copy_page_range() we turn lazy mode on,
and then in swap_entry_free() we call swap_count_continued()
which ends up in:

         map = kmap_atomic(page, KM_USER0) + offset;

and then later we touch *map.

Since we are running in batched mode (lazy) we don't actually
set up the PTE mappings and the kmap_atomic is not done
synchronously and ends up trying to dereference a page that has
not been set.

Looking at kmap_atomic_prot_pfn(), it uses
'arch_flush_lazy_mmu_mode' and doing the same in
kmap_atomic_prot() and __kunmap_atomic() makes the problem go
away.

Interestingly, commit b8bcfe997e ("x86/paravirt: remove lazy
mode in interrupts") removed part of this to fix an interrupt
issue - but it went to far and did not consider this scenario.

Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Jeremy Fitzhardinge <jeremy.fitzhardinge@citrix.com>
Cc: <stable@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2011-12-05 17:06:34 +01:00
Ingo Molnar
f1b23714cb Merge branch 'ucode' of git://git.kernel.org/pub/scm/linux/kernel/git/bp/bp into x86/urgent 2011-12-05 16:38:51 +01:00
Peter Chubb
1ef0389096 x86: Fix "Acer Aspire 1" reboot hang
Looks like on some Acer Aspire 1s with older bioses, reboot via bios
fails.  It works on my machine, (with BIOS version 0.3310) but
not on some others (BIOS version 0.3309).

There's a log of problems at:

  https://bbs.archlinux.org/viewtopic.php?id=124136

This patch adds a different callback to the reboot quirk table,
to allow rebooting via keybaord controller.

Reported-by: Uroš Vampl <mobile.leecher@gmail.com>
Tested-by: Vasily Khoruzhick <anarsoul@gmail.com>
Signed-off-by: Peter Chubb <peter.chubb@nicta.com.au>
Cc: Don Zickus <dzickus@redhat.com>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: stable@kernel.org
Link: http://lkml.kernel.org/r/1323093233-9481-1-git-send-email-anarsoul@gmail.com
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2011-12-05 15:06:17 +01:00
Ajaykumar Hotchandani
8dbf4a3003 x86/mtrr: Resolve inconsistency with Intel processor manual
Following is from Notes of section 11.5.3 of Intel processor
manual available at:

  http://www.intel.com/Assets/PDF/manual/325384.pdf

For the Pentium 4 and Intel Xeon processors, after the sequence of
steps given above has been executed, the cache lines containing the
code between the end of the WBINVD instruction and before the
MTRRS have actually been disabled may be retained in the cache
hierarchy. Here, to remove code from the cache completely, a
second WBINVD instruction must be executed after the MTRRs have
been disabled.

This patch provides resolution for that.

Ideally, I will like to make changes only for Pentium 4 and Xeon
processors. But, I am not finding easier way to do it.
And, extra wbinvd() instruction does not hurt much for other
processors.

Signed-off-by: Ajaykumar Hotchandani <ajaykumar.hotchandani@oracle.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Arjan van de Ven <arjan@linux.intel.com>
Cc: Lucas De Marchi <lucas.demarchi@profusion.mobi>
Link: http://lkml.kernel.org/r/4EBD1CC5.3030008@oracle.com
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2011-12-05 15:06:15 +01:00
Masami Hiramatsu
9dde9dc0a8 x86/tools: Add decoded instruction dump mode
Add instruction dump mode to insn_sanity tool for
checking decoder really decoded instructions.

This mode is enabled when passing double -v (-vv) to
insn_sanity. It is useful for who wants to check whether
the decoder can decode some instructions correctly.
e.g.
 $ echo 0f 73 10 11 | ./insn_sanity -y -vv -i -
 Instruction = {
        .prefixes = {
                .value = 0, bytes[] = {0, 0, 0, 0},
                .got = 1, .nbytes = 0},
        .rex_prefix = {
                .value = 0, bytes[] = {0, 0, 0, 0},
                .got = 1, .nbytes = 0},
        .vex_prefix = {
                .value = 0, bytes[] = {0, 0, 0, 0},
                .got = 1, .nbytes = 0},
        .opcode = {
                .value = 29455, bytes[] = {f, 73, 0, 0},
                .got = 1, .nbytes = 2},
        .modrm = {
                .value = 16, bytes[] = {10, 0, 0, 0},
                .got = 1, .nbytes = 1},
        .sib = {
                .value = 0, bytes[] = {0, 0, 0, 0},
                .got = 1, .nbytes = 0},
        .displacement = {
                .value = 0, bytes[] = {0, 0, 0, 0},
                .got = 1, .nbytes = 0},
        .immediate1 = {
                .value = 17, bytes[] = {11, 0, 0, 0},
                .got = 1, .nbytes = 1},
        .immediate2 = {
                .value = 0, bytes[] = {0, 0, 0, 0},
                .got = 0, .nbytes = 0},
        .attr = 44800, .opnd_bytes = 4, .addr_bytes = 8,
        .length = 4, .x86_64 = 1, .kaddr = 0x7fff0f7d9430}
 Success: decoded and checked 1 given instructions with 0 errors (seed:0x0)

Signed-off-by: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: yrl.pp-manager.tt@hitachi.com
Link: http://lkml.kernel.org/r/20111205120603.15475.91192.stgit@cloud
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2011-12-05 14:53:23 +01:00
Masami Hiramatsu
a9c373d033 x86: Update instruction decoder to support new AVX formats
Since new Intel software developers manual introduces
new format for AVX instruction set (including AVX2),
it is important to update x86-opcode-map.txt to fit
those changes.

Signed-off-by: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: yrl.pp-manager.tt@hitachi.com
Link: http://lkml.kernel.org/r/20111205120557.15475.13236.stgit@cloud
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2011-12-05 14:53:21 +01:00
Masami Hiramatsu
e70825fc51 x86/tools: Fix insn_sanity message outputs
Fix x86 instruction decoder test to dump all error messages
to stderr and others to stdout.

Signed-off-by: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: yrl.pp-manager.tt@hitachi.com
Link: http://lkml.kernel.org/r/20111205120550.15475.70149.stgit@cloud
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2011-12-05 14:53:19 +01:00
Masami Hiramatsu
bfbe9015de x86/tools: Fix instruction decoder message output
Fix instruction decoder test (insn_sanity), so that
it doesn't show both info and error messages twice on
same instruction. (In that case, show only error message)

Signed-off-by: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: yrl.pp-manager.tt@hitachi.com
Link: http://lkml.kernel.org/r/20111205120545.15475.7928.stgit@cloud
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2011-12-05 14:53:17 +01:00
Masami Hiramatsu
130b78b2bf x86: Fix instruction decoder to handle grouped AVX instructions
For reducing memory usage of attribute table, x86 instruction
decoder puts "Group" attribute only on "no-last-prefix"
attribute table (same as vex_p == 0 case).

Thus, the decoder should look no-last-prefix table first, and
then only if it is not a group, move on to "with-last-prefix"
table (vex_p != 0).

However, current implementation, inat_get_avx_attribute()
looks with-last-prefix directly. So, when decoding
a grouped AVX instruction, the decoder fails to find correct
group because there is no "Group" attribute on the table.
This ends up with the mis-decoding of instructions, as Ingo
reported in http://thread.gmane.org/gmane.linux.kernel/1214103

This patch fixes it to check no-last-prefix table first
even if that is an AVX instruction, and get an attribute from
"with last-prefix" table only if that is not a group.

Reported-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: yrl.pp-manager.tt@hitachi.com
Link: http://lkml.kernel.org/r/20111205120539.15475.91428.stgit@cloud
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2011-12-05 14:53:15 +01:00
Masami Hiramatsu
1056c3e916 x86/tools: Fix Makefile to build all test tools
Fix arch/x86/tools/Makefile to compile both test tools
correctly. This bug leads build error.

Signed-off-by: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: yrl.pp-manager.tt@hitachi.com
Link: http://lkml.kernel.org/r/20111205120533.15475.62047.stgit@cloud
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2011-12-05 14:53:13 +01:00
Borislav Petkov
ce37defc0f x86: Document rdmsr_safe restrictions
Recently, I got bitten by using rdmsr_safe too early in the boot
process. Document its shortcomings for future reference.

Link: http://lkml.kernel.org/r/4ED5B70F.606@lwfinger.net
Signed-off-by: Borislav Petkov <borislav.petkov@amd.com>
2011-12-05 14:28:37 +01:00
Srivatsa S. Bhat
bd39906397 x86, microcode: Fix the failure path of microcode update driver init code
The microcode update driver's initialization code does not handle
failures correctly. This patch fixes this issue.

Signed-off-by: Jan Beulich <JBeulich@suse.com>
Signed-off-by: Srivatsa S. Bhat <srivatsa.bhat@linux.vnet.ibm.com>
Link: http://lkml.kernel.org/r/20111107123530.12164.31227.stgit@srivatsabhat.in.ibm.com
Link: http://lkml.kernel.org/r/4ED8E2270200007800065120@nat28.tlf.novell.com
Signed-off-by: Borislav Petkov <borislav.petkov@amd.com>
2011-12-05 14:21:01 +01:00
Prarit Bhargava
644ddf588f Add TAINT_FIRMWARE_WORKAROUND on MTRR fixup
TAINT_FIRMWARE_WORKAROUND should be set when an MTRR fixup
is done.

Signed-off-by: Prarit Bhargava <prarit@redhat.com>
Acked-by: David Rientjes <rientjes@google.com>
Link: http://lkml.kernel.org/r/1318958650-12447-1-git-send-email-prarit@redhat.com
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2011-12-05 13:48:50 +01:00
Bjorn Helgaas
9e6866686b x86/mpparse: Account for bus types other than ISA and PCI
In commit f8924e770e ("x86: unify mp_bus_info"), the 32-bit
and 64-bit versions of MP_bus_info were rearranged to match each
other better.  Unfortunately it introduced a regression: prior
to that change we used to always set the mp_bus_not_pci bit,
then clear it if we found a PCI bus.  After it, we set
mp_bus_not_pci for ISA buses, clear it for PCI buses, and leave
it alone otherwise.

In the cases of ISA and PCI, there's not much difference.  But
ISA is not the only non-PCI bus, so it's better to always set
mp_bus_not_pci and clear it only for PCI.

Without this change, Dan's Dell PowerEdge 4200 panics on boot
with a log indicating interrupt routing trouble unless the
"noapic" option is supplied.  With this change, the machine
boots reliably without "noapic".

Fixes http://bugs.debian.org/586494

Reported-bisected-and-tested-by: Dan McGrath <troubledaemon@gmail.com>
Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
Cc: stable@vger.kernel.org	# 2.6.26+
Cc: Dan McGrath <troubledaemon@gmail.com>
Cc: Alexey Starikovskiy <aystarik@gmail.com>
[jrnieder@gmail.com: clarified commit message]
Signed-off-by: Jonathan Nieder <jrnieder@gmail.com>
Link: http://lkml.kernel.org/r/20111122215000.GA9151@elie.hsd1.il.comcast.net
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2011-12-05 13:46:27 +01:00
Feng Tang
efa2212685 x86, mrst: Change the pmic_gpio device type to IPC
In latest firmware's SFI tables, pmic_gpio has been set to
IPC type of device, so we need handle it too.

Signed-off-by: Feng Tang <feng.tang@intel.com>
Signed-off-by: Alan Cox <alan@linux.intel.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2011-12-05 12:42:15 +01:00
Jekyll Lai
28744b3e9c mrst: Added some platform data for the SFI translations
Add SFI glue for the following devices:

tca6416: a gpio expander compatible with max7315
mpu3050: gyro sensor

Both of these actual drivers are already upstream

Signed-off-by: Jekyll Lai <jekyll_lai@wistron.com>
Signed-off-by: Alan Cox <alan@linux.intel.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2011-12-05 12:42:13 +01:00
Jacob Pan
48bc556210 x86,mrst: Power control commands update
On the Intel MID devices SCU commands are issued to manage power
off and the like. We need to issue different ones for
non-Lincroft based devices.

Signed-off-by: Alek Du <alek.du@intel.com>
Signed-off-by: Jacob Pan <jacob.jun.pan@linux.intel.com>
Signed-off-by: Alan Cox <alan@linux.intel.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2011-12-05 12:42:11 +01:00
Rafael J. Wysocki
6be30bb7d7 x86/reboot: Blacklist Dell OptiPlex 990 known to require PCI reboot
Dell OptiPlex 990 is known to require PCI reboot, so add it to
the reboot blacklist in pci_reboot_dmi_table[].

Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl>
Link: http://lkml.kernel.org/r/201111160019.51303.rjw@sisk.pl
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2011-12-05 12:20:43 +01:00
Jack Steiner
b495e039b4 x86, UV: Fix UV2 hub part number
There was a mixup when the SGI UV2 hub chip was sent to be
fabricated, and it ended up with the wrong part number in the
HRP_NODE_ID mmr. Future versions of the chip will (may) have the
correct part number. Change the UV infrastructure to recognize
both part numbers as valid IDs of a UV2 hub chip.

Signed-off-by: Jack Steiner <steiner@sgi.com>
Link: http://lkml.kernel.org/r/20111129210058.GA20452@sgi.com
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2011-12-05 11:49:52 +01:00
Mitsuo Hayasaka
69682b625a x86: Add user_mode_vm check in stack_overflow_check
The kernel stack overflow is checked in stack_overflow_check(),
which may wrongly detect the overflow if the stack pointer in
user space points to the kernel stack intentionally or
accidentally. So, the actual overflow is never detected after
this misdetection because WARN_ONCE() is used on the detection
of it.

This patch adds user-mode-vm checking before it to avoid this
problem and bails out early if the user stack is used.

Signed-off-by: Mitsuo Hayasaka <mitsuo.hayasaka.hu@hitachi.com>
Cc: yrl.pp-manager.tt@hitachi.com
Cc: Randy Dunlap <rdunlap@xenotime.net>
Link: http://lkml.kernel.org/r/20111129060821.11076.55315.stgit@ltc219.sdl.hitachi.co.jp
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Cc: "H. Peter Anvin" <hpa@zytor.com>
2011-12-05 11:28:25 +01:00