Commit Graph

245870 Commits

Author SHA1 Message Date
Rafael J. Wysocki
6538df8019 PM: Introduce generic prepare and complete callbacks for subsystems
Introduce generic .prepare() and .complete() power management
callbacks, currently missing, that can be used by subsystems and
power domains and export them.  Provide NULL definitions of all
the generic system sleep callbacks for CONFIG_PM_SLEEP unset.

Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl>
2011-05-17 23:26:21 +02:00
Rafael J. Wysocki
91e7c75ba9 PM: Allow drivers to allocate memory from .prepare() callbacks safely
If device drivers allocate substantial amounts of memory (above 1 MB)
in their hibernate .freeze() callbacks (or in their legacy suspend
callbcks during hibernation), the subsequent creation of hibernate
image may fail due to the lack of memory.  This is the case, because
the drivers' .freeze() callbacks are executed after the hibernate
memory preallocation has been carried out and the preallocated amount
of memory may be too small to cover the new driver allocations.
Unfortunately, the drivers' .prepare() callbacks also are executed
after the hibernate memory preallocation has completed, so they are
not suitable for allocating additional memory either.  Thus the only
way a driver can safely allocate memory during hibernation is to use
a hibernate/suspend notifier.  However, the notifiers are called
before the freezing of user space and the drivers wanting to use them
for allocating additional memory may not know how much memory needs
to be allocated at that point.

To let device drivers overcome this difficulty rework the hibernation
sequence so that the memory preallocation is carried out after the
drivers' .prepare() callbacks have been executed, so that the
.prepare() callbacks can be used for allocating additional memory
to be used by the drivers' .freeze() callbacks.  Update documentation
to match the new behavior of the code.

Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl>
2011-05-17 23:26:00 +02:00
Rafael J. Wysocki
c650da23d5 PM: Remove CONFIG_PM_VERBOSE
Now that we have CONFIG_DYNAMIC_DEBUG there is no need for yet
another flag causing dev_dbg() and pr_debug() statements in the
core PM code to produce output.  Moreover, CONFIG_PM_VERBOSE
causes so much output to be generated that it's not really useful
and almost no one sets it.

References: https://bugzilla.kernel.org/show_bug.cgi?id=23182
Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl>
2011-05-17 23:25:10 +02:00
Rafael J. Wysocki
290c748725 Merge branch 'power-domains' into for-linus
* power-domains:
  PM: Fix build issue in clock_ops.c for CONFIG_PM_RUNTIME unset
  PM: Revert "driver core: platform_bus: allow runtime override of dev_pm_ops"
  OMAP1 / PM: Use generic clock manipulation routines for runtime PM
  PM / Runtime: Generic clock manipulation rountines for runtime PM (v6)
  PM / Runtime: Add subsystem data field to struct dev_pm_info
  OMAP2+ / PM: move runtime PM implementation to use device power domains
  PM / Platform: Use generic runtime PM callbacks directly
  shmobile: Use power domains for platform runtime PM
  PM: Export platform bus type's default PM callbacks
  PM: Make power domain callbacks take precedence over subsystem ones
2011-05-17 23:23:46 +02:00
Rafael J. Wysocki
2d2a9163bd Merge branch 'syscore' into for-linus
* syscore:
  PM: Remove sysdev suspend, resume and shutdown operations
  PM / PowerPC: Use struct syscore_ops instead of sysdevs for PM
  PM / UNICORE32: Use struct syscore_ops instead of sysdevs for PM
  PM / AVR32: Use struct syscore_ops instead of sysdevs for PM
  PM / Blackfin: Use struct syscore_ops instead of sysdevs for PM
  ARM / Samsung: Use struct syscore_ops for "core" power management
  ARM / PXA: Use struct syscore_ops for "core" power management
  ARM / SA1100: Use struct syscore_ops for "core" power management
  ARM / Integrator: Use struct syscore_ops for core PM
  ARM / OMAP: Use struct syscore_ops for "core" power management
  ARM: Use struct syscore_ops instead of sysdevs for PM in common code
2011-05-17 23:23:40 +02:00
Rafael J. Wysocki
1c1be3a949 Revert "PM / Hibernate: Reduce autotuned default image size"
This reverts commit bea3864fb6
(PM / Hibernate: Reduce autotuned default image size), because users
are now able to resolve the issue this commit was supposed to address
in a different way (i.e. by using the new /sys/power/reserved_size
interface).

Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl>
2011-05-17 23:19:19 +02:00
Rafael J. Wysocki
ddeb648708 PM / Hibernate: Add sysfs knob to control size of memory for drivers
Martin reports that on his system hibernation occasionally fails due
to the lack of memory, because the radeon driver apparently allocates
too much of it during the device freeze stage.  It turns out that the
amount of memory allocated by radeon during hibernation (and
presumably during system suspend too) depends on the utilization of
the GPU (e.g. hibernating while there are two KDE 4 sessions with
compositing enabled causes radeon to allocate more memory than for
one KDE 4 session).

In principle it should be possible to use image_size to make the
memory preallocation mechanism free enough memory for the radeon
driver, but in practice it is not easy to guess the right value
because of the way the preallocation code uses image_size.  For this
reason, it seems reasonable to allow users to control the amount of
memory reserved for driver allocations made after the hibernate
preallocation, which currently is constant and amounts to 1 MB.

Introduce a new sysfs file, /sys/power/reserved_size, whose value
will be used as the amount of memory to reserve for the
post-preallocation reservations made by device drivers, in bytes.
For backwards compatibility, set its default (and initial) value to
the currently used number (1 MB).

References: https://bugzilla.kernel.org/show_bug.cgi?id=34102
Reported-and-tested-by: Martin Steigerwald <Martin@Lichtvoll.de>
Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl>
2011-05-17 23:19:19 +02:00
Eric Dumazet
13e3813656 PM / Wakeup: Remove useless synchronize_rcu() call
wakeup_source_add() adds an item into wakeup_sources list.

There is no need to call synchronize_rcu() at this point.

Its only needed in wakeup_source_remove()

Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl>
2011-05-17 23:19:19 +02:00
Kay Sievers
13d53f8775 kmod: always provide usermodehelper_disable()
We need to prevent kernel-forked processes during system poweroff.
Such processes try to access the filesystem whose disks we are
trying to shutdown at the same time. This causes delays and exceptions
in the storage drivers.

A follow-up patch will add these calls and need usermodehelper_disable()
also on systems without suspend support.

Signed-off-by: Kay Sievers <kay.sievers@vrfy.org>
Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl>
2011-05-17 23:19:18 +02:00
Amerigo Wang
c3b0795c98 PM / ACPI: Remove acpi_sleep=s4_nonvs
acpi_sleep=s4_nonvs is superseded by acpi_sleep=nonvs, so remove it.

Signed-off-by: WANG Cong <amwang@redhat.com>
Acked-by: Pavel Machek <pavel@ucw.cz>
Acked-by: Len Brown <lenb@kernel.org>
Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl>
2011-05-17 23:19:18 +02:00
Rafael J. Wysocki
e762318baa PM / Wakeup: Fix build warning related to the "wakeup" sysfs file
The "wakeup" device sysfs file is only created if CONFIG_PM_SLEEP
is set, so put it under CONFIG_PM_SLEEP and make a build warning
related to it go away.

Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl>
Acked-by: Greg Kroah-Hartman <gregkh@suse.de>
2011-05-17 23:19:18 +02:00
Rafael J. Wysocki
a144c6a6c9 PM: Print a warning if firmware is requested when tasks are frozen
Some drivers erroneously use request_firmware() from their ->resume()
(or ->thaw(), or ->restore()) callbacks, which is not going to work
unless the firmware has been built in.  This causes system resume to
stall until the firmware-loading timeout expires, which makes users
think that the resume has failed and reboot their machines
unnecessarily.  For this reason, make _request_firmware() print a
warning and return immediately with error code if it has been called
when tasks are frozen and it's impossible to start any new usermode
helpers.

Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl>
Acked-by: Greg Kroah-Hartman <gregkh@suse.de>
Reviewed-by: Valdis Kletnieks <valdis.kletnieks@vt.edu>
2011-05-17 23:19:17 +02:00
Rafael J. Wysocki
e1866b33b1 PM / Runtime: Rework runtime PM handling during driver removal
The driver core tries to prevent race conditions between runtime PM
and driver removal from happening by incrementing the runtime PM
usage counter of the device and executing pm_runtime_barrier() before
running the bus notifier and the ->remove() callbacks provided by the
device's subsystem or driver.  This guarantees that, if a future
runtime suspend of the device has been scheduled or a runtime resume
or idle request has been queued up right before the driver removal,
it will be canceled or waited for to complete and no other
asynchronous runtime suspend or idle requests for the device will be
put into the PM workqueue until the ->remove() callback returns.
However, it doesn't prevent resume requests from being queued up
after pm_runtime_barrier() has been called and it doesn't prevent
pm_runtime_resume() from executing the device subsystem's runtime
resume callback.  Morever, it prevents the device's subsystem or
driver from putting the device into the suspended state by calling
pm_runtime_suspend() from its ->remove() routine.  This turns out to
be a major inconvenience for some subsystems and drivers that want to
leave the devices they handle in the suspended state.

To really prevent runtime PM callbacks from racing with the bus
notifier callback in __device_release_driver(), which is necessary,
because the notifier is used by some subsystems to carry out
operations affecting the runtime PM functionality, use
pm_runtime_get_sync() instead of the combination of
pm_runtime_get_noresume() and pm_runtime_barrier().  This will resume
the device if it's in the suspended state and will prevent it from
being suspended again until pm_runtime_put_*() is called.

To allow subsystems and drivers to put devices into the suspended
state by calling pm_runtime_suspend() from their ->remove() routines,
execute pm_runtime_put_sync() after running the bus notifier in
__device_release_driver().  This will require subsystems and drivers
to make their ->remove() callbacks avoid races with runtime PM
directly, but it will allow of more flexibility in the handling of
devices during the removal of their drivers.

Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl>
2011-05-17 23:19:17 +02:00
Mike Frysinger
ee940d8dcc Freezer: Use SMP barriers
The freezer processes are dealing with multiple threads running
simultaneously, and on a UP system, the memory reads/writes do
not need barriers to keep things in sync.  These are only needed
on SMP systems, so use SMP barriers instead.

Signed-off-by: Mike Frysinger <vapier@gentoo.org>
Acked-by: Pavel Machek <pavel@ucw.cz>
Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl>
2011-05-17 23:19:17 +02:00
MyungJoo Ham
3c43193608 PM / Suspend: Do not ignore error codes returned by suspend_enter()
The current implementation of suspend-to-RAM returns 0 if there is an
error from suspend_enter(), because suspend_devices_and_enter() ignores
the return value from suspend_enter().  This patch addresses this issue
and properly keep the error return from suspend_enter() and let
suspend_devices_and_enter relay the error return.

Signed-off-by: MyungJoo Ham <myungjoo.ham@samsung.com>
Signed-off-by: Kyungmin Park <kyungmin.park@samsung.com>
Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl>
2011-05-17 23:19:16 +02:00
Jeff Layton
11379b5e33 cifs: fix cifsConvertToUCS() for the mapchars case
As Metze pointed out, commit 84cdf74e broke mapchars option:

    Commit "cifs: fix unaligned accesses in cifsConvertToUCS"
    (84cdf74e80) does multiple steps
    in just one commit (moving the function and changing it without
    testing).

    put_unaligned_le16(temp, &target[j]); is never called for any
    codepoint the goes via the 'default' switch statement. As a result
    we put just zero (or maybe uninitialized) bytes into the target
    buffer.

His proposed patch looks correct, but doesn't apply to the current head
of the tree. This patch should also fix it.

Cc: <stable@kernel.org> # .38.x: 581ade4: cifs: clean up various nits in unicode routines (try #2)
Reported-by: Stefan Metzmacher <metze@samba.org>
Signed-off-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Steve French <sfrench@us.ibm.com>
2011-05-17 20:54:04 +00:00
Fenghua Yu
2494b030ba x86, cpufeature: Fix cpuid leaf 7 feature detection
CPUID leaf 7, subleaf 0 returns the maximum subleaf in EAX, not the
number of subleaves.  Since so far only subleaf 0 is defined (and only
the EBX bitfield) we do not need to qualify the test.

Signed-off-by: Fenghua Yu <fenghua.yu@intel.com>
Link: http://lkml.kernel.org/r/1305660806-17519-1-git-send-email-fenghua.yu@intel.com
Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
Cc: <stable@kernel.org> 2.6.36..39
2011-05-17 13:36:29 -07:00
Jeff Layton
221d1d7972 cifs: add fallback in is_path_accessible for old servers
The is_path_accessible check uses a QPathInfo call, which isn't
supported by ancient win9x era servers. Fall back to an older
SMBQueryInfo call if it fails with the magic error codes.

Cc: stable@kernel.org
Reported-and-Tested-by: Sandro Bonazzola <sandro.bonazzola@gmail.com>
Signed-off-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Steve French <sfrench@us.ibm.com>
2011-05-17 18:51:14 +00:00
Stephane Eranian
94692349c4 perf: Fix multi-event parsing bug
This patch fixes an issue with event parsing.
The following commit appears to have broken the
ability to specify a comma separated list of events:

   commit ceb53fbf6d
   Author: Ingo Molnar <mingo@elte.hu>
   Date:   Wed Apr 27 04:06:33 2011 +0200

       perf stat: Fail more clearly when an invalid modifier is specified

This patch fixes this while preserving the desired effect:

$ perf stat -e instructions:u,instructions:k ls /dev/null /dev/null

 Performance counter stats for 'ls /dev/null':

            365956 instructions:u           #    0.00  insns per cycle
            731806 instructions:k           #    0.00  insns per cycle

        0.001108862  seconds time elapsed

$ perf stat -e task-clock-msecs true
invalid event modifier: '-msecs'
Run 'perf list' for a list of valid events and modifiers

Signed-off-by: Stephane Eranian <eranian@google.com>
Cc: acme@redhat.com
Cc: peterz@infradead.org
Cc: fweisbec@gmail.com
Link: http://lkml.kernel.org/r/20110517133619.GA6999@quad
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2011-05-17 20:45:36 +02:00
Linus Torvalds
a085963a27 Merge branch 'timers-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip
* 'timers-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
  tick: Clear broadcast active bit when switching to oneshot
  rtc: mc13xxx: Don't call rtc_device_register while holding lock
  rtc: rp5c01: Initialize drvdata before registering device
  rtc: pcap: Initialize drvdata before registering device
  rtc: msm6242: Initialize drvdata before registering device
  rtc: max8998: Initialize drvdata before registering device
  rtc: max8925: Initialize drvdata before registering device
  rtc: m41t80: Initialize clientdata before registering device
  rtc: ds1286: Initialize drvdata before registering device
  rtc: ep93xx: Initialize drvdata before registering device
  rtc: davinci: Initialize drvdata before registering device
  rtc: mxc: Initialize drvdata before registering device
  clocksource: Install completely before selecting
2011-05-17 08:02:04 -07:00
Borislav Petkov
14fb57dccb x86, AMD: Fix ARAT feature setting again
Trying to enable the local APIC timer on early K8 revisions
uncovers a number of other issues with it, in conjunction with
the C1E enter path on AMD. Fixing those causes much more churn
and troubles than the benefit of using that timer brings so
don't enable it on K8 at all, falling back to the original
functionality the kernel had wrt to that.

Reported-and-bisected-by: Nick Bowler <nbowler@elliptictech.com>
Cc: Boris Ostrovsky <Boris.Ostrovsky@amd.com>
Cc: Andreas Herrmann <andreas.herrmann3@amd.com>
Cc: Greg Kroah-Hartman <greg@kroah.com>
Cc: Hans Rosenfeld <hans.rosenfeld@amd.com>
Cc: Nick Bowler <nbowler@elliptictech.com>
Cc: Joerg-Volker-Peetz <jvpeetz@web.de>
Signed-off-by: Borislav Petkov <borislav.petkov@amd.com>
Link: http://lkml.kernel.org/r/1305636919-31165-3-git-send-email-bp@amd64.org
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2011-05-17 15:28:34 +02:00
Borislav Petkov
328935e634 Revert "x86, AMD: Fix APIC timer erratum 400 affecting K8 Rev.A-E processors"
This reverts commit e20a2d205c, as it crashes
certain boxes with specific AMD CPU models.

Moving the lower endpoint of the Erratum 400 check to accomodate
earlier K8 revisions (A-E) opens a can of worms which is simply
not worth to fix properly by tweaking the errata checking
framework:

* missing IntPenging MSR on revisions < CG cause #GP:

http://marc.info/?l=linux-kernel&m=130541471818831

* makes earlier revisions use the LAPIC timer instead of the C1E
idle routine which switches to HPET, thus not waking up in
deeper C-states:

http://lkml.org/lkml/2011/4/24/20

Therefore, leave the original boundary starting with K8-revF.

Signed-off-by: Ingo Molnar <mingo@elte.hu>
2011-05-17 15:28:33 +02:00
Ingo Molnar
86b9523ab1 Merge branch 'gart/rename' of git://git.kernel.org/pub/scm/linux/kernel/git/joro/linux-2.6-iommu into core/iommu 2011-05-17 14:39:00 +02:00
Jens Axboe
9937a5e2f3 scsi: remove performance regression due to async queue run
Commit c21e6beb removed our queue request_fn re-enter
protection, and defaulted to always running the queues from
kblockd to be safe. This was a known potential slow down,
but should be safe.

Unfortunately this is causing big performance regressions for
some, so we need to improve this logic. Looking into the details
of the re-enter, the real issue is on requeue of requests.

Requeue of requests upon seeing a BUSY condition from the device
ends up re-running the queue, causing traces like this:

scsi_request_fn()
        scsi_dispatch_cmd()
                scsi_queue_insert()
                        __scsi_queue_insert()
                                scsi_run_queue()
					scsi_request_fn()
						...

potentially causing the issue we want to avoid. So special
case the requeue re-run of the queue, but improve it to offload
the entire run of local queue and starved queue from a single
workqueue callback. This is a lot better than potentially
kicking off a workqueue run for each device seen.

This also fixes the issue of the local device going into recursion,
since the above mentioned commit never moved that queue run out
of line.

Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
2011-05-17 11:04:44 +02:00
Linus Torvalds
c1d10d18c5 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-2.6
* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-2.6:
  net: Change netdev_fix_features messages loglevel
  vmxnet3: Fix inconsistent LRO state after initialization
  sfc: Fix oops in register dump after mapping change
  IPVS: fix netns if reading ip_vs_* procfs entries
  bridge: fix forwarding of IPv6
2011-05-16 18:38:08 -07:00
Linus Torvalds
477de0de44 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/cjb/mmc
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/cjb/mmc:
  Revert "mmc: fix a race between card-detect rescan and clock-gate work instances"
2011-05-16 18:36:47 -07:00
Randy Dunlap
b5e6ab589d mm: fix kernel-doc warning in page_alloc.c
Fix new kernel-doc warning in mm/page_alloc.c:

  Warning(mm/page_alloc.c:2370): No description found for parameter 'nid'

Signed-off-by: Randy Dunlap <randy.dunlap@oracle.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-05-16 18:34:30 -07:00
Yinghai Lu
93d2175d3d PCI: Clear bridge resource flags if requested size is 0
During pci remove/rescan testing found:

  pci 0000:c0:03.0: PCI bridge to [bus c4-c9]
  pci 0000:c0:03.0:   bridge window [io  0x1000-0x0fff]
  pci 0000:c0:03.0:   bridge window [mem 0xf0000000-0xf00fffff]
  pci 0000:c0:03.0:   bridge window [mem 0xfc180000000-0xfc197ffffff 64bit pref]
  pci 0000:c0:03.0: device not available (can't reserve [io  0x1000-0x0fff])
  pci 0000:c0:03.0: Error enabling bridge (-22), continuing
  pci 0000:c0:03.0: enabling bus mastering
  pci 0000:c0:03.0: setting latency timer to 64
  pcieport 0000:c0:03.0: device not available (can't reserve [io  0x1000-0x0fff])
  pcieport: probe of 0000:c0:03.0 failed with error -22

This bug was caused by commit c8adf9a3e8 ("PCI: pre-allocate
additional resources to devices only after successful allocation of
essential resources.")

After that commit, pci_hotplug_io_size is changed to additional_io_size
from minium size.  So it will not go through resource_size(res) != 0
path, and will not be reset.

The root cause is: pci_bridge_check_ranges will set RESOURCE_IO flag for
pci bridge, and later if children do not need IO resource.  those bridge
resources will not need to be allocated.  but flags is still there.
that will confuse the the pci_enable_bridges later.

related code:

   static void assign_requested_resources_sorted(struct resource_list *head,
                                    struct resource_list_x *fail_head)
   {
           struct resource *res;
           struct resource_list *list;
           int idx;

           for (list = head->next; list; list = list->next) {
                   res = list->res;
                   idx = res - &list->dev->resource[0];
                   if (resource_size(res) && pci_assign_resource(list->dev, idx)) {
   ...
                           reset_resource(res);
                   }
           }
   }

At last, We have to clear the flags in pbus_size_mem/io when requested
size == 0 and !add_head.  becasue this case it will not go through
adjust_resources_sorted().

Just make size1 = size0 when !add_head. it will make flags get cleared.

At the same time when requested size == 0, add_size != 0, will still
have in head and add_list.  because we do not clear the flags for it.

After this, we will get right result:

  pci 0000:c0:03.0: PCI bridge to [bus c4-c9]
  pci 0000:c0:03.0:   bridge window [io  disabled]
  pci 0000:c0:03.0:   bridge window [mem 0xf0000000-0xf00fffff]
  pci 0000:c0:03.0:   bridge window [mem 0xfc180000000-0xfc197ffffff 64bit pref]
  pci 0000:c0:03.0: enabling bus mastering
  pci 0000:c0:03.0: setting latency timer to 64
  pcieport 0000:c0:03.0: setting latency timer to 64
  pcieport 0000:c0:03.0: irq 160 for MSI/MSI-X
  pcieport 0000:c0:03.0: Signaling PME through PCIe PME interrupt
  pci 0000:c4:00.0: Signaling PME through PCIe PME interrupt
  pcie_pme 0000:c0:03.0:pcie01: service driver pcie_pme loaded
  aer 0000:c0:03.0:pcie02: service driver aer loaded
  pciehp 0000:c0:03.0:pcie04: Hotplug Controller:

v3: more simple fix. also fix one typo in pbus_size_mem

Signed-off-by: Yinghai Lu <yinghai@kernel.org>
Reviewed-by: Ram Pai <linuxram@us.ibm.com>
Cc: Jesse Barnes <jbarnes@virtuousgeek.org>
Cc: Bjorn Helgaas <bhelgaas@google.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-05-16 18:33:35 -07:00
Thomas Gleixner
07f4beb0b5 tick: Clear broadcast active bit when switching to oneshot
The first cpu which switches from periodic to oneshot mode switches
also the broadcast device into oneshot mode. The broadcast device
serves as a backup for per cpu timers which stop in deeper
C-states. To avoid starvation of the cpus which might be in idle and
depend on broadcast mode it marks the other cpus as broadcast active
and sets the brodcast expiry value of those cpus to the next tick.

The oneshot mode broadcast bit for the other cpus is sticky and gets
only cleared when those cpus exit idle. If a cpu was not idle while
the bit got set in consequence the bit prevents that the broadcast
device is armed on behalf of that cpu when it enters idle for the
first time after it switched to oneshot mode.

In most cases that goes unnoticed as one of the other cpus has usually
a timer pending which keeps the broadcast device armed with a short
timeout. Now if the only cpu which has a short timer active has the
bit set then the broadcast device will not be armed on behalf of that
cpu and will fire way after the expected timer expiry. In the case of
Christians bug report it took ~145 seconds which is about half of the
wrap around time of HPET (the limit for that device) due to the fact
that all other cpus had no timers armed which expired before the 145
seconds timeframe.

The solution is simply to clear the broadcast active bit
unconditionally when a cpu switches to oneshot mode after the first
cpu switched the broadcast device over. It's not idle at that point
otherwise it would not be executing that code.

[ I fundamentally hate that broadcast crap. Why the heck thought some
  folks that when going into deep idle it's a brilliant concept to
  switch off the last device which brings the cpu back from that
  state? ]

Thanks to Christian for providing all the valuable debug information!

Reported-and-tested-by: Christian Hoffmann <email@christianhoffmann.info>
Cc: John Stultz <johnstul@us.ibm.com>
Link: http://lkml.kernel.org/r/%3Calpine.LFD.2.02.1105161105170.3078%40ionos%3E
Cc: stable@kernel.org
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2011-05-16 23:35:41 +02:00
David Rientjes
dc382fd5bc x86, mm: Allow ZONE_DMA to be configurable
ZONE_DMA is unnecessary for a large number of machines that do not
require less than 32-bit DMA addressing, e.g. ISA legacy DMA or PCI
cards with a restricted DMA address mask.

This patch allows users to disable ZONE_DMA for x86 if they know they
will not be using such devices with their kernel.

This prevents the VM from unnecessarily reserving a ratio of memory
(defaulting to 1/256th of system capacity) with lowmem_reserve_ratio
for such allocations when it will never be used.

Signed-off-by: David Rientjes <rientjes@google.com>
Link: http://lkml.kernel.org/r/alpine.DEB.2.00.1105161353560.4353@chino.kir.corp.google.com
Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
2011-05-16 14:03:28 -07:00
Ondrej Zary
865be7a810 x86, cpu: Fix detection of Celeron Covington stepping A1 and B0
Steppings A1 and B0 of Celeron Covington are currently misdetected as
Pentium II (Dixon). Fix it by removing the stepping check.

[ hpa: this fixes this specific bug... the CPUID documentation
  specifies that the L2 cache size can disambiguate additional CPUs;
  this patch does not fix that. ]

Signed-off-by: Ondrej Zary <linux@rainbow-software.org>
Link: http://lkml.kernel.org/r/201105162138.15416.linux@rainbow-software.org
Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
2011-05-16 13:24:21 -07:00
Michał Mirosław
6f404e441d net: Change netdev_fix_features messages loglevel
Those reduced to DEBUG can possibly be triggered by unprivileged processes
and are nothing exceptional. Illegal checksum combinations can only be
caused by driver bug, so promote those messages to WARN.

Since GSO without SG will now only cause DEBUG message from
netdev_fix_features(), remove the workaround from register_netdevice().

Signed-off-by: Michał Mirosław <mirq-linux@rere.qmqm.pl>
Signed-off-by: David S. Miller <davem@davemloft.net>
2011-05-16 15:14:21 -04:00
Thomas Jarosch
ebde6f8acb vmxnet3: Fix inconsistent LRO state after initialization
During initialization of vmxnet3, the state of LRO
gets out of sync with netdev->features.

This leads to very poor TCP performance in a IP forwarding
setup and is hitting many VMware users.

Simplified call sequence:
1. vmxnet3_declare_features() initializes "adapter->lro" to true.

2. The kernel automatically disables LRO if IP forwarding is enabled,
so vmxnet3_set_flags() gets called. This also updates netdev->features.

3. Now vmxnet3_setup_driver_shared() is called. "adapter->lro" is still
set to true and LRO gets enabled again, even though
netdev->features shows it's disabled.

Fix it by updating "adapter->lro", too.

The private vmxnet3 adapter flags are scheduled for removal
in net-next, see commit a0d2730c95
"net: vmxnet3: convert to hw_features".

Patch applies to 2.6.37 / 2.6.38 and 2.6.39-rc6.

Please CC: comments.

Signed-off-by: Thomas Jarosch <thomas.jarosch@intra2net.com>
Acked-by: Stephen Hemminger <shemminger@vyatta.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2011-05-16 15:05:23 -04:00
Ben Hutchings
867955f568 sfc: Fix oops in register dump after mapping change
Commit 747df2258b ('sfc: Always map MCDI
shared memory as uncacheable') introduced a separate mapping for the
MCDI shared memory (MC_TREG_SMEM).  This means we can no longer easily
include it in the register dump.  Since it is not particularly useful
in debugging, substitute a recognisable dummy value.

Signed-off-by: Ben Hutchings <bhutchings@solarflare.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2011-05-16 15:05:23 -04:00
Martin Schwidefsky
f296388682 ftrace/s390: mcount offset calculation
Do the mcount offset adjustment in the recordmcount.pl/recordmcount.[ch]
at compile time and not in ftrace_call_adjust at run time.

Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2011-05-16 15:05:06 -04:00
Martin Schwidefsky
521ccb5c4a ftrace/x86: mcount offset calculation
Do the mcount offset adjustment in the recordmcount.pl/recordmcount.[ch]
at compile time and not in ftrace_call_adjust at run time.

Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2011-05-16 14:55:57 -04:00
Martin Schwidefsky
07d8b595f3 ftrace/recordmcount: mcount address adjustment
Introduce mcount_adjust{,_32,_64} to the C implementation of
recordmcount analog to $mcount_adjust in the perl script.
The adjustment is added to the address of the relocations
against the mcount symbol. If this adjustment is done by
recordmcount at compile time the ftrace_call_adjust function
can be turned into a nop.

Cc: John Reiser <jreiser@bitwagon.com>
Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2011-05-16 14:53:22 -04:00
Steven Rostedt
41b402a201 ftrace/recordmcount: Add helper function get_sym_str_and_relp()
The code to get the symbol, string, and relp pointers in the two functions
sift_rel_mcount() and nop_mcount() are identical and also non-trivial.
Moving this duplicate code into a single helper function makes the code
easier to read and more maintainable.

Cc: John Reiser <jreiser@bitwagon.com>
Link: http://lkml.kernel.org/r/20110421023739.723658553@goodmis.org
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2011-05-16 14:48:55 -04:00
Steven Rostedt
37762cb997 ftrace/recordmcount: Remove duplicate code to find mcount symbol
The code in sift_rel_mcount() and nop_mcount() to get the mcount symbol
number is identical. Replace the two locations with a call to a function
that does the work.

Cc: John Reiser <jreiser@bitwagon.com>
Link: http://lkml.kernel.org/r/20110421023739.488093407@goodmis.org
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2011-05-16 14:48:02 -04:00
Steven Rostedt
2895cd2ab8 ftrace/x86: Do not trace .discard.text section
The section called .discard.text has tracing attached to it and is
currently ignored by ftrace. But it does include a call to the mcount
stub. Adding a notrace to the code keeps gcc from adding the useless
mcount caller to it.

Link: http://lkml.kernel.org/r/20110421023739.243651696@goodmis.org
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2011-05-16 14:47:13 -04:00
Steven Rostedt
f0201738b6 ftrace: Avoid recording mcount on .init sections directly
The init and exit sections should not be traced and adding a call to
mcount to them is a waste of text and instruction cache. Have the
macro section attributes include notrace to ignore these functions
for tracing from the build.

Link: http://lkml.kernel.org/r/20110421023738.953028219@goodmis.org
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2011-05-16 14:46:30 -04:00
Steven Rostedt
85356f8022 kbuild/recordmcount: Add RECORDMCOUNT_WARN to warn about mcount callers
When mcount is called in a section that ftrace will not modify it into
a nop, we want to warn about this. But not warn about this always. Now
if the user builds the kernel with the option RECORDMCOUNT_WARN=1 then
the build will warn about mcount callers that are ignored and will just
waste execution time.

Acked-by: Michal Marek <mmarek@suse.cz>
Cc: linux-kbuild@vger.kernel.org
Link: http://lkml.kernel.org/r/20110421023738.714956282@goodmis.org
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2011-05-16 14:45:03 -04:00
Steven Rostedt
dfad3d598c ftrace/recordmcount: Add warning logic to warn on mcount not recorded
There's some sections that should not have mcount recorded and should not have
modifications to the that code. But currently they waste some time by calling
mcount anyway (which simply returns). As the real answer should be to
either whitelist the section or have gcc ignore it fully.

This change adds a option to recordmcount to warn when it finds a section
that is ignored by ftrace but still contains mcount callers. This is not on
by default as developers may not know if the section should be completely
ignored or added to the whitelist.

Cc: John Reiser <jreiser@bitwagon.com>
Link: http://lkml.kernel.org/r/20110421023738.476989377@goodmis.org
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2011-05-16 14:44:20 -04:00
Steven Rostedt
ffd618fa39 ftrace/recordmcount: Make ignored mcount calls into nops at compile time
There are sections that are ignored by ftrace for the function tracing because
the text is in a section that can be removed without notice. The mcount calls
in these sections are ignored and ftrace never sees them. The downside of this
is that the functions in these sections still call mcount. Although the mcount
function is defined in assembly simply as a return, this added overhead is
unnecessary.

The solution is to convert these callers into nops at compile time.
A better solution is to add 'notrace' to the section markers, but as new sections
come up all the time, it would be nice that they are delt with when they
are created.

Later patches will deal with finding these sections and doing the proper solution.

Thanks to H. Peter Anvin for giving me the right nops to use for x86.

Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: John Reiser <jreiser@bitwagon.com>
Link: http://lkml.kernel.org/r/20110421023738.237101176@goodmis.org
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2011-05-16 14:43:32 -04:00
Steven Rostedt
8abd5724a7 ftrace/recordmcount: Modify only executable sections
PROGBITS is not enough to determine if the section should be modified
or not. Only process sections that are marked as executable.

Cc: John Reiser <jreiser@bitwagon.com>
Link: http://lkml.kernel.org/r/20110421023737.991485123@goodmis.org
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2011-05-16 14:42:56 -04:00
Steven Rostedt
9f087e7612 ftrace: Add .kprobe.text section to whitelist for recordmcount.c
The .kprobe.text section is safe to modify mcount to nop and tracing.
Add it to the whitelist in recordmcount.c and recordmcount.pl.

Cc: John Reiser <jreiser@bitwagon.com>
Cc: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com>
Link: http://lkml.kernel.org/r/20110421023737.743350547@goodmis.org
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2011-05-16 14:42:15 -04:00
Steven Rostedt
e90b0c8bf2 ftrace/trivial: Clean up record mcount to use Linux switch style
The Linux style for switch statements is:

	switch (var) {
	case x:
		[...]
		break;
	}

Not:
	switch (var) {
	case x: {
		[...]
	} break;

Cc: John Reiser <jreiser@bitwagon.com>
Link: http://lkml.kernel.org/r/20110421023737.523968644@goodmis.org
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2011-05-16 14:41:07 -04:00
Steven Rostedt
dd5477ff3b ftrace/trivial: Clean up recordmcount.c to use Linux style comparisons
The Linux ftrace subsystem style for comparing is:

  var == 1
  var > 0

and not:

  1 == var
  0 < var

It is considered that Linux developers are smart enough not to do the

  if (var = 1)

mistake.

Cc: John Reiser <jreiser@bitwagon.com>
Link: http://lkml.kernel.org/r/20110421023737.290712238@goodmis.org
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2011-05-16 14:38:51 -04:00
Borislav Petkov
eecaaba5b2 Documentation, ABI: Update L3 cache index disable text
Change contact person to AMD kernel mailing list, update text and
external references, drop "Users:" tag.

Cc: Randy Dunlap <rdunlap@xenotime.net>
Signed-off-by: Borislav Petkov <borislav.petkov@amd.com>
Link: http://lkml.kernel.org/r/1305553188-21061-4-git-send-email-bp@amd64.org
Acked-by: Greg Kroah-Hartman <gregkh@suse.de>
Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
2011-05-16 11:24:30 -07:00
Frank Arnold
42be450565 x86, AMD, cacheinfo: Fix L3 cache index disable checks
We provide two slots to disable cache indices, and have a check to
prevent both slots to be used for the same index.

If the user disables the same index on different subcaches, both slots
will hold the same index, e.g.

  $ echo 2047 > /sys/devices/system/cpu/cpu0/cache/index3/cache_disable_0
  $ cat /sys/devices/system/cpu/cpu0/cache/index3/cache_disable_0
  2047
  $ echo 1050623 > /sys/devices/system/cpu/cpu0/cache/index3/cache_disable_1
  $ cat /sys/devices/system/cpu/cpu0/cache/index3/cache_disable_1
  2047

due to the fact that the check was looking only at index bits [11:0]
and was ignoring writes to bits outside that range. The more correct
fix is to simply check whether the index is within the bounds of
[0..l3->indices].

While at it, cleanup comments and drop now-unused local macros.

Signed-off-by: Frank Arnold <frank.arnold@amd.com>
Link: http://lkml.kernel.org/r/1305553188-21061-3-git-send-email-bp@amd64.org
Signed-off-by: Borislav Petkov <borislav.petkov@amd.com>
Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
2011-05-16 11:24:27 -07:00