Commit Graph

14054 Commits

Author SHA1 Message Date
Nigel Cunningham
5a60d6235c PM: Optional beeping during resume from suspend to RAM
Add a feature allowing the user to make the system beep during a resume from
suspend to RAM, on x86_64 and i386.

This is useful for the users with broken resume from RAM, so that they can
verify if the control reaches the kernel after a wake-up event.

Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-07-19 10:04:43 -07:00
Nick Piggin
83c54070ee mm: fault feedback #2
This patch completes Linus's wish that the fault return codes be made into
bit flags, which I agree makes everything nicer.  This requires requires
all handle_mm_fault callers to be modified (possibly the modifications
should go further and do things like fault accounting in handle_mm_fault --
however that would be for another patch).

[akpm@linux-foundation.org: fix alpha build]
[akpm@linux-foundation.org: fix s390 build]
[akpm@linux-foundation.org: fix sparc build]
[akpm@linux-foundation.org: fix sparc64 build]
[akpm@linux-foundation.org: fix ia64 build]
Signed-off-by: Nick Piggin <npiggin@suse.de>
Cc: Richard Henderson <rth@twiddle.net>
Cc: Ivan Kokshaysky <ink@jurassic.park.msu.ru>
Cc: Russell King <rmk@arm.linux.org.uk>
Cc: Ian Molton <spyro@f2s.com>
Cc: Bryan Wu <bryan.wu@analog.com>
Cc: Mikael Starvik <starvik@axis.com>
Cc: David Howells <dhowells@redhat.com>
Cc: Yoshinori Sato <ysato@users.sourceforge.jp>
Cc: "Luck, Tony" <tony.luck@intel.com>
Cc: Hirokazu Takata <takata@linux-m32r.org>
Cc: Geert Uytterhoeven <geert@linux-m68k.org>
Cc: Roman Zippel <zippel@linux-m68k.org>
Cc: Greg Ungerer <gerg@uclinux.org>
Cc: Matthew Wilcox <willy@debian.org>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Paul Mundt <lethal@linux-sh.org>
Cc: Kazumoto Kojima <kkojima@rr.iij4u.or.jp>
Cc: Richard Curnow <rc@rc0.org.uk>
Cc: William Lee Irwin III <wli@holomorphy.com>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Jeff Dike <jdike@addtoit.com>
Cc: Paolo 'Blaisorblade' Giarrusso <blaisorblade@yahoo.it>
Cc: Miles Bader <uclinux-v850@lsi.nec.co.jp>
Cc: Chris Zankel <chris@zankel.net>
Acked-by: Kyle McMartin <kyle@mcmartin.ca>
Acked-by: Haavard Skinnemoen <hskinnemoen@atmel.com>
Acked-by: Ralf Baechle <ralf@linux-mips.org>
Acked-by: Andi Kleen <ak@muc.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
[ Still apparently needs some ARM and PPC loving - Linus ]
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-07-19 10:04:41 -07:00
Andy Fleming
7132ab7f6e Fix RGMII-ID handling in gianfar
The TSEC/eTSEC can detect the interface to the PHY automatically,
but it isn't able to detect whether the RGMII connection needs internal
delay.  So we need to detect that change in the device tree, propagate
it to the platform data, and then check it if we're in RGMII.  This fixes
a bug on the 8641D HPCN board where the Vitesse PHY doesn't use the delay
for RGMII.

Signed-off-by: Andy Fleming <afleming@freescale.com>
2007-07-18 18:29:37 -04:00
Andy Fleming
cc65185d40 Add phy-connection-type to gianfar nodes
The TSEC/eTSEC automatically detect their PHY interface type, unless
the type is RGMII-ID (RGMII with internal delay).  In that situation,
it just detects RGMII.  In order to fix this, we need to pass in rgmii-id
if that is the connection type.

Signed-off-by: Andy Fleming <afleming@freescale.com>
2007-07-18 18:29:37 -04:00
Linus Torvalds
5bae7ac9fe Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/hskinnemoen/avr32-2.6
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/hskinnemoen/avr32-2.6:
  [AVR32] Initialize phy_mask for both macb devices
  [AVR32] Fix atomic_add_unless() and atomic_sub_unless()
  [AVR32] Correct misspelled CONFIG_BLK_DEV_INITRD variable.
  [AVR32] Fix build error in parse_tag_rdimg()
  [AVR32] Don't wire up macb0 unless SW6 is in default position
  [AVR32] Wire up SSC platform device 0 as TX on ATSTK1000 board
  [AVR32] Add Atmel SSC driver platform device to AT32AP architecture
  [AVR32] Remove optimization of unaligned word loads
  [AVR32] Make STK1000 mux settings configurable
  [AVR32] CPU frequency scaling for AT32AP
  [AVR32] Split SM device into PM, RTC, WDT and EIC
  [AVR32] faster avr32 unaligned access
2007-07-18 12:57:52 -07:00
Linus Torvalds
97405fe26b Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/hpa/linux-2.6-x86setup
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/hpa/linux-2.6-x86setup:
  [PATCH] x86: do not recompile boot for each build
  [x86 setup] Save/restore DS around invocations of INT 10h
  [x86 setup] VGA: Clear the Protect bit before setting the vertical height
  [x86 setup] Fix assembly constraints
  [x86 setup] build/tools.c: fix comment
  [x86 setup] MAINTAINERS: document x86 setup code git tree
2007-07-18 12:13:02 -07:00
Peter Zijlstra
a10d9a71ba i386: fixup TRACE_IRQ breakage
The TRACE_IRQS_ON function in iret_exc: calls a C function without
ensuring that the segments are set properly. Move the trace function and
the enabling of interrupt into the C stub.

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-07-18 12:09:01 -07:00
Roland McGrath
29eb51101c Handle bogus %cs selector in single-step instruction decoding
The code for LDT segment selectors was not robust in the face of a bogus
selector set in %cs via ptrace before the single-step was done.

Signed-off-by: Roland McGrath <roland@redhat.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-07-18 12:09:01 -07:00
Haavard Skinnemoen
587ca7619a [AVR32] Initialize phy_mask for both macb devices
The STK1000 uses pullups on the MDIO lines to the PHY, but they are
too weak. This causes the PHY layer to detect PHYs on all possible MII
addresses. Mask out all but the correct address to prevent this from
happening.

Signed-off-by: Haavard Skinnemoen <hskinnemoen@atmel.com>
2007-07-18 20:47:04 +02:00
Robert P. J. Day
f3e26984f1 [AVR32] Correct misspelled CONFIG_BLK_DEV_INITRD variable.
Signed-off-by: Robert P. J. Day <rpjday@mindspring.com>
Signed-off-by: Haavard Skinnemoen <hskinnemoen@atmel.com>
2007-07-18 20:47:04 +02:00
Haavard Skinnemoen
aa15f63790 [AVR32] Fix build error in parse_tag_rdimg()
This code is inside an #ifdef with a misspelled config symbol, so it
hasn't been used for a long time. Fix it before fixing the config
symbol to keep bisection working.

Signed-off-by: Haavard Skinnemoen <hskinnemoen@atmel.com>
2007-07-18 20:47:04 +02:00
Kristoffer Nyborg Gregertsen
d4003ba0a1 [AVR32] Don't wire up macb0 unless SW6 is in default position
If the user wants to sacrifice macb0 for more GPIOs, let him.

Signed-off-by: Haavard Skinnemoen <hskinnemoen@atmel.com>
2007-07-18 20:45:52 +02:00
Hans-Christian Egtvedt
95a42267cd [AVR32] Wire up SSC platform device 0 as TX on ATSTK1000 board
Signed-off-by: Hans-Christian Egtvedt <hcegtvedt@atmel.com>
Signed-off-by: Haavard Skinnemoen <hskinnemoen@atmel.com>
2007-07-18 20:45:52 +02:00
Hans-Christian Egtvedt
9cf6cf58d0 [AVR32] Add Atmel SSC driver platform device to AT32AP architecture
This patch adds register definitions, clocks and IRQs to the platform devices.

Signed-off-by: Hans-Christian Egtvedt <hcegtvedt@atmel.com>
Signed-off-by: Haavard Skinnemoen <hskinnemoen@atmel.com>
2007-07-18 20:45:52 +02:00
David Brownell
a8e93ed8cb [AVR32] Make STK1000 mux settings configurable
This adds some STK1002-specific config options covering the jumper settings,
so the kernel can automatically be configured to include the relevant devices.

One of them replaces the previous internal SW2_DEFAULT setting; SPI config
is affected by two of the jumpers; and a fourth one switches between LCD and
the second Ethernet connector.  (There's more that to be done.)

Signed-off-by: David Brownell <dbrownell@users.sourceforge.net>
Signed-off-by: Haavard Skinnemoen <hskinnemoen@atmel.com>
2007-07-18 20:45:51 +02:00
Hans-Christian Egtvedt
9e58e1855c [AVR32] CPU frequency scaling for AT32AP
This patch enables CPU frequency scaling for AT32AP devices. This will
enable the CPU to scale between the speed of the high speed bus and
the master clock and thus save some power.

The patch also adds a parent to cpu_clk and a cpu_clk_set_rate to
enable changing the CPU clock divider in a sane way.

The driver does not check if the given rate is 0, thus resulting in a
div by 0.  I think this check should be go into the clk_set_rate
framework, and not here.

Tested on AT32AP7000/ATSTK1000.

Hardware documentation can be found in the AT32AP7000 datasheet.

Signed-off-by: Hans-Christian Egtvedt <hcegtvedt@atmel.com>
Signed-off-by: Haavard Skinnemoen <hskinnemoen@atmel.com>
2007-07-18 20:45:51 +02:00
Haavard Skinnemoen
7a5b805907 [AVR32] Split SM device into PM, RTC, WDT and EIC
Split the SM platform device into separate platform devices for PM,
RTC, WDT and EIC. This is more correct according to the documentation
and allows us to simplify the code a little.

Also turn the EIC driver into a real platform driver.

Signed-off-by: Haavard Skinnemoen <hskinnemoen@atmel.com>
Acked-by: Hans-Christian Egtvedt <hcegtvedt@atmel.com>
2007-07-18 20:45:51 +02:00
Sam Ravnborg
3fbc54165d [PATCH] x86: do not recompile boot for each build
Keep the arch/i386/boot directory from being rebuilt every time.

Signed-off-by: Sam Ravnborg <sam@ravnborg.org>
Signed-off-by: H. Peter Anvin <hpa@zytor.com>
2007-07-18 11:36:17 -07:00
H. Peter Anvin
8c027ae2dc [x86 setup] Save/restore DS around invocations of INT 10h
There exists at least one card, Trident TVGA8900CL (BIOS dated 1992/9/8)
which clobbers DS when "scrolling in an SVGA text mode of more than
800x600 pixels."  Although we are extremely unlikely to run into that
situation, it is cheap insurance to save and restore DS, and it only adds
a grand total of 50 bytes to the total output.

Pointed out by Etienne Lorrain.

Cc: Etienne Lorrain <etienne_lorrain@yahoo.fr>
Signed-off-by: H. Peter Anvin <hpa@zytor.com>
2007-07-18 11:36:17 -07:00
H. Peter Anvin
7ad37df02c [x86 setup] VGA: Clear the Protect bit before setting the vertical height
If the user has asked for the vertical height registers to be recomputed
by setting bit 15 in the video mode number, we do so without clearing the
Protect bit in the Vertical Retrace Register before setting the Overflow
register.  As a result, if the VGA BIOS had set the Protect bit, the
write to the Overflow register will be dropped, and bits [9:8] of the
vertical height will be left unchanged.

This is a bug imported from the assembly version of this code.  It was
pointed out by Etienne Lorrain.

Cc: Etienne Lorrain <etienne_lorrain@yahoo.fr>
Signed-off-by: H. Peter Anvin <hpa@zytor.com>
2007-07-18 11:36:17 -07:00
H. Peter Anvin
5593eaa854 [x86 setup] Fix assembly constraints
Fix incorrect assembly constraints.  In particular, fix memory
constraints used inside push..pop, which can cause invalid operation
since gcc may generate %esp-relative references.

Additionally:

outl() should have "dN" not "dn".

query_mca() shouldn't listen 16/32-bit registers in an 8-bit only
context.

has_eflag(): the "mask" is only used well after both the stack pointer
and the output registers have been touched; this requires the output
registers to be earlyclobbers (=&) and the input to exclude memory (so
"ri", not "g").

Thanks to Etienne Lorrain and Chuck Ebbert for prompting this review.

Cc: Etienne Lorrain <etienne_lorrain@yahoo.fr>
Cc: Chuck Ebbert <cebbert@redhat.com>
Signed-off-by: H. Peter Anvin <hpa@zytor.com>
2007-07-18 11:36:17 -07:00
H. Peter Anvin
9aa3909c0e [x86 setup] build/tools.c: fix comment
Correct a comment in arch/i386/boot/build/tools.c; we now build the
kernel from only two components instead of three, since the boot
sector has been integrated in the setup code.

Signed-off-by: H. Peter Anvin <hpa@zytor.com>
2007-07-18 11:36:17 -07:00
Linus Torvalds
d756d10e24 Merge branch 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4
* 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4:
  ext4: extent macros cleanup
  Fix compilation with EXT_DEBUG, also fix leXX_to_cpu conversions.
  ext4: remove extra IS_RDONLY() check
  ext4: Use is_power_of_2()
  Use zero_user_page() in ext4 where possible
  ext4: Remove 65000 subdirectory limit
  ext4: Expand extra_inodes space per the s_{want,min}_extra_isize fields 
  ext4: Add nanosecond timestamps
  jbd2: Move jbd2-debug file to debugfs
  jbd2: Fix CONFIG_JBD_DEBUG ifdef to be CONFIG_JBD2_DEBUG
  ext4: Set the journal JBD2_FEATURE_INCOMPAT_64BIT on large devices
  ext4: Make extents code sanely handle on-disk corruption
  ext4: copy i_flags to inode flags on write
  ext4: Enable extents by default
  Change on-disk format to support 2^15 uninitialized extents
  write support for preallocated blocks
  fallocate support in ext4
  sys_fallocate() implementation on i386, x86_64 and powerpc
2007-07-18 10:32:00 -07:00
Linus Torvalds
31bdc5dc76 Merge branch 'master' of master.kernel.org:/pub/scm/linux/kernel/git/davem/sparc-2.6
* 'master' of master.kernel.org:/pub/scm/linux/kernel/git/davem/sparc-2.6:
  [SPARC64]: Set vio->desc_buf to NULL after freeing.
  [SPARC]: Mark sparc and sparc64 as not having virt_to_bus
  [SPARC64]: Fix reset handling in VNET driver.
  [SPARC64]: Handle reset events in vio_link_state_change().
  [SPARC64]: Handle LDC resets properly in domain-services driver.
  [SPARC64]: Massively simplify VIO device layer and support hot add/remove.
  [SPARC64]: Simplify VNET probing.
  [SPARC64]: Simplify VDC device probing.
  [SPARC64]: Add basic infrastructure for MD add/remove notification.
2007-07-18 10:23:37 -07:00
Linus Torvalds
5cc97bf2d8 Merge branch 'xen-upstream' of ssh://master.kernel.org/pub/scm/linux/kernel/git/jeremy/xen
* 'xen-upstream' of ssh://master.kernel.org/pub/scm/linux/kernel/git/jeremy/xen: (44 commits)
  xen: disable all non-virtual drivers
  xen: use iret directly when possible
  xen: suppress abs symbol warnings for unused reloc pointers
  xen: Attempt to patch inline versions of common operations
  xen: Place vcpu_info structure into per-cpu memory
  xen: handle external requests for shutdown, reboot and sysrq
  xen: machine operations
  xen: add virtual network device driver
  xen: add virtual block device driver.
  xen: add the Xenbus sysfs and virtual device hotplug driver
  xen: Add grant table support
  xen: use the hvc console infrastructure for Xen console
  xen: hack to prevent bad segment register reload
  xen: lazy-mmu operations
  xen: Add support for preemption
  xen: SMP guest support
  xen: Implement sched_clock
  xen: Account for stolen time
  xen: ignore RW mapping of RO pages in pagetable_init
  xen: Complete pagetable pinning
  ...
2007-07-18 10:18:39 -07:00
Tony Breeds
826ea8f22c Revert "[POWERPC] Do firmware feature fixups after features are initialised"
This reverts commit 5a26f6bbb7.

The original patch causes boot failures when built with ppc64_defconfig.  The
quickest fix is to revert it while alterates are investigated.

Signed-off-by: Tony Breeds <tony@bakeyournoodle.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-07-18 10:17:39 -07:00
Tony Breeds
4f3731da16 Fix compile failure in arch/powerpc/kernel/pci-common.c
This fixes the fallout from the recent powerpc merge (commit
489de30259):

   CC      arch/powerpc/kernel/pci-common.o
  arch/powerpc/kernel/pci-common.c:160: error: conflicting types for 'pcibios_add_platform_entries'
  include/linux/pci.h:889: error: previous declaration of 'pcibios_add_platform_entries' was here

Signed-off-by: Tony Breeds <tony@bakeyournoodle.com>
Tested-by: Bret Towe <magnade@gmail.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-07-18 10:17:39 -07:00
Jeremy Fitzhardinge
dfdcdd42fd xen: disable all non-virtual drivers
A domU Xen environment has no non-virtual drivers, so make sure
they're all disabled at once.

Signed-off-by: Jeremy Fitzhardinge <jeremy@xensource.com>
Cc: Rusty Russell <rusty@rustcorp.com.au>
2007-07-18 08:47:46 -07:00
Jeremy Fitzhardinge
9ec2b804e0 xen: use iret directly when possible
Most of the time we can simply use the iret instruction to exit the
kernel, rather than having to use the iret hypercall - the only
exception is if we're returning into vm86 mode, or from delivering an
NMI (which we don't support yet).

When running native, iret has the behaviour of testing for a pending
interrupt atomically with re-enabling interrupts.  Unfortunately
there's no way to do this with Xen, so there's a window in which we
could get a recursive exception after enabling events but before
actually returning to userspace.

This causes a problem: if the nested interrupt causes one of the
task's TIF_WORK_MASK flags to be set, they will not be checked again
before returning to userspace.  This means that pending work may be
left pending indefinitely, until the process enters and leaves the
kernel again.  The net effect is that a pending signal or reschedule
event could be delayed for an unbounded amount of time.

To deal with this, the xen event upcall handler checks to see if the
EIP is within the critical section of the iret code, after events
are (potentially) enabled up to the iret itself.  If its within this
range, it calls the iret critical section fixup, which adjusts the
stack to deal with any unrestored registers, and then shifts the
stack frame up to replace the previous invocation.

Signed-off-by: Jeremy Fitzhardinge <jeremy@xensource.com>
2007-07-18 08:47:46 -07:00
Jeremy Fitzhardinge
600b2fc242 xen: suppress abs symbol warnings for unused reloc pointers
arch/i386/xen/xen-asm.S defines some small pieces of code which are
used to implement a few paravirt_ops.  They're designed so they can be
used either in-place, or be inline patched into their callsites if
there's enough space.

Some of those operations need to make calls out (specifically, if you
re-enable events [interrupts], and there's a pending event at that
time).  These calls need the call instruction to be relocated if the
code is patched inline.  In this case xen_foo_reloc is a
section-relative symbol which points to xen_foo's required relocation.

Other operations have no need of a relocation, and so their
corresponding xen_bar_reloc is absolute 0.  These are the cases which
are triggering the warning.

This patch adds those symbols to the list of safe abs symbols.

Signed-off-by: Jeremy Fitzhardinge <jeremy@xensource.com>
Cc: Adrian Bunk <bunk@stusta.de>
2007-07-18 08:47:45 -07:00
Jeremy Fitzhardinge
6487673b8a xen: Attempt to patch inline versions of common operations
This patchs adds the mechanism to allow us to patch inline versions of
common operations.

The implementations of the direct-access versions save_fl, restore_fl,
irq_enable and irq_disable are now in assembler, and the same code is
used for both out of line and inline uses.

Signed-off-by: Jeremy Fitzhardinge <jeremy@xensource.com>
Cc: Chris Wright <chrisw@sous-sol.org>
Cc: Keir Fraser <keir@xensource.com>
2007-07-18 08:47:45 -07:00
Jeremy Fitzhardinge
60223a326f xen: Place vcpu_info structure into per-cpu memory
An experimental patch for Xen allows guests to place their vcpu_info
structs anywhere.  We try to use this to place the vcpu_info into the
PDA, which allows direct access.

If this works, then switch to using direct access operations for
irq_enable, disable, save_fl and restore_fl.

Signed-off-by: Jeremy Fitzhardinge <jeremy@xensource.com>
Cc: Chris Wright <chrisw@sous-sol.org>
Cc: Keir Fraser <keir@xensource.com>
2007-07-18 08:47:45 -07:00
Jeremy Fitzhardinge
3e2b8fbeec xen: handle external requests for shutdown, reboot and sysrq
The guest domain can be asked to shutdown or reboot itself, or have a
sysrq key injected, via xenbus.  This patch adds a watcher for those
events, and does the appropriate action.

Signed-off-by: Jeremy Fitzhardinge <jeremy@xensource.com>
Cc: Chris Wright <chrisw@sous-sol.org>
2007-07-18 08:47:45 -07:00
Jeremy Fitzhardinge
fefa629abe xen: machine operations
Make the appropriate hypercalls to halt and reboot the virtual machine.

Signed-off-by: Jeremy Fitzhardinge <jeremy@xensource.com>
Acked-by: Chris Wright <chrisw@sous-sol.org>
2007-07-18 08:47:45 -07:00
Jeremy Fitzhardinge
b536b4b962 xen: use the hvc console infrastructure for Xen console
Implement a Xen back-end for hvc console.

* * *
Add early printk support via hvc console, enable using
"earlyprintk=xen" on the kernel command line.

From: Gerd Hoffmann <kraxel@suse.de>
Signed-off-by: Jeremy Fitzhardinge <jeremy@xensource.com>
Signed-off-by: Chris Wright <chrisw@sous-sol.org>
Acked-by: Ingo Molnar <mingo@elte.hu>
Acked-by: Olof Johansson <olof@lixom.net>
2007-07-18 08:47:44 -07:00
Jeremy Fitzhardinge
8b84ad942b xen: hack to prevent bad segment register reload
The hypervisor saves and restores the segment registers as part of the
state is saves while context switching.  If, during a context switch,
the next process doesn't use the TLS segments, it invalidates the GDT
entry, causing the segment register reload to fault.  This fault
effectively doubles the cost of a context switch.

This patch is a band-aid workaround which clears the usermode %gs
after it has been saved for the previous process, but before it gets
reloaded for the next, and it avoids having the hypervisor attempt to
erroneously reload it.

Signed-off-by: Jeremy Fitzhardinge <jeremy@xensource.com>
Signed-off-by: Chris Wright <chrisw@sous-sol.org>
2007-07-18 08:47:44 -07:00
Jeremy Fitzhardinge
d66bf8fcf3 xen: lazy-mmu operations
This patch uses the lazy-mmu hooks to batch mmu operations where
possible.  This is primarily useful for batching operations applied to
active pagetables, which happens during mprotect, munmap, mremap and
the like (mmap does not do bulk pagetable operations, so it isn't
helped).

Signed-off-by: Jeremy Fitzhardinge <jeremy@xensource.com>
Acked-by: Chris Wright <chrisw@sous-sol.org>
2007-07-18 08:47:44 -07:00
Jeremy Fitzhardinge
f120f13ea0 xen: Add support for preemption
Add Xen support for preemption.  This is mostly a cleanup of existing
preempt_enable/disable calls, or just comments to explain the current
usage.

Signed-off-by: Jeremy Fitzhardinge <jeremy@xensource.com>
Signed-off-by: Chris Wright <chrisw@sous-sol.org>
2007-07-18 08:47:44 -07:00
Jeremy Fitzhardinge
f87e4cac4f xen: SMP guest support
This is a fairly straightforward Xen implementation of smp_ops.

Xen has its own IPI mechanisms, and has no dependency on any
APIC-based IPI.  The smp_ops hooks and the flush_tlb_others pv_op
allow a Xen guest to avoid all APIC code in arch/i386 (the only apic
operation is a single apic_read for the apic version number).

One subtle point which needs to be addressed is unpinning pagetables
when another cpu may have a lazy tlb reference to the pagetable. Xen
will not allow an in-use pagetable to be unpinned, so we must find any
other cpus with a reference to the pagetable and get them to shoot
down their references.

Signed-off-by: Jeremy Fitzhardinge <jeremy@xensource.com>
Signed-off-by: Chris Wright <chrisw@sous-sol.org>
Cc: Benjamin LaHaise <bcrl@kvack.org>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Andi Kleen <ak@suse.de>
2007-07-18 08:47:44 -07:00
Jeremy Fitzhardinge
ab55028886 xen: Implement sched_clock
Implement xen_sched_clock, which returns the number of ns the current
vcpu has been actually in an unstolen state (ie, running or blocked,
vs runnable-but-not-running, or offline) since boot.

Signed-off-by: Jeremy Fitzhardinge <jeremy@xensource.com>
Acked-by: Chris Wright <chrisw@sous-sol.org>
Cc: john stultz <johnstul@us.ibm.com>
2007-07-18 08:47:43 -07:00
Jeremy Fitzhardinge
f91a8b447b xen: Account for stolen time
This patch accounts for the time stolen from our VCPUs.  Stolen time is
time where a vcpu is runnable and could be running, but all available
physical CPUs are being used for something else.

This accounting gets run on each timer interrupt, just as a way to get
it run relatively often, and when interesting things are going on.
Stolen time is not really used by much in the kernel; it is reported
in /proc/stats, and that's about it.

Signed-off-by: Jeremy Fitzhardinge <jeremy@xensource.com>
Acked-by: Chris Wright <chrisw@sous-sol.org>
Cc: john stultz <johnstul@us.ibm.com>
Cc: Rik van Riel <riel@redhat.com>
2007-07-18 08:47:43 -07:00
Jeremy Fitzhardinge
9a4029fd34 xen: ignore RW mapping of RO pages in pagetable_init
When setting up the initial pagetable, which includes mappings of all
low physical memory, ignore a mapping which tries to set the RW bit on
an RO pte.  An RO pte indicates a page which is part of the current
pagetable, and so it cannot be allowed to become RW.

Once xen_pagetable_setup_done is called, set_pte reverts to its normal
behaviour.

Signed-off-by: Jeremy Fitzhardinge <jeremy@xensource.com>
Acked-by: Chris Wright <chrisw@sous-sol.org>
Cc: ebiederm@xmission.com (Eric W. Biederman)
2007-07-18 08:47:43 -07:00
Jeremy Fitzhardinge
f4f97b3ea9 xen: Complete pagetable pinning
Xen requires all active pagetables to be marked read-only.  When the
base of the pagetable is loaded into %cr3, the hypervisor validates
the entire pagetable and only allows the load to proceed if it all
checks out.

This is pretty slow, so to mitigate this cost Xen has a notion of
pinned pagetables.  Pinned pagetables are pagetables which are
considered to be active even if no processor's cr3 is pointing to is.
This means that it must remain read-only and all updates are validated
by the hypervisor.  This makes context switches much cheaper, because
the hypervisor doesn't need to revalidate the pagetable each time.

This also adds a new paravirt hook which is called during setup once
the zones and memory allocator have been initialized.  When the
init_mm pagetable is first built, the struct page array does not yet
exist, and so there's nowhere to put he init_mm pagetable's PG_pinned
flags.  Once the zones are initialized and the struct page array
exists, we can set the PG_pinned flags for those pages.

This patch also adds the Xen support for pte pages allocated out of
highmem (highpte) by implementing xen_kmap_atomic_pte.

Signed-off-by: Jeremy Fitzhardinge <jeremy@xensource.com>
Signed-off-by: Chris Wright <chrisw@sous-sol.org>
Cc: Zach Amsden <zach@vmware.com>
2007-07-18 08:47:43 -07:00
Jeremy Fitzhardinge
e738fca8d7 xen: configuration
Put config options for Xen after the core pieces are in place.

Signed-off-by: Jeremy Fitzhardinge <jeremy@xensource.com>
Signed-off-by: Chris Wright <chrisw@sous-sol.org>
2007-07-18 08:47:43 -07:00
Jeremy Fitzhardinge
15c84731d6 xen: time implementation
Xen maintains a base clock which measures nanoseconds since system
boot.  This is provided to guests via a shared page which contains a
base time in ns, a tsc timestamp at that point and tsc frequency
parameters.  Guests can compute the current time by reading the tsc
and using it to extrapolate the current time from the basetime.  The
hypervisor makes sure that the frequency parameters are updated
regularly, paricularly if the tsc changes rate or stops.

This is implemented as a clocksource, so the interface to the rest of
the kernel is a simple clocksource which simply returns the current
time directly in nanoseconds.

Xen also provides a simple timer mechanism, which allows a timeout to
be set in the future.  When that time arrives, a timer event is sent
to the guest.  There are two timer interfaces:
 - An old one which also delivers a stream of (unused) ticks at 100Hz,
   and on the same event, the actual timer events.  The 100Hz ticks
   cause a lot of spurious wakeups, but are basically harmless.
 - The new timer interface doesn't have the 100Hz ticks, and can also
   fail if the specified time is in the past.

This code presents the Xen timer as a clockevent driver, and uses the
new interface by preference.

Signed-off-by: Jeremy Fitzhardinge <jeremy@xensource.com>
Signed-off-by: Chris Wright <chrisw@sous-sol.org>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Thomas Gleixner <tglx@linutronix.de>
2007-07-18 08:47:43 -07:00
Jeremy Fitzhardinge
e46cdb66c8 xen: event channels
Xen implements interrupts in terms of event channels.  Each guest
domain gets 1024 event channels which can be used for a variety of
purposes, such as Xen timer events, inter-domain events,
inter-processor events (IPI) or for real hardware IRQs.

Within the kernel, we map the event channels to IRQs, and implement
the whole interrupt handling using a Xen irq_chip.

Rather than setting NR_IRQ to 1024 under PARAVIRT in order to
accomodate Xen, we create a dynamic mapping between event channels and
IRQs.  Ideally, Linux will eventually move towards dynamically
allocating per-irq structures, and we can use a 1:1 mapping between
event channels and irqs.

Signed-off-by: Jeremy Fitzhardinge <jeremy@xensource.com>
Signed-off-by: Chris Wright <chrisw@sous-sol.org>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Eric W. Biederman <ebiederm@xmission.com>
2007-07-18 08:47:42 -07:00
Jeremy Fitzhardinge
3b827c1b3a xen: virtual mmu
Xen pagetable handling, including the machinery to implement direct
pagetables.

Xen presents the real CPU's pagetables directly to guests, with no
added shadowing or other layer of abstraction.  Naturally this means
the hypervisor must maintain close control over what the guest can put
into the pagetable.

When the guest modifies the pte/pmd/pgd, it must convert its
domain-specific notion of a "physical" pfn into a global machine frame
number (mfn) before inserting the entry into the pagetable.  Xen will
check to make sure the domain is allowed to create a mapping of the
given mfn.

Xen also requires that all mappings the guest has of its own active
pagetable are read-only.  This is relatively easy to implement in
Linux because all pagetables share the same pte pages for kernel
mappings, so updating the pte in one pagetable will implicitly update
the mapping in all pagetables.

Normally a pagetable becomes active when you point to it with cr3 (or
the Xen equivalent), but when you do so, Xen must check the whole
pagetable for correctness, which is clearly a performance problem.

Xen solves this with pinning which keeps a pagetable effectively
active even if its currently unused, which means that all the normal
update rules are enforced.  This means that it need not revalidate the
pagetable when loading cr3.

This patch has a first-cut implementation of pinning, but it is more
fully implemented in a later patch.

Signed-off-by: Jeremy Fitzhardinge <jeremy@xensource.com>
Signed-off-by: Chris Wright <chrisw@sous-sol.org>
2007-07-18 08:47:42 -07:00
Jeremy Fitzhardinge
5ead97c84f xen: Core Xen implementation
This patch is a rollup of all the core pieces of the Xen
implementation, including:
 - booting and setup
 - pagetable setup
 - privileged instructions
 - segmentation
 - interrupt flags
 - upcalls
 - multicall batching

BOOTING AND SETUP

The vmlinux image is decorated with ELF notes which tell the Xen
domain builder what the kernel's requirements are; the domain builder
then constructs the address space accordingly and starts the kernel.

Xen has its own entrypoint for the kernel (contained in an ELF note).
The ELF notes are set up by xen-head.S, which is included into head.S.
In principle it could be linked separately, but it seems to provoke
lots of binutils bugs.

Because the domain builder starts the kernel in a fairly sane state
(32-bit protected mode, paging enabled, flat segments set up), there's
not a lot of setup needed before starting the kernel proper.  The main
steps are:
  1. Install the Xen paravirt_ops, which is simply a matter of a
     structure assignment.
  2. Set init_mm to use the Xen-supplied pagetables (analogous to the
     head.S generated pagetables in a native boot).
  3. Reserve address space for Xen, since it takes a chunk at the top
     of the address space for its own use.
  4. Call start_kernel()

PAGETABLE SETUP

Once we hit the main kernel boot sequence, it will end up calling back
via paravirt_ops to set up various pieces of Xen specific state.  One
of the critical things which requires a bit of extra care is the
construction of the initial init_mm pagetable.  Because Xen places
tight constraints on pagetables (an active pagetable must always be
valid, and must always be mapped read-only to the guest domain), we
need to be careful when constructing the new pagetable to keep these
constraints in mind.  It turns out that the easiest way to do this is
use the initial Xen-provided pagetable as a template, and then just
insert new mappings for memory where a mapping doesn't already exist.

This means that during pagetable setup, it uses a special version of
xen_set_pte which ignores any attempt to remap a read-only page as
read-write (since Xen will map its own initial pagetable as RO), but
lets other changes to the ptes happen, so that things like NX are set
properly.

PRIVILEGED INSTRUCTIONS AND SEGMENTATION

When the kernel runs under Xen, it runs in ring 1 rather than ring 0.
This means that it is more privileged than user-mode in ring 3, but it
still can't run privileged instructions directly.  Non-performance
critical instructions are dealt with by taking a privilege exception
and trapping into the hypervisor and emulating the instruction, but
more performance-critical instructions have their own specific
paravirt_ops.  In many cases we can avoid having to do any hypercalls
for these instructions, or the Xen implementation is quite different
from the normal native version.

The privileged instructions fall into the broad classes of:
  Segmentation: setting up the GDT and the GDT entries, LDT,
     TLS and so on.  Xen doesn't allow the GDT to be directly
     modified; all GDT updates are done via hypercalls where the new
     entries can be validated.  This is important because Xen uses
     segment limits to prevent the guest kernel from damaging the
     hypervisor itself.
  Traps and exceptions: Xen uses a special format for trap entrypoints,
     so when the kernel wants to set an IDT entry, it needs to be
     converted to the form Xen expects.  Xen sets int 0x80 up specially
     so that the trap goes straight from userspace into the guest kernel
     without going via the hypervisor.  sysenter isn't supported.
  Kernel stack: The esp0 entry is extracted from the tss and provided to
     Xen.
  TLB operations: the various TLB calls are mapped into corresponding
     Xen hypercalls.
  Control registers: all the control registers are privileged.  The most
     important is cr3, which points to the base of the current pagetable,
     and we handle it specially.

Another instruction we treat specially is CPUID, even though its not
privileged.  We want to control what CPU features are visible to the
rest of the kernel, and so CPUID ends up going into a paravirt_op.
Xen implements this mainly to disable the ACPI and APIC subsystems.

INTERRUPT FLAGS

Xen maintains its own separate flag for masking events, which is
contained within the per-cpu vcpu_info structure.  Because the guest
kernel runs in ring 1 and not 0, the IF flag in EFLAGS is completely
ignored (and must be, because even if a guest domain disables
interrupts for itself, it can't disable them overall).

(A note on terminology: "events" and interrupts are effectively
synonymous.  However, rather than using an "enable flag", Xen uses a
"mask flag", which blocks event delivery when it is non-zero.)

There are paravirt_ops for each of cli/sti/save_fl/restore_fl, which
are implemented to manage the Xen event mask state.  The only thing
worth noting is that when events are unmasked, we need to explicitly
see if there's a pending event and call into the hypervisor to make
sure it gets delivered.

UPCALLS

Xen needs a couple of upcall (or callback) functions to be implemented
by each guest.  One is the event upcalls, which is how events
(interrupts, effectively) are delivered to the guests.  The other is
the failsafe callback, which is used to report errors in either
reloading a segment register, or caused by iret.  These are
implemented in i386/kernel/entry.S so they can jump into the normal
iret_exc path when necessary.

MULTICALL BATCHING

Xen provides a multicall mechanism, which allows multiple hypercalls
to be issued at once in order to mitigate the cost of trapping into
the hypervisor.  This is particularly useful for context switches,
since the 4-5 hypercalls they would normally need (reload cr3, update
TLS, maybe update LDT) can be reduced to one.  This patch implements a
generic batching mechanism for hypercalls, which gets used in many
places in the Xen code.

Signed-off-by: Jeremy Fitzhardinge <jeremy@xensource.com>
Signed-off-by: Chris Wright <chrisw@sous-sol.org>
Cc: Ian Pratt <ian.pratt@xensource.com>
Cc: Christian Limpach <Christian.Limpach@cl.cam.ac.uk>
Cc: Adrian Bunk <bunk@stusta.de>
2007-07-18 08:47:42 -07:00
Jeremy Fitzhardinge
24037a8b69 Add nosegneg capability to the vsyscall page notes
Add the "nosegneg" fake capabilty to the vsyscall page notes. This is
used by the runtime linker to select a glibc version which then
disables negative-offset accesses to the thread-local segment via
%gs. These accesses require emulation in Xen (because segments are
truncated to protect the hypervisor address space) and avoiding them
provides a measurable performance boost.

Signed-off-by: Ian Pratt <ian.pratt@xensource.com>
Signed-off-by: Christian Limpach <Christian.Limpach@cl.cam.ac.uk>
Signed-off-by: Chris Wright <chrisw@sous-sol.org>
Signed-off-by: Jeremy Fitzhardinge <jeremy@xensource.com>
Acked-by: Zachary Amsden <zach@vmware.com>
Cc: Roland McGrath <roland@redhat.com>
Cc: Ulrich Drepper <drepper@redhat.com>
2007-07-18 08:47:42 -07:00
Jeremy Fitzhardinge
688340ea34 Add a sched_clock paravirt_op
The tsc-based get_scheduled_cycles interface is not a good match for
Xen's runstate accounting, which reports everything in nanoseconds.

This patch replaces this interface with a sched_clock interface, which
matches both Xen and VMI's requirements.

In order to do this, we:
   1. replace get_scheduled_cycles with sched_clock
   2. hoist cycles_2_ns into a common header
   3. update vmi accordingly

One thing to note: because sched_clock is implemented as a weak
function in kernel/sched.c, we must define a real function in order to
override this weak binding.  This means the usual paravirt_ops
technique of using an inline function won't work in this case.

Signed-off-by: Jeremy Fitzhardinge <jeremy@xensource.com>
Cc: Zachary Amsden <zach@vmware.com>
Cc: Dan Hecht <dhecht@vmware.com>
Cc: john stultz <johnstul@us.ibm.com>
2007-07-18 08:47:42 -07:00