linux_dsm_epyc7002

mirror of https://github.com/AuxXxilium/linux_dsm_epyc7002.git synced 2024-12-24 23:34:40 +07:00

Author	SHA1	Message	Date
Anton Blanchard	ef1313deaf	powerpc: Add VMX optimised xor for RAID5 Add a VMX optimised xor, used primarily for RAID5. On a POWER7 blade this is a decent win: 32regs : 17932.800 MB/sec altivec : 19724.800 MB/sec The bigger gain is when the same test is run in SMT4 mode, as it would if there was a lot of work going on: 8regs : 8377.600 MB/sec altivec : 15801.600 MB/sec I tested this against an array created without the patch, and also verified it worked as expected on a little endian kernel. [ Fix !CONFIG_ALTIVEC build -- BenH ] Signed-off-by: Anton Blanchard <anton@samba.org> Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>	2013-10-30 16:02:28 +11:00
Tom Musta	dbc2fbd7c2	powerpc: Fix Unaligned LE Floating Point Loads and Stores This patch addresses unaligned single precision floating point loads and stores in the single-step code. The old implementation improperly treated an 8 byte structure as an array of two 4 byte words, which is a classic little endian bug. Signed-off-by: Tom Musta <tmusta@gmail.com> Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>	2013-10-30 16:01:36 +11:00
Tom Musta	6506b4718b	powerpc: Fix Unaligned Loads and Stores This patch modifies the unaligned access routines of the sstep.c module so that it properly reverses the bytes of storage operands in the little endian kernel kernel. This is implemented by breaking an unaligned little endian access into a combination of single byte accesses plus an overal byte reversal operation. Signed-off-by: Tom Musta <tmusta@gmail.com> Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>	2013-10-30 16:01:30 +11:00
Anton Blanchard	de577a3568	powerpc: Use generic memcpy code in little endian We need to fix some endian issues in our memcpy code. For now just enable the generic memcpy routine for little endian builds. Signed-off-by: Anton Blanchard <anton@samba.org> Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>	2013-10-11 16:48:40 +11:00
Anton Blanchard	7a332b0c9a	powerpc: Use generic checksum code in little endian We need to fix some endian issues in our checksum code. For now just enable the generic checksum routines for little endian builds. Signed-off-by: Anton Blanchard <anton@samba.org> Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>	2013-10-11 16:48:39 +11:00
Anton Blanchard	32ee1e188e	powerpc: Fix endian issues in VMX copy loops Fix the permute loops for little endian. Signed-off-by: Anton Blanchard <anton@samba.org> Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>	2013-10-11 16:48:25 +11:00
Paul E. McKenney	8f21bd0090	powerpc: Restore registers on error exit from csum_partial_copy_generic() The csum_partial_copy_generic() function saves the PowerPC non-volatile r14, r15, and r16 registers for the main checksum-and-copy loop. Unfortunately, it fails to restore them upon error exit from this loop, which results in silent corruption of these registers in the presumably rare event of an access exception within that loop. This commit therefore restores these register on error exit from the loop. Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Signed-off-by: Anton Blanchard <anton@samba.org> Cc: stable@vger.kernel.org Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>	2013-10-03 17:22:42 +10:00
Paul E. McKenney	d9813c3681	powerpc: Fix parameter clobber in csum_partial_copy_generic() The csum_partial_copy_generic() uses register r7 to adjust the remaining bytes to process. Unfortunately, r7 also holds a parameter, namely the address of the flag to set in case of access exceptions while reading the source buffer. Lacking a quantum implementation of PowerPC, this commit instead uses register r9 to do the adjusting, leaving r7's pointer uncorrupted. Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Signed-off-by: Anton Blanchard <anton@samba.org> Cc: stable@vger.kernel.org Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>	2013-10-03 17:22:36 +10:00
Benjamin Herrenschmidt	cbc9565ee8	powerpc: Remove ksp_limit on ppc64 We've been keeping that field in thread_struct for a while, it contains the "limit" of the current stack pointer and is meant to be used for detecting stack overflows. It has a few problems however: - First, it was never actually used on 64-bit. Set and updated but not actually exploited - When switching stack to/from irq and softirq stacks, it's update is racy unless we hard disable interrupts, which is costly. This is fine on 32-bit as we don't soft-disable there but not on 64-bit. Thus rather than fixing 2 in order to implement 1 in some hypothetical future, let's remove the code completely from 64-bit. In order to avoid a clutter of ifdef's, we remove the updates from C code completely during interrupt stack switching, and instead maintain it from the asm helper that is used to do the stack switching in the first place. Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>	2013-09-25 14:15:51 +10:00
Tom Musta	17e8de7e18	powerpc: Unaligned stores and stmw are broken in emulation code The stmw instruction was incorrectly decoded as an update form instruction and thus the RA register was being clobbered. Also, the utility routine to write memory to unaligned addresses breaks the operation into smaller aligned accesses but was incorrectly incrementing the address by only one; it needs to increment the address by the size of the smaller aligned chunk. Signed-off-by: Tom Musta <tmusta@us.ibm.com> Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>	2013-08-27 14:36:08 +10:00
Anton Blanchard	7ffcf8ec26	powerpc: Fix little endian lppaca, slb_shadow and dtl_entry The lppaca, slb_shadow and dtl_entry hypervisor structures are big endian, so we have to byte swap them in little endian builds. LE KVM hosts will also need to be fixed but for now add an #error to remind us. Signed-off-by: Anton Blanchard <anton@samba.org> Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>	2013-08-14 15:33:35 +10:00
Michael Neuling	70a54a4fae	powerpc: Fix single step emulation of 32bit overflowed branches Check truncate_if_32bit() on final write to nip. Signed-off-by: Michael Neuling <mikey@neuling.org> Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>	2013-06-20 16:55:13 +10:00
Michael Neuling	280a5ba22c	powerpc/pseries: Improve stream generation comments in copypage/user No code changes, just documenting what's happening a little better. Signed-off-by: Michael Neuling <mikey@neuling.org> Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>	2013-06-01 08:29:26 +10:00
Suzuki K. Poulose	5e249d4528	uprobes/powerpc: Add dependency on single step emulation Uprobes uses emulate_step in sstep.c, but we haven't explicitly specified the dependency. On pseries HAVE_HW_BREAKPOINT protects us, but 44x has no such luxury. Consolidate other users that depend on sstep and create a new config option. Signed-off-by: Ananth N Mavinakayanahalli <ananth@in.ibm.com> Signed-off-by: Suzuki K. Poulose <suzuki@in.ibm.com> Cc: linuxppc-dev@ozlabs.org Cc: stable@vger.kernel.org Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>	2013-01-29 11:35:06 +11:00
Anton Blanchard	1fbe9cf259	powerpc: Build kernel with -mcmodel=medium Finally remove the two level TOC and build with -mcmodel=medium. Unfortunately we can't build modules with -mcmodel=medium due to the tricks the kernel module loader plays with percpu data: # -mcmodel=medium breaks modules because it uses 32bit offsets from # the TOC pointer to create pointers where possible. Pointers into the # percpu data area are created by this method. # # The kernel module loader relocates the percpu data section from the # original location (starting with 0xd...) to somewhere in the base # kernel percpu data space (starting with 0xc...). We need a full # 64bit relocation for this to work, hence -mcmodel=large. On older kernels we fall back to the two level TOC (-mminimal-toc) Signed-off-by: Anton Blanchard <anton@samba.org> Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>	2013-01-10 17:00:31 +11:00
Nishanth Aravamudan	c8adfeccee	powerpc: Fix VMX fix for memcpy case In `2fae7cdb60` ("powerpc: Fix VMX in interrupt check in POWER7 copy loops"), Anton inadvertently introduced a regression for memcpy on POWER7 machines. copyuser and memcpy diverge slightly in their use of cr1 (copyuser doesn't use it, but memcpy does) and you end up clobbering that register with your fix. That results in (taken from an FC18 kernel): [ 18.824604] Unrecoverable VMX/Altivec Unavailable Exception f20 at c000000000052f40 [ 18.824618] Oops: Unrecoverable VMX/Altivec Unavailable Exception, sig: 6 [#1] [ 18.824623] SMP NR_CPUS=1024 NUMA pSeries [ 18.824633] Modules linked in: tg3(+) be2net(+) cxgb4(+) ipr(+) sunrpc xts lrw gf128mul dm_crypt dm_round_robin dm_multipath linear raid10 raid456 async_raid6_recov async_memcpy async_pq raid6_pq async_xor xor async_tx raid1 raid0 scsi_dh_rdac scsi_dh_hp_sw scsi_dh_emc scsi_dh_alua squashfs cramfs [ 18.824705] NIP: c000000000052f40 LR: c00000000020b874 CTR: 0000000000000512 [ 18.824709] REGS: c000001f1fef7790 TRAP: 0f20 Not tainted (3.6.0-0.rc6.git0.2.fc18.ppc64) [ 18.824713] MSR: 8000000000009032 <SF,EE,ME,IR,DR,RI> CR: 4802802e XER: 20000010 [ 18.824726] SOFTE: 0 [ 18.824728] CFAR: 0000000000000f20 [ 18.824731] TASK = c000000fa7128400[0] 'swapper/24' THREAD: c000000fa7480000 CPU: 24 GPR00: 00000000ffffffc0 c000001f1fef7a10 c00000000164edc0 c000000f9b9a8120 GPR04: c000000f9b9a8124 0000000000001438 0000000000000060 03ffffff064657ee GPR08: 0000000080000000 0000000000000010 0000000000000020 0000000000000030 GPR12: 0000000028028022 c00000000ff25400 0000000000000001 0000000000000000 GPR16: 0000000000000000 7fffffffffffffff c0000000016b2180 c00000000156a500 GPR20: c000000f968c7a90 c0000000131c31d8 c000001f1fef4000 c000000001561d00 GPR24: 000000000000000a 0000000000000000 0000000000000001 0000000000000012 GPR28: c000000fa5c04f80 00000000000008bc c0000000015c0a28 000000000000022e [ 18.824792] NIP [c000000000052f40] .memcpy_power7+0x5a0/0x7c4 [ 18.824797] LR [c00000000020b874] .pcpu_free_area+0x174/0x2d0 [ 18.824800] Call Trace: [ 18.824803] [c000001f1fef7a10] [c000000000052c14] .memcpy_power7+0x274/0x7c4 (unreliable) [ 18.824809] [c000001f1fef7b10] [c00000000020b874] .pcpu_free_area+0x174/0x2d0 [ 18.824813] [c000001f1fef7bb0] [c00000000020ba88] .free_percpu+0xb8/0x1b0 [ 18.824819] [c000001f1fef7c50] [c00000000043d144] .throtl_pd_exit+0x94/0xd0 [ 18.824824] [c000001f1fef7cf0] [c00000000043acf8] .blkg_free+0x88/0xe0 [ 18.824829] [c000001f1fef7d90] [c00000000018c048] .rcu_process_callbacks+0x2e8/0x8a0 [ 18.824835] [c000001f1fef7e90] [c0000000000a8ce8] .__do_softirq+0x158/0x4d0 [ 18.824840] [c000001f1fef7f90] [c000000000025ecc] .call_do_softirq+0x14/0x24 [ 18.824845] [c000000fa7483650] [c000000000010e80] .do_softirq+0x160/0x1a0 [ 18.824850] [c000000fa74836f0] [c0000000000a94a4] .irq_exit+0xf4/0x120 [ 18.824854] [c000000fa7483780] [c000000000020c44] .timer_interrupt+0x154/0x4d0 [ 18.824859] [c000000fa7483830] [c000000000003be0] decrementer_common+0x160/0x180 [ 18.824866] --- Exception: 901 at .plpar_hcall_norets+0x84/0xd4 [ 18.824866] LR = .check_and_cede_processor+0x48/0x80 [ 18.824871] [c000000fa7483b20] [c00000000007f018] .check_and_cede_processor+0x18/0x80 (unreliable) [ 18.824877] [c000000fa7483b90] [c00000000007f104] .dedicated_cede_loop+0x84/0x150 [ 18.824883] [c000000fa7483c50] [c0000000006bc030] .cpuidle_enter+0x30/0x50 [ 18.824887] [c000000fa7483cc0] [c0000000006bc9f4] .cpuidle_idle_call+0x104/0x720 [ 18.824892] [c000000fa7483d80] [c000000000070af8] .pSeries_idle+0x18/0x40 [ 18.824897] [c000000fa7483df0] [c000000000019084] .cpu_idle+0x1a4/0x380 [ 18.824902] [c000000fa7483ec0] [c0000000008a4c18] .start_secondary+0x520/0x528 [ 18.824907] [c000000fa7483f90] [c0000000000093f0] .start_secondary_prolog+0x10/0x14 [ 18.824911] Instruction dump: [ 18.824914] 38840008 90030000 90e30004 38630008 7ca62850 7cc300d0 78c7e102 7cf01120 [ 18.824923] 78c60660 39200010 39400020 39600030 <7e00200c> 7c0020ce 38840010 409f001c [ 18.824935] ---[ end trace 0bb95124affaaa45 ]--- [ 18.825046] Unrecoverable VMX/Altivec Unavailable Exception f20 at c000000000052d08 I believe the right fix is to make memcpy match usercopy and not use cr1. Signed-off-by: Nishanth Aravamudan <nacc@us.ibm.com> Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org> CC: <stable@kernel.org> [v3.6]	2012-10-04 18:02:43 +10:00
Tiejun Chen	8e9f693715	powerpc/kprobe: Don't emulate store when kprobe stwu r1 We don't do the real store operation for kprobing 'stwu Rx,(y)R1' since this may corrupt the exception frame, now we will do this operation safely in exception return code after migrate current exception frame below the kprobed function stack. So we only update gpr[1] here and trigger a thread flag to mask this. Note we should make sure if we trigger kernel stack over flow. Signed-off-by: Tiejun Chen <tiejun.chen@windriver.com> Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>	2012-09-18 15:32:45 +10:00
Benjamin Herrenschmidt	636802ef96	powerpc: Don't use __put_user() in patch_instruction patch_instruction() can be called very early on ppc32, when the kernel isn't yet running at it's linked address. That can cause the ! is_kernel_addr() test in __put_user() to trip and call might_sleep() which is very bad at that point during boot. Use a lower level function instead for now, at least until we get to rework ppc32 boot process to do the code patching later, like ppc64 does. Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>	2012-09-05 16:05:23 +10:00
Anton Blanchard	2fae7cdb60	powerpc: Fix VMX in interrupt check in POWER7 copy loops The enhanced prefetch hint patches corrupt the condition register that was used to check if we are in interrupt. Fix this by using cr1. Signed-off-by: Anton Blanchard <anton@samba.org> Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>	2012-08-24 20:26:09 +10:00
Anton Blanchard	dad477ccd6	powerpc: POWER7 copy_to_user/copy_from_user patch applied twice "powerpc: Use enhanced touch instructions in POWER7 copy_to_user/copy_from_user" was applied twice. Remove one. Signed-off-by: Anton Blanchard <anton@samba.org> Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>	2012-08-24 20:26:09 +10:00
Stephen Rothwell	1d5a436d2c	powerpc: Put the gpr save/restore functions in their own section This allows the linker to know that calls to them do not need to switch TOC and stop errors like the following when linking large configurations: powerpc64-linux-ld: drivers/built-in.o: In function `.gpiochip_is_requested': (.text+0x4): sibling call optimization to `_savegpr0_29' does not allow automatic multiple TOCs; recompile with -mminimal-toc or -fno-optimize-sibling-calls, or make `_savegpr0_29' extern Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au> Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>	2012-07-11 14:19:59 +10:00
Michael Neuling	e55174e911	powerpc: Fixes for instructions not using correct register naming These macros are using integers where they could be using logical names since they take registers. We are going to enforce this soon, so fix these up now. Signed-off-by: Michael Neuling <mikey@neuling.org> Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>	2012-07-10 19:18:16 +10:00
Michael Neuling	86e32fdce7	powerpc: Change mtcrf to use real register names mtocrf define is just a wrapper around the real instructions so we can just use real register names here (ie. lower case). Also remove braces in macro so this is possible. Signed-off-by: Michael Neuling <mikey@neuling.org> Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>	2012-07-10 19:18:11 +10:00
Michael Neuling	44ce6a5ee7	powerpc: Merge STK_REG/PARAM/FRAMESIZE Merge the defines of STACKFRAMESIZE, STK_REG, STK_PARAM from different places. Signed-off-by: Michael Neuling <mikey@neuling.org> Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>	2012-07-10 19:18:03 +10:00
Michael Neuling	c75df6f96c	powerpc: Fix usage of register macros getting ready for %r0 change Anything that uses a constructed instruction (ie. from ppc-opcode.h), need to use the new R0 macro, as %r0 is not going to work. Also convert usages of macros where we are just determining an offset (usually for a load/store), like: std r14,STK_REG(r14)(r1) Can't use STK_REG(r14) as %r14 doesn't work in the STK_REG macro since it's just calculating an offset. Signed-off-by: Michael Neuling <mikey@neuling.org> Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>	2012-07-10 19:17:55 +10:00
Anton Blanchard	cf8fb5533f	powerpc: Optimise the 64bit optimised __clear_user I blame Mikey for this. He elevated my slightly dubious testcase: to benchmark status. And naturally we need to be number 1 at creating zeros. So lets improve __clear_user some more. As Paul suggests we can use dcbz for large lengths. This patch gets the destination cacheline aligned then uses dcbz on whole cachelines. Before: 10485760000 bytes (10 GB) copied, 0.414744 s, 25.3 GB/s After: 10485760000 bytes (10 GB) copied, 0.268597 s, 39.0 GB/s 39 GB/s, a new record. Signed-off-by: Anton Blanchard <anton@samba.org> Tested-by: Olof Johansson <olof@lixom.net> Acked-by: Olof Johansson <olof@lixom.net> Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>	2012-07-03 14:14:48 +10:00
Anton Blanchard	b3f271e86e	powerpc: POWER7 optimised memcpy using VMX and enhanced prefetch Implement a POWER7 optimised memcpy using VMX and enhanced prefetch instructions. This is a copy of the POWER7 optimised copy_to_user/copy_from_user loop. Detailed implementation and performance details can be found in commit `a66086b819` (powerpc: POWER7 optimised copy_to_user/copy_from_user using VMX). I noticed memcpy issues when profiling a RAID6 workload: .memcpy .async_memcpy .async_copy_data .__raid_run_ops .handle_stripe .raid5d .md_thread I created a simplified testcase by building a RAID6 array with 4 1GB ramdisks (booting with brd.rd_size=1048576): # mdadm -CR -e 1.2 /dev/md0 --level=6 -n4 /dev/ram[0-3] I then timed how long it took to write to the entire array: # dd if=/dev/zero of=/dev/md0 bs=1M Before: 892 MB/s After: 999 MB/s A 12% improvement. Signed-off-by: Anton Blanchard <anton@samba.org> Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>	2012-07-03 14:14:46 +10:00
Anton Blanchard	bce4b4bd91	powerpc: Use enhanced touch instructions in POWER7 copy_to_user/copy_from_user Version 2.06 of the POWER ISA introduced enhanced touch instructions, allowing us to specify a number of attributes including the length of a stream. This patch adds a software stream for both loads and stores in the POWER7 copy_tofrom_user loop. Since the setup is quite complicated and we have to use an eieio to ensure correct ordering of the "GO" command we only do this for copies above 4kB. To quantify any performance improvements we need a working set bigger than the caches so we operate on a 1GB file: # dd if=/dev/zero of=/tmp/foo bs=1M count=1024 And we compare how fast we can read the file: # dd if=/tmp/foo of=/dev/null bs=1M before: 7.7 GB/s after: 9.6 GB/s A 25% improvement. The worst case for this patch will be a completely L1 cache contained copy of just over 4kB. We can test this with the copy_to_user testcase we used to tune copy_tofrom_user originally: http://ozlabs.org/~anton/junkcode/copy_to_user.c # time ./copy_to_user2 -l 4224 -i 10000000 before: 6.807 s after: 6.946 s A 2% slowdown, which seems reasonable considering our data is unlikely to be completely L1 contained. Signed-off-by: Anton Blanchard <anton@samba.org> Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>	2012-07-03 14:14:45 +10:00
Anton Blanchard	fde69282b7	powerpc: POWER7 optimised copy_page using VMX and enhanced prefetch Implement a POWER7 optimised copy_page using VMX and enhanced prefetch instructions. We use enhanced prefetch hints to prefetch both the load and store side. We copy a cacheline at a time and fall back to regular loads and stores if we are unable to use VMX (eg we are in an interrupt). The following microbenchmark was used to assess the impact of the patch: http://ozlabs.org/~anton/junkcode/page_fault_file.c We test MAP_PRIVATE page faults across a 1GB file, 100 times: # time ./page_fault_file -p -l 1G -i 100 Before: 22.25s After: 18.89s 17% faster Signed-off-by: Anton Blanchard <anton@samba.org> Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>	2012-07-03 14:14:44 +10:00
Anton Blanchard	6f7839e542	powerpc: Rename copyuser_power7_vmx.c to vmx-helper.c Subsequent patches will add more VMX library functions and it makes sense to keep all the c-code helper functions in the one file. Signed-off-by: Anton Blanchard <anton@samba.org> Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>	2012-07-03 14:14:43 +10:00
Anton Blanchard	a9514dc69d	powerpc: Use enhanced touch instructions in POWER7 copy_to_user/copy_from_user Version 2.06 of the POWER ISA introduced enhanced touch instructions, allowing us to specify a number of attributes including the length of a stream. This patch adds a software stream for both loads and stores in the POWER7 copy_tofrom_user loop. Since the setup is quite complicated and we have to use an eieio to ensure correct ordering of the "GO" command we only do this for copies above 4kB. To quantify any performance improvements we need a working set bigger than the caches so we operate on a 1GB file: # dd if=/dev/zero of=/tmp/foo bs=1M count=1024 And we compare how fast we can read the file: # dd if=/tmp/foo of=/dev/null bs=1M before: 7.7 GB/s after: 9.6 GB/s A 25% improvement. The worst case for this patch will be a completely L1 cache contained copy of just over 4kB. We can test this with the copy_to_user testcase we used to tune copy_tofrom_user originally: http://ozlabs.org/~anton/junkcode/copy_to_user.c # time ./copy_to_user2 -l 4224 -i 10000000 before: 6.807 s after: 6.946 s A 2% slowdown, which seems reasonable considering our data is unlikely to be completely L1 contained. Signed-off-by: Anton Blanchard <anton@samba.org> Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>	2012-07-03 14:14:42 +10:00
Anton Blanchard	17968fbbd1	powerpc: 64bit optimised __clear_user I noticed __clear_user high up in a profile of one of my RAID stress tests. The testcase was doing a dd from /dev/zero which ends up calling __clear_user. __clear_user is basically a loop with a single 4 byte store which is horribly slow. We can do much better by aligning the desination and doing 32 bytes of 8 byte stores in a loop. The following testcase was used to verify the patch: http://ozlabs.org/~anton/junkcode/stress_clear_user.c To show the improvement in performance I ran a dd from /dev/zero to /dev/null on a POWER7 box: Before: # dd if=/dev/zero of=/dev/null bs=1M count=10000 10485760000 bytes (10 GB) copied, 3.72379 s, 2.8 GB/s After: # time dd if=/dev/zero of=/dev/null bs=1M count=10000 10485760000 bytes (10 GB) copied, 0.728318 s, 14.4 GB/s Over 5x faster. Signed-off-by: Anton Blanchard <anton@samba.org> Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>	2012-07-03 14:14:41 +10:00
Steven Rostedt	b6e3796834	powerpc: Have patch_instruction detect faults For ftrace to use the patch_instruction code, it needs to check for faults on write. Ftrace updates code all over the kernel, and we need to know if code is updated or not due to protections that are placed on some portions of the kernel. If ftrace does not detect a fault, it will error later on, and it will be much more difficult to find the problem. By changing patch_instruction() to detect faults, then ftrace will be able to make use of it too. Signed-off-by: Steven Rostedt <rostedt@goodmis.org> Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>	2012-07-03 14:14:38 +10:00
Paul Mackerras	1629372caa	powerpc: Use the new generic strncpy_from_user() and strnlen_user() This is much the same as for SPARC except that we can do the find_zero() function more efficiently using the count-leading-zeroes instructions. Tested on 32-bit and 64-bit PowerPC. Signed-off-by: Paul Mackerras <paulus@samba.org> Acked-by: David S. Miller <davem@davemloft.net> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2012-05-27 21:00:07 -07:00
Anton Blanchard	694caf0255	powerpc: Remove CONFIG_POWER4_ONLY Remove CONFIG_POWER4_ONLY, the option is badly named and only does two things: - It wraps the MMU segment table code. With feature fixups there is little downside to compiling this in. - It uses the newer mtocrf instruction in various assembly functions. Instead of making this a compile option just do it at runtime via a feature fixup. Signed-off-by: Anton Blanchard <anton@samba.org> Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>	2012-04-30 15:37:26 +10:00
David Howells	ae3a197e3d	Disintegrate asm/system.h for PowerPC Disintegrate asm/system.h for PowerPC. Signed-off-by: David Howells <dhowells@redhat.com> Acked-by: Benjamin Herrenschmidt <benh@kernel.crashing.org> cc: linuxppc-dev@lists.ozlabs.org	2012-03-28 18:30:02 +01:00
Stephen Rothwell	f5339277eb	powerpc: Remove FW_FEATURE ISERIES from arch code This is no longer selectable, so just remove all the dependent code. Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au> Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>	2012-03-21 11:16:11 +11:00
Anton Blanchard	a66086b819	powerpc: POWER7 optimised copy_to_user/copy_from_user using VMX Implement a POWER7 optimised copy_to_user/copy_from_user using VMX. For large aligned copies this new loop is over 10% faster, and for large unaligned copies it is over 200% faster. If we take a fault we fall back to the old version, this keeps things relatively simple and easy to verify. On POWER7 unaligned stores rarely slow down - they only flush when a store crosses a 4KB page boundary. Furthermore this flush is handled completely in hardware and should be 20-30 cycles. Unaligned loads on the other hand flush much more often - whenever crossing a 128 byte cache line, or a 32 byte sector if either sector is an L1 miss. Considering this information we really want to get the loads aligned and not worry about the alignment of the stores. Microbenchmarks confirm that this approach is much faster than the current unaligned copy loop that uses shifts and rotates to ensure both loads and stores are aligned. We also want to try and do the stores in cacheline aligned, cacheline sized chunks. If the store queue is unable to merge an entire cacheline of stores then the L2 cache will have to do a read/modify/write. Even worse, we will serialise this with the stores in the next iteration of the copy loop since both iterations hit the same cacheline. Based on this, the new loop does the following things: 1 - 127 bytes Get the source 8 byte aligned and use 8 byte loads and stores. Pretty boring and similar to how the current loop works. 128 - 4095 bytes Get the source 8 byte aligned and use 8 byte loads and stores, 1 cacheline at a time. We aren't doing the stores in cacheline aligned chunks so we will potentially serialise once per cacheline. Even so it is much better than the loop we have today. 4096 - bytes If both source and destination have the same alignment get them both 16 byte aligned, then get the destination cacheline aligned. Do cacheline sized loads and stores using VMX. If source and destination do not have the same alignment, we get the destination cacheline aligned, and use permute to do aligned loads. In both cases the VMX loop should be optimal - we always do aligned loads and stores and are always doing stores in cacheline aligned, cacheline sized chunks. To be able to use VMX we must be careful about interrupts and sleeping. We don't use the VMX loop when in an interrupt (which should be rare anyway) and we wrap the VMX loop in disable/enable_pagefault and fall back to the existing copy_tofrom_user loop if we do need to sleep. The VMX breakpoint of 4096 bytes was chosen using this microbenchmark: http://ozlabs.org/~anton/junkcode/copy_to_user.c Since we are using VMX and there is a cost to saving and restoring the user VMX state there are two broad cases we need to benchmark: - Best case - userspace never uses VMX - Worst case - userspace always uses VMX In reality a userspace process will sit somewhere between these two extremes. Since we need to test both aligned and unaligned copies we end up with 4 combinations. The point at which the VMX loop begins to win is: 0% VMX aligned 2048 bytes unaligned 2048 bytes 100% VMX aligned 16384 bytes unaligned 8192 bytes Considering this is a microbenchmark, the data is hot in cache and the VMX loop has better store queue merging properties we set the breakpoint to 4096 bytes, a little below the unaligned breakpoints. Some future optimisations we can look at: - Looking at the perf data, a significant part of the cost when a task is always using VMX is the extra exception we take to restore the VMX state. As such we should do something similar to the x86 optimisation that restores FPU state for heavy users. ie: /* * If the task has used fpu the last 5 timeslices, just do a full * restore of the math state immediately to avoid the trap; the * chances of needing FPU soon are obviously high now / preload_fpu = tsk_used_math(next_p) && next_p->fpu_counter > 5; and / * fpu_counter contains the number of consecutive context switches * that the FPU is used. If this is over a threshold, the lazy fpu * saving becomes unlazy to save the trap. This is an unsigned char * so that after 256 times the counter wraps and the behavior turns * lazy again; this to deal with bursty apps that only use FPU for * a short time */ - We could create a paca bit to mirror the VMX enabled MSR bit and check that first, avoiding multiple calls to calling enable_kernel_altivec. That should help with iovec based system calls like readv. - We could have two VMX breakpoints, one for when we know the user VMX state is loaded into the registers and one when it isn't. This could be a second bit in the paca so we can calculate the break points quickly. - One suggestion from Ben was to save and restore the VSX registers we use inline instead of using enable_kernel_altivec. [BenH: Fixed a problem with preempt and fixed build without CONFIG_ALTIVEC] Signed-off-by: Anton Blanchard <anton@samba.org> Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>	2011-12-19 14:40:40 +11:00
Anton Blanchard	d715e433b7	powerpc: Copy down exception vectors after feature fixups kdump fails because we try to execute an HV only instruction. Feature fixups are being applied after we copy the exception vectors down to 0 so they miss out on any updates. We have always had this issue but it only became critical in v3.0 when we added CFAR support (breaks POWER5) and v3.1 when we added POWERNV (breaks everyone). Signed-off-by: Anton Blanchard <anton@samba.org> Cc: <stable@kernel.org> [v3.0+] Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>	2011-11-16 14:47:54 +11:00
Paul Gortmaker	4b16f8e2d6	powerpc: various straight conversions from module.h --> export.h All these files were including module.h just for the basic EXPORT_SYMBOL infrastructure. We can shift them off to the export.h header which is a way smaller footprint and thus realize some compile time gains. Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com>	2011-10-31 19:30:44 -04:00
Linus Torvalds	82aff107f8	Merge branch 'merge' of git://git.kernel.org/pub/scm/linux/kernel/git/benh/powerpc * 'merge' of git://git.kernel.org/pub/scm/linux/kernel/git/benh/powerpc: (152 commits) powerpc: Fix hard CPU IDs detection powerpc/pmac: Update via-pmu to new syscore_ops powerpc/kvm: Fix the build for 32-bit Book 3S (classic) processors powerpc/kvm: Fix kvmppc_core_pending_dec powerpc: Remove last piece of GEMINI powerpc: Fix for Pegasos keyboard and mouse powerpc: Make early memory scan more resilient to out of order nodes powerpc/pseries/iommu: Cleanup ddw naming powerpc/pseries/iommu: Find windows after kexec during boot powerpc/pseries/iommu: Remove ddw property when destroying window powerpc/pseries/iommu: Add additional checks when changing iommu mask powerpc/pseries/iommu: Use correct return type in dupe_ddw_if_already_created powerpc: Remove unused/obsolete CONFIG_XICS misc: Add CARMA DATA-FPGA Programmer support misc: Add CARMA DATA-FPGA Access Driver powerpc: Make IRQ_NOREQUEST last to clear, first to set powerpc: Integrated Flash controller device tree bindings powerpc/85xx: Create dts of each core in CAMP mode for P1020RDB powerpc/85xx: Fix PCIe IDSEL for Px020RDB powerpc/85xx: P2020 DTS: re-organize dts files ...	2011-05-20 13:28:01 -07:00
Linus Torvalds	268bb0ce3e	sanitize <linux/prefetch.h> usage Commit `e66eed651f` ("list: remove prefetching from regular list iterators") removed the include of prefetch.h from list.h, which uncovered several cases that had apparently relied on that rather obscure header file dependency. So this fixes things up a bit, using grep -L linux/prefetch.h $(git grep -l '[^a-z_]prefetchw(' -- '.[ch]') grep -L 'prefetchw(' $(git grep -l 'linux/prefetch.h' -- '.[ch]') to guide us in finding files that either need <linux/prefetch.h> inclusion, or have it despite not needing it. There are more of them around (mostly network drivers), but this gets many core ones. Reported-by: Stephen Rothwell <sfr@canb.auug.org.au> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2011-05-20 12:50:29 -07:00
Milton Miller	a56555e573	powerpc: Remove alloc_maybe_bootmem for zalloc version Replace all remaining callers of alloc_maybe_bootmem with zalloc_maybe_bootmem. The callsite in pci_dn is followed with a memset to clear the memory, and not zeroing at the other callsites in the celleb fake pci code could lead to following uninitialized memory as pointers or even freeing said pointers on error paths. Signed-off-by: Milton Miller <miltonm@bga.com> Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>	2011-05-19 15:30:57 +10:00
Anton Blanchard	40f1ce7fb7	powerpc: Remove ioremap_flags We have a confusing number of ioremap functions. Make things just a bit simpler by merging ioremap_flags and ioremap_prot. Signed-off-by: Anton Blanchard <anton@samba.org> Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>	2011-05-19 14:30:43 +10:00
Anton Blanchard	d988f0e3f8	powerpc: Simplify 4k/64k copy_page logic To make it easier to add optimised versions of copy_page, remove the 4kB loop for 64kB pages and just do all the work in copy_page. Signed-off-by: Anton Blanchard <anton@samba.org> Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>	2011-05-19 14:30:42 +10:00
Michael Ellerman	b91e136cdf	powerpc: Use MSR_64BIT in sstep.c, fix kprobes on BOOK3E We check MSR_SF a lot in sstep.c, to decide if we need to emulate the truncation of values when running in 32-bit mode. Factor out that code into a helper, and convert it and the other uses to use MSR_64BIT. This fixes a bug on BOOK3E where kprobes would end up returning to a 32-bit address, because regs->nip was truncated, because (msr & MSR_SF) was false. Signed-off-by: Michael Ellerman <michael@ellerman.id.au> Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>	2011-04-27 14:18:46 +10:00
Michael Ellerman	c0337288ab	powerpc: Ensure the else case of feature sections will fit When we create an alternative feature section, the else case must be the same size or smaller than the body. This is because when we patch the else case in we just overwrite the body, so there must be room. Up to now we just did this by inspection, but it's quite easy to enforce it in the assembler, so we should. The only change is to add the ifgt block, but that effects the alignment of the tabs and so the whole macro is modified. Also add a test, but #if 0 it because we don't want to break the build. Anyone who's modifying the feature macros should enable the test. Signed-off-by: Michael Ellerman <michael@ellerman.id.au> Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>	2011-01-21 14:08:33 +11:00
Anton Blanchard	b5f9b6665b	powerpc: Hardcode popcnt instructions for old assemblers The popcnt instructions went into binutils relatively recently. As with a number of other instructions, create macros and hardcode them. Signed-off-by: Anton Blanchard <anton@samba.org> Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>	2010-12-09 15:35:30 +11:00
Anton Blanchard	64ff312876	powerpc: Add support for popcnt instructions POWER5 added popcntb, and POWER7 added popcntw and popcntd. As a first step this patch does all the work out of line, but it would be nice to implement them as inlines with an out of line fallback. The performance issue with hweight was noticed when disabling SMT on a large (192 thread) POWER7 box. The patch improves that testcase by about 8%. Signed-off-by: Anton Blanchard <anton@samba.org> Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>	2010-11-29 15:48:17 +11:00
matt mooney	4108d9ba90	powerpc/Makefiles: Change to new flag variables Replace EXTRA_CFLAGS with ccflags-y and EXTRA_AFLAGS with asflags-y. Signed-off-by: matt mooney <mfm@muteddisk.com> Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>	2010-10-13 16:19:22 +11:00
Sean MacLennan	cd64d1697c	powerpc: mtmsrd not defined Replace the BOOK3S_64 specific mtmsrd with the generic MTMSRD macro. Only enable ldstfp when CONFIG_PPC_FPU is set. Signed-off-by: Sean MacLennan <smaclennan@pikatech.com> Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>	2010-09-02 14:07:34 +10:00
Sean MacLennan	025c0186a0	powerpc: Fix incorrect .stabs entry for copy_32.S Signed-off-by: Sean MacLennan <smaclennan@pikatech.com> Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>	2010-09-02 14:07:34 +10:00
Paul Mackerras	8154c5d22d	powerpc: Abstract indexing of lppaca structs Currently we have the lppaca structs as a simple array of NR_CPUS entries, taking up space in the data section of the kernel image. In future we would like to allocate them dynamically, so this abstracts out the accesses to the array, making it easier to change how we locate the lppaca for a given cpu in future. Specifically, lppaca[cpu] changes to lppaca_of(cpu). Signed-off-by: Paul Mackerras <paulus@samba.org> Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>	2010-09-02 14:07:31 +10:00
Anton Blanchard	8c77391475	powerpc: Add 64bit csum_and_copy_to_user This adds the equivalent of csum_and_copy_from_user for the receive side so we can copy and checksum in one pass. It is modelled on the generic checksum routine. Signed-off-by: Anton Blanchard <anton@samba.org> Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>	2010-09-02 14:07:30 +10:00
Anton Blanchard	fdd374b62c	powerpc: Optimise 64bit csum_partial_copy_generic and add csum_and_copy_from_user We use the same core loop as the new csum_partial, adding in the stores and exception handling code. To keep things simple we do all the exception fixup in csum_and_copy_from_user. This wrapper function is modelled on the generic checksum code and is careful to always calculate a complete checksum even if we only copied part of the data to userspace. To test this I forced checksumming on over loopback and ran socklib (a simple TCP benchmark). On a POWER6 575 throughput improved by 19% with this patch. If I forced both the sender and receiver onto the same cpu (with the hope of shifting the benchmark from being cache bandwidth limited to cpu limited), adding this patch improved performance by 55% Signed-off-by: Anton Blanchard <anton@samba.org> Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>	2010-09-02 14:07:30 +10:00
Anton Blanchard	9b83ecb0a3	powerpc: Optimise 64bit csum_partial The main loop of csum_partial runs very slowly on recent POWER CPUs. After some analysis on both POWER6 and POWER7 I came up with routine below. First we get the source aligned to a double word, ignoring any odd alignment to keep things simple. Then we do 64 bytes at a time, with an entry and exit limb of a further 64 bytes. On both POWER6 and POWER7 this should be as fast as we can go since we are limited by the latency of the adde instructions. To test this I forced checksumming on over loopback and ran socklib (a simple TCP benchmark). On a POWER6 575 throughput improved by 11% with this patch. Signed-off-by: Anton Blanchard <anton@samba.org> Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>	2010-09-02 14:07:29 +10:00
Benjamin Herrenschmidt	5f07aa7524	Merge commit 'paulus-perf/master' into next	2010-07-09 11:25:48 +10:00
Stephen Rothwell	3880ecb05b	powerpc: Fix feature-fixup tests for gcc 4.5 The feature-fixup test declare some extern void variables and then take their addresses. Fix this by declaring them as extern u8 instead. Fixes these warnings (treated as errors): CC arch/powerpc/lib/feature-fixups.o cc1: warnings being treated as errors arch/powerpc/lib/feature-fixups.c: In function 'test_cpu_macros': arch/powerpc/lib/feature-fixups.c:293:23: error: taking address of expression of type 'void' arch/powerpc/lib/feature-fixups.c:294:9: error: taking address of expression of type 'void' arch/powerpc/lib/feature-fixups.c:297:2: error: taking address of expression of type 'void' arch/powerpc/lib/feature-fixups.c:297:2: error: taking address of expression of type 'void' arch/powerpc/lib/feature-fixups.c: In function 'test_fw_macros': arch/powerpc/lib/feature-fixups.c:306:23: error: taking address of expression of type 'void' arch/powerpc/lib/feature-fixups.c:307:9: error: taking address of expression of type 'void' arch/powerpc/lib/feature-fixups.c:310:2: error: taking address of expression of type 'void' arch/powerpc/lib/feature-fixups.c:310:2: error: taking address of expression of type 'void' arch/powerpc/lib/feature-fixups.c: In function 'test_lwsync_macros': arch/powerpc/lib/feature-fixups.c:321:23: error: taking address of expression of type 'void' arch/powerpc/lib/feature-fixups.c:322:9: error: taking address of expression of type 'void' arch/powerpc/lib/feature-fixups.c:326:3: error: taking address of expression of type 'void' arch/powerpc/lib/feature-fixups.c:326:3: error: taking address of expression of type 'void' arch/powerpc/lib/feature-fixups.c:329:3: error: taking address of expression of type 'void' arch/powerpc/lib/feature-fixups.c:329:3: error: taking address of expression of type 'void' Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au> Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>	2010-07-08 18:11:41 +10:00
Stephen Rothwell	7fca5dc8aa	powerpc: Fix module building for gcc 4.5 and 64 bit Gcc 4.5 is now generating out of line register save and restore in the function prefix and postfix when we use -Os. Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au> Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>	2010-07-08 18:11:38 +10:00
K.Prasad	5aae8a5370	powerpc, hw_breakpoints: Implement hw_breakpoints for 64-bit server processors Implement perf-events based hw-breakpoint interfaces for PowerPC 64-bit server (Book III S) processors. This allows access to a given location to be used as an event that can be counted or profiled by the perf_events subsystem. This is done using the DABR (data breakpoint register), which can also be used for process debugging via ptrace. When perf_event hw_breakpoint support is configured in, the perf_event subsystem manages the DABR and arbitrates access to it, and ptrace then creates a perf_event when it is requested to set a data breakpoint. [Adopted suggestions from Paul Mackerras <paulus@samba.org> to - emulate_step() all system-wide breakpoints and single-step only the per-task breakpoints - perform arch-specific cleanup before unregistration through arch_unregister_hw_breakpoint() ] Signed-off-by: K.Prasad <prasad@linux.vnet.ibm.com> Signed-off-by: Paul Mackerras <paulus@samba.org>	2010-06-22 19:40:50 +10:00
Paul Mackerras	0016a4cf55	powerpc: Emulate most Book I instructions in emulate_step() This extends the emulate_step() function to handle a large proportion of the Book I instructions implemented on current 64-bit server processors. The aim is to handle all the load and store instructions used in the kernel, plus all of the instructions that appear between l[wd]arx and st[wd]cx., so this handles the Altivec/VMX lvx and stvx and the VSX lxv2dx and stxv2dx instructions (implemented in POWER7). The new code can emulate user mode instructions, and checks the effective address for a load or store if the saved state is for user mode. It doesn't handle little-endian mode at present. For floating-point, Altivec/VMX and VSX instructions, it checks that the saved MSR has the enable bit for the relevant facility set, and if so, assumes that the FP/VMX/VSX registers contain valid state, and does loads or stores directly to/from the FP/VMX/VSX registers, using assembly helpers in ldstfp.S. Instructions supported now include: * Loads and stores, including some but not all VMX and VSX instructions, and lmw/stmw * Atomic loads and stores (l[dw]arx, st[dw]cx.) * Arithmetic instructions (add, subtract, multiply, divide, etc.) * Compare instructions * Rotate and mask instructions * Shift instructions * Logical instructions (and, or, xor, etc.) * Condition register logical instructions * mtcrf, cntlz[wd], exts[bhw] * isync, sync, lwsync, ptesync, eieio * Cache operations (dcbf, dcbst, dcbt, dcbtst) The overflow-checking arithmetic instructions are not included, but they appear not to be ever used in C code. This uses decimal values for the minor opcodes in the switch statements because that is what appears in the Power ISA specification, thus it is easier to check that they are correct if they are in decimal. If this is used to single-step an instruction where a data breakpoint interrupt occurred, then there is the possibility that the instruction is a lwarx or ldarx. In that case we have to be careful not to lose the reservation until we get to the matching st[wd]cx., or we'll never make forward progress. One alternative is to try to arrange that we can return from interrupts and handle data breakpoint interrupts without losing the reservation, which means not using any spinlocks, mutexes, or atomic ops (including bitops). That seems rather fragile. The other alternative is to emulate the larx/stcx and all the instructions in between. This is why this commit adds support for a wide range of integer instructions. Signed-off-by: Paul Mackerras <paulus@samba.org>	2010-06-22 19:40:29 +10:00
Andreas Schwab	ca5d0674c3	powerpc: Fix string library functions The powerpc strncmp implementation does not correctly handle a zero length, despite the claim in `0119536cd3` (Add hand-coded assembly strcmp). Additionally, all the length arguments are size_t, not int, so use PPC_LCMPI and eq instead of cmpwi and le throughout. Signed-off-by: Andreas Schwab <schwab@linux-m68k.org> Acked-by: Paul Mackerras <paulus@samba.org> Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>	2010-05-21 17:31:08 +10:00
Jeff Mahoney	637a99022f	powerpc: Fix handling of strncmp with zero len Commit `0119536c`, which added the assembly version of strncmp to powerpc, mentions that it adds two instructions to the version from boot/string.S to allow it to handle len=0. Unfortunately, it doesn't always return 0 when that is the case. The length is passed in r5, but the return value is passed back in r3. In certain cases, this will happen to work. Otherwise it will pass back the address of the first string as the return value. This patch lifts the len <= 0 handling code from memcpy to handle that case. Reported by: Christian_Sellars@symantec.com Signed-off-by: Jeff Mahoney <jeffm@suse.com> CC: <stable@kernel.org> Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>	2010-04-07 18:00:39 +10:00
Tejun Heo	5a0e3ad6af	include cleanup: Update gfp.h and slab.h includes to prepare for breaking implicit slab.h inclusion from percpu.h percpu.h is included by sched.h and module.h and thus ends up being included when building most .c files. percpu.h includes slab.h which in turn includes gfp.h making everything defined by the two files universally available and complicating inclusion dependencies. percpu.h -> slab.h dependency is about to be removed. Prepare for this change by updating users of gfp and slab facilities include those headers directly instead of assuming availability. As this conversion needs to touch large number of source files, the following script is used as the basis of conversion. http://userweb.kernel.org/~tj/misc/slabh-sweep.py The script does the followings. * Scan files for gfp and slab usages and update includes such that only the necessary includes are there. ie. if only gfp is used, gfp.h, if slab is used, slab.h. * When the script inserts a new include, it looks at the include blocks and try to put the new include such that its order conforms to its surrounding. It's put in the include block which contains core kernel includes, in the same order that the rest are ordered - alphabetical, Christmas tree, rev-Xmas-tree or at the end if there doesn't seem to be any matching order. * If the script can't find a place to put a new include (mostly because the file doesn't have fitting include block), it prints out an error message indicating which .h file needs to be added to the file. The conversion was done in the following steps. 1. The initial automatic conversion of all .c files updated slightly over 4000 files, deleting around 700 includes and adding ~480 gfp.h and ~3000 slab.h inclusions. The script emitted errors for ~400 files. 2. Each error was manually checked. Some didn't need the inclusion, some needed manual addition while adding it to implementation .h or embedding .c file was more appropriate for others. This step added inclusions to around 150 files. 3. The script was run again and the output was compared to the edits from #2 to make sure no file was left behind. 4. Several build tests were done and a couple of problems were fixed. e.g. lib/decompress_.c used malloc/free() wrappers around slab APIs requiring slab.h to be added manually. 5. The script was run on all .h files but without automatically editing them as sprinkling gfp.h and slab.h inclusions around .h files could easily lead to inclusion dependency hell. Most gfp.h inclusion directives were ignored as stuff from gfp.h was usually wildly available and often used in preprocessor macros. Each slab.h inclusion directive was examined and added manually as necessary. 6. percpu.h was updated not to include slab.h. 7. Build test were done on the following configurations and failures were fixed. CONFIG_GCOV_KERNEL was turned off for all tests (as my distributed build env didn't work with gcov compiles) and a few more options had to be turned off depending on archs to make things build (like ipr on powerpc/64 which failed due to missing writeq). x86 and x86_64 UP and SMP allmodconfig and a custom test config. * powerpc and powerpc64 SMP allmodconfig * sparc and sparc64 SMP allmodconfig * ia64 SMP allmodconfig * s390 SMP allmodconfig * alpha SMP allmodconfig * um on x86_64 SMP allmodconfig 8. percpu.h modifications were reverted so that it could be applied as a separate patch and serve as bisection point. Given the fact that I had only a couple of failures from tests on step 6, I'm fairly confident about the coverage of this conversion patch. If there is a breakage, it's likely to be something in one of the arch headers which should be easily discoverable easily on most builds of the specific arch. Signed-off-by: Tejun Heo <tj@kernel.org> Guess-its-ok-by: Christoph Lameter <cl@linux-foundation.org> Cc: Ingo Molnar <mingo@redhat.com> Cc: Lee Schermerhorn <Lee.Schermerhorn@hp.com>	2010-03-30 22:02:32 +09:00
Benjamin Herrenschmidt	3d98ffbffb	powerpc: Fix lwsync feature fixup vs. modules on 64-bit Anton's commit enabling the use of the lwsync fixup mechanism on 64-bit breaks modules. The lwsync fixup section uses .long instead of the FTR_ENTRY_OFFSET macro used by other fixups sections, and thus will generate 32-bit relocations that our module loader cannot resolve. This changes it to use the same type as other feature sections. Note however that we might want to consider using 32-bit for all the feature fixup offsets and add support for R_PPC_REL32 to module_64.c instead as that would reduce the size of the kernel image. I'll leave that as an exercise for the reader for now... Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>	2010-02-26 18:29:17 +11:00
Anton Blanchard	789c299ca2	powerpc: Improve 64bit copy_tofrom_user Here is a patch from Paul Mackerras that improves the ppc64 copy_tofrom_user. The loop now does 32 bytes at a time and as well as pairing loads and stores. A quick test case that reads 8kB over and over shows the improvement: POWER6: 53% faster POWER7: 51% faster #define _XOPEN_SOURCE 500 #include <stdlib.h> #include <stdio.h> #include <unistd.h> #include <fcntl.h> #include <sys/types.h> #include <sys/stat.h> #define BUFSIZE (8 * 1024) #define ITERATIONS 10000000 int main() { char tmpfile[] = "/tmp/copy_to_user_testXXXXXX"; int fd; char *buf[BUFSIZE]; unsigned long i; fd = mkstemp(tmpfile); if (fd < 0) { perror("open"); exit(1); } if (write(fd, buf, BUFSIZE) != BUFSIZE) { perror("open"); exit(1); } for (i = 0; i < 10000000; i++) { if (pread(fd, buf, BUFSIZE, 0) != BUFSIZE) { perror("pread"); exit(1); } } unlink(tmpfile); return 0; } Signed-off-by: Anton Blanchard <anton@samba.org> Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>	2010-02-17 14:03:16 +11:00
Anton Blanchard	63e6c5b810	powerpc: Pair loads and stores in copy_4k_page A number of our chips like loads and stores to be paired. A small kernel module testcase shows the improvement of pairing loads and stores in copy_4k_page: POWER6: +9% POWER7: +1.5% #include <linux/module.h> #include <linux/mm.h> #define ITERATIONS 10000000 static int __init copypage_init(void) { struct timespec before, after; unsigned long i; struct page destpage, srcpage; char dest, src; destpage = alloc_page(GFP_KERNEL); srcpage = alloc_page(GFP_KERNEL); dest = page_address(destpage); src = page_address(srcpage); getnstimeofday(&before); for (i = 0; i < ITERATIONS; i++) copy_4K_page(dest, src); getnstimeofday(&after); free_page((unsigned long)dest); free_page((unsigned long)src); printk(KERN_DEBUG "copy_4K_page loop took %lu ns\n", (after.tv_sec - before.tv_sec) * NSEC_PER_SEC + (after.tv_nsec - before.tv_nsec)); return 0; } static void __exit copypage_exit(void) { } module_init(copypage_init) module_exit(copypage_exit) MODULE_LICENSE("GPL"); MODULE_AUTHOR("Anton Blanchard"); Signed-off-by: Anton Blanchard <anton@samba.org> Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>	2010-02-17 14:03:16 +11:00
Anton Blanchard	53eae2281a	powerpc: Fix lwsync patching code on 64bit do_lwsync_fixups doesn't work on 64bit, we end up writing lwsyncs to the wrong addresses: 0:mon> di c0000001000bfacc c0000001000bfacc 7c2004ac lwsync Since the lwsync section has negative offsets we need to use a signed int pointer so we sign extend the value. Signed-off-by: Anton Blanchard <anton@samba.org> Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>	2010-02-17 14:03:15 +11:00
Thomas Gleixner	fb3a6bbc91	locking: Convert raw_rwlock to arch_rwlock Not strictly necessary for -rt as -rt does not have non sleeping rwlocks, but it's odd to not have a consistent naming convention. No functional change. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Acked-by: Peter Zijlstra <peterz@infradead.org> Acked-by: David S. Miller <davem@davemloft.net> Acked-by: Ingo Molnar <mingo@elte.hu> Cc: linux-arch@vger.kernel.org	2009-12-14 23:55:32 +01:00
Thomas Gleixner	0199c4e68d	locking: Convert __raw_spin* functions to arch_spin* Name space cleanup. No functional change. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Acked-by: Peter Zijlstra <peterz@infradead.org> Acked-by: David S. Miller <davem@davemloft.net> Acked-by: Ingo Molnar <mingo@elte.hu> Cc: linux-arch@vger.kernel.org	2009-12-14 23:55:32 +01:00
Thomas Gleixner	445c89514b	locking: Convert raw_spinlock to arch_spinlock The raw_spin* namespace was taken by lockdep for the architecture specific implementations. raw_spin_* would be the ideal name space for the spinlocks which are not converted to sleeping locks in preempt-rt. Linus suggested to convert the raw_ to arch_ locks and cleanup the name space instead of using an artifical name like core_spin, atomic_spin or whatever No functional change. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Acked-by: Peter Zijlstra <peterz@infradead.org> Acked-by: David S. Miller <davem@davemloft.net> Acked-by: Ingo Molnar <mingo@elte.hu> Cc: linux-arch@vger.kernel.org	2009-12-14 23:55:32 +01:00
Benjamin Herrenschmidt	bcd6acd51f	Merge commit 'origin/master' into next Conflicts: include/linux/kvm.h	2009-12-09 17:14:38 +11:00
Joakim Tjernlund	15d914d72a	powerpc/8xx: Start using dcbX instructions in various copy routines Now that 8xx can fixup dcbX instructions, start using them where possible like every other PowerPc arch do. Signed-off-by: Joakim Tjernlund <Joakim.Tjernlund@transmode.se> Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>	2009-12-09 17:10:37 +11:00
Anton Blanchard	3cd980dbc1	powerpc: perf_event: Cleanup copy_page output by hiding setup symbol A lot of hits in "setup" doesn't make much sense, so hide this symbol and allow all the hits to end up in copy_4k_page. Signed-off-by: Anton Blanchard <anton@samba.org> Signed-off-by: Paul Mackerras <paulus@samba.org>	2009-10-28 16:13:05 +11:00
Michael Ellerman	ba55bd7436	powerpc: Add configurable -Werror for arch/powerpc Add the option to build the code under arch/powerpc with -Werror. The intention is to make it harder for people to inadvertantly introduce warnings in the arch/powerpc code. It needs to be configurable so that if a warning is introduced, people can easily work around it while it's being fixed. The option is a negative, ie. don't enable -Werror, so that it will be turned on for allyes and allmodconfig builds. The default is n, in the hope that developers will build with -Werror, that will probably lead to some build breaks, I am prepared to be flamed. It's not enabled for math-emu, which is a steaming pile of warnings. Signed-off-by: Michael Ellerman <michael@ellerman.id.au> Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>	2009-06-16 14:15:45 +10:00
Benjamin Herrenschmidt	b16e7766d6	powerpc: Move dma-noncoherent.c from arch/powerpc/lib to arch/powerpc/mm (pre-requisite to make the next patches more palatable) Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>	2009-05-27 16:32:05 +10:00
Benjamin Herrenschmidt	84532a0fc3	Revert "powerpc: Rework dma-noncoherent to use generic vmalloc layer" This reverts commit `33f00dcedb`. While it was a good idea to try to use the mm/vmalloc.c allocator instead of our own (in fact, ours is itself a dup on an old variant of the vmalloc one), unfortunately, the approach is terminally busted since dma_alloc_coherent() can be called at interrupt time or in atomic contexts and there's little chances we'll make the code in mm/vmalloc.c cope with\ that :-( Until we can get the generic code to forbid that idiocy and fix all drivers abusing it, we pretty much have no choice but revert to our custom virtual space allocator. There's also a problem with SMP safety since freeing such mapping would require an IPI which cannot be done at interrupt time. However, right now, I don't think we support any platform that is both SMP and has non-coherent DMA (don't laugh, I know such things do exist !) so we can sort that out later. Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>	2009-05-27 13:33:14 +10:00
Benjamin Herrenschmidt	e14eee56c2	Merge commit 'origin/master' into next	2009-03-11 17:10:07 +11:00
Mark Nelson	f72b728bf1	powerpc: Fix 64bit __copy_tofrom_user() regression This fixes a regression introduced by commit `a4e22f02f5` ("powerpc: Update 64bit __copy_tofrom_user() using CPU_FTR_UNALIGNED_LD_STD"). The same bug that existed in the 64bit memcpy() also exists here so fix it here too. The fix is the same as that applied to memcpy() with the addition of fixes for the exception handling code required for __copy_tofrom_user(). This stops us reading beyond the end of the source region we were told to copy. Signed-off-by: Mark Nelson <markn@au1.ibm.com> Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>	2009-02-26 14:02:54 +11:00
Mark Nelson	e423b9ecd6	powerpc: Fix 64bit memcpy() regression This fixes a regression introduced by commit `25d6e2d7c5` ("powerpc: Update 64bit memcpy() using CPU_FTR_UNALIGNED_LD_STD"). This commit allowed CPUs that have the CPU_FTR_UNALIGNED_LD_STD CPU feature bit present to do the memcpy() with unaligned load doubles. But, along with this came a bug where our final load double would read bytes beyond a page boundary and into the next (unmapped) page. This was caught by enabling CONFIG_DEBUG_PAGEALLOC, The fix was to read only the number of bytes that we need to store rather than reading a full 8-byte doubleword and storing only a portion of that. In order to minimise the amount of existing code touched we use the original do_tail for the src_unaligned case. Below is an example of the regression, as reported by Sachin Sant: Unable to handle kernel paging request for data at address 0xc00000003f380000 Faulting instruction address: 0xc000000000039574 cpu 0x1: Vector: 300 (Data Access) at [c00000003baf3020] pc: c000000000039574: .memcpy+0x74/0x244 lr: d00000000244916c: .ext3_xattr_get+0x288/0x2f4 [ext3] sp: c00000003baf32a0 msr: 8000000000009032 dar: c00000003f380000 dsisr: 40000000 current = 0xc00000003e54b010 paca = 0xc000000000a53680 pid = 1840, comm = readahead enter ? for help [link register ] d00000000244916c .ext3_xattr_get+0x288/0x2f4 [ext3] [c00000003baf32a0] d000000002449104 .ext3_xattr_get+0x220/0x2f4 [ext3] (unreliab le) [c00000003baf3390] d00000000244a6e8 .ext3_xattr_security_get+0x40/0x5c [ext3] [c00000003baf3400] c000000000148154 .generic_getxattr+0x74/0x9c [c00000003baf34a0] c000000000333400 .inode_doinit_with_dentry+0x1c4/0x678 [c00000003baf3560] c00000000032c6b0 .security_d_instantiate+0x50/0x68 [c00000003baf35e0] c00000000013c818 .d_instantiate+0x78/0x9c [c00000003baf3680] c00000000013ced0 .d_splice_alias+0xf0/0x120 [c00000003baf3720] d00000000243e05c .ext3_lookup+0xec/0x134 [ext3] [c00000003baf37c0] c000000000131e74 .do_lookup+0x110/0x260 [c00000003baf3880] c000000000134ed0 .__link_path_walk+0xa98/0x1010 [c00000003baf3970] c0000000001354a0 .path_walk+0x58/0xc4 [c00000003baf3a20] c000000000135720 .do_path_lookup+0x138/0x1e4 [c00000003baf3ad0] c00000000013645c .path_lookup_open+0x6c/0xc8 [c00000003baf3b70] c000000000136780 .do_filp_open+0xcc/0x874 [c00000003baf3d10] c0000000001251e0 .do_sys_open+0x80/0x140 [c00000003baf3dc0] c00000000016aaec .compat_sys_open+0x24/0x38 [c00000003baf3e30] c00000000000855c syscall_exit+0x0/0x40 Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>	2009-02-26 14:02:53 +11:00
Ilya Yanok	33f00dcedb	powerpc: Rework dma-noncoherent to use generic vmalloc layer This patch rewrites consistent dma allocations support to use vmalloc layer to allocate virtual memory space from vmalloc pool and get rid of CONFIG_CONSISTENT_{START,SIZE}. This greatly simplifies the code by effectively removing a custom allocator we had for virtual space. Signed-off-by: Ilya Yanok <yanok@emcraft.com> Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>	2009-02-23 10:48:57 +11:00
Kumar Gala	16c57b3620	powerpc: Unify opcode definitions and support Create a new header that becomes a single location for defining PowerPC opcodes used by code that is either generationg instructions at runtime (fixups, debug, etc.), emulating instructions, or just compiling instructions old assemblers don't know about. We currently don't handle the floating point emulation or alignment decode as both are better handled by the specific decode support they already have. Added support for the new dcbzl, dcbal, msgsnd, tlbilx, & wait instructions since older assemblers don't know about them. Signed-off-by: Kumar Gala <galak@kernel.crashing.org> Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>	2009-02-23 10:48:56 +11:00
Ananth N Mavinakayanahalli	eef336189b	powerpc: Don't emulate mr. instructions Currently emulate_step() emulates mr. instructions without updating cr0 and this can be disastrous. Don't emulate mr. This bug has been around for a while, but I am not sure if its a worthy -stable candidate. I'll leave it to Ben do decide. Signed-off-by: Ananth N Mavinakayanahalli <ananth@in.ibm.com> Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>	2009-02-10 14:39:07 +11:00
Linus Torvalds	3c92ec8ae9	Merge branch 'next' of git://git.kernel.org/pub/scm/linux/kernel/git/paulus/powerpc * 'next' of git://git.kernel.org/pub/scm/linux/kernel/git/paulus/powerpc: (144 commits) powerpc/44x: Support 16K/64K base page sizes on 44x powerpc: Force memory size to be a multiple of PAGE_SIZE powerpc/32: Wire up the trampoline code for kdump powerpc/32: Add the ability for a classic ppc kernel to be loaded at 32M powerpc/32: Allow __ioremap on RAM addresses for kdump kernel powerpc/32: Setup OF properties for kdump powerpc/32/kdump: Implement crash_setup_regs() using ppc_save_regs() powerpc: Prepare xmon_save_regs for use with kdump powerpc: Remove default kexec/crash_kernel ops assignments powerpc: Make default kexec/crash_kernel ops implicit powerpc: Setup OF properties for ppc32 kexec powerpc/pseries: Fix cpu hotplug powerpc: Fix KVM build on ppc440 powerpc/cell: add QPACE as a separate Cell platform powerpc/cell: fix build breakage with CONFIG_SPUFS disabled powerpc/mpc5200: fix error paths in PSC UART probe function powerpc/mpc5200: add rts/cts handling in PSC UART driver powerpc/mpc5200: Make PSC UART driver update serial errors counters powerpc/mpc5200: Remove obsolete code from mpc5200 MDIO driver powerpc/mpc5200: Add MDMA/UDMA support to MPC5200 ATA driver ... Fix trivial conflict in drivers/char/Makefile as per Paul's directions	2008-12-28 16:54:33 -08:00
David Howells	8168b5400b	powerpc: Rename struct vm_region to avoid conflict with NOMMU Rename PowerPC's struct vm_region so that I can introduce my own global version for NOMMU. It's feasible that the PowerPC version may wish to use my global one instead. The NOMMU vm_region struct defines areas of the physical memory map that are under mmap. This may include chunks of RAM or regions of memory mapped devices, such as flash. It is also used to retain copies of file content so that shareable private memory mappings of files can be made. As such, it may be compatible with what is described in the banner comment for PowerPC's vm_region struct. Signed-off-by: David Howells <dhowells@redhat.com> Signed-off-by: Paul Mackerras <paulus@samba.org>	2008-12-21 14:21:14 +11:00
Ingo Molnar	30cd324e97	Merge branches 'tracing/ftrace', 'tracing/ring-buffer' and 'tracing/urgent' into tracing/core Conflicts: include/linux/ftrace.h	2008-12-19 09:42:40 +01:00
Paul Mackerras	c280266a32	Merge branch 'linux-2.6' into next	2008-12-18 11:06:12 +11:00
Guillaume Knispel	af4d364386	powerpc: Fix corruption error in rh_alloc_fixed() There is an error in rh_alloc_fixed() of the Remote Heap code: If there is at least one free block blk won't be NULL at the end of the search loop, so -ENOMEM won't be returned and the else branch of "if (bs == s \|\| be == e)" will be taken, corrupting the management structures. Signed-off-by: Guillaume Knispel <gknispel@proformatique.com> Acked-by: Timur Tabi <timur@freescale.com> Signed-off-by: Kumar Gala <galak@kernel.crashing.org>	2008-12-17 10:06:14 -06:00
Steven Rostedt	f1eecf0e4f	powerpc/ppc32: static ftrace fixes for PPC32 Impact: fix for PowerPC 32 code There were some early init code that was not safe for static ftrace to boot on my PowerBook. This code must only use relative addressing, and static mcount performs a compare of the ftrace_trace_function pointer, and gets that with an absolute address. In the early init boot up code, this will cause a fault. This patch removes tracing from the files containing the offending functions. Signed-off-by: Steven Rostedt <srostedt@redhat.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2008-11-28 14:08:07 +01:00
Mark Nelson	a4e22f02f5	powerpc: Update 64bit __copy_tofrom_user() using CPU_FTR_UNALIGNED_LD_STD In exactly the same way that we updated memcpy() with new feature sections in commit `25d6e2d7c5` ("powerpc: Update 64bit memcpy() using CPU_FTR_UNALIGNED_LD_STD"), we do the same thing here for __copy_tofrom_user(). Once again this is purely a performance tweak for Cell and Power6 - this has no effect on all the other 64bit powerpc chips. We can make these same changes to __copy_tofrom_user() because the basic copy algorithm is the same as in memcpy() - this version just has all the exception handling logic needed when copying to or from userspace as well as a special case for copying whole 4K pages that are page aligned. CPU_FTR_UNALIGNED_LD_STD CPU was added in commit `4ec577a289` ("powerpc: Add new CPU feature: CPU_FTR_UNALIGNED_LD_STD"). We also make the same simple one line change from cmpldi r1,... to cmpldi cr1,... for consistency. Signed-off-by: Mark Nelson <markn@au1.ibm.com> Signed-off-by: Paul Mackerras <paulus@samba.org>	2008-11-19 16:04:54 +11:00
Hollis Blanchard	7526ff76f8	powerpc: Remove superfluous WARN_ON() from dma-noncoherent.c I can't tell why this WARN_ON exists, and there's no comment explaining it. Whether the pmd is present or not, pte_alloc_kernel() seems to handle both cases. Booting a 440 kernel with 64K PAGE_SIZE triggers the warning, but boot successfully completes and I see no problems beyond that. Signed-off-by: Hollis Blanchard <hollisb@us.ibm.com> Acked-by: Benjamin Herrenschmidt <benh@kernel.crashing.org> Signed-off-by: Paul Mackerras <paulus@samba.org>	2008-11-19 16:04:52 +11:00
Mark Nelson	25d6e2d7c5	powerpc: Update 64bit memcpy() using CPU_FTR_UNALIGNED_LD_STD Update memcpy() to add two new feature sections: one for aligning the destination before copying and one for copying using aligned load and store doubles. These new feature sections will only affect Power6 and Cell because the CPU feature bit was only added to these two processors. Power6 gets its best performance in memcpy() when aligning neither the source nor the destination, while Cell gets its best performance when just the destination is aligned. But in order to save on CPU feature bits we can use the previously added CPU_FTR_CP_USE_DCBTZ feature bit to differentiate between Power6 and Cell (because CPU_FTR_CP_USE_DCBTZ was added to Cell but not Power6). The first feature section acts to nop out the branch that takes us to the code that aligns us to an eight byte boundary for the destination. We only want to nop out this branch on Power6. So the ALT_FTR_SECTION_END() for this feature section creates a test mask of the two feature bits ORed together and provides an expected result of just CPU_FTR_UNALIGNED_LD_STD, thus we nop out the branch if we're on a CPU that has CPU_FTR_UNALIGNED_LD_STD set and CPU_FTR_CP_USE_DCBTZ unset. For the second feature section added, if we're on a CPU that has the CPU_FTR_UNALIGNED_LD_STD bit set then we don't want to do the copy with aligned loads and stores (and the appropriate shifting left and right instructions), so we want to nop out the branch to .Lsrc_unaligned. The andi. used for this branch is moved to just above the branch because this allows us to nop out both instructions with just one feature section which gives us better performance and doesn't hurt readability which two separate feature sections did. Moving the andi. to just above the branch doesn't have any noticeable negative effect on the remaining 64bit processors (the ones that didn't have this feature bit added). On Cell this simple modification results in an improvement to measured memcpy() bandwidth of up to 50% in the hot cache case and up to 15% in the cold cache case. On Power6 we get memory bandwidth results that are up to three times faster in the hot cache case and up to 50% faster in the cold cache case. Commit `2a9294369b` ("powerpc: Add new CPU feature: CPU_FTR_CP_USE_DCBTZ") was where CPU_FTR_CP_USE_DCBTZ was added. To say that Cell gets its best performance in memcpy() with just the destination aligned is true but only for the reason that the indirect shift and rotate instructions, sld and srd, are microcoded on Cell. This means that either the destination or the source can be aligned, but not both, and seeing as we get better performance with the destination aligned we choose this option. While we're at it make a one line change from cmpldi r1,... to cmpldi cr1,... for consistency. Signed-off-by: Mark Nelson <markn@au1.ibm.com> Signed-off-by: Paul Mackerras <paulus@samba.org>	2008-11-05 22:08:29 +11:00
Benjamin Herrenschmidt	8aa2659009	powerpc: Fix DMA offset for non-coherent DMA After Becky's work we can almost have different DMA offsets between on-chip devices and PCI. Almost because there's a problem with the non-coherent DMA code that basically ignores the programmed offset to use the global one for everything. This fixes it. Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>	2008-10-14 10:35:26 +11:00
Mark Nelson	57dda6ef5b	powerpc: New copy_4K_page() This new copy_4K_page() function was originally tuned for the best performance on the Cell processor, but after testing on more 64bit powerpc chips it was found that with a small modification it either matched the performance offered by the current mainline version or bettered it by a small amount. It was found that on a Cell-based QS22 blade the amount of system time measured when compiling a 2.6.26 pseries_defconfig decreased by 4%. Using the same test, a 4-way 970MP machine saw a decrease of 2% in system time. No noticeable change was seen on Power4, Power5 or Power6. The 4096 byte page is copied in thirty-two 128 byte strides. An initial setup loop executes dcbt instructions for the whole source page and dcbz instructions for the whole destination page. To do this, the cache line size is retrieved from ppc64_caches. A new CPU feature bit, CPU_FTR_CP_USE_DCBTZ, (introduced in the previous patch) is used to make the modification to this new copy routine - on Power4, 970 and Cell the feature bit is set so the setup loop is executed, but on all other 64bit chips the setup loop is nop'ed out. Signed-off-by: Mark Nelson <markn@au1.ibm.com> Signed-off-by: Paul Mackerras <paulus@samba.org>	2008-09-15 11:07:42 -07:00
Kumar Gala	9c4cb82515	powerpc: Remove use of CONFIG_PPC_MERGE Now that arch/ppc is gone and CONFIG_PPC_MERGE is always set, remove the dead code associated with !CONFIG_PPC_MERGE from arch/powerpc and include/asm-powerpc. Signed-off-by: Kumar Gala <galak@kernel.crashing.org> Signed-off-by: Paul Mackerras <paulus@samba.org>	2008-08-04 13:18:17 +10:00
Andrea Righi	27ac792ca0	PAGE_ALIGN(): correctly handle 64-bit values on 32-bit architectures On 32-bit architectures PAGE_ALIGN() truncates 64-bit values to the 32-bit boundary. For example: u64 val = PAGE_ALIGN(size); always returns a value < 4GB even if size is greater than 4GB. The problem resides in PAGE_MASK definition (from include/asm-x86/page.h for example): #define PAGE_SHIFT 12 #define PAGE_SIZE (_AC(1,UL) << PAGE_SHIFT) #define PAGE_MASK (~(PAGE_SIZE-1)) ... #define PAGE_ALIGN(addr) (((addr)+PAGE_SIZE-1)&PAGE_MASK) The "~" is performed on a 32-bit value, so everything in "and" with PAGE_MASK greater than 4GB will be truncated to the 32-bit boundary. Using the ALIGN() macro seems to be the right way, because it uses typeof(addr) for the mask. Also move the PAGE_ALIGN() definitions out of include/asm-*/page.h in include/linux/mm.h. See also lkml discussion: http://lkml.org/lkml/2008/6/11/237 [akpm@linux-foundation.org: fix drivers/media/video/uvc/uvc_queue.c] [akpm@linux-foundation.org: fix v850] [akpm@linux-foundation.org: fix powerpc] [akpm@linux-foundation.org: fix arm] [akpm@linux-foundation.org: fix mips] [akpm@linux-foundation.org: fix drivers/media/video/pvrusb2/pvrusb2-dvb.c] [akpm@linux-foundation.org: fix drivers/mtd/maps/uclinux.c] [akpm@linux-foundation.org: fix powerpc] Signed-off-by: Andrea Righi <righi.andrea@gmail.com> Cc: <linux-arch@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-07-24 10:47:21 -07:00
Michael Ellerman	76bfdcf71c	powerpc: Use PPC_LONG and PPC_LONG_ALIGN in lib/string.S Replace ifdef clutter with the PPC_LONG and PPC_LONG_ALIGN macros for readability. No change to the generated code. Signed-off-by: Michael Ellerman <michael@ellerman.id.au> Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>	2008-07-22 10:39:35 +10:00
Michael Ellerman	1856c02040	powerpc: Use WARN_ON(1) instead of __WARN() __WARN() is not defined for all configs, use WARN_ON(1) instead. Signed-off-by: Michael Ellerman <michael@ellerman.id.au> Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>	2008-07-22 10:39:34 +10:00
Kumar Gala	2d1b202762	powerpc: Fixup lwsync at runtime To allow for a single kernel image on e500 v1/v2/mc we need to fixup lwsync at runtime. On e500v1/v2 lwsync causes an illop so we need to patch up the code. We default to 'sync' since that is always safe and if the cpu is capable we will replace 'sync' with 'lwsync'. We introduce CPU_FTR_LWSYNC as a way to determine at runtime if this is needed. This flag could be moved elsewhere since we dont really use it for the normal CPU_FTR purpose. Finally we only store the relative offset in the fixup section to keep it as small as possible rather than using a full fixup_entry. Signed-off-by: Kumar Gala <galak@kernel.crashing.org> Signed-off-by: Paul Mackerras <paulus@samba.org>	2008-07-03 16:58:10 +10:00
Kumar Gala	5888da1876	powerpc: Fix building of feature-fixup tests on ppc32 We need to use PPC_LCMPI otherwise we get compile errors like: arch/powerpc/lib/feature-fixups-test.S: Assembler messages: arch/powerpc/lib/feature-fixups-test.S:142: Error: Unrecognized opcode: `cmpdi' arch/powerpc/lib/feature-fixups-test.S:149: Error: Unrecognized opcode: `cmpdi' arch/powerpc/lib/feature-fixups-test.S:164: Error: Unrecognized opcode: `cmpdi' Signed-off-by: Kumar Gala <galak@kernel.crashing.org> Signed-off-by: Paul Mackerras <paulus@samba.org>	2008-07-03 16:58:09 +10:00

1 2 3 4 5

228 Commits