Commit Graph

5865 Commits

Author SHA1 Message Date
Oliver O'Halloran
269e583357 powerpc/eeh: Delete eeh_pe->config_addr
The eeh_pe->config_addr field was supposed to be removed in
commit 35d64734b6 ("powerpc/eeh: Clean up PE addressing") which made it
largely unused. Finish the job.

Signed-off-by: Oliver O'Halloran <oohall@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20201007040903.819081-1-oohall@gmail.com
2020-10-07 22:34:47 +11:00
Scott Cheloha
72cdd117c4 pseries/hotplug-memory: hot-add: skip redundant LMB lookup
During memory hot-add, dlpar_add_lmb() calls memory_add_physaddr_to_nid()
to determine which node id (nid) to use when later calling __add_memory().

This is wasteful.  On pseries, memory_add_physaddr_to_nid() finds an
appropriate nid for a given address by looking up the LMB containing the
address and then passing that LMB to of_drconf_to_nid_single() to get the
nid.  In dlpar_add_lmb() we get this address from the LMB itself.

In short, we have a pointer to an LMB and then we are searching for
that LMB *again* in order to find its nid.

If we call of_drconf_to_nid_single() directly from dlpar_add_lmb() we
can skip the redundant lookup.  The only error handling we need to
duplicate from memory_add_physaddr_to_nid() is the fallback to the
default nid when drconf_to_nid_single() returns -1 (NUMA_NO_NODE) or
an invalid nid.

Skipping the extra lookup makes hot-add operations faster, especially
on machines with many LMBs.

Consider an LPAR with 126976 LMBs.  In one test, hot-adding 126000
LMBs on an upatched kernel took ~3.5 hours while a patched kernel
completed the same operation in ~2 hours:

Unpatched (12450 seconds):
Sep  9 04:06:31 ltc-brazos1 drmgr[810169]: drmgr: -c mem -a -q 126000
Sep  9 04:06:31 ltc-brazos1 kernel: pseries-hotplug-mem: Attempting to hot-add 126000 LMB(s)
[...]
Sep  9 07:34:01 ltc-brazos1 kernel: pseries-hotplug-mem: Memory at 20000000 (drc index 80000002) was hot-added

Patched (7065 seconds):
Sep  8 21:49:57 ltc-brazos1 drmgr[877703]: drmgr: -c mem -a -q 126000
Sep  8 21:49:57 ltc-brazos1 kernel: pseries-hotplug-mem: Attempting to hot-add 126000 LMB(s)
[...]
Sep  8 23:27:42 ltc-brazos1 kernel: pseries-hotplug-mem: Memory at 20000000 (drc index 80000002) was hot-added

It should be noted that the speedup grows more substantial when
hot-adding LMBs at the end of the drconf range.  This is because we
are skipping a linear LMB search.

To see the distinction, consider smaller hot-add test on the same
LPAR.  A perf-stat run with 10 iterations showed that hot-adding 4096
LMBs completed less than 1 second faster on a patched kernel:

Unpatched:
 Performance counter stats for 'drmgr -c mem -a -q 4096' (10 runs):

        104,753.42 msec task-clock                #    0.992 CPUs utilized            ( +-  0.55% )
             4,708      context-switches          #    0.045 K/sec                    ( +-  0.69% )
             2,444      cpu-migrations            #    0.023 K/sec                    ( +-  1.25% )
               394      page-faults               #    0.004 K/sec                    ( +-  0.22% )
   445,902,503,057      cycles                    #    4.257 GHz                      ( +-  0.55% )  (66.67%)
     8,558,376,740      stalled-cycles-frontend   #    1.92% frontend cycles idle     ( +-  0.88% )  (49.99%)
   300,346,181,651      stalled-cycles-backend    #   67.36% backend cycles idle      ( +-  0.76% )  (50.01%)
   258,091,488,691      instructions              #    0.58  insn per cycle
                                                  #    1.16  stalled cycles per insn  ( +-  0.22% )  (66.67%)
    70,568,169,256      branches                  #  673.660 M/sec                    ( +-  0.17% )  (50.01%)
     3,100,725,426      branch-misses             #    4.39% of all branches          ( +-  0.20% )  (49.99%)

           105.583 +- 0.589 seconds time elapsed  ( +-  0.56% )

Patched:
 Performance counter stats for 'drmgr -c mem -a -q 4096' (10 runs):

        104,055.69 msec task-clock                #    0.993 CPUs utilized            ( +-  0.32% )
             4,606      context-switches          #    0.044 K/sec                    ( +-  0.20% )
             2,463      cpu-migrations            #    0.024 K/sec                    ( +-  0.93% )
               394      page-faults               #    0.004 K/sec                    ( +-  0.25% )
   442,951,129,921      cycles                    #    4.257 GHz                      ( +-  0.32% )  (66.66%)
     8,710,413,329      stalled-cycles-frontend   #    1.97% frontend cycles idle     ( +-  0.47% )  (50.06%)
   299,656,905,836      stalled-cycles-backend    #   67.65% backend cycles idle      ( +-  0.39% )  (50.02%)
   252,731,168,193      instructions              #    0.57  insn per cycle
                                                  #    1.19  stalled cycles per insn  ( +-  0.20% )  (66.66%)
    68,902,851,121      branches                  #  662.173 M/sec                    ( +-  0.13% )  (49.94%)
     3,100,242,882      branch-misses             #    4.50% of all branches          ( +-  0.15% )  (49.98%)

           104.829 +- 0.325 seconds time elapsed  ( +-  0.31% )

This is consistent.  An add-by-count hot-add operation adds LMBs
greedily, so LMBs near the start of the drconf range are considered
first.  On an otherwise idle LPAR with so many LMBs we would expect to
find the LMBs we need near the start of the drconf range, hence the
smaller speedup.

Signed-off-by: Scott Cheloha <cheloha@linux.ibm.com>
Reviewed-by: Laurent Dufour <ldufour@linux.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20200916145122.3408129-1-cheloha@linux.ibm.com
2020-10-06 23:22:27 +11:00
Srikar Dronamraju
e29e9ed665 powerpc/smp: Remove get_physical_package_id
Now that cpu_core_mask has been removed and topology_core_cpumask has
been updated to use cpu_cpu_mask, we no more need
get_physical_package_id.

Signed-off-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Tested-by: Satheesh Rajendran <sathnaga@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20200921095653.9701-4-srikar@linux.vnet.ibm.com
2020-10-06 23:22:26 +11:00
Srikar Dronamraju
4ca234a9cb powerpc/smp: Stop updating cpu_core_mask
Anton Blanchard reported that his 4096 vcpu KVM guest took around 30
minutes to boot. He also analyzed it to the time taken to iterate while
setting the cpu_core_mask.

Further analysis shows that cpu_core_mask and cpu_cpu_mask for any CPU
would be equal on Power. However updating cpu_core_mask took forever to
update as its a per cpu cpumask variable. Instead cpu_cpu_mask was a per
NODE /per DIE cpumask that was shared by all the respective CPUs.

Also cpu_cpu_mask is needed from a scheduler perspective. However
cpu_core_map is an exported symbol. Hence stop updating cpu_core_map
and make it point to cpu_cpu_mask.

Signed-off-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Tested-by: Satheesh Rajendran <sathnaga@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20200921095653.9701-3-srikar@linux.vnet.ibm.com
2020-10-06 23:22:26 +11:00
Srikar Dronamraju
4bce545903 powerpc/topology: Update topology_core_cpumask
On Power, cpu_core_mask and cpu_cpu_mask refer to the same set of CPUs.
cpu_cpu_mask is needed by scheduler, hence look at deprecating
cpu_core_mask. Before deleting the cpu_core_mask, ensure its only user
is moved to cpu_cpu_mask.

Signed-off-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Tested-by: Satheesh Rajendran <sathnaga@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20200921095653.9701-2-srikar@linux.vnet.ibm.com
2020-10-06 23:22:25 +11:00
Gustavo Romero
d0ffdee8ff powerpc/tm: Save and restore AMR on treclaim and trechkpt
Althought AMR is stashed in the checkpoint area, currently we don't save
it to the per thread checkpoint struct after a treclaim and so we don't
restore it either from that struct when we trechkpt. As a consequence when
the transaction is later rolled back the kernel space AMR value when the
trechkpt was done appears in userspace.

That commit saves and restores AMR accordingly on treclaim and trechkpt.
Since AMR value is also used in kernel space in other functions, it also
takes care of stashing kernel live AMR into the stack before treclaim and
before trechkpt, restoring it later, just before returning from tm_reclaim
and __tm_recheckpoint.

Is also fixes two nonrelated comments about CR and MSR.

Signed-off-by: Gustavo Romero <gromero@linux.ibm.com>
Tested-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20200919150025.9609-1-gromero@linux.ibm.com
2020-10-06 23:22:25 +11:00
Oliver O'Halloran
35d64734b6 powerpc/eeh: Clean up PE addressing
When support for EEH on PowerNV was added a lot of pseries specific code
was made "generic" and some of the quirks of pseries EEH came along for the
ride. One of the stranger quirks is eeh_pe containing two types of PE
address: pe->addr and pe->config_addr. There reason for this appears to be
historical baggage rather than any real requirements.

On pseries EEH PEs are manipulated using RTAS calls. Each EEH RTAS call
takes a "PE configuration address" as an input which is used to identify
which EEH PE is being manipulated by the call. When initialising the EEH
state for a device the first thing we need to do is determine the
configuration address for the PE which contains the device so we can enable
EEH on that PE. This process is outlined in PAPR which is the modern
(i.e post-2003) FW specification for pseries. However, EEH support was
first described in the pSeries RISC Platform Architecture (RPA) and
although they are mostly compatible EEH is one of the areas where they are
not.

The major difference is that RPA doesn't actually have the concept of a PE.
On RPA systems the EEH RTAS calls are done on a per-device basis using the
same config_addr that would be passed to the RTAS functions to access PCI
config space (e.g. ibm,read-pci-config). The config_addr is not identical
since the function and config register offsets of the config_addr must be
set to zero. EEH operations being done on a per-device basis doesn't make a
whole lot of sense when you consider how EEH was implemented on legacy PCI
systems.

For legacy PCI(-X) systems EEH was implemented using special PCI-PCI
bridges which contained logic to detect errors and freeze the secondary
bus when one occurred. This means that the EEH enabled state is shared
among all devices behind that EEH bridge. As a result there's no way to
implement the per-device control required for the semantics specified by
RPA. It can be made to work if we assume that a separate EEH bridge exists
for each EEH capable PCI slot and there are no bridges behind those slots.
However, RPA also specifies the ibm,configure-bridge RTAS call for
re-initalising bridges behind EEH capable slots after they are reset due
to an EEH event so that is probably not a valid assumption. This
incoherence was fixed in later PAPR, which succeeded RPA. Unfortunately,
since Linux EEH support seems to have been implemented based on the RPA
spec some of the legacy assumptions were carried over (probably for POWER4
compatibility).

The fix made in PAPR was the introduction of the "PE" concept and
redefining the EEH RTAS calls (set-eeh-option, reset-slot, etc) to operate
on a per-PE basis so all devices behind an EEH bride would share the same
EEH state. The "config_addr" argument to the EEH RTAS calls became the
"PE_config_addr" and the OS was required to use the
ibm,get-config-addr-info RTAS call to find the correct PE address for the
device. When support for the new interfaces was added to Linux it was
implemented using something like:

At probe time:

	pdn->eeh_config_addr = rtas_config_addr(pdn);
	pdn->eeh_pe_config_addr = rtas_get_config_addr_info(pdn);

When performing an RTAS call:

	config_addr = pdn->eeh_config_addr;
	if (pdn->eeh_pe_config_addr)
		config_addr = pdn->eeh_pe_config_addr;

	rtas_call(..., config_addr, ...);

In other words, if the ibm,get-config-addr-info RTAS call is implemented
and returned a valid result we'd use that as the argument to the EEH
RTAS calls. If not, Linux would fall back to using the device's
config_addr. Over time these addresses have moved around going from pci_dn
to eeh_dev and finally into eeh_pe. Today the users look like this:

	config_addr = pe->config_addr;
	if (pe->addr)
		config_addr = pe->addr;

	rtas_call(..., config_addr, ...);

However, considering the EEH core always operates on a per-PE basis and
even on pseries the only per-device operation is the initial call to
ibm,set-eeh-option I'm not sure if any of this actually works on an RPA
system today. It doesn't make much sense to have the fallback address in
a generic structure either since the bulk of the code which reference it
is in pseries anyway.

The EEH core makes a token effort to support looking up a PE using the
config_addr by having two arguments to eeh_pe_get(). However, a survey of
all the callers to eeh_pe_get() shows that all bar one have the config_addr
argument hard-coded to zero.The only caller that doesn't is in
eeh_pe_tree_insert() which has:

	if (!eeh_has_flag(EEH_VALID_PE_ZERO) && !edev->pe_config_addr)
		return -EINVAL;

	pe = eeh_pe_get(hose, edev->pe_config_addr, edev->bdfn);

The third argument (config_addr) is only used if the second (pe->addr)
argument is invalid. The preceding check ensures that the call to
eeh_pe_get() will never happen if edev->pe_config_addr is invalid so there
is no situation where eeh_pe_get() will search for a PE based on the 3rd
argument. The check also means that we'll never insert a PE into the tree
where pe_config_addr is zero since EEH_VALID_PE_ZERO is never set on
pseries. All the users of the fallback address on pseries never actually
use the fallback and all the only caller that supplies something for the
config_addr argument to eeh_pe_get() never use it either. It's all dead
code.

This patch removes the fallback address from eeh_pe since nothing uses it.
Specificly, we do this by:

1) Removing pe->config_addr
2) Removing the EEH_VALID_PE_ZERO flag
3) Removing the fallback address argument to eeh_pe_get().
4) Removing all the checks for pe->addr being zero in the pseries EEH code.

This leaves us with PE's only being identified by what's in their pe->addr
field and the EEH core relying on the platform to ensure that eeh_dev's are
only inserted into the EEH tree if they're actually inside a PE.

No functional changes, I hope.

Signed-off-by: Oliver O'Halloran <oohall@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20200918093050.37344-9-oohall@gmail.com
2020-10-06 23:22:25 +11:00
Oliver O'Halloran
5d69e46a21 powerpc/eeh: Delete eeh_ops->init
No longer used since the platforms perform their EEH initialisation before
calling eeh_init().

Signed-off-by: Oliver O'Halloran <oohall@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20200918093050.37344-4-oohall@gmail.com
2020-10-06 23:22:25 +11:00
Oliver O'Halloran
d125aedb40 powerpc/eeh: Rework EEH initialisation
Drop the EEH register / unregister ops thing and have the platform pass the
ops structure into eeh_init() directly. This takes one initcall out of the
EEH setup path and it means we're only doing EEH setup on the platforms
which actually support it. It's also less code and generally easier to
follow.

No functional changes.

Signed-off-by: Oliver O'Halloran <oohall@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20200918093050.37344-1-oohall@gmail.com
2020-10-06 23:22:24 +11:00
Nicholas Piggin
012a9a97a8 powerpc/64e: remove PACA_IRQ_EE_EDGE
This is not used anywhere.

Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20200915114650.3980244-3-npiggin@gmail.com
2020-10-06 23:22:23 +11:00
Nicholas Piggin
cdb1ea0276 powerpc/pseries: add new branch prediction security bits for link stack
The hypervisor interface has defined branch prediction security bits for
handling the link stack. Wire them up.

Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20200825075612.224656-1-npiggin@gmail.com
2020-10-06 23:22:23 +11:00
Nicholas Piggin
05504b4256 powerpc/64s: Add cp_abort after tlbiel to invalidate copy-buffer address
The copy buffer is implemented as a real address in the nest which is
translated from EA by copy, and used for memory access by paste. This
requires that it be invalidated by TLB invalidation.

TLBIE does invalidate the copy buffer, but TLBIEL does not. Add
cp_abort to the tlbiel sequence.

Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
[mpe: Fixup whitespace and comment formatting]
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20200916030234.4110379-2-npiggin@gmail.com
2020-10-06 23:22:23 +11:00
Nicholas Piggin
9983efa83b powerpc: untangle cputable mce include
Having cputable.h include mce.h means it pulls in a bunch of low level
headers (e.g., synch.h) which then can't use CPU_FTR_ definitions.

Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20200916030234.4110379-1-npiggin@gmail.com
2020-10-06 23:22:22 +11:00
Cédric Le Goater
ebbfeef0d8 powerpc/32: Declare stack_overflow_exception() prototype
This fixes a compile error with W=1.

CC      arch/powerpc/kernel/traps.o
../arch/powerpc/kernel/traps.c:1663:6: error: no previous prototype for ‘stack_overflow_exception’ [-Werror=missing-prototypes]
 void stack_overflow_exception(struct pt_regs *regs)
      ^~~~~~~~~~~~~~~~~~~~~~~~

Fixes: 3978eb7851 ("powerpc/32: Add early stack overflow detection with VMAP stack.")
Signed-off-by: Cédric Le Goater <clg@kaod.org>
Reviewed-by: Christophe Leroy <christophe.leroy@csgroup.eu>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20200914211007.2285999-8-clg@kaod.org
2020-09-18 20:05:25 +10:00
Michael Ellerman
39f8756145 powerpc/smp: Move ppc_md.cpu_die() to smp_ops.cpu_offline_self()
We have smp_ops->cpu_die() and ppc_md.cpu_die(). One of them offlines
the current CPU and one offlines another CPU, can you guess which is
which? Also one is in smp_ops and one is in ppc_md?

So rename ppc_md.cpu_die(), to cpu_offline_self(), because that's what
it does. And move it into smp_ops where it belongs.

Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20200819015634.1974478-3-mpe@ellerman.id.au
2020-09-18 19:59:43 +10:00
Michael Ellerman
bf3c1464db powerpc/smp: Fold cpu_die() into its only caller
Avoid the eternal confusion between cpu_die() and __cpu_die() by
removing the former, folding it into its only caller.

Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20200819015634.1974478-2-mpe@ellerman.id.au
2020-09-18 19:59:43 +10:00
Michael Ellerman
0b30191b27 Merge branch 'topic/irqs-off-activate-mm' into next
Merge Nick's series to add ARCH_WANT_IRQS_OFF_ACTIVATE_MM.
2020-09-18 18:14:06 +10:00
Srikar Dronamraju
72730bfc2a powerpc/smp: Create coregroup domain
Add percpu coregroup maps and masks to create coregroup domain.
If a coregroup doesn't exist, the coregroup domain will be degenerated
in favour of SMT/CACHE domain. Do note this patch is only creating stubs
for cpu_to_coregroup_id. The actual cpu_to_coregroup_id implementation
would be in a subsequent patch.

Signed-off-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Reviewed-by: Gautham R. Shenoy <ego@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20200810071834.92514-10-srikar@linux.vnet.ibm.com
2020-09-16 22:13:32 +10:00
Srikar Dronamraju
f9f130ff2e powerpc/numa: Detect support for coregroup
Add support for grouping cores based on the device-tree classification.
- The last domain in the associativity domains always refers to the
core.
- If primary reference domain happens to be the penultimate domain in
the associativity domains device-tree property, then there are no
coregroups. However if its not a penultimate domain, then there are
coregroups. There can be more than one coregroup. For now we would be
interested in the last or the smallest coregroups, i.e one sub-group
per DIE.

Currently there are no firmwares that are exposing this grouping. Hence
allow the basis for grouping to be abstract.  Once the firmware starts
using this grouping, code would be added to detect the type of grouping
and adjust the sd domain flags accordingly.

Signed-off-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Reviewed-by: Gautham R. Shenoy <ego@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20200810071834.92514-8-srikar@linux.vnet.ibm.com
2020-09-16 22:13:31 +10:00
Srikar Dronamraju
f3232321db powerpc/topology: Override cpu_smt_mask
On Power9, a pair of SMT4 cores can be presented by the firmware as a SMT8
core for backward compatibility reasons, with the fusion of two SMT4 cores.
Powerpc allows LPARs to be live migrated from Power8 to Power9.  Existing
software developed/configured for Power8, expects to see a SMT8 core.

In order to maintain userspace backward compatibility (with Power8 chips in
case of Power9) in enterprise Linux systems, the topology_sibling_cpumask
has to be set to SMT8 core.

cpu_smt_mask() should generally point to the cpu mask of the SMT4 core.
Hence override the default cpu_smt_mask() to be powerpc specific
allowing for better scheduling behaviour on Power.

schbench
(latency measured in usecs, so lesser is better)
Without patch                   With patch
Latency percentiles (usec)	Latency percentiles (usec)
	50.0000th: 34           	50.0000th: 38
	75.0000th: 47           	75.0000th: 52
	90.0000th: 54           	90.0000th: 60
	95.0000th: 57           	95.0000th: 64
	*99.0000th: 62          	*99.0000th: 72
	99.5000th: 65           	99.5000th: 75
	99.9000th: 76           	99.9000th: 3452
	min=0, max=9205         	min=0, max=9344

schbench (With Cede disabled)
Without patch                   With patch
Latency percentiles (usec) 	Latency percentiles (usec)
	50.0000th: 20           	50.0000th: 21
	75.0000th: 28           	75.0000th: 29
	90.0000th: 33           	90.0000th: 34
	95.0000th: 35           	95.0000th: 37
	*99.0000th: 40          	*99.0000th: 40
	99.5000th: 48           	99.5000th: 42
	99.9000th: 94           	99.9000th: 79
	min=0, max=791          	min=0, max=791

perf bench sched pipe
usec/ops : lesser is better
Without patch
  N           Min           Max        Median           Avg        Stddev
101      5.095113      5.595269      5.204842     5.2298776    0.10762713

5.10 - 5.15 : ##################################################   23% (24)
5.15 - 5.20 : #############################################        21% (22)
5.20 - 5.25 : ##################################################   23% (24)
5.25 - 5.30 : #########################                            11% (12)
5.30 - 5.35 : ##########                                            4% (5)
5.35 - 5.40 : ########                                              3% (4)
5.40 - 5.45 : ########                                              3% (4)
5.45 - 5.50 : ####                                                  1% (2)
5.50 - 5.55 : ##                                                    0% (1)
5.55 - 5.60 : ####                                                  1% (2)

With patch
  N           Min           Max        Median           Avg        Stddev
101      5.134675      8.524719      5.207658     5.2780985    0.34911969

5.1 - 5.5 : ##################################################   94% (95)
5.5 - 5.8 : ##                                                    3% (4)
5.8 - 6.2 :                                                       0% (1)
6.2 - 6.5 :
6.5 - 6.8 :
6.8 - 7.2 :
7.2 - 7.5 :
7.5 - 7.8 :
7.8 - 8.2 :
8.2 - 8.5 :

perf bench sched pipe (cede disabled)
usec/ops : lesser is better
Without patch
  N           Min           Max        Median           Avg        Stddev
101      7.884227     12.576538      7.956474     8.0170722    0.46159054

7.9 - 8.4 : ##################################################   99% (100)
8.4 - 8.8 :
8.8 - 9.3 :
9.3 - 9.8 :
9.8 - 10.2 :
10.2 - 10.7 :
10.7 - 11.2 :
11.2 - 11.6 :
11.6 - 12.1 :
12.1 - 12.6 :

With patch
  N           Min           Max        Median           Avg        Stddev
101      7.956021      8.217284      8.015615     8.0283866   0.049844967

7.96 - 7.98 : ######################                               12% (13)
7.98 - 8.01 : ##################################################   28% (29)
8.01 - 8.03 : ####################################                 20% (21)
8.03 - 8.06 : #########################                            14% (15)
8.06 - 8.09 : ######################                               12% (13)
8.09 - 8.11 : ######                                                3% (4)
8.11 - 8.14 : ###                                                   1% (2)
8.14 - 8.17 : ###                                                   1% (2)
8.17 - 8.19 :
8.19 - 8.22 : #                                                     0% (1)

Observations: With the patch, the initial run/iteration takes a slight
longer time. This can be attributed to the fact that now we pick a CPU
from a idle core which could be sleep mode. Once we remove the cede,
state the numbers improve in favour of the patch.

ebizzy:
transactions per second (higher is better)
without patch
  N           Min           Max        Median           Avg        Stddev
100       1018433       1304470       1193208     1182315.7     60018.733

1018433 - 1047037 : ######                                                3% (3)
1047037 - 1075640 : ########                                              4% (4)
1075640 - 1104244 : ########                                              4% (4)
1104244 - 1132848 : ###############                                       7% (7)
1132848 - 1161452 : ####################################                 17% (17)
1161452 - 1190055 : ##########################                           12% (12)
1190055 - 1218659 : #############################################        21% (21)
1218659 - 1247263 : ##################################################   23% (23)
1247263 - 1275866 : ########                                              4% (4)
1275866 - 1304470 : ########                                              4% (4)

with patch
  N           Min           Max        Median           Avg        Stddev
100        967014       1292938       1208819     1185281.8     69815.851

 967014 - 999606  : ##                                                    1% (1)
 999606 - 1032199 : ##                                                    1% (1)
1032199 - 1064791 : ############                                          6% (6)
1064791 - 1097384 : ##########                                            5% (5)
1097384 - 1129976 : ##################                                    9% (9)
1129976 - 1162568 : ####################                                 10% (10)
1162568 - 1195161 : ##########################                           13% (13)
1195161 - 1227753 : ############################################         22% (22)
1227753 - 1260346 : ##################################################   25% (25)
1260346 - 1292938 : ##############                                        7% (7)

Observations: Not much changes, ebizzy is not much impacted.

Signed-off-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20200807074517.27957-2-srikar@linux.vnet.ibm.com
2020-09-16 22:05:19 +10:00
Nicholas Piggin
a665eec0a2 powerpc/64s/radix: Fix mm_cpumask trimming race vs kthread_use_mm
Commit 0cef77c779 ("powerpc/64s/radix: flush remote CPUs out of
single-threaded mm_cpumask") added a mechanism to trim the mm_cpumask of
a process under certain conditions. One of the assumptions is that
mm_users would not be incremented via a reference outside the process
context with mmget_not_zero() then go on to kthread_use_mm() via that
reference.

That invariant was broken by io_uring code (see previous sparc64 fix),
but I'll point Fixes: to the original powerpc commit because we are
changing that assumption going forward, so this will make backports
match up.

Fix this by no longer relying on that assumption, but by having each CPU
check the mm is not being used, and clearing their own bit from the mask
only if it hasn't been switched-to by the time the IPI is processed.

This relies on commit 38cf307c1f ("mm: fix kthread_use_mm() vs TLB
invalidate") and ARCH_WANT_IRQS_OFF_ACTIVATE_MM to disable irqs over mm
switch sequences.

Fixes: 0cef77c779 ("powerpc/64s/radix: flush remote CPUs out of single-threaded mm_cpumask")
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Reviewed-by: Michael Ellerman <mpe@ellerman.id.au>
Depends-on: 38cf307c1f ("mm: fix kthread_use_mm() vs TLB invalidate")
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20200914045219.3736466-5-npiggin@gmail.com
2020-09-16 12:24:37 +10:00
Nicholas Piggin
66acd46080 powerpc: select ARCH_WANT_IRQS_OFF_ACTIVATE_MM
powerpc uses IPIs in some situations to switch a kernel thread away
from a lazy tlb mm, which is subject to the TLB flushing race
described in the changelog introducing ARCH_WANT_IRQS_OFF_ACTIVATE_MM.

Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20200914045219.3736466-3-npiggin@gmail.com
2020-09-16 12:24:37 +10:00
Cédric Le Goater
3a3181e16f powerpc/pci: unmap legacy INTx interrupts when a PHB is removed
When a passthrough IO adapter is removed from a pseries machine using
hash MMU and the XIVE interrupt mode, the POWER hypervisor expects the
guest OS to clear all page table entries related to the adapter. If
some are still present, the RTAS call which isolates the PCI slot
returns error 9001 "valid outstanding translations" and the removal of
the IO adapter fails. This is because when the PHBs are scanned, Linux
maps automatically the INTx interrupts in the Linux interrupt number
space but these are never removed.

To solve this problem, we introduce a PPC platform specific
pcibios_remove_bus() routine which clears all interrupt mappings when
the bus is removed. This also clears the associated page table entries
of the ESB pages when using XIVE.

For this purpose, we record the logical interrupt numbers of the
mapped interrupt under the PHB structure and let pcibios_remove_bus()
do the clean up.

Since some PCI adapters, like GPUs, use the "interrupt-map" property
to describe interrupt mappings other than the legacy INTx interrupts,
we can not restrict the size of the mapping array to PCI_NUM_INTX. The
number of interrupt mappings is computed from the "interrupt-map"
property and the mapping array is allocated accordingly.

Signed-off-by: Cédric Le Goater <clg@kaod.org>
Reviewed-by: Alexey Kardashevskiy <aik@ozlabs.ru>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20200807101854.844619-1-clg@kaod.org
2020-09-15 22:13:39 +10:00
Nicholas Piggin
ffd2961bb4 powerpc/powernv/idle: add a basic stop 0-3 driver for POWER10
This driver does not restore stop > 3 state, so it limits itself
to states which do not lose full state or TB.

The POWER10 SPRs are sufficiently different from P9 that it seems
easier to split out the P10 code. The POWER10 deep sleep code
(e.g., the BHRB restore) has been taken out, but it can be re-added
when stop > 3 support is added.

Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Tested-by: Pratik Rajesh Sampat<psampat@linux.ibm.com>
Tested-by: Vaidyanathan Srinivasan <svaidy@linux.ibm.com>
Reviewed-by: Pratik Rajesh Sampat<psampat@linux.ibm.com>
Reviewed-by: Gautham R. Shenoy <ego@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20200819094700.493399-1-npiggin@gmail.com
2020-09-15 22:13:38 +10:00
Christophe Leroy
532ed1900d powerpc/process: Remove useless #ifdef CONFIG_SPE
cpu_has_feature(CPU_FTR_SPE) returns false when CONFIG_SPE is
not set.

There is no need to enclose the test in an #ifdef CONFIG_SPE.
Remove it.

CPU_FTR_SPE only exists on 32 bits. Define it as 0 on 64 bits.

We have a couple of places like:

 #ifdef CONFIG_SPE
	if (cpu_has_feature(CPU_FTR_SPE)) {
		do_something_that_requires_CONFIG_SPE
	} else {
		return -EINVAL;
	}
 #else
	return -EINVAL;
 #endif

Replace them by a cleaner version:

	if (cpu_has_feature(CPU_FTR_SPE)) {
 #ifdef CONFIG_SPE
		do_something_that_requires_CONFIG_SPE
 #endif
	} else {
		return -EINVAL;
	}

When CONFIG_SPE is not set, this resolves to an unconditional
return of -EINVAL

Signed-off-by: Christophe Leroy <christophe.leroy@csgroup.eu>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/698df8387555765b70ea42e4a7fa48141c309c1f.1597643221.git.christophe.leroy@csgroup.eu
2020-09-15 22:13:36 +10:00
Christophe Leroy
7fdf966bed powerpc/uaccess: Remove __put_user_asm() and __put_user_asm2()
__put_user_asm() and __put_user_asm2() are not used anymore.

Remove them.

Signed-off-by: Christophe Leroy <christophe.leroy@csgroup.eu>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/d66c4a372738d2fbd81f433ca86e4295871ace6a.1599216721.git.christophe.leroy@csgroup.eu
2020-09-15 22:13:32 +10:00
Christophe Leroy
ee0a49a687 powerpc/uaccess: Switch __put_user_size_allowed() to __put_user_asm_goto()
__put_user_asm_goto() provides more flexibility to GCC and avoids using
a local variable to tell if the write succeeded or not.
GCC can then avoid implementing a cmp in the fast path.

See the difference for a small function like the PPC64 version of
save_general_regs() in arch/powerpc/kernel/signal_32.c:

Before the patch (unreachable nop removed):

0000000000000c10 <.save_general_regs>:
     c10:	39 20 00 2c 	li      r9,44
     c14:	39 40 00 00 	li      r10,0
     c18:	7d 29 03 a6 	mtctr   r9
     c1c:	38 c0 00 00 	li      r6,0
     c20:	48 00 00 14 	b       c34 <.save_general_regs+0x24>
     c30:	42 40 00 40 	bdz     c70 <.save_general_regs+0x60>
     c34:	28 2a 00 27 	cmpldi  r10,39
     c38:	7c c8 33 78 	mr      r8,r6
     c3c:	79 47 1f 24 	rldicr  r7,r10,3,60
     c40:	39 20 00 01 	li      r9,1
     c44:	41 82 00 0c 	beq     c50 <.save_general_regs+0x40>
     c48:	7d 23 38 2a 	ldx     r9,r3,r7
     c4c:	79 29 00 20 	clrldi  r9,r9,32
     c50:	91 24 00 00 	stw     r9,0(r4)
     c54:	2c 28 00 00 	cmpdi   r8,0
     c58:	39 4a 00 01 	addi    r10,r10,1
     c5c:	38 84 00 04 	addi    r4,r4,4
     c60:	41 82 ff d0 	beq     c30 <.save_general_regs+0x20>
     c64:	38 60 ff f2 	li      r3,-14
     c68:	4e 80 00 20 	blr
     c70:	38 60 00 00 	li      r3,0
     c74:	4e 80 00 20 	blr

0000000000000000 <.fixup>:
  cc:	39 00 ff f2 	li      r8,-14
  d0:	48 00 00 00 	b       d0 <.fixup+0xd0>
			d0: R_PPC64_REL24	.text+0xc54

After the patch:

0000000000001490 <.save_general_regs>:
    1490:	39 20 00 2c 	li      r9,44
    1494:	39 40 00 00 	li      r10,0
    1498:	7d 29 03 a6 	mtctr   r9
    149c:	60 00 00 00 	nop
    14a0:	28 2a 00 27 	cmpldi  r10,39
    14a4:	79 48 1f 24 	rldicr  r8,r10,3,60
    14a8:	39 20 00 01 	li      r9,1
    14ac:	41 82 00 0c 	beq     14b8 <.save_general_regs+0x28>
    14b0:	7d 23 40 2a 	ldx     r9,r3,r8
    14b4:	79 29 00 20 	clrldi  r9,r9,32
    14b8:	91 24 00 00 	stw     r9,0(r4)
    14bc:	39 4a 00 01 	addi    r10,r10,1
    14c0:	38 84 00 04 	addi    r4,r4,4
    14c4:	42 00 ff dc 	bdnz    14a0 <.save_general_regs+0x10>
    14c8:	38 60 00 00 	li      r3,0
    14cc:	4e 80 00 20 	blr
    14d0:	38 60 ff f2 	li      r3,-14
    14d4:	4e 80 00 20 	blr

Signed-off-by: Christophe Leroy <christophe.leroy@csgroup.eu>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/94ba5a5138f99522e1562dbcdb38d31aa790dc89.1599216721.git.christophe.leroy@csgroup.eu
2020-09-15 22:13:32 +10:00
Christophe Leroy
fcf1f26895 powerpc/uaccess: Add pre-update addressing to __put_user_asm_goto()
Enable pre-update addressing mode in __put_user_asm_goto()

Signed-off-by: Christophe Leroy <christophe.leroy@csgroup.eu>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/346f65d677adb11865f7762c25a1ca3c64404ba5.1599216023.git.christophe.leroy@csgroup.eu
2020-09-15 22:13:31 +10:00
Christophe Leroy
e47168f3d1 powerpc/8xx: Support 16k hugepages with 4k pages
The 8xx has 4 page sizes: 4k, 16k, 512k and 8M

4k and 16k can be selected at build time as standard page sizes,
and 512k and 8M are hugepages.

When 4k standard pages are selected, 16k pages are not available.

Allow 16k pages as hugepages when 4k pages are used.

To allow that, implement arch_make_huge_pte() which receives
the necessary arguments to allow setting the PTE in accordance
with the page size:
- 512 k pages must have _PAGE_HUGE and _PAGE_SPS. They are set
by pte_mkhuge(). arch_make_huge_pte() does nothing.
- 16 k pages must have only _PAGE_SPS. arch_make_huge_pte() clears
_PAGE_HUGE.

Signed-off-by: Christophe Leroy <christophe.leroy@csgroup.eu>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/a518abc29266a708dfbccc8fce9ae6694fe4c2c6.1598862623.git.christophe.leroy@csgroup.eu
2020-09-15 22:13:31 +10:00
Christophe Leroy
175a999915 powerpc/8xx: Refactor calculation of number of entries per PTE in page tables
On 8xx, the number of entries occupied by a PTE in the page tables
depends on the size of the page. At the time being, this calculation
is done in two places: in pte_update() and in set_huge_pte_at()

Refactor this calculation into a helper called
number_of_cells_per_pte(). For the time being, the val param is
unused. It will be used by following patch.

Instead of opencoding is_hugepd(), use hugepd_ok() with a forward
declaration.

Signed-off-by: Christophe Leroy <christophe.leroy@csgroup.eu>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/f6ea2483c2c389567b007945948f704d18cfaeea.1598862623.git.christophe.leroy@csgroup.eu
2020-09-15 22:13:31 +10:00
Finn Thain
66943005cc powerpc/tau: Use appropriate temperature sample interval
According to the MPC750 Users Manual, the SITV value in Thermal
Management Register 3 is 13 bits long. The present code calculates the
SITV value as 60 * 500 cycles. This would overflow to give 10 us on
a 500 MHz CPU rather than the intended 60 us. (But according to the
Microprocessor Datasheet, there is also a factor of 266 that has to be
applied to this value on certain parts i.e. speed sort above 266 MHz.)
Always use the maximum cycle count, as recommended by the Datasheet.

Fixes: 1da177e4c3 ("Linux-2.6.12-rc2")
Signed-off-by: Finn Thain <fthain@telegraphics.com.au>
Tested-by: Stan Johnson <userm57@yahoo.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/896f542e5f0f1d6cf8218524c2b67d79f3d69b3c.1599260540.git.fthain@telegraphics.com.au
2020-09-15 22:13:24 +10:00
Aneesh Kumar K.V
b32d5d7e92 powerpc/mm/book3s: Split radix and hash MAX_PHYSMEM limit
MAX_PHYSMEM #define is used along with sparsemem to determine the SECTION_SHIFT
value. Powerpc also uses the same value to limit the max memory enabled on the
system. With 4K PAGE_SIZE and hash translation mode, we want to limit the max
memory enabled to 64TB due to page table size restrictions. However, with
radix translation, we don't have these restrictions. Hence split the radix
and hash MA_PHYSMEM limit and use different limit for each of them.

Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20200608070904.387440-4-aneesh.kumar@linux.ibm.com
2020-09-15 22:13:22 +10:00
Aneesh Kumar K.V
7746406baa powerpc/book3s64/hash/4k: Support large linear mapping range with 4K
With commit: 0034d395f8 ("powerpc/mm/hash64: Map all the kernel
regions in the same 0xc range"), we now split the 64TB address range
into 4 contexts each of 16TB. That implies we can do only 16TB linear
mapping.

On some systems, eg. Power9, memory attached to nodes > 0 will appear
above 16TB in the linear mapping. This resulted in kernel crash when
we boot such systems in hash translation mode with 4K PAGE_SIZE.

This patch updates the kernel mapping such that we now start supporting upto
61TB of memory with 4K. The kernel mapping now looks like below 4K PAGE_SIZE
and hash translation.

    vmalloc start     = 0xc0003d0000000000
    IO start          = 0xc0003e0000000000
    vmemmap start     = 0xc0003f0000000000

Our MAX_PHYSMEM_BITS for 4K is still 64TB even though we can only map 61TB.
We prevent bolt mapping anything outside 61TB range by checking against
H_VMALLOC_START.

Fixes: 0034d395f8 ("powerpc/mm/hash64: Map all the kernel regions in the same 0xc range")
Reported-by: Cameron Berkenpas <cam@neo-zeon.de>
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20200608070904.387440-3-aneesh.kumar@linux.ibm.com
2020-09-15 22:13:22 +10:00
Ravi Bangoria
fa725cc53d powerpc/watchpoint/ptrace: Introduce PPC_DEBUG_FEATURE_DATA_BP_ARCH_31
PPC_DEBUG_FEATURE_DATA_BP_ARCH_31 can be used to determine whether
we are running on an ISA 3.1 compliant machine. Which is needed to
determine DAR behaviour, 512 byte boundary limit etc. This was
requested by Pedro Miraglia Franco de Carvalho for extending
watchpoint features in gdb. Note that availability of 2nd DAWR is
independent of this flag and should be checked using
ppc_debug_info->num_data_bps.

Signed-off-by: Ravi Bangoria <ravi.bangoria@linux.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20200902042945.129369-8-ravi.bangoria@linux.ibm.com
2020-09-15 22:13:20 +10:00
Ravi Bangoria
5b905d7798 powerpc/watchpoint: Fix exception handling for CONFIG_HAVE_HW_BREAKPOINT=N
On powerpc, ptrace watchpoint works in one-shot mode. i.e. kernel
disables event every time it fires and user has to re-enable it.
Also, in case of ptrace watchpoint, kernel notifies ptrace user
before executing instruction.

With CONFIG_HAVE_HW_BREAKPOINT=N, kernel is missing to disable
ptrace event and thus it's causing infinite loop of exceptions.
This is especially harmful when user watches on a data which is
also read/written by kernel, eg syscall parameters. In such case,
infinite exceptions happens in kernel mode which causes soft-lockup.

Fixes: 9422de3e95 ("powerpc: Hardware breakpoints rewrite to handle non DABR breakpoint registers")
Reported-by: Pedro Miraglia Franco de Carvalho <pedromfc@linux.ibm.com>
Signed-off-by: Ravi Bangoria <ravi.bangoria@linux.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20200902042945.129369-6-ravi.bangoria@linux.ibm.com
2020-09-15 22:13:20 +10:00
Ravi Bangoria
edc8dd99b2 powerpc/watchpoint: Move DAWR detection logic outside of hw_breakpoint.c
Power10 hw has multiple DAWRs but hw doesn't tell which DAWR caused
the exception. So we have a sw logic to detect that in hw_breakpoint.c.
But hw_breakpoint.c gets compiled only with CONFIG_HAVE_HW_BREAKPOINT=Y.
Move DAWR detection logic outside of hw_breakpoint.c so that it can be
reused when CONFIG_HAVE_HW_BREAKPOINT is not set.

Signed-off-by: Ravi Bangoria <ravi.bangoria@linux.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20200902042945.129369-5-ravi.bangoria@linux.ibm.com
2020-09-15 22:13:19 +10:00
Ravi Bangoria
4759c11ed2 powerpc/watchpoint: Fix quadword instruction handling on p10 predecessors
On p10 predecessors, watchpoint with quadword access is compared at
quadword length. If the watch range is doubleword or less than that
in a first half of quadword aligned 16 bytes, and if there is any
unaligned quadword access which will access only the 2nd half, the
handler should consider it as extraneous and emulate/single-step it
before continuing.

Fixes: 74c6881019 ("powerpc/watchpoint: Prepare handler to handle more than one watchpoint")
Reported-by: Pedro Miraglia Franco de Carvalho <pedromfc@linux.ibm.com>
Signed-off-by: Ravi Bangoria <ravi.bangoria@linux.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20200902042945.129369-2-ravi.bangoria@linux.ibm.com
2020-09-15 22:12:25 +10:00
Thiago Jung Bauermann
eae9eec476 powerpc/pseries/svm: Allocate SWIOTLB buffer anywhere in memory
POWER secure guests (i.e., guests which use the Protected Execution
Facility) need to use SWIOTLB to be able to do I/O with the
hypervisor, but they don't need the SWIOTLB memory to be in low
addresses since the hypervisor doesn't have any addressing limitation.

This solves a SWIOTLB initialization problem we are seeing in secure
guests with 128 GB of RAM: they are configured with 4 GB of
crashkernel reserved memory, which leaves no space for SWIOTLB in low
addresses.

To do this, we use mostly the same code as swiotlb_init(), but
allocate the buffer using memblock_alloc() instead of
memblock_alloc_low().

Fixes: 2efbc58f15 ("powerpc/pseries/svm: Force SWIOTLB for secure guests")
Signed-off-by: Thiago Jung Bauermann <bauerman@linux.ibm.com>
Reviewed-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20200818221126.391073-1-bauerman@linux.ibm.com
2020-09-14 23:07:14 +10:00
Michael Ellerman
960e370813 Merge branch 'fixes' into next
Bring in our fixes branch for this cycle which avoids some small
conflicts with upcoming commits.
2020-09-14 22:57:18 +10:00
Michael Ellerman
529d2bd56a powerpc/64: Remove unused generic_secondary_thread_init()
The last caller was removed in 2014 in commit fb5a515704 ("powerpc:
Remove platforms/wsp and associated pieces").

As Jordan noticed even though there are no callers, the code above in
fsl_secondary_thread_init() falls through into
generic_secondary_thread_init(). So we can remove the _GLOBAL but not
the body of the function.

However because fsl_secondary_thread_init() is inside #ifdef
CONFIG_PPC_BOOK3E, we can never reach the body of
generic_secondary_thread_init() unless CONFIG_PPC_BOOK3E is enabled,
so we can wrap the whole thing in a single #ifdef.

Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20200819015704.1976364-1-mpe@ellerman.id.au
2020-09-08 22:24:17 +10:00
Christophe Leroy
2f279eeb68 powerpc/uaccess: Add pre-update addressing to __get_user_asm() and __put_user_asm()
Enable pre-update addressing mode in __get_user_asm() and __put_user_asm()

Signed-off-by: Christophe Leroy <christophe.leroy@csgroup.eu>
Reviewed-by: Segher Boessenkool <segher@kernel.crashing.org>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/13041c7df39e89ddf574ea0cdc6dedfdd9734140.1597235091.git.christophe.leroy@csgroup.eu
2020-09-08 22:23:22 +10:00
Christophe Leroy
c20beffeec powerpc/uaccess: Use flexible addressing with __put_user()/__get_user()
At the time being, __put_user()/__get_user() and friends only use
D-form addressing, with 0 offset. Ex:

	lwz	reg1, 0(reg2)

Give the compiler the opportunity to use other adressing modes
whenever possible, to get more optimised code.

Hereunder is a small exemple:

struct test {
	u32 item1;
	u16 item2;
	u8 item3;
	u64 item4;
};

int set_test_user(struct test __user *from, struct test __user *to)
{
	int err;
	u32 item1;
	u16 item2;
	u8 item3;
	u64 item4;

	err = __get_user(item1, &from->item1);
	err |= __get_user(item2, &from->item2);
	err |= __get_user(item3, &from->item3);
	err |= __get_user(item4, &from->item4);

	err |= __put_user(item1, &to->item1);
	err |= __put_user(item2, &to->item2);
	err |= __put_user(item3, &to->item3);
	err |= __put_user(item4, &to->item4);

	return err;
}

Before the patch:

00000df0 <set_test_user>:
 df0:	94 21 ff f0 	stwu    r1,-16(r1)
 df4:	39 40 00 00 	li      r10,0
 df8:	93 c1 00 08 	stw     r30,8(r1)
 dfc:	93 e1 00 0c 	stw     r31,12(r1)
 e00:	7d 49 53 78 	mr      r9,r10
 e04:	80 a3 00 00 	lwz     r5,0(r3)
 e08:	38 e3 00 04 	addi    r7,r3,4
 e0c:	7d 46 53 78 	mr      r6,r10
 e10:	a0 e7 00 00 	lhz     r7,0(r7)
 e14:	7d 29 33 78 	or      r9,r9,r6
 e18:	39 03 00 06 	addi    r8,r3,6
 e1c:	7d 46 53 78 	mr      r6,r10
 e20:	89 08 00 00 	lbz     r8,0(r8)
 e24:	7d 29 33 78 	or      r9,r9,r6
 e28:	38 63 00 08 	addi    r3,r3,8
 e2c:	7d 46 53 78 	mr      r6,r10
 e30:	83 c3 00 00 	lwz     r30,0(r3)
 e34:	83 e3 00 04 	lwz     r31,4(r3)
 e38:	7d 29 33 78 	or      r9,r9,r6
 e3c:	7d 43 53 78 	mr      r3,r10
 e40:	90 a4 00 00 	stw     r5,0(r4)
 e44:	7d 29 1b 78 	or      r9,r9,r3
 e48:	38 c4 00 04 	addi    r6,r4,4
 e4c:	7d 43 53 78 	mr      r3,r10
 e50:	b0 e6 00 00 	sth     r7,0(r6)
 e54:	7d 29 1b 78 	or      r9,r9,r3
 e58:	38 e4 00 06 	addi    r7,r4,6
 e5c:	7d 43 53 78 	mr      r3,r10
 e60:	99 07 00 00 	stb     r8,0(r7)
 e64:	7d 23 1b 78 	or      r3,r9,r3
 e68:	38 84 00 08 	addi    r4,r4,8
 e6c:	93 c4 00 00 	stw     r30,0(r4)
 e70:	93 e4 00 04 	stw     r31,4(r4)
 e74:	7c 63 53 78 	or      r3,r3,r10
 e78:	83 c1 00 08 	lwz     r30,8(r1)
 e7c:	83 e1 00 0c 	lwz     r31,12(r1)
 e80:	38 21 00 10 	addi    r1,r1,16
 e84:	4e 80 00 20 	blr

After the patch:

00000dbc <set_test_user>:
 dbc:	39 40 00 00 	li      r10,0
 dc0:	7d 49 53 78 	mr      r9,r10
 dc4:	80 03 00 00 	lwz     r0,0(r3)
 dc8:	7d 48 53 78 	mr      r8,r10
 dcc:	a1 63 00 04 	lhz     r11,4(r3)
 dd0:	7d 29 43 78 	or      r9,r9,r8
 dd4:	7d 48 53 78 	mr      r8,r10
 dd8:	88 a3 00 06 	lbz     r5,6(r3)
 ddc:	7d 29 43 78 	or      r9,r9,r8
 de0:	7d 48 53 78 	mr      r8,r10
 de4:	80 c3 00 08 	lwz     r6,8(r3)
 de8:	80 e3 00 0c 	lwz     r7,12(r3)
 dec:	7d 29 43 78 	or      r9,r9,r8
 df0:	7d 43 53 78 	mr      r3,r10
 df4:	90 04 00 00 	stw     r0,0(r4)
 df8:	7d 29 1b 78 	or      r9,r9,r3
 dfc:	7d 43 53 78 	mr      r3,r10
 e00:	b1 64 00 04 	sth     r11,4(r4)
 e04:	7d 29 1b 78 	or      r9,r9,r3
 e08:	7d 43 53 78 	mr      r3,r10
 e0c:	98 a4 00 06 	stb     r5,6(r4)
 e10:	7d 23 1b 78 	or      r3,r9,r3
 e14:	90 c4 00 08 	stw     r6,8(r4)
 e18:	90 e4 00 0c 	stw     r7,12(r4)
 e1c:	7c 63 53 78 	or      r3,r3,r10
 e20:	4e 80 00 20 	blr

Signed-off-by: Christophe Leroy <christophe.leroy@csgroup.eu>
Reviewed-by: Segher Boessenkool <segher@kernel.crashing.org>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/c27bc4e598daf3bbb225de7a1f5c52121cf1e279.1597235091.git.christophe.leroy@csgroup.eu
2020-09-02 11:00:23 +10:00
Scott Cheloha
e5e179aa3a pseries/drmem: don't cache node id in drmem_lmb struct
At memory hot-remove time we can retrieve an LMB's nid from its
corresponding memory_block.  There is no need to store the nid
in multiple locations.

Note that lmb_to_memblock() uses find_memory_block() to get the
corresponding memory_block.  As find_memory_block() runs in sub-linear
time this approach is negligibly slower than what we do at present.

In exchange for this lookup at hot-remove time we no longer need to
call memory_add_physaddr_to_nid() during drmem_init() for each LMB.
On powerpc, memory_add_physaddr_to_nid() is a linear search, so this
spares us an O(n^2) initialization during boot.

On systems with many LMBs that initialization overhead is palpable and
disruptive.  For example, on a box with 249854 LMBs we're seeing
drmem_init() take upwards of 30 seconds to complete:

[   53.721639] drmem: initializing drmem v2
[   80.604346] watchdog: BUG: soft lockup - CPU#65 stuck for 23s! [swapper/0:1]
[   80.604377] Modules linked in:
[   80.604389] CPU: 65 PID: 1 Comm: swapper/0 Not tainted 5.6.0-rc2+ #4
[   80.604397] NIP:  c0000000000a4980 LR: c0000000000a4940 CTR: 0000000000000000
[   80.604407] REGS: c0002dbff8493830 TRAP: 0901   Not tainted  (5.6.0-rc2+)
[   80.604412] MSR:  8000000002009033 <SF,VEC,EE,ME,IR,DR,RI,LE>  CR: 44000248  XER: 0000000d
[   80.604431] CFAR: c0000000000a4a38 IRQMASK: 0
[   80.604431] GPR00: c0000000000a4940 c0002dbff8493ac0 c000000001904400 c0003cfffffede30
[   80.604431] GPR04: 0000000000000000 c000000000f4095a 000000000000002f 0000000010000000
[   80.604431] GPR08: c0000bf7ecdb7fb8 c0000bf7ecc2d3c8 0000000000000008 c00c0002fdfb2001
[   80.604431] GPR12: 0000000000000000 c00000001e8ec200
[   80.604477] NIP [c0000000000a4980] hot_add_scn_to_nid+0xa0/0x3e0
[   80.604486] LR [c0000000000a4940] hot_add_scn_to_nid+0x60/0x3e0
[   80.604492] Call Trace:
[   80.604498] [c0002dbff8493ac0] [c0000000000a4940] hot_add_scn_to_nid+0x60/0x3e0 (unreliable)
[   80.604509] [c0002dbff8493b20] [c000000000087c10] memory_add_physaddr_to_nid+0x20/0x60
[   80.604521] [c0002dbff8493b40] [c0000000010d4880] drmem_init+0x25c/0x2f0
[   80.604530] [c0002dbff8493c10] [c000000000010154] do_one_initcall+0x64/0x2c0
[   80.604540] [c0002dbff8493ce0] [c0000000010c4aa0] kernel_init_freeable+0x2d8/0x3a0
[   80.604550] [c0002dbff8493db0] [c000000000010824] kernel_init+0x2c/0x148
[   80.604560] [c0002dbff8493e20] [c00000000000b648] ret_from_kernel_thread+0x5c/0x74
[   80.604567] Instruction dump:
[   80.604574] 392918e8 e9490000 e90a000a e92a0000 80ea000c 1d080018 3908ffe8 7d094214
[   80.604586] 7fa94040 419d00dc e9490010 714a0088 <2faa0008> 409e00ac e9490000 7fbe5040
[   89.047390] drmem: 249854 LMB(s)

With a patched kernel on the same machine we're no longer seeing the
soft lockup.  drmem_init() now completes in negligible time, even when
the LMB count is large.

Fixes: b2d3b5ee66 ("powerpc/pseries: Track LMB nid instead of using device tree")
Signed-off-by: Scott Cheloha <cheloha@linux.ibm.com>
Reviewed-by: Nathan Lynch <nathanl@linux.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20200811015115.63677-1-cheloha@linux.ibm.com
2020-09-02 11:00:21 +10:00
Christophe Leroy
de39b19452 powerpc: Rewrite 4xx flush_cache_instruction() in C
Nothing prevents flush_cache_instruction() from being writen in C.

Do it to improve readability and maintainability.

This function is very small and isn't called from assembly,
make it static inline in asm/cacheflush.h

Signed-off-by: Christophe Leroy <christophe.leroy@csgroup.eu>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/93d93fc69b4b3ad3ceba2fc0756333c0c0245bb7.1597384512.git.christophe.leroy@csgroup.eu
2020-09-02 11:00:21 +10:00
Christophe Leroy
f663f33120 powerpc: Move flush_instruction_cache() prototype in asm/cacheflush.h
flush_instruction_cache() belongs to the cache flushing function
family.

Move its prototype in asm/cacheflush.h

Signed-off-by: Christophe Leroy <christophe.leroy@csgroup.eu>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/993445b5227e8ca2f0e38bcc9ea3dfea6e865920.1597384512.git.christophe.leroy@csgroup.eu
2020-09-02 11:00:21 +10:00
Nathan Lynch
9d6792ffe1 powerpc/pseries: explicitly reschedule during drmem_lmb list traversal
The drmem lmb list can have hundreds of thousands of entries, and
unfortunately lookups take the form of linear searches. As long as
this is the case, traversals have the potential to monopolize the CPU
and provoke lockup reports, workqueue stalls, and the like unless
they explicitly yield.

Rather than placing cond_resched() calls within various
for_each_drmem_lmb() loop blocks in the code, put it in the iteration
expression of the loop macro itself so users can't omit it.

Introduce a drmem_lmb_next() iteration helper function which calls
cond_resched() at a regular interval during array traversal. Each
iteration of the loop in DLPAR code paths can involve around ten RTAS
calls which can each take up to 250us, so this ensures the check is
performed at worst every few milliseconds.

Fixes: 6c6ea53725 ("powerpc/mm: Separate ibm, dynamic-memory data from DT format")
Signed-off-by: Nathan Lynch <nathanl@linux.ibm.com>
Reviewed-by: Christophe Leroy <christophe.leroy@csgroup.eu>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20200813151131.2070161-1-nathanl@linux.ibm.com
2020-09-02 11:00:20 +10:00
Christophe Leroy
e53281bc21 powerpc: Drop _nmask_and_or_msr()
_nmask_and_or_msr() is only used at two places to set MSR_IP.

The SYNC is unnecessary as the users are not PowerPC 601.

Can be easily writen in C.

Do it, and drop _nmask_and_or_msr()

Signed-off-by: Christophe Leroy <christophe.leroy@csgroup.eu>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/c2d2b8dfb8dd677026b26dffc8d31070c38a6b89.1597388079.git.christophe.leroy@csgroup.eu
2020-09-02 11:00:20 +10:00
Scott Cheloha
59562b5c33 powerpc/perf: consolidate GPCI hcall structs into asm/hvcall.h
The H_GetPerformanceCounterInfo (GPCI) hypercall input/output structs are
useful to modules outside of perf/, so move them into asm/hvcall.h to live
alongside the other powerpc hypercall structs.

Leave the perf-specific GPCI stuff in perf/hv-gpci.h.

Signed-off-by: Scott Cheloha <cheloha@linux.ibm.com>
Acked-by: Nathan Lynch <nathanl@linux.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20200727184605.2945095-1-cheloha@linux.ibm.com
2020-09-02 11:00:20 +10:00
Christophe Leroy
82eb179242 powerpc: drop hard_reset_now() and poweroff_now() declaration
Those function have never existed. Drop their declaration.

Signed-off-by: Christophe Leroy <christophe.leroy@csgroup.eu>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/edcdd72a36495d25213c0256c8022367458e0d19.1596716418.git.christophe.leroy@csgroup.eu
2020-09-02 11:00:20 +10:00
Christophe Leroy
63442de430 powerpc/fpu: Drop cvt_fd() and cvt_df()
Those two functions have been unused since commit identified below.
Drop them.

Fixes: 31bfdb036f ("powerpc: Use instruction emulation infrastructure to handle alignment faults")
Signed-off-by: Christophe Leroy <christophe.leroy@csgroup.eu>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/d5641ada199b8dd2af16ad00a66084cf974f2704.1596716418.git.christophe.leroy@csgroup.eu
2020-09-02 11:00:19 +10:00